[jira] [Commented] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?
[ https://issues.apache.org/jira/browse/HDFS-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17865983#comment-17865983 ] Hongbing Wang commented on HDFS-17535: -- [~ruilaing] We had similar problems without pr HDFS-15240 in previous years and there seemed to be no convenient tool to fix them. We also use structured data (orc/parquet) features for verifying data, and the overall idea is similar to yours. For RS-6-3, if more than 3 blocks are broken unfortunately, it will be not recoverable. > I have confirmed the EC corrupt file, can this corrupt file be restored? > > > Key: HDFS-17535 > URL: https://issues.apache.org/jira/browse/HDFS-17535 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, hdfs >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Blocker > > I learned that EC does have a major bug with file corrupt > https://issues.apache.org/jira/browse/HDFS-15759 > 1:I have confirmed the EC corrupt file, can this corrupt file be restored? > Have important data that is causing us production data loss issues? Is > there a way to recover > Checking EC block group: blk_-9223372036361352768 > Status: ERROR, message: EC compute result not match.:ip is 10.12.66.116 block > is : -9223372036361352765 > 2:[https://github.com/apache/orc/issues/1939] I was wondering if cherry > picked your current code (GitHub pull request #2869), Can I skip patches > related to HDFS-14768,HDFS-15186, and HDFS-15240? > hdfs version 3.1.0 > thank you -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16867) Exiting Mover due to an exception in MoverMetrics.create
[ https://issues.apache.org/jira/browse/HDFS-16867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696878#comment-17696878 ] Hongbing Wang commented on HDFS-16867: -- [~Happy-shi] Is this still being followed up? I had the same problem with balancer. {code:java} 2023-03-06 17:40:53,264 ERROR org.apache.hadoop.hdfs.server.balancer.Balancer: Exiting balancer due an exception org.apache.hadoop.metrics2.MetricsException: Metrics source Balancer-BP-332003681-10.196.164.22-1648632173322 already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:225) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:198) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) at org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:55) at org.apache.hadoop.hdfs.server.balancer.Balancer.(Balancer.java:344) at org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:809) at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:847) at org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:952) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1102){code} > Exiting Mover due to an exception in MoverMetrics.create > > > Key: HDFS-16867 > URL: https://issues.apache.org/jira/browse/HDFS-16867 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZhiWei Shi >Assignee: ZhiWei Shi >Priority: Major > Labels: pull-request-available > > After the Mover process is started for a period of time, the process exits > unexpectedly and an error is reported in the log > {code:java} > [hdfs@${hostname} hadoop-3.3.2-nn]$ nohup bin/hdfs mover -p > /test-mover-jira9534 > mover.log.jira9534.20221209.2 & > [hdfs@{hostname} hadoop-3.3.2-nn]$ tail -f mover.log.jira9534.20221209.2 > ... > 22/12/09 14:22:32 INFO balancer.Dispatcher: Start moving > blk_1073911285_170466 with size=134217728 from 10.108.182.205:800:DISK to > ${ip1}:800:ARCHIVE through ${ip2}:800 > 22/12/09 14:22:32 INFO balancer.Dispatcher: Successfully moved > blk_1073911285_170466 with size=134217728 from 10.108.182.205:800:DISK to > ${ip1}:800:ARCHIVE through ${ip2}:800 > 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Stopping Mover metrics > system... > 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Mover metrics system stopped. > 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Mover metrics system shutdown > complete. > Dec 9, 2022, 2:22:42 PM Mover took 13mins, 19sec > 22/12/09 14:22:42 ERROR mover.Mover: Exiting Mover due to an exception > org.apache.hadoop.metrics2.MetricsException: Metrics source > Mover-${BlockpoolID} already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > at > org.apache.hadoop.hdfs.server.mover.MoverMetrics.create(MoverMetrics.java:49) > at org.apache.hadoop.hdfs.server.mover.Mover.(Mover.java:162) > at org.apache.hadoop.hdfs.server.mover.Mover.run(Mover.java:684) > at org.apache.hadoop.hdfs.server.mover.Mover$Cli.run(Mover.java:826) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81) > at org.apache.hadoop.hdfs.server.mover.Mover.main(Mover.java:908) > {code} > 1、“final ExitStatus r = m.run()”return only after scheduled one of replica > 2、“r == ExitStatus.IN_PROGRESS”,won’t run iter.remove() > 3、Execute “new Mover” and “this.metrics = MoverMetrics.create(this)” multiple > times for the same nnc,which leads to the error > {code:java} > //Mover.java > for (final StorageType t : diff.existing) { > for (final MLocation ml : locations) { > final Source source = storages.getSource(ml); > if (ml.storageType == t && source != null) { > // try to schedule one replica move. > if (scheduleMoveReplica(db, source, diff.expected)) { // 1、return only > after scheduled one of replica > return true; > } > } > } > } > while (connectors.size() > 0) { > Collections.shuffle(connectors); > Iterator iter = connectors.iterator(); > while (iter.hasNext()) { > NameNodeConnector nnc = iter.next(); > //3、Execute “new Mover” and “this.metrics = MoverMetrics.create(this)” > multiple times for the same nnc,which leads to the error > final Mover m = new Mover(nnc, co
[jira] [Created] (HDFS-16763) MoverTool: Make valid for the number of mover threads per DN
Hongbing Wang created HDFS-16763: Summary: MoverTool: Make valid for the number of mover threads per DN Key: HDFS-16763 URL: https://issues.apache.org/jira/browse/HDFS-16763 Project: Hadoop HDFS Issue Type: Bug Components: balancer & mover Reporter: Hongbing Wang When running the Mover tool, the number of mover threads per DN is always 1, resulting in very slow data movement. This JIRA fixes the problem which the current config is invalid. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16656) Fix some incorrect descriptions in SPS
[ https://issues.apache.org/jira/browse/HDFS-16656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-16656: - Summary: Fix some incorrect descriptions in SPS (was: Fixed some incorrect descriptions in SPS) > Fix some incorrect descriptions in SPS > -- > > Key: HDFS-16656 > URL: https://issues.apache.org/jira/browse/HDFS-16656 > Project: Hadoop HDFS > Issue Type: Improvement > Components: documentation >Reporter: Hongbing Wang >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > There are some incorrect descriptions in SPS module in web site, as follows: > [ArchivalStorage.md|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html] > and > [hdfs-default.xml|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml] > Fix them in `ArchivalStorage.md` and `hdfs-default.xml`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16656) Fixed some incorrect descriptions in SPS
[ https://issues.apache.org/jira/browse/HDFS-16656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-16656: - Description: There are some incorrect descriptions in SPS module in web site, as follows: [ArchivalStorage.md|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html] and [hdfs-default.xml|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml] Fix them in `ArchivalStorage.md` and `hdfs-default.xml`. (was: There are some incorrect descriptions in SPS module in web site, as follows: [ArchivalStorage.md|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html] and [hdfs-default.xml|[https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml].] Fix them in `ArchivalStorage.md` and `hdfs-default.xml`.) > Fixed some incorrect descriptions in SPS > > > Key: HDFS-16656 > URL: https://issues.apache.org/jira/browse/HDFS-16656 > Project: Hadoop HDFS > Issue Type: Improvement > Components: documentation >Reporter: Hongbing Wang >Priority: Minor > > There are some incorrect descriptions in SPS module in web site, as follows: > [ArchivalStorage.md|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html] > and > [hdfs-default.xml|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml] > Fix them in `ArchivalStorage.md` and `hdfs-default.xml`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16656) Fixed some incorrect descriptions in SPS
Hongbing Wang created HDFS-16656: Summary: Fixed some incorrect descriptions in SPS Key: HDFS-16656 URL: https://issues.apache.org/jira/browse/HDFS-16656 Project: Hadoop HDFS Issue Type: Improvement Components: documentation Reporter: Hongbing Wang There are some incorrect descriptions in SPS module in web site, as follows: [ArchivalStorage.md|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html] and [hdfs-default.xml|[https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml].] Fix them in `ArchivalStorage.md` and `hdfs-default.xml`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16512) Improve oiv tool to parse fsimage file in parallel with XML format
[ https://issues.apache.org/jira/browse/HDFS-16512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-16512: - Parent: HDFS-14617 Issue Type: Sub-task (was: Improvement) > Improve oiv tool to parse fsimage file in parallel with XML format > -- > > Key: HDFS-16512 > URL: https://issues.apache.org/jira/browse/HDFS-16512 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16512) Improve oiv tool to parse fsimage file in parallel with XML format
Hongbing Wang created HDFS-16512: Summary: Improve oiv tool to parse fsimage file in parallel with XML format Key: HDFS-16512 URL: https://issues.apache.org/jira/browse/HDFS-16512 Project: Hadoop HDFS Issue Type: Improvement Reporter: Hongbing Wang Assignee: Hongbing Wang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15987) Improve oiv tool to parse fsimage file in parallel with delimited format
[ https://issues.apache.org/jira/browse/HDFS-15987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405682#comment-17405682 ] Hongbing Wang commented on HDFS-15987: -- Report [^Improve_oiv_tool_001.pdf] is given, and the corresponding code submit is [commit 66502f90.|https://github.com/apache/hadoop/pull/2918/commits/66502f901c3d5ec3410965ea5fdef2b31947d816] > Improve oiv tool to parse fsimage file in parallel with delimited format > > > Key: HDFS-15987 > URL: https://issues.apache.org/jira/browse/HDFS-15987 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > Labels: pull-request-available > Attachments: Improve_oiv_tool_001.pdf > > Time Spent: 2h 20m > Remaining Estimate: 0h > > The purpose of this Jira is to improve oiv tool to parse fsimage file with > sub-sections (see -HDFS-14617-) in parallel with delmited format. > 1.Serial parsing is time-consuming > The time to serially parse a large fsimage with delimited format (e.g. `hdfs > oiv -p Delimited -t ...`) is as follows: > {code:java} > 1) Loading string table: -> Not time consuming. > 2) Loading inode references: -> Not time consuming > 3) Loading directories in INode section: -> Slightly time consuming (3%) > 4) Loading INode directory section: -> A bit time consuming (11%) > 5) Output: -> Very time consuming (86%){code} > Therefore, output is the most parallelized stage. > 2.How to output in parallel > The sub-sections are grouped in order, and each thread processes a group and > outputs it to the file corresponding to each thread, and finally merges the > output files. > 3. The result of a test > {code:java} > input fsimage file info: > 3.4G, 12 sub-sections, 55976500 INodes > - > Threads TotalTime OutputTime MergeTime > 1 18m37s 16m18s – > 48m7s 4m49s 41s{code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15987) Improve oiv tool to parse fsimage file in parallel with delimited format
[ https://issues.apache.org/jira/browse/HDFS-15987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15987: - Attachment: Improve_oiv_tool_001.pdf > Improve oiv tool to parse fsimage file in parallel with delimited format > > > Key: HDFS-15987 > URL: https://issues.apache.org/jira/browse/HDFS-15987 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > Labels: pull-request-available > Attachments: Improve_oiv_tool_001.pdf > > Time Spent: 2h 20m > Remaining Estimate: 0h > > The purpose of this Jira is to improve oiv tool to parse fsimage file with > sub-sections (see -HDFS-14617-) in parallel with delmited format. > 1.Serial parsing is time-consuming > The time to serially parse a large fsimage with delimited format (e.g. `hdfs > oiv -p Delimited -t ...`) is as follows: > {code:java} > 1) Loading string table: -> Not time consuming. > 2) Loading inode references: -> Not time consuming > 3) Loading directories in INode section: -> Slightly time consuming (3%) > 4) Loading INode directory section: -> A bit time consuming (11%) > 5) Output: -> Very time consuming (86%){code} > Therefore, output is the most parallelized stage. > 2.How to output in parallel > The sub-sections are grouped in order, and each thread processes a group and > outputs it to the file corresponding to each thread, and finally merges the > output files. > 3. The result of a test > {code:java} > input fsimage file info: > 3.4G, 12 sub-sections, 55976500 INodes > - > Threads TotalTime OutputTime MergeTime > 1 18m37s 16m18s – > 48m7s 4m49s 41s{code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15987) Improve oiv tool to parse fsimage file in parallel with delimited format
[ https://issues.apache.org/jira/browse/HDFS-15987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17402581#comment-17402581 ] Hongbing Wang commented on HDFS-15987: -- [~mofei] The PR works well in our cluster. I will give an online report in the next few days. Thank you for your attention. > Improve oiv tool to parse fsimage file in parallel with delimited format > > > Key: HDFS-15987 > URL: https://issues.apache.org/jira/browse/HDFS-15987 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > Labels: pull-request-available > Time Spent: 2h 20m > Remaining Estimate: 0h > > The purpose of this Jira is to improve oiv tool to parse fsimage file with > sub-sections (see -HDFS-14617-) in parallel with delmited format. > 1.Serial parsing is time-consuming > The time to serially parse a large fsimage with delimited format (e.g. `hdfs > oiv -p Delimited -t ...`) is as follows: > {code:java} > 1) Loading string table: -> Not time consuming. > 2) Loading inode references: -> Not time consuming > 3) Loading directories in INode section: -> Slightly time consuming (3%) > 4) Loading INode directory section: -> A bit time consuming (11%) > 5) Output: -> Very time consuming (86%){code} > Therefore, output is the most parallelized stage. > 2.How to output in parallel > The sub-sections are grouped in order, and each thread processes a group and > outputs it to the file corresponding to each thread, and finally merges the > output files. > 3. The result of a test > {code:java} > input fsimage file info: > 3.4G, 12 sub-sections, 55976500 INodes > - > Threads TotalTime OutputTime MergeTime > 1 18m37s 16m18s – > 48m7s 4m49s 41s{code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14788) Use dynamic regex filter to ignore copy of source files in Distcp
[ https://issues.apache.org/jira/browse/HDFS-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17372761#comment-17372761 ] Hongbing Wang commented on HDFS-14788: -- Is there a plan to filter files by modtime? In the scenario of incremental data synchronization, if files in certain time windows can be specified, efficiency can be greatly improved. > Use dynamic regex filter to ignore copy of source files in Distcp > - > > Key: HDFS-14788 > URL: https://issues.apache.org/jira/browse/HDFS-14788 > Project: Hadoop HDFS > Issue Type: Improvement > Components: distcp >Affects Versions: 3.2.1 >Reporter: Mukund Thakur >Assignee: Mukund Thakur >Priority: Major > Fix For: 3.3.0 > > > There is a feature in Distcp where we can ignore specific files to get copied > to the destination. This is currently based on a filter regex which is read > from a specific file. The process of creating different regex file for > different distcp jobs seems like a tedious task. What we are proposing is to > expose a regex_filter parameter which can be set during Distcp job creation > and use this filter in a new implementation CopyFilter class. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15650) Make the socket timeout for computing checksum of striped blocks configurable
[ https://issues.apache.org/jira/browse/HDFS-15650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17371155#comment-17371155 ] Hongbing Wang commented on HDFS-15650: -- [~yhaya] [~weichiu] Hi! In our practice, when there are a large number of ec checksum scenarios (such as distcp with checksum), there will be many socket timeout, and generally retrying is normal. (Note: -HDFS-15709- has been merged). I think it makes sense to fix the hard-code. New config `dfs.checksum.ec.socket-timeout` looks good. Do you have any plan to fix this issue? Thanks! > Make the socket timeout for computing checksum of striped blocks configurable > - > > Key: HDFS-15650 > URL: https://issues.apache.org/jira/browse/HDFS-15650 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, ec, erasure-coding >Reporter: Yushi Hayasaka >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Regarding the DataNode tries to get the checksum of EC internal blocks from > another DataNode for computing the checksum of striped blocks, the timeout is > hard-coded now, but it should be configurable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16018) Optimize the display of hdfs "count -e" or "count -t" command
Hongbing Wang created HDFS-16018: Summary: Optimize the display of hdfs "count -e" or "count -t" command Key: HDFS-16018 URL: https://issues.apache.org/jira/browse/HDFS-16018 Project: Hadoop HDFS Issue Type: Improvement Components: dfsclient Reporter: Hongbing Wang Assignee: Hongbing Wang Attachments: fs_count_fixed.png, fs_count_origin.png The display of `fs -count -e`or `fs -count -t` is not aligned. *Current display:* *!fs_count_origin.png|width=1184,height=156!* *Fixed display:* *!fs_count_fixed.png|width=1217,height=157!* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15987) Improve oiv tool to parse fsimage file in parallel with delimited format
Hongbing Wang created HDFS-15987: Summary: Improve oiv tool to parse fsimage file in parallel with delimited format Key: HDFS-15987 URL: https://issues.apache.org/jira/browse/HDFS-15987 Project: Hadoop HDFS Issue Type: Improvement Reporter: Hongbing Wang The purpose of this Jira is to improve oiv tool to parse fsimage file with sub-sections (see -HDFS-14617-) in parallel with delmited format. 1.Serial parsing is time-consuming The time to serially parse a large fsimage with delimited format (e.g. `hdfs oiv -p Delimited -t ...`) is as follows: {code:java} 1) Loading string table: -> Not time consuming. 2) Loading inode references: -> Not time consuming 3) Loading directories in INode section: -> Slightly time consuming (3%) 4) Loading INode directory section: -> A bit time consuming (11%) 5) Output: -> Very time consuming (86%){code} Therefore, output is the most parallelized stage. 2.How to output in parallel The sub-sections are grouped in order, and each thread processes a group and outputs it to the file corresponding to each thread, and finally merges the output files. 3. The result of a test {code:java} input fsimage file info: 3.4G, 12 sub-sections, 55976500 INodes - Threads TotalTime OutputTime MergeTime 1 18m37s 16m18s – 48m7s 4m49s 41s{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15858) Backport HDFS-14694 to branch-3.1/3.2/3.3
[ https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15858: - Attachment: (was: HDFS-15858-branch-3.1.002.patch) > Backport HDFS-14694 to branch-3.1/3.2/3.3 > - > > Key: HDFS-15858 > URL: https://issues.apache.org/jira/browse/HDFS-15858 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Attachments: HDFS-15858-branch-3.1.001.patch, > HDFS-15858-branch-3.1.002.patch, HDFS-15858-branch-3.2.002.patch > > > -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and > -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call > recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. > The original patchs are in conflict with the lower version. So, backport them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15858) Backport HDFS-14694 to branch-3.1/3.2/3.3
[ https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15858: - Attachment: HDFS-15858-branch-3.1.002.patch > Backport HDFS-14694 to branch-3.1/3.2/3.3 > - > > Key: HDFS-15858 > URL: https://issues.apache.org/jira/browse/HDFS-15858 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Attachments: HDFS-15858-branch-3.1.001.patch, > HDFS-15858-branch-3.1.002.patch, HDFS-15858-branch-3.1.002.patch, > HDFS-15858-branch-3.2.002.patch > > > -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and > -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call > recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. > The original patchs are in conflict with the lower version. So, backport them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15858) Backport HDFS-14694 to branch-3.1/3.2/3.3
[ https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292106#comment-17292106 ] Hongbing Wang commented on HDFS-15858: -- Resubmit [^HDFS-15858-branch-3.1.002.patch] to trigger UT. > Backport HDFS-14694 to branch-3.1/3.2/3.3 > - > > Key: HDFS-15858 > URL: https://issues.apache.org/jira/browse/HDFS-15858 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Attachments: HDFS-15858-branch-3.1.001.patch, > HDFS-15858-branch-3.1.002.patch, HDFS-15858-branch-3.2.002.patch > > > -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and > -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call > recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. > The original patchs are in conflict with the lower version. So, backport them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15858) Backport HDFS-14694 to branch-3.1/3.2/3.3
[ https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291443#comment-17291443 ] Hongbing Wang commented on HDFS-15858: -- {{Note:}} * {{backport to branch-3.1: {color:#0747a6}use branch-3.1.xxx.patch{color}}} * {{backport to branch-3.2: {color:#0747a6}use branch-3.2.xxx.patch{color}}} * {{backport to branch-3.3: }}{{Directly use -HDFS-14694- latest patch}} Considering that lower version PR in -HDFS-15684- depends on this Jira , we should complete this PR first. > Backport HDFS-14694 to branch-3.1/3.2/3.3 > - > > Key: HDFS-15858 > URL: https://issues.apache.org/jira/browse/HDFS-15858 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Attachments: HDFS-15858-branch-3.1.001.patch, > HDFS-15858-branch-3.1.002.patch, HDFS-15858-branch-3.2.002.patch > > > -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and > -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call > recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. > The original patchs are in conflict with the lower version. So, backport them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15858) Backport HDFS-14694 to branch-3.1/3.2/3.3
[ https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15858: - Summary: Backport HDFS-14694 to branch-3.1/3.2/3.3 (was: Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3) > Backport HDFS-14694 to branch-3.1/3.2/3.3 > - > > Key: HDFS-15858 > URL: https://issues.apache.org/jira/browse/HDFS-15858 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Attachments: HDFS-15858-branch-3.1.001.patch, > HDFS-15858-branch-3.1.002.patch, HDFS-15858-branch-3.2.002.patch > > > -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and > -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call > recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. > The original patchs are in conflict with the lower version. So, backport them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15858) Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3
[ https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15858: - Attachment: HDFS-15858-branch-3.2.002.patch > Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3 > > > Key: HDFS-15858 > URL: https://issues.apache.org/jira/browse/HDFS-15858 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Attachments: HDFS-15858-branch-3.1.001.patch, > HDFS-15858-branch-3.1.002.patch, HDFS-15858-branch-3.2.002.patch > > > -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and > -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call > recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. > The original patchs are in conflict with the lower version. So, backport them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15858) Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3
[ https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15858: - Attachment: HDFS-15858-branch-3.1.002.patch > Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3 > > > Key: HDFS-15858 > URL: https://issues.apache.org/jira/browse/HDFS-15858 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Attachments: HDFS-15858-branch-3.1.001.patch, > HDFS-15858-branch-3.1.002.patch, HDFS-15858-branch-3.2.002.patch > > > -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and > -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call > recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. > The original patchs are in conflict with the lower version. So, backport them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15858) Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3
[ https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15858: - Attachment: (was: HDFS-15858-branch-3.2.001.patch) > Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3 > > > Key: HDFS-15858 > URL: https://issues.apache.org/jira/browse/HDFS-15858 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Attachments: HDFS-15858-branch-3.1.001.patch > > > -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and > -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call > recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. > The original patchs are in conflict with the lower version. So, backport them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15858) Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3
[ https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15858: - Attachment: (was: HDFS-15858-branch-3.3.001.patch) > Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3 > > > Key: HDFS-15858 > URL: https://issues.apache.org/jira/browse/HDFS-15858 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Attachments: HDFS-15858-branch-3.1.001.patch > > > -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and > -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call > recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. > The original patchs are in conflict with the lower version. So, backport them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15858) Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3
[ https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15858: - Attachment: HDFS-15858-branch-3.2.001.patch > Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3 > > > Key: HDFS-15858 > URL: https://issues.apache.org/jira/browse/HDFS-15858 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Attachments: HDFS-15858-branch-3.1.001.patch, > HDFS-15858-branch-3.2.001.patch, HDFS-15858-branch-3.3.001.patch > > > -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and > -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call > recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. > The original patchs are in conflict with the lower version. So, backport them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15858) Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3
[ https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15858: - Attachment: HDFS-15858-branch-3.3.001.patch > Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3 > > > Key: HDFS-15858 > URL: https://issues.apache.org/jira/browse/HDFS-15858 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Attachments: HDFS-15858-branch-3.1.001.patch, > HDFS-15858-branch-3.2.001.patch, HDFS-15858-branch-3.3.001.patch > > > -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and > -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call > recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. > The original patchs are in conflict with the lower version. So, backport them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15858) Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3
[ https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15858: - Attachment: HDFS-15858-branch-3.1.001.patch > Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3 > > > Key: HDFS-15858 > URL: https://issues.apache.org/jira/browse/HDFS-15858 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Attachments: HDFS-15858-branch-3.1.001.patch, > HDFS-15858-branch-3.2.001.patch, HDFS-15858-branch-3.3.001.patch > > > -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and > -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call > recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. > The original patchs are in conflict with the lower version. So, backport them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15858) Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3
Hongbing Wang created HDFS-15858: Summary: Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3 Key: HDFS-15858 URL: https://issues.apache.org/jira/browse/HDFS-15858 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs-client Reporter: Hongbing Wang Assignee: Hongbing Wang -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. The original patchs are in conflict with the lower version. So, backport them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290745#comment-17290745 ] Hongbing Wang commented on HDFS-15684: -- [~ferhui] ok. Because this PR depends on -HDFS-14694,- I will backport them in another Jira later. > EC: Call recoverLease on DFSStripedOutputStream close exception > --- > > Key: HDFS-15684 > URL: https://issues.apache.org/jira/browse/HDFS-15684 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient, ec >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-15684.001.patch, HDFS-15684.002.patch, > HDFS-15684.003.patch > > > -HDFS-14694- add a feature that call recoverLease operation automatically > when DFSOutputSteam close encounters exception. When we wanted to apply this > feature to our cluster, we found that it does not support EC files. > I think this feature should take effect whether replica files or EC files. > This Jira proposes to make it effective when in the case of EC files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
[ https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277006#comment-17277006 ] Hongbing Wang commented on HDFS-15779: -- [~ferhui] Thanks for the guidance. Fix code style in [^HDFS-15779.002.patch]. > EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block > - > > Key: HDFS-15779 > URL: https://issues.apache.org/jira/browse/HDFS-15779 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > Attachments: HDFS-15779.001.patch, HDFS-15779.002.patch > > > The NullPointerException in DN log as follows: > {code:java} > 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY > //... > 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Connection timed out > 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Failed to reconstruct striped block: > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Receiving > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 > src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 > 010 > {code} > NPE occurs at `writer.getTargetBuffer()` in codes: > {code:java} > // StripedWriter#clearBuffers > void clearBuffers() { > for (StripedBlockWriter writer : writers) { > ByteBuffer targetBuffer = writer.getTargetBuffer(); > if (targetBuffer != null) { > targetBuffer.clear(); > } > } > } > {code} > So, why is the writer null? Let's track when the writer is initialized and > when reconstruct() is called, as follows: > {code:java} > // StripedBlockReconstructor#run > public void run() { > try { > initDecoderIfNecessary(); > getStripedReader().init(); > stripedWriter.init(); //① > reconstruct(); //② > stripedWriter.endTargetBlocks(); > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > // ...{code} > They are called at ① and ② above respectively. `stripedWriter.init()` -> > `initTargetStreams()`, as follows: > {code:java} > // StripedWriter#initTargetStreams > int initTargetStreams() { > int nSuccess = 0; > for (short i = 0; i < targets.length; i++) { > try { > writers[i] = createWriter(i); > nSuccess++; > targetsStatus[i] = true; > } catch (Throwable e) { > LOG.warn(e.getMessage()); > } > } > return nSuccess; > } > {code} > NPE occurs when createWriter() gets an exception and 0 < nSuccess < > targets.length. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
[ https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15779: - Attachment: HDFS-15779.002.patch > EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block > - > > Key: HDFS-15779 > URL: https://issues.apache.org/jira/browse/HDFS-15779 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > Attachments: HDFS-15779.001.patch, HDFS-15779.002.patch > > > The NullPointerException in DN log as follows: > {code:java} > 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY > //... > 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Connection timed out > 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Failed to reconstruct striped block: > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Receiving > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 > src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 > 010 > {code} > NPE occurs at `writer.getTargetBuffer()` in codes: > {code:java} > // StripedWriter#clearBuffers > void clearBuffers() { > for (StripedBlockWriter writer : writers) { > ByteBuffer targetBuffer = writer.getTargetBuffer(); > if (targetBuffer != null) { > targetBuffer.clear(); > } > } > } > {code} > So, why is the writer null? Let's track when the writer is initialized and > when reconstruct() is called, as follows: > {code:java} > // StripedBlockReconstructor#run > public void run() { > try { > initDecoderIfNecessary(); > getStripedReader().init(); > stripedWriter.init(); //① > reconstruct(); //② > stripedWriter.endTargetBlocks(); > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > // ...{code} > They are called at ① and ② above respectively. `stripedWriter.init()` -> > `initTargetStreams()`, as follows: > {code:java} > // StripedWriter#initTargetStreams > int initTargetStreams() { > int nSuccess = 0; > for (short i = 0; i < targets.length; i++) { > try { > writers[i] = createWriter(i); > nSuccess++; > targetsStatus[i] = true; > } catch (Throwable e) { > LOG.warn(e.getMessage()); > } > } > return nSuccess; > } > {code} > NPE occurs when createWriter() gets an exception and 0 < nSuccess < > targets.length. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
[ https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276167#comment-17276167 ] Hongbing Wang commented on HDFS-15779: -- [~ferhui] Thanks for review! From the structural point of view, using *if (targetsStatus[i])* is the best, but I was worried that there would be problems. Because the status of targetsStatus[i] may be changed in _StripedWriter#transferData2Targets_, it will cause targetsStatus[i] and writer[i] to not correspond one to one. Note that they correspond before this. {code:java} // StripedWriter#transferData2Targets int transferData2Targets() { int nSuccess = 0; for (int i = 0; i < targets.length; i++) { if (targetsStatus[i]) { boolean success = false; try { writers[i].transferData2Target(packetBuf); nSuccess++; success = true; } catch (IOException e) { LOG.warn(e.getMessage()); } targetsStatus[i] = success; // may be false here } } return nSuccess; } {code} If _transferData2Target()_ throws IOException, _writer[i]_ may still need to call _clearBuffers_(), I think. Is that so? Thanks again. > EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block > - > > Key: HDFS-15779 > URL: https://issues.apache.org/jira/browse/HDFS-15779 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > Attachments: HDFS-15779.001.patch > > > The NullPointerException in DN log as follows: > {code:java} > 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY > //... > 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Connection timed out > 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Failed to reconstruct striped block: > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Receiving > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 > src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 > 010 > {code} > NPE occurs at `writer.getTargetBuffer()` in codes: > {code:java} > // StripedWriter#clearBuffers > void clearBuffers() { > for (StripedBlockWriter writer : writers) { > ByteBuffer targetBuffer = writer.getTargetBuffer(); > if (targetBuffer != null) { > targetBuffer.clear(); > } > } > } > {code} > So, why is the writer null? Let's track when the writer is initialized and > when reconstruct() is called, as follows: > {code:java} > // StripedBlockReconstructor#run > public void run() { > try { > initDecoderIfNecessary(); > getStripedReader().init(); > stripedWriter.init(); //① > reconstruct(); //② > stripedWriter.endTargetBlocks(); > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > // ...{code} > They are called at ① and ② above respectively. `stripedWriter.init()` -> > `initTargetStreams()`, as follows: > {code:java} > // StripedWriter#initTargetStreams > int initTargetStreams() { > int nSuccess = 0; > for (short i = 0; i < targets.length; i++) { > try { > writers[i] = createWriter(i); > nSuccess++; > targetsStatus[i] = true; > } catch (Throwable e) { > LOG.warn(e.getMessage()); > } > } > return nSuccess; > } > {code} > NPE occurs when createWriter() gets an exception and 0 < nSuccess < > targets.length. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mai
[jira] [Comment Edited] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
[ https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276167#comment-17276167 ] Hongbing Wang edited comment on HDFS-15779 at 2/1/21, 9:16 AM: --- [~ferhui] Thanks for review! From the structural point of view, using *if (targetsStatus[i])* is the best, but I was worried that there would be problems. Because the status of targetsStatus[i] may be changed in _StripedWriter#transferData2Targets_, it will cause targetsStatus[i] and writer[i] to not correspond one to one. Note that they correspond before this. {code:java} // StripedWriter#transferData2Targets int transferData2Targets() { int nSuccess = 0; for (int i = 0; i < targets.length; i++) { if (targetsStatus[i]) { boolean success = false; try { writers[i].transferData2Target(packetBuf); nSuccess++; success = true; } catch (IOException e) { LOG.warn(e.getMessage()); } targetsStatus[i] = success; // may be false here } } return nSuccess; } {code} If _transferData2Target()_ throws IOException, _writer[i]_ may still need to call _clearBuffers_(), I think. Is that so? Thanks again. was (Author: wanghongbing): [~ferhui] Thanks for review! From the structural point of view, using *if (targetsStatus[i])* is the best, but I was worried that there would be problems. Because the status of targetsStatus[i] may be changed in _StripedWriter#transferData2Targets_, it will cause targetsStatus[i] and writer[i] to not correspond one to one. Note that they correspond before this. {code:java} // StripedWriter#transferData2Targets int transferData2Targets() { int nSuccess = 0; for (int i = 0; i < targets.length; i++) { if (targetsStatus[i]) { boolean success = false; try { writers[i].transferData2Target(packetBuf); nSuccess++; success = true; } catch (IOException e) { LOG.warn(e.getMessage()); } targetsStatus[i] = success; // may be false here } } return nSuccess; } {code} If _transferData2Target()_ throws IOException, _writer[i]_ may still need to call _clearBuffers_(), I think. Is that so? Thanks again. > EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block > - > > Key: HDFS-15779 > URL: https://issues.apache.org/jira/browse/HDFS-15779 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > Attachments: HDFS-15779.001.patch > > > The NullPointerException in DN log as follows: > {code:java} > 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY > //... > 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Connection timed out > 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Failed to reconstruct striped block: > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Receiving > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 > src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 > 010 > {code} > NPE occurs at `writer.getTargetBuffer()` in codes: > {code:java} > // StripedWriter#clearBuffers > void clearBuffers() { > for (StripedBlockWriter writer : writers) { > ByteBuffer targetBuffer = writer.getTargetBuffer(); > if (targetBuffer != null) { > targetBuffer.clear(); > } > } > } > {code} > So, why is the writer null? Let's track when the writer is initialized and > when reconstruct() is called, as follows: > {code:java} > // StripedBlo
[jira] [Resolved] (HDFS-15797) EC: reconstruction threads limit parameter does not take effect
[ https://issues.apache.org/jira/browse/HDFS-15797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang resolved HDFS-15797. -- Resolution: Duplicate > EC: reconstruction threads limit parameter does not take effect > --- > > Key: HDFS-15797 > URL: https://issues.apache.org/jira/browse/HDFS-15797 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > > -HDFS-12044- changed _SynchronousQueue_ in stripedReconstructionPool to > unbounded _LinkedBlockingQueue_, which caused the _maximumPoolSize_ to be > invalid. The parameter +dfs.datanode.ec.reconstruction.threads+ (defaults to > 8) is therefore invalid. This parameter is misleading here, or we need to > modify the code to make it effective. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15797) EC: reconstruction threads limit parameter does not take effect
[ https://issues.apache.org/jira/browse/HDFS-15797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273487#comment-17273487 ] Hongbing Wang commented on HDFS-15797: -- Thanks [~sodonnell] ! yes, it should be closed. > EC: reconstruction threads limit parameter does not take effect > --- > > Key: HDFS-15797 > URL: https://issues.apache.org/jira/browse/HDFS-15797 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > > -HDFS-12044- changed _SynchronousQueue_ in stripedReconstructionPool to > unbounded _LinkedBlockingQueue_, which caused the _maximumPoolSize_ to be > invalid. The parameter +dfs.datanode.ec.reconstruction.threads+ (defaults to > 8) is therefore invalid. This parameter is misleading here, or we need to > modify the code to make it effective. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15797) EC: reconstruction threads limit parameter does not take effect
[ https://issues.apache.org/jira/browse/HDFS-15797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273427#comment-17273427 ] Hongbing Wang commented on HDFS-15797: -- Sorry, HDFS-14367 has already solved. > EC: reconstruction threads limit parameter does not take effect > --- > > Key: HDFS-15797 > URL: https://issues.apache.org/jira/browse/HDFS-15797 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > > -HDFS-12044- changed _SynchronousQueue_ in stripedReconstructionPool to > unbounded _LinkedBlockingQueue_, which caused the _maximumPoolSize_ to be > invalid. The parameter +dfs.datanode.ec.reconstruction.threads+ (defaults to > 8) is therefore invalid. This parameter is misleading here, or we need to > modify the code to make it effective. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15797) EC: reconstruction threads limit parameter does not take effect
[ https://issues.apache.org/jira/browse/HDFS-15797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15797: - Description: -HDFS-12044- changed _SynchronousQueue_ in stripedReconstructionPool to unbounded _LinkedBlockingQueue_, which caused the _maximumPoolSize_ to be invalid. The parameter +dfs.datanode.ec.reconstruction.threads+ (defaults to 8) is therefore invalid. This parameter is misleading here, or we need to modify the code to make it effective. > EC: reconstruction threads limit parameter does not take effect > --- > > Key: HDFS-15797 > URL: https://issues.apache.org/jira/browse/HDFS-15797 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > > -HDFS-12044- changed _SynchronousQueue_ in stripedReconstructionPool to > unbounded _LinkedBlockingQueue_, which caused the _maximumPoolSize_ to be > invalid. The parameter +dfs.datanode.ec.reconstruction.threads+ (defaults to > 8) is therefore invalid. This parameter is misleading here, or we need to > modify the code to make it effective. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15797) EC: reconstruction threads limit parameter does not take effect
Hongbing Wang created HDFS-15797: Summary: EC: reconstruction threads limit parameter does not take effect Key: HDFS-15797 URL: https://issues.apache.org/jira/browse/HDFS-15797 Project: Hadoop HDFS Issue Type: Bug Reporter: Hongbing Wang Assignee: Hongbing Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
[ https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272020#comment-17272020 ] Hongbing Wang commented on HDFS-15779: -- just fix NPE in [^HDFS-15779.001.patch]. If the writer that is not involved in the reconstruction is null, the reconstruction can be also successful. So don’t care about writer which is null when clearBuffers(). > EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block > - > > Key: HDFS-15779 > URL: https://issues.apache.org/jira/browse/HDFS-15779 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > Attachments: HDFS-15779.001.patch > > > The NullPointerException in DN log as follows: > {code:java} > 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY > //... > 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Connection timed out > 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Failed to reconstruct striped block: > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Receiving > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 > src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 > 010 > {code} > NPE occurs at `writer.getTargetBuffer()` in codes: > {code:java} > // StripedWriter#clearBuffers > void clearBuffers() { > for (StripedBlockWriter writer : writers) { > ByteBuffer targetBuffer = writer.getTargetBuffer(); > if (targetBuffer != null) { > targetBuffer.clear(); > } > } > } > {code} > So, why is the writer null? Let's track when the writer is initialized and > when reconstruct() is called, as follows: > {code:java} > // StripedBlockReconstructor#run > public void run() { > try { > initDecoderIfNecessary(); > getStripedReader().init(); > stripedWriter.init(); //① > reconstruct(); //② > stripedWriter.endTargetBlocks(); > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > // ...{code} > They are called at ① and ② above respectively. `stripedWriter.init()` -> > `initTargetStreams()`, as follows: > {code:java} > // StripedWriter#initTargetStreams > int initTargetStreams() { > int nSuccess = 0; > for (short i = 0; i < targets.length; i++) { > try { > writers[i] = createWriter(i); > nSuccess++; > targetsStatus[i] = true; > } catch (Throwable e) { > LOG.warn(e.getMessage()); > } > } > return nSuccess; > } > {code} > NPE occurs when createWriter() gets an exception and 0 < nSuccess < > targets.length. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
[ https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15779: - Attachment: HDFS-15779.001.patch > EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block > - > > Key: HDFS-15779 > URL: https://issues.apache.org/jira/browse/HDFS-15779 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > Attachments: HDFS-15779.001.patch > > > The NullPointerException in DN log as follows: > {code:java} > 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY > //... > 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Connection timed out > 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Failed to reconstruct striped block: > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Receiving > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 > src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 > 010 > {code} > NPE occurs at `writer.getTargetBuffer()` in codes: > {code:java} > // StripedWriter#clearBuffers > void clearBuffers() { > for (StripedBlockWriter writer : writers) { > ByteBuffer targetBuffer = writer.getTargetBuffer(); > if (targetBuffer != null) { > targetBuffer.clear(); > } > } > } > {code} > So, why is the writer null? Let's track when the writer is initialized and > when reconstruct() is called, as follows: > {code:java} > // StripedBlockReconstructor#run > public void run() { > try { > initDecoderIfNecessary(); > getStripedReader().init(); > stripedWriter.init(); //① > reconstruct(); //② > stripedWriter.endTargetBlocks(); > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > // ...{code} > They are called at ① and ② above respectively. `stripedWriter.init()` -> > `initTargetStreams()`, as follows: > {code:java} > // StripedWriter#initTargetStreams > int initTargetStreams() { > int nSuccess = 0; > for (short i = 0; i < targets.length; i++) { > try { > writers[i] = createWriter(i); > nSuccess++; > targetsStatus[i] = true; > } catch (Throwable e) { > LOG.warn(e.getMessage()); > } > } > return nSuccess; > } > {code} > NPE occurs when createWriter() gets an exception and 0 < nSuccess < > targets.length. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
[ https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267024#comment-17267024 ] Hongbing Wang edited comment on HDFS-15779 at 1/18/21, 5:26 AM: I have two issues to discuss: * Does it throw an exception only when `initTargetStreams() == 0` instead of `< targets.length` ? {code:java} // StripedWriter#init if (initTargetStreams() == 0) { String error = "All targets are failed."; throw new IOException(error); }{code} * Is it the best change to just judge whether the writer is null? {code:java} // StripedWriter#clearBuffers void clearBuffers() { for (StripedBlockWriter writer : writers) { ByteBuffer targetBuffer = writer.getTargetBuffer(); if (targetBuffer != null) { targetBuffer.clear(); } } } {code} was (Author: wanghongbing): I have two issues to discuss: # Does it throw an exception only when `initTargetStreams() == 0` instead of `< targets.length` ? {code:java} // StripedWriter#init if (initTargetStreams() == 0) { String error = "All targets are failed."; throw new IOException(error); }{code} # Is it the best change to just judge whether the writer is null? {code:java} // StripedWriter#clearBuffers void clearBuffers() { for (StripedBlockWriter writer : writers) { ByteBuffer targetBuffer = writer.getTargetBuffer(); if (targetBuffer != null) { targetBuffer.clear(); } } } {code} > EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block > - > > Key: HDFS-15779 > URL: https://issues.apache.org/jira/browse/HDFS-15779 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > > The NullPointerException in DN log as follows: > {code:java} > 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY > //... > 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Connection timed out > 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Failed to reconstruct striped block: > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Receiving > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 > src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 > 010 > {code} > NPE occurs at `writer.getTargetBuffer()` in codes: > {code:java} > // StripedWriter#clearBuffers > void clearBuffers() { > for (StripedBlockWriter writer : writers) { > ByteBuffer targetBuffer = writer.getTargetBuffer(); > if (targetBuffer != null) { > targetBuffer.clear(); > } > } > } > {code} > So, why is the writer null? Let's track when the writer is initialized and > when reconstruct() is called, as follows: > {code:java} > // StripedBlockReconstructor#run > public void run() { > try { > initDecoderIfNecessary(); > getStripedReader().init(); > stripedWriter.init(); //① > reconstruct(); //② > stripedWriter.endTargetBlocks(); > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > // ...{code} > They are called at ① and ② above respectively. `stripedWriter.init()` -> > `initTargetStreams()`, as follows: > {code:java} > // StripedWriter#initTargetStreams > int initTargetStreams() { > int nSuccess = 0; > for (short i = 0; i < targets.length; i++) { > try { > writers[i] = createWriter(i); > nSuccess++; > targetsStatus[i] = true; > } catch (Throwable e) { > LOG.warn(e.getMessage()); > } > } > return nSuccess; > } > {code} > NPE occ
[jira] [Commented] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
[ https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267024#comment-17267024 ] Hongbing Wang commented on HDFS-15779: -- I have two issues to discuss: # Does it throw an exception only when `initTargetStreams() == 0` instead of `< targets.length` ? {code:java} // StripedWriter#init if (initTargetStreams() == 0) { String error = "All targets are failed."; throw new IOException(error); }{code} # Is it the best change to just judge whether the writer is null? {code:java} // StripedWriter#clearBuffers void clearBuffers() { for (StripedBlockWriter writer : writers) { ByteBuffer targetBuffer = writer.getTargetBuffer(); if (targetBuffer != null) { targetBuffer.clear(); } } } {code} > EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block > - > > Key: HDFS-15779 > URL: https://issues.apache.org/jira/browse/HDFS-15779 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > > The NullPointerException in DN log as follows: > {code:java} > 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY > //... > 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Connection timed out > 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Failed to reconstruct striped block: > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Receiving > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 > src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 > 010 > {code} > NPE occurs at `writer.getTargetBuffer()` in codes: > {code:java} > // StripedWriter#clearBuffers > void clearBuffers() { > for (StripedBlockWriter writer : writers) { > ByteBuffer targetBuffer = writer.getTargetBuffer(); > if (targetBuffer != null) { > targetBuffer.clear(); > } > } > } > {code} > So, why is the writer null? Let's track when the writer is initialized and > when reconstruct() is called, as follows: > {code:java} > // StripedBlockReconstructor#run > public void run() { > try { > initDecoderIfNecessary(); > getStripedReader().init(); > stripedWriter.init(); //① > reconstruct(); //② > stripedWriter.endTargetBlocks(); > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > // ...{code} > They are called at ① and ② above respectively. `stripedWriter.init()` -> > `initTargetStreams()`, as follows: > {code:java} > // StripedWriter#initTargetStreams > int initTargetStreams() { > int nSuccess = 0; > for (short i = 0; i < targets.length; i++) { > try { > writers[i] = createWriter(i); > nSuccess++; > targetsStatus[i] = true; > } catch (Throwable e) { > LOG.warn(e.getMessage()); > } > } > return nSuccess; > } > {code} > NPE occurs when createWriter() gets an exception and 0 < nSuccess < > targets.length. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
[ https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15779: - Description: The NullPointerException in DN log as follows: {code:java} 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY //... 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Connection timed out 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped block: BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 010 {code} NPE occurs at `writer.getTargetBuffer()` in codes: {code:java} // StripedWriter#clearBuffers void clearBuffers() { for (StripedBlockWriter writer : writers) { ByteBuffer targetBuffer = writer.getTargetBuffer(); if (targetBuffer != null) { targetBuffer.clear(); } } } {code} So, why is the writer null? Let's track when the writer is initialized and when reconstruct() is called, as follows: {code:java} // StripedBlockReconstructor#run public void run() { try { initDecoderIfNecessary(); getStripedReader().init(); stripedWriter.init(); //① reconstruct(); //② stripedWriter.endTargetBlocks(); } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); // ...{code} They are called at ① and ② above respectively. `stripedWriter.init()` -> `initTargetStreams()`, as follows: {code:java} // StripedWriter#initTargetStreams int initTargetStreams() { int nSuccess = 0; for (short i = 0; i < targets.length; i++) { try { writers[i] = createWriter(i); nSuccess++; targetsStatus[i] = true; } catch (Throwable e) { LOG.warn(e.getMessage()); } } return nSuccess; } {code} NPE occurs when createWriter() gets an exception and 0 < nSuccess < targets.length. was: The NullPointerException in DN log as follows: {code:java} 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY //... 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Connection timed out 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped block: BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 010 {code} NPE occurs at `writer.getTargetBuffer()` in codes: {code:java} // StripedWriter#clearBuffers void clearBuffers() { for (StripedBlockWriter writer : writers) { ByteBuffer targetBu
[jira] [Updated] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
[ https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15779: - Description: The NullPointerException in DN log as follows: {code:java} 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY //... 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Connection timed out 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped block: BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 010 {code} NPE occurs at `writer.getTargetBuffer()` in codes: {code:java} // StripedWriter#clearBuffers void clearBuffers() { for (StripedBlockWriter writer : writers) { ByteBuffer targetBuffer = writer.getTargetBuffer(); if (targetBuffer != null) { targetBuffer.clear(); } } } {code} So, why is the writer null? Let's track when the writer is initialized and when reconstruct() is called, as follows: {code:java} // StripedBlockReconstructor#run public void run() { try { initDecoderIfNecessary(); getStripedReader().init(); stripedWriter.init(); //① reconstruct(); //② stripedWriter.endTargetBlocks(); } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); // ...{code} They are called at ① and ② above respectively. `stripedWriter.init()` -> `initTargetStreams()`, as follows: {code:java} // StripedWriter#initTargetStreams int initTargetStreams() { int nSuccess = 0; for (short i = 0; i < targets.length; i++) { try { writers[i] = createWriter(i); nSuccess++; targetsStatus[i] = true; } catch (Throwable e) { LOG.warn(e.getMessage()); } } return nSuccess; } {code} NPE occurs when createWriter(i) gets an exception and 0 < nSuccess < targets.length. was: The NullPointerException in DN log as follows: {code:java} 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY //... 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Connection timed out 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped block: BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 010 {code} NPE occurs at `writer.getTargetBuffer()` in codes: {code:java} void clearBuffers() { for (StripedBlockWriter writer : writers) { ByteBuffer targetBuffer = writer.getTargetBuffer
[jira] [Updated] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
[ https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15779: - Description: The NullPointerException in DN log as follows: {code:java} 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY //... 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Connection timed out 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped block: BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 010 {code} NPE occurs at `writer.getTargetBuffer()` in codes: {code:java} void clearBuffers() { for (StripedBlockWriter writer : writers) { ByteBuffer targetBuffer = writer.getTargetBuffer(); if (targetBuffer != null) { targetBuffer.clear(); } } } {code} So, why is the writer null? Let's track when the writer is initialized and when reconstruct() is called, as follows: {code:java} // StripedBlockReconstructor#run public void run() { try { initDecoderIfNecessary(); getStripedReader().init(); stripedWriter.init(); //① reconstruct(); //② stripedWriter.endTargetBlocks(); } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); // ...{code} They are called at ① and ② above respectively. `stripedWriter.init()` -> `initTargetStreams()`, as follows: and `writers[i] = createWriter(i)` ` was: The NullPointerException in DN log as follows: {code:java} 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY //... 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Connection timed out 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped block: BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 010 {code} NPE occurs in writer.getTargetBuffer(); > EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block > - > > Key: HDFS-15779 > URL: https://issues.apache.org/jira/browse/HDFS-15779 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > > The NullPointerException in DN log as follows: > {code:java} > 2020-12-28 15:49
[jira] [Updated] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
[ https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15779: - Description: The NullPointerException in DN log as follows: {code:java} 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY //... 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Connection timed out 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped block: BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 010 {code} NPE occurs in writer.getTargetBuffer(); > EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block > - > > Key: HDFS-15779 > URL: https://issues.apache.org/jira/browse/HDFS-15779 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > > The NullPointerException in DN log as follows: > > {code:java} > 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY > //... > 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Connection timed out > 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Failed to reconstruct striped block: > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Receiving > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 > src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 > 010 > {code} > NPE occurs in writer.getTargetBuffer(); > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
Hongbing Wang created HDFS-15779: Summary: EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block Key: HDFS-15779 URL: https://issues.apache.org/jira/browse/HDFS-15779 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.2.0 Reporter: Hongbing Wang Assignee: Hongbing Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236173#comment-17236173 ] Hongbing Wang commented on HDFS-15684: -- `TestDFSOutputStream` pass in local. Other OOM failed tests also pass in local when Randomly testing some. > EC: Call recoverLease on DFSStripedOutputStream close exception > --- > > Key: HDFS-15684 > URL: https://issues.apache.org/jira/browse/HDFS-15684 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient, ec >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > Attachments: HDFS-15684.001.patch, HDFS-15684.002.patch, > HDFS-15684.003.patch > > > -HDFS-14694- add a feature that call recoverLease operation automatically > when DFSOutputSteam close encounters exception. When we wanted to apply this > feature to our cluster, we found that it does not support EC files. > I think this feature should take effect whether replica files or EC files. > This Jira proposes to make it effective when in the case of EC files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236021#comment-17236021 ] Hongbing Wang commented on HDFS-15684: -- Thanks [~ferhui], [~hexiaoqiao]. Fix the checkstyle in 003.patch. > EC: Call recoverLease on DFSStripedOutputStream close exception > --- > > Key: HDFS-15684 > URL: https://issues.apache.org/jira/browse/HDFS-15684 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient, ec >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > Attachments: HDFS-15684.001.patch, HDFS-15684.002.patch, > HDFS-15684.003.patch > > > -HDFS-14694- add a feature that call recoverLease operation automatically > when DFSOutputSteam close encounters exception. When we wanted to apply this > feature to our cluster, we found that it does not support EC files. > I think this feature should take effect whether replica files or EC files. > This Jira proposes to make it effective when in the case of EC files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15684: - Attachment: HDFS-15684.003.patch > EC: Call recoverLease on DFSStripedOutputStream close exception > --- > > Key: HDFS-15684 > URL: https://issues.apache.org/jira/browse/HDFS-15684 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient, ec >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > Attachments: HDFS-15684.001.patch, HDFS-15684.002.patch, > HDFS-15684.003.patch > > > -HDFS-14694- add a feature that call recoverLease operation automatically > when DFSOutputSteam close encounters exception. When we wanted to apply this > feature to our cluster, we found that it does not support EC files. > I think this feature should take effect whether replica files or EC files. > This Jira proposes to make it effective when in the case of EC files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232712#comment-17232712 ] Hongbing Wang commented on HDFS-15684: -- add Tests in v2 patch. > EC: Call recoverLease on DFSStripedOutputStream close exception > --- > > Key: HDFS-15684 > URL: https://issues.apache.org/jira/browse/HDFS-15684 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient, ec >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > Attachments: HDFS-15684.001.patch, HDFS-15684.002.patch > > > -HDFS-14694- add a feature that call recoverLease operation automatically > when DFSOutputSteam close encounters exception. When we wanted to apply this > feature to our cluster, we found that it does not support EC files. > I think this feature should take effect whether replica files or EC files. > This Jira proposes to make it effective when in the case of EC files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15684: - Attachment: HDFS-15684.002.patch > EC: Call recoverLease on DFSStripedOutputStream close exception > --- > > Key: HDFS-15684 > URL: https://issues.apache.org/jira/browse/HDFS-15684 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient, ec >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > Attachments: HDFS-15684.001.patch, HDFS-15684.002.patch > > > -HDFS-14694- add a feature that call recoverLease operation automatically > when DFSOutputSteam close encounters exception. When we wanted to apply this > feature to our cluster, we found that it does not support EC files. > I think this feature should take effect whether replica files or EC files. > This Jira proposes to make it effective when in the case of EC files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15684: - Attachment: HDFS-15684.001.patch > EC: Call recoverLease on DFSStripedOutputStream close exception > --- > > Key: HDFS-15684 > URL: https://issues.apache.org/jira/browse/HDFS-15684 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient, ec >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > Attachments: HDFS-15684.001.patch > > > -HDFS-14694- add a feature that call recoverLease operation automatically > when DFSOutputSteam close encounters exception. When we wanted to apply this > feature to our cluster, we found that it does not support EC files. > I think this feature should take effect whether replica files or EC files. > This Jira proposes to make it effective when in the case of EC files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15684: - Description: -HDFS-14694- add a feature that call recoverLease operation automatically when DFSOutputSteam close encounters exception. When we wanted to apply this feature to our cluster, we found that it does not support EC files. I think this feature should take effect whether replica files or EC files. This Jira proposes to make it effective when in the case of EC files. > EC: Call recoverLease on DFSStripedOutputStream close exception > --- > > Key: HDFS-15684 > URL: https://issues.apache.org/jira/browse/HDFS-15684 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient, ec >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Major > > -HDFS-14694- add a feature that call recoverLease operation automatically > when DFSOutputSteam close encounters exception. When we wanted to apply this > feature to our cluster, we found that it does not support EC files. > I think this feature should take effect whether replica files or EC files. > This Jira proposes to make it effective when in the case of EC files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception
Hongbing Wang created HDFS-15684: Summary: EC: Call recoverLease on DFSStripedOutputStream close exception Key: HDFS-15684 URL: https://issues.apache.org/jira/browse/HDFS-15684 Project: Hadoop HDFS Issue Type: Improvement Components: dfsclient, ec Reporter: Hongbing Wang Assignee: Hongbing Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15668) RBF: Fix RouterRPCMetrics annocation and document misplaced error
[ https://issues.apache.org/jira/browse/HDFS-15668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227128#comment-17227128 ] Hongbing Wang commented on HDFS-15668: -- Both +{color:#172b4d}hadoop.security.TestLdapGroupsMapping{color}+ {color:#172b4d}and{color} +{color:#172b4d}hadoop.hdfs.server.federation.router.TestRouterRpc{color}+ {color:#172b4d}tests pass in local.{color} > RBF: Fix RouterRPCMetrics annocation and document misplaced error > - > > Key: HDFS-15668 > URL: https://issues.apache.org/jira/browse/HDFS-15668 > Project: Hadoop HDFS > Issue Type: Improvement > Components: documentation >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Attachments: HDFS-15668.001.patch > > > I found that the description of the two fields: +{{ProxyOpFailureStandby}}+ > and +{{ProxyOpFailureCommunicate}}+ in the > [website|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html#RouterRPCMetrics] > may be misplaced. > When I reviewed the code to see the meaning of the two fields, I found that > their descriptions were indeed misplaced. > _Origin code_: > {code:java} > @Metric("Number of operations to fail to reach NN") > private MutableCounterLong proxyOpFailureStandby; > @Metric("Number of operations to hit a standby NN") > private MutableCounterLong proxyOpFailureCommunicate; > {code} > _They should be_: > {code:java} > @Metric("Number of operations to hit a standby NN") > private MutableCounterLong proxyOpFailureStandby; > @Metric("Number of operations to fail to reach NN") > private MutableCounterLong proxyOpFailureCommunicate; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15668) RBF: Fix RouterRPCMetrics annocation and document misplaced error
[ https://issues.apache.org/jira/browse/HDFS-15668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226536#comment-17226536 ] Hongbing Wang commented on HDFS-15668: -- [~ferhui] Could you help take a look? > RBF: Fix RouterRPCMetrics annocation and document misplaced error > - > > Key: HDFS-15668 > URL: https://issues.apache.org/jira/browse/HDFS-15668 > Project: Hadoop HDFS > Issue Type: Improvement > Components: documentation >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Attachments: HDFS-15668.001.patch > > > I found that the description of the two fields: +{{ProxyOpFailureStandby}}+ > and +{{ProxyOpFailureCommunicate}}+ in the > [website|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html#RouterRPCMetrics] > may be misplaced. > When I reviewed the code to see the meaning of the two fields, I found that > their descriptions were indeed misplaced. > _Origin code_: > {code:java} > @Metric("Number of operations to fail to reach NN") > private MutableCounterLong proxyOpFailureStandby; > @Metric("Number of operations to hit a standby NN") > private MutableCounterLong proxyOpFailureCommunicate; > {code} > _They should be_: > {code:java} > @Metric("Number of operations to hit a standby NN") > private MutableCounterLong proxyOpFailureStandby; > @Metric("Number of operations to fail to reach NN") > private MutableCounterLong proxyOpFailureCommunicate; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15668) RBF: Fix RouterRPCMetrics annocation and document misplaced error
[ https://issues.apache.org/jira/browse/HDFS-15668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15668: - Summary: RBF: Fix RouterRPCMetrics annocation and document misplaced error (was: Fix RouterRPCMetrics annocation and document misplaced error) > RBF: Fix RouterRPCMetrics annocation and document misplaced error > - > > Key: HDFS-15668 > URL: https://issues.apache.org/jira/browse/HDFS-15668 > Project: Hadoop HDFS > Issue Type: Improvement > Components: documentation >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Attachments: HDFS-15668.001.patch > > > I found that the description of the two fields: +{{ProxyOpFailureStandby}}+ > and +{{ProxyOpFailureCommunicate}}+ in the > [website|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html#RouterRPCMetrics] > may be misplaced. > When I reviewed the code to see the meaning of the two fields, I found that > their descriptions were indeed misplaced. > _Origin code_: > {code:java} > @Metric("Number of operations to fail to reach NN") > private MutableCounterLong proxyOpFailureStandby; > @Metric("Number of operations to hit a standby NN") > private MutableCounterLong proxyOpFailureCommunicate; > {code} > _They should be_: > {code:java} > @Metric("Number of operations to hit a standby NN") > private MutableCounterLong proxyOpFailureStandby; > @Metric("Number of operations to fail to reach NN") > private MutableCounterLong proxyOpFailureCommunicate; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15668) Fix RouterRPCMetrics annocation and document misplaced error
[ https://issues.apache.org/jira/browse/HDFS-15668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15668: - Summary: Fix RouterRPCMetrics annocation and document misplaced error (was: Fix RouterRPCMetrics annocation and document error) > Fix RouterRPCMetrics annocation and document misplaced error > > > Key: HDFS-15668 > URL: https://issues.apache.org/jira/browse/HDFS-15668 > Project: Hadoop HDFS > Issue Type: Improvement > Components: documentation >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Attachments: HDFS-15668.001.patch > > > I found that the description of the two fields: +{{ProxyOpFailureStandby}}+ > and +{{ProxyOpFailureCommunicate}}+ in the > [website|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html#RouterRPCMetrics] > may be misplaced. > When I reviewed the code to see the meaning of the two fields, I found that > their descriptions were indeed misplaced. > _Origin code_: > {code:java} > @Metric("Number of operations to fail to reach NN") > private MutableCounterLong proxyOpFailureStandby; > @Metric("Number of operations to hit a standby NN") > private MutableCounterLong proxyOpFailureCommunicate; > {code} > _They should be_: > {code:java} > @Metric("Number of operations to hit a standby NN") > private MutableCounterLong proxyOpFailureStandby; > @Metric("Number of operations to fail to reach NN") > private MutableCounterLong proxyOpFailureCommunicate; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15668) Fix RouterRPCMetrics annocation and document error
[ https://issues.apache.org/jira/browse/HDFS-15668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15668: - Attachment: HDFS-15668.001.patch > Fix RouterRPCMetrics annocation and document error > -- > > Key: HDFS-15668 > URL: https://issues.apache.org/jira/browse/HDFS-15668 > Project: Hadoop HDFS > Issue Type: Improvement > Components: documentation >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Attachments: HDFS-15668.001.patch > > > I found that the description of the two fields: +{{ProxyOpFailureStandby}}+ > and +{{ProxyOpFailureCommunicate}}+ in the > [website|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html#RouterRPCMetrics] > may be misplaced. > When I reviewed the code to see the meaning of the two fields, I found that > their descriptions were indeed misplaced. > _Origin code_: > {code:java} > @Metric("Number of operations to fail to reach NN") > private MutableCounterLong proxyOpFailureStandby; > @Metric("Number of operations to hit a standby NN") > private MutableCounterLong proxyOpFailureCommunicate; > {code} > _They should be_: > {code:java} > @Metric("Number of operations to hit a standby NN") > private MutableCounterLong proxyOpFailureStandby; > @Metric("Number of operations to fail to reach NN") > private MutableCounterLong proxyOpFailureCommunicate; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15668) Fix RouterRPCMetrics annocation and document error
[ https://issues.apache.org/jira/browse/HDFS-15668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15668: - Description: I found that the description of the two fields: +{{ProxyOpFailureStandby}}+ and +{{ProxyOpFailureCommunicate}}+ in the [website|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html#RouterRPCMetrics] may be misplaced. When I reviewed the code to see the meaning of the two fields, I found that their descriptions were indeed misplaced. _Origin code_: {code:java} @Metric("Number of operations to fail to reach NN") private MutableCounterLong proxyOpFailureStandby; @Metric("Number of operations to hit a standby NN") private MutableCounterLong proxyOpFailureCommunicate; {code} _They should be_: {code:java} @Metric("Number of operations to hit a standby NN") private MutableCounterLong proxyOpFailureStandby; @Metric("Number of operations to fail to reach NN") private MutableCounterLong proxyOpFailureCommunicate; {code} > Fix RouterRPCMetrics annocation and document error > -- > > Key: HDFS-15668 > URL: https://issues.apache.org/jira/browse/HDFS-15668 > Project: Hadoop HDFS > Issue Type: Improvement > Components: documentation >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > > I found that the description of the two fields: +{{ProxyOpFailureStandby}}+ > and +{{ProxyOpFailureCommunicate}}+ in the > [website|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html#RouterRPCMetrics] > may be misplaced. > When I reviewed the code to see the meaning of the two fields, I found that > their descriptions were indeed misplaced. > _Origin code_: > {code:java} > @Metric("Number of operations to fail to reach NN") > private MutableCounterLong proxyOpFailureStandby; > @Metric("Number of operations to hit a standby NN") > private MutableCounterLong proxyOpFailureCommunicate; > {code} > _They should be_: > {code:java} > @Metric("Number of operations to hit a standby NN") > private MutableCounterLong proxyOpFailureStandby; > @Metric("Number of operations to fail to reach NN") > private MutableCounterLong proxyOpFailureCommunicate; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15668) Fix RouterRPCMetrics annocation and document error
Hongbing Wang created HDFS-15668: Summary: Fix RouterRPCMetrics annocation and document error Key: HDFS-15668 URL: https://issues.apache.org/jira/browse/HDFS-15668 Project: Hadoop HDFS Issue Type: Improvement Components: documentation Affects Versions: 3.2.0 Reporter: Hongbing Wang Assignee: Hongbing Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Attachment: (was: HDFS-15641.addendum.patch) > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3 > > Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, > HDFS-15641.003.patch, deadlock.png, deadlock_fixed.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Comment: was deleted (was: Thanks [~ferhui] and [~hexiaoqiao] . {quote} is it OK with one datanode? {quote} Yes, one dn also works for this patch. So I improved UT with one dn. [^HDFS-15641.addendum.patch] is a addendum patch after v003. ) > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3 > > Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, > HDFS-15641.003.patch, HDFS-15641.addendum.patch, deadlock.png, > deadlock_fixed.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221188#comment-17221188 ] Hongbing Wang edited comment on HDFS-15641 at 10/27/20, 6:53 AM: - Thanks [~ferhui] and [~hexiaoqiao] . {quote} is it OK with one datanode? {quote} Yes, one dn also works for this patch. So I improved UT with one dn. [^HDFS-15641.addendum.patch] is a addendum patch after v003. was (Author: wanghongbing): Thanks [~ferhui] and [~hexiaoqiao] . {quote} is it OK with one datanode? {quote} Yes, one dn also works for this patch. So I improved UT with one dn. [^HDFS-15641.addendum.patch] is a addendum patch. > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3 > > Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, > HDFS-15641.003.patch, HDFS-15641.addendum.patch, deadlock.png, > deadlock_fixed.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221188#comment-17221188 ] Hongbing Wang commented on HDFS-15641: -- Thanks [~ferhui] and [~hexiaoqiao] . {quote} is it OK with one datanode? {quote} Yes, one dn also works for this patch. So I improved UT with one dn. [^HDFS-15641.addendum.patch] is a addendum patch. > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3 > > Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, > HDFS-15641.003.patch, HDFS-15641.addendum.patch, deadlock.png, > deadlock_fixed.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Attachment: HDFS-15641.addendum.patch > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3 > > Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, > HDFS-15641.003.patch, HDFS-15641.addendum.patch, deadlock.png, > deadlock_fixed.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220650#comment-17220650 ] Hongbing Wang commented on HDFS-15641: -- Thanks! Expect it to be merged!:D > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, > HDFS-15641.003.patch, deadlock.png, deadlock_fixed.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219713#comment-17219713 ] Hongbing Wang commented on HDFS-15641: -- I have provided two alternative patch versions, [^HDFS-15641.002.patch] and [^HDFS-15641.003.patch] . 003.patch just puts UT into TestRefreshNamenodes.java. > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, > HDFS-15641.003.patch, deadlock.png, deadlock_fixed.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Attachment: HDFS-15641.003.patch > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, > HDFS-15641.003.patch, deadlock.png, deadlock_fixed.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219672#comment-17219672 ] Hongbing Wang commented on HDFS-15641: -- Thanks [~ferhui]. {quote}Is it right ? {quote} Yes, you are right. {quote} could you please move your UT there? {quote} I will resubmit a patch later. > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, > deadlock.png, deadlock_fixed.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219003#comment-17219003 ] Hongbing Wang commented on HDFS-15641: -- Thanks [~ferhui] for your reply. I will explain in two steps. (a)*The occurrence of deadlock*: see figure below, and the corresponding jstack is [^jstack.log] !deadlock.png|width=973,height=214! Related locks: `instance of BlockPoolManager` and `read-write lock in BPOfferService`. (b)*The fix I proposed:* In [^HDFS-15641.002.patch], I made 3 changes: # `+BPOfferService.java+`: I just injected a test error to delay 1s. This only takes effect in test and does not affect the production env. Both threads will wait a short while after acquiring their respective locks. # `+BPServiceActor.java+`: This is my change to fix the bug. Ensure that the time to start `bpThread` is after the read lock is completed. # `+TestRefreshNamenodesFailure.java+`: just test. Merge the above 1 and 3 can reproduce the deadlock. And merge 1, 2 and 3 can fix this deadlock. The process after fixed is as follows: !deadlock_fixed.png|width=1027,height=222! Thanks again ! > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, > deadlock.png, deadlock_fixed.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Attachment: deadlock_fixed.png > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, > deadlock.png, deadlock_fixed.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218116#comment-17218116 ] Hongbing Wang commented on HDFS-15641: -- Thanks [~hexiaoqiao] for attention. There may be a bit of confusion here. *lifelineSender.start()* does not refer to the start of the thread. LifelineSender has rewritten the start() method, as follows: {code:java} // BPServiceActor$LifelineSender#start public void start() { lifelineThread = new Thread(this, formatThreadName("lifeline", lifelineNnAddr)); // formatThreadName occurs deadlock lifelineThread.setDaemon(true); //... lifelineThread.start(); //Thread start here } // formatThreadName private String formatThreadName( final String action, final InetSocketAddress addr) { String bpId = bpos.getBlockPoolId(true); //... } // getBlockPoolId String getBlockPoolId(boolean quiet) { // avoid lock contention unless the registration hasn't completed. String id = bpId; if (id != null) { return id; } DataNodeFaultInjector.get().delayWhenOfferServiceHoldLock(); readLock(); // deadlock occurs here //... }{code} To be precise, the deadlock occurs in the `refreshThread` and `bpThread`. Deadlock is related to the above *start ->* *formatThreadName -> getBlockPoolId -> readLock and readUnlock* . So, I promise to let _readLock and readUnlock_ is completely executed before starting `bpThread`. The test I given can reproduce the deadlock before the fix, and test passed after the fix. Thanks [~hexiaoqiao] again. > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, > deadlock.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217588#comment-17217588 ] Hongbing Wang commented on HDFS-15641: -- fix some issue for ut, [^HDFS-15641.002.patch] > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, > deadlock.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Attachment: HDFS-15641.002.patch > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, > deadlock.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216886#comment-17216886 ] Hongbing Wang edited comment on HDFS-15641 at 10/20/20, 3:14 AM: - I adjusted the thread start sequence in BPServiceActor to ensure that the first thread (*{color:#172b4d}lifelineSender{color}*) has acquired and released the read lock before starting the second thread (*bpThread*). Original code: {code:java} void start() { // ... bpThread.start(); if (lifelineSender != null) { lifelineSender.start(); } } {code} New code: {code:java} void start() { // ... if (lifelineSender != null) { lifelineSender.start(); } bpThread.start(); } {code} (1) thread (*lifelineSender*) : _lifelineSender.start() -> BPServiceActor.formatThreadName() -> getBlockPoolId() ->_ _readLock() and readUnlock_ (2) afterward, start thread (*bpThread*) So, it can avoid deadlock, I think. was (Author: wanghongbing): I adjusted the thread start sequence in BPServiceActor to ensure that the first thread (*{color:#172b4d}lifelineSender{color}*) has acquired and released the read lock before starting the second thread (*bpThread*). Original code: {code:java} void start() { if ((bpThread != null) && (bpThread.isAlive())) { //Thread is started already return; } bpThread = new Thread(this); bpThread.setDaemon(true); // needed for JUnit testing bpThread.start(); if (lifelineSender != null) { lifelineSender.start(); } } {code} New code: {code:java} void start() { if ((bpThread != null) && (bpThread.isAlive())) { //Thread is started already return; } bpThread = new Thread(this); bpThread.setDaemon(true); // needed for JUnit testing if (lifelineSender != null) { lifelineSender.start(); } bpThread.start(); } {code} > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.001.patch, deadlock.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Attachment: HDFS-15641.001.patch > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.001.patch, deadlock.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Attachment: (was: HDFS-15641.000.test.patch) > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.001.patch, deadlock.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Attachment: (was: HDFS-15641.001.patch) > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.001.patch, deadlock.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216878#comment-17216878 ] Hongbing Wang edited comment on HDFS-15641 at 10/19/20, 4:50 PM: - {quote}{{just wonder if this issue is also in trunk}} {quote} yes, it reproduces in trunk. [^HDFS-15641.000.test.patch] uses CyclicBarrier to control thread execution order to reproduce deadlock. was (Author: wanghongbing): {quote}{{ }}{{just wonder if this issue is also in trunk}} {quote} yes, it reproduces in trunk. [^HDFS-15641.000.test.patch] uses CyclicBarrier to control thread execution order to reproduce deadlock. > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.000.test.patch, HDFS-15641.001.patch, > deadlock.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216886#comment-17216886 ] Hongbing Wang commented on HDFS-15641: -- I adjusted the thread start sequence in BPServiceActor to ensure that the first thread (*{color:#172b4d}lifelineSender{color}*) has acquired and released the read lock before starting the second thread (*bpThread*). Original code: {code:java} void start() { if ((bpThread != null) && (bpThread.isAlive())) { //Thread is started already return; } bpThread = new Thread(this); bpThread.setDaemon(true); // needed for JUnit testing bpThread.start(); if (lifelineSender != null) { lifelineSender.start(); } } {code} New code: {code:java} void start() { if ((bpThread != null) && (bpThread.isAlive())) { //Thread is started already return; } bpThread = new Thread(this); bpThread.setDaemon(true); // needed for JUnit testing if (lifelineSender != null) { lifelineSender.start(); } bpThread.start(); } {code} > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.000.test.patch, HDFS-15641.001.patch, > deadlock.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216878#comment-17216878 ] Hongbing Wang edited comment on HDFS-15641 at 10/19/20, 4:38 PM: - {quote}{{ }}{{just wonder if this issue is also in trunk}} {quote} yes, it reproduces in trunk. [^HDFS-15641.000.test.patch] uses CyclicBarrier to control thread execution order to reproduce deadlock. was (Author: wanghongbing): {{{quote} }} {{just wonder if this issue is also in trunk}} {{{quote}}} {{yes, it reproduces in trunk. [^HDFS-15641.000.test.patch] uses }}CyclicBarrier to control thread execution order to reproduce deadlock. {{}} > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.000.test.patch, deadlock.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Attachment: HDFS-15641.001.patch > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.000.test.patch, HDFS-15641.001.patch, > deadlock.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216878#comment-17216878 ] Hongbing Wang commented on HDFS-15641: -- {{{quote} }} {{just wonder if this issue is also in trunk}} {{{quote}}} {{yes, it reproduces in trunk. [^HDFS-15641.000.test.patch] uses }}CyclicBarrier to control thread execution order to reproduce deadlock. {{}} > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.000.test.patch, deadlock.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Description: DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes hostname:50020` to register a new namespace in federation env. The jstack is shown in jstack.log The specific process is shown in Figure deadlock.png was: DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes hostname:50020` to register a new namespace in federation env. The jstack is shown in jstack.log The specific process is shown in Figure RefreshNameNode_DeadLock.png > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.000.test.patch, deadlock.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Attachment: (was: RefreshNameNode_DeadLock.png) > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.000.test.patch, deadlock.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Attachment: deadlock.png > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.000.test.patch, deadlock.png, jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure deadlock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Description: DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes hostname:50020` to register a new namespace in federation env. The jstack is shown in jstack.log The specific process is shown in Figure RefreshNameNode_DeadLock.png was: DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes hostname:50020` to register a new namespace in federation env. The jstack is shown in RefreshNameNode_DeadLock.jstack. The specific process is shown in Figure RefreshNameNode_DeadLock.png > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.000.test.patch, RefreshNameNode_DeadLock.png, > jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in jstack.log > The specific process is shown in Figure RefreshNameNode_DeadLock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Attachment: (was: RefreshNameNode_DeadLock.jstack) > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.000.test.patch, RefreshNameNode_DeadLock.png, > jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in RefreshNameNode_DeadLock.jstack. > The specific process is shown in Figure RefreshNameNode_DeadLock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Attachment: jstack.log > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.000.test.patch, RefreshNameNode_DeadLock.png, > jstack.log > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in RefreshNameNode_DeadLock.jstack. > The specific process is shown in Figure RefreshNameNode_DeadLock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216690#comment-17216690 ] Hongbing Wang commented on HDFS-15641: -- I add one test [^HDFS-15641.000.test.patch] to reproduce this deadlock. Patch solving the problem will be attached later. [~hexiaoqiao] Could you help take a look? > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.000.test.patch, > RefreshNameNode_DeadLock.jstack, RefreshNameNode_DeadLock.png > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in RefreshNameNode_DeadLock.jstack. > The specific process is shown in Figure RefreshNameNode_DeadLock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Attachment: HDFS-15641.000.test.patch > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: HDFS-15641.000.test.patch, > RefreshNameNode_DeadLock.jstack, RefreshNameNode_DeadLock.png > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in RefreshNameNode_DeadLock.jstack. > The specific process is shown in Figure RefreshNameNode_DeadLock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Attachment: RefreshNameNode_DeadLock.jstack > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: RefreshNameNode_DeadLock.jstack, > RefreshNameNode_DeadLock.png > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in RefreshNameNode_DeadLock.jstack. > The specific process is shown in Figure RefreshNameNode_DeadLock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
[ https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15641: - Attachment: RefreshNameNode_DeadLock.png > DataNode could meet deadlock if invoke refreshNameNode > -- > > Key: HDFS-15641 > URL: https://issues.apache.org/jira/browse/HDFS-15641 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Critical > Attachments: RefreshNameNode_DeadLock.jstack, > RefreshNameNode_DeadLock.png > > > DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes > hostname:50020` to register a new namespace in federation env. > The jstack is shown in RefreshNameNode_DeadLock.jstack. > The specific process is shown in Figure RefreshNameNode_DeadLock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode
Hongbing Wang created HDFS-15641: Summary: DataNode could meet deadlock if invoke refreshNameNode Key: HDFS-15641 URL: https://issues.apache.org/jira/browse/HDFS-15641 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.2.0 Reporter: Hongbing Wang Assignee: Hongbing Wang DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes hostname:50020` to register a new namespace in federation env. The jstack is shown in RefreshNameNode_DeadLock.jstack. The specific process is shown in Figure RefreshNameNode_DeadLock.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190115#comment-17190115 ] Hongbing Wang commented on HDFS-15556: -- BPServiceActor uses `initialRegistrationComplete` variable of type `CountDownLatch(1)` to ensure that the sendLifeLine thread must be after the registration is completed. It seems that this rule does not take effect when reRegister because `initialRegistrationComplete` already countDown() in the first registration. > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error
[ https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158991#comment-17158991 ] Hongbing Wang commented on HDFS-15240: -- [~marvelrock] We have the same problem. At the same time as this problem, there are frequent FullGCs (every few seconds). Dump and MAT found that there are lots of ecWorker objects that almost fill the entire heap. !image-2020-07-16-15-56-38-608.png|width=722,height=591! look forward this patch into trunk. > Erasure Coding: dirty buffer causes reconstruction block error > -- > > Key: HDFS-15240 > URL: https://issues.apache.org/jira/browse/HDFS-15240 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Reporter: HuangTao >Assignee: HuangTao >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-15240.001.patch, HDFS-15240.002.patch, > HDFS-15240.003.patch, HDFS-15240.004.patch, HDFS-15240.005.patch, > image-2020-07-16-15-56-38-608.png > > > When read some lzo files we found some blocks were broken. > I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from > DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') > blocks. And find the longest common sequenece(LCS) between b6'(decoded) and > b6(read from DN)(b7'/b7 and b8'/b8). > After selecting 6 blocks of the block group in combinations one time and > iterating through all cases, I find one case that the length of LCS is the > block length - 64KB, 64KB is just the length of ByteBuffer used by > StripedBlockReader. So the corrupt reconstruction block is made by a dirty > buffer. > The following log snippet(only show 2 of 28 cases) is my check program > output. In my case, I known the 3th block is corrupt, so need other 5 blocks > to decode another 3 blocks, then find the 1th block's LCS substring is block > length - 64kb. > It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the > dirty buffer was used before read the 1th block. > Must be noted that StripedBlockReader read from the offset 0 of the 1th block > after used the dirty buffer. > {code:java} > decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8] > Check Block(1) first 131072 bytes longest common substring length 4 > Check Block(6) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4 > decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8] > Check Block(1) first 131072 bytes longest common substring length 65536 > CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length > 27197440 # this one > Check Block(7) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4{code} > Now I know the dirty buffer causes reconstruction block error, but how does > the dirty buffer come about? > After digging into the code and DN log, I found this following DN log is the > root reason. > {code:java} > [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel > java.nio.channels.SocketChannel[connected local=/:52586 > remote=/:50010]. 18 millis timeout left. > [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped > block: BP-714356632--1519726836856:blk_-YY_3472979393 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) {code} > Reading from DN may timeout(hold by a future(F)) and output the INFO log, but > the futures that contains the future(F) is cleared, > {code:java} > return new StripingChunkReadResult(futures.remove(future), > StripingChunkReadResult.CANCELLED); {code} > futures.remove(future) cause NPE. So the EC reconstruction is failed. In the > finall
[jira] [Updated] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error
[ https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang updated HDFS-15240: - Attachment: image-2020-07-16-15-56-38-608.png > Erasure Coding: dirty buffer causes reconstruction block error > -- > > Key: HDFS-15240 > URL: https://issues.apache.org/jira/browse/HDFS-15240 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Reporter: HuangTao >Assignee: HuangTao >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-15240.001.patch, HDFS-15240.002.patch, > HDFS-15240.003.patch, HDFS-15240.004.patch, HDFS-15240.005.patch, > image-2020-07-16-15-56-38-608.png > > > When read some lzo files we found some blocks were broken. > I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from > DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') > blocks. And find the longest common sequenece(LCS) between b6'(decoded) and > b6(read from DN)(b7'/b7 and b8'/b8). > After selecting 6 blocks of the block group in combinations one time and > iterating through all cases, I find one case that the length of LCS is the > block length - 64KB, 64KB is just the length of ByteBuffer used by > StripedBlockReader. So the corrupt reconstruction block is made by a dirty > buffer. > The following log snippet(only show 2 of 28 cases) is my check program > output. In my case, I known the 3th block is corrupt, so need other 5 blocks > to decode another 3 blocks, then find the 1th block's LCS substring is block > length - 64kb. > It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the > dirty buffer was used before read the 1th block. > Must be noted that StripedBlockReader read from the offset 0 of the 1th block > after used the dirty buffer. > {code:java} > decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8] > Check Block(1) first 131072 bytes longest common substring length 4 > Check Block(6) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4 > decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8] > Check Block(1) first 131072 bytes longest common substring length 65536 > CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length > 27197440 # this one > Check Block(7) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4{code} > Now I know the dirty buffer causes reconstruction block error, but how does > the dirty buffer come about? > After digging into the code and DN log, I found this following DN log is the > root reason. > {code:java} > [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel > java.nio.channels.SocketChannel[connected local=/:52586 > remote=/:50010]. 18 millis timeout left. > [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped > block: BP-714356632--1519726836856:blk_-YY_3472979393 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) {code} > Reading from DN may timeout(hold by a future(F)) and output the INFO log, but > the futures that contains the future(F) is cleared, > {code:java} > return new StripingChunkReadResult(futures.remove(future), > StripingChunkReadResult.CANCELLED); {code} > futures.remove(future) cause NPE. So the EC reconstruction is failed. In the > finally phase, the code snippet in *getStripedReader().close()* > {code:java} > reconstructor.freeBuffer(reader.getReadBuffer()); > reader.freeReadBuffer(); > reader.closeBlockReader(); {code} > free buffer firstly, but the StripedBlockReader still holds the buffer and > write it. -- This message was sent by Atlassian Jir
[jira] [Assigned] (HDFS-15425) Review Logging of DFSClient
[ https://issues.apache.org/jira/browse/HDFS-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbing Wang reassigned HDFS-15425: Assignee: Hongbing Wang > Review Logging of DFSClient > --- > > Key: HDFS-15425 > URL: https://issues.apache.org/jira/browse/HDFS-15425 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient >Reporter: Hongbing Wang >Assignee: Hongbing Wang >Priority: Minor > Fix For: 3.4.0 > > Attachments: HDFS-15425.001.patch, HDFS-15425.002.patch, > HDFS-15425.003.patch > > > Review use of SLF4J for DFSClient.LOG. > Make the code more concise and readable. > Less is more ! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org