from:"Hongbing Wang \(Jira\)"

[jira] [Commented] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?

2024-07-15 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17865983#comment-17865983
 ] 

Hongbing Wang commented on HDFS-17535:
--

[~ruilaing]   We had similar problems without pr HDFS-15240  in previous years 
and there seemed to be no convenient tool to fix them. We also use structured 
data (orc/parquet) features for verifying data, and the overall idea is similar 
to yours.

For RS-6-3, if more than 3 blocks are broken unfortunately, it will be not 
recoverable.

> I have confirmed the EC corrupt file, can this corrupt file be restored?
> 
>
> Key: HDFS-17535
> URL: https://issues.apache.org/jira/browse/HDFS-17535
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.1.0
>Reporter: ruiliang
>Priority: Blocker
>
> I learned that EC does have a major bug with file corrupt
> https://issues.apache.org/jira/browse/HDFS-15759
> 1：I have confirmed the EC corrupt file, can this corrupt file be restored?
> Have important data that is causing us production data loss issues?   Is 
> there a way to recover
> Checking EC block group: blk_-9223372036361352768
> Status: ERROR, message: EC compute result not match.:ip is 10.12.66.116 block 
> is : -9223372036361352765
> 2：[https://github.com/apache/orc/issues/1939] I was wondering if cherry 
> picked your current code (GitHub pull request #2869), Can I skip patches 
> related to HDFS-14768,HDFS-15186, and HDFS-15240?
> hdfs  version 3.1.0
> thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16867) Exiting Mover due to an exception in MoverMetrics.create

2023-03-06 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696878#comment-17696878
 ] 

Hongbing Wang commented on HDFS-16867:
--

[~Happy-shi] Is this still being followed up?
I had the same problem with balancer.
{code:java}
2023-03-06 17:40:53,264 ERROR org.apache.hadoop.hdfs.server.balancer.Balancer: 
Exiting balancer due an exception
org.apache.hadoop.metrics2.MetricsException: Metrics source 
Balancer-BP-332003681-10.196.164.22-1648632173322 already exists!
        at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:225)
        at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:198)
        at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
        at 
org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:55)
        at 
org.apache.hadoop.hdfs.server.balancer.Balancer.(Balancer.java:344)
        at 
org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:809)
        at 
org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:847)
        at 
org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:952)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at 
org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1102){code}

> Exiting Mover due to an exception in MoverMetrics.create
> 
>
> Key: HDFS-16867
> URL: https://issues.apache.org/jira/browse/HDFS-16867
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZhiWei Shi
>Assignee: ZhiWei Shi
>Priority: Major
>  Labels: pull-request-available
>
> After the Mover process is started for a period of time, the process exits 
> unexpectedly and an error is reported in the log
> {code:java}
> [hdfs@${hostname} hadoop-3.3.2-nn]$ nohup bin/hdfs mover -p 
> /test-mover-jira9534 > mover.log.jira9534.20221209.2 &
> [hdfs@{hostname}  hadoop-3.3.2-nn]$ tail -f mover.log.jira9534.20221209.2
> ...
> 22/12/09 14:22:32 INFO balancer.Dispatcher: Start moving 
> blk_1073911285_170466 with size=134217728 from 10.108.182.205:800:DISK to 
> ${ip1}:800:ARCHIVE through ${ip2}:800
> 22/12/09 14:22:32 INFO balancer.Dispatcher: Successfully moved 
> blk_1073911285_170466 with size=134217728 from 10.108.182.205:800:DISK to 
> ${ip1}:800:ARCHIVE through ${ip2}:800
> 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Stopping Mover metrics 
> system...
> 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Mover metrics system stopped.
> 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Mover metrics system shutdown 
> complete.
> Dec 9, 2022, 2:22:42 PM  Mover took 13mins, 19sec
> 22/12/09 14:22:42 ERROR mover.Mover: Exiting Mover due to an exception
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> Mover-${BlockpoolID} already exists!
> at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
> at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
> at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
> at 
> org.apache.hadoop.hdfs.server.mover.MoverMetrics.create(MoverMetrics.java:49)
> at org.apache.hadoop.hdfs.server.mover.Mover.(Mover.java:162)
> at org.apache.hadoop.hdfs.server.mover.Mover.run(Mover.java:684)
> at org.apache.hadoop.hdfs.server.mover.Mover$Cli.run(Mover.java:826)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)
> at org.apache.hadoop.hdfs.server.mover.Mover.main(Mover.java:908) 
> {code}
> 1、“final ExitStatus r = m.run()”return only after scheduled one of replica
> 2、“r == ExitStatus.IN_PROGRESS”,won’t run iter.remove()
> 3、Execute “new Mover” and “this.metrics = MoverMetrics.create(this)” multiple 
> times for the same nnc，which leads to the error
> {code:java}
> //Mover.java
>  for (final StorageType t : diff.existing) {
>   for (final MLocation ml : locations) {
> final Source source = storages.getSource(ml);
> if (ml.storageType == t && source != null) {
>   // try to schedule one replica move.
>   if (scheduleMoveReplica(db, source, diff.expected)) { // 1、return only 
> after scheduled one of replica         
>  return true;
>   }
> }
>   }
> }
> while (connectors.size() > 0) {
>   Collections.shuffle(connectors);
>   Iterator iter = connectors.iterator();
>   while (iter.hasNext()) {
> NameNodeConnector nnc = iter.next();
> //3、Execute “new Mover” and “this.metrics = MoverMetrics.create(this)” 
> multiple times for the same nnc，which leads to the error
>      final Mover m = new Mover(nnc, co

[jira] [Created] (HDFS-16763) MoverTool: Make valid for the number of mover threads per DN

2022-09-07 Thread Hongbing Wang (Jira)

Hongbing Wang created HDFS-16763:


 Summary: MoverTool: Make valid for the number of mover threads per 
DN
 Key: HDFS-16763
 URL: https://issues.apache.org/jira/browse/HDFS-16763
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer & mover
Reporter: Hongbing Wang


When running the Mover tool, the number of mover threads per DN is always 1, 
resulting in very slow data movement.

This JIRA fixes the problem which the current config is invalid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16656) Fix some incorrect descriptions in SPS

2022-07-12 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-16656:
-
Summary: Fix some incorrect descriptions in SPS  (was: Fixed some incorrect 
descriptions in SPS)

> Fix some incorrect descriptions in SPS
> --
>
> Key: HDFS-16656
> URL: https://issues.apache.org/jira/browse/HDFS-16656
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Hongbing Wang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> There are some  incorrect descriptions in SPS module in web site, as follows: 
> [ArchivalStorage.md|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html]
>  and 
> [hdfs-default.xml|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml]
>  Fix them in `ArchivalStorage.md` and `hdfs-default.xml`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16656) Fixed some incorrect descriptions in SPS

2022-07-12 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-16656:
-
Description: There are some  incorrect descriptions in SPS module in web 
site, as follows: 
[ArchivalStorage.md|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html]
 and 
[hdfs-default.xml|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml]
 Fix them in `ArchivalStorage.md` and `hdfs-default.xml`.  (was: There are some 
 incorrect descriptions in SPS module in web site, as follows: 
[ArchivalStorage.md|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html]
 and 
[hdfs-default.xml|[https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml].]
 Fix them in `ArchivalStorage.md` and `hdfs-default.xml`.)

> Fixed some incorrect descriptions in SPS
> 
>
> Key: HDFS-16656
> URL: https://issues.apache.org/jira/browse/HDFS-16656
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Hongbing Wang
>Priority: Minor
>
> There are some  incorrect descriptions in SPS module in web site, as follows: 
> [ArchivalStorage.md|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html]
>  and 
> [hdfs-default.xml|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml]
>  Fix them in `ArchivalStorage.md` and `hdfs-default.xml`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-16656) Fixed some incorrect descriptions in SPS

2022-07-12 Thread Hongbing Wang (Jira)

Hongbing Wang created HDFS-16656:


 Summary: Fixed some incorrect descriptions in SPS
 Key: HDFS-16656
 URL: https://issues.apache.org/jira/browse/HDFS-16656
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: documentation
Reporter: Hongbing Wang


There are some  incorrect descriptions in SPS module in web site, as follows: 
[ArchivalStorage.md|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html]
 and 
[hdfs-default.xml|[https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml].]
 Fix them in `ArchivalStorage.md` and `hdfs-default.xml`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16512) Improve oiv tool to parse fsimage file in parallel with XML format

2022-03-20 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-16512:
-
Parent: HDFS-14617
Issue Type: Sub-task  (was: Improvement)

> Improve oiv tool to parse fsimage file in parallel with XML format
> --
>
> Key: HDFS-16512
> URL: https://issues.apache.org/jira/browse/HDFS-16512
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-16512) Improve oiv tool to parse fsimage file in parallel with XML format

2022-03-20 Thread Hongbing Wang (Jira)

Hongbing Wang created HDFS-16512:


 Summary: Improve oiv tool to parse fsimage file in parallel with 
XML format
 Key: HDFS-16512
 URL: https://issues.apache.org/jira/browse/HDFS-16512
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Hongbing Wang
Assignee: Hongbing Wang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15987) Improve oiv tool to parse fsimage file in parallel with delimited format

2021-08-27 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405682#comment-17405682
 ] 

Hongbing Wang commented on HDFS-15987:
--

Report [^Improve_oiv_tool_001.pdf] is given, and the corresponding code submit 
is [commit 
66502f90.|https://github.com/apache/hadoop/pull/2918/commits/66502f901c3d5ec3410965ea5fdef2b31947d816]

> Improve oiv tool to parse fsimage file in parallel with delimited format
> 
>
> Key: HDFS-15987
> URL: https://issues.apache.org/jira/browse/HDFS-15987
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
>  Labels: pull-request-available
> Attachments: Improve_oiv_tool_001.pdf
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The purpose of this Jira is to improve oiv tool to parse fsimage file with 
> sub-sections (see -HDFS-14617-) in parallel with delmited format. 
> 1.Serial parsing is time-consuming
> The time to serially parse a large fsimage with delimited format (e.g. `hdfs 
> oiv -p Delimited -t  ...`) is as follows: 
> {code:java}
> 1) Loading string table: -> Not time consuming.
> 2) Loading inode references: -> Not time consuming
> 3) Loading directories in INode section: -> Slightly time consuming (3%)
> 4) Loading INode directory section:  -> A bit time consuming (11%)
> 5) Output:   -> Very time consuming (86%){code}
> Therefore, output is the most parallelized stage.
> 2.How to output in parallel
> The sub-sections are grouped in order, and each thread processes a group and 
> outputs it to the file corresponding to each thread, and finally merges the 
> output files.
> 3. The result of a test
> {code:java}
>  input fsimage file info:
>  3.4G, 12 sub-sections, 55976500 INodes
>  -
>  Threads TotalTime OutputTime MergeTime
>  1   18m37s 16m18s  –
>  48m7s  4m49s   41s{code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15987) Improve oiv tool to parse fsimage file in parallel with delimited format

2021-08-27 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15987:
-
Attachment: Improve_oiv_tool_001.pdf

> Improve oiv tool to parse fsimage file in parallel with delimited format
> 
>
> Key: HDFS-15987
> URL: https://issues.apache.org/jira/browse/HDFS-15987
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
>  Labels: pull-request-available
> Attachments: Improve_oiv_tool_001.pdf
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The purpose of this Jira is to improve oiv tool to parse fsimage file with 
> sub-sections (see -HDFS-14617-) in parallel with delmited format. 
> 1.Serial parsing is time-consuming
> The time to serially parse a large fsimage with delimited format (e.g. `hdfs 
> oiv -p Delimited -t  ...`) is as follows: 
> {code:java}
> 1) Loading string table: -> Not time consuming.
> 2) Loading inode references: -> Not time consuming
> 3) Loading directories in INode section: -> Slightly time consuming (3%)
> 4) Loading INode directory section:  -> A bit time consuming (11%)
> 5) Output:   -> Very time consuming (86%){code}
> Therefore, output is the most parallelized stage.
> 2.How to output in parallel
> The sub-sections are grouped in order, and each thread processes a group and 
> outputs it to the file corresponding to each thread, and finally merges the 
> output files.
> 3. The result of a test
> {code:java}
>  input fsimage file info:
>  3.4G, 12 sub-sections, 55976500 INodes
>  -
>  Threads TotalTime OutputTime MergeTime
>  1   18m37s 16m18s  –
>  48m7s  4m49s   41s{code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15987) Improve oiv tool to parse fsimage file in parallel with delimited format

2021-08-21 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17402581#comment-17402581
 ] 

Hongbing Wang commented on HDFS-15987:
--

[~mofei] The PR works well in our cluster.  I will give an online report in the 
next few days. Thank you for your attention.

> Improve oiv tool to parse fsimage file in parallel with delimited format
> 
>
> Key: HDFS-15987
> URL: https://issues.apache.org/jira/browse/HDFS-15987
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The purpose of this Jira is to improve oiv tool to parse fsimage file with 
> sub-sections (see -HDFS-14617-) in parallel with delmited format. 
> 1.Serial parsing is time-consuming
> The time to serially parse a large fsimage with delimited format (e.g. `hdfs 
> oiv -p Delimited -t  ...`) is as follows: 
> {code:java}
> 1) Loading string table: -> Not time consuming.
> 2) Loading inode references: -> Not time consuming
> 3) Loading directories in INode section: -> Slightly time consuming (3%)
> 4) Loading INode directory section:  -> A bit time consuming (11%)
> 5) Output:   -> Very time consuming (86%){code}
> Therefore, output is the most parallelized stage.
> 2.How to output in parallel
> The sub-sections are grouped in order, and each thread processes a group and 
> outputs it to the file corresponding to each thread, and finally merges the 
> output files.
> 3. The result of a test
> {code:java}
>  input fsimage file info:
>  3.4G, 12 sub-sections, 55976500 INodes
>  -
>  Threads TotalTime OutputTime MergeTime
>  1   18m37s 16m18s  –
>  48m7s  4m49s   41s{code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14788) Use dynamic regex filter to ignore copy of source files in Distcp

2021-07-01 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17372761#comment-17372761
 ] 

Hongbing Wang commented on HDFS-14788:
--

Is there a plan to filter files by modtime? In the scenario of incremental data 
synchronization, if files in certain time windows can be specified, efficiency 
can be greatly improved.

> Use dynamic regex filter to ignore copy of source files in Distcp
> -
>
> Key: HDFS-14788
> URL: https://issues.apache.org/jira/browse/HDFS-14788
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: distcp
>Affects Versions: 3.2.1
>Reporter: Mukund Thakur
>Assignee: Mukund Thakur
>Priority: Major
> Fix For: 3.3.0
>
>
> There is a feature in Distcp where we can ignore specific files to get copied 
> to the destination. This is currently based on a filter regex which is read 
> from a specific file. The process of creating different regex file for 
> different distcp jobs seems like a tedious task. What we are proposing is to 
> expose a regex_filter parameter which can be set during Distcp job creation 
> and use this filter in a new implementation CopyFilter class. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15650) Make the socket timeout for computing checksum of striped blocks configurable

2021-06-28 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17371155#comment-17371155
 ] 

Hongbing Wang commented on HDFS-15650:
--

[~yhaya] [~weichiu] Hi! In our practice, when there are a large number of ec 
checksum scenarios (such as distcp with checksum), there will be many socket 
timeout, and generally retrying is normal. (Note:  -HDFS-15709- has been 
merged). I think it makes sense to fix the hard-code. 

New config `dfs.checksum.ec.socket-timeout` looks good. Do you have any plan to 
fix this issue? 

Thanks!

> Make the socket timeout for computing checksum of striped blocks configurable
> -
>
> Key: HDFS-15650
> URL: https://issues.apache.org/jira/browse/HDFS-15650
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, ec, erasure-coding
>Reporter: Yushi Hayasaka
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Regarding the DataNode tries to get the checksum of EC internal blocks from 
> another DataNode for computing the checksum of striped blocks, the timeout is 
> hard-coded now, but it should be configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-16018) Optimize the display of hdfs "count -e" or "count -t" command

2021-05-09 Thread Hongbing Wang (Jira)

Hongbing Wang created HDFS-16018:


 Summary: Optimize the display of hdfs "count -e" or "count -t" 
command
 Key: HDFS-16018
 URL: https://issues.apache.org/jira/browse/HDFS-16018
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: dfsclient
Reporter: Hongbing Wang
Assignee: Hongbing Wang
 Attachments: fs_count_fixed.png, fs_count_origin.png

The display of `fs -count -e`or `fs -count -t` is not aligned.

*Current display:*

*!fs_count_origin.png|width=1184,height=156!*

*Fixed display:*

*!fs_count_fixed.png|width=1217,height=157!*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-15987) Improve oiv tool to parse fsimage file in parallel with delimited format

2021-04-16 Thread Hongbing Wang (Jira)

Hongbing Wang created HDFS-15987:


 Summary: Improve oiv tool to parse fsimage file in parallel with 
delimited format
 Key: HDFS-15987
 URL: https://issues.apache.org/jira/browse/HDFS-15987
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Hongbing Wang


The purpose of this Jira is to improve oiv tool to parse fsimage file with 
sub-sections (see -HDFS-14617-) in parallel with delmited format. 

1.Serial parsing is time-consuming

The time to serially parse a large fsimage with delimited format (e.g. `hdfs 
oiv -p Delimited -t  ...`) is as follows: 
{code:java}
1) Loading string table: -> Not time consuming.
2) Loading inode references: -> Not time consuming
3) Loading directories in INode section: -> Slightly time consuming (3%)
4) Loading INode directory section:  -> A bit time consuming (11%)
5) Output:   -> Very time consuming (86%){code}
Therefore, output is the most parallelized stage.

2.How to output in parallel

The sub-sections are grouped in order, and each thread processes a group and 
outputs it to the file corresponding to each thread, and finally merges the 
output files.

3. The result of a test
{code:java}
 input fsimage file info:
 3.4G, 12 sub-sections, 55976500 INodes
 -
 Threads TotalTime OutputTime MergeTime
 1   18m37s 16m18s  –
 48m7s  4m49s   41s{code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15858) Backport HDFS-14694 to branch-3.1/3.2/3.3

2021-02-27 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15858:
-
Attachment: (was: HDFS-15858-branch-3.1.002.patch)

> Backport HDFS-14694 to branch-3.1/3.2/3.3
> -
>
> Key: HDFS-15858
> URL: https://issues.apache.org/jira/browse/HDFS-15858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Attachments: HDFS-15858-branch-3.1.001.patch, 
> HDFS-15858-branch-3.1.002.patch, HDFS-15858-branch-3.2.002.patch
>
>
> -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and 
> -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call 
> recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. 
> The original patchs are in conflict with the lower version. So, backport them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15858) Backport HDFS-14694 to branch-3.1/3.2/3.3

2021-02-27 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15858:
-
Attachment: HDFS-15858-branch-3.1.002.patch

> Backport HDFS-14694 to branch-3.1/3.2/3.3
> -
>
> Key: HDFS-15858
> URL: https://issues.apache.org/jira/browse/HDFS-15858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Attachments: HDFS-15858-branch-3.1.001.patch, 
> HDFS-15858-branch-3.1.002.patch, HDFS-15858-branch-3.1.002.patch, 
> HDFS-15858-branch-3.2.002.patch
>
>
> -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and 
> -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call 
> recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. 
> The original patchs are in conflict with the lower version. So, backport them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15858) Backport HDFS-14694 to branch-3.1/3.2/3.3

2021-02-27 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292106#comment-17292106
 ] 

Hongbing Wang commented on HDFS-15858:
--

Resubmit [^HDFS-15858-branch-3.1.002.patch] to trigger UT.

> Backport HDFS-14694 to branch-3.1/3.2/3.3
> -
>
> Key: HDFS-15858
> URL: https://issues.apache.org/jira/browse/HDFS-15858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Attachments: HDFS-15858-branch-3.1.001.patch, 
> HDFS-15858-branch-3.1.002.patch, HDFS-15858-branch-3.2.002.patch
>
>
> -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and 
> -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call 
> recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. 
> The original patchs are in conflict with the lower version. So, backport them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15858) Backport HDFS-14694 to branch-3.1/3.2/3.3

2021-02-25 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291443#comment-17291443
 ] 

Hongbing Wang commented on HDFS-15858:
--

{{Note:}}
 * {{backport to branch-3.1: {color:#0747a6}use branch-3.1.xxx.patch{color}}}
 * {{backport to branch-3.2: {color:#0747a6}use branch-3.2.xxx.patch{color}}}
 * {{backport to branch-3.3: }}{{Directly use -HDFS-14694- latest patch}}

Considering that lower version PR in -HDFS-15684- depends on this Jira , we 
should complete this PR first.

> Backport HDFS-14694 to branch-3.1/3.2/3.3
> -
>
> Key: HDFS-15858
> URL: https://issues.apache.org/jira/browse/HDFS-15858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Attachments: HDFS-15858-branch-3.1.001.patch, 
> HDFS-15858-branch-3.1.002.patch, HDFS-15858-branch-3.2.002.patch
>
>
> -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and 
> -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call 
> recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. 
> The original patchs are in conflict with the lower version. So, backport them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15858) Backport HDFS-14694 to branch-3.1/3.2/3.3

2021-02-25 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15858:
-
Summary: Backport HDFS-14694 to branch-3.1/3.2/3.3  (was: Backport 
HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3)

> Backport HDFS-14694 to branch-3.1/3.2/3.3
> -
>
> Key: HDFS-15858
> URL: https://issues.apache.org/jira/browse/HDFS-15858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Attachments: HDFS-15858-branch-3.1.001.patch, 
> HDFS-15858-branch-3.1.002.patch, HDFS-15858-branch-3.2.002.patch
>
>
> -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and 
> -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call 
> recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. 
> The original patchs are in conflict with the lower version. So, backport them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15858) Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3

2021-02-25 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15858:
-
Attachment: HDFS-15858-branch-3.2.002.patch

> Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3
> 
>
> Key: HDFS-15858
> URL: https://issues.apache.org/jira/browse/HDFS-15858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Attachments: HDFS-15858-branch-3.1.001.patch, 
> HDFS-15858-branch-3.1.002.patch, HDFS-15858-branch-3.2.002.patch
>
>
> -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and 
> -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call 
> recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. 
> The original patchs are in conflict with the lower version. So, backport them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15858) Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3

2021-02-25 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15858:
-
Attachment: HDFS-15858-branch-3.1.002.patch

> Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3
> 
>
> Key: HDFS-15858
> URL: https://issues.apache.org/jira/browse/HDFS-15858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Attachments: HDFS-15858-branch-3.1.001.patch, 
> HDFS-15858-branch-3.1.002.patch, HDFS-15858-branch-3.2.002.patch
>
>
> -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and 
> -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call 
> recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. 
> The original patchs are in conflict with the lower version. So, backport them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15858) Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3

2021-02-25 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15858:
-
Attachment: (was: HDFS-15858-branch-3.2.001.patch)

> Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3
> 
>
> Key: HDFS-15858
> URL: https://issues.apache.org/jira/browse/HDFS-15858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Attachments: HDFS-15858-branch-3.1.001.patch
>
>
> -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and 
> -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call 
> recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. 
> The original patchs are in conflict with the lower version. So, backport them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15858) Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3

2021-02-25 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15858:
-
Attachment: (was: HDFS-15858-branch-3.3.001.patch)

> Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3
> 
>
> Key: HDFS-15858
> URL: https://issues.apache.org/jira/browse/HDFS-15858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Attachments: HDFS-15858-branch-3.1.001.patch
>
>
> -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and 
> -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call 
> recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. 
> The original patchs are in conflict with the lower version. So, backport them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15858) Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3

2021-02-25 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15858:
-
Attachment: HDFS-15858-branch-3.2.001.patch

> Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3
> 
>
> Key: HDFS-15858
> URL: https://issues.apache.org/jira/browse/HDFS-15858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Attachments: HDFS-15858-branch-3.1.001.patch, 
> HDFS-15858-branch-3.2.001.patch, HDFS-15858-branch-3.3.001.patch
>
>
> -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and 
> -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call 
> recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. 
> The original patchs are in conflict with the lower version. So, backport them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15858) Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3

2021-02-25 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15858:
-
Attachment: HDFS-15858-branch-3.3.001.patch

> Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3
> 
>
> Key: HDFS-15858
> URL: https://issues.apache.org/jira/browse/HDFS-15858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Attachments: HDFS-15858-branch-3.1.001.patch, 
> HDFS-15858-branch-3.2.001.patch, HDFS-15858-branch-3.3.001.patch
>
>
> -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and 
> -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call 
> recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. 
> The original patchs are in conflict with the lower version. So, backport them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15858) Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3

2021-02-25 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15858:
-
Attachment: HDFS-15858-branch-3.1.001.patch

> Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3
> 
>
> Key: HDFS-15858
> URL: https://issues.apache.org/jira/browse/HDFS-15858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Attachments: HDFS-15858-branch-3.1.001.patch, 
> HDFS-15858-branch-3.2.001.patch, HDFS-15858-branch-3.3.001.patch
>
>
> -[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and 
> -[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call 
> recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. 
> The original patchs are in conflict with the lower version. So, backport them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-15858) Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3

2021-02-25 Thread Hongbing Wang (Jira)

Hongbing Wang created HDFS-15858:


 Summary: Backport HDFS-14694 and HDFS-15684 to branch-3.1/3.2/3.3
 Key: HDFS-15858
 URL: https://issues.apache.org/jira/browse/HDFS-15858
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client
Reporter: Hongbing Wang
Assignee: Hongbing Wang


-[HDFS-14694|https://issues.apache.org/jira/browse/HDFS-14694]- and 
-[HDFS-15684|https://issues.apache.org/jira/browse/HDFS-15684]- Call 
recoverLease on DFSOutputStream or DFSStripedOutputStream close exception. The 
original patchs are in conflict with the lower version. So, backport them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception

2021-02-24 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290745#comment-17290745
 ] 

Hongbing Wang commented on HDFS-15684:
--

[~ferhui] ok. Because this PR depends on  -HDFS-14694,- I will backport them in 
another Jira later.

> EC: Call recoverLease on DFSStripedOutputStream close exception
> ---
>
> Key: HDFS-15684
> URL: https://issues.apache.org/jira/browse/HDFS-15684
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient, ec
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-15684.001.patch, HDFS-15684.002.patch, 
> HDFS-15684.003.patch
>
>
> -HDFS-14694- add a feature that call recoverLease operation automatically 
> when DFSOutputSteam close encounters exception. When we wanted to apply this 
> feature to our cluster, we found that it does not support EC files. 
> I think this feature should take effect whether replica files or EC files. 
> This Jira proposes to make it effective when in the case of EC files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

2021-02-02 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277006#comment-17277006
 ] 

Hongbing Wang commented on HDFS-15779:
--

[~ferhui] Thanks for the guidance. Fix code style in [^HDFS-15779.002.patch].

> EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
> -
>
> Key: HDFS-15779
> URL: https://issues.apache.org/jira/browse/HDFS-15779
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
> Attachments: HDFS-15779.001.patch, HDFS-15779.002.patch
>
>
> The NullPointerException in DN log as follows: 
> {code:java}
> 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
> //...
> 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Connection timed out
> 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Failed to reconstruct striped block: 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
> src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
> 010
> {code}
> NPE occurs at `writer.getTargetBuffer()` in codes:
> {code:java}
> // StripedWriter#clearBuffers
> void clearBuffers() {
>   for (StripedBlockWriter writer : writers) {
> ByteBuffer targetBuffer = writer.getTargetBuffer();
> if (targetBuffer != null) {
>   targetBuffer.clear();
> }
>   }
> }
> {code}
> So, why is the writer null? Let's track when the writer is initialized and 
> when reconstruct() is called,  as follows:
> {code:java}
> // StripedBlockReconstructor#run
> public void run() {
>   try {
> initDecoderIfNecessary();
> getStripedReader().init();
> stripedWriter.init();  //①
> reconstruct();  //②
> stripedWriter.endTargetBlocks();
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> // ...{code}
> They are called at ① and ② above respectively. `stripedWriter.init()` -> 
> `initTargetStreams()`, as follows:
> {code:java}
> // StripedWriter#initTargetStreams
> int initTargetStreams() {
>   int nSuccess = 0;
>   for (short i = 0; i < targets.length; i++) {
> try {
>   writers[i] = createWriter(i);
>   nSuccess++;
>   targetsStatus[i] = true;
> } catch (Throwable e) {
>   LOG.warn(e.getMessage());
> }
>   }
>   return nSuccess;
> }
> {code}
> NPE occurs when createWriter() gets an exception and  0 < nSuccess < 
> targets.length. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

2021-02-02 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15779:
-
Attachment: HDFS-15779.002.patch

> EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
> -
>
> Key: HDFS-15779
> URL: https://issues.apache.org/jira/browse/HDFS-15779
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
> Attachments: HDFS-15779.001.patch, HDFS-15779.002.patch
>
>
> The NullPointerException in DN log as follows: 
> {code:java}
> 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
> //...
> 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Connection timed out
> 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Failed to reconstruct striped block: 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
> src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
> 010
> {code}
> NPE occurs at `writer.getTargetBuffer()` in codes:
> {code:java}
> // StripedWriter#clearBuffers
> void clearBuffers() {
>   for (StripedBlockWriter writer : writers) {
> ByteBuffer targetBuffer = writer.getTargetBuffer();
> if (targetBuffer != null) {
>   targetBuffer.clear();
> }
>   }
> }
> {code}
> So, why is the writer null? Let's track when the writer is initialized and 
> when reconstruct() is called,  as follows:
> {code:java}
> // StripedBlockReconstructor#run
> public void run() {
>   try {
> initDecoderIfNecessary();
> getStripedReader().init();
> stripedWriter.init();  //①
> reconstruct();  //②
> stripedWriter.endTargetBlocks();
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> // ...{code}
> They are called at ① and ② above respectively. `stripedWriter.init()` -> 
> `initTargetStreams()`, as follows:
> {code:java}
> // StripedWriter#initTargetStreams
> int initTargetStreams() {
>   int nSuccess = 0;
>   for (short i = 0; i < targets.length; i++) {
> try {
>   writers[i] = createWriter(i);
>   nSuccess++;
>   targetsStatus[i] = true;
> } catch (Throwable e) {
>   LOG.warn(e.getMessage());
> }
>   }
>   return nSuccess;
> }
> {code}
> NPE occurs when createWriter() gets an exception and  0 < nSuccess < 
> targets.length. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

2021-02-01 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276167#comment-17276167
 ] 

Hongbing Wang commented on HDFS-15779:
--

[~ferhui] Thanks for review!
 From the structural point of view, using *if (targetsStatus[i])* is the best, 
but I was worried that there would be problems.

Because the status of targetsStatus[i] may be changed in 
_StripedWriter#transferData2Targets_, it will cause targetsStatus[i] and 
writer[i] to not correspond one to one. Note that they correspond before this.
{code:java}
// StripedWriter#transferData2Targets
int transferData2Targets() {
  int nSuccess = 0;
  for (int i = 0; i < targets.length; i++) {
if (targetsStatus[i]) {
  boolean success = false;
  try {
writers[i].transferData2Target(packetBuf);
nSuccess++;
success = true;
  } catch (IOException e) {
LOG.warn(e.getMessage());
  }
  targetsStatus[i] = success; // may be false here 
}
  }
  return nSuccess;
}
{code}
If _transferData2Target()_ throws IOException, _writer[i]_ may still need to 
call _clearBuffers_(), I think. Is that so?

Thanks again.

 

 

> EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
> -
>
> Key: HDFS-15779
> URL: https://issues.apache.org/jira/browse/HDFS-15779
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
> Attachments: HDFS-15779.001.patch
>
>
> The NullPointerException in DN log as follows: 
> {code:java}
> 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
> //...
> 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Connection timed out
> 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Failed to reconstruct striped block: 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
> src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
> 010
> {code}
> NPE occurs at `writer.getTargetBuffer()` in codes:
> {code:java}
> // StripedWriter#clearBuffers
> void clearBuffers() {
>   for (StripedBlockWriter writer : writers) {
> ByteBuffer targetBuffer = writer.getTargetBuffer();
> if (targetBuffer != null) {
>   targetBuffer.clear();
> }
>   }
> }
> {code}
> So, why is the writer null? Let's track when the writer is initialized and 
> when reconstruct() is called,  as follows:
> {code:java}
> // StripedBlockReconstructor#run
> public void run() {
>   try {
> initDecoderIfNecessary();
> getStripedReader().init();
> stripedWriter.init();  //①
> reconstruct();  //②
> stripedWriter.endTargetBlocks();
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> // ...{code}
> They are called at ① and ② above respectively. `stripedWriter.init()` -> 
> `initTargetStreams()`, as follows:
> {code:java}
> // StripedWriter#initTargetStreams
> int initTargetStreams() {
>   int nSuccess = 0;
>   for (short i = 0; i < targets.length; i++) {
> try {
>   writers[i] = createWriter(i);
>   nSuccess++;
>   targetsStatus[i] = true;
> } catch (Throwable e) {
>   LOG.warn(e.getMessage());
> }
>   }
>   return nSuccess;
> }
> {code}
> NPE occurs when createWriter() gets an exception and  0 < nSuccess < 
> targets.length. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mai

[jira] [Comment Edited] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

2021-02-01 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276167#comment-17276167
 ] 

Hongbing Wang edited comment on HDFS-15779 at 2/1/21, 9:16 AM:
---

[~ferhui] Thanks for review!
 From the structural point of view, using *if (targetsStatus[i])* is the best, 
but I was worried that there would be problems.

Because the status of targetsStatus[i] may be changed in 
_StripedWriter#transferData2Targets_, it will cause targetsStatus[i] and 
writer[i] to not correspond one to one. Note that they correspond before this.
{code:java}
// StripedWriter#transferData2Targets
int transferData2Targets() {
  int nSuccess = 0;
  for (int i = 0; i < targets.length; i++) {
if (targetsStatus[i]) {
  boolean success = false;
  try {
writers[i].transferData2Target(packetBuf);
nSuccess++;
success = true;
  } catch (IOException e) {
LOG.warn(e.getMessage());
  }
  targetsStatus[i] = success; // may be false here 
}
  }
  return nSuccess;
}
{code}
If _transferData2Target()_ throws IOException, _writer[i]_ may still need to 
call _clearBuffers_(), I think. Is that so?

Thanks again. 


was (Author: wanghongbing):
[~ferhui] Thanks for review!
 From the structural point of view, using *if (targetsStatus[i])* is the best, 
but I was worried that there would be problems.

Because the status of targetsStatus[i] may be changed in 
_StripedWriter#transferData2Targets_, it will cause targetsStatus[i] and 
writer[i] to not correspond one to one. Note that they correspond before this.
{code:java}
// StripedWriter#transferData2Targets
int transferData2Targets() {
  int nSuccess = 0;
  for (int i = 0; i < targets.length; i++) {
if (targetsStatus[i]) {
  boolean success = false;
  try {
writers[i].transferData2Target(packetBuf);
nSuccess++;
success = true;
  } catch (IOException e) {
LOG.warn(e.getMessage());
  }
  targetsStatus[i] = success; // may be false here 
}
  }
  return nSuccess;
}
{code}
If _transferData2Target()_ throws IOException, _writer[i]_ may still need to 
call _clearBuffers_(), I think. Is that so?

Thanks again.

 

 

> EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
> -
>
> Key: HDFS-15779
> URL: https://issues.apache.org/jira/browse/HDFS-15779
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
> Attachments: HDFS-15779.001.patch
>
>
> The NullPointerException in DN log as follows: 
> {code:java}
> 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
> //...
> 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Connection timed out
> 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Failed to reconstruct striped block: 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
> src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
> 010
> {code}
> NPE occurs at `writer.getTargetBuffer()` in codes:
> {code:java}
> // StripedWriter#clearBuffers
> void clearBuffers() {
>   for (StripedBlockWriter writer : writers) {
> ByteBuffer targetBuffer = writer.getTargetBuffer();
> if (targetBuffer != null) {
>   targetBuffer.clear();
> }
>   }
> }
> {code}
> So, why is the writer null? Let's track when the writer is initialized and 
> when reconstruct() is called,  as follows:
> {code:java}
> // StripedBlo

[jira] [Resolved] (HDFS-15797) EC: reconstruction threads limit parameter does not take effect

2021-01-28 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang resolved HDFS-15797.
--
Resolution: Duplicate

> EC: reconstruction threads limit parameter does not take effect
> ---
>
> Key: HDFS-15797
> URL: https://issues.apache.org/jira/browse/HDFS-15797
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
>
> -HDFS-12044- changed _SynchronousQueue_ in stripedReconstructionPool to 
> unbounded _LinkedBlockingQueue_, which caused the _maximumPoolSize_ to be 
> invalid. The parameter +dfs.datanode.ec.reconstruction.threads+ (defaults to 
> 8) is therefore invalid. This parameter is misleading here, or we need to 
> modify the code to make it effective.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15797) EC: reconstruction threads limit parameter does not take effect

2021-01-28 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273487#comment-17273487
 ] 

Hongbing Wang commented on HDFS-15797:
--

Thanks [~sodonnell] ! yes, it should be closed.

> EC: reconstruction threads limit parameter does not take effect
> ---
>
> Key: HDFS-15797
> URL: https://issues.apache.org/jira/browse/HDFS-15797
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
>
> -HDFS-12044- changed _SynchronousQueue_ in stripedReconstructionPool to 
> unbounded _LinkedBlockingQueue_, which caused the _maximumPoolSize_ to be 
> invalid. The parameter +dfs.datanode.ec.reconstruction.threads+ (defaults to 
> 8) is therefore invalid. This parameter is misleading here, or we need to 
> modify the code to make it effective.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15797) EC: reconstruction threads limit parameter does not take effect

2021-01-28 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273427#comment-17273427
 ] 

Hongbing Wang commented on HDFS-15797:
--

Sorry, HDFS-14367 has already solved.

> EC: reconstruction threads limit parameter does not take effect
> ---
>
> Key: HDFS-15797
> URL: https://issues.apache.org/jira/browse/HDFS-15797
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
>
> -HDFS-12044- changed _SynchronousQueue_ in stripedReconstructionPool to 
> unbounded _LinkedBlockingQueue_, which caused the _maximumPoolSize_ to be 
> invalid. The parameter +dfs.datanode.ec.reconstruction.threads+ (defaults to 
> 8) is therefore invalid. This parameter is misleading here, or we need to 
> modify the code to make it effective.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15797) EC: reconstruction threads limit parameter does not take effect

2021-01-28 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15797:
-
Description: -HDFS-12044- changed _SynchronousQueue_ in 
stripedReconstructionPool to unbounded _LinkedBlockingQueue_, which caused the 
_maximumPoolSize_ to be invalid. The parameter 
+dfs.datanode.ec.reconstruction.threads+ (defaults to 8) is therefore invalid. 
This parameter is misleading here, or we need to modify the code to make it 
effective.

> EC: reconstruction threads limit parameter does not take effect
> ---
>
> Key: HDFS-15797
> URL: https://issues.apache.org/jira/browse/HDFS-15797
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
>
> -HDFS-12044- changed _SynchronousQueue_ in stripedReconstructionPool to 
> unbounded _LinkedBlockingQueue_, which caused the _maximumPoolSize_ to be 
> invalid. The parameter +dfs.datanode.ec.reconstruction.threads+ (defaults to 
> 8) is therefore invalid. This parameter is misleading here, or we need to 
> modify the code to make it effective.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-15797) EC: reconstruction threads limit parameter does not take effect

2021-01-28 Thread Hongbing Wang (Jira)

Hongbing Wang created HDFS-15797:


 Summary: EC: reconstruction threads limit parameter does not take 
effect
 Key: HDFS-15797
 URL: https://issues.apache.org/jira/browse/HDFS-15797
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Hongbing Wang
Assignee: Hongbing Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

2021-01-26 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272020#comment-17272020
 ] 

Hongbing Wang commented on HDFS-15779:
--

just fix NPE in [^HDFS-15779.001.patch]. If the writer that is not involved in 
the reconstruction is null, the reconstruction can be also successful. So don’t 
care about writer which is null when clearBuffers().

> EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
> -
>
> Key: HDFS-15779
> URL: https://issues.apache.org/jira/browse/HDFS-15779
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
> Attachments: HDFS-15779.001.patch
>
>
> The NullPointerException in DN log as follows: 
> {code:java}
> 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
> //...
> 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Connection timed out
> 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Failed to reconstruct striped block: 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
> src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
> 010
> {code}
> NPE occurs at `writer.getTargetBuffer()` in codes:
> {code:java}
> // StripedWriter#clearBuffers
> void clearBuffers() {
>   for (StripedBlockWriter writer : writers) {
> ByteBuffer targetBuffer = writer.getTargetBuffer();
> if (targetBuffer != null) {
>   targetBuffer.clear();
> }
>   }
> }
> {code}
> So, why is the writer null? Let's track when the writer is initialized and 
> when reconstruct() is called,  as follows:
> {code:java}
> // StripedBlockReconstructor#run
> public void run() {
>   try {
> initDecoderIfNecessary();
> getStripedReader().init();
> stripedWriter.init();  //①
> reconstruct();  //②
> stripedWriter.endTargetBlocks();
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> // ...{code}
> They are called at ① and ② above respectively. `stripedWriter.init()` -> 
> `initTargetStreams()`, as follows:
> {code:java}
> // StripedWriter#initTargetStreams
> int initTargetStreams() {
>   int nSuccess = 0;
>   for (short i = 0; i < targets.length; i++) {
> try {
>   writers[i] = createWriter(i);
>   nSuccess++;
>   targetsStatus[i] = true;
> } catch (Throwable e) {
>   LOG.warn(e.getMessage());
> }
>   }
>   return nSuccess;
> }
> {code}
> NPE occurs when createWriter() gets an exception and  0 < nSuccess < 
> targets.length. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

2021-01-26 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15779:
-
Attachment: HDFS-15779.001.patch

> EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
> -
>
> Key: HDFS-15779
> URL: https://issues.apache.org/jira/browse/HDFS-15779
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
> Attachments: HDFS-15779.001.patch
>
>
> The NullPointerException in DN log as follows: 
> {code:java}
> 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
> //...
> 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Connection timed out
> 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Failed to reconstruct striped block: 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
> src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
> 010
> {code}
> NPE occurs at `writer.getTargetBuffer()` in codes:
> {code:java}
> // StripedWriter#clearBuffers
> void clearBuffers() {
>   for (StripedBlockWriter writer : writers) {
> ByteBuffer targetBuffer = writer.getTargetBuffer();
> if (targetBuffer != null) {
>   targetBuffer.clear();
> }
>   }
> }
> {code}
> So, why is the writer null? Let's track when the writer is initialized and 
> when reconstruct() is called,  as follows:
> {code:java}
> // StripedBlockReconstructor#run
> public void run() {
>   try {
> initDecoderIfNecessary();
> getStripedReader().init();
> stripedWriter.init();  //①
> reconstruct();  //②
> stripedWriter.endTargetBlocks();
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> // ...{code}
> They are called at ① and ② above respectively. `stripedWriter.init()` -> 
> `initTargetStreams()`, as follows:
> {code:java}
> // StripedWriter#initTargetStreams
> int initTargetStreams() {
>   int nSuccess = 0;
>   for (short i = 0; i < targets.length; i++) {
> try {
>   writers[i] = createWriter(i);
>   nSuccess++;
>   targetsStatus[i] = true;
> } catch (Throwable e) {
>   LOG.warn(e.getMessage());
> }
>   }
>   return nSuccess;
> }
> {code}
> NPE occurs when createWriter() gets an exception and  0 < nSuccess < 
> targets.length. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

2021-01-17 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267024#comment-17267024
 ] 

Hongbing Wang edited comment on HDFS-15779 at 1/18/21, 5:26 AM:


I have two issues to discuss:
 * Does it throw an exception only when `initTargetStreams() == 0` instead of  
`< targets.length` ?

{code:java}
// StripedWriter#init
if (initTargetStreams() == 0) {
  String error = "All targets are failed.";
  throw new IOException(error);
}{code}
 * Is it the best change to just judge whether the writer is null?

{code:java}
// StripedWriter#clearBuffers
void clearBuffers() {
  for (StripedBlockWriter writer : writers) {
ByteBuffer targetBuffer = writer.getTargetBuffer();
if (targetBuffer != null) {
  targetBuffer.clear();
}
  }
}
{code}


was (Author: wanghongbing):
I have two issues to discuss:
 # Does it throw an exception only when `initTargetStreams() == 0` instead of  
`< targets.length` ?

{code:java}
// StripedWriter#init
if (initTargetStreams() == 0) {
  String error = "All targets are failed.";
  throw new IOException(error);
}{code}

 # Is it the best change to just judge whether the writer is null?

{code:java}
// StripedWriter#clearBuffers
void clearBuffers() {
  for (StripedBlockWriter writer : writers) {
ByteBuffer targetBuffer = writer.getTargetBuffer();
if (targetBuffer != null) {
  targetBuffer.clear();
}
  }
}
{code}

> EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
> -
>
> Key: HDFS-15779
> URL: https://issues.apache.org/jira/browse/HDFS-15779
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
>
> The NullPointerException in DN log as follows: 
> {code:java}
> 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
> //...
> 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Connection timed out
> 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Failed to reconstruct striped block: 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
> src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
> 010
> {code}
> NPE occurs at `writer.getTargetBuffer()` in codes:
> {code:java}
> // StripedWriter#clearBuffers
> void clearBuffers() {
>   for (StripedBlockWriter writer : writers) {
> ByteBuffer targetBuffer = writer.getTargetBuffer();
> if (targetBuffer != null) {
>   targetBuffer.clear();
> }
>   }
> }
> {code}
> So, why is the writer null? Let's track when the writer is initialized and 
> when reconstruct() is called,  as follows:
> {code:java}
> // StripedBlockReconstructor#run
> public void run() {
>   try {
> initDecoderIfNecessary();
> getStripedReader().init();
> stripedWriter.init();  //①
> reconstruct();  //②
> stripedWriter.endTargetBlocks();
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> // ...{code}
> They are called at ① and ② above respectively. `stripedWriter.init()` -> 
> `initTargetStreams()`, as follows:
> {code:java}
> // StripedWriter#initTargetStreams
> int initTargetStreams() {
>   int nSuccess = 0;
>   for (short i = 0; i < targets.length; i++) {
> try {
>   writers[i] = createWriter(i);
>   nSuccess++;
>   targetsStatus[i] = true;
> } catch (Throwable e) {
>   LOG.warn(e.getMessage());
> }
>   }
>   return nSuccess;
> }
> {code}
> NPE occ

[jira] [Commented] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

2021-01-17 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267024#comment-17267024
 ] 

Hongbing Wang commented on HDFS-15779:
--

I have two issues to discuss:
 # Does it throw an exception only when `initTargetStreams() == 0` instead of  
`< targets.length` ?

{code:java}
// StripedWriter#init
if (initTargetStreams() == 0) {
  String error = "All targets are failed.";
  throw new IOException(error);
}{code}

 # Is it the best change to just judge whether the writer is null?

{code:java}
// StripedWriter#clearBuffers
void clearBuffers() {
  for (StripedBlockWriter writer : writers) {
ByteBuffer targetBuffer = writer.getTargetBuffer();
if (targetBuffer != null) {
  targetBuffer.clear();
}
  }
}
{code}

> EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
> -
>
> Key: HDFS-15779
> URL: https://issues.apache.org/jira/browse/HDFS-15779
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
>
> The NullPointerException in DN log as follows: 
> {code:java}
> 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
> //...
> 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Connection timed out
> 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Failed to reconstruct striped block: 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
> src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
> 010
> {code}
> NPE occurs at `writer.getTargetBuffer()` in codes:
> {code:java}
> // StripedWriter#clearBuffers
> void clearBuffers() {
>   for (StripedBlockWriter writer : writers) {
> ByteBuffer targetBuffer = writer.getTargetBuffer();
> if (targetBuffer != null) {
>   targetBuffer.clear();
> }
>   }
> }
> {code}
> So, why is the writer null? Let's track when the writer is initialized and 
> when reconstruct() is called,  as follows:
> {code:java}
> // StripedBlockReconstructor#run
> public void run() {
>   try {
> initDecoderIfNecessary();
> getStripedReader().init();
> stripedWriter.init();  //①
> reconstruct();  //②
> stripedWriter.endTargetBlocks();
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> // ...{code}
> They are called at ① and ② above respectively. `stripedWriter.init()` -> 
> `initTargetStreams()`, as follows:
> {code:java}
> // StripedWriter#initTargetStreams
> int initTargetStreams() {
>   int nSuccess = 0;
>   for (short i = 0; i < targets.length; i++) {
> try {
>   writers[i] = createWriter(i);
>   nSuccess++;
>   targetsStatus[i] = true;
> } catch (Throwable e) {
>   LOG.warn(e.getMessage());
> }
>   }
>   return nSuccess;
> }
> {code}
> NPE occurs when createWriter() gets an exception and  0 < nSuccess < 
> targets.length. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

2021-01-17 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15779:
-
Description: 
The NullPointerException in DN log as follows: 
{code:java}
2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
//...
2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Connection timed out
2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Failed to reconstruct striped block: 
BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving 
BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
010
{code}
NPE occurs at `writer.getTargetBuffer()` in codes:
{code:java}
// StripedWriter#clearBuffers
void clearBuffers() {
  for (StripedBlockWriter writer : writers) {
ByteBuffer targetBuffer = writer.getTargetBuffer();
if (targetBuffer != null) {
  targetBuffer.clear();
}
  }
}
{code}
So, why is the writer null? Let's track when the writer is initialized and when 
reconstruct() is called,  as follows:
{code:java}
// StripedBlockReconstructor#run
public void run() {
  try {
initDecoderIfNecessary();

getStripedReader().init();

stripedWriter.init();  //①

reconstruct();  //②

stripedWriter.endTargetBlocks();
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
// ...{code}
They are called at ① and ② above respectively. `stripedWriter.init()` -> 
`initTargetStreams()`, as follows:
{code:java}
// StripedWriter#initTargetStreams
int initTargetStreams() {
  int nSuccess = 0;
  for (short i = 0; i < targets.length; i++) {
try {
  writers[i] = createWriter(i);
  nSuccess++;
  targetsStatus[i] = true;
} catch (Throwable e) {
  LOG.warn(e.getMessage());
}
  }
  return nSuccess;
}
{code}
NPE occurs when createWriter() gets an exception and  0 < nSuccess < 
targets.length. 

  was:
The NullPointerException in DN log as follows: 
{code:java}
2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
//...
2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Connection timed out
2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Failed to reconstruct striped block: 
BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving 
BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
010
{code}
NPE occurs at `writer.getTargetBuffer()` in codes:
{code:java}
// StripedWriter#clearBuffers
void clearBuffers() {
  for (StripedBlockWriter writer : writers) {
ByteBuffer targetBu

[jira] [Updated] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

2021-01-17 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15779:
-
Description: 
The NullPointerException in DN log as follows: 
{code:java}
2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
//...
2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Connection timed out
2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Failed to reconstruct striped block: 
BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving 
BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
010
{code}
NPE occurs at `writer.getTargetBuffer()` in codes:
{code:java}
// StripedWriter#clearBuffers
void clearBuffers() {
  for (StripedBlockWriter writer : writers) {
ByteBuffer targetBuffer = writer.getTargetBuffer();
if (targetBuffer != null) {
  targetBuffer.clear();
}
  }
}
{code}
So, why is the writer null? Let's track when the writer is initialized and when 
reconstruct() is called,  as follows:
{code:java}
// StripedBlockReconstructor#run
public void run() {
  try {
initDecoderIfNecessary();

getStripedReader().init();

stripedWriter.init();  //①

reconstruct();  //②

stripedWriter.endTargetBlocks();
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
// ...{code}
They are called at ① and ② above respectively. `stripedWriter.init()` -> 
`initTargetStreams()`, as follows:
{code:java}
// StripedWriter#initTargetStreams
int initTargetStreams() {
  int nSuccess = 0;
  for (short i = 0; i < targets.length; i++) {
try {
  writers[i] = createWriter(i);
  nSuccess++;
  targetsStatus[i] = true;
} catch (Throwable e) {
  LOG.warn(e.getMessage());
}
  }
  return nSuccess;
}
{code}
NPE occurs when createWriter(i) gets an exception and  0 < nSuccess < 
targets.length. 

  was:
The NullPointerException in DN log as follows: 
{code:java}
2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
//...
2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Connection timed out
2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Failed to reconstruct striped block: 
BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving 
BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
010
{code}
NPE occurs at `writer.getTargetBuffer()` in codes:
{code:java}
void clearBuffers() {
  for (StripedBlockWriter writer : writers) {
ByteBuffer targetBuffer = writer.getTargetBuffer

[jira] [Updated] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

2021-01-17 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15779:
-
Description: 
The NullPointerException in DN log as follows: 
{code:java}
2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
//...
2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Connection timed out
2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Failed to reconstruct striped block: 
BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving 
BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
010
{code}
NPE occurs at `writer.getTargetBuffer()` in codes:
{code:java}
void clearBuffers() {
  for (StripedBlockWriter writer : writers) {
ByteBuffer targetBuffer = writer.getTargetBuffer();
if (targetBuffer != null) {
  targetBuffer.clear();
}
  }
}
{code}
So, why is the writer null? Let's track when the writer is initialized and when 
reconstruct() is called,  as follows:
{code:java}
// StripedBlockReconstructor#run
public void run() {
  try {
initDecoderIfNecessary();

getStripedReader().init();

stripedWriter.init();  //①

reconstruct();  //②

stripedWriter.endTargetBlocks();
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
// ...{code}
They are called at ① and ② above respectively. `stripedWriter.init()` -> 
`initTargetStreams()`, as follows:

 

and `writers[i] = createWriter(i)`

`

 

 

  was:
The NullPointerException in DN log as follows: 

 
{code:java}
2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
//...
2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Connection timed out
2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Failed to reconstruct striped block: 
BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving 
BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
010
{code}
NPE occurs in writer.getTargetBuffer();

 


> EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
> -
>
> Key: HDFS-15779
> URL: https://issues.apache.org/jira/browse/HDFS-15779
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
>
> The NullPointerException in DN log as follows: 
> {code:java}
> 2020-12-28 15:49

[jira] [Updated] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

2021-01-17 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15779:
-
Description: 
The NullPointerException in DN log as follows: 

 
{code:java}
2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
//...
2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Connection timed out
2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Failed to reconstruct striped block: 
BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving 
BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
010
{code}
NPE occurs in writer.getTargetBuffer();

 

> EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
> -
>
> Key: HDFS-15779
> URL: https://issues.apache.org/jira/browse/HDFS-15779
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
>
> The NullPointerException in DN log as follows: 
>  
> {code:java}
> 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
> //...
> 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Connection timed out
> 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Failed to reconstruct striped block: 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
> src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
> 010
> {code}
> NPE occurs in writer.getTargetBuffer();
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

2021-01-17 Thread Hongbing Wang (Jira)

Hongbing Wang created HDFS-15779:


 Summary: EC: fix NPE caused by StripedWriter.clearBuffers during 
reconstruct block
 Key: HDFS-15779
 URL: https://issues.apache.org/jira/browse/HDFS-15779
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.2.0
Reporter: Hongbing Wang
Assignee: Hongbing Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception

2020-11-20 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236173#comment-17236173
 ] 

Hongbing Wang commented on HDFS-15684:
--

`TestDFSOutputStream` pass in local. Other OOM failed tests also pass in local 
when Randomly testing some.

> EC: Call recoverLease on DFSStripedOutputStream close exception
> ---
>
> Key: HDFS-15684
> URL: https://issues.apache.org/jira/browse/HDFS-15684
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient, ec
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
> Attachments: HDFS-15684.001.patch, HDFS-15684.002.patch, 
> HDFS-15684.003.patch
>
>
> -HDFS-14694- add a feature that call recoverLease operation automatically 
> when DFSOutputSteam close encounters exception. When we wanted to apply this 
> feature to our cluster, we found that it does not support EC files. 
> I think this feature should take effect whether replica files or EC files. 
> This Jira proposes to make it effective when in the case of EC files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception

2020-11-20 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236021#comment-17236021
 ] 

Hongbing Wang commented on HDFS-15684:
--

Thanks [~ferhui], [~hexiaoqiao]. Fix the checkstyle in 003.patch. 

> EC: Call recoverLease on DFSStripedOutputStream close exception
> ---
>
> Key: HDFS-15684
> URL: https://issues.apache.org/jira/browse/HDFS-15684
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient, ec
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
> Attachments: HDFS-15684.001.patch, HDFS-15684.002.patch, 
> HDFS-15684.003.patch
>
>
> -HDFS-14694- add a feature that call recoverLease operation automatically 
> when DFSOutputSteam close encounters exception. When we wanted to apply this 
> feature to our cluster, we found that it does not support EC files. 
> I think this feature should take effect whether replica files or EC files. 
> This Jira proposes to make it effective when in the case of EC files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception

2020-11-20 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15684:
-
Attachment: HDFS-15684.003.patch

> EC: Call recoverLease on DFSStripedOutputStream close exception
> ---
>
> Key: HDFS-15684
> URL: https://issues.apache.org/jira/browse/HDFS-15684
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient, ec
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
> Attachments: HDFS-15684.001.patch, HDFS-15684.002.patch, 
> HDFS-15684.003.patch
>
>
> -HDFS-14694- add a feature that call recoverLease operation automatically 
> when DFSOutputSteam close encounters exception. When we wanted to apply this 
> feature to our cluster, we found that it does not support EC files. 
> I think this feature should take effect whether replica files or EC files. 
> This Jira proposes to make it effective when in the case of EC files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception

2020-11-16 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232712#comment-17232712
 ] 

Hongbing Wang commented on HDFS-15684:
--

add Tests in v2 patch.

> EC: Call recoverLease on DFSStripedOutputStream close exception
> ---
>
> Key: HDFS-15684
> URL: https://issues.apache.org/jira/browse/HDFS-15684
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient, ec
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
> Attachments: HDFS-15684.001.patch, HDFS-15684.002.patch
>
>
> -HDFS-14694- add a feature that call recoverLease operation automatically 
> when DFSOutputSteam close encounters exception. When we wanted to apply this 
> feature to our cluster, we found that it does not support EC files. 
> I think this feature should take effect whether replica files or EC files. 
> This Jira proposes to make it effective when in the case of EC files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception

2020-11-16 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15684:
-
Attachment: HDFS-15684.002.patch

> EC: Call recoverLease on DFSStripedOutputStream close exception
> ---
>
> Key: HDFS-15684
> URL: https://issues.apache.org/jira/browse/HDFS-15684
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient, ec
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
> Attachments: HDFS-15684.001.patch, HDFS-15684.002.patch
>
>
> -HDFS-14694- add a feature that call recoverLease operation automatically 
> when DFSOutputSteam close encounters exception. When we wanted to apply this 
> feature to our cluster, we found that it does not support EC files. 
> I think this feature should take effect whether replica files or EC files. 
> This Jira proposes to make it effective when in the case of EC files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception

2020-11-15 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15684:
-
Attachment: HDFS-15684.001.patch

> EC: Call recoverLease on DFSStripedOutputStream close exception
> ---
>
> Key: HDFS-15684
> URL: https://issues.apache.org/jira/browse/HDFS-15684
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient, ec
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
> Attachments: HDFS-15684.001.patch
>
>
> -HDFS-14694- add a feature that call recoverLease operation automatically 
> when DFSOutputSteam close encounters exception. When we wanted to apply this 
> feature to our cluster, we found that it does not support EC files. 
> I think this feature should take effect whether replica files or EC files. 
> This Jira proposes to make it effective when in the case of EC files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception

2020-11-15 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15684:
-
Description: 
-HDFS-14694- add a feature that call recoverLease operation automatically when 
DFSOutputSteam close encounters exception. When we wanted to apply this feature 
to our cluster, we found that it does not support EC files. 

I think this feature should take effect whether replica files or EC files. This 
Jira proposes to make it effective when in the case of EC files.

> EC: Call recoverLease on DFSStripedOutputStream close exception
> ---
>
> Key: HDFS-15684
> URL: https://issues.apache.org/jira/browse/HDFS-15684
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient, ec
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
>
> -HDFS-14694- add a feature that call recoverLease operation automatically 
> when DFSOutputSteam close encounters exception. When we wanted to apply this 
> feature to our cluster, we found that it does not support EC files. 
> I think this feature should take effect whether replica files or EC files. 
> This Jira proposes to make it effective when in the case of EC files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-15684) EC: Call recoverLease on DFSStripedOutputStream close exception

2020-11-15 Thread Hongbing Wang (Jira)

Hongbing Wang created HDFS-15684:


 Summary: EC: Call recoverLease on DFSStripedOutputStream close 
exception
 Key: HDFS-15684
 URL: https://issues.apache.org/jira/browse/HDFS-15684
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: dfsclient, ec
Reporter: Hongbing Wang
Assignee: Hongbing Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15668) RBF: Fix RouterRPCMetrics annocation and document misplaced error

2020-11-05 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227128#comment-17227128
 ] 

Hongbing Wang commented on HDFS-15668:
--

Both +{color:#172b4d}hadoop.security.TestLdapGroupsMapping{color}+ 
{color:#172b4d}and{color} 
+{color:#172b4d}hadoop.hdfs.server.federation.router.TestRouterRpc{color}+ 
{color:#172b4d}tests pass in local.{color}

> RBF: Fix RouterRPCMetrics annocation and document misplaced error
> -
>
> Key: HDFS-15668
> URL: https://issues.apache.org/jira/browse/HDFS-15668
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Attachments: HDFS-15668.001.patch
>
>
> I found that the description of the two fields: +{{ProxyOpFailureStandby}}+ 
> and +{{ProxyOpFailureCommunicate}}+ in the 
> [website|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html#RouterRPCMetrics]
>  may be misplaced. 
> When I reviewed the code to see the meaning of the two fields, I found that 
> their descriptions were indeed misplaced.
> _Origin code_：
> {code:java}
> @Metric("Number of operations to fail to reach NN")
> private MutableCounterLong proxyOpFailureStandby;
> @Metric("Number of operations to hit a standby NN")
> private MutableCounterLong proxyOpFailureCommunicate;
> {code}
> _They should be_:
> {code:java}
> @Metric("Number of operations to hit a standby NN")
> private MutableCounterLong proxyOpFailureStandby;
> @Metric("Number of operations to fail to reach NN")
> private MutableCounterLong proxyOpFailureCommunicate;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15668) RBF: Fix RouterRPCMetrics annocation and document misplaced error

2020-11-04 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226536#comment-17226536
 ] 

Hongbing Wang commented on HDFS-15668:
--

[~ferhui] Could you help take a look?

> RBF: Fix RouterRPCMetrics annocation and document misplaced error
> -
>
> Key: HDFS-15668
> URL: https://issues.apache.org/jira/browse/HDFS-15668
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Attachments: HDFS-15668.001.patch
>
>
> I found that the description of the two fields: +{{ProxyOpFailureStandby}}+ 
> and +{{ProxyOpFailureCommunicate}}+ in the 
> [website|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html#RouterRPCMetrics]
>  may be misplaced. 
> When I reviewed the code to see the meaning of the two fields, I found that 
> their descriptions were indeed misplaced.
> _Origin code_：
> {code:java}
> @Metric("Number of operations to fail to reach NN")
> private MutableCounterLong proxyOpFailureStandby;
> @Metric("Number of operations to hit a standby NN")
> private MutableCounterLong proxyOpFailureCommunicate;
> {code}
> _They should be_:
> {code:java}
> @Metric("Number of operations to hit a standby NN")
> private MutableCounterLong proxyOpFailureStandby;
> @Metric("Number of operations to fail to reach NN")
> private MutableCounterLong proxyOpFailureCommunicate;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15668) RBF: Fix RouterRPCMetrics annocation and document misplaced error

2020-11-04 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15668:
-
Summary: RBF: Fix RouterRPCMetrics annocation and document misplaced error  
(was: Fix RouterRPCMetrics annocation and document misplaced error)

> RBF: Fix RouterRPCMetrics annocation and document misplaced error
> -
>
> Key: HDFS-15668
> URL: https://issues.apache.org/jira/browse/HDFS-15668
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Attachments: HDFS-15668.001.patch
>
>
> I found that the description of the two fields: +{{ProxyOpFailureStandby}}+ 
> and +{{ProxyOpFailureCommunicate}}+ in the 
> [website|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html#RouterRPCMetrics]
>  may be misplaced. 
> When I reviewed the code to see the meaning of the two fields, I found that 
> their descriptions were indeed misplaced.
> _Origin code_：
> {code:java}
> @Metric("Number of operations to fail to reach NN")
> private MutableCounterLong proxyOpFailureStandby;
> @Metric("Number of operations to hit a standby NN")
> private MutableCounterLong proxyOpFailureCommunicate;
> {code}
> _They should be_:
> {code:java}
> @Metric("Number of operations to hit a standby NN")
> private MutableCounterLong proxyOpFailureStandby;
> @Metric("Number of operations to fail to reach NN")
> private MutableCounterLong proxyOpFailureCommunicate;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15668) Fix RouterRPCMetrics annocation and document misplaced error

2020-11-04 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15668:
-
Summary: Fix RouterRPCMetrics annocation and document misplaced error  
(was: Fix RouterRPCMetrics annocation and document error)

> Fix RouterRPCMetrics annocation and document misplaced error
> 
>
> Key: HDFS-15668
> URL: https://issues.apache.org/jira/browse/HDFS-15668
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Attachments: HDFS-15668.001.patch
>
>
> I found that the description of the two fields: +{{ProxyOpFailureStandby}}+ 
> and +{{ProxyOpFailureCommunicate}}+ in the 
> [website|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html#RouterRPCMetrics]
>  may be misplaced. 
> When I reviewed the code to see the meaning of the two fields, I found that 
> their descriptions were indeed misplaced.
> _Origin code_：
> {code:java}
> @Metric("Number of operations to fail to reach NN")
> private MutableCounterLong proxyOpFailureStandby;
> @Metric("Number of operations to hit a standby NN")
> private MutableCounterLong proxyOpFailureCommunicate;
> {code}
> _They should be_:
> {code:java}
> @Metric("Number of operations to hit a standby NN")
> private MutableCounterLong proxyOpFailureStandby;
> @Metric("Number of operations to fail to reach NN")
> private MutableCounterLong proxyOpFailureCommunicate;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15668) Fix RouterRPCMetrics annocation and document error

2020-11-04 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15668:
-
Attachment: HDFS-15668.001.patch

> Fix RouterRPCMetrics annocation and document error
> --
>
> Key: HDFS-15668
> URL: https://issues.apache.org/jira/browse/HDFS-15668
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Attachments: HDFS-15668.001.patch
>
>
> I found that the description of the two fields: +{{ProxyOpFailureStandby}}+ 
> and +{{ProxyOpFailureCommunicate}}+ in the 
> [website|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html#RouterRPCMetrics]
>  may be misplaced. 
> When I reviewed the code to see the meaning of the two fields, I found that 
> their descriptions were indeed misplaced.
> _Origin code_：
> {code:java}
> @Metric("Number of operations to fail to reach NN")
> private MutableCounterLong proxyOpFailureStandby;
> @Metric("Number of operations to hit a standby NN")
> private MutableCounterLong proxyOpFailureCommunicate;
> {code}
> _They should be_:
> {code:java}
> @Metric("Number of operations to hit a standby NN")
> private MutableCounterLong proxyOpFailureStandby;
> @Metric("Number of operations to fail to reach NN")
> private MutableCounterLong proxyOpFailureCommunicate;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15668) Fix RouterRPCMetrics annocation and document error

2020-11-04 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15668:
-
Description: 
I found that the description of the two fields: +{{ProxyOpFailureStandby}}+ and 
+{{ProxyOpFailureCommunicate}}+ in the 
[website|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html#RouterRPCMetrics]
 may be misplaced. 

When I reviewed the code to see the meaning of the two fields, I found that 
their descriptions were indeed misplaced.

_Origin code_：
{code:java}
@Metric("Number of operations to fail to reach NN")
private MutableCounterLong proxyOpFailureStandby;
@Metric("Number of operations to hit a standby NN")
private MutableCounterLong proxyOpFailureCommunicate;
{code}
_They should be_:
{code:java}
@Metric("Number of operations to hit a standby NN")
private MutableCounterLong proxyOpFailureStandby;
@Metric("Number of operations to fail to reach NN")
private MutableCounterLong proxyOpFailureCommunicate;
{code}

> Fix RouterRPCMetrics annocation and document error
> --
>
> Key: HDFS-15668
> URL: https://issues.apache.org/jira/browse/HDFS-15668
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
>
> I found that the description of the two fields: +{{ProxyOpFailureStandby}}+ 
> and +{{ProxyOpFailureCommunicate}}+ in the 
> [website|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html#RouterRPCMetrics]
>  may be misplaced. 
> When I reviewed the code to see the meaning of the two fields, I found that 
> their descriptions were indeed misplaced.
> _Origin code_：
> {code:java}
> @Metric("Number of operations to fail to reach NN")
> private MutableCounterLong proxyOpFailureStandby;
> @Metric("Number of operations to hit a standby NN")
> private MutableCounterLong proxyOpFailureCommunicate;
> {code}
> _They should be_:
> {code:java}
> @Metric("Number of operations to hit a standby NN")
> private MutableCounterLong proxyOpFailureStandby;
> @Metric("Number of operations to fail to reach NN")
> private MutableCounterLong proxyOpFailureCommunicate;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-15668) Fix RouterRPCMetrics annocation and document error

2020-11-04 Thread Hongbing Wang (Jira)

Hongbing Wang created HDFS-15668:


 Summary: Fix RouterRPCMetrics annocation and document error
 Key: HDFS-15668
 URL: https://issues.apache.org/jira/browse/HDFS-15668
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: documentation
Affects Versions: 3.2.0
Reporter: Hongbing Wang
Assignee: Hongbing Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-27 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Attachment: (was: HDFS-15641.addendum.patch)

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> HDFS-15641.003.patch, deadlock.png, deadlock_fixed.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Issue Comment Deleted] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-27 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Comment: was deleted

(was: Thanks [~ferhui] and [~hexiaoqiao] . 
{quote}
is it OK with one datanode?
{quote}
Yes, one dn also works for this patch. So I improved UT with one dn.
 [^HDFS-15641.addendum.patch] is a addendum patch after v003. )

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> HDFS-15641.003.patch, HDFS-15641.addendum.patch, deadlock.png, 
> deadlock_fixed.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-26 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221188#comment-17221188
 ] 

Hongbing Wang edited comment on HDFS-15641 at 10/27/20, 6:53 AM:
-

Thanks [~ferhui] and [~hexiaoqiao] . 
{quote}
is it OK with one datanode?
{quote}
Yes, one dn also works for this patch. So I improved UT with one dn.
 [^HDFS-15641.addendum.patch] is a addendum patch after v003. 


was (Author: wanghongbing):
Thanks [~ferhui] and [~hexiaoqiao] . 
{quote}
is it OK with one datanode?
{quote}
Yes, one dn also works for this patch. So I improved UT with one dn.
 [^HDFS-15641.addendum.patch] is a addendum patch. 

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> HDFS-15641.003.patch, HDFS-15641.addendum.patch, deadlock.png, 
> deadlock_fixed.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-26 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221188#comment-17221188
 ] 

Hongbing Wang commented on HDFS-15641:
--

Thanks [~ferhui] and [~hexiaoqiao] . 
{quote}
is it OK with one datanode?
{quote}
Yes, one dn also works for this patch. So I improved UT with one dn.
 [^HDFS-15641.addendum.patch] is a addendum patch. 

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> HDFS-15641.003.patch, HDFS-15641.addendum.patch, deadlock.png, 
> deadlock_fixed.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-26 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Attachment: HDFS-15641.addendum.patch

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> HDFS-15641.003.patch, HDFS-15641.addendum.patch, deadlock.png, 
> deadlock_fixed.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-26 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220650#comment-17220650
 ] 

Hongbing Wang commented on HDFS-15641:
--

Thanks! Expect it to be merged!:D

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> HDFS-15641.003.patch, deadlock.png, deadlock_fixed.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-23 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219713#comment-17219713
 ] 

Hongbing Wang commented on HDFS-15641:
--

I have provided two alternative patch versions, [^HDFS-15641.002.patch] and 
[^HDFS-15641.003.patch] . 003.patch just puts UT into 
TestRefreshNamenodes.java. 

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> HDFS-15641.003.patch, deadlock.png, deadlock_fixed.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-23 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Attachment: HDFS-15641.003.patch

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> HDFS-15641.003.patch, deadlock.png, deadlock_fixed.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-23 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219672#comment-17219672
 ] 

Hongbing Wang commented on HDFS-15641:
--

Thanks [~ferhui]. 
{quote}Is it right ?
{quote}
Yes, you are right. 
{quote} could you please move your UT there?
{quote}
 I will resubmit a patch later.

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> deadlock.png, deadlock_fixed.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-22 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219003#comment-17219003
 ] 

Hongbing Wang commented on HDFS-15641:
--

Thanks [~ferhui] for your reply. I will explain in two steps.
（a）*The occurrence of deadlock*: see figure below, and the corresponding jstack 
is [^jstack.log]

!deadlock.png|width=973,height=214!

Related locks: `instance of BlockPoolManager` and `read-write lock in 
BPOfferService`.

（b）*The fix I proposed:* In [^HDFS-15641.002.patch], I made 3 changes：
 # `+BPOfferService.java+`:  I just injected a test error to delay 1s. This 
only takes effect in test and does not affect the production env. Both threads 
will wait a short while after acquiring their respective locks.
 # `+BPServiceActor.java+`:  This is my change to fix the bug. Ensure that the 
time to start `bpThread` is after the read lock is completed.
 # `+TestRefreshNamenodesFailure.java+`:  just test.

Merge the above 1 and 3 can reproduce the deadlock. And merge 1, 2 and 3 can 
fix this deadlock.

The process after fixed is as follows:
!deadlock_fixed.png|width=1027,height=222!

Thanks again !

 

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> deadlock.png, deadlock_fixed.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-22 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Attachment: deadlock_fixed.png

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> deadlock.png, deadlock_fixed.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-20 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218116#comment-17218116
 ] 

Hongbing Wang commented on HDFS-15641:
--

Thanks [~hexiaoqiao] for attention. There may be a bit of confusion here. 
*lifelineSender.start()* does not refer to the start of the thread. 
LifelineSender has rewritten the start() method, as follows:
{code:java}
// BPServiceActor$LifelineSender#start
public void start() {
  lifelineThread = new Thread(this,
  formatThreadName("lifeline", lifelineNnAddr)); // formatThreadName occurs 
deadlock
  lifelineThread.setDaemon(true);
  //...
  lifelineThread.start(); //Thread start here
}
// formatThreadName
private String formatThreadName(
final String action,
final InetSocketAddress addr) {
  String bpId = bpos.getBlockPoolId(true);
  //...
}
// getBlockPoolId
String getBlockPoolId(boolean quiet) {
  // avoid lock contention unless the registration hasn't completed.
  String id = bpId;
  if (id != null) {
return id;
  }
  DataNodeFaultInjector.get().delayWhenOfferServiceHoldLock();
  readLock(); // deadlock occurs here
  //...
}{code}
To be precise, the deadlock occurs in the `refreshThread` and `bpThread`. 
Deadlock is related to the above *start ->* *formatThreadName -> getBlockPoolId 
-> readLock and readUnlock* . So, I promise to let _readLock and readUnlock_ is 
completely executed before starting `bpThread`.

The test I given can reproduce the deadlock before the fix, and test passed  
after the fix.

Thanks [~hexiaoqiao] again.

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-20 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217588#comment-17217588
 ] 

Hongbing Wang commented on HDFS-15641:
--

fix some issue for ut, [^HDFS-15641.002.patch]

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-20 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Attachment: HDFS-15641.002.patch

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216886#comment-17216886
 ] 

Hongbing Wang edited comment on HDFS-15641 at 10/20/20, 3:14 AM:
-

I adjusted the thread start sequence in BPServiceActor to ensure that the first 
thread (*{color:#172b4d}lifelineSender{color}*) has acquired and released the 
read lock before starting the second thread (*bpThread*).

Original code:
{code:java}
void start() {
  // ...
  bpThread.start();

  if (lifelineSender != null) {
lifelineSender.start();
  }
}
{code}
New code:
{code:java}
void start() {
  // ...
  if (lifelineSender != null) {
lifelineSender.start();
  }
  bpThread.start();
}
{code}
(1) thread (*lifelineSender*) : _lifelineSender.start() -> 
BPServiceActor.formatThreadName() -> getBlockPoolId() ->_ _readLock() and 
readUnlock_

(2) afterward, start thread (*bpThread*)

So, it can avoid deadlock, I think. 


was (Author: wanghongbing):
I adjusted the thread start sequence in BPServiceActor to ensure that the first 
thread (*{color:#172b4d}lifelineSender{color}*) has acquired and released the 
read lock before starting the second thread (*bpThread*).

Original code:
{code:java}
void start() {
  if ((bpThread != null) && (bpThread.isAlive())) {
//Thread is started already
return;
  }
  bpThread = new Thread(this);
  bpThread.setDaemon(true); // needed for JUnit testing
  bpThread.start();

  if (lifelineSender != null) {
lifelineSender.start();
  }
}
{code}
New code:
{code:java}
void start() {
  if ((bpThread != null) && (bpThread.isAlive())) {
//Thread is started already
return;
  }
  bpThread = new Thread(this);
  bpThread.setDaemon(true); // needed for JUnit testing

  if (lifelineSender != null) {
lifelineSender.start();
  }
  bpThread.start();
}
{code}

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.001.patch, deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Attachment: HDFS-15641.001.patch

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.001.patch, deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Attachment: (was: HDFS-15641.000.test.patch)

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.001.patch, deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Attachment: (was: HDFS-15641.001.patch)

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.001.patch, deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216878#comment-17216878
 ] 

Hongbing Wang edited comment on HDFS-15641 at 10/19/20, 4:50 PM:
-

{quote}{{just wonder if this issue is also in trunk}}
{quote}
yes, it reproduces in trunk. [^HDFS-15641.000.test.patch] uses CyclicBarrier  
to control thread execution order to reproduce deadlock.


was (Author: wanghongbing):
 
{quote}{{ }}{{just wonder if this issue is also in trunk}}
{quote}
yes, it reproduces in trunk. [^HDFS-15641.000.test.patch] uses CyclicBarrier  
to control thread execution order to reproduce deadlock.

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.000.test.patch, HDFS-15641.001.patch, 
> deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216886#comment-17216886
 ] 

Hongbing Wang commented on HDFS-15641:
--

I adjusted the thread start sequence in BPServiceActor to ensure that the first 
thread (*{color:#172b4d}lifelineSender{color}*) has acquired and released the 
read lock before starting the second thread (*bpThread*).

Original code:
{code:java}
void start() {
  if ((bpThread != null) && (bpThread.isAlive())) {
//Thread is started already
return;
  }
  bpThread = new Thread(this);
  bpThread.setDaemon(true); // needed for JUnit testing
  bpThread.start();

  if (lifelineSender != null) {
lifelineSender.start();
  }
}
{code}
New code:
{code:java}
void start() {
  if ((bpThread != null) && (bpThread.isAlive())) {
//Thread is started already
return;
  }
  bpThread = new Thread(this);
  bpThread.setDaemon(true); // needed for JUnit testing

  if (lifelineSender != null) {
lifelineSender.start();
  }
  bpThread.start();
}
{code}

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.000.test.patch, HDFS-15641.001.patch, 
> deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216878#comment-17216878
 ] 

Hongbing Wang edited comment on HDFS-15641 at 10/19/20, 4:38 PM:
-

 
{quote}{{ }}{{just wonder if this issue is also in trunk}}
{quote}
yes, it reproduces in trunk. [^HDFS-15641.000.test.patch] uses CyclicBarrier  
to control thread execution order to reproduce deadlock.


was (Author: wanghongbing):
{{{quote} }}

{{just wonder if this issue is also in trunk}}

{{{quote}}}

{{yes, it reproduces in trunk. [^HDFS-15641.000.test.patch] uses 
}}CyclicBarrier to control thread execution order to reproduce deadlock.

{{}}

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.000.test.patch, deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Attachment: HDFS-15641.001.patch

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.000.test.patch, HDFS-15641.001.patch, 
> deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216878#comment-17216878
 ] 

Hongbing Wang commented on HDFS-15641:
--

{{{quote} }}

{{just wonder if this issue is also in trunk}}

{{{quote}}}

{{yes, it reproduces in trunk. [^HDFS-15641.000.test.patch] uses 
}}CyclicBarrier to control thread execution order to reproduce deadlock.

{{}}

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.000.test.patch, deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Description: 
DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
hostname:50020` to register a new namespace in federation env.

The jstack is shown in jstack.log
 The specific process is shown in Figure deadlock.png

  was:
DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
hostname:50020` to register a new namespace in federation env.

The jstack is shown in jstack.log
 The specific process is shown in Figure RefreshNameNode_DeadLock.png


> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.000.test.patch, deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Attachment: (was: RefreshNameNode_DeadLock.png)

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.000.test.patch, deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Attachment: deadlock.png

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.000.test.patch, deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Description: 
DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
hostname:50020` to register a new namespace in federation env.

The jstack is shown in jstack.log
 The specific process is shown in Figure RefreshNameNode_DeadLock.png

  was:
DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
hostname:50020` to register a new namespace in federation env.

The jstack is shown in RefreshNameNode_DeadLock.jstack.
The specific process is shown in Figure RefreshNameNode_DeadLock.png


> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.000.test.patch, RefreshNameNode_DeadLock.png, 
> jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure RefreshNameNode_DeadLock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Attachment: (was: RefreshNameNode_DeadLock.jstack)

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.000.test.patch, RefreshNameNode_DeadLock.png, 
> jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in RefreshNameNode_DeadLock.jstack.
> The specific process is shown in Figure RefreshNameNode_DeadLock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Attachment: jstack.log

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.000.test.patch, RefreshNameNode_DeadLock.png, 
> jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in RefreshNameNode_DeadLock.jstack.
> The specific process is shown in Figure RefreshNameNode_DeadLock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216690#comment-17216690
 ] 

Hongbing Wang commented on HDFS-15641:
--

I add one test [^HDFS-15641.000.test.patch] to reproduce this deadlock. Patch 
solving the problem will be attached later. [~hexiaoqiao]  Could you help take 
a look?

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.000.test.patch, 
> RefreshNameNode_DeadLock.jstack, RefreshNameNode_DeadLock.png
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in RefreshNameNode_DeadLock.jstack.
> The specific process is shown in Figure RefreshNameNode_DeadLock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Attachment: HDFS-15641.000.test.patch

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.000.test.patch, 
> RefreshNameNode_DeadLock.jstack, RefreshNameNode_DeadLock.png
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in RefreshNameNode_DeadLock.jstack.
> The specific process is shown in Figure RefreshNameNode_DeadLock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Attachment: RefreshNameNode_DeadLock.jstack

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: RefreshNameNode_DeadLock.jstack, 
> RefreshNameNode_DeadLock.png
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in RefreshNameNode_DeadLock.jstack.
> The specific process is shown in Figure RefreshNameNode_DeadLock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15641:
-
Attachment: RefreshNameNode_DeadLock.png

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: RefreshNameNode_DeadLock.jstack, 
> RefreshNameNode_DeadLock.png
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in RefreshNameNode_DeadLock.jstack.
> The specific process is shown in Figure RefreshNameNode_DeadLock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Hongbing Wang (Jira)

Hongbing Wang created HDFS-15641:


 Summary: DataNode could meet deadlock if invoke refreshNameNode
 Key: HDFS-15641
 URL: https://issues.apache.org/jira/browse/HDFS-15641
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.2.0
Reporter: Hongbing Wang
Assignee: Hongbing Wang


DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
hostname:50020` to register a new namespace in federation env.

The jstack is shown in RefreshNameNode_DeadLock.jstack.
The specific process is shown in Figure RefreshNameNode_DeadLock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190115#comment-17190115
 ] 

Hongbing Wang commented on HDFS-15556:
--

BPServiceActor uses `initialRegistrationComplete` variable of type 
`CountDownLatch(1)` to ensure that the sendLifeLine thread must be after the 
registration is completed.  
It seems that this rule does not take effect when reRegister because 
`initialRegistrationComplete` already countDown() in the first registration. 

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error

2020-07-16 Thread Hongbing Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158991#comment-17158991
 ] 

Hongbing Wang commented on HDFS-15240:
--

[~marvelrock] We have the same problem. At the same time as this problem, there 
are frequent FullGCs (every few seconds). Dump and MAT found that there are 
lots of ecWorker objects that almost fill the entire heap.
!image-2020-07-16-15-56-38-608.png|width=722,height=591!

look forward this patch into trunk.

 

> Erasure Coding: dirty buffer causes reconstruction block error
> --
>
> Key: HDFS-15240
> URL: https://issues.apache.org/jira/browse/HDFS-15240
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Reporter: HuangTao
>Assignee: HuangTao
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-15240.001.patch, HDFS-15240.002.patch, 
> HDFS-15240.003.patch, HDFS-15240.004.patch, HDFS-15240.005.patch, 
> image-2020-07-16-15-56-38-608.png
>
>
> When read some lzo files we found some blocks were broken.
> I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from 
> DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') 
> blocks. And find the longest common sequenece(LCS) between b6'(decoded) and 
> b6(read from DN)(b7'/b7 and b8'/b8).
> After selecting 6 blocks of the block group in combinations one time and 
> iterating through all cases, I find one case that the length of LCS is the 
> block length - 64KB, 64KB is just the length of ByteBuffer used by 
> StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
> buffer.
> The following log snippet(only show 2 of 28 cases) is my check program 
> output. In my case, I known the 3th block is corrupt, so need other 5 blocks 
> to decode another 3 blocks, then find the 1th block's LCS substring is block 
> length - 64kb.
> It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the 
> dirty buffer was used before read the 1th block.
> Must be noted that StripedBlockReader read from the offset 0 of the 1th block 
> after used the dirty buffer.
> {code:java}
> decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8]
> Check Block(1) first 131072 bytes longest common substring length 4
> Check Block(6) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4
> decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8]
> Check Block(1) first 131072 bytes longest common substring length 65536
> CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 
> 27197440  # this one
> Check Block(7) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4{code}
> Now I know the dirty buffer causes reconstruction block error, but how does 
> the dirty buffer come about?
> After digging into the code and DN log, I found this following DN log is the 
> root reason.
> {code:java}
> [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel 
> java.nio.channels.SocketChannel[connected local=/:52586 
> remote=/:50010]. 18 millis timeout left.
> [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped 
> block: BP-714356632--1519726836856:blk_-YY_3472979393
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834) {code}
> Reading from DN may timeout(hold by a future(F)) and output the INFO log, but 
> the futures that contains the future(F)  is cleared, 
> {code:java}
> return new StripingChunkReadResult(futures.remove(future),
> StripingChunkReadResult.CANCELLED); {code}
> futures.remove(future) cause NPE. So the EC reconstruction is failed. In the 
> finall

[jira] [Updated] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error

2020-07-16 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15240:
-
Attachment: image-2020-07-16-15-56-38-608.png

> Erasure Coding: dirty buffer causes reconstruction block error
> --
>
> Key: HDFS-15240
> URL: https://issues.apache.org/jira/browse/HDFS-15240
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Reporter: HuangTao
>Assignee: HuangTao
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-15240.001.patch, HDFS-15240.002.patch, 
> HDFS-15240.003.patch, HDFS-15240.004.patch, HDFS-15240.005.patch, 
> image-2020-07-16-15-56-38-608.png
>
>
> When read some lzo files we found some blocks were broken.
> I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from 
> DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') 
> blocks. And find the longest common sequenece(LCS) between b6'(decoded) and 
> b6(read from DN)(b7'/b7 and b8'/b8).
> After selecting 6 blocks of the block group in combinations one time and 
> iterating through all cases, I find one case that the length of LCS is the 
> block length - 64KB, 64KB is just the length of ByteBuffer used by 
> StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
> buffer.
> The following log snippet(only show 2 of 28 cases) is my check program 
> output. In my case, I known the 3th block is corrupt, so need other 5 blocks 
> to decode another 3 blocks, then find the 1th block's LCS substring is block 
> length - 64kb.
> It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the 
> dirty buffer was used before read the 1th block.
> Must be noted that StripedBlockReader read from the offset 0 of the 1th block 
> after used the dirty buffer.
> {code:java}
> decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8]
> Check Block(1) first 131072 bytes longest common substring length 4
> Check Block(6) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4
> decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8]
> Check Block(1) first 131072 bytes longest common substring length 65536
> CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 
> 27197440  # this one
> Check Block(7) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4{code}
> Now I know the dirty buffer causes reconstruction block error, but how does 
> the dirty buffer come about?
> After digging into the code and DN log, I found this following DN log is the 
> root reason.
> {code:java}
> [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel 
> java.nio.channels.SocketChannel[connected local=/:52586 
> remote=/:50010]. 18 millis timeout left.
> [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped 
> block: BP-714356632--1519726836856:blk_-YY_3472979393
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834) {code}
> Reading from DN may timeout(hold by a future(F)) and output the INFO log, but 
> the futures that contains the future(F)  is cleared, 
> {code:java}
> return new StripingChunkReadResult(futures.remove(future),
> StripingChunkReadResult.CANCELLED); {code}
> futures.remove(future) cause NPE. So the EC reconstruction is failed. In the 
> finally phase, the code snippet in *getStripedReader().close()* 
> {code:java}
> reconstructor.freeBuffer(reader.getReadBuffer());
> reader.freeReadBuffer();
> reader.closeBlockReader(); {code}
> free buffer firstly, but the StripedBlockReader still holds the buffer and 
> write it.



--
This message was sent by Atlassian Jir

[jira] [Assigned] (HDFS-15425) Review Logging of DFSClient

2020-07-09 Thread Hongbing Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang reassigned HDFS-15425:


Assignee: Hongbing Wang

> Review Logging of DFSClient
> ---
>
> Key: HDFS-15425
> URL: https://issues.apache.org/jira/browse/HDFS-15425
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: HDFS-15425.001.patch, HDFS-15425.002.patch, 
> HDFS-15425.003.patch
>
>
> Review use of SLF4J for DFSClient.LOG. 
> Make the code more concise and readable. 
> Less is more !



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

1 2 >

1 - 100 of 135 matches

Mail list logo