date:20210823

[jira] [Commented] (HDFS-16112) Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor

2021-08-23 Thread tomscut (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403509#comment-17403509
 ] 

tomscut commented on HDFS-16112:


Thanks [~weichiu] for your help.

> Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor 
> 
>
> Key: HDFS-16112
> URL: https://issues.apache.org/jira/browse/HDFS-16112
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>
> These unit tests 
> TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and 
> TestDecommissioningStatus#testDecommissionStatus recently seems a little 
> flaky, we should fix them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16182?focusedWorklogId=640933=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640933
 ]

ASF GitHub Bot logged work on HDFS-16182:
-

Author: ASF GitHub Bot
Created on: 24/Aug/21 04:06
Start Date: 24/Aug/21 04:06
Worklog Time Spent: 10m 
  Work Description: Neilxzn edited a comment on pull request #3320:
URL: https://github.com/apache/hadoop/pull/3320#issuecomment-904304994


   @jojochuang   Agree with you.  I think we should fix it. 
   
   In my cluster, we use  BlockPlacementPolicyDefault to choose dn and the 
number of SSD DN is much less than DISK DN. It may cause to  some block  that 
should be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too 
busy or no enough place.  Consider the following scenario.
   
1. Create  empty file   /foo_file 
2. Set its storagepolicy to All_SSD
   3. Put data to /foo_file
   4. /foo_file  gets 3  DISK  dns for pipeline because SSD dns are too busy at 
the beginning. 
   5. When it transfers data in pipeline,  one of 3 DISK dns shut down.
   6. The client  need to get one new dn for existing pipeline in 
DataStreamer$addDatanode2ExistingPipeline. 
   7. If SSD dns are available at the moment,  namenode will choose the 3 SSD 
dns and return it to the client. However, the client just need one new dn,  
namenode returns 3 new SSD dn and  the client threw exception in 
DataStreamer$findNewDatanode.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640933)
Time Spent: 50m  (was: 40m)

> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16182.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in DataStreamer.addDatanode2ExistingPipeline. 
>  
> Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget.  I think 
> numOfReplicas should not be

[jira] [Comment Edited] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread Max Xie (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403006#comment-17403006
 ] 

Max  Xie edited comment on HDFS-16182 at 8/24/21, 4:06 AM:
---

In my cluster, we use  BlockPlacementPolicyDefault to choose dn and the number 
of SSD DN is much less than DISK DN. It may cause to  some block  that should 
be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no 
enough place.  Consider the following scenario.
 #  Create  empty file   /foo_file 
 #  Set its storagepolicy to All_SSD
 # Put data to /foo_file
 # /foo_file  gets 3  DISK  dns for pipeline because SSD dns are too busy at 
the beginning. 
 # When it transfers data in pipeline,  one of 3 DISK dns shut down.
 # The client  need to get one new dn for existing pipeline in 
DataStreamer$addDatanode2ExistingPipeline.. 
 # If SSD dns are available at the moment,  namenode will choose the 3 SSD dns 
and return it to the client. However, the client just need one new dn,  
namenode returns 3 new SSD dn and  the client fail in 
DataStreamer$findNewDatanode.

 


was (Author: max2049):
In my cluster, we use  BlockPlacementPolicyDefault to choose dn and the number 
of SSD DN is much less than DISK DN. It may cause to  some block  that should 
be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no 
enough place.  Consider the following scenario.
 #  Create  empty file   /foo_file 
 #  Set its storagepolicy to All_SSD
 # Put data to /foo_file
 # /foo_file  gets 3  DISK  dns for pipeline because SSD dns are too busy at 
the beginning. 
 # When it transfers data in pipeline,  one of 3 DISK dns shut down.
 # The client  need to get one new dn for existing pipeline. 
 # If SSD dns are available at the moment,  namenode will choose the 3 SSD dns 
and return it to the client. However, the client just need one new dn,  
namenode returns 3 new SSD dn and  the client fail in DataStreamer.

 

> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16182.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in DataStreamer.addDatanode2ExistingPipeline. 
>  
> Maybe we can fix it in

[jira] [Work logged] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16182?focusedWorklogId=640932=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640932
 ]

ASF GitHub Bot logged work on HDFS-16182:
-

Author: ASF GitHub Bot
Created on: 24/Aug/21 04:04
Start Date: 24/Aug/21 04:04
Worklog Time Spent: 10m 
  Work Description: Neilxzn commented on pull request #3320:
URL: https://github.com/apache/hadoop/pull/3320#issuecomment-904304994


   Agree it.  I think we should fix it. 
   
   In my cluster, we use  BlockPlacementPolicyDefault to choose dn and the 
number of SSD DN is much less than DISK DN. It may cause to  some block  that 
should be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too 
busy or no enough place.  Consider the following scenario.
   
1. Create  empty file   /foo_file 
2. Set its storagepolicy to All_SSD
   3. Put data to /foo_file
   4. /foo_file  gets 3  DISK  dns for pipeline because SSD dns are too busy at 
the beginning. 
   5. When it transfers data in pipeline,  one of 3 DISK dns shut down.
   6. The client  need to get one new dn for existing pipeline in 
DataStreamer$addDatanode2ExistingPipeline. 
   7. If SSD dns are available at the moment,  namenode will choose the 3 SSD 
dns and return it to the client. However, the client just need one new dn,  
namenode returns 3 new SSD dn and  the client threw exception in 
DataStreamer$findNewDatanode.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640932)
Time Spent: 40m  (was: 0.5h)

> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16182.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in DataStreamer.addDatanode2ExistingPipeline. 
>  
> Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget.  I think 
> numOfReplicas should not be assigned by

[jira] [Comment Edited] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread Max Xie (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403006#comment-17403006
 ] 

Max  Xie edited comment on HDFS-16182 at 8/24/21, 4:00 AM:
---

In my cluster, we use  BlockPlacementPolicyDefault to choose dn and the number 
of SSD DN is much less than DISK DN. It may cause to  some block  that should 
be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no 
enough place.  Consider the following scenario.
 #  Create  empty file   /foo_file 
 #  Set its storagepolicy to All_SSD
 # Put data to /foo_file
 # /foo_file  gets 3  DISK  dns for pipeline because SSD dns are too busy at 
the beginning. 
 # When it transfers data in pipeline,  one of 3 DISK dns shut down.
 # The client  need to get one new dn for existing pipeline. 
 # If SSD dns are available at the moment,  namenode will choose the 3 SSD dns 
and return it to the client. However, the client just need one new dn,  
namenode returns 3 new SSD dn and  the client fail in DataStreamer.

 


was (Author: max2049):
In my cluster, we use  BlockPlacementPolicyDefault to choose dn and the number 
of SSD DN is much less than DISK DN. It may cause to  some block  that should 
be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no 
enough place.  The steps are as follow.
 #  Create  empty file   /foo_file 
 #  Set its storagepolicy to All_SSD
 # Put data to /foo_file
 # /foo_file  gets 3  DISK  dns for pipeline because SSD dns are too busy at 
the beginning. 
 # When it transfers data in pipeline,  one of 3 DISK dns shut down.
 # The client  need to get one new dn for existing pipeline. 
 # If SSD dns are available at the moment,  namenode will choose the 3 SSD dns 
and return it to the client. However, the client just need one new dn,  
namenode returns 3 new SSD dn and  the client fail.

 

> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16182.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in DataStreamer.addDatanode2ExistingPipeline. 
>  
> Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget.  I think 
> numOfReplicas should not be assigned by

[jira] [Work logged] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16182?focusedWorklogId=640930=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640930
 ]

ASF GitHub Bot logged work on HDFS-16182:
-

Author: ASF GitHub Bot
Created on: 24/Aug/21 03:49
Start Date: 24/Aug/21 03:49
Worklog Time Spent: 10m 
  Work Description: jojochuang commented on pull request #3320:
URL: https://github.com/apache/hadoop/pull/3320#issuecomment-904299806


   Don't really understand the code but seems like an ancient regression from 
HDFS-6686. As a general code advice, we should not update a parameter variable 
and pass it on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640930)
Time Spent: 0.5h  (was: 20m)

> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16182.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in DataStreamer.addDatanode2ExistingPipeline. 
>  
> Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget.  I think 
> numOfReplicas should not be assigned by requiredStorageTypes.
>  
>    
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16180) FsVolumeImpl.nextBlock should consider that the block meta file has been deleted.

2021-08-23 Thread Wei-Chiu Chuang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-16180.

Fix Version/s: 3.4.0
   Resolution: Fixed

> FsVolumeImpl.nextBlock should consider that the block meta file has been 
> deleted.
> -
>
> Key: HDFS-16180
> URL: https://issues.apache.org/jira/browse/HDFS-16180
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In my cluster,  we found that when VolumeScanner run, sometime dn will throw 
> some error log below
> ```
>  
> 2021-08-19 08:00:11,549 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Deleted BP-1020175758-nnip-1597745872895 blk_1142977964_69237147 URI 
> file:/disk1/dfs/data/current/BP-1020175758- 
> nnip-1597745872895/current/finalized/subdir0/subdir21/blk_1142977964
> 2021-08-19 08:00:48,368 ERROR 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl: 
> nextBlock(DS-060c8e4c-1ef6-49f5-91ef-91957356891a, BP-1020175758- 
> nnip-1597745872895): I/O error
> java.io.IOException: Meta file not found, 
> blockFile=/disk1/dfs/data/current/BP-1020175758- 
> nnip-1597745872895/current/finalized/subdir0/subdir21/blk_1142977964
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetUtil.findMetaFile(FsDatasetUtil.java:101)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl.nextBlock(FsVolumeImpl.java:809)
> at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:528)
> at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:628)
> 2021-08-19 08:00:48,368 WARN 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner: 
> VolumeScanner(/disk1/dfs/data, DS-060c8e4c-1ef6-49f5-91ef-91957356891a): 
> nextBlock error on 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl@7febc6b4
> ```
> When VolumeScanner scan block  blk_1142977964,  it has been deleted by 
> datanode,  scanner can not find the meta file of blk_1142977964, so it throw 
> these error log.
>  
> Maybe we should handle FileNotFoundException during nextblock to reduce error 
> log and nextblock retry times.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16180) FsVolumeImpl.nextBlock should consider that the block meta file has been deleted.

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16180?focusedWorklogId=640926=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640926
 ]

ASF GitHub Bot logged work on HDFS-16180:
-

Author: ASF GitHub Bot
Created on: 24/Aug/21 03:15
Start Date: 24/Aug/21 03:15
Worklog Time Spent: 10m 
  Work Description: jojochuang merged pull request #3315:
URL: https://github.com/apache/hadoop/pull/3315


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640926)
Time Spent: 1.5h  (was: 1h 20m)

> FsVolumeImpl.nextBlock should consider that the block meta file has been 
> deleted.
> -
>
> Key: HDFS-16180
> URL: https://issues.apache.org/jira/browse/HDFS-16180
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In my cluster,  we found that when VolumeScanner run, sometime dn will throw 
> some error log below
> ```
>  
> 2021-08-19 08:00:11,549 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Deleted BP-1020175758-nnip-1597745872895 blk_1142977964_69237147 URI 
> file:/disk1/dfs/data/current/BP-1020175758- 
> nnip-1597745872895/current/finalized/subdir0/subdir21/blk_1142977964
> 2021-08-19 08:00:48,368 ERROR 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl: 
> nextBlock(DS-060c8e4c-1ef6-49f5-91ef-91957356891a, BP-1020175758- 
> nnip-1597745872895): I/O error
> java.io.IOException: Meta file not found, 
> blockFile=/disk1/dfs/data/current/BP-1020175758- 
> nnip-1597745872895/current/finalized/subdir0/subdir21/blk_1142977964
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetUtil.findMetaFile(FsDatasetUtil.java:101)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl.nextBlock(FsVolumeImpl.java:809)
> at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:528)
> at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:628)
> 2021-08-19 08:00:48,368 WARN 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner: 
> VolumeScanner(/disk1/dfs/data, DS-060c8e4c-1ef6-49f5-91ef-91957356891a): 
> nextBlock error on 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl@7febc6b4
> ```
> When VolumeScanner scan block  blk_1142977964,  it has been deleted by 
> datanode,  scanner can not find the meta file of blk_1142977964, so it throw 
> these error log.
>  
> Maybe we should handle FileNotFoundException during nextblock to reduce error 
> log and nextblock retry times.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16112) Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor

2021-08-23 Thread Wei-Chiu Chuang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-16112.

Resolution: Duplicate

closed it for you. You are already granted contributor privilege and you should 
be able to close it yourself (that is my understanding)

> Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor 
> 
>
> Key: HDFS-16112
> URL: https://issues.apache.org/jira/browse/HDFS-16112
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>
> These unit tests 
> TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and 
> TestDecommissioningStatus#testDecommissionStatus recently seems a little 
> flaky, we should fix them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Assigned] (HDFS-16112) Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor

2021-08-23 Thread Wei-Chiu Chuang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-16112:
--

Assignee: tomscut

> Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor 
> 
>
> Key: HDFS-16112
> URL: https://issues.apache.org/jira/browse/HDFS-16112
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>
> These unit tests 
> TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and 
> TestDecommissioningStatus#testDecommissionStatus recently seems a little 
> flaky, we should fix them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread Max Xie (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403006#comment-17403006
 ] 

Max  Xie edited comment on HDFS-16182 at 8/24/21, 2:57 AM:
---

In my cluster, we use  BlockPlacementPolicyDefault to choose dn and the number 
of SSD DN is much less than DISK DN. It may cause to  some block  that should 
be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no 
enough place.  The steps are as follow.
 #  Create  empty file   /foo_file 
 #  Set its storagepolicy to All_SSD
 # Put data to /foo_file
 # /foo_file  gets 3  DISK  dns for pipeline because SSD dns are too busy at 
the beginning. 
 # When it transfers data in pipeline,  one of 3 DISK dns shut down.
 # The client  need to get one new dn for existing pipeline. 
 # If SSD dns are available at the moment,  namenode will choose the 3 SSD dns 
and return it to the client. However, the client just need one new dn,  
namenode returns 3 new SSD dn and  the client fail.

 


was (Author: max2049):
In my cluster, we use  BlockPlacementPolicyDefault to choose dn and the number 
of SSD DN is much less than DISK DN. It may cause to  some block  that should 
be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no 
enough place.  The steps are as follow.
 #  Create  empty file   /foo_file 
 #  Set its storagepolicy to All_SSD
 # Put data to /foo_file
 # /foo_file  gets 3  DISK  dns for pipeline because SSD dns are too busy at 
the beginning. 
 # When it transfers data in pipeline,  one of 3 DISK dns shut down.
 # The client  need to get one new dn for existing pipeline. 
 # If SSD dns are available at the moment,  namenode will choose the 3 SSD dns 
and return it to the client. However, the client just need one new dn,  
namenode returns 3 new SSD dn and  the client fail.

 

 

 

> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16182.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in DataStreamer.addDatanode2ExistingPipeline. 
>  
> Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget.  I think 
> numOfReplicas should not be assigned by requiredStorageTypes.
>  
>    
>

[jira] [Comment Edited] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread Max Xie (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403006#comment-17403006
 ] 

Max  Xie edited comment on HDFS-16182 at 8/24/21, 2:55 AM:
---

In my cluster, we use  BlockPlacementPolicyDefault to choose dn and the number 
of SSD DN is much less than DISK DN. It may cause to  some block  that should 
be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no 
enough place.  The steps are as follow.
 #  Create  empty file   /foo_file 
 #  Set its storagepolicy to All_SSD
 # Put data to /foo_file
 # /foo_file  gets 3  DISK  dns for pipeline because SSD dns are too busy at 
the beginning. 
 # When it transfers data in pipeline,  one of 3 DISK dns shut down.
 # The client  need to get one new dn for existing pipeline. 
 # If SSD dns are available at the moment,  namenode will choose the 3 SSD dns 
and return it to the client. However, the client just need one new dn,  
namenode returns 3 new SSD dn and  the client fail.

 

 

 


was (Author: max2049):
In my cluster, we use  BlockPlacementPolicyDefault to choose dn and the number 
of SSD DN is much less than DISK DN. It may cause to  some block  that should 
be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no 
enough place. 
 #  create  empty file   /foo_file 
 #  set its storagepolicy to All_SSD
 # put data to /foo_file
 # /foo_file  gets 3  DISK  dns for pipeline because SSD dns are too busy at 
the beginning. 
 # when it transfers data in pipeline,  one of 3 DISK dns  down and 

[shut|http://dict.youdao.com/search?q=shut=chrome.extension]  [ʃʌt]  
[详细|http://dict.youdao.com/search?q=shut=chrome.extension]X
基本翻译
vt. 关闭；停业；幽禁
vi. 关上；停止营业
n. 关闭
adj. 关闭的；围绕的
n. (Shut)人名；(俄)舒特；(中)室(广东话·威妥玛)
网络释义
[Shut:|http://dict.youdao.com/search?q=Shut=chrome.extension=eng] 
此路不通
[Eyes Wide 
Shut:|http://dict.youdao.com/search?q=Eyes%20Wide%20Shut=chrome.extension=eng]
 大开眼戒
[shut 
out:|http://dict.youdao.com/search?q=shut%20out=chrome.extension=eng]
 排除

> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16182.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in

[jira] [Comment Edited] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread Max Xie (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403006#comment-17403006
 ] 

Max  Xie edited comment on HDFS-16182 at 8/24/21, 2:45 AM:
---

In my cluster, we use  BlockPlacementPolicyDefault to choose dn and the number 
of SSD DN is much less than DISK DN. It may cause to  some block  that should 
be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no 
enough place. 
 #  create  empty file   /foo_file 
 #  set its storagepolicy to All_SSD
 # put data to /foo_file
 # /foo_file  gets 3  DISK  dns for pipeline because SSD dns are too busy at 
the beginning. 
 # when it transfers data in pipeline,  one of 3 DISK dns  down and 

[shut|http://dict.youdao.com/search?q=shut=chrome.extension]  [ʃʌt]  
[详细|http://dict.youdao.com/search?q=shut=chrome.extension]X
基本翻译
vt. 关闭；停业；幽禁
vi. 关上；停止营业
n. 关闭
adj. 关闭的；围绕的
n. (Shut)人名；(俄)舒特；(中)室(广东话·威妥玛)
网络释义
[Shut:|http://dict.youdao.com/search?q=Shut=chrome.extension=eng] 
此路不通
[Eyes Wide 
Shut:|http://dict.youdao.com/search?q=Eyes%20Wide%20Shut=chrome.extension=eng]
 大开眼戒
[shut 
out:|http://dict.youdao.com/search?q=shut%20out=chrome.extension=eng]
 排除


was (Author: max2049):
In my cluster, we use  BlockPlacementPolicyDefault to choose dn and the number 
of SSD DN is much less than DISK DN. It may cause to  some block  that should 
be placed to SSD DNs fallback to place DISK DNs.

> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16182.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in DataStreamer.addDatanode2ExistingPipeline. 
>  
> Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget.  I think 
> numOfReplicas should not be assigned by requiredStorageTypes.
>  
>    
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread Max Xie (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403460#comment-17403460
 ] 

Max  Xie commented on HDFS-16182:
-

     hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl,

     hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped

Unit test failures seem unrelated. And I test it in IDEA locally again, these 
unit tests pass.

 

> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16182.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in DataStreamer.addDatanode2ExistingPipeline. 
>  
> Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget.  I think 
> numOfReplicas should not be assigned by requiredStorageTypes.
>  
>    
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16112) Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor

2021-08-23 Thread tomscut (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403441#comment-17403441
 ] 

tomscut commented on HDFS-16112:


This was fixed by [HDFS-16171|https://issues.apache.org/jira/browse/HDFS-16171].

Hi [~weichiu] [~tasanuma] [~ferhui] , may I ask how to close this issue? Thanks.

> Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor 
> 
>
> Key: HDFS-16112
> URL: https://issues.apache.org/jira/browse/HDFS-16112
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: tomscut
>Priority: Minor
>
> These unit tests 
> TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and 
> TestDecommissioningStatus#testDecommissionStatus recently seems a little 
> flaky, we should fix them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16143) TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16143?focusedWorklogId=640792=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640792
 ]

ASF GitHub Bot logged work on HDFS-16143:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 17:58
Start Date: 23/Aug/21 17:58
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3235:
URL: https://github.com/apache/hadoop/pull/3235#issuecomment-903987999


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 52s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  12m 27s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  23m  2s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  22m 55s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |  19m 25s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   3m 50s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m  4s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m  7s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   3m 10s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   5m 49s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  19m  3s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 24s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 16s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  22m  8s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |  22m  8s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  19m 37s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |  19m 37s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   3m 53s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   2m 59s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m  4s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   3m  9s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   6m 16s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  19m 25s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  16m 47s |  |  hadoop-common in the patch 
passed.  |
   | -1 :x: |  unit  | 333m  9s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3235/37/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 58s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 547m 19s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.datanode.TestBlockScanner |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3235/37/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3235 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 84206923342f 4.15.0-147-generic #151-Ubuntu SMP Fri Jun 18 
19:21:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / c18b3d3013658348a1c32b090ea7b8a6c06634ae |
   | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private

[jira] [Work logged] (HDFS-6874) Add GETFILEBLOCKLOCATIONS operation to HttpFS

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-6874?focusedWorklogId=640761=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640761
 ]

ASF GitHub Bot logged work on HDFS-6874:


Author: ASF GitHub Bot
Created on: 23/Aug/21 16:31
Start Date: 23/Aug/21 16:31
Worklog Time Spent: 10m 
  Work Description: amahussein commented on a change in pull request #3322:
URL: https://github.com/apache/hadoop/pull/3322#discussion_r694125842



##
File path: 
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/web/WebHdfsFileSystem.java
##
@@ -1857,18 +1859,57 @@ public synchronized void cancelDelegationToken(final 
Token token
   }
 
   @Override
-  public BlockLocation[] getFileBlockLocations(final Path p,
-  final long offset, final long length) throws IOException {
+  public BlockLocation[] getFileBlockLocations(final Path p, final long offset,
+  final long length) throws IOException {
 statistics.incrementReadOps(1);
 storageStatistics.incrementOpCounter(OpType.GET_FILE_BLOCK_LOCATIONS);
+BlockLocation[] locations = null;
+try {
+  if (isServerHCFSCompatible) {
+locations =
+getFileBlockLocations(GetOpParam.Op.GETFILEBLOCKLOCATIONS, p, 
offset, length);
+  } else {
+locations = getFileBlockLocations(GetOpParam.Op.GET_BLOCK_LOCATIONS, p,
+offset, length);
+  }
+} catch (RemoteException e) {
+  if (isGetFileBlockLocationsException(e)) {

Review comment:
   ```suggestion
 // parsing the exception is needed only if the client thinks the 
service is compatible
 if (isServerHCFSCompatible && isGetFileBlockLocationsException(e)) {
   ```

##
File path: 
hadoop-hdfs-project/hadoop-hdfs-httpfs/src/test/java/org/apache/hadoop/fs/http/server/TestHttpFSServer.java
##
@@ -2002,4 +2003,38 @@ public void testContentType() throws Exception {
 () -> HttpFSUtils.jsonParse(conn));
 conn.disconnect();
   }
+
+  @Test
+  @TestDir
+  @TestJetty
+  @TestHdfs
+  public void testGetFileBlockLocations() throws Exception {
+createHttpFSServer(false, false);
+// Create a test directory
+String pathStr = "/tmp/tmp-snap-diff-test";
+createDirWithHttp(pathStr, "700", null);
+
+Path path = new Path(pathStr);
+DistributedFileSystem dfs = (DistributedFileSystem) FileSystem
+.get(path.toUri(), TestHdfsHelper.getHdfsConf());
+// Enable snapshot
+dfs.allowSnapshot(path);
+Assert.assertTrue(dfs.getFileStatus(path).isSnapshotEnabled());
+// Create a file and take a snapshot
+String file1 = pathStr + "/file1";
+createWithHttp(file1, null);
+HttpURLConnection conn = sendRequestToHttpFSServer(file1,
+"GETFILEBLOCKLOCATIONS", "length=10");
+Assert.assertEquals(HttpURLConnection.HTTP_OK, conn.getResponseCode());
+BlockLocation[] locations1 =
+dfs.getFileBlockLocations(new Path(file1), 0, 1);
+Assert.assertNotNull(locations1);
+
+HttpURLConnection conn1 = sendRequestToHttpFSServer(file1,
+"GET_BLOCK_LOCATIONS", "length=10");
+Assert.assertEquals(HttpURLConnection.HTTP_OK, conn1.getResponseCode());
+BlockLocation[] locations2 =
+dfs.getFileBlockLocations(new Path(file1), 0, 1);
+Assert.assertNotNull(locations2);
+  }

Review comment:
   Falling back from `GETFILEBLOCKLOCATIONS` to `GET_FILE_BLOCK_LOCATIONS` 
and caching the boolean flag is not tested. Maybe we need another unit test 
that assumes that the operation is not supported and falls back to the old.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640761)
Time Spent: 0.5h  (was: 20m)

> Add GETFILEBLOCKLOCATIONS operation to HttpFS
> -
>
> Key: HDFS-6874
> URL: https://issues.apache.org/jira/browse/HDFS-6874
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: httpfs
>Affects Versions: 2.4.1, 2.7.3
>Reporter: Gao Zhong Liang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: BB2015-05-TBR, pull-request-available
> Attachments: HDFS-6874-1.patch, HDFS-6874-branch-2.6.0.patch, 
> HDFS-6874.011.patch, HDFS-6874.02.patch, HDFS-6874.03.patch, 
> HDFS-6874.04.patch, HDFS-6874.05.patch, HDFS-6874.06.patch, 
> HDFS-6874.07.patch, HDFS-6874.08.patch, HDFS-6874.09.patch, 
> HDFS-6874.10.patch, HDFS-6874.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
>

[jira] [Commented] (HDFS-12188) TestDecommissioningStatus#testDecommissionStatus fails intermittently

2021-08-23 Thread Ahmed Hussein (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-12188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403275#comment-17403275
 ] 

Ahmed Hussein commented on HDFS-12188:
--

Thanks [~vjasani]. I marked this issue as fixed by HDFS-16171.


> TestDecommissioningStatus#testDecommissionStatus fails intermittently
> -
>
> Key: HDFS-12188
> URL: https://issues.apache.org/jira/browse/HDFS-12188
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Brahma Reddy Battula
>Assignee: Ajay Kumar
>Priority: Major
>  Labels: pull-request-available
> Attachments: TestFailure_Log.txt
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> {noformat}
> java.lang.AssertionError: Unexpected num under-replicated blocks expected:<3> 
> but was:<4>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.checkDecommissionStatus(TestDecommissioningStatus.java:144)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.testDecommissionStatus(TestDecommissioningStatus.java:240)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-12188) TestDecommissioningStatus#testDecommissionStatus fails intermittently

2021-08-23 Thread Ahmed Hussein (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-12188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein resolved HDFS-12188.
--
Resolution: Fixed

> TestDecommissioningStatus#testDecommissionStatus fails intermittently
> -
>
> Key: HDFS-12188
> URL: https://issues.apache.org/jira/browse/HDFS-12188
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Brahma Reddy Battula
>Assignee: Ajay Kumar
>Priority: Major
>  Labels: pull-request-available
> Attachments: TestFailure_Log.txt
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> {noformat}
> java.lang.AssertionError: Unexpected num under-replicated blocks expected:<3> 
> but was:<4>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.checkDecommissionStatus(TestDecommissioningStatus.java:144)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.testDecommissionStatus(TestDecommissioningStatus.java:240)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-6874) Add GETFILEBLOCKLOCATIONS operation to HttpFS

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-6874?focusedWorklogId=640739=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640739
 ]

ASF GitHub Bot logged work on HDFS-6874:


Author: ASF GitHub Bot
Created on: 23/Aug/21 15:24
Start Date: 23/Aug/21 15:24
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3322:
URL: https://github.com/apache/hadoop/pull/3322#issuecomment-903872705


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 42s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  12m 44s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  20m 10s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   4m 52s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |   4m 38s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   1m 15s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m  1s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m 10s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   2m 39s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   6m 23s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  14m  3s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 28s |  |  Maven dependency ordering for patch  |
   | -1 :x: |  mvninstall  |   0m 20s | 
[/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs-httpfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3322/1/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs-httpfs.txt)
 |  hadoop-hdfs-httpfs in the patch failed.  |
   | -1 :x: |  compile  |   4m 17s | 
[/patch-compile-hadoop-hdfs-project-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3322/1/artifact/out/patch-compile-hadoop-hdfs-project-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.txt)
 |  hadoop-hdfs-project in the patch failed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.  |
   | -1 :x: |  javac  |   4m 17s | 
[/patch-compile-hadoop-hdfs-project-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3322/1/artifact/out/patch-compile-hadoop-hdfs-project-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.txt)
 |  hadoop-hdfs-project in the patch failed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.  |
   | -1 :x: |  compile  |   4m  4s | 
[/patch-compile-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3322/1/artifact/out/patch-compile-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10.txt)
 |  hadoop-hdfs-project in the patch failed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10.  |
   | -1 :x: |  javac  |   4m  4s | 
[/patch-compile-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3322/1/artifact/out/patch-compile-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10.txt)
 |  hadoop-hdfs-project in the patch failed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10.  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m  7s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3322/1/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 4 new + 462 unchanged - 1 fixed = 
466 total (was 463)  |
   | -1 :x: |  mvnsite  |   0m 22s | 
[/patch-mvnsite-hadoop-hdfs-project_hadoop-hdfs-httpfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3322/1/artifact/out/patch-mvnsite-hadoop-hdfs-project_hadoop-hdfs-httpfs.txt)
 |  hadoop-hdfs-httpfs in the patch failed.  |
   | +1 :green_heart: |  javadoc  |   1m 44s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1

[jira] [Work logged] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16182?focusedWorklogId=640730=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640730
 ]

ASF GitHub Bot logged work on HDFS-16182:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 15:01
Start Date: 23/Aug/21 15:01
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3320:
URL: https://github.com/apache/hadoop/pull/3320#issuecomment-903852770


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m 18s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  33m 36s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 24s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |   1m 15s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   1m  1s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 28s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 58s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 23s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 18s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  19m 10s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 19s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |   1m 19s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 12s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |   1m 12s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 51s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3320/1/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 154 unchanged 
- 1 fixed = 155 total (was 155)  |
   | +1 :green_heart: |  mvnsite  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 50s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 20s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 22s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  19m  3s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 356m  5s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3320/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 40s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 449m 41s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestEditLogTailer |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3320/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3320 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 5ca83a692447 4.15.0-143-generic #147-Ubuntu SMP Wed Apr 14 
16:10:11 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 74f75f9545382b06b07d18e9a657ecfb9ab115a9 |
   | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04

[jira] [Commented] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread Hadoop QA (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403180#comment-17403180
 ] 

Hadoop QA commented on HDFS-16182:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 13m  
6s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 30m 
49s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
23s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
16s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 3s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
23s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 43s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
56s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
27s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 21m 
10s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  3m  
5s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
11s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
13s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
13s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
7s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
7s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 53s{color} | 
{color:orange}https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/702/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt{color}
 | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 
154 unchanged - 1 fixed = 155 total (was 155) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
14s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 41s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
51s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
26s{color} | {color:green}{color} |

[jira] [Work logged] (HDFS-16175) Improve the configurable value of Server #PURGE_INTERVAL_NANOS

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16175?focusedWorklogId=640699=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640699
 ]

ASF GitHub Bot logged work on HDFS-16175:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 12:47
Start Date: 23/Aug/21 12:47
Worklog Time Spent: 10m 
  Work Description: jianghuazhu commented on pull request #3307:
URL: https://github.com/apache/hadoop/pull/3307#issuecomment-903732597


   Thanks @ayushtkn for the comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640699)
Time Spent: 4h  (was: 3h 50m)

> Improve the configurable value of Server #PURGE_INTERVAL_NANOS
> --
>
> Key: HDFS-16175
> URL: https://issues.apache.org/jira/browse/HDFS-16175
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ipc
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> In Server, Server #PURGE_INTERVAL_NANOS is a fixed value, 15.
> We can try to improve the configurable value of Server #PURGE_INTERVAL_NANOS, 
> which will make RPC more flexible.
> private final static long PURGE_INTERVAL_NANOS = TimeUnit.NANOSECONDS.convert(
>   15, TimeUnit.MINUTES);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16155) Allow configurable exponential backoff in DFSInputStream refetchLocations

2021-08-23 Thread Bryan Beaudreault (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403140#comment-17403140
 ] 

Bryan Beaudreault commented on HDFS-16155:
--

[~hexiaoqiao] any chance you could review this?

> Allow configurable exponential backoff in DFSInputStream refetchLocations
> -
>
> Key: HDFS-16155
> URL: https://issues.apache.org/jira/browse/HDFS-16155
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The retry policy in 
> [DFSInputStream#refetchLocations|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1018-L1040]
>  was first written many years ago. It allows configuration of the base time 
> window, but subsequent retries double in an un-configurable way. This retry 
> strategy makes sense in some clusters as it's very conservative and will 
> avoid DDOSing the namenode in certain systemic failure modes – for example, 
> if a  file is being read by a large hadoop job and the underlying blocks are 
> moved by the balancer. In this case, enough datanodes would be added to the 
> deadNodes list and all hadoop tasks would simultaneously try to refetch the 
> blocks. The 3s doubling with random factor helps break up that stampeding 
> herd.
> However, not all cluster use-cases are created equal, so there are other 
> cases where a more aggressive initial backoff is preferred. For example in a 
> low-latency single reader scenario. In this case, if the balancer moves 
> enough blocks, the reader hits this 3s backoff which is way too long for a 
> low latency use-case.
> One could configure the the window very low (10ms), but then you can hit 
> other systemic failure modes which would result in readers DDOSing the 
> namenode again. For example, if blocks went missing due to truly dead 
> datanodes. In this case, many readers might be refetching locations for 
> different files with retry backoffs like 10ms, 20ms, 40ms, etc. It takes a 
> while to backoff enough to avoid impacting the namenode with that strategy.
> I suggest adding a configurable multiplier to the backoff strategy so that 
> operators can tune this as they see fit for their use-case. In the above low 
> latency case, one could set the base very low (say 2ms) and the multiplier 
> very high (say 50). This gives an aggressive first retry that very quickly 
> backs off.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-6874) Add GETFILEBLOCKLOCATIONS operation to HttpFS

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-6874:
-
Labels: BB2015-05-TBR pull-request-available  (was: BB2015-05-TBR)

> Add GETFILEBLOCKLOCATIONS operation to HttpFS
> -
>
> Key: HDFS-6874
> URL: https://issues.apache.org/jira/browse/HDFS-6874
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: httpfs
>Affects Versions: 2.4.1, 2.7.3
>Reporter: Gao Zhong Liang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: BB2015-05-TBR, pull-request-available
> Attachments: HDFS-6874-1.patch, HDFS-6874-branch-2.6.0.patch, 
> HDFS-6874.011.patch, HDFS-6874.02.patch, HDFS-6874.03.patch, 
> HDFS-6874.04.patch, HDFS-6874.05.patch, HDFS-6874.06.patch, 
> HDFS-6874.07.patch, HDFS-6874.08.patch, HDFS-6874.09.patch, 
> HDFS-6874.10.patch, HDFS-6874.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> GETFILEBLOCKLOCATIONS operation is missing in HttpFS, which is already 
> supported in WebHDFS.  For the request of GETFILEBLOCKLOCATIONS in 
> org.apache.hadoop.fs.http.server.HttpFSServer, BAD_REQUEST is returned so far:
> ...
>  case GETFILEBLOCKLOCATIONS: {
> response = Response.status(Response.Status.BAD_REQUEST).build();
> break;
>   }
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-6874) Add GETFILEBLOCKLOCATIONS operation to HttpFS

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-6874?focusedWorklogId=640645=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640645
 ]

ASF GitHub Bot logged work on HDFS-6874:


Author: ASF GitHub Bot
Created on: 23/Aug/21 09:33
Start Date: 23/Aug/21 09:33
Worklog Time Spent: 10m 
  Work Description: jojochuang opened a new pull request #3322:
URL: https://github.com/apache/hadoop/pull/3322


   ### Description of PR
   This is a rebase of the patch file 11 attached to HDFS-6874.
   
   The GETFILEBLOCKLOCATIONS is HCFS compatible. Add support of it to httpfs to 
makes it possible for more applications to run directly against HttpFS server.
   
   Add GETFILEBLOCKLOCATIONS op support for httpfs server (HttpFSServer). Add 
the same for httpfs client (HttpFSFileSystem)
   Let webhdfs client (WebHdfsFileSystem ) tries the new GETFILEBLOCKLOCATIONS 
op if the server supports it. Otherwise, fall back to the old 
GET_FILE_BLOCK_LOCATIONS op. The selection is cached so the second invocation 
doesn't need to trial and error again.
   
   ### How was this patch tested?
   Unit tests.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640645)
Remaining Estimate: 0h
Time Spent: 10m

> Add GETFILEBLOCKLOCATIONS operation to HttpFS
> -
>
> Key: HDFS-6874
> URL: https://issues.apache.org/jira/browse/HDFS-6874
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: httpfs
>Affects Versions: 2.4.1, 2.7.3
>Reporter: Gao Zhong Liang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: BB2015-05-TBR
> Attachments: HDFS-6874-1.patch, HDFS-6874-branch-2.6.0.patch, 
> HDFS-6874.011.patch, HDFS-6874.02.patch, HDFS-6874.03.patch, 
> HDFS-6874.04.patch, HDFS-6874.05.patch, HDFS-6874.06.patch, 
> HDFS-6874.07.patch, HDFS-6874.08.patch, HDFS-6874.09.patch, 
> HDFS-6874.10.patch, HDFS-6874.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> GETFILEBLOCKLOCATIONS operation is missing in HttpFS, which is already 
> supported in WebHDFS.  For the request of GETFILEBLOCKLOCATIONS in 
> org.apache.hadoop.fs.http.server.HttpFSServer, BAD_REQUEST is returned so far:
> ...
>  case GETFILEBLOCKLOCATIONS: {
> response = Response.status(Response.Status.BAD_REQUEST).build();
> break;
>   }
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16175) Improve the configurable value of Server #PURGE_INTERVAL_NANOS

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16175?focusedWorklogId=640643=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640643
 ]

ASF GitHub Bot logged work on HDFS-16175:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 09:25
Start Date: 23/Aug/21 09:25
Worklog Time Spent: 10m 
  Work Description: jianghuazhu edited a comment on pull request #3307:
URL: https://github.com/apache/hadoop/pull/3307#issuecomment-903579635


   Some exceptions occurred here in jenkins.
   But it seems to have nothing to do with the code I submitted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640643)
Time Spent: 3h 50m  (was: 3h 40m)

> Improve the configurable value of Server #PURGE_INTERVAL_NANOS
> --
>
> Key: HDFS-16175
> URL: https://issues.apache.org/jira/browse/HDFS-16175
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ipc
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> In Server, Server #PURGE_INTERVAL_NANOS is a fixed value, 15.
> We can try to improve the configurable value of Server #PURGE_INTERVAL_NANOS, 
> which will make RPC more flexible.
> private final static long PURGE_INTERVAL_NANOS = TimeUnit.NANOSECONDS.convert(
>   15, TimeUnit.MINUTES);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-6874) Add GETFILEBLOCKLOCATIONS operation to HttpFS

2021-08-23 Thread Wei-Chiu Chuang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403069#comment-17403069
 ] 

Wei-Chiu Chuang commented on HDFS-6874:
---

I rebased the patch and addressed Inigo's comments. Will raise a PR.

> Add GETFILEBLOCKLOCATIONS operation to HttpFS
> -
>
> Key: HDFS-6874
> URL: https://issues.apache.org/jira/browse/HDFS-6874
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: httpfs
>Affects Versions: 2.4.1, 2.7.3
>Reporter: Gao Zhong Liang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: BB2015-05-TBR
> Attachments: HDFS-6874-1.patch, HDFS-6874-branch-2.6.0.patch, 
> HDFS-6874.011.patch, HDFS-6874.02.patch, HDFS-6874.03.patch, 
> HDFS-6874.04.patch, HDFS-6874.05.patch, HDFS-6874.06.patch, 
> HDFS-6874.07.patch, HDFS-6874.08.patch, HDFS-6874.09.patch, 
> HDFS-6874.10.patch, HDFS-6874.patch
>
>
> GETFILEBLOCKLOCATIONS operation is missing in HttpFS, which is already 
> supported in WebHDFS.  For the request of GETFILEBLOCKLOCATIONS in 
> org.apache.hadoop.fs.http.server.HttpFSServer, BAD_REQUEST is returned so far:
> ...
>  case GETFILEBLOCKLOCATIONS: {
> response = Response.status(Response.Status.BAD_REQUEST).build();
> break;
>   }
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16175) Improve the configurable value of Server #PURGE_INTERVAL_NANOS

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16175?focusedWorklogId=640632=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640632
 ]

ASF GitHub Bot logged work on HDFS-16175:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 09:06
Start Date: 23/Aug/21 09:06
Worklog Time Spent: 10m 
  Work Description: jianghuazhu commented on pull request #3307:
URL: https://github.com/apache/hadoop/pull/3307#issuecomment-903580807


   @ayushtkn , can you help review the code.
   Thank you very much.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640632)
Time Spent: 3h 40m  (was: 3.5h)

> Improve the configurable value of Server #PURGE_INTERVAL_NANOS
> --
>
> Key: HDFS-16175
> URL: https://issues.apache.org/jira/browse/HDFS-16175
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ipc
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> In Server, Server #PURGE_INTERVAL_NANOS is a fixed value, 15.
> We can try to improve the configurable value of Server #PURGE_INTERVAL_NANOS, 
> which will make RPC more flexible.
> private final static long PURGE_INTERVAL_NANOS = TimeUnit.NANOSECONDS.convert(
>   15, TimeUnit.MINUTES);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16175) Improve the configurable value of Server #PURGE_INTERVAL_NANOS

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16175?focusedWorklogId=640631=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640631
 ]

ASF GitHub Bot logged work on HDFS-16175:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 09:04
Start Date: 23/Aug/21 09:04
Worklog Time Spent: 10m 
  Work Description: jianghuazhu commented on pull request #3307:
URL: https://github.com/apache/hadoop/pull/3307#issuecomment-903579635


   Some anomalies happened here in jenkins.
   But it does not seem to be related to the code I submitted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640631)
Time Spent: 3.5h  (was: 3h 20m)

> Improve the configurable value of Server #PURGE_INTERVAL_NANOS
> --
>
> Key: HDFS-16175
> URL: https://issues.apache.org/jira/browse/HDFS-16175
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ipc
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> In Server, Server #PURGE_INTERVAL_NANOS is a fixed value, 15.
> We can try to improve the configurable value of Server #PURGE_INTERVAL_NANOS, 
> which will make RPC more flexible.
> private final static long PURGE_INTERVAL_NANOS = TimeUnit.NANOSECONDS.convert(
>   15, TimeUnit.MINUTES);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16143) TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16143?focusedWorklogId=640627=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640627
 ]

ASF GitHub Bot logged work on HDFS-16143:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 09:00
Start Date: 23/Aug/21 09:00
Worklog Time Spent: 10m 
  Work Description: tasanuma commented on a change in pull request #3235:
URL: https://github.com/apache/hadoop/pull/3235#discussion_r693790700



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java
##
@@ -429,19 +432,29 @@ public void 
testStandbyTriggersLogRollsWhenTailInProgressEdits()
   waitForStandbyToCatchUpWithInProgressEdits(standby, activeTxId,
   standbyCatchupWaitTime);
 
+  long curTime = standby.getNamesystem().getEditLogTailer().getTimer()
+  .monotonicNow();
+  long inSufficientTimeForLogRoll = logRollPeriodMs / 3;
+  final FakeTimer testTimer =
+  new FakeTimer(curTime + inSufficientTimeForLogRoll);
+  standby.getNamesystem().getEditLogTailer().setTimerForTest(testTimer);
+  Thread.sleep(2000);
+
   for (int i = DIRS_TO_MAKE / 2; i < DIRS_TO_MAKE; i++) {
 NameNodeAdapter.mkdirs(active, getDirPath(i),
 new PermissionStatus("test", "test",
 new FsPermission((short)00755)), true);
   }
 
-  boolean exceptionThrown = false;
   try {
 checkForLogRoll(active, origTxId, noLogRollWaitTime);
+fail("Expected to timeout");
   } catch (TimeoutException e) {
-exceptionThrown = true;
+// expected
   }
-  assertTrue(exceptionThrown);
+
+  long sufficientTimeForLogRoll = logRollPeriodMs * 3;

Review comment:
   I understood. Thanks for your detailed explanation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640627)
Time Spent: 10h  (was: 9h 50m)

> TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky
> -
>
> Key: HDFS-16143
> URL: https://issues.apache.org/jira/browse/HDFS-16143
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Attachments: patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
>
>  Time Spent: 10h
>  Remaining Estimate: 0h
>
> https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3229/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
> {quote}
> [ERROR] 
> testStandbyTriggersLogRollsWhenTailInProgressEdits[0](org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer)
>   Time elapsed: 6.862 s  <<< FAILURE!
> java.lang.AssertionError
>   at org.junit.Assert.fail(Assert.java:87)
>   at org.junit.Assert.assertTrue(Assert.java:42)
>   at org.junit.Assert.assertTrue(Assert.java:53)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer.testStandbyTriggersLogRollsWhenTailInProgressEdits(TestEditLogTailer.java:444)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16143) TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16143?focusedWorklogId=640620=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640620
 ]

ASF GitHub Bot logged work on HDFS-16143:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 08:46
Start Date: 23/Aug/21 08:46
Worklog Time Spent: 10m 
  Work Description: virajjasani commented on a change in pull request #3235:
URL: https://github.com/apache/hadoop/pull/3235#discussion_r693780174



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java
##
@@ -429,19 +432,29 @@ public void 
testStandbyTriggersLogRollsWhenTailInProgressEdits()
   waitForStandbyToCatchUpWithInProgressEdits(standby, activeTxId,
   standbyCatchupWaitTime);
 
+  long curTime = standby.getNamesystem().getEditLogTailer().getTimer()
+  .monotonicNow();
+  long inSufficientTimeForLogRoll = logRollPeriodMs / 3;
+  final FakeTimer testTimer =
+  new FakeTimer(curTime + inSufficientTimeForLogRoll);
+  standby.getNamesystem().getEditLogTailer().setTimerForTest(testTimer);

Review comment:
   Nice idea, I think we can target this as follow up work. Similar to 
EditLogTailer, we should introduce `Timer` instance such that we keep using 
Timer's default version of `now`, `monotonicNow` etc utilities but tests would 
get a way to inject FakeTimer.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640620)
Time Spent: 9h 50m  (was: 9h 40m)

> TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky
> -
>
> Key: HDFS-16143
> URL: https://issues.apache.org/jira/browse/HDFS-16143
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Attachments: patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
>
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3229/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
> {quote}
> [ERROR] 
> testStandbyTriggersLogRollsWhenTailInProgressEdits[0](org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer)
>   Time elapsed: 6.862 s  <<< FAILURE!
> java.lang.AssertionError
>   at org.junit.Assert.fail(Assert.java:87)
>   at org.junit.Assert.assertTrue(Assert.java:42)
>   at org.junit.Assert.assertTrue(Assert.java:53)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer.testStandbyTriggersLogRollsWhenTailInProgressEdits(TestEditLogTailer.java:444)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16143) TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16143?focusedWorklogId=640619=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640619
 ]

ASF GitHub Bot logged work on HDFS-16143:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 08:42
Start Date: 23/Aug/21 08:42
Worklog Time Spent: 10m 
  Work Description: virajjasani commented on a change in pull request #3235:
URL: https://github.com/apache/hadoop/pull/3235#discussion_r693776368



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java
##
@@ -429,19 +432,29 @@ public void 
testStandbyTriggersLogRollsWhenTailInProgressEdits()
   waitForStandbyToCatchUpWithInProgressEdits(standby, activeTxId,
   standbyCatchupWaitTime);
 
+  long curTime = standby.getNamesystem().getEditLogTailer().getTimer()
+  .monotonicNow();
+  long inSufficientTimeForLogRoll = logRollPeriodMs / 3;
+  final FakeTimer testTimer =
+  new FakeTimer(curTime + inSufficientTimeForLogRoll);
+  standby.getNamesystem().getEditLogTailer().setTimerForTest(testTimer);
+  Thread.sleep(2000);
+
   for (int i = DIRS_TO_MAKE / 2; i < DIRS_TO_MAKE; i++) {
 NameNodeAdapter.mkdirs(active, getDirPath(i),
 new PermissionStatus("test", "test",
 new FsPermission((short)00755)), true);
   }
 
-  boolean exceptionThrown = false;
   try {
 checkForLogRoll(active, origTxId, noLogRollWaitTime);
+fail("Expected to timeout");
   } catch (TimeoutException e) {
-exceptionThrown = true;
+// expected
   }
-  assertTrue(exceptionThrown);
+
+  long sufficientTimeForLogRoll = logRollPeriodMs * 3;

Review comment:
   We multiply by 3 to advance timer.monotonicNow() by `logRollPeriodMs * 
3` which would be `15` here, and that is quite sufficient for log roll as per 
this equation in EditLogTailer: 
   
   ```
 /**
  * @return true if the configured log roll period has elapsed.
  */
 private boolean tooLongSinceLastLoad() {
   return logRollPeriodMs >= 0 && 
 (timer.monotonicNow() - lastRollTimeMs) > logRollPeriodMs;
 }
   ```
   
   With `logRollPeriodMs / 3` worth of duration, `tooLongSinceLastLoad()` 
returns false whereas with `logRollPeriodMs * 3` duration, 
`tooLongSinceLastLoad()` will return true.
   e.g 
   logRollPeriodMs = 5 sec;
   With logRollPeriodMs/3, timer.monotonicNow() = lastRollTimeMs + 5/3 = 
lastRollTimeMs + 1;
   So, timer.monotonicNow() - lastRollTimeMs = 1;
   And hence, `(timer.monotonicNow() - lastRollTimeMs) > logRollPeriodMs` is 
false (1<5).
   
   Now with `logRollPeriodMs*3`, timer.monotonicNow() = lastRollTimeMs + 5*3 = 
lastRollTimeMs + 15;
   So, timer.monotonicNow() - lastRollTimeMs = 15;
   And hence, `(timer.monotonicNow() - lastRollTimeMs) > logRollPeriodMs` is 
true (15>5).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640619)
Time Spent: 9h 40m  (was: 9.5h)

> TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky
> -
>
> Key: HDFS-16143
> URL: https://issues.apache.org/jira/browse/HDFS-16143
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Attachments: patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
>
>  Time Spent: 9h 40m
>  Remaining Estimate: 0h
>
> https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3229/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
> {quote}
> [ERROR] 
> testStandbyTriggersLogRollsWhenTailInProgressEdits[0](org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer)
>   Time elapsed: 6.862 s  <<< FAILURE!
> java.lang.AssertionError
>   at org.junit.Assert.fail(Assert.java:87)
>   at org.junit.Assert.assertTrue(Assert.java:42)
>   at org.junit.Assert.assertTrue(Assert.java:53)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer.testStandbyTriggersLogRollsWhenTailInProgressEdits(TestEditLogTailer.java:444)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail:

[jira] [Work logged] (HDFS-16143) TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16143?focusedWorklogId=640618=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640618
 ]

ASF GitHub Bot logged work on HDFS-16143:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 08:41
Start Date: 23/Aug/21 08:41
Worklog Time Spent: 10m 
  Work Description: virajjasani commented on a change in pull request #3235:
URL: https://github.com/apache/hadoop/pull/3235#discussion_r693776368



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java
##
@@ -429,19 +432,29 @@ public void 
testStandbyTriggersLogRollsWhenTailInProgressEdits()
   waitForStandbyToCatchUpWithInProgressEdits(standby, activeTxId,
   standbyCatchupWaitTime);
 
+  long curTime = standby.getNamesystem().getEditLogTailer().getTimer()
+  .monotonicNow();
+  long inSufficientTimeForLogRoll = logRollPeriodMs / 3;
+  final FakeTimer testTimer =
+  new FakeTimer(curTime + inSufficientTimeForLogRoll);
+  standby.getNamesystem().getEditLogTailer().setTimerForTest(testTimer);
+  Thread.sleep(2000);
+
   for (int i = DIRS_TO_MAKE / 2; i < DIRS_TO_MAKE; i++) {
 NameNodeAdapter.mkdirs(active, getDirPath(i),
 new PermissionStatus("test", "test",
 new FsPermission((short)00755)), true);
   }
 
-  boolean exceptionThrown = false;
   try {
 checkForLogRoll(active, origTxId, noLogRollWaitTime);
+fail("Expected to timeout");
   } catch (TimeoutException e) {
-exceptionThrown = true;
+// expected
   }
-  assertTrue(exceptionThrown);
+
+  long sufficientTimeForLogRoll = logRollPeriodMs * 3;

Review comment:
   We multiply by 3 to advance timer.monotonicNow() by `logRollPeriodMs * 
3` which would be `15` here, and that is quite sufficient for log roll as per 
this equation in EditLogTailer: 
   
   ```
 /**
  * @return true if the configured log roll period has elapsed.
  */
 private boolean tooLongSinceLastLoad() {
   return logRollPeriodMs >= 0 && 
 (timer.monotonicNow() - lastRollTimeMs) > logRollPeriodMs;
 }
   ```
   
   With `logRollPeriodMs / 3` worth of duration, `tooLongSinceLastLoad()` 
returns false whereas with `logRollPeriodMs * 3` duration, 
`tooLongSinceLastLoad()` will return true.
   e.g 
   logRollPeriodMs = 5 sec;
   With logRollPeriodMs/3, timer.monotonicNow() = lastRollTimeMs + 5/3 = 
lastRollTimeMs + 1;
   So, timer.monotonicNow() - lastRollTimeMs = 1;
   And hence, `(timer.monotonicNow() - lastRollTimeMs) > logRollPeriodMs` is 
false (1<5).
   
   Now with logRollPeriodMs*3, timer.monotonicNow() = lastRollTimeMs + 5*3 = 
lastRollTimeMs + 15;
   So, timer.monotonicNow() - lastRollTimeMs = 15;
   And hence, `(timer.monotonicNow() - lastRollTimeMs) > logRollPeriodMs` is 
true (15>5).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640618)
Time Spent: 9.5h  (was: 9h 20m)

> TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky
> -
>
> Key: HDFS-16143
> URL: https://issues.apache.org/jira/browse/HDFS-16143
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Attachments: patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3229/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
> {quote}
> [ERROR] 
> testStandbyTriggersLogRollsWhenTailInProgressEdits[0](org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer)
>   Time elapsed: 6.862 s  <<< FAILURE!
> java.lang.AssertionError
>   at org.junit.Assert.fail(Assert.java:87)
>   at org.junit.Assert.assertTrue(Assert.java:42)
>   at org.junit.Assert.assertTrue(Assert.java:53)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer.testStandbyTriggersLogRollsWhenTailInProgressEdits(TestEditLogTailer.java:444)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail:

[jira] [Work logged] (HDFS-16143) TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16143?focusedWorklogId=640613=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640613
 ]

ASF GitHub Bot logged work on HDFS-16143:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 08:29
Start Date: 23/Aug/21 08:29
Worklog Time Spent: 10m 
  Work Description: jojochuang commented on a change in pull request #3235:
URL: https://github.com/apache/hadoop/pull/3235#discussion_r693765302



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java
##
@@ -429,19 +432,29 @@ public void 
testStandbyTriggersLogRollsWhenTailInProgressEdits()
   waitForStandbyToCatchUpWithInProgressEdits(standby, activeTxId,
   standbyCatchupWaitTime);
 
+  long curTime = standby.getNamesystem().getEditLogTailer().getTimer()
+  .monotonicNow();
+  long inSufficientTimeForLogRoll = logRollPeriodMs / 3;
+  final FakeTimer testTimer =
+  new FakeTimer(curTime + inSufficientTimeForLogRoll);
+  standby.getNamesystem().getEditLogTailer().setTimerForTest(testTimer);

Review comment:
   just a thought. it would be great if we can refactor the MiniDfsCluster, 
the NameNode, FSNamesystem and EditLogTailer such that they take a FakeTimer as 
a parameter during initialization. If all the tests adopt the way of FakeTimer 
we wouldn't have so many flaky tests. But I reckon it's out of scope of this 
change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640613)
Time Spent: 9h 20m  (was: 9h 10m)

> TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky
> -
>
> Key: HDFS-16143
> URL: https://issues.apache.org/jira/browse/HDFS-16143
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Attachments: patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3229/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
> {quote}
> [ERROR] 
> testStandbyTriggersLogRollsWhenTailInProgressEdits[0](org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer)
>   Time elapsed: 6.862 s  <<< FAILURE!
> java.lang.AssertionError
>   at org.junit.Assert.fail(Assert.java:87)
>   at org.junit.Assert.assertTrue(Assert.java:42)
>   at org.junit.Assert.assertTrue(Assert.java:53)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer.testStandbyTriggersLogRollsWhenTailInProgressEdits(TestEditLogTailer.java:444)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16143) TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16143?focusedWorklogId=640612=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640612
 ]

ASF GitHub Bot logged work on HDFS-16143:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 08:28
Start Date: 23/Aug/21 08:28
Worklog Time Spent: 10m 
  Work Description: tasanuma commented on a change in pull request #3235:
URL: https://github.com/apache/hadoop/pull/3235#discussion_r693746433



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java
##
@@ -429,19 +432,29 @@ public void 
testStandbyTriggersLogRollsWhenTailInProgressEdits()
   waitForStandbyToCatchUpWithInProgressEdits(standby, activeTxId,
   standbyCatchupWaitTime);
 
+  long curTime = standby.getNamesystem().getEditLogTailer().getTimer()
+  .monotonicNow();
+  long inSufficientTimeForLogRoll = logRollPeriodMs / 3;

Review comment:
   I feel `inSufficient` means `in sufficient`. I prefer 
`insufficientTimeForLogRoll` to `inSufficientTimeForLogRoll`.

##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java
##
@@ -429,19 +432,29 @@ public void 
testStandbyTriggersLogRollsWhenTailInProgressEdits()
   waitForStandbyToCatchUpWithInProgressEdits(standby, activeTxId,
   standbyCatchupWaitTime);
 
+  long curTime = standby.getNamesystem().getEditLogTailer().getTimer()
+  .monotonicNow();
+  long inSufficientTimeForLogRoll = logRollPeriodMs / 3;
+  final FakeTimer testTimer =
+  new FakeTimer(curTime + inSufficientTimeForLogRoll);
+  standby.getNamesystem().getEditLogTailer().setTimerForTest(testTimer);
+  Thread.sleep(2000);
+
   for (int i = DIRS_TO_MAKE / 2; i < DIRS_TO_MAKE; i++) {
 NameNodeAdapter.mkdirs(active, getDirPath(i),
 new PermissionStatus("test", "test",
 new FsPermission((short)00755)), true);
   }
 
-  boolean exceptionThrown = false;
   try {
 checkForLogRoll(active, origTxId, noLogRollWaitTime);
+fail("Expected to timeout");
   } catch (TimeoutException e) {
-exceptionThrown = true;
+// expected
   }
-  assertTrue(exceptionThrown);
+
+  long sufficientTimeForLogRoll = logRollPeriodMs * 3;

Review comment:
   Why do we multiply by 3 here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640612)
Time Spent: 9h 10m  (was: 9h)

> TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky
> -
>
> Key: HDFS-16143
> URL: https://issues.apache.org/jira/browse/HDFS-16143
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Attachments: patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3229/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
> {quote}
> [ERROR] 
> testStandbyTriggersLogRollsWhenTailInProgressEdits[0](org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer)
>   Time elapsed: 6.862 s  <<< FAILURE!
> java.lang.AssertionError
>   at org.junit.Assert.fail(Assert.java:87)
>   at org.junit.Assert.assertTrue(Assert.java:42)
>   at org.junit.Assert.assertTrue(Assert.java:53)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer.testStandbyTriggersLogRollsWhenTailInProgressEdits(TestEditLogTailer.java:444)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16180) FsVolumeImpl.nextBlock should consider that the block meta file has been deleted.

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16180?focusedWorklogId=640607=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640607
 ]

ASF GitHub Bot logged work on HDFS-16180:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 08:03
Start Date: 23/Aug/21 08:03
Worklog Time Spent: 10m 
  Work Description: jojochuang commented on a change in pull request #3315:
URL: https://github.com/apache/hadoop/pull/3315#discussion_r693287224



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeImpl.java
##
@@ -865,7 +866,15 @@ public ExtendedBlock nextBlock() throws IOException {
   }
 
   File blkFile = getBlockFile(bpid, block);
-  File metaFile = FsDatasetUtil.findMetaFile(blkFile);
+  File metaFile ;
+  try {
+ metaFile = FsDatasetUtil.findMetaFile(blkFile);
+  } catch (FileNotFoundException e){
+LOG.warn("nextBlock({}, {}): {}", storageID, bpid,

Review comment:
   can you make the log message more explicit? Like "Metadata file for 
block file is missing. Skip it"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640607)
Time Spent: 1h 20m  (was: 1h 10m)

> FsVolumeImpl.nextBlock should consider that the block meta file has been 
> deleted.
> -
>
> Key: HDFS-16180
> URL: https://issues.apache.org/jira/browse/HDFS-16180
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> In my cluster,  we found that when VolumeScanner run, sometime dn will throw 
> some error log below
> ```
>  
> 2021-08-19 08:00:11,549 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Deleted BP-1020175758-nnip-1597745872895 blk_1142977964_69237147 URI 
> file:/disk1/dfs/data/current/BP-1020175758- 
> nnip-1597745872895/current/finalized/subdir0/subdir21/blk_1142977964
> 2021-08-19 08:00:48,368 ERROR 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl: 
> nextBlock(DS-060c8e4c-1ef6-49f5-91ef-91957356891a, BP-1020175758- 
> nnip-1597745872895): I/O error
> java.io.IOException: Meta file not found, 
> blockFile=/disk1/dfs/data/current/BP-1020175758- 
> nnip-1597745872895/current/finalized/subdir0/subdir21/blk_1142977964
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetUtil.findMetaFile(FsDatasetUtil.java:101)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl.nextBlock(FsVolumeImpl.java:809)
> at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:528)
> at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:628)
> 2021-08-19 08:00:48,368 WARN 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner: 
> VolumeScanner(/disk1/dfs/data, DS-060c8e4c-1ef6-49f5-91ef-91957356891a): 
> nextBlock error on 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl@7febc6b4
> ```
> When VolumeScanner scan block  blk_1142977964,  it has been deleted by 
> datanode,  scanner can not find the meta file of blk_1142977964, so it throw 
> these error log.
>  
> Maybe we should handle FileNotFoundException during nextblock to reduce error 
> log and nextblock retry times.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread Max Xie (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403009#comment-17403009
 ] 

Max  Xie commented on HDFS-16182:
-

[~weichiu] [~sodonnell] Any thought about it. Thanks for the reviews.

> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16182.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in DataStreamer.addDatanode2ExistingPipeline. 
>  
> Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget.  I think 
> numOfReplicas should not be assigned by requiredStorageTypes.
>  
>    
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread Max Xie (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max  Xie updated HDFS-16182:

Attachment: HDFS-16182.patch
Status: Patch Available  (was: In Progress)

> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16182.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in DataStreamer.addDatanode2ExistingPipeline. 
>  
> Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget.  I think 
> numOfReplicas should not be assigned by requiredStorageTypes.
>  
>    
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread Max Xie (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max  Xie updated HDFS-16182:

Attachment: (was: HDFS-16182.patch)

> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in DataStreamer.addDatanode2ExistingPipeline. 
>  
> Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget.  I think 
> numOfReplicas should not be assigned by requiredStorageTypes.
>  
>    
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread Max Xie (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403006#comment-17403006
 ] 

Max  Xie commented on HDFS-16182:
-

In my cluster, we use  BlockPlacementPolicyDefault to choose dn and the number 
of SSD DN is much less than DISK DN. It may cause to  some block  that should 
be placed to SSD DNs fallback to place DISK DNs.

> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16182.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in DataStreamer.addDatanode2ExistingPipeline. 
>  
> Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget.  I think 
> numOfReplicas should not be assigned by requiredStorageTypes.
>  
>    
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread Max Xie (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max  Xie updated HDFS-16182:

Attachment: HDFS-16182.patch

> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16182.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in DataStreamer.addDatanode2ExistingPipeline. 
>  
> Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget.  I think 
> numOfReplicas should not be assigned by requiredStorageTypes.
>  
>    
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Assigned] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread Max Xie (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max  Xie reassigned HDFS-16182:
---

Assignee: Max  Xie

> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in DataStreamer.addDatanode2ExistingPipeline. 
>  
> Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget.  I think 
> numOfReplicas should not be assigned by requiredStorageTypes.
>  
>    
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work started] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread Max Xie (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16182 started by Max  Xie.
---
> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in DataStreamer.addDatanode2ExistingPipeline. 
>  
> Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget.  I think 
> numOfReplicas should not be assigned by requiredStorageTypes.
>  
>    
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-16182:
--
Labels: pull-request-available  (was: )

> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in DataStreamer.addDatanode2ExistingPipeline. 
>  
> Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget.  I think 
> numOfReplicas should not be assigned by requiredStorageTypes.
>  
>    
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16182?focusedWorklogId=640600=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640600
 ]

ASF GitHub Bot logged work on HDFS-16182:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 07:30
Start Date: 23/Aug/21 07:30
Worklog Time Spent: 10m 
  Work Description: Neilxzn opened a new pull request #3320:
URL: https://github.com/apache/hadoop/pull/3320


   
   
   ### Description of PR
   https://issues.apache.org/jira/browse/HDFS-16182
   
   ### How was this patch tested?
   add  TestBlockStoragePolicy.testAddDatanode2ExistingPipelineInSsd
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640600)
Remaining Estimate: 0h
Time Spent: 10m

> numOfReplicas is given the wrong value in  
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
> Heterogeneous Storage  
> ---
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
> better performance. Sometimes  hdfs client transfer data in pipline,  it will 
> throw IOException and exit.  Exception logs are below: 
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>  
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>  
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>  
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>  
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>  
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
> ```
> After search it,   I found when existing pipline need replace new dn to 
> transfer data, the client will get one additional dn from namenode  and check 
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
>  throw new IOException(
>  "Failed to replace a bad datanode on the existing pipeline "
>  + "due to no more good datanodes being available to try. "
>  + "(Nodes: current=" + Arrays.asList(nodes)
>  + ", original=" + Arrays.asList(original) + "). "
>  + "The current failed datanode replacement policy is "
>  + dfsClient.dtpReplaceDatanodeOnFailure
>  + ", and a client may configure this via '"
>  + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
>  + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes 
> , not one in DataStreamer.addDatanode2ExistingPipeline. 
>  
> Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget.  I think 
> numOfReplicas should not be assigned by requiredStorageTypes.
>  
>    
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail:

[jira] [Created] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage

2021-08-23 Thread Max Xie (Jira)

Max  Xie created HDFS-16182:
---

 Summary: numOfReplicas is given the wrong value in  
BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with 
Heterogeneous Storage  
 Key: HDFS-16182
 URL: https://issues.apache.org/jira/browse/HDFS-16182
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namanode
Affects Versions: 3.4.0
Reporter: Max  Xie


In our hdfs cluster, we use heterogeneous storage to store data in SSD  for a 
better performance. Sometimes  hdfs client transfer data in pipline,  it will 
throw IOException and exit.  Exception logs are below: 

```
java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
 
DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
 
DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
 
DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
 
DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
 
original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
 
DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
 The current failed datanode replacement policy is DEFAULT, and a client may 
configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' 
in its configuration.
```

After search it,   I found when existing pipline need replace new dn to 
transfer data, the client will get one additional dn from namenode  and check 
that the number of dn is the original number + 1.

```

## DataStreamer$findNewDatanode

if (nodes.length != original.length + 1) {
 throw new IOException(
 "Failed to replace a bad datanode on the existing pipeline "
 + "due to no more good datanodes being available to try. "
 + "(Nodes: current=" + Arrays.asList(nodes)
 + ", original=" + Arrays.asList(original) + "). "
 + "The current failed datanode replacement policy is "
 + dfsClient.dtpReplaceDatanodeOnFailure
 + ", and a client may configure this via '"
 + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
 + "' in its configuration.");
}

```

The root cause is that Namenode$getAdditionalDatanode returns multi datanodes , 
not one in DataStreamer.addDatanode2ExistingPipeline. 

 

Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget.  I think 
numOfReplicas should not be assigned by requiredStorageTypes.

 

   

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16181) [SBN Read] Fix metric of RpcRequestCacheMissAmount can't display when tailEditLog form JN

2021-08-23 Thread wangzhaohui (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhaohui updated HDFS-16181:
---
Summary: [SBN Read] Fix metric of RpcRequestCacheMissAmount can't display 
when tailEditLog form JN  (was: [SBN] Fix metric of RpcRequestCacheMissAmount 
can't display when tailEditLog form JN)

> [SBN Read] Fix metric of RpcRequestCacheMissAmount can't display when 
> tailEditLog form JN
> -
>
> Key: HDFS-16181
> URL: https://issues.apache.org/jira/browse/HDFS-16181
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: wangzhaohui
>Assignee: wangzhaohui
>Priority: Critical
>  Labels: pull-request-available
> Attachments: after.jpg, before.jpg
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I found the JN turn on edit cache, but the metric of 
> rpcRequestCacheMissAmount can not display.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16179) Update loglevel for BlockManager#chooseExcessRedundancyStriped to avoid too much logs

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16179?focusedWorklogId=640591=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640591
 ]

ASF GitHub Bot logged work on HDFS-16179:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 06:26
Start Date: 23/Aug/21 06:26
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #3313:
URL: https://github.com/apache/hadoop/pull/3313#issuecomment-903480361


   > > @tomscut Hi, does this WARN log be printed when only writing EC files ? 
This WARN logs also appeared in our cluster without writting any files, but not 
as many as you said.
   > > I found that the block in the WARN log belongs to the file written a 
long time ago. So, I have some guesses:
   > > 
   > > * is there a daemon thread calling this method?
   > > * or other conditions trigger this method?
   > > 
   > > Here is our 3-hour running log.
   > > 
![image](https://user-images.githubusercontent.com/18388154/130396631-14db5ce7-0e35-442d-b0d8-f38486ab5496.png)
   > 
   > Thanks @whbing for your comments. I found those logs were printed after 
completeFile. Triggered by FSDirWriteFileOp#completeFileInternal().
   > 
   > ```
   >   private static boolean completeFileInternal(
   >   FSNamesystem fsn, INodesInPath iip,
   >   String holder, Block last, long fileId)
   >   throws IOException {
   > (...)
   > fsn.finalizeINodeFileUnderConstruction(src, pendingFile,
   > Snapshot.CURRENT_STATE_ID, true);
   > (...)
   > return true;
   >   }
   > ```
   > 
   > 
![image](https://user-images.githubusercontent.com/55134131/130398711-1fa0d1dc-8c46-4f8f-b7f1-2459dca3c5c4.png)
   
   
   
   > @tomscut Hi, does this WARN log be printed when only writing EC files ? 
This WARN logs also appeared in our cluster without writting any files, but not 
as many as you said.
   > I found that the block in the WARN log belongs to the file written a long 
time ago. So, I have some guesses:
   > 
   > * is there a daemon thread calling this method?
   > * or other conditions trigger this method?
   > 
   > Here is our 3-hour running log.
   > 
![image](https://user-images.githubusercontent.com/18388154/130396631-14db5ce7-0e35-442d-b0d8-f38486ab5496.png)
   
   Yes, those WARN logs were printed only writing EC files. Because the logs 
printed in BlockManager#chooseExcessRedundancyStriped().


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640591)
Time Spent: 2h 50m  (was: 2h 40m)

> Update loglevel for BlockManager#chooseExcessRedundancyStriped to avoid too 
> much logs
> -
>
> Key: HDFS-16179
> URL: https://issues.apache.org/jira/browse/HDFS-16179
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.1.0
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Labels: pull-request-available
> Attachments: log-count.jpg, logs.jpg
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> {code:java}
> private void chooseExcessRedundancyStriped(BlockCollection bc,
> final Collection nonExcess,
> BlockInfo storedBlock,
> DatanodeDescriptor delNodeHint) {
>   ...
>   // cardinality of found indicates the expected number of internal blocks
>   final int numOfTarget = found.cardinality();
>   final BlockStoragePolicy storagePolicy = storagePolicySuite.getPolicy(
>   bc.getStoragePolicyID());
>   final List excessTypes = storagePolicy.chooseExcess(
>   (short) numOfTarget, DatanodeStorageInfo.toStorageTypes(nonExcess));
>   if (excessTypes.isEmpty()) {
> LOG.warn("excess types chosen for block {} among storages {} is empty",
> storedBlock, nonExcess);
> return;
>   }
>   ...
> }
> {code}
>  
> IMO, here is just detecting excess StorageType and setting the log level to 
> debug has no effect.
>  
> We have a cluster that uses the EC policy to store data. The current log 
> level is WARN here, and in about 50 minutes, 286,093 logs are printed, which 
> can cause other important logs to drown out.
>  
> !logs.jpg|width=1167,height=62!
>  
> !log-count.jpg|width=760,height=30!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16179) Update loglevel for BlockManager#chooseExcessRedundancyStriped to avoid too much logs

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16179?focusedWorklogId=640590=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640590
 ]

ASF GitHub Bot logged work on HDFS-16179:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 06:23
Start Date: 23/Aug/21 06:23
Worklog Time Spent: 10m 
  Work Description: tomscut edited a comment on pull request #3313:
URL: https://github.com/apache/hadoop/pull/3313#issuecomment-903476259


   > @tomscut Hi, does this WARN log be printed when only writing EC files ? 
This WARN logs also appeared in our cluster without writting any files, but not 
as many as you said.
   > I found that the block in the WARN log belongs to the file written a long 
time ago. So, I have some guesses:
   > 
   > * is there a daemon thread calling this method?
   > * or other conditions trigger this method?
   > 
   > Here is our 3-hour running log.
   > 
![image](https://user-images.githubusercontent.com/18388154/130396631-14db5ce7-0e35-442d-b0d8-f38486ab5496.png)
   
   Thanks @whbing for your comments. I found those logs were printed after 
completeFile. Triggered by FSDirWriteFileOp#completeFileInternal(). Our hadoop 
version is 3.1.0.
   
   ```
 private static boolean completeFileInternal(
 FSNamesystem fsn, INodesInPath iip,
 String holder, Block last, long fileId)
 throws IOException {
   (...)
   fsn.finalizeINodeFileUnderConstruction(src, pendingFile,
   Snapshot.CURRENT_STATE_ID, true);
   (...)
   return true;
 }
   ```
   
   
![image](https://user-images.githubusercontent.com/55134131/130398711-1fa0d1dc-8c46-4f8f-b7f1-2459dca3c5c4.png)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640590)
Time Spent: 2h 40m  (was: 2.5h)

> Update loglevel for BlockManager#chooseExcessRedundancyStriped to avoid too 
> much logs
> -
>
> Key: HDFS-16179
> URL: https://issues.apache.org/jira/browse/HDFS-16179
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.1.0
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Labels: pull-request-available
> Attachments: log-count.jpg, logs.jpg
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> {code:java}
> private void chooseExcessRedundancyStriped(BlockCollection bc,
> final Collection nonExcess,
> BlockInfo storedBlock,
> DatanodeDescriptor delNodeHint) {
>   ...
>   // cardinality of found indicates the expected number of internal blocks
>   final int numOfTarget = found.cardinality();
>   final BlockStoragePolicy storagePolicy = storagePolicySuite.getPolicy(
>   bc.getStoragePolicyID());
>   final List excessTypes = storagePolicy.chooseExcess(
>   (short) numOfTarget, DatanodeStorageInfo.toStorageTypes(nonExcess));
>   if (excessTypes.isEmpty()) {
> LOG.warn("excess types chosen for block {} among storages {} is empty",
> storedBlock, nonExcess);
> return;
>   }
>   ...
> }
> {code}
>  
> IMO, here is just detecting excess StorageType and setting the log level to 
> debug has no effect.
>  
> We have a cluster that uses the EC policy to store data. The current log 
> level is WARN here, and in about 50 minutes, 286,093 logs are printed, which 
> can cause other important logs to drown out.
>  
> !logs.jpg|width=1167,height=62!
>  
> !log-count.jpg|width=760,height=30!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16175) Improve the configurable value of Server #PURGE_INTERVAL_NANOS

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16175?focusedWorklogId=640589=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640589
 ]

ASF GitHub Bot logged work on HDFS-16175:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 06:21
Start Date: 23/Aug/21 06:21
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3307:
URL: https://github.com/apache/hadoop/pull/3307#issuecomment-903477996


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 49s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  31m  5s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  22m 38s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |  19m  3s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   1m  2s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 35s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 10s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 39s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   2m 34s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  15m 52s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 55s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  22m 23s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |  22m 23s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  19m 42s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |  19m 42s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m  3s | 
[/results-checkstyle-hadoop-common-project_hadoop-common.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3307/5/artifact/out/results-checkstyle-hadoop-common-project_hadoop-common.txt)
 |  hadoop-common-project/hadoop-common: The patch generated 1 new + 282 
unchanged - 0 fixed = 283 total (was 282)  |
   | +1 :green_heart: |  mvnsite  |   1m 36s |  |  the patch passed  |
   | +1 :green_heart: |  xml  |   0m  1s |  |  The patch has no ill-formed XML 
file.  |
   | +1 :green_heart: |  javadoc  |   1m  6s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 38s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   2m 54s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  18m 26s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |  41m 59s | 
[/patch-unit-hadoop-common-project_hadoop-common.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3307/5/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt)
 |  hadoop-common in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m  1s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 210m  1s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.ha.TestZKFailoverControllerStress |
   |   | hadoop.ipc.TestRetryCache |
   |   | hadoop.ipc.TestCallQueueManager |
   |   | hadoop.metrics2.source.TestJvmMetrics |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3307/5/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3307 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell xml |
   | uname | Linux 1b8547543f35 4.15.0-151-generic #157-Ubuntu SMP Fri Jul 9 
23:07:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk /

[jira] [Work logged] (HDFS-16179) Update loglevel for BlockManager#chooseExcessRedundancyStriped to avoid too much logs

2021-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16179?focusedWorklogId=640588=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640588
 ]

ASF GitHub Bot logged work on HDFS-16179:
-

Author: ASF GitHub Bot
Created on: 23/Aug/21 06:17
Start Date: 23/Aug/21 06:17
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #3313:
URL: https://github.com/apache/hadoop/pull/3313#issuecomment-903476259


   > @tomscut Hi, does this WARN log be printed when only writing EC files ? 
This WARN logs also appeared in our cluster without writting any files, but not 
as many as you said.
   > I found that the block in the WARN log belongs to the file written a long 
time ago. So, I have some guesses:
   > 
   > * is there a daemon thread calling this method?
   > * or other conditions trigger this method?
   > 
   > Here is our 3-hour running log.
   > 
![image](https://user-images.githubusercontent.com/18388154/130396631-14db5ce7-0e35-442d-b0d8-f38486ab5496.png)
   
   Thanks @whbing for your comments. I found those logs were printed after 
completeFile. Triggered by FSDirWriteFileOp#completeFileInternal().
   
   ```
 private static boolean completeFileInternal(
 FSNamesystem fsn, INodesInPath iip,
 String holder, Block last, long fileId)
 throws IOException {
   (...)
   fsn.finalizeINodeFileUnderConstruction(src, pendingFile,
   Snapshot.CURRENT_STATE_ID, true);
   (...)
   return true;
 }
   ```
   
   
![image](https://user-images.githubusercontent.com/55134131/130398711-1fa0d1dc-8c46-4f8f-b7f1-2459dca3c5c4.png)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 640588)
Time Spent: 2.5h  (was: 2h 20m)

> Update loglevel for BlockManager#chooseExcessRedundancyStriped to avoid too 
> much logs
> -
>
> Key: HDFS-16179
> URL: https://issues.apache.org/jira/browse/HDFS-16179
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.1.0
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Labels: pull-request-available
> Attachments: log-count.jpg, logs.jpg
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> {code:java}
> private void chooseExcessRedundancyStriped(BlockCollection bc,
> final Collection nonExcess,
> BlockInfo storedBlock,
> DatanodeDescriptor delNodeHint) {
>   ...
>   // cardinality of found indicates the expected number of internal blocks
>   final int numOfTarget = found.cardinality();
>   final BlockStoragePolicy storagePolicy = storagePolicySuite.getPolicy(
>   bc.getStoragePolicyID());
>   final List excessTypes = storagePolicy.chooseExcess(
>   (short) numOfTarget, DatanodeStorageInfo.toStorageTypes(nonExcess));
>   if (excessTypes.isEmpty()) {
> LOG.warn("excess types chosen for block {} among storages {} is empty",
> storedBlock, nonExcess);
> return;
>   }
>   ...
> }
> {code}
>  
> IMO, here is just detecting excess StorageType and setting the log level to 
> debug has no effect.
>  
> We have a cluster that uses the EC policy to store data. The current log 
> level is WARN here, and in about 50 minutes, 286,093 logs are printed, which 
> can cause other important logs to drown out.
>  
> !logs.jpg|width=1167,height=62!
>  
> !log-count.jpg|width=760,height=30!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

52 matches

Mail list logo