[jira] [Commented] (HDFS-16112) Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor
[ https://issues.apache.org/jira/browse/HDFS-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403509#comment-17403509 ] tomscut commented on HDFS-16112: Thanks [~weichiu] for your help. > Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor > > > Key: HDFS-16112 > URL: https://issues.apache.org/jira/browse/HDFS-16112 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > > These unit tests > TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and > TestDecommissioningStatus#testDecommissionStatus recently seems a little > flaky, we should fix them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?focusedWorklogId=640933=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640933 ] ASF GitHub Bot logged work on HDFS-16182: - Author: ASF GitHub Bot Created on: 24/Aug/21 04:06 Start Date: 24/Aug/21 04:06 Worklog Time Spent: 10m Work Description: Neilxzn edited a comment on pull request #3320: URL: https://github.com/apache/hadoop/pull/3320#issuecomment-904304994 @jojochuang Agree with you. I think we should fix it. In my cluster, we use BlockPlacementPolicyDefault to choose dn and the number of SSD DN is much less than DISK DN. It may cause to some block that should be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no enough place. Consider the following scenario. 1. Create empty file /foo_file 2. Set its storagepolicy to All_SSD 3. Put data to /foo_file 4. /foo_file gets 3 DISK dns for pipeline because SSD dns are too busy at the beginning. 5. When it transfers data in pipeline, one of 3 DISK dns shut down. 6. The client need to get one new dn for existing pipeline in DataStreamer$addDatanode2ExistingPipeline. 7. If SSD dns are available at the moment, namenode will choose the 3 SSD dns and return it to the client. However, the client just need one new dn, namenode returns 3 new SSD dn and the client threw exception in DataStreamer$findNewDatanode. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640933) Time Spent: 50m (was: 40m) > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16182.patch > > Time Spent: 50m > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in DataStreamer.addDatanode2ExistingPipeline. > > Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think > numOfReplicas should not be
[jira] [Comment Edited] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403006#comment-17403006 ] Max Xie edited comment on HDFS-16182 at 8/24/21, 4:06 AM: --- In my cluster, we use BlockPlacementPolicyDefault to choose dn and the number of SSD DN is much less than DISK DN. It may cause to some block that should be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no enough place. Consider the following scenario. # Create empty file /foo_file # Set its storagepolicy to All_SSD # Put data to /foo_file # /foo_file gets 3 DISK dns for pipeline because SSD dns are too busy at the beginning. # When it transfers data in pipeline, one of 3 DISK dns shut down. # The client need to get one new dn for existing pipeline in DataStreamer$addDatanode2ExistingPipeline.. # If SSD dns are available at the moment, namenode will choose the 3 SSD dns and return it to the client. However, the client just need one new dn, namenode returns 3 new SSD dn and the client fail in DataStreamer$findNewDatanode. was (Author: max2049): In my cluster, we use BlockPlacementPolicyDefault to choose dn and the number of SSD DN is much less than DISK DN. It may cause to some block that should be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no enough place. Consider the following scenario. # Create empty file /foo_file # Set its storagepolicy to All_SSD # Put data to /foo_file # /foo_file gets 3 DISK dns for pipeline because SSD dns are too busy at the beginning. # When it transfers data in pipeline, one of 3 DISK dns shut down. # The client need to get one new dn for existing pipeline. # If SSD dns are available at the moment, namenode will choose the 3 SSD dns and return it to the client. However, the client just need one new dn, namenode returns 3 new SSD dn and the client fail in DataStreamer. > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16182.patch > > Time Spent: 50m > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in DataStreamer.addDatanode2ExistingPipeline. > > Maybe we can fix it in
[jira] [Work logged] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?focusedWorklogId=640932=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640932 ] ASF GitHub Bot logged work on HDFS-16182: - Author: ASF GitHub Bot Created on: 24/Aug/21 04:04 Start Date: 24/Aug/21 04:04 Worklog Time Spent: 10m Work Description: Neilxzn commented on pull request #3320: URL: https://github.com/apache/hadoop/pull/3320#issuecomment-904304994 Agree it. I think we should fix it. In my cluster, we use BlockPlacementPolicyDefault to choose dn and the number of SSD DN is much less than DISK DN. It may cause to some block that should be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no enough place. Consider the following scenario. 1. Create empty file /foo_file 2. Set its storagepolicy to All_SSD 3. Put data to /foo_file 4. /foo_file gets 3 DISK dns for pipeline because SSD dns are too busy at the beginning. 5. When it transfers data in pipeline, one of 3 DISK dns shut down. 6. The client need to get one new dn for existing pipeline in DataStreamer$addDatanode2ExistingPipeline. 7. If SSD dns are available at the moment, namenode will choose the 3 SSD dns and return it to the client. However, the client just need one new dn, namenode returns 3 new SSD dn and the client threw exception in DataStreamer$findNewDatanode. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640932) Time Spent: 40m (was: 0.5h) > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16182.patch > > Time Spent: 40m > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in DataStreamer.addDatanode2ExistingPipeline. > > Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think > numOfReplicas should not be assigned by
[jira] [Comment Edited] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403006#comment-17403006 ] Max Xie edited comment on HDFS-16182 at 8/24/21, 4:00 AM: --- In my cluster, we use BlockPlacementPolicyDefault to choose dn and the number of SSD DN is much less than DISK DN. It may cause to some block that should be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no enough place. Consider the following scenario. # Create empty file /foo_file # Set its storagepolicy to All_SSD # Put data to /foo_file # /foo_file gets 3 DISK dns for pipeline because SSD dns are too busy at the beginning. # When it transfers data in pipeline, one of 3 DISK dns shut down. # The client need to get one new dn for existing pipeline. # If SSD dns are available at the moment, namenode will choose the 3 SSD dns and return it to the client. However, the client just need one new dn, namenode returns 3 new SSD dn and the client fail in DataStreamer. was (Author: max2049): In my cluster, we use BlockPlacementPolicyDefault to choose dn and the number of SSD DN is much less than DISK DN. It may cause to some block that should be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no enough place. The steps are as follow. # Create empty file /foo_file # Set its storagepolicy to All_SSD # Put data to /foo_file # /foo_file gets 3 DISK dns for pipeline because SSD dns are too busy at the beginning. # When it transfers data in pipeline, one of 3 DISK dns shut down. # The client need to get one new dn for existing pipeline. # If SSD dns are available at the moment, namenode will choose the 3 SSD dns and return it to the client. However, the client just need one new dn, namenode returns 3 new SSD dn and the client fail. > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16182.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in DataStreamer.addDatanode2ExistingPipeline. > > Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think > numOfReplicas should not be assigned by
[jira] [Work logged] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?focusedWorklogId=640930=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640930 ] ASF GitHub Bot logged work on HDFS-16182: - Author: ASF GitHub Bot Created on: 24/Aug/21 03:49 Start Date: 24/Aug/21 03:49 Worklog Time Spent: 10m Work Description: jojochuang commented on pull request #3320: URL: https://github.com/apache/hadoop/pull/3320#issuecomment-904299806 Don't really understand the code but seems like an ancient regression from HDFS-6686. As a general code advice, we should not update a parameter variable and pass it on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640930) Time Spent: 0.5h (was: 20m) > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16182.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in DataStreamer.addDatanode2ExistingPipeline. > > Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think > numOfReplicas should not be assigned by requiredStorageTypes. > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16180) FsVolumeImpl.nextBlock should consider that the block meta file has been deleted.
[ https://issues.apache.org/jira/browse/HDFS-16180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang resolved HDFS-16180. Fix Version/s: 3.4.0 Resolution: Fixed > FsVolumeImpl.nextBlock should consider that the block meta file has been > deleted. > - > > Key: HDFS-16180 > URL: https://issues.apache.org/jira/browse/HDFS-16180 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0, 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > In my cluster, we found that when VolumeScanner run, sometime dn will throw > some error log below > ``` > > 2021-08-19 08:00:11,549 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService: > Deleted BP-1020175758-nnip-1597745872895 blk_1142977964_69237147 URI > file:/disk1/dfs/data/current/BP-1020175758- > nnip-1597745872895/current/finalized/subdir0/subdir21/blk_1142977964 > 2021-08-19 08:00:48,368 ERROR > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl: > nextBlock(DS-060c8e4c-1ef6-49f5-91ef-91957356891a, BP-1020175758- > nnip-1597745872895): I/O error > java.io.IOException: Meta file not found, > blockFile=/disk1/dfs/data/current/BP-1020175758- > nnip-1597745872895/current/finalized/subdir0/subdir21/blk_1142977964 > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetUtil.findMetaFile(FsDatasetUtil.java:101) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl.nextBlock(FsVolumeImpl.java:809) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:528) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:628) > 2021-08-19 08:00:48,368 WARN > org.apache.hadoop.hdfs.server.datanode.VolumeScanner: > VolumeScanner(/disk1/dfs/data, DS-060c8e4c-1ef6-49f5-91ef-91957356891a): > nextBlock error on > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl@7febc6b4 > ``` > When VolumeScanner scan block blk_1142977964, it has been deleted by > datanode, scanner can not find the meta file of blk_1142977964, so it throw > these error log. > > Maybe we should handle FileNotFoundException during nextblock to reduce error > log and nextblock retry times. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16180) FsVolumeImpl.nextBlock should consider that the block meta file has been deleted.
[ https://issues.apache.org/jira/browse/HDFS-16180?focusedWorklogId=640926=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640926 ] ASF GitHub Bot logged work on HDFS-16180: - Author: ASF GitHub Bot Created on: 24/Aug/21 03:15 Start Date: 24/Aug/21 03:15 Worklog Time Spent: 10m Work Description: jojochuang merged pull request #3315: URL: https://github.com/apache/hadoop/pull/3315 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640926) Time Spent: 1.5h (was: 1h 20m) > FsVolumeImpl.nextBlock should consider that the block meta file has been > deleted. > - > > Key: HDFS-16180 > URL: https://issues.apache.org/jira/browse/HDFS-16180 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0, 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Minor > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > In my cluster, we found that when VolumeScanner run, sometime dn will throw > some error log below > ``` > > 2021-08-19 08:00:11,549 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService: > Deleted BP-1020175758-nnip-1597745872895 blk_1142977964_69237147 URI > file:/disk1/dfs/data/current/BP-1020175758- > nnip-1597745872895/current/finalized/subdir0/subdir21/blk_1142977964 > 2021-08-19 08:00:48,368 ERROR > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl: > nextBlock(DS-060c8e4c-1ef6-49f5-91ef-91957356891a, BP-1020175758- > nnip-1597745872895): I/O error > java.io.IOException: Meta file not found, > blockFile=/disk1/dfs/data/current/BP-1020175758- > nnip-1597745872895/current/finalized/subdir0/subdir21/blk_1142977964 > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetUtil.findMetaFile(FsDatasetUtil.java:101) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl.nextBlock(FsVolumeImpl.java:809) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:528) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:628) > 2021-08-19 08:00:48,368 WARN > org.apache.hadoop.hdfs.server.datanode.VolumeScanner: > VolumeScanner(/disk1/dfs/data, DS-060c8e4c-1ef6-49f5-91ef-91957356891a): > nextBlock error on > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl@7febc6b4 > ``` > When VolumeScanner scan block blk_1142977964, it has been deleted by > datanode, scanner can not find the meta file of blk_1142977964, so it throw > these error log. > > Maybe we should handle FileNotFoundException during nextblock to reduce error > log and nextblock retry times. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16112) Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor
[ https://issues.apache.org/jira/browse/HDFS-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang resolved HDFS-16112. Resolution: Duplicate closed it for you. You are already granted contributor privilege and you should be able to close it yourself (that is my understanding) > Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor > > > Key: HDFS-16112 > URL: https://issues.apache.org/jira/browse/HDFS-16112 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > > These unit tests > TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and > TestDecommissioningStatus#testDecommissionStatus recently seems a little > flaky, we should fix them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16112) Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor
[ https://issues.apache.org/jira/browse/HDFS-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang reassigned HDFS-16112: -- Assignee: tomscut > Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor > > > Key: HDFS-16112 > URL: https://issues.apache.org/jira/browse/HDFS-16112 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > > These unit tests > TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and > TestDecommissioningStatus#testDecommissionStatus recently seems a little > flaky, we should fix them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403006#comment-17403006 ] Max Xie edited comment on HDFS-16182 at 8/24/21, 2:57 AM: --- In my cluster, we use BlockPlacementPolicyDefault to choose dn and the number of SSD DN is much less than DISK DN. It may cause to some block that should be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no enough place. The steps are as follow. # Create empty file /foo_file # Set its storagepolicy to All_SSD # Put data to /foo_file # /foo_file gets 3 DISK dns for pipeline because SSD dns are too busy at the beginning. # When it transfers data in pipeline, one of 3 DISK dns shut down. # The client need to get one new dn for existing pipeline. # If SSD dns are available at the moment, namenode will choose the 3 SSD dns and return it to the client. However, the client just need one new dn, namenode returns 3 new SSD dn and the client fail. was (Author: max2049): In my cluster, we use BlockPlacementPolicyDefault to choose dn and the number of SSD DN is much less than DISK DN. It may cause to some block that should be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no enough place. The steps are as follow. # Create empty file /foo_file # Set its storagepolicy to All_SSD # Put data to /foo_file # /foo_file gets 3 DISK dns for pipeline because SSD dns are too busy at the beginning. # When it transfers data in pipeline, one of 3 DISK dns shut down. # The client need to get one new dn for existing pipeline. # If SSD dns are available at the moment, namenode will choose the 3 SSD dns and return it to the client. However, the client just need one new dn, namenode returns 3 new SSD dn and the client fail. > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16182.patch > > Time Spent: 20m > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in DataStreamer.addDatanode2ExistingPipeline. > > Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think > numOfReplicas should not be assigned by requiredStorageTypes. > > >
[jira] [Comment Edited] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403006#comment-17403006 ] Max Xie edited comment on HDFS-16182 at 8/24/21, 2:55 AM: --- In my cluster, we use BlockPlacementPolicyDefault to choose dn and the number of SSD DN is much less than DISK DN. It may cause to some block that should be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no enough place. The steps are as follow. # Create empty file /foo_file # Set its storagepolicy to All_SSD # Put data to /foo_file # /foo_file gets 3 DISK dns for pipeline because SSD dns are too busy at the beginning. # When it transfers data in pipeline, one of 3 DISK dns shut down. # The client need to get one new dn for existing pipeline. # If SSD dns are available at the moment, namenode will choose the 3 SSD dns and return it to the client. However, the client just need one new dn, namenode returns 3 new SSD dn and the client fail. was (Author: max2049): In my cluster, we use BlockPlacementPolicyDefault to choose dn and the number of SSD DN is much less than DISK DN. It may cause to some block that should be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no enough place. # create empty file /foo_file # set its storagepolicy to All_SSD # put data to /foo_file # /foo_file gets 3 DISK dns for pipeline because SSD dns are too busy at the beginning. # when it transfers data in pipeline, one of 3 DISK dns down and [shut|http://dict.youdao.com/search?q=shut=chrome.extension] [ʃʌt] [详细|http://dict.youdao.com/search?q=shut=chrome.extension]X 基本翻译 vt. 关闭;停业;幽禁 vi. 关上;停止营业 n. 关闭 adj. 关闭的;围绕的 n. (Shut)人名;(俄)舒特;(中)室(广东话·威妥玛) 网络释义 [Shut:|http://dict.youdao.com/search?q=Shut=chrome.extension=eng] 此路不通 [Eyes Wide Shut:|http://dict.youdao.com/search?q=Eyes%20Wide%20Shut=chrome.extension=eng] 大开眼戒 [shut out:|http://dict.youdao.com/search?q=shut%20out=chrome.extension=eng] 排除 > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16182.patch > > Time Spent: 20m > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in
[jira] [Comment Edited] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403006#comment-17403006 ] Max Xie edited comment on HDFS-16182 at 8/24/21, 2:45 AM: --- In my cluster, we use BlockPlacementPolicyDefault to choose dn and the number of SSD DN is much less than DISK DN. It may cause to some block that should be placed to SSD DNs fallback to place DISK DNs when SSD DNs are too busy or no enough place. # create empty file /foo_file # set its storagepolicy to All_SSD # put data to /foo_file # /foo_file gets 3 DISK dns for pipeline because SSD dns are too busy at the beginning. # when it transfers data in pipeline, one of 3 DISK dns down and [shut|http://dict.youdao.com/search?q=shut=chrome.extension] [ʃʌt] [详细|http://dict.youdao.com/search?q=shut=chrome.extension]X 基本翻译 vt. 关闭;停业;幽禁 vi. 关上;停止营业 n. 关闭 adj. 关闭的;围绕的 n. (Shut)人名;(俄)舒特;(中)室(广东话·威妥玛) 网络释义 [Shut:|http://dict.youdao.com/search?q=Shut=chrome.extension=eng] 此路不通 [Eyes Wide Shut:|http://dict.youdao.com/search?q=Eyes%20Wide%20Shut=chrome.extension=eng] 大开眼戒 [shut out:|http://dict.youdao.com/search?q=shut%20out=chrome.extension=eng] 排除 was (Author: max2049): In my cluster, we use BlockPlacementPolicyDefault to choose dn and the number of SSD DN is much less than DISK DN. It may cause to some block that should be placed to SSD DNs fallback to place DISK DNs. > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16182.patch > > Time Spent: 20m > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in DataStreamer.addDatanode2ExistingPipeline. > > Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think > numOfReplicas should not be assigned by requiredStorageTypes. > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403460#comment-17403460 ] Max Xie commented on HDFS-16182: - hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl, hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped Unit test failures seem unrelated. And I test it in IDEA locally again, these unit tests pass. > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16182.patch > > Time Spent: 20m > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in DataStreamer.addDatanode2ExistingPipeline. > > Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think > numOfReplicas should not be assigned by requiredStorageTypes. > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16112) Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor
[ https://issues.apache.org/jira/browse/HDFS-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403441#comment-17403441 ] tomscut commented on HDFS-16112: This was fixed by [HDFS-16171|https://issues.apache.org/jira/browse/HDFS-16171]. Hi [~weichiu] [~tasanuma] [~ferhui] , may I ask how to close this issue? Thanks. > Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor > > > Key: HDFS-16112 > URL: https://issues.apache.org/jira/browse/HDFS-16112 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Priority: Minor > > These unit tests > TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and > TestDecommissioningStatus#testDecommissionStatus recently seems a little > flaky, we should fix them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16143) TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky
[ https://issues.apache.org/jira/browse/HDFS-16143?focusedWorklogId=640792=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640792 ] ASF GitHub Bot logged work on HDFS-16143: - Author: ASF GitHub Bot Created on: 23/Aug/21 17:58 Start Date: 23/Aug/21 17:58 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #3235: URL: https://github.com/apache/hadoop/pull/3235#issuecomment-903987999 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 52s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 12m 27s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 23m 2s | | trunk passed | | +1 :green_heart: | compile | 22m 55s | | trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | compile | 19m 25s | | trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | checkstyle | 3m 50s | | trunk passed | | +1 :green_heart: | mvnsite | 3m 4s | | trunk passed | | +1 :green_heart: | javadoc | 2m 7s | | trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javadoc | 3m 10s | | trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | spotbugs | 5m 49s | | trunk passed | | +1 :green_heart: | shadedclient | 19m 3s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 24s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 16s | | the patch passed | | +1 :green_heart: | compile | 22m 8s | | the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javac | 22m 8s | | the patch passed | | +1 :green_heart: | compile | 19m 37s | | the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | javac | 19m 37s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 3m 53s | | the patch passed | | +1 :green_heart: | mvnsite | 2m 59s | | the patch passed | | +1 :green_heart: | javadoc | 2m 4s | | the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javadoc | 3m 9s | | the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | spotbugs | 6m 16s | | the patch passed | | +1 :green_heart: | shadedclient | 19m 25s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 16m 47s | | hadoop-common in the patch passed. | | -1 :x: | unit | 333m 9s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3235/37/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 58s | | The patch does not generate ASF License warnings. | | | | 547m 19s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.datanode.TestBlockScanner | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3235/37/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/3235 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell | | uname | Linux 84206923342f 4.15.0-147-generic #151-Ubuntu SMP Fri Jun 18 19:21:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / c18b3d3013658348a1c32b090ea7b8a6c06634ae | | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private
[jira] [Work logged] (HDFS-6874) Add GETFILEBLOCKLOCATIONS operation to HttpFS
[ https://issues.apache.org/jira/browse/HDFS-6874?focusedWorklogId=640761=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640761 ] ASF GitHub Bot logged work on HDFS-6874: Author: ASF GitHub Bot Created on: 23/Aug/21 16:31 Start Date: 23/Aug/21 16:31 Worklog Time Spent: 10m Work Description: amahussein commented on a change in pull request #3322: URL: https://github.com/apache/hadoop/pull/3322#discussion_r694125842 ## File path: hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/web/WebHdfsFileSystem.java ## @@ -1857,18 +1859,57 @@ public synchronized void cancelDelegationToken(final Token token } @Override - public BlockLocation[] getFileBlockLocations(final Path p, - final long offset, final long length) throws IOException { + public BlockLocation[] getFileBlockLocations(final Path p, final long offset, + final long length) throws IOException { statistics.incrementReadOps(1); storageStatistics.incrementOpCounter(OpType.GET_FILE_BLOCK_LOCATIONS); +BlockLocation[] locations = null; +try { + if (isServerHCFSCompatible) { +locations = +getFileBlockLocations(GetOpParam.Op.GETFILEBLOCKLOCATIONS, p, offset, length); + } else { +locations = getFileBlockLocations(GetOpParam.Op.GET_BLOCK_LOCATIONS, p, +offset, length); + } +} catch (RemoteException e) { + if (isGetFileBlockLocationsException(e)) { Review comment: ```suggestion // parsing the exception is needed only if the client thinks the service is compatible if (isServerHCFSCompatible && isGetFileBlockLocationsException(e)) { ``` ## File path: hadoop-hdfs-project/hadoop-hdfs-httpfs/src/test/java/org/apache/hadoop/fs/http/server/TestHttpFSServer.java ## @@ -2002,4 +2003,38 @@ public void testContentType() throws Exception { () -> HttpFSUtils.jsonParse(conn)); conn.disconnect(); } + + @Test + @TestDir + @TestJetty + @TestHdfs + public void testGetFileBlockLocations() throws Exception { +createHttpFSServer(false, false); +// Create a test directory +String pathStr = "/tmp/tmp-snap-diff-test"; +createDirWithHttp(pathStr, "700", null); + +Path path = new Path(pathStr); +DistributedFileSystem dfs = (DistributedFileSystem) FileSystem +.get(path.toUri(), TestHdfsHelper.getHdfsConf()); +// Enable snapshot +dfs.allowSnapshot(path); +Assert.assertTrue(dfs.getFileStatus(path).isSnapshotEnabled()); +// Create a file and take a snapshot +String file1 = pathStr + "/file1"; +createWithHttp(file1, null); +HttpURLConnection conn = sendRequestToHttpFSServer(file1, +"GETFILEBLOCKLOCATIONS", "length=10"); +Assert.assertEquals(HttpURLConnection.HTTP_OK, conn.getResponseCode()); +BlockLocation[] locations1 = +dfs.getFileBlockLocations(new Path(file1), 0, 1); +Assert.assertNotNull(locations1); + +HttpURLConnection conn1 = sendRequestToHttpFSServer(file1, +"GET_BLOCK_LOCATIONS", "length=10"); +Assert.assertEquals(HttpURLConnection.HTTP_OK, conn1.getResponseCode()); +BlockLocation[] locations2 = +dfs.getFileBlockLocations(new Path(file1), 0, 1); +Assert.assertNotNull(locations2); + } Review comment: Falling back from `GETFILEBLOCKLOCATIONS` to `GET_FILE_BLOCK_LOCATIONS` and caching the boolean flag is not tested. Maybe we need another unit test that assumes that the operation is not supported and falls back to the old. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640761) Time Spent: 0.5h (was: 20m) > Add GETFILEBLOCKLOCATIONS operation to HttpFS > - > > Key: HDFS-6874 > URL: https://issues.apache.org/jira/browse/HDFS-6874 > Project: Hadoop HDFS > Issue Type: Improvement > Components: httpfs >Affects Versions: 2.4.1, 2.7.3 >Reporter: Gao Zhong Liang >Assignee: Weiwei Yang >Priority: Major > Labels: BB2015-05-TBR, pull-request-available > Attachments: HDFS-6874-1.patch, HDFS-6874-branch-2.6.0.patch, > HDFS-6874.011.patch, HDFS-6874.02.patch, HDFS-6874.03.patch, > HDFS-6874.04.patch, HDFS-6874.05.patch, HDFS-6874.06.patch, > HDFS-6874.07.patch, HDFS-6874.08.patch, HDFS-6874.09.patch, > HDFS-6874.10.patch, HDFS-6874.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > >
[jira] [Commented] (HDFS-12188) TestDecommissioningStatus#testDecommissionStatus fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-12188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403275#comment-17403275 ] Ahmed Hussein commented on HDFS-12188: -- Thanks [~vjasani]. I marked this issue as fixed by HDFS-16171. > TestDecommissioningStatus#testDecommissionStatus fails intermittently > - > > Key: HDFS-12188 > URL: https://issues.apache.org/jira/browse/HDFS-12188 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: test >Reporter: Brahma Reddy Battula >Assignee: Ajay Kumar >Priority: Major > Labels: pull-request-available > Attachments: TestFailure_Log.txt > > Time Spent: 40m > Remaining Estimate: 0h > > {noformat} > java.lang.AssertionError: Unexpected num under-replicated blocks expected:<3> > but was:<4> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.checkDecommissionStatus(TestDecommissioningStatus.java:144) > at > org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.testDecommissionStatus(TestDecommissioningStatus.java:240) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-12188) TestDecommissioningStatus#testDecommissionStatus fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-12188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein resolved HDFS-12188. -- Resolution: Fixed > TestDecommissioningStatus#testDecommissionStatus fails intermittently > - > > Key: HDFS-12188 > URL: https://issues.apache.org/jira/browse/HDFS-12188 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: test >Reporter: Brahma Reddy Battula >Assignee: Ajay Kumar >Priority: Major > Labels: pull-request-available > Attachments: TestFailure_Log.txt > > Time Spent: 40m > Remaining Estimate: 0h > > {noformat} > java.lang.AssertionError: Unexpected num under-replicated blocks expected:<3> > but was:<4> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.checkDecommissionStatus(TestDecommissioningStatus.java:144) > at > org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.testDecommissionStatus(TestDecommissioningStatus.java:240) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-6874) Add GETFILEBLOCKLOCATIONS operation to HttpFS
[ https://issues.apache.org/jira/browse/HDFS-6874?focusedWorklogId=640739=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640739 ] ASF GitHub Bot logged work on HDFS-6874: Author: ASF GitHub Bot Created on: 23/Aug/21 15:24 Start Date: 23/Aug/21 15:24 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #3322: URL: https://github.com/apache/hadoop/pull/3322#issuecomment-903872705 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 42s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 3 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 12m 44s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 20m 10s | | trunk passed | | +1 :green_heart: | compile | 4m 52s | | trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | compile | 4m 38s | | trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | checkstyle | 1m 15s | | trunk passed | | +1 :green_heart: | mvnsite | 3m 1s | | trunk passed | | +1 :green_heart: | javadoc | 2m 10s | | trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javadoc | 2m 39s | | trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | spotbugs | 6m 23s | | trunk passed | | +1 :green_heart: | shadedclient | 14m 3s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 28s | | Maven dependency ordering for patch | | -1 :x: | mvninstall | 0m 20s | [/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs-httpfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3322/1/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs-httpfs.txt) | hadoop-hdfs-httpfs in the patch failed. | | -1 :x: | compile | 4m 17s | [/patch-compile-hadoop-hdfs-project-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3322/1/artifact/out/patch-compile-hadoop-hdfs-project-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.txt) | hadoop-hdfs-project in the patch failed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04. | | -1 :x: | javac | 4m 17s | [/patch-compile-hadoop-hdfs-project-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3322/1/artifact/out/patch-compile-hadoop-hdfs-project-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.txt) | hadoop-hdfs-project in the patch failed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04. | | -1 :x: | compile | 4m 4s | [/patch-compile-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3322/1/artifact/out/patch-compile-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10.txt) | hadoop-hdfs-project in the patch failed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10. | | -1 :x: | javac | 4m 4s | [/patch-compile-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3322/1/artifact/out/patch-compile-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10.txt) | hadoop-hdfs-project in the patch failed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10. | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 7s | [/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3322/1/artifact/out/results-checkstyle-hadoop-hdfs-project.txt) | hadoop-hdfs-project: The patch generated 4 new + 462 unchanged - 1 fixed = 466 total (was 463) | | -1 :x: | mvnsite | 0m 22s | [/patch-mvnsite-hadoop-hdfs-project_hadoop-hdfs-httpfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3322/1/artifact/out/patch-mvnsite-hadoop-hdfs-project_hadoop-hdfs-httpfs.txt) | hadoop-hdfs-httpfs in the patch failed. | | +1 :green_heart: | javadoc | 1m 44s | | the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1
[jira] [Work logged] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?focusedWorklogId=640730=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640730 ] ASF GitHub Bot logged work on HDFS-16182: - Author: ASF GitHub Bot Created on: 23/Aug/21 15:01 Start Date: 23/Aug/21 15:01 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #3320: URL: https://github.com/apache/hadoop/pull/3320#issuecomment-903852770 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 18s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 33m 36s | | trunk passed | | +1 :green_heart: | compile | 1m 24s | | trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | compile | 1m 15s | | trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | checkstyle | 1m 1s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 28s | | trunk passed | | +1 :green_heart: | javadoc | 0m 58s | | trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javadoc | 1m 23s | | trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | spotbugs | 3m 18s | | trunk passed | | +1 :green_heart: | shadedclient | 19m 10s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 18s | | the patch passed | | +1 :green_heart: | compile | 1m 19s | | the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javac | 1m 19s | | the patch passed | | +1 :green_heart: | compile | 1m 12s | | the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | javac | 1m 12s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 51s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3320/1/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 154 unchanged - 1 fixed = 155 total (was 155) | | +1 :green_heart: | mvnsite | 1m 18s | | the patch passed | | +1 :green_heart: | javadoc | 0m 50s | | the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javadoc | 1m 20s | | the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | spotbugs | 3m 22s | | the patch passed | | +1 :green_heart: | shadedclient | 19m 3s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 356m 5s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3320/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 40s | | The patch does not generate ASF License warnings. | | | | 449m 41s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestEditLogTailer | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3320/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/3320 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell | | uname | Linux 5ca83a692447 4.15.0-143-generic #147-Ubuntu SMP Wed Apr 14 16:10:11 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 74f75f9545382b06b07d18e9a657ecfb9ab115a9 | | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
[jira] [Commented] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403180#comment-17403180 ] Hadoop QA commented on HDFS-16182: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 13m 6s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 30m 49s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 23s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 16s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 3s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 23s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 43s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 56s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 27s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 21m 10s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 3m 5s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 11s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 13s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 13s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 7s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 7s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 53s{color} | {color:orange}https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/702/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 154 unchanged - 1 fixed = 155 total (was 155) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 14s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 41s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 26s{color} | {color:green}{color} |
[jira] [Work logged] (HDFS-16175) Improve the configurable value of Server #PURGE_INTERVAL_NANOS
[ https://issues.apache.org/jira/browse/HDFS-16175?focusedWorklogId=640699=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640699 ] ASF GitHub Bot logged work on HDFS-16175: - Author: ASF GitHub Bot Created on: 23/Aug/21 12:47 Start Date: 23/Aug/21 12:47 Worklog Time Spent: 10m Work Description: jianghuazhu commented on pull request #3307: URL: https://github.com/apache/hadoop/pull/3307#issuecomment-903732597 Thanks @ayushtkn for the comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640699) Time Spent: 4h (was: 3h 50m) > Improve the configurable value of Server #PURGE_INTERVAL_NANOS > -- > > Key: HDFS-16175 > URL: https://issues.apache.org/jira/browse/HDFS-16175 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ipc >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 4h > Remaining Estimate: 0h > > In Server, Server #PURGE_INTERVAL_NANOS is a fixed value, 15. > We can try to improve the configurable value of Server #PURGE_INTERVAL_NANOS, > which will make RPC more flexible. > private final static long PURGE_INTERVAL_NANOS = TimeUnit.NANOSECONDS.convert( > 15, TimeUnit.MINUTES); -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16155) Allow configurable exponential backoff in DFSInputStream refetchLocations
[ https://issues.apache.org/jira/browse/HDFS-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403140#comment-17403140 ] Bryan Beaudreault commented on HDFS-16155: -- [~hexiaoqiao] any chance you could review this? > Allow configurable exponential backoff in DFSInputStream refetchLocations > - > > Key: HDFS-16155 > URL: https://issues.apache.org/jira/browse/HDFS-16155 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Minor > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > The retry policy in > [DFSInputStream#refetchLocations|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1018-L1040] > was first written many years ago. It allows configuration of the base time > window, but subsequent retries double in an un-configurable way. This retry > strategy makes sense in some clusters as it's very conservative and will > avoid DDOSing the namenode in certain systemic failure modes – for example, > if a file is being read by a large hadoop job and the underlying blocks are > moved by the balancer. In this case, enough datanodes would be added to the > deadNodes list and all hadoop tasks would simultaneously try to refetch the > blocks. The 3s doubling with random factor helps break up that stampeding > herd. > However, not all cluster use-cases are created equal, so there are other > cases where a more aggressive initial backoff is preferred. For example in a > low-latency single reader scenario. In this case, if the balancer moves > enough blocks, the reader hits this 3s backoff which is way too long for a > low latency use-case. > One could configure the the window very low (10ms), but then you can hit > other systemic failure modes which would result in readers DDOSing the > namenode again. For example, if blocks went missing due to truly dead > datanodes. In this case, many readers might be refetching locations for > different files with retry backoffs like 10ms, 20ms, 40ms, etc. It takes a > while to backoff enough to avoid impacting the namenode with that strategy. > I suggest adding a configurable multiplier to the backoff strategy so that > operators can tune this as they see fit for their use-case. In the above low > latency case, one could set the base very low (say 2ms) and the multiplier > very high (say 50). This gives an aggressive first retry that very quickly > backs off. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-6874) Add GETFILEBLOCKLOCATIONS operation to HttpFS
[ https://issues.apache.org/jira/browse/HDFS-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-6874: - Labels: BB2015-05-TBR pull-request-available (was: BB2015-05-TBR) > Add GETFILEBLOCKLOCATIONS operation to HttpFS > - > > Key: HDFS-6874 > URL: https://issues.apache.org/jira/browse/HDFS-6874 > Project: Hadoop HDFS > Issue Type: Improvement > Components: httpfs >Affects Versions: 2.4.1, 2.7.3 >Reporter: Gao Zhong Liang >Assignee: Weiwei Yang >Priority: Major > Labels: BB2015-05-TBR, pull-request-available > Attachments: HDFS-6874-1.patch, HDFS-6874-branch-2.6.0.patch, > HDFS-6874.011.patch, HDFS-6874.02.patch, HDFS-6874.03.patch, > HDFS-6874.04.patch, HDFS-6874.05.patch, HDFS-6874.06.patch, > HDFS-6874.07.patch, HDFS-6874.08.patch, HDFS-6874.09.patch, > HDFS-6874.10.patch, HDFS-6874.patch > > Time Spent: 10m > Remaining Estimate: 0h > > GETFILEBLOCKLOCATIONS operation is missing in HttpFS, which is already > supported in WebHDFS. For the request of GETFILEBLOCKLOCATIONS in > org.apache.hadoop.fs.http.server.HttpFSServer, BAD_REQUEST is returned so far: > ... > case GETFILEBLOCKLOCATIONS: { > response = Response.status(Response.Status.BAD_REQUEST).build(); > break; > } > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-6874) Add GETFILEBLOCKLOCATIONS operation to HttpFS
[ https://issues.apache.org/jira/browse/HDFS-6874?focusedWorklogId=640645=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640645 ] ASF GitHub Bot logged work on HDFS-6874: Author: ASF GitHub Bot Created on: 23/Aug/21 09:33 Start Date: 23/Aug/21 09:33 Worklog Time Spent: 10m Work Description: jojochuang opened a new pull request #3322: URL: https://github.com/apache/hadoop/pull/3322 ### Description of PR This is a rebase of the patch file 11 attached to HDFS-6874. The GETFILEBLOCKLOCATIONS is HCFS compatible. Add support of it to httpfs to makes it possible for more applications to run directly against HttpFS server. Add GETFILEBLOCKLOCATIONS op support for httpfs server (HttpFSServer). Add the same for httpfs client (HttpFSFileSystem) Let webhdfs client (WebHdfsFileSystem ) tries the new GETFILEBLOCKLOCATIONS op if the server supports it. Otherwise, fall back to the old GET_FILE_BLOCK_LOCATIONS op. The selection is cached so the second invocation doesn't need to trial and error again. ### How was this patch tested? Unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640645) Remaining Estimate: 0h Time Spent: 10m > Add GETFILEBLOCKLOCATIONS operation to HttpFS > - > > Key: HDFS-6874 > URL: https://issues.apache.org/jira/browse/HDFS-6874 > Project: Hadoop HDFS > Issue Type: Improvement > Components: httpfs >Affects Versions: 2.4.1, 2.7.3 >Reporter: Gao Zhong Liang >Assignee: Weiwei Yang >Priority: Major > Labels: BB2015-05-TBR > Attachments: HDFS-6874-1.patch, HDFS-6874-branch-2.6.0.patch, > HDFS-6874.011.patch, HDFS-6874.02.patch, HDFS-6874.03.patch, > HDFS-6874.04.patch, HDFS-6874.05.patch, HDFS-6874.06.patch, > HDFS-6874.07.patch, HDFS-6874.08.patch, HDFS-6874.09.patch, > HDFS-6874.10.patch, HDFS-6874.patch > > Time Spent: 10m > Remaining Estimate: 0h > > GETFILEBLOCKLOCATIONS operation is missing in HttpFS, which is already > supported in WebHDFS. For the request of GETFILEBLOCKLOCATIONS in > org.apache.hadoop.fs.http.server.HttpFSServer, BAD_REQUEST is returned so far: > ... > case GETFILEBLOCKLOCATIONS: { > response = Response.status(Response.Status.BAD_REQUEST).build(); > break; > } > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16175) Improve the configurable value of Server #PURGE_INTERVAL_NANOS
[ https://issues.apache.org/jira/browse/HDFS-16175?focusedWorklogId=640643=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640643 ] ASF GitHub Bot logged work on HDFS-16175: - Author: ASF GitHub Bot Created on: 23/Aug/21 09:25 Start Date: 23/Aug/21 09:25 Worklog Time Spent: 10m Work Description: jianghuazhu edited a comment on pull request #3307: URL: https://github.com/apache/hadoop/pull/3307#issuecomment-903579635 Some exceptions occurred here in jenkins. But it seems to have nothing to do with the code I submitted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640643) Time Spent: 3h 50m (was: 3h 40m) > Improve the configurable value of Server #PURGE_INTERVAL_NANOS > -- > > Key: HDFS-16175 > URL: https://issues.apache.org/jira/browse/HDFS-16175 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ipc >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 3h 50m > Remaining Estimate: 0h > > In Server, Server #PURGE_INTERVAL_NANOS is a fixed value, 15. > We can try to improve the configurable value of Server #PURGE_INTERVAL_NANOS, > which will make RPC more flexible. > private final static long PURGE_INTERVAL_NANOS = TimeUnit.NANOSECONDS.convert( > 15, TimeUnit.MINUTES); -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-6874) Add GETFILEBLOCKLOCATIONS operation to HttpFS
[ https://issues.apache.org/jira/browse/HDFS-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403069#comment-17403069 ] Wei-Chiu Chuang commented on HDFS-6874: --- I rebased the patch and addressed Inigo's comments. Will raise a PR. > Add GETFILEBLOCKLOCATIONS operation to HttpFS > - > > Key: HDFS-6874 > URL: https://issues.apache.org/jira/browse/HDFS-6874 > Project: Hadoop HDFS > Issue Type: Improvement > Components: httpfs >Affects Versions: 2.4.1, 2.7.3 >Reporter: Gao Zhong Liang >Assignee: Weiwei Yang >Priority: Major > Labels: BB2015-05-TBR > Attachments: HDFS-6874-1.patch, HDFS-6874-branch-2.6.0.patch, > HDFS-6874.011.patch, HDFS-6874.02.patch, HDFS-6874.03.patch, > HDFS-6874.04.patch, HDFS-6874.05.patch, HDFS-6874.06.patch, > HDFS-6874.07.patch, HDFS-6874.08.patch, HDFS-6874.09.patch, > HDFS-6874.10.patch, HDFS-6874.patch > > > GETFILEBLOCKLOCATIONS operation is missing in HttpFS, which is already > supported in WebHDFS. For the request of GETFILEBLOCKLOCATIONS in > org.apache.hadoop.fs.http.server.HttpFSServer, BAD_REQUEST is returned so far: > ... > case GETFILEBLOCKLOCATIONS: { > response = Response.status(Response.Status.BAD_REQUEST).build(); > break; > } > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16175) Improve the configurable value of Server #PURGE_INTERVAL_NANOS
[ https://issues.apache.org/jira/browse/HDFS-16175?focusedWorklogId=640632=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640632 ] ASF GitHub Bot logged work on HDFS-16175: - Author: ASF GitHub Bot Created on: 23/Aug/21 09:06 Start Date: 23/Aug/21 09:06 Worklog Time Spent: 10m Work Description: jianghuazhu commented on pull request #3307: URL: https://github.com/apache/hadoop/pull/3307#issuecomment-903580807 @ayushtkn , can you help review the code. Thank you very much. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640632) Time Spent: 3h 40m (was: 3.5h) > Improve the configurable value of Server #PURGE_INTERVAL_NANOS > -- > > Key: HDFS-16175 > URL: https://issues.apache.org/jira/browse/HDFS-16175 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ipc >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 3h 40m > Remaining Estimate: 0h > > In Server, Server #PURGE_INTERVAL_NANOS is a fixed value, 15. > We can try to improve the configurable value of Server #PURGE_INTERVAL_NANOS, > which will make RPC more flexible. > private final static long PURGE_INTERVAL_NANOS = TimeUnit.NANOSECONDS.convert( > 15, TimeUnit.MINUTES); -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16175) Improve the configurable value of Server #PURGE_INTERVAL_NANOS
[ https://issues.apache.org/jira/browse/HDFS-16175?focusedWorklogId=640631=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640631 ] ASF GitHub Bot logged work on HDFS-16175: - Author: ASF GitHub Bot Created on: 23/Aug/21 09:04 Start Date: 23/Aug/21 09:04 Worklog Time Spent: 10m Work Description: jianghuazhu commented on pull request #3307: URL: https://github.com/apache/hadoop/pull/3307#issuecomment-903579635 Some anomalies happened here in jenkins. But it does not seem to be related to the code I submitted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640631) Time Spent: 3.5h (was: 3h 20m) > Improve the configurable value of Server #PURGE_INTERVAL_NANOS > -- > > Key: HDFS-16175 > URL: https://issues.apache.org/jira/browse/HDFS-16175 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ipc >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 3.5h > Remaining Estimate: 0h > > In Server, Server #PURGE_INTERVAL_NANOS is a fixed value, 15. > We can try to improve the configurable value of Server #PURGE_INTERVAL_NANOS, > which will make RPC more flexible. > private final static long PURGE_INTERVAL_NANOS = TimeUnit.NANOSECONDS.convert( > 15, TimeUnit.MINUTES); -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16143) TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky
[ https://issues.apache.org/jira/browse/HDFS-16143?focusedWorklogId=640627=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640627 ] ASF GitHub Bot logged work on HDFS-16143: - Author: ASF GitHub Bot Created on: 23/Aug/21 09:00 Start Date: 23/Aug/21 09:00 Worklog Time Spent: 10m Work Description: tasanuma commented on a change in pull request #3235: URL: https://github.com/apache/hadoop/pull/3235#discussion_r693790700 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java ## @@ -429,19 +432,29 @@ public void testStandbyTriggersLogRollsWhenTailInProgressEdits() waitForStandbyToCatchUpWithInProgressEdits(standby, activeTxId, standbyCatchupWaitTime); + long curTime = standby.getNamesystem().getEditLogTailer().getTimer() + .monotonicNow(); + long inSufficientTimeForLogRoll = logRollPeriodMs / 3; + final FakeTimer testTimer = + new FakeTimer(curTime + inSufficientTimeForLogRoll); + standby.getNamesystem().getEditLogTailer().setTimerForTest(testTimer); + Thread.sleep(2000); + for (int i = DIRS_TO_MAKE / 2; i < DIRS_TO_MAKE; i++) { NameNodeAdapter.mkdirs(active, getDirPath(i), new PermissionStatus("test", "test", new FsPermission((short)00755)), true); } - boolean exceptionThrown = false; try { checkForLogRoll(active, origTxId, noLogRollWaitTime); +fail("Expected to timeout"); } catch (TimeoutException e) { -exceptionThrown = true; +// expected } - assertTrue(exceptionThrown); + + long sufficientTimeForLogRoll = logRollPeriodMs * 3; Review comment: I understood. Thanks for your detailed explanation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640627) Time Spent: 10h (was: 9h 50m) > TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky > - > > Key: HDFS-16143 > URL: https://issues.apache.org/jira/browse/HDFS-16143 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: test >Reporter: Akira Ajisaka >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Attachments: patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt > > Time Spent: 10h > Remaining Estimate: 0h > > https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3229/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt > {quote} > [ERROR] > testStandbyTriggersLogRollsWhenTailInProgressEdits[0](org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer) > Time elapsed: 6.862 s <<< FAILURE! > java.lang.AssertionError > at org.junit.Assert.fail(Assert.java:87) > at org.junit.Assert.assertTrue(Assert.java:42) > at org.junit.Assert.assertTrue(Assert.java:53) > at > org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer.testStandbyTriggersLogRollsWhenTailInProgressEdits(TestEditLogTailer.java:444) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16143) TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky
[ https://issues.apache.org/jira/browse/HDFS-16143?focusedWorklogId=640620=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640620 ] ASF GitHub Bot logged work on HDFS-16143: - Author: ASF GitHub Bot Created on: 23/Aug/21 08:46 Start Date: 23/Aug/21 08:46 Worklog Time Spent: 10m Work Description: virajjasani commented on a change in pull request #3235: URL: https://github.com/apache/hadoop/pull/3235#discussion_r693780174 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java ## @@ -429,19 +432,29 @@ public void testStandbyTriggersLogRollsWhenTailInProgressEdits() waitForStandbyToCatchUpWithInProgressEdits(standby, activeTxId, standbyCatchupWaitTime); + long curTime = standby.getNamesystem().getEditLogTailer().getTimer() + .monotonicNow(); + long inSufficientTimeForLogRoll = logRollPeriodMs / 3; + final FakeTimer testTimer = + new FakeTimer(curTime + inSufficientTimeForLogRoll); + standby.getNamesystem().getEditLogTailer().setTimerForTest(testTimer); Review comment: Nice idea, I think we can target this as follow up work. Similar to EditLogTailer, we should introduce `Timer` instance such that we keep using Timer's default version of `now`, `monotonicNow` etc utilities but tests would get a way to inject FakeTimer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640620) Time Spent: 9h 50m (was: 9h 40m) > TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky > - > > Key: HDFS-16143 > URL: https://issues.apache.org/jira/browse/HDFS-16143 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: test >Reporter: Akira Ajisaka >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Attachments: patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt > > Time Spent: 9h 50m > Remaining Estimate: 0h > > https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3229/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt > {quote} > [ERROR] > testStandbyTriggersLogRollsWhenTailInProgressEdits[0](org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer) > Time elapsed: 6.862 s <<< FAILURE! > java.lang.AssertionError > at org.junit.Assert.fail(Assert.java:87) > at org.junit.Assert.assertTrue(Assert.java:42) > at org.junit.Assert.assertTrue(Assert.java:53) > at > org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer.testStandbyTriggersLogRollsWhenTailInProgressEdits(TestEditLogTailer.java:444) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16143) TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky
[ https://issues.apache.org/jira/browse/HDFS-16143?focusedWorklogId=640619=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640619 ] ASF GitHub Bot logged work on HDFS-16143: - Author: ASF GitHub Bot Created on: 23/Aug/21 08:42 Start Date: 23/Aug/21 08:42 Worklog Time Spent: 10m Work Description: virajjasani commented on a change in pull request #3235: URL: https://github.com/apache/hadoop/pull/3235#discussion_r693776368 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java ## @@ -429,19 +432,29 @@ public void testStandbyTriggersLogRollsWhenTailInProgressEdits() waitForStandbyToCatchUpWithInProgressEdits(standby, activeTxId, standbyCatchupWaitTime); + long curTime = standby.getNamesystem().getEditLogTailer().getTimer() + .monotonicNow(); + long inSufficientTimeForLogRoll = logRollPeriodMs / 3; + final FakeTimer testTimer = + new FakeTimer(curTime + inSufficientTimeForLogRoll); + standby.getNamesystem().getEditLogTailer().setTimerForTest(testTimer); + Thread.sleep(2000); + for (int i = DIRS_TO_MAKE / 2; i < DIRS_TO_MAKE; i++) { NameNodeAdapter.mkdirs(active, getDirPath(i), new PermissionStatus("test", "test", new FsPermission((short)00755)), true); } - boolean exceptionThrown = false; try { checkForLogRoll(active, origTxId, noLogRollWaitTime); +fail("Expected to timeout"); } catch (TimeoutException e) { -exceptionThrown = true; +// expected } - assertTrue(exceptionThrown); + + long sufficientTimeForLogRoll = logRollPeriodMs * 3; Review comment: We multiply by 3 to advance timer.monotonicNow() by `logRollPeriodMs * 3` which would be `15` here, and that is quite sufficient for log roll as per this equation in EditLogTailer: ``` /** * @return true if the configured log roll period has elapsed. */ private boolean tooLongSinceLastLoad() { return logRollPeriodMs >= 0 && (timer.monotonicNow() - lastRollTimeMs) > logRollPeriodMs; } ``` With `logRollPeriodMs / 3` worth of duration, `tooLongSinceLastLoad()` returns false whereas with `logRollPeriodMs * 3` duration, `tooLongSinceLastLoad()` will return true. e.g logRollPeriodMs = 5 sec; With logRollPeriodMs/3, timer.monotonicNow() = lastRollTimeMs + 5/3 = lastRollTimeMs + 1; So, timer.monotonicNow() - lastRollTimeMs = 1; And hence, `(timer.monotonicNow() - lastRollTimeMs) > logRollPeriodMs` is false (1<5). Now with `logRollPeriodMs*3`, timer.monotonicNow() = lastRollTimeMs + 5*3 = lastRollTimeMs + 15; So, timer.monotonicNow() - lastRollTimeMs = 15; And hence, `(timer.monotonicNow() - lastRollTimeMs) > logRollPeriodMs` is true (15>5). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640619) Time Spent: 9h 40m (was: 9.5h) > TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky > - > > Key: HDFS-16143 > URL: https://issues.apache.org/jira/browse/HDFS-16143 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: test >Reporter: Akira Ajisaka >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Attachments: patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt > > Time Spent: 9h 40m > Remaining Estimate: 0h > > https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3229/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt > {quote} > [ERROR] > testStandbyTriggersLogRollsWhenTailInProgressEdits[0](org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer) > Time elapsed: 6.862 s <<< FAILURE! > java.lang.AssertionError > at org.junit.Assert.fail(Assert.java:87) > at org.junit.Assert.assertTrue(Assert.java:42) > at org.junit.Assert.assertTrue(Assert.java:53) > at > org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer.testStandbyTriggersLogRollsWhenTailInProgressEdits(TestEditLogTailer.java:444) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Work logged] (HDFS-16143) TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky
[ https://issues.apache.org/jira/browse/HDFS-16143?focusedWorklogId=640618=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640618 ] ASF GitHub Bot logged work on HDFS-16143: - Author: ASF GitHub Bot Created on: 23/Aug/21 08:41 Start Date: 23/Aug/21 08:41 Worklog Time Spent: 10m Work Description: virajjasani commented on a change in pull request #3235: URL: https://github.com/apache/hadoop/pull/3235#discussion_r693776368 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java ## @@ -429,19 +432,29 @@ public void testStandbyTriggersLogRollsWhenTailInProgressEdits() waitForStandbyToCatchUpWithInProgressEdits(standby, activeTxId, standbyCatchupWaitTime); + long curTime = standby.getNamesystem().getEditLogTailer().getTimer() + .monotonicNow(); + long inSufficientTimeForLogRoll = logRollPeriodMs / 3; + final FakeTimer testTimer = + new FakeTimer(curTime + inSufficientTimeForLogRoll); + standby.getNamesystem().getEditLogTailer().setTimerForTest(testTimer); + Thread.sleep(2000); + for (int i = DIRS_TO_MAKE / 2; i < DIRS_TO_MAKE; i++) { NameNodeAdapter.mkdirs(active, getDirPath(i), new PermissionStatus("test", "test", new FsPermission((short)00755)), true); } - boolean exceptionThrown = false; try { checkForLogRoll(active, origTxId, noLogRollWaitTime); +fail("Expected to timeout"); } catch (TimeoutException e) { -exceptionThrown = true; +// expected } - assertTrue(exceptionThrown); + + long sufficientTimeForLogRoll = logRollPeriodMs * 3; Review comment: We multiply by 3 to advance timer.monotonicNow() by `logRollPeriodMs * 3` which would be `15` here, and that is quite sufficient for log roll as per this equation in EditLogTailer: ``` /** * @return true if the configured log roll period has elapsed. */ private boolean tooLongSinceLastLoad() { return logRollPeriodMs >= 0 && (timer.monotonicNow() - lastRollTimeMs) > logRollPeriodMs; } ``` With `logRollPeriodMs / 3` worth of duration, `tooLongSinceLastLoad()` returns false whereas with `logRollPeriodMs * 3` duration, `tooLongSinceLastLoad()` will return true. e.g logRollPeriodMs = 5 sec; With logRollPeriodMs/3, timer.monotonicNow() = lastRollTimeMs + 5/3 = lastRollTimeMs + 1; So, timer.monotonicNow() - lastRollTimeMs = 1; And hence, `(timer.monotonicNow() - lastRollTimeMs) > logRollPeriodMs` is false (1<5). Now with logRollPeriodMs*3, timer.monotonicNow() = lastRollTimeMs + 5*3 = lastRollTimeMs + 15; So, timer.monotonicNow() - lastRollTimeMs = 15; And hence, `(timer.monotonicNow() - lastRollTimeMs) > logRollPeriodMs` is true (15>5). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640618) Time Spent: 9.5h (was: 9h 20m) > TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky > - > > Key: HDFS-16143 > URL: https://issues.apache.org/jira/browse/HDFS-16143 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: test >Reporter: Akira Ajisaka >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Attachments: patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt > > Time Spent: 9.5h > Remaining Estimate: 0h > > https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3229/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt > {quote} > [ERROR] > testStandbyTriggersLogRollsWhenTailInProgressEdits[0](org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer) > Time elapsed: 6.862 s <<< FAILURE! > java.lang.AssertionError > at org.junit.Assert.fail(Assert.java:87) > at org.junit.Assert.assertTrue(Assert.java:42) > at org.junit.Assert.assertTrue(Assert.java:53) > at > org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer.testStandbyTriggersLogRollsWhenTailInProgressEdits(TestEditLogTailer.java:444) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Work logged] (HDFS-16143) TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky
[ https://issues.apache.org/jira/browse/HDFS-16143?focusedWorklogId=640613=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640613 ] ASF GitHub Bot logged work on HDFS-16143: - Author: ASF GitHub Bot Created on: 23/Aug/21 08:29 Start Date: 23/Aug/21 08:29 Worklog Time Spent: 10m Work Description: jojochuang commented on a change in pull request #3235: URL: https://github.com/apache/hadoop/pull/3235#discussion_r693765302 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java ## @@ -429,19 +432,29 @@ public void testStandbyTriggersLogRollsWhenTailInProgressEdits() waitForStandbyToCatchUpWithInProgressEdits(standby, activeTxId, standbyCatchupWaitTime); + long curTime = standby.getNamesystem().getEditLogTailer().getTimer() + .monotonicNow(); + long inSufficientTimeForLogRoll = logRollPeriodMs / 3; + final FakeTimer testTimer = + new FakeTimer(curTime + inSufficientTimeForLogRoll); + standby.getNamesystem().getEditLogTailer().setTimerForTest(testTimer); Review comment: just a thought. it would be great if we can refactor the MiniDfsCluster, the NameNode, FSNamesystem and EditLogTailer such that they take a FakeTimer as a parameter during initialization. If all the tests adopt the way of FakeTimer we wouldn't have so many flaky tests. But I reckon it's out of scope of this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640613) Time Spent: 9h 20m (was: 9h 10m) > TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky > - > > Key: HDFS-16143 > URL: https://issues.apache.org/jira/browse/HDFS-16143 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: test >Reporter: Akira Ajisaka >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Attachments: patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt > > Time Spent: 9h 20m > Remaining Estimate: 0h > > https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3229/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt > {quote} > [ERROR] > testStandbyTriggersLogRollsWhenTailInProgressEdits[0](org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer) > Time elapsed: 6.862 s <<< FAILURE! > java.lang.AssertionError > at org.junit.Assert.fail(Assert.java:87) > at org.junit.Assert.assertTrue(Assert.java:42) > at org.junit.Assert.assertTrue(Assert.java:53) > at > org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer.testStandbyTriggersLogRollsWhenTailInProgressEdits(TestEditLogTailer.java:444) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16143) TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky
[ https://issues.apache.org/jira/browse/HDFS-16143?focusedWorklogId=640612=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640612 ] ASF GitHub Bot logged work on HDFS-16143: - Author: ASF GitHub Bot Created on: 23/Aug/21 08:28 Start Date: 23/Aug/21 08:28 Worklog Time Spent: 10m Work Description: tasanuma commented on a change in pull request #3235: URL: https://github.com/apache/hadoop/pull/3235#discussion_r693746433 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java ## @@ -429,19 +432,29 @@ public void testStandbyTriggersLogRollsWhenTailInProgressEdits() waitForStandbyToCatchUpWithInProgressEdits(standby, activeTxId, standbyCatchupWaitTime); + long curTime = standby.getNamesystem().getEditLogTailer().getTimer() + .monotonicNow(); + long inSufficientTimeForLogRoll = logRollPeriodMs / 3; Review comment: I feel `inSufficient` means `in sufficient`. I prefer `insufficientTimeForLogRoll` to `inSufficientTimeForLogRoll`. ## File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java ## @@ -429,19 +432,29 @@ public void testStandbyTriggersLogRollsWhenTailInProgressEdits() waitForStandbyToCatchUpWithInProgressEdits(standby, activeTxId, standbyCatchupWaitTime); + long curTime = standby.getNamesystem().getEditLogTailer().getTimer() + .monotonicNow(); + long inSufficientTimeForLogRoll = logRollPeriodMs / 3; + final FakeTimer testTimer = + new FakeTimer(curTime + inSufficientTimeForLogRoll); + standby.getNamesystem().getEditLogTailer().setTimerForTest(testTimer); + Thread.sleep(2000); + for (int i = DIRS_TO_MAKE / 2; i < DIRS_TO_MAKE; i++) { NameNodeAdapter.mkdirs(active, getDirPath(i), new PermissionStatus("test", "test", new FsPermission((short)00755)), true); } - boolean exceptionThrown = false; try { checkForLogRoll(active, origTxId, noLogRollWaitTime); +fail("Expected to timeout"); } catch (TimeoutException e) { -exceptionThrown = true; +// expected } - assertTrue(exceptionThrown); + + long sufficientTimeForLogRoll = logRollPeriodMs * 3; Review comment: Why do we multiply by 3 here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640612) Time Spent: 9h 10m (was: 9h) > TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky > - > > Key: HDFS-16143 > URL: https://issues.apache.org/jira/browse/HDFS-16143 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: test >Reporter: Akira Ajisaka >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Attachments: patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt > > Time Spent: 9h 10m > Remaining Estimate: 0h > > https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3229/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt > {quote} > [ERROR] > testStandbyTriggersLogRollsWhenTailInProgressEdits[0](org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer) > Time elapsed: 6.862 s <<< FAILURE! > java.lang.AssertionError > at org.junit.Assert.fail(Assert.java:87) > at org.junit.Assert.assertTrue(Assert.java:42) > at org.junit.Assert.assertTrue(Assert.java:53) > at > org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer.testStandbyTriggersLogRollsWhenTailInProgressEdits(TestEditLogTailer.java:444) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16180) FsVolumeImpl.nextBlock should consider that the block meta file has been deleted.
[ https://issues.apache.org/jira/browse/HDFS-16180?focusedWorklogId=640607=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640607 ] ASF GitHub Bot logged work on HDFS-16180: - Author: ASF GitHub Bot Created on: 23/Aug/21 08:03 Start Date: 23/Aug/21 08:03 Worklog Time Spent: 10m Work Description: jojochuang commented on a change in pull request #3315: URL: https://github.com/apache/hadoop/pull/3315#discussion_r693287224 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeImpl.java ## @@ -865,7 +866,15 @@ public ExtendedBlock nextBlock() throws IOException { } File blkFile = getBlockFile(bpid, block); - File metaFile = FsDatasetUtil.findMetaFile(blkFile); + File metaFile ; + try { + metaFile = FsDatasetUtil.findMetaFile(blkFile); + } catch (FileNotFoundException e){ +LOG.warn("nextBlock({}, {}): {}", storageID, bpid, Review comment: can you make the log message more explicit? Like "Metadata file for block file is missing. Skip it" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640607) Time Spent: 1h 20m (was: 1h 10m) > FsVolumeImpl.nextBlock should consider that the block meta file has been > deleted. > - > > Key: HDFS-16180 > URL: https://issues.apache.org/jira/browse/HDFS-16180 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0, 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Minor > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > In my cluster, we found that when VolumeScanner run, sometime dn will throw > some error log below > ``` > > 2021-08-19 08:00:11,549 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService: > Deleted BP-1020175758-nnip-1597745872895 blk_1142977964_69237147 URI > file:/disk1/dfs/data/current/BP-1020175758- > nnip-1597745872895/current/finalized/subdir0/subdir21/blk_1142977964 > 2021-08-19 08:00:48,368 ERROR > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl: > nextBlock(DS-060c8e4c-1ef6-49f5-91ef-91957356891a, BP-1020175758- > nnip-1597745872895): I/O error > java.io.IOException: Meta file not found, > blockFile=/disk1/dfs/data/current/BP-1020175758- > nnip-1597745872895/current/finalized/subdir0/subdir21/blk_1142977964 > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetUtil.findMetaFile(FsDatasetUtil.java:101) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl.nextBlock(FsVolumeImpl.java:809) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:528) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:628) > 2021-08-19 08:00:48,368 WARN > org.apache.hadoop.hdfs.server.datanode.VolumeScanner: > VolumeScanner(/disk1/dfs/data, DS-060c8e4c-1ef6-49f5-91ef-91957356891a): > nextBlock error on > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl@7febc6b4 > ``` > When VolumeScanner scan block blk_1142977964, it has been deleted by > datanode, scanner can not find the meta file of blk_1142977964, so it throw > these error log. > > Maybe we should handle FileNotFoundException during nextblock to reduce error > log and nextblock retry times. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403009#comment-17403009 ] Max Xie commented on HDFS-16182: - [~weichiu] [~sodonnell] Any thought about it. Thanks for the reviews. > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16182.patch > > Time Spent: 10m > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in DataStreamer.addDatanode2ExistingPipeline. > > Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think > numOfReplicas should not be assigned by requiredStorageTypes. > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Xie updated HDFS-16182: Attachment: HDFS-16182.patch Status: Patch Available (was: In Progress) > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16182.patch > > Time Spent: 10m > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in DataStreamer.addDatanode2ExistingPipeline. > > Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think > numOfReplicas should not be assigned by requiredStorageTypes. > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Xie updated HDFS-16182: Attachment: (was: HDFS-16182.patch) > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in DataStreamer.addDatanode2ExistingPipeline. > > Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think > numOfReplicas should not be assigned by requiredStorageTypes. > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403006#comment-17403006 ] Max Xie commented on HDFS-16182: - In my cluster, we use BlockPlacementPolicyDefault to choose dn and the number of SSD DN is much less than DISK DN. It may cause to some block that should be placed to SSD DNs fallback to place DISK DNs. > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16182.patch > > Time Spent: 10m > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in DataStreamer.addDatanode2ExistingPipeline. > > Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think > numOfReplicas should not be assigned by requiredStorageTypes. > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Xie updated HDFS-16182: Attachment: HDFS-16182.patch > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16182.patch > > Time Spent: 10m > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in DataStreamer.addDatanode2ExistingPipeline. > > Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think > numOfReplicas should not be assigned by requiredStorageTypes. > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Xie reassigned HDFS-16182: --- Assignee: Max Xie > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in DataStreamer.addDatanode2ExistingPipeline. > > Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think > numOfReplicas should not be assigned by requiredStorageTypes. > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16182 started by Max Xie. --- > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Assignee: Max Xie >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in DataStreamer.addDatanode2ExistingPipeline. > > Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think > numOfReplicas should not be assigned by requiredStorageTypes. > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-16182: -- Labels: pull-request-available (was: ) > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in DataStreamer.addDatanode2ExistingPipeline. > > Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think > numOfReplicas should not be assigned by requiredStorageTypes. > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
[ https://issues.apache.org/jira/browse/HDFS-16182?focusedWorklogId=640600=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640600 ] ASF GitHub Bot logged work on HDFS-16182: - Author: ASF GitHub Bot Created on: 23/Aug/21 07:30 Start Date: 23/Aug/21 07:30 Worklog Time Spent: 10m Work Description: Neilxzn opened a new pull request #3320: URL: https://github.com/apache/hadoop/pull/3320 ### Description of PR https://issues.apache.org/jira/browse/HDFS-16182 ### How was this patch tested? add TestBlockStoragePolicy.testAddDatanode2ExistingPipelineInSsd ### For code changes: - [ ] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640600) Remaining Estimate: 0h Time Spent: 10m > numOfReplicas is given the wrong value in > BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with > Heterogeneous Storage > --- > > Key: HDFS-16182 > URL: https://issues.apache.org/jira/browse/HDFS-16182 > Project: Hadoop HDFS > Issue Type: Bug > Components: namanode >Affects Versions: 3.4.0 >Reporter: Max Xie >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In our hdfs cluster, we use heterogeneous storage to store data in SSD for a > better performance. Sometimes hdfs client transfer data in pipline, it will > throw IOException and exit. Exception logs are below: > ``` > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], > > DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], > > DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], > > DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], > > original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], > > DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > ``` > After search it, I found when existing pipline need replace new dn to > transfer data, the client will get one additional dn from namenode and check > that the number of dn is the original number + 1. > ``` > ## DataStreamer$findNewDatanode > if (nodes.length != original.length + 1) { > throw new IOException( > "Failed to replace a bad datanode on the existing pipeline " > + "due to no more good datanodes being available to try. " > + "(Nodes: current=" + Arrays.asList(nodes) > + ", original=" + Arrays.asList(original) + "). " > + "The current failed datanode replacement policy is " > + dfsClient.dtpReplaceDatanodeOnFailure > + ", and a client may configure this via '" > + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY > + "' in its configuration."); > } > ``` > The root cause is that Namenode$getAdditionalDatanode returns multi datanodes > , not one in DataStreamer.addDatanode2ExistingPipeline. > > Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think > numOfReplicas should not be assigned by requiredStorageTypes. > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Created] (HDFS-16182) numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage
Max Xie created HDFS-16182: --- Summary: numOfReplicas is given the wrong value in BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with Heterogeneous Storage Key: HDFS-16182 URL: https://issues.apache.org/jira/browse/HDFS-16182 Project: Hadoop HDFS Issue Type: Bug Components: namanode Affects Versions: 3.4.0 Reporter: Max Xie In our hdfs cluster, we use heterogeneous storage to store data in SSD for a better performance. Sometimes hdfs client transfer data in pipline, it will throw IOException and exit. Exception logs are below: ``` java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK], DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD], DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD], DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]], original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK], DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. ``` After search it, I found when existing pipline need replace new dn to transfer data, the client will get one additional dn from namenode and check that the number of dn is the original number + 1. ``` ## DataStreamer$findNewDatanode if (nodes.length != original.length + 1) { throw new IOException( "Failed to replace a bad datanode on the existing pipeline " + "due to no more good datanodes being available to try. " + "(Nodes: current=" + Arrays.asList(nodes) + ", original=" + Arrays.asList(original) + "). " + "The current failed datanode replacement policy is " + dfsClient.dtpReplaceDatanodeOnFailure + ", and a client may configure this via '" + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY + "' in its configuration."); } ``` The root cause is that Namenode$getAdditionalDatanode returns multi datanodes , not one in DataStreamer.addDatanode2ExistingPipeline. Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think numOfReplicas should not be assigned by requiredStorageTypes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16181) [SBN Read] Fix metric of RpcRequestCacheMissAmount can't display when tailEditLog form JN
[ https://issues.apache.org/jira/browse/HDFS-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhaohui updated HDFS-16181: --- Summary: [SBN Read] Fix metric of RpcRequestCacheMissAmount can't display when tailEditLog form JN (was: [SBN] Fix metric of RpcRequestCacheMissAmount can't display when tailEditLog form JN) > [SBN Read] Fix metric of RpcRequestCacheMissAmount can't display when > tailEditLog form JN > - > > Key: HDFS-16181 > URL: https://issues.apache.org/jira/browse/HDFS-16181 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: wangzhaohui >Assignee: wangzhaohui >Priority: Critical > Labels: pull-request-available > Attachments: after.jpg, before.jpg > > Time Spent: 20m > Remaining Estimate: 0h > > I found the JN turn on edit cache, but the metric of > rpcRequestCacheMissAmount can not display. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16179) Update loglevel for BlockManager#chooseExcessRedundancyStriped to avoid too much logs
[ https://issues.apache.org/jira/browse/HDFS-16179?focusedWorklogId=640591=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640591 ] ASF GitHub Bot logged work on HDFS-16179: - Author: ASF GitHub Bot Created on: 23/Aug/21 06:26 Start Date: 23/Aug/21 06:26 Worklog Time Spent: 10m Work Description: tomscut commented on pull request #3313: URL: https://github.com/apache/hadoop/pull/3313#issuecomment-903480361 > > @tomscut Hi, does this WARN log be printed when only writing EC files ? This WARN logs also appeared in our cluster without writting any files, but not as many as you said. > > I found that the block in the WARN log belongs to the file written a long time ago. So, I have some guesses: > > > > * is there a daemon thread calling this method? > > * or other conditions trigger this method? > > > > Here is our 3-hour running log. > > ![image](https://user-images.githubusercontent.com/18388154/130396631-14db5ce7-0e35-442d-b0d8-f38486ab5496.png) > > Thanks @whbing for your comments. I found those logs were printed after completeFile. Triggered by FSDirWriteFileOp#completeFileInternal(). > > ``` > private static boolean completeFileInternal( > FSNamesystem fsn, INodesInPath iip, > String holder, Block last, long fileId) > throws IOException { > (...) > fsn.finalizeINodeFileUnderConstruction(src, pendingFile, > Snapshot.CURRENT_STATE_ID, true); > (...) > return true; > } > ``` > > ![image](https://user-images.githubusercontent.com/55134131/130398711-1fa0d1dc-8c46-4f8f-b7f1-2459dca3c5c4.png) > @tomscut Hi, does this WARN log be printed when only writing EC files ? This WARN logs also appeared in our cluster without writting any files, but not as many as you said. > I found that the block in the WARN log belongs to the file written a long time ago. So, I have some guesses: > > * is there a daemon thread calling this method? > * or other conditions trigger this method? > > Here is our 3-hour running log. > ![image](https://user-images.githubusercontent.com/18388154/130396631-14db5ce7-0e35-442d-b0d8-f38486ab5496.png) Yes, those WARN logs were printed only writing EC files. Because the logs printed in BlockManager#chooseExcessRedundancyStriped(). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640591) Time Spent: 2h 50m (was: 2h 40m) > Update loglevel for BlockManager#chooseExcessRedundancyStriped to avoid too > much logs > - > > Key: HDFS-16179 > URL: https://issues.apache.org/jira/browse/HDFS-16179 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.1.0 >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Attachments: log-count.jpg, logs.jpg > > Time Spent: 2h 50m > Remaining Estimate: 0h > > {code:java} > private void chooseExcessRedundancyStriped(BlockCollection bc, > final Collection nonExcess, > BlockInfo storedBlock, > DatanodeDescriptor delNodeHint) { > ... > // cardinality of found indicates the expected number of internal blocks > final int numOfTarget = found.cardinality(); > final BlockStoragePolicy storagePolicy = storagePolicySuite.getPolicy( > bc.getStoragePolicyID()); > final List excessTypes = storagePolicy.chooseExcess( > (short) numOfTarget, DatanodeStorageInfo.toStorageTypes(nonExcess)); > if (excessTypes.isEmpty()) { > LOG.warn("excess types chosen for block {} among storages {} is empty", > storedBlock, nonExcess); > return; > } > ... > } > {code} > > IMO, here is just detecting excess StorageType and setting the log level to > debug has no effect. > > We have a cluster that uses the EC policy to store data. The current log > level is WARN here, and in about 50 minutes, 286,093 logs are printed, which > can cause other important logs to drown out. > > !logs.jpg|width=1167,height=62! > > !log-count.jpg|width=760,height=30! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16179) Update loglevel for BlockManager#chooseExcessRedundancyStriped to avoid too much logs
[ https://issues.apache.org/jira/browse/HDFS-16179?focusedWorklogId=640590=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640590 ] ASF GitHub Bot logged work on HDFS-16179: - Author: ASF GitHub Bot Created on: 23/Aug/21 06:23 Start Date: 23/Aug/21 06:23 Worklog Time Spent: 10m Work Description: tomscut edited a comment on pull request #3313: URL: https://github.com/apache/hadoop/pull/3313#issuecomment-903476259 > @tomscut Hi, does this WARN log be printed when only writing EC files ? This WARN logs also appeared in our cluster without writting any files, but not as many as you said. > I found that the block in the WARN log belongs to the file written a long time ago. So, I have some guesses: > > * is there a daemon thread calling this method? > * or other conditions trigger this method? > > Here is our 3-hour running log. > ![image](https://user-images.githubusercontent.com/18388154/130396631-14db5ce7-0e35-442d-b0d8-f38486ab5496.png) Thanks @whbing for your comments. I found those logs were printed after completeFile. Triggered by FSDirWriteFileOp#completeFileInternal(). Our hadoop version is 3.1.0. ``` private static boolean completeFileInternal( FSNamesystem fsn, INodesInPath iip, String holder, Block last, long fileId) throws IOException { (...) fsn.finalizeINodeFileUnderConstruction(src, pendingFile, Snapshot.CURRENT_STATE_ID, true); (...) return true; } ``` ![image](https://user-images.githubusercontent.com/55134131/130398711-1fa0d1dc-8c46-4f8f-b7f1-2459dca3c5c4.png) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640590) Time Spent: 2h 40m (was: 2.5h) > Update loglevel for BlockManager#chooseExcessRedundancyStriped to avoid too > much logs > - > > Key: HDFS-16179 > URL: https://issues.apache.org/jira/browse/HDFS-16179 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.1.0 >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Attachments: log-count.jpg, logs.jpg > > Time Spent: 2h 40m > Remaining Estimate: 0h > > {code:java} > private void chooseExcessRedundancyStriped(BlockCollection bc, > final Collection nonExcess, > BlockInfo storedBlock, > DatanodeDescriptor delNodeHint) { > ... > // cardinality of found indicates the expected number of internal blocks > final int numOfTarget = found.cardinality(); > final BlockStoragePolicy storagePolicy = storagePolicySuite.getPolicy( > bc.getStoragePolicyID()); > final List excessTypes = storagePolicy.chooseExcess( > (short) numOfTarget, DatanodeStorageInfo.toStorageTypes(nonExcess)); > if (excessTypes.isEmpty()) { > LOG.warn("excess types chosen for block {} among storages {} is empty", > storedBlock, nonExcess); > return; > } > ... > } > {code} > > IMO, here is just detecting excess StorageType and setting the log level to > debug has no effect. > > We have a cluster that uses the EC policy to store data. The current log > level is WARN here, and in about 50 minutes, 286,093 logs are printed, which > can cause other important logs to drown out. > > !logs.jpg|width=1167,height=62! > > !log-count.jpg|width=760,height=30! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16175) Improve the configurable value of Server #PURGE_INTERVAL_NANOS
[ https://issues.apache.org/jira/browse/HDFS-16175?focusedWorklogId=640589=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640589 ] ASF GitHub Bot logged work on HDFS-16175: - Author: ASF GitHub Bot Created on: 23/Aug/21 06:21 Start Date: 23/Aug/21 06:21 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #3307: URL: https://github.com/apache/hadoop/pull/3307#issuecomment-903477996 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 49s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 31m 5s | | trunk passed | | +1 :green_heart: | compile | 22m 38s | | trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | compile | 19m 3s | | trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | checkstyle | 1m 2s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 35s | | trunk passed | | +1 :green_heart: | javadoc | 1m 10s | | trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javadoc | 1m 39s | | trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | spotbugs | 2m 34s | | trunk passed | | +1 :green_heart: | shadedclient | 15m 52s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 55s | | the patch passed | | +1 :green_heart: | compile | 22m 23s | | the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javac | 22m 23s | | the patch passed | | +1 :green_heart: | compile | 19m 42s | | the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | javac | 19m 42s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 3s | [/results-checkstyle-hadoop-common-project_hadoop-common.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3307/5/artifact/out/results-checkstyle-hadoop-common-project_hadoop-common.txt) | hadoop-common-project/hadoop-common: The patch generated 1 new + 282 unchanged - 0 fixed = 283 total (was 282) | | +1 :green_heart: | mvnsite | 1m 36s | | the patch passed | | +1 :green_heart: | xml | 0m 1s | | The patch has no ill-formed XML file. | | +1 :green_heart: | javadoc | 1m 6s | | the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javadoc | 1m 38s | | the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | spotbugs | 2m 54s | | the patch passed | | +1 :green_heart: | shadedclient | 18m 26s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 41m 59s | [/patch-unit-hadoop-common-project_hadoop-common.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3307/5/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt) | hadoop-common in the patch passed. | | +1 :green_heart: | asflicense | 1m 1s | | The patch does not generate ASF License warnings. | | | | 210m 1s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.ha.TestZKFailoverControllerStress | | | hadoop.ipc.TestRetryCache | | | hadoop.ipc.TestCallQueueManager | | | hadoop.metrics2.source.TestJvmMetrics | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3307/5/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/3307 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell xml | | uname | Linux 1b8547543f35 4.15.0-151-generic #157-Ubuntu SMP Fri Jul 9 23:07:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk /
[jira] [Work logged] (HDFS-16179) Update loglevel for BlockManager#chooseExcessRedundancyStriped to avoid too much logs
[ https://issues.apache.org/jira/browse/HDFS-16179?focusedWorklogId=640588=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640588 ] ASF GitHub Bot logged work on HDFS-16179: - Author: ASF GitHub Bot Created on: 23/Aug/21 06:17 Start Date: 23/Aug/21 06:17 Worklog Time Spent: 10m Work Description: tomscut commented on pull request #3313: URL: https://github.com/apache/hadoop/pull/3313#issuecomment-903476259 > @tomscut Hi, does this WARN log be printed when only writing EC files ? This WARN logs also appeared in our cluster without writting any files, but not as many as you said. > I found that the block in the WARN log belongs to the file written a long time ago. So, I have some guesses: > > * is there a daemon thread calling this method? > * or other conditions trigger this method? > > Here is our 3-hour running log. > ![image](https://user-images.githubusercontent.com/18388154/130396631-14db5ce7-0e35-442d-b0d8-f38486ab5496.png) Thanks @whbing for your comments. I found those logs were printed after completeFile. Triggered by FSDirWriteFileOp#completeFileInternal(). ``` private static boolean completeFileInternal( FSNamesystem fsn, INodesInPath iip, String holder, Block last, long fileId) throws IOException { (...) fsn.finalizeINodeFileUnderConstruction(src, pendingFile, Snapshot.CURRENT_STATE_ID, true); (...) return true; } ``` ![image](https://user-images.githubusercontent.com/55134131/130398711-1fa0d1dc-8c46-4f8f-b7f1-2459dca3c5c4.png) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640588) Time Spent: 2.5h (was: 2h 20m) > Update loglevel for BlockManager#chooseExcessRedundancyStriped to avoid too > much logs > - > > Key: HDFS-16179 > URL: https://issues.apache.org/jira/browse/HDFS-16179 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.1.0 >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Attachments: log-count.jpg, logs.jpg > > Time Spent: 2.5h > Remaining Estimate: 0h > > {code:java} > private void chooseExcessRedundancyStriped(BlockCollection bc, > final Collection nonExcess, > BlockInfo storedBlock, > DatanodeDescriptor delNodeHint) { > ... > // cardinality of found indicates the expected number of internal blocks > final int numOfTarget = found.cardinality(); > final BlockStoragePolicy storagePolicy = storagePolicySuite.getPolicy( > bc.getStoragePolicyID()); > final List excessTypes = storagePolicy.chooseExcess( > (short) numOfTarget, DatanodeStorageInfo.toStorageTypes(nonExcess)); > if (excessTypes.isEmpty()) { > LOG.warn("excess types chosen for block {} among storages {} is empty", > storedBlock, nonExcess); > return; > } > ... > } > {code} > > IMO, here is just detecting excess StorageType and setting the log level to > debug has no effect. > > We have a cluster that uses the EC policy to store data. The current log > level is WARN here, and in about 50 minutes, 286,093 logs are printed, which > can cause other important logs to drown out. > > !logs.jpg|width=1167,height=62! > > !log-count.jpg|width=760,height=30! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org