[jira] [Commented] (HDFS-15346) DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136267#comment-17136267 ] Jinglun commented on HDFS-15346: Hi [~linyiqun], thanks your great comments ! Address the comments in v12, pending jenkins. > DistCpFedBalance implementation > --- > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, > HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, > HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch, > HDFS-15346.012.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15346) DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15346: --- Attachment: HDFS-15346.012.patch > DistCpFedBalance implementation > --- > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, > HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, > HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch, > HDFS-15346.012.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15294) Federation balance tool
[ https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136263#comment-17136263 ] Yiqun Lin commented on HDFS-15294: -- As this feature tool is designed as a common tool like distcp, I removed all RBF label in uncommitted subtask. > Federation balance tool > --- > > Key: HDFS-15294 > URL: https://issues.apache.org/jira/browse/HDFS-15294 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, > HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, > HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, > HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf > > > This jira introduces a new balance command 'fedbalance' that is ran by the > administrator. The process is: > 1. Use distcp and snapshot diff to sync data between src and dst until they > are the same. > 2. Update mount table in Router. > 3. Delete the src to trash. > > The patch is too big to review, so I split it into 2 patches: > Phase 1 / The State Machine(BalanceProcedureScheduler): Including the > abstraction of job and scheduler model. > {code:java} > org.apache.hadoop.hdfs.procedure.BalanceProcedureScheduler; > org.apache.hadoop.hdfs.procedure.BalanceProcedureConfigKeys; > org.apache.hadoop.hdfs.procedure.BalanceProcedure; > org.apache.hadoop.hdfs.procedure.BalanceJob; > org.apache.hadoop.hdfs.procedure.BalanceJournal; > org.apache.hadoop.hdfs.procedure.HDFSJournal; > {code} > Phase 2 / The DistCpFedBalance: It's an implementation of BalanceJob. HDFS-15346> > {code:java} > org.apache.hadoop.hdfs.server.federation.procedure.MountTableProcedure; > org.apache.hadoop.tools.DistCpFedBalance; > org.apache.hadoop.tools.DistCpProcedure; > org.apache.hadoop.tools.FedBalance; > org.apache.hadoop.tools.FedBalanceConfigs; > org.apache.hadoop.tools.FedBalanceContext; > org.apache.hadoop.tools.TrashProcedure; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15410) Add separated config file fedbalance-default.xml for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15410: - Summary: Add separated config file fedbalance-default.xml for fedbalance tool (was: RBF: Add separated config file fedbalance-default.xml for fedbalance tool) > Add separated config file fedbalance-default.xml for fedbalance tool > > > Key: HDFS-15410 > URL: https://issues.apache.org/jira/browse/HDFS-15410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > > Add a separated config file named fedbalance-default.xml for fedbalance tool > configs. It's like the ditcp-default.xml for distcp tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15374) Add documentation for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15374: - Summary: Add documentation for fedbalance tool (was: RBF: Add documentation for fedbalance tool) > Add documentation for fedbalance tool > - > > Key: HDFS-15374 > URL: https://issues.apache.org/jira/browse/HDFS-15374 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15374.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15346) DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15346: - Summary: DistCpFedBalance implementation (was: RBF: DistCpFedBalance implementation) > DistCpFedBalance implementation > --- > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, > HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, > HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15294) Federation balance tool
[ https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15294: - Summary: Federation balance tool (was: RBF: Balance data across federation namespaces with DistCp and snapshot diff) > Federation balance tool > --- > > Key: HDFS-15294 > URL: https://issues.apache.org/jira/browse/HDFS-15294 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, > HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, > HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, > HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf > > > This jira introduces a new balance command 'fedbalance' that is ran by the > administrator. The process is: > 1. Use distcp and snapshot diff to sync data between src and dst until they > are the same. > 2. Update mount table in Router. > 3. Delete the src to trash. > > The patch is too big to review, so I split it into 2 patches: > Phase 1 / The State Machine(BalanceProcedureScheduler): Including the > abstraction of job and scheduler model. > {code:java} > org.apache.hadoop.hdfs.procedure.BalanceProcedureScheduler; > org.apache.hadoop.hdfs.procedure.BalanceProcedureConfigKeys; > org.apache.hadoop.hdfs.procedure.BalanceProcedure; > org.apache.hadoop.hdfs.procedure.BalanceJob; > org.apache.hadoop.hdfs.procedure.BalanceJournal; > org.apache.hadoop.hdfs.procedure.HDFSJournal; > {code} > Phase 2 / The DistCpFedBalance: It's an implementation of BalanceJob. HDFS-15346> > {code:java} > org.apache.hadoop.hdfs.server.federation.procedure.MountTableProcedure; > org.apache.hadoop.tools.DistCpFedBalance; > org.apache.hadoop.tools.DistCpProcedure; > org.apache.hadoop.tools.FedBalance; > org.apache.hadoop.tools.FedBalanceConfigs; > org.apache.hadoop.tools.FedBalanceContext; > org.apache.hadoop.tools.TrashProcedure; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15414) java.net.SocketException: Original Exception : java.io.IOException: Broken pipe
[ https://issues.apache.org/jira/browse/HDFS-15414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YCozy updated HDFS-15414: - Description: We observed this exception in a DataNode's log while we are not shutting down any nodes in the cluster. Specifically, we have a cluster with 3 DataNodes (DN1, DN2, DN3) and 2 NameNodes (NN1, NN2). At some point, this exception occurs in DN3's log: {noformat} 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007) Starting thread to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007) Starting thread to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 2020-06-08 21:53:03,381 INFO org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: Scheduling a check for /app/dn3/current 2020-06-08 21:53:03,383 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007):Failed to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 got java.net.SocketException: Original Exception : java.io.IOException: Broken pipe at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223) at org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:280) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:620) at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:804) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:751) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2469) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Broken pipe ... 11 more{noformat} Port 9766 is DN2's address. Around the same time, we observe the following exceptions in DN2's log: {noformat} 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 src: /127.0.0.1:47618 dest: /127.0.0.1:9766 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in state FINALIZED and thus cannot be created. 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:47618 dst: /127.0.0.1:9766; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in state FINALIZED and thus cannot be created.{noformat} However, this exception doesn't look like the cause of the broken pipe because earlier DN2 has another occurrence of a ReplicaAlreadyExistsException, but DN3 only has one occurrence of broken pipe. Here's the other occurrence of ReplicaAlreadyExistsException on DN2: {noformat} 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 src: /127.0.0.1:47462 dest: /127.0.0.1:9766 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in state FINALIZED and thus cannot be created. 2020-06-08 21:52:54,448 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src
[jira] [Updated] (HDFS-15414) java.net.SocketException: Original Exception : java.io.IOException: Broken pipe
[ https://issues.apache.org/jira/browse/HDFS-15414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YCozy updated HDFS-15414: - Description: We observed this exception in a DataNode's log while we are not shutting down any nodes in the cluster. Specifically, we have a cluster with 3 DataNodes (DN1, DN2, DN3) and 2 NameNodes (NN1, NN2). At some point, this exception occurs in DN3's log: {noformat} 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007) Starting thread to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007) Starting thread to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 2020-06-08 21:53:03,381 INFO org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: Scheduling a check for /app/dn3/current 2020-06-08 21:53:03,383 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007):Failed to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 got java.net.SocketException: Original Exception : java.io.IOException: Broken pipe at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223) at org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:280) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:620) at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:804) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:751) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2469) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Broken pipe ... 11 more{noformat} Port 9766 is DN2's address. Around the same time, we observe the following exceptions in DN2's log: {noformat} 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 src: /127.0.0.1:47618 dest: /127.0.0.1:9766 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in state FINALIZED and thus cannot be created. 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:47618 dst: /127.0.0.1:9766; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in state FINALIZED and thus cannot be created.{noformat} However, this exception does look like the cause of the broken pipe because earlier DN2 has another occurrence of a ReplicaAlreadyExistsException, but DN3 only has one occurrence of broken pipe. Here's the other occurrence of ReplicaAlreadyExistsException on DN2: {noformat} 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 src: /127.0.0.1:47462 dest: /127.0.0.1:9766 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in state FINALIZED and thus cannot be created. 2020-06-08 21:52:54,448 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src:
[jira] [Created] (HDFS-15414) java.net.SocketException: Original Exception : java.io.IOException: Broken pipe
YCozy created HDFS-15414: Summary: java.net.SocketException: Original Exception : java.io.IOException: Broken pipe Key: HDFS-15414 URL: https://issues.apache.org/jira/browse/HDFS-15414 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.10.0 Reporter: YCozy We observed this exception in a DataNode's log while we are not shutting down any nodes in the cluster. Specifically, we have a cluster with 3 DataNodes (DN1, DN2, DN3) and 2 NameNodes (NN1, NN2). At some point, this exception occurs in DN3's log: {noformat} 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007) Starting thread to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007) Starting thread to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 2020-06-08 21:53:03,381 INFO org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: Scheduling a check for /app/dn3/current 2020-06-08 21:53:03,383 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007):Failed to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 got java.net.SocketException: Original Exception : java.io.IOException: Broken pipe at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223) at org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:280) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:620) at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:804) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:751) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2469) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Broken pipe ... 11 more{noformat} Port 9766 is DN2's address. Around the same time, we observe the following exceptions in DN2's log: {noformat} 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 src: /127.0.0.1:47618 dest: /127.0.0.1:9766 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3- 1591653120007:blk_1073741825_1002 already exists in state FINALIZED and thus cannot be created. 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:47618 dst: /127.0.0.1:9766; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in state FINALIZED and thus cannot be created.{noformat} However, this exception does look like the cause of the broken pipe because earlier DN2 has another occurrence of a ReplicaAlreadyExistsException, but DN3 only has one occurrence of broken pipe. Here's the other occurrence of ReplicaAlreadyExistsException on DN2: {noformat} 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 src: /127.0.0.1:47462 dest: /127.0.0.1:9766 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3- 1591653120007:blk_10737418
[jira] [Created] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections
Andrey Elenskiy created HDFS-15413: -- Summary: DFSStripedInputStream throws exception when datanodes close idle connections Key: HDFS-15413 URL: https://issues.apache.org/jira/browse/HDFS-15413 Project: Hadoop HDFS Issue Type: Bug Components: ec, erasure-coding, hdfs-client Affects Versions: 3.1.3 Environment: - Hadoop 3.1.3 - erasure coding with ISA-L and RS-3-2-1024k scheme - running in kubernetes - dfs.client.socket-timeout = 1 - dfs.datanode.socket.write.timeout = 1 Reporter: Andrey Elenskiy Attachments: out.log We've run into an issue with compactions failing in HBase when erasure coding is enabled on a table directory. After digging further I was able to narrow it down to a seek + read logic and able to reproduce the issue with hdfs client only: {code:java} import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.FSDataInputStream; public class ReaderRaw { public static void main(final String[] args) throws Exception { Path p = new Path(args[0]); int bufLen = Integer.parseInt(args[1]); int sleepDuration = Integer.parseInt(args[2]); int countBeforeSleep = Integer.parseInt(args[3]); int countAfterSleep = Integer.parseInt(args[4]); Configuration conf = new Configuration(); FSDataInputStream istream = FileSystem.get(conf).open(p); byte[] buf = new byte[bufLen]; int readTotal = 0; int count = 0; try { while (true) { istream.seek(readTotal); int bytesRemaining = bufLen; int bufOffset = 0; while (bytesRemaining > 0) { int nread = istream.read(buf, 0, bufLen); if (nread < 0) { throw new Exception("nread is less than zero"); } readTotal += nread; bufOffset += nread; bytesRemaining -= nread; } count++; if (count == countBeforeSleep) { System.out.println("sleeping for " + sleepDuration + " milliseconds"); Thread.sleep(sleepDuration); System.out.println("resuming"); } if (count == countBeforeSleep + countAfterSleep) { System.out.println("done"); break; } } } catch (Exception e) { System.out.println("exception on read " + count + " read total " + readTotal); throw e; } } } {code} The issue appears to be due to the fact that datanodes close the connection of EC client if it doesn't fetch next packet for longer than dfs.client.socket-timeout. The EC client doesn't retry and instead assumes that those datanodes went away resulting in "missing blocks" exception. I was able to consistently reproduce with the following arguments: {noformat} bufLen = 100 (just below 1MB which is the size of the stripe) sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000) countBeforeSleep = 1 countAfterSleep = 7 {noformat} I've attached the entire log output of running the snippet above against erasure coded file with RS-3-2-1024k policy. And here are the logs from datanodes of disconnecting the client: datanode 1: {noformat} 2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped reading, disconnecting it (datanode-v11-0-hadoop.hadoop:9866:DataXceiver error processing READ_BLOCK operation src: /10.128.23.40:53748 dst: /10.128.14.46:9866); java.net.SocketTimeoutException: 1 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.128.14.46:9866 remote=/10.128.23.40:53748] {noformat} datanode 2: {noformat} 2020-06-15 19:06:20,341 INFO datanode.DataNode: Likely the client has stopped reading, disconnecting it (datanode-v11-1-hadoop.hadoop:9866:DataXceiver error processing READ_BLOCK operation src: /10.128.23.40:48772 dst: /10.128.9.42:9866); java.net.SocketTimeoutException: 1 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.128.9.42:9866 remote=/10.128.23.40:48772] {noformat} datanode 3: {noformat} 2020-06-15 19:06:20,467 INFO datanode.DataNode: Likely the client has stopped reading, disconnecting it (datanode-v11-3-hadoop.hadoop:9866:DataXceiver error processing READ_BLOCK operation src: /10.128.23.40:57184 dst: /10.128.16.13:9866); java.net.SocketTimeoutException: 1 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.128.16.13:9866 remote=/10.128.23.40:57184] {noformat} I've tried running the same code again non-ec files with replication
[jira] [Commented] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136060#comment-17136060 ] Hadoop QA commented on HDFS-15406: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 13s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 46s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 27s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 41s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 3m 2s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 0s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 5s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 10s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}109m 29s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 36s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}179m 51s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting | | | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy | | | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped | | | hadoop.hdfs.tools.TestDFSAdminWithHA | | | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure | | | hadoop.hdfs.TestReconstructStripedFile | | | hadoop.hdfs.TestStripedFileAppend | | | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier | | | hadoop.hdfs.TestRollingUpgrade | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-HDFS-Build/29430/artifact/out/Dockerfile | | JIRA Issue | HDFS-15406 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13005719/HDFS-15406.001.patch | | Optional Tests | dupn
[jira] [Updated] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15406: - Attachment: HDFS-15406.001.patch Status: Patch Available (was: Open) > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15406.001.patch > > > In our customer cluster we have approx 10M blocks in one datanode > the Datanode to scans all the blocks , it has taken nearly 5mins > {code:java} > 2020-06-10 12:17:06,869 | INFO | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: > 11149530, missing metadata files:472, missing block files:472, missing blocks > in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 > 2020-06-10 12:17:06,869 | WARN | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | Lock held time above threshold: lock identifier: > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) > org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) > org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) > org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) > org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > | InstrumentedLock.java:143 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15412) Add options to set different block scan period for diffrent StorageType
[ https://issues.apache.org/jira/browse/HDFS-15412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135895#comment-17135895 ] Hadoop QA commented on HDFS-15412: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 27m 41s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 4s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 18m 18s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 21m 50s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 46s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 3m 13s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 19s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 21s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 17m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 17m 27s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 2m 53s{color} | {color:orange} root: The patch generated 2 new + 41 unchanged - 0 fixed = 43 total (was 41) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 54s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 37s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 35s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}126m 35s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 55s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}286m 13s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier | | | hadoop.hdfs.TestGetFileChecksum | | | hadoop.hdfs.TestReconstructStripedFile | | | hadoop.hdfs.TestMultipleNNPortQOP | | | hadoop.hdfs.server.namenode.TestDecommissioningStatusWithBackoffMonitor | | | hadoop.hdfs.TestSafeModeWithStripedFileWithRandomECPolicy | | | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped | | | hadoop.hdfs.TestReconstructStripedFil
[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135841#comment-17135841 ] Hadoop QA commented on HDFS-15346: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 49s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 15 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 8s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 21m 12s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 3m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 6m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 39s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 53s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 6m 30s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 31s{color} | {color:blue} branch/hadoop-project no findbugs output file (findbugsXml.xml) {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 34s{color} | {color:blue} branch/hadoop-assemblies no findbugs output file (findbugsXml.xml) {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 33s{color} | {color:blue} branch/hadoop-tools/hadoop-tools-dist no findbugs output file (findbugsXml.xml) {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 32s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 20m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 20m 58s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 3m 19s{color} | {color:orange} root: The patch generated 5 new + 2 unchanged - 0 fixed = 7 total (was 2) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 7m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} shellcheck {color} | {color:green} 0m 0s{color} | {color:green} There were no new shellcheck issues. {color} | | {color:green}+1{color} | {color:green} shelldocs {color} | {color:green} 0m 32s{color} | {color:green} There were no new shelldocs issues. {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 6s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 25s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 29s{color} | {color:green} the patch passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 30s{color} | {color:blue} hadoop-project has no data from findbugs {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 34s{color} | {color:blue} hadoop-assemblies has no data from findbugs {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {colo
[jira] [Commented] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135832#comment-17135832 ] Hadoop QA commented on HDFS-15175: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 36s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 23s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 20m 10s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 55s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 4m 10s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 7s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 41s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 47s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}105m 4s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 35s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}189m 16s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot | | | hadoop.hdfs.TestReconstructStripedFile | | | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover | | | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy | | | hadoop.hdfs.server.datanode.TestBPOfferService | | | hadoop.hdfs.server.namenode.TestNameNodeRetryCacheMetrics | | | hadoop.hdfs.TestStripedFileAppend | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-HDFS-Build/29429/artifact/out/Dockerfile | | JIRA Issue | HDFS-15175 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13005703/HDFS-15175-trunk.1.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
[jira] [Updated] (HDFS-15410) RBF: Add separated config file fedbalance-default.xml for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15410: - Summary: RBF: Add separated config file fedbalance-default.xml for fedbalance tool (was: Add separated config file fedbalance-default.xml for fedbalance tool.) > RBF: Add separated config file fedbalance-default.xml for fedbalance tool > - > > Key: HDFS-15410 > URL: https://issues.apache.org/jira/browse/HDFS-15410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > > Add a separated config file named fedbalance-default.xml for fedbalance tool > configs. It's like the ditcp-default.xml for distcp tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135819#comment-17135819 ] Yiqun Lin commented on HDFS-15346: -- [~LiJinglun], the refactor looks great. I find you decrease the timeout value, the new value seems too small and it will lead timeout error. Can you adjust all this time value to 3(@Test(timeout = 3) in TestDistCpProcedure? This value works well in my local. Finally, can we add 'fedbalance' in current package name under fedbalance module? Under module path src/test/java, src/main/java Update {noformat} org.apache.hadoop.tools org.apache.hadoop.tools.procedure {noformat} to {noformat} org.apache.hadoop.tools.fedbalance org.apache.hadoop.tools.fedbalance.procedure {noformat} Then please check and update some old class path that used in the module, like hadoop-federation-balance.sh, pom.xml or some other place. Others looks good to me now. Thanks [~LiJinglun] for the so patient working for this. Once above are addressed, I will hold off the commit for few days in case there are some other comments from others. > RBF: DistCpFedBalance implementation > > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, > HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, > HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135767#comment-17135767 ] hemanthboyina commented on HDFS-15406: -- thanks [~brahmareddy] for the comment {quote}Not sure, whether HDFS-9668 will address the same. {quote} the locking contention was being handled through HDFS-15150 and HDFS-15160 by introducing read and write lock , though these doesn't improve the time taken by the lock which this Jira is aimed to solve and by caching the getBaseURI() , the lock time was reduced to 52sec > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > > In our customer cluster we have approx 10M blocks in one datanode > the Datanode to scans all the blocks , it has taken nearly 5mins > {code:java} > 2020-06-10 12:17:06,869 | INFO | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: > 11149530, missing metadata files:472, missing block files:472, missing blocks > in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 > 2020-06-10 12:17:06,869 | WARN | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | Lock held time above threshold: lock identifier: > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) > org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) > org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) > org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) > org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > | InstrumentedLock.java:143 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135756#comment-17135756 ] hemanthboyina commented on HDFS-15406: -- discussed with [~pilchard] offline , the major drawback in his report was the configuration , "dfs.datanode.directoryscan.threads" was set as 1 two major points here *) if we have more volumes , the thread count(dfs.datanode.directoryscan.threads) will impact on the time taken by getDiskReport() aka getVolumeReports() , as each volume will be launched by a thread here , if we increase the threads count , the time taken by getDiskReport() will be less *) next we acquire the lock and compare the report to the in memory data For creating ScanInfo object we use vol.getBaseUri() {code:java} FSVolumeSpi#ScanInfo public ScanInfo(long blockId, File blockFile, File metaFile, FsVolumeSpi vol) { String condensedVolPath = (vol == null || vol.getBaseURI() == null) ? null : getCondensedPath(new File(vol.getBaseURI()).getAbsolutePath()); {code} we addDifference if there is any mismatch in blockId or blockLength for that we call getMetaFile() and getBlockFile() , here we again use vol.getBaseUri {code:java} public File getMetaFile() { return new File(new File(volume.getBaseURI()).getAbsolutePath(), metaSuffix); {code} so if a DN has more blocks the calls to getBaseUri are more , and each time we call getBaseURI we care converting the currentDir.getParent to URI which is taking time and we can cache this here {code:java} public URI getBaseURI() { return new File(currentDir.getParent()).toURI(); } {code} on making this as cache , the lock time reduced to 52 Sec > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > > In our customer cluster we have approx 10M blocks in one datanode > the Datanode to scans all the blocks , it has taken nearly 5mins > {code:java} > 2020-06-10 12:17:06,869 | INFO | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: > 11149530, missing metadata files:472, missing block files:472, missing blocks > in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 > 2020-06-10 12:17:06,869 | WARN | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | Lock held time above threshold: lock identifier: > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) > org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) > org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) > org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) > org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > | InstrumentedLock.java:143 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135742#comment-17135742 ] Brahma Reddy Battula commented on HDFS-15406: - {quote}we get the datanode jstack, with 11M block , found that getDiskReport run nearly 23 min,then hold lock to process scan about 6 min. {quote} getDiskReport() (After HDFS-13947) getVolumeReports()) can be improved by confiuring the "dfs.datanode.directoryscan.threads" more. {quote} hold lock to process scan about 6 min {quote} Not sure, whether HDFS-9668 will address the same. > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > > In our customer cluster we have approx 10M blocks in one datanode > the Datanode to scans all the blocks , it has taken nearly 5mins > {code:java} > 2020-06-10 12:17:06,869 | INFO | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: > 11149530, missing metadata files:472, missing block files:472, missing blocks > in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 > 2020-06-10 12:17:06,869 | WARN | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | Lock held time above threshold: lock identifier: > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) > org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) > org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) > org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) > org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > | InstrumentedLock.java:143 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135722#comment-17135722 ] Xiaoqiao He commented on HDFS-15175: Thanks [~wanchang] for the patch. It is almost LGTM, adding ut will be better to push forward. deep copy operation for some request every time is a bit expensive in my opinion, however do not find any more graceful solution here. cc [~ayushtkn],[~liuml07],[~weichiu] Any more suggestions? > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Wan Chang >Priority: Critical > Labels: NameNode > Attachments: HDFS-15175-trunk.1.patch > > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135712#comment-17135712 ] huhaiyang commented on HDFS-15391: -- Thanks [~hexiaoqiao] To help solve. > Standby NameNode due loads the corruption edit log, the service exits and > cannot be restarted > - > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {noformat} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed : > 1.close file > 2.open file > 3.truncate file > 4.append file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135698#comment-17135698 ] huhaiyang commented on HDFS-15391: -- [~liuml07] Thank you for reply! The current issue is the same as [HDFS-15175|https://issues.apache.org/jira/browse/HDFS-15175] and [HDFS-15175|https://issues.apache.org/jira/browse/HDFS-15175] submitted patch and ready for repair. > Standby NameNode due loads the corruption edit log, the service exits and > cannot be restarted > - > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {noformat} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed : > 1.close file > 2.open file > 3.truncate file > 4.append file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Chang updated HDFS-15175: - Attachment: HDFS-15175-trunk.1.patch Labels: NameNode (was: ) Status: Patch Available (was: Open) > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Wan Chang >Priority: Critical > Labels: NameNode > Attachments: HDFS-15175-trunk.1.patch > > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135675#comment-17135675 ] Xiaoqiao He edited comment on HDFS-15175 at 6/15/20, 9:43 AM: -- Add [~wanchang] as contributor and assign this issue to him. Please feel free to assign back if you are interested it.[~caiyicong] was (Author: hexiaoqiao): Add [~wanchang] as contributor and assign this issue to him. Please assign back if you are interested it.[~caiyicong] > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Wan Chang >Priority: Critical > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135675#comment-17135675 ] Xiaoqiao He edited comment on HDFS-15175 at 6/15/20, 9:42 AM: -- Add [~wanchang] as contributor and assign this issue to him. Please assign back if you are interested it.[~caiyicong] was (Author: hexiaoqiao): Add [~wanchang] as contributor and assign this issue to him. > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Wan Chang >Priority: Critical > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoqiao He resolved HDFS-15391. Resolution: Duplicate > Standby NameNode due loads the corruption edit log, the service exits and > cannot be restarted > - > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {noformat} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed : > 1.close file > 2.open file > 3.truncate file > 4.append file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135683#comment-17135683 ] Xiaoqiao He commented on HDFS-15391: This issue is duplicated by HDFS-15175, will close this one. Please trace it at HDFS-15175. > Standby NameNode due loads the corruption edit log, the service exits and > cannot be restarted > - > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {noformat} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed : > 1.close file > 2.open file > 3.truncate file > 4.append file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135675#comment-17135675 ] Xiaoqiao He commented on HDFS-15175: Add [~wanchang] as contributor and assign this issue to him. > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Wan Chang >Priority: Critical > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoqiao He reassigned HDFS-15175: -- Assignee: Wan Chang (was: Yicong Cai) > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Wan Chang >Priority: Critical > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15412) Add options to set different block scan period for diffrent StorageType
[ https://issues.apache.org/jira/browse/HDFS-15412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Yun updated HDFS-15412: Summary: Add options to set different block scan period for diffrent StorageType (was: Add options to set different scan period for diffrent StorageType) > Add options to set different block scan period for diffrent StorageType > --- > > Key: HDFS-15412 > URL: https://issues.apache.org/jira/browse/HDFS-15412 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Yang Yun >Assignee: Yang Yun >Priority: Minor > Attachments: HDFS-15412.001.patch > > > For some cold data, sometime, we don't want to scan cold data as often as > hot data. Add options that we can set the scan period time according to > StorageType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15412) Add options to set different scan period for diffrent StorageType
[ https://issues.apache.org/jira/browse/HDFS-15412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Yun updated HDFS-15412: Attachment: HDFS-15412.001.patch Status: Patch Available (was: Open) > Add options to set different scan period for diffrent StorageType > - > > Key: HDFS-15412 > URL: https://issues.apache.org/jira/browse/HDFS-15412 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Yang Yun >Assignee: Yang Yun >Priority: Minor > Attachments: HDFS-15412.001.patch > > > For some cold data, sometime, we don't want to scan cold data as often as > hot data. Add options that we can set the scan period time according to > StorageType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15412) Add options to set different scan period for diffrent StorageType
Yang Yun created HDFS-15412: --- Summary: Add options to set different scan period for diffrent StorageType Key: HDFS-15412 URL: https://issues.apache.org/jira/browse/HDFS-15412 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Reporter: Yang Yun Assignee: Yang Yun For some cold data, sometime, we don't want to scan cold data as often as hot data. Add options that we can set the scan period time according to StorageType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135615#comment-17135615 ] Jinglun commented on HDFS-15346: Your are genius [~linyiqun] ! Thanks your brilliant comments, the improvement is great ! The unit tests run very fast now. I followed up all the changes. And I did a little refactor based on your improvement. The logic of the improvement is the same as you suggested. I only extracted a method and refactored the class RunningJobStatus to make it easier to read. Please let me know your thoughts, I'm also ok to keep it just the same as you suggested. {quote}Can you update following description in router option? I update this content as well but seems this was not addressed in the latest patch. {quote} Sorry I missed this. Update at v11. {quote}Method name cleanUpBeforeInitDistcp can be renamed to pathCheckBeforeInitDistcp since we don't do any cleanup operation now. {quote} Address at v11. > RBF: DistCpFedBalance implementation > > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, > HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, > HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15346) RBF: DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15346: --- Attachment: HDFS-15346.011.patch > RBF: DistCpFedBalance implementation > > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, > HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, > HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15411) TestNameNodeMXBean.testDecommissioningNodes fails
[ https://issues.apache.org/jira/browse/HDFS-15411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135610#comment-17135610 ] Ayush Saxena commented on HDFS-15411: - Thanx [~aajisaka] for reporting. I tried to reproduce this by putting a sleep at L367. The check failed but with different difference. This time being due to nonDFSUsed being different. we should replace this check with a relaxed check.. {noformat} Expected :{"127.0.0.1:39745":{"infoAddr":"127.0.0.1:35813","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:39745","lastContact":0,"usedSpace":49152,"adminState":"In Service","nonDfsUsedSpace":303558705152,"capacity":3935851380736,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":3432219295744,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":1.2488276E-6,"volfails":0,"lastBlockReport":0},"127.0.0.1:33467":{"infoAddr":"127.0.0.1:40379","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:33467","lastContact":0,"usedSpace":49152,"adminState":"In Service","nonDfsUsedSpace":303558705152,"capacity":3935851380736,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":3432219295744,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":1.2488276E-6,"volfails":0,"lastBlockReport":0},"127.0.0.1:44865":{"infoAddr":"127.0.0.1:44315","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:44865","lastContact":0,"usedSpace":49152,"adminState":"In Service","nonDfsUsedSpace":303558705152,"capacity":3935851380736,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":3432219295744,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":1.2488276E-6,"volfails":0,"lastBlockReport":0}} Actual :{"127.0.0.1:39745":{"infoAddr":"127.0.0.1:35813","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:39745","lastContact":0,"usedSpace":49152,"adminState":"In Service","nonDfsUsedSpace":303558696960,"capacity":3935851380736,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":3432219303936,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":1.2488276E-6,"volfails":0,"lastBlockReport":0},"127.0.0.1:33467":{"infoAddr":"127.0.0.1:40379","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:33467","lastContact":0,"usedSpace":49152,"adminState":"In Service","nonDfsUsedSpace":303558696960,"capacity":3935851380736,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":3432219303936,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":1.2488276E-6,"volfails":0,"lastBlockReport":0},"127.0.0.1:44865":{"infoAddr":"127.0.0.1:44315","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:44865","lastContact":0,"usedSpace":49152,"adminState":"In Service","nonDfsUsedSpace":303558696960,"capacity":3935851380736,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":3432219303936,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":1.2488276E-6,"volfails":0,"lastBlockReport":0}} {noformat} > TestNameNodeMXBean.testDecommissioningNodes fails > - > > Key: HDFS-15411 > URL: https://issues.apache.org/jira/browse/HDFS-15411 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: Akira Ajisaka >Priority: Major > > https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/172/testReport/org.apache.hadoop.hdfs.server.namenode/TestNameNodeMXBean/testDecommissioningNodes/ > {noformat} > org.junit.ComparisonFailure: > expected:<...0,"lastBlockReport":[0},"127.0.0.1:35473":{"infoAddr":"127.0.0.1:38399","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:35473","lastContact":0,"usedSpace":49152,"adminState":"In > > Service","nonDfsUsedSpace":325285158912,"capacity":7871746113536,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":7146454327296,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":6.244104E-7,"volfails":0,"lastBlockReport":0},"127.0.0.1:44811":{"infoAddr":"127.0.0.1:39743","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:44811","lastContact":0,"usedSpace":49152,"adminState":"In > > Service","nonDfsUsedSpace":325285158912,"capacity":7871746113536,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":7146454327296,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":6.244104E-7,"volfails":0,"lastBlockReport":0]}}> > but > was:<...0,"lastBlockReport":[362603},"127.0.0.1:35473":{"infoAddr":"127.0.0.1:38399","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:35473","lastContact":0,"usedSpace":49152,"adminState":"In > > Service","nonDfsUsedSpace":325285158912,"capacity":7871746113536,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":7146454327296,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":6.244104E-7,"volfails":0,"lastBlockReport":362603},"127.0.0.1:448