[jira] [Commented] (HDFS-15346) DistCpFedBalance implementation

2020-06-15 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136267#comment-17136267
 ] 

Jinglun commented on HDFS-15346:


Hi [~linyiqun], thanks your great comments ! Address the comments in v12, 
pending jenkins.

> DistCpFedBalance implementation
> ---
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, 
> HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, 
> HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch, 
> HDFS-15346.012.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15346) DistCpFedBalance implementation

2020-06-15 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15346:
---
Attachment: HDFS-15346.012.patch

> DistCpFedBalance implementation
> ---
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, 
> HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, 
> HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch, 
> HDFS-15346.012.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15294) Federation balance tool

2020-06-15 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136263#comment-17136263
 ] 

Yiqun Lin commented on HDFS-15294:
--

As this feature tool is designed as a common tool like distcp, I removed all 
RBF label in uncommitted subtask.

> Federation balance tool
> ---
>
> Key: HDFS-15294
> URL: https://issues.apache.org/jira/browse/HDFS-15294
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, 
> HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, 
> HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, 
> HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf
>
>
> This jira introduces a new balance command 'fedbalance' that is ran by the 
> administrator. The process is:
>  1. Use distcp and snapshot diff to sync data between src and dst until they 
> are the same.
>  2. Update mount table in Router.
>  3. Delete the src to trash.
>  
> The patch is too big to review, so I split it into 2 patches:
> Phase 1 / The State Machine(BalanceProcedureScheduler): Including the 
> abstraction of job and scheduler model.   
> {code:java}
> org.apache.hadoop.hdfs.procedure.BalanceProcedureScheduler;
> org.apache.hadoop.hdfs.procedure.BalanceProcedureConfigKeys;
> org.apache.hadoop.hdfs.procedure.BalanceProcedure;
> org.apache.hadoop.hdfs.procedure.BalanceJob;
> org.apache.hadoop.hdfs.procedure.BalanceJournal;
> org.apache.hadoop.hdfs.procedure.HDFSJournal;
> {code}
> Phase 2 / The DistCpFedBalance: It's an implementation of BalanceJob.     HDFS-15346>
> {code:java}
> org.apache.hadoop.hdfs.server.federation.procedure.MountTableProcedure;
> org.apache.hadoop.tools.DistCpFedBalance;
> org.apache.hadoop.tools.DistCpProcedure;
> org.apache.hadoop.tools.FedBalance;
> org.apache.hadoop.tools.FedBalanceConfigs;
> org.apache.hadoop.tools.FedBalanceContext;
> org.apache.hadoop.tools.TrashProcedure;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15410) Add separated config file fedbalance-default.xml for fedbalance tool

2020-06-15 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15410:
-
Summary: Add separated config file fedbalance-default.xml for fedbalance 
tool  (was: RBF: Add separated config file fedbalance-default.xml for 
fedbalance tool)

> Add separated config file fedbalance-default.xml for fedbalance tool
> 
>
> Key: HDFS-15410
> URL: https://issues.apache.org/jira/browse/HDFS-15410
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
>
> Add a separated config file named fedbalance-default.xml for fedbalance tool 
> configs. It's like the ditcp-default.xml for distcp tool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15374) Add documentation for fedbalance tool

2020-06-15 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15374:
-
Summary: Add documentation for fedbalance tool  (was: RBF: Add 
documentation for fedbalance tool)

> Add documentation for fedbalance tool
> -
>
> Key: HDFS-15374
> URL: https://issues.apache.org/jira/browse/HDFS-15374
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15374.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15346) DistCpFedBalance implementation

2020-06-15 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15346:
-
Summary: DistCpFedBalance implementation  (was: RBF: DistCpFedBalance 
implementation)

> DistCpFedBalance implementation
> ---
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, 
> HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, 
> HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15294) Federation balance tool

2020-06-15 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15294:
-
Summary: Federation balance tool  (was: RBF: Balance data across federation 
namespaces with DistCp and snapshot diff)

> Federation balance tool
> ---
>
> Key: HDFS-15294
> URL: https://issues.apache.org/jira/browse/HDFS-15294
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, 
> HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, 
> HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, 
> HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf
>
>
> This jira introduces a new balance command 'fedbalance' that is ran by the 
> administrator. The process is:
>  1. Use distcp and snapshot diff to sync data between src and dst until they 
> are the same.
>  2. Update mount table in Router.
>  3. Delete the src to trash.
>  
> The patch is too big to review, so I split it into 2 patches:
> Phase 1 / The State Machine(BalanceProcedureScheduler): Including the 
> abstraction of job and scheduler model.   
> {code:java}
> org.apache.hadoop.hdfs.procedure.BalanceProcedureScheduler;
> org.apache.hadoop.hdfs.procedure.BalanceProcedureConfigKeys;
> org.apache.hadoop.hdfs.procedure.BalanceProcedure;
> org.apache.hadoop.hdfs.procedure.BalanceJob;
> org.apache.hadoop.hdfs.procedure.BalanceJournal;
> org.apache.hadoop.hdfs.procedure.HDFSJournal;
> {code}
> Phase 2 / The DistCpFedBalance: It's an implementation of BalanceJob.     HDFS-15346>
> {code:java}
> org.apache.hadoop.hdfs.server.federation.procedure.MountTableProcedure;
> org.apache.hadoop.tools.DistCpFedBalance;
> org.apache.hadoop.tools.DistCpProcedure;
> org.apache.hadoop.tools.FedBalance;
> org.apache.hadoop.tools.FedBalanceConfigs;
> org.apache.hadoop.tools.FedBalanceContext;
> org.apache.hadoop.tools.TrashProcedure;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15414) java.net.SocketException: Original Exception : java.io.IOException: Broken pipe

2020-06-15 Thread YCozy (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YCozy updated HDFS-15414:
-
Description: 
We observed this exception in a DataNode's log while we are not shutting down 
any nodes in the cluster. Specifically, we have a cluster with 3 DataNodes 
(DN1, DN2, DN3) and 2 NameNodes (NN1, NN2). At some point, this exception 
occurs in DN3's log:
{noformat}
2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:9666, 
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
infoSecurePort=0, ipcPort=9667, 
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
 Starting thread to transfer 
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:9666, 
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
infoSecurePort=0, ipcPort=9667, 
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
 Starting thread to transfer 
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
2020-06-08 21:53:03,381 INFO 
org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: 
Scheduling a check for /app/dn3/current
2020-06-08 21:53:03,383 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:9666, 
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
infoSecurePort=0, ipcPort=9667, 
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007):Failed
 to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 
127.0.0.1:9766 got
java.net.SocketException: Original Exception : java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at 
sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605)
at 
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223)
 
at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:280)
 
at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:620)
 
at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:804)
 
at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:751)
 
at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2469)
 
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Broken pipe 
... 11 more{noformat}
Port 9766 is DN2's address. 

Around the same time, we observe the following exceptions in DN2's log:
{noformat}
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 src: 
/127.0.0.1:47618 dest: /127.0.0.1:9766
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 received 
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: 
Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists 
in state FINALIZED and thus cannot be created.
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation  src: 
/127.0.0.1:47618 dst: /127.0.0.1:9766; 
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in 
state FINALIZED and thus cannot be created.{noformat}
However, this exception doesn't look like the cause of the broken pipe because 
earlier DN2 has another occurrence of a ReplicaAlreadyExistsException, but DN3 
only has one occurrence of broken pipe. Here's the other occurrence of 
ReplicaAlreadyExistsException on DN2:
{noformat}
2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 src: 
/127.0.0.1:47462 dest: /127.0.0.1:9766
2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 received 
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: 
Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists 
in state FINALIZED and thus cannot be created.
2020-06-08 21:52:54,448 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation  src

[jira] [Updated] (HDFS-15414) java.net.SocketException: Original Exception : java.io.IOException: Broken pipe

2020-06-15 Thread YCozy (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YCozy updated HDFS-15414:
-
Description: 
We observed this exception in a DataNode's log while we are not shutting down 
any nodes in the cluster. Specifically, we have a cluster with 3 DataNodes 
(DN1, DN2, DN3) and 2 NameNodes (NN1, NN2). At some point, this exception 
occurs in DN3's log:
{noformat}
2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:9666, 
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
infoSecurePort=0, ipcPort=9667, 
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
 Starting thread to transfer 
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:9666, 
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
infoSecurePort=0, ipcPort=9667, 
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
 Starting thread to transfer 
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
2020-06-08 21:53:03,381 INFO 
org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: 
Scheduling a check for /app/dn3/current
2020-06-08 21:53:03,383 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:9666, 
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
infoSecurePort=0, ipcPort=9667, 
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007):Failed
 to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 
127.0.0.1:9766 got
java.net.SocketException: Original Exception : java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at 
sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605)
at 
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223)
 
at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:280)
 
at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:620)
 
at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:804)
 
at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:751)
 
at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2469)
 
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Broken pipe 
... 11 more{noformat}
Port 9766 is DN2's address. 

Around the same time, we observe the following exceptions in DN2's log:
{noformat}
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 src: 
/127.0.0.1:47618 dest: /127.0.0.1:9766
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 received 
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: 
Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists 
in state FINALIZED and thus cannot be created.
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation  src: 
/127.0.0.1:47618 dst: /127.0.0.1:9766; 
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in 
state FINALIZED and thus cannot be created.{noformat}
However, this exception does look like the cause of the broken pipe because 
earlier DN2 has another occurrence of a ReplicaAlreadyExistsException, but DN3 
only has one occurrence of broken pipe. Here's the other occurrence of 
ReplicaAlreadyExistsException on DN2:
{noformat}
2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 src: 
/127.0.0.1:47462 dest: /127.0.0.1:9766
2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 received 
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: 
Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists 
in state FINALIZED and thus cannot be created.
2020-06-08 21:52:54,448 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation  src: 

[jira] [Created] (HDFS-15414) java.net.SocketException: Original Exception : java.io.IOException: Broken pipe

2020-06-15 Thread YCozy (Jira)
YCozy created HDFS-15414:


 Summary: java.net.SocketException: Original Exception : 
java.io.IOException: Broken pipe
 Key: HDFS-15414
 URL: https://issues.apache.org/jira/browse/HDFS-15414
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.10.0
Reporter: YCozy


We observed this exception in a DataNode's log while we are not shutting down 
any nodes in the cluster. Specifically, we have a cluster with 3 DataNodes 
(DN1, DN2, DN3) and 2 NameNodes (NN1, NN2). At some point, this exception 
occurs in DN3's log:
{noformat}
2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:9666, 
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
infoSecurePort=0, ipcPort=9667, 
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
 Starting thread to transfer 
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:9666, 
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
infoSecurePort=0, ipcPort=9667, 
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
 Starting thread to transfer 
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
2020-06-08 21:53:03,381 INFO 
org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: 
Scheduling a check for /app/dn3/current
2020-06-08 21:53:03,383 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:9666, 
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
infoSecurePort=0, ipcPort=9667, 
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007):Failed
 to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 
127.0.0.1:9766 got
java.net.SocketException: Original Exception : java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at 
sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605)
at 
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223)
 
at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:280)
 
at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:620)
 
at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:804)
 
at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:751)
 
at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2469)
 
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Broken pipe 
... 11 more{noformat}
Port 9766 is DN2's address. 

Around the same time, we observe the following exceptions in DN2's log:
{noformat}
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 src: 
/127.0.0.1:47618 dest: /127.0.0.1:9766
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 received 
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: 
Block BP-553302063-172.17.0.3-    1591653120007:blk_1073741825_1002 already 
exists in state FINALIZED and thus cannot be created.
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation  src: 
/127.0.0.1:47618 dst: /127.0.0.1:9766; 
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block     
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in 
state FINALIZED and thus cannot be created.{noformat}
However, this exception does look like the cause of the broken pipe because 
earlier DN2 has another occurrence of a ReplicaAlreadyExistsException, but DN3 
only has one occurrence of broken pipe. Here's the other occurrence of 
ReplicaAlreadyExistsException on DN2:
{noformat}
2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 src: 
/127.0.0.1:47462 dest: /127.0.0.1:9766
2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 received 
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: 
Block BP-553302063-172.17.0.3-    1591653120007:blk_10737418

[jira] [Created] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections

2020-06-15 Thread Andrey Elenskiy (Jira)
Andrey Elenskiy created HDFS-15413:
--

 Summary: DFSStripedInputStream throws exception when datanodes 
close idle connections
 Key: HDFS-15413
 URL: https://issues.apache.org/jira/browse/HDFS-15413
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ec, erasure-coding, hdfs-client
Affects Versions: 3.1.3
 Environment: - Hadoop 3.1.3
- erasure coding with ISA-L and RS-3-2-1024k scheme
- running in kubernetes
- dfs.client.socket-timeout = 1
- dfs.datanode.socket.write.timeout = 1
Reporter: Andrey Elenskiy
 Attachments: out.log

We've run into an issue with compactions failing in HBase when erasure coding 
is enabled on a table directory. After digging further I was able to narrow it 
down to a seek + read logic and able to reproduce the issue with hdfs client 
only:
{code:java}
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataInputStream;

public class ReaderRaw {
public static void main(final String[] args) throws Exception {
Path p = new Path(args[0]);
int bufLen = Integer.parseInt(args[1]);
int sleepDuration = Integer.parseInt(args[2]);
int countBeforeSleep = Integer.parseInt(args[3]);
int countAfterSleep = Integer.parseInt(args[4]);
Configuration conf = new Configuration();

FSDataInputStream istream = FileSystem.get(conf).open(p);

byte[] buf = new byte[bufLen];
int readTotal = 0;
int count = 0;
try {
  while (true) {
istream.seek(readTotal);

int bytesRemaining = bufLen;
int bufOffset = 0;
while (bytesRemaining > 0) {
  int nread = istream.read(buf, 0, bufLen);
  if (nread < 0) {
  throw new Exception("nread is less than zero");
  }
  readTotal += nread;
  bufOffset += nread;
  bytesRemaining -= nread;
}

count++;
if (count == countBeforeSleep) {
System.out.println("sleeping for " + sleepDuration + " 
milliseconds");
Thread.sleep(sleepDuration);
System.out.println("resuming");
}
if (count == countBeforeSleep + countAfterSleep) {
System.out.println("done");
break;
}
  }
} catch (Exception e) {
System.out.println("exception on read " + count + " read total " + 
readTotal);
throw e;
}
}
}
{code}


The issue appears to be due to the fact that datanodes close the connection of 
EC client if it doesn't fetch next packet for longer than 
dfs.client.socket-timeout. The EC client doesn't retry and instead assumes that 
those datanodes went away resulting in "missing blocks" exception.

I was able to consistently reproduce with the following arguments:
{noformat}
bufLen = 100 (just below 1MB which is the size of the stripe) 
sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000)
countBeforeSleep = 1
countAfterSleep = 7
{noformat}

I've attached the entire log output of running the snippet above against 
erasure coded file with RS-3-2-1024k policy. And here are the logs from 
datanodes of disconnecting the client:

datanode 1:
{noformat}
2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped 
reading, disconnecting it (datanode-v11-0-hadoop.hadoop:9866:DataXceiver error 
processing READ_BLOCK operation  src: /10.128.23.40:53748 dst: 
/10.128.14.46:9866); java.net.SocketTimeoutException: 1 millis timeout 
while waiting for channel to be ready for write. ch : 
java.nio.channels.SocketChannel[connected local=/10.128.14.46:9866 
remote=/10.128.23.40:53748]
{noformat}

datanode 2:
{noformat}
2020-06-15 19:06:20,341 INFO datanode.DataNode: Likely the client has stopped 
reading, disconnecting it (datanode-v11-1-hadoop.hadoop:9866:DataXceiver error 
processing READ_BLOCK operation  src: /10.128.23.40:48772 dst: 
/10.128.9.42:9866); java.net.SocketTimeoutException: 1 millis timeout while 
waiting for channel to be ready for write. ch : 
java.nio.channels.SocketChannel[connected local=/10.128.9.42:9866 
remote=/10.128.23.40:48772]
{noformat}

datanode 3:
{noformat}
2020-06-15 19:06:20,467 INFO datanode.DataNode: Likely the client has stopped 
reading, disconnecting it (datanode-v11-3-hadoop.hadoop:9866:DataXceiver error 
processing READ_BLOCK operation  src: /10.128.23.40:57184 dst: 
/10.128.16.13:9866); java.net.SocketTimeoutException: 1 millis timeout 
while waiting for channel to be ready for write. ch : 
java.nio.channels.SocketChannel[connected local=/10.128.16.13:9866 
remote=/10.128.23.40:57184]
{noformat}

I've tried running the same code again non-ec files with replication

[jira] [Commented] (HDFS-15406) Improve the speed of Datanode Block Scan

2020-06-15 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136060#comment-17136060
 ] 

Hadoop QA commented on HDFS-15406:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
13s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
16s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m 27s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
41s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  3m  
2s{color} | {color:blue} Used deprecated FindBugs config; considering switching 
to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m  
0s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m  5s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
10s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}109m 29s{color} 
| {color:red} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
36s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}179m 51s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting |
|   | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy |
|   | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped |
|   | hadoop.hdfs.tools.TestDFSAdminWithHA |
|   | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure |
|   | hadoop.hdfs.TestReconstructStripedFile |
|   | hadoop.hdfs.TestStripedFileAppend |
|   | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier |
|   | hadoop.hdfs.TestRollingUpgrade |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-HDFS-Build/29430/artifact/out/Dockerfile
 |
| JIRA Issue | HDFS-15406 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13005719/HDFS-15406.001.patch |
| Optional Tests | dupn

[jira] [Updated] (HDFS-15406) Improve the speed of Datanode Block Scan

2020-06-15 Thread hemanthboyina (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hemanthboyina updated HDFS-15406:
-
Attachment: HDFS-15406.001.patch
Status: Patch Available  (was: Open)

> Improve the speed of Datanode Block Scan
> 
>
> Key: HDFS-15406
> URL: https://issues.apache.org/jira/browse/HDFS-15406
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-15406.001.patch
>
>
> In our customer cluster we have approx 10M blocks in one datanode 
> the Datanode to scans all the blocks , it has taken nearly 5mins
> {code:java}
> 2020-06-10 12:17:06,869 | INFO  | 
> java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty 
> queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: 
> 11149530, missing metadata files:472, missing block files:472, missing blocks 
> in memory:0, mismatched blocks:0 | DirectoryScanner.java:473
> 2020-06-10 12:17:06,869 | WARN  | 
> java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty 
> queue] | Lock held time above threshold: lock identifier: 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl 
> lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: 
> java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148)
> org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186)
> org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133)
> org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84)
> org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96)
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475)
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375)
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
>  | InstrumentedLock.java:143 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15412) Add options to set different block scan period for diffrent StorageType

2020-06-15 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135895#comment-17135895
 ] 

Hadoop QA commented on HDFS-15412:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 27m 
41s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m  
4s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 18m 
18s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
21m 50s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
46s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  3m 
13s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
19s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
21s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
 0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 17m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 17m 
27s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
2m 53s{color} | {color:orange} root: The patch generated 2 new + 41 unchanged - 
0 fixed = 43 total (was 41) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 54s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
37s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  9m 
35s{color} | {color:green} hadoop-common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}126m 35s{color} 
| {color:red} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
55s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}286m 13s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier |
|   | hadoop.hdfs.TestGetFileChecksum |
|   | hadoop.hdfs.TestReconstructStripedFile |
|   | hadoop.hdfs.TestMultipleNNPortQOP |
|   | hadoop.hdfs.server.namenode.TestDecommissioningStatusWithBackoffMonitor |
|   | hadoop.hdfs.TestSafeModeWithStripedFileWithRandomECPolicy |
|   | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped |
|   | hadoop.hdfs.TestReconstructStripedFil

[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation

2020-06-15 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135841#comment-17135841
 ] 

Hadoop QA commented on HDFS-15346:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
49s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 15 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m  
8s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 21m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  3m 
16s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  6m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m 39s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  4m 
53s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  6m 
30s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m 
31s{color} | {color:blue} branch/hadoop-project no findbugs output file 
(findbugsXml.xml) {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m 
34s{color} | {color:blue} branch/hadoop-assemblies no findbugs output file 
(findbugsXml.xml) {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m 
33s{color} | {color:blue} branch/hadoop-tools/hadoop-tools-dist no findbugs 
output file (findbugsXml.xml) {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
32s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 20m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 20m 
58s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
3m 19s{color} | {color:orange} root: The patch generated 5 new + 2 unchanged - 
0 fixed = 7 total (was 2) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  7m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shellcheck {color} | {color:green}  0m 
 0s{color} | {color:green} There were no new shellcheck issues. {color} |
| {color:green}+1{color} | {color:green} shelldocs {color} | {color:green}  0m 
32s{color} | {color:green} There were no new shelldocs issues. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
6s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m 25s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  5m 
29s{color} | {color:green} the patch passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m 
30s{color} | {color:blue} hadoop-project has no data from findbugs {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m 
34s{color} | {color:blue} hadoop-assemblies has no data from findbugs {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {colo

[jira] [Commented] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog

2020-06-15 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135832#comment-17135832
 ] 

Hadoop QA commented on HDFS-15175:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
36s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
20m 10s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  4m 
10s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m  
7s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 41s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
47s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}105m  4s{color} 
| {color:red} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
35s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}189m 16s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot |
|   | hadoop.hdfs.TestReconstructStripedFile |
|   | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover |
|   | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy |
|   | hadoop.hdfs.server.datanode.TestBPOfferService |
|   | hadoop.hdfs.server.namenode.TestNameNodeRetryCacheMetrics |
|   | hadoop.hdfs.TestStripedFileAppend |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-HDFS-Build/29429/artifact/out/Dockerfile
 |
| JIRA Issue | HDFS-15175 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13005703/HDFS-15175-trunk.1.patch
 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |

[jira] [Updated] (HDFS-15410) RBF: Add separated config file fedbalance-default.xml for fedbalance tool

2020-06-15 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15410:
-
Summary: RBF: Add separated config file fedbalance-default.xml for 
fedbalance tool  (was: Add separated config file fedbalance-default.xml for 
fedbalance tool.)

> RBF: Add separated config file fedbalance-default.xml for fedbalance tool
> -
>
> Key: HDFS-15410
> URL: https://issues.apache.org/jira/browse/HDFS-15410
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
>
> Add a separated config file named fedbalance-default.xml for fedbalance tool 
> configs. It's like the ditcp-default.xml for distcp tool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation

2020-06-15 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135819#comment-17135819
 ] 

Yiqun Lin commented on HDFS-15346:
--

[~LiJinglun], the refactor looks great. I find you decrease the timeout value, 
the new value seems too small and it will lead timeout  error.

Can you adjust all this time value to 3(@Test(timeout = 3) in 
TestDistCpProcedure? This value works well in my local.

Finally, can we add 'fedbalance' in current package name under fedbalance 
module?

Under module path src/test/java, src/main/java
 Update
{noformat}
org.apache.hadoop.tools
org.apache.hadoop.tools.procedure
{noformat}
to
{noformat}
org.apache.hadoop.tools.fedbalance
org.apache.hadoop.tools.fedbalance.procedure
{noformat}
Then please check and update some old class path that used in the module, like 
hadoop-federation-balance.sh, pom.xml or some other place.

Others looks good to me now. Thanks [~LiJinglun] for the so patient working for 
this. 
Once above are addressed, I will hold off the commit for few days in case there 
are some other comments from others.

> RBF: DistCpFedBalance implementation
> 
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, 
> HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, 
> HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15406) Improve the speed of Datanode Block Scan

2020-06-15 Thread hemanthboyina (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135767#comment-17135767
 ] 

hemanthboyina commented on HDFS-15406:
--

thanks [~brahmareddy] for the comment
{quote}Not sure, whether HDFS-9668 will address the same.
{quote}
the locking contention was being handled through HDFS-15150 and HDFS-15160 by 
introducing read and write lock , though these doesn't improve the time taken 
by the lock which this Jira is aimed to solve

and by caching the getBaseURI() , the lock time was reduced to 52sec

> Improve the speed of Datanode Block Scan
> 
>
> Key: HDFS-15406
> URL: https://issues.apache.org/jira/browse/HDFS-15406
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
>
> In our customer cluster we have approx 10M blocks in one datanode 
> the Datanode to scans all the blocks , it has taken nearly 5mins
> {code:java}
> 2020-06-10 12:17:06,869 | INFO  | 
> java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty 
> queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: 
> 11149530, missing metadata files:472, missing block files:472, missing blocks 
> in memory:0, mismatched blocks:0 | DirectoryScanner.java:473
> 2020-06-10 12:17:06,869 | WARN  | 
> java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty 
> queue] | Lock held time above threshold: lock identifier: 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl 
> lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: 
> java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148)
> org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186)
> org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133)
> org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84)
> org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96)
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475)
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375)
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
>  | InstrumentedLock.java:143 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15406) Improve the speed of Datanode Block Scan

2020-06-15 Thread hemanthboyina (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135756#comment-17135756
 ] 

hemanthboyina commented on HDFS-15406:
--

discussed with [~pilchard] offline  , the major drawback in his report was the 
configuration , "dfs.datanode.directoryscan.threads" was set as 1

two major points here

*) if we have more volumes , the thread 
count(dfs.datanode.directoryscan.threads)  will impact on the time taken by 
getDiskReport() aka getVolumeReports() , as each volume will be launched by a 
thread here , if we increase the threads count , the time taken by 
getDiskReport() will be less

*) next we acquire the lock and compare the report to the in memory data

For creating ScanInfo object we use vol.getBaseUri()
{code:java}
FSVolumeSpi#ScanInfo
public ScanInfo(long blockId, File blockFile, File metaFile,
FsVolumeSpi vol) {
  String condensedVolPath =
  (vol == null || vol.getBaseURI() == null) ? null :
  getCondensedPath(new File(vol.getBaseURI()).getAbsolutePath()); {code}
we addDifference if there is any mismatch in blockId or blockLength for that we 
call getMetaFile() and getBlockFile() , here we  again use vol.getBaseUri
{code:java}
public File getMetaFile() {
return new File(new File(volume.getBaseURI()).getAbsolutePath(),
metaSuffix); {code}
so if a DN has more blocks the calls to getBaseUri are more , and each time we 
call getBaseURI we care converting the currentDir.getParent to URI  which is 
taking time and we can cache this here
{code:java}
public URI getBaseURI() {
  return new File(currentDir.getParent()).toURI();
} {code}
on making this as cache , the lock time reduced to 52 Sec

> Improve the speed of Datanode Block Scan
> 
>
> Key: HDFS-15406
> URL: https://issues.apache.org/jira/browse/HDFS-15406
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
>
> In our customer cluster we have approx 10M blocks in one datanode 
> the Datanode to scans all the blocks , it has taken nearly 5mins
> {code:java}
> 2020-06-10 12:17:06,869 | INFO  | 
> java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty 
> queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: 
> 11149530, missing metadata files:472, missing block files:472, missing blocks 
> in memory:0, mismatched blocks:0 | DirectoryScanner.java:473
> 2020-06-10 12:17:06,869 | WARN  | 
> java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty 
> queue] | Lock held time above threshold: lock identifier: 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl 
> lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: 
> java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148)
> org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186)
> org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133)
> org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84)
> org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96)
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475)
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375)
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
>  | InstrumentedLock.java:143 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15406) Improve the speed of Datanode Block Scan

2020-06-15 Thread Brahma Reddy Battula (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135742#comment-17135742
 ] 

Brahma Reddy Battula commented on HDFS-15406:
-

{quote}we get the datanode jstack, with 11M block , found that getDiskReport 
run nearly 23 min,then hold lock to process scan about 6 min.
{quote}
getDiskReport() (After HDFS-13947) getVolumeReports()) can be improved by 
confiuring the "dfs.datanode.directoryscan.threads" more.
{quote} hold lock to process scan about 6 min
{quote}
Not sure, whether HDFS-9668 will address the same.

> Improve the speed of Datanode Block Scan
> 
>
> Key: HDFS-15406
> URL: https://issues.apache.org/jira/browse/HDFS-15406
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
>
> In our customer cluster we have approx 10M blocks in one datanode 
> the Datanode to scans all the blocks , it has taken nearly 5mins
> {code:java}
> 2020-06-10 12:17:06,869 | INFO  | 
> java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty 
> queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: 
> 11149530, missing metadata files:472, missing block files:472, missing blocks 
> in memory:0, mismatched blocks:0 | DirectoryScanner.java:473
> 2020-06-10 12:17:06,869 | WARN  | 
> java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty 
> queue] | Lock held time above threshold: lock identifier: 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl 
> lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: 
> java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148)
> org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186)
> org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133)
> org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84)
> org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96)
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475)
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375)
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
>  | InstrumentedLock.java:143 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog

2020-06-15 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135722#comment-17135722
 ] 

Xiaoqiao He commented on HDFS-15175:


Thanks [~wanchang] for the patch. It is almost LGTM, adding ut will be better 
to push forward. deep copy operation for some request every time is a bit 
expensive in my opinion, however do not find any more graceful solution here. 
cc [~ayushtkn],[~liuml07],[~weichiu] Any more suggestions?

> Multiple CloseOp shared block instance causes the standby namenode to crash 
> when rolling editlog
> 
>
> Key: HDFS-15175
> URL: https://issues.apache.org/jira/browse/HDFS-15175
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Wan Chang
>Priority: Critical
>  Labels: NameNode
> Attachments: HDFS-15175-trunk.1.patch
>
>
>  
> {panel:title=Crash exception}
> 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log 
> tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp 
> [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, 
> atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], 
> permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, 
> clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, 
> txid=32625024993]
>  java.io.IOException: File is not under construction: ..
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:360)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
>  at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
> {panel}
>  
> {panel:title=Editlog}
> 
>  OP_REASSIGN_LEASE
>  
>  32625021150
>  DFSClient_NONMAPREDUCE_-969060727_197760
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625023743
>  0
>  0
>  ..
>  3
>  1581816135883
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> ..
> 
>  OP_TRUNCATE
>  
>  32625024049
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  ..
>  185818644
>  1581816136336
>  
>  5568434562
>  185818648
>  4495417845
>  
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625024993
>  0
>  0
>  ..
>  3
>  1581816138774
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> {panel}
>  
>  
> The block size should be 185818648 in the first CloseOp. When truncate is 
> used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is 
> synchronized to the JournalNode in the same batch. The block used by CloseOp 
> twice is the same instance, which causes the first CloseOp has wrong block 
> size. When SNN rolling Editlog, TruncateOp does not make the file to the 
> UnderConstruction state. Then, when the second CloseOp is executed, the file 
> is not in the UnderConstruction state, and SNN crashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted

2020-06-15 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135712#comment-17135712
 ] 

huhaiyang commented on HDFS-15391:
--

Thanks [~hexiaoqiao] To help solve.

> Standby NameNode due loads the corruption edit log, the service exits and 
> cannot be restarted
> -
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {noformat}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed :
> 1.close file
> 2.open file
> 3.truncate file
> 4.append file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted

2020-06-15 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135698#comment-17135698
 ] 

huhaiyang commented on HDFS-15391:
--

[~liuml07] Thank you for reply!
 The current issue is the same as 
[HDFS-15175|https://issues.apache.org/jira/browse/HDFS-15175] and 
[HDFS-15175|https://issues.apache.org/jira/browse/HDFS-15175]  submitted patch 
and ready for repair.

> Standby NameNode due loads the corruption edit log, the service exits and 
> cannot be restarted
> -
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {noformat}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed :
> 1.close file
> 2.open file
> 3.truncate file
> 4.append file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog

2020-06-15 Thread Wan Chang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Chang updated HDFS-15175:
-
Attachment: HDFS-15175-trunk.1.patch
Labels: NameNode  (was: )
Status: Patch Available  (was: Open)

> Multiple CloseOp shared block instance causes the standby namenode to crash 
> when rolling editlog
> 
>
> Key: HDFS-15175
> URL: https://issues.apache.org/jira/browse/HDFS-15175
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Wan Chang
>Priority: Critical
>  Labels: NameNode
> Attachments: HDFS-15175-trunk.1.patch
>
>
>  
> {panel:title=Crash exception}
> 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log 
> tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp 
> [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, 
> atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], 
> permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, 
> clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, 
> txid=32625024993]
>  java.io.IOException: File is not under construction: ..
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:360)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
>  at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
> {panel}
>  
> {panel:title=Editlog}
> 
>  OP_REASSIGN_LEASE
>  
>  32625021150
>  DFSClient_NONMAPREDUCE_-969060727_197760
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625023743
>  0
>  0
>  ..
>  3
>  1581816135883
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> ..
> 
>  OP_TRUNCATE
>  
>  32625024049
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  ..
>  185818644
>  1581816136336
>  
>  5568434562
>  185818648
>  4495417845
>  
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625024993
>  0
>  0
>  ..
>  3
>  1581816138774
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> {panel}
>  
>  
> The block size should be 185818648 in the first CloseOp. When truncate is 
> used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is 
> synchronized to the JournalNode in the same batch. The block used by CloseOp 
> twice is the same instance, which causes the first CloseOp has wrong block 
> size. When SNN rolling Editlog, TruncateOp does not make the file to the 
> UnderConstruction state. Then, when the second CloseOp is executed, the file 
> is not in the UnderConstruction state, and SNN crashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog

2020-06-15 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135675#comment-17135675
 ] 

Xiaoqiao He edited comment on HDFS-15175 at 6/15/20, 9:43 AM:
--

Add [~wanchang] as contributor and assign this issue to him. Please feel free 
to assign back if you are interested it.[~caiyicong]


was (Author: hexiaoqiao):
Add [~wanchang] as contributor and assign this issue to him. Please assign back 
if you are interested it.[~caiyicong]

> Multiple CloseOp shared block instance causes the standby namenode to crash 
> when rolling editlog
> 
>
> Key: HDFS-15175
> URL: https://issues.apache.org/jira/browse/HDFS-15175
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Wan Chang
>Priority: Critical
>
>  
> {panel:title=Crash exception}
> 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log 
> tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp 
> [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, 
> atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], 
> permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, 
> clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, 
> txid=32625024993]
>  java.io.IOException: File is not under construction: ..
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:360)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
>  at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
> {panel}
>  
> {panel:title=Editlog}
> 
>  OP_REASSIGN_LEASE
>  
>  32625021150
>  DFSClient_NONMAPREDUCE_-969060727_197760
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625023743
>  0
>  0
>  ..
>  3
>  1581816135883
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> ..
> 
>  OP_TRUNCATE
>  
>  32625024049
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  ..
>  185818644
>  1581816136336
>  
>  5568434562
>  185818648
>  4495417845
>  
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625024993
>  0
>  0
>  ..
>  3
>  1581816138774
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> {panel}
>  
>  
> The block size should be 185818648 in the first CloseOp. When truncate is 
> used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is 
> synchronized to the JournalNode in the same batch. The block used by CloseOp 
> twice is the same instance, which causes the first CloseOp has wrong block 
> size. When SNN rolling Editlog, TruncateOp does not make the file to the 
> UnderConstruction state. Then, when the second CloseOp is executed, the file 
> is not in the UnderConstruction state, and SNN crashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog

2020-06-15 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135675#comment-17135675
 ] 

Xiaoqiao He edited comment on HDFS-15175 at 6/15/20, 9:42 AM:
--

Add [~wanchang] as contributor and assign this issue to him. Please assign back 
if you are interested it.[~caiyicong]


was (Author: hexiaoqiao):
Add [~wanchang] as contributor and assign this issue to him.

> Multiple CloseOp shared block instance causes the standby namenode to crash 
> when rolling editlog
> 
>
> Key: HDFS-15175
> URL: https://issues.apache.org/jira/browse/HDFS-15175
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Wan Chang
>Priority: Critical
>
>  
> {panel:title=Crash exception}
> 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log 
> tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp 
> [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, 
> atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], 
> permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, 
> clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, 
> txid=32625024993]
>  java.io.IOException: File is not under construction: ..
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:360)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
>  at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
> {panel}
>  
> {panel:title=Editlog}
> 
>  OP_REASSIGN_LEASE
>  
>  32625021150
>  DFSClient_NONMAPREDUCE_-969060727_197760
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625023743
>  0
>  0
>  ..
>  3
>  1581816135883
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> ..
> 
>  OP_TRUNCATE
>  
>  32625024049
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  ..
>  185818644
>  1581816136336
>  
>  5568434562
>  185818648
>  4495417845
>  
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625024993
>  0
>  0
>  ..
>  3
>  1581816138774
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> {panel}
>  
>  
> The block size should be 185818648 in the first CloseOp. When truncate is 
> used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is 
> synchronized to the JournalNode in the same batch. The block used by CloseOp 
> twice is the same instance, which causes the first CloseOp has wrong block 
> size. When SNN rolling Editlog, TruncateOp does not make the file to the 
> UnderConstruction state. Then, when the second CloseOp is executed, the file 
> is not in the UnderConstruction state, and SNN crashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted

2020-06-15 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-15391.

Resolution: Duplicate

> Standby NameNode due loads the corruption edit log, the service exits and 
> cannot be restarted
> -
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {noformat}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed :
> 1.close file
> 2.open file
> 3.truncate file
> 4.append file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted

2020-06-15 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135683#comment-17135683
 ] 

Xiaoqiao He commented on HDFS-15391:


This issue is duplicated by HDFS-15175, will close this one. Please trace it at 
HDFS-15175.

> Standby NameNode due loads the corruption edit log, the service exits and 
> cannot be restarted
> -
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {noformat}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed :
> 1.close file
> 2.open file
> 3.truncate file
> 4.append file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog

2020-06-15 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135675#comment-17135675
 ] 

Xiaoqiao He commented on HDFS-15175:


Add [~wanchang] as contributor and assign this issue to him.

> Multiple CloseOp shared block instance causes the standby namenode to crash 
> when rolling editlog
> 
>
> Key: HDFS-15175
> URL: https://issues.apache.org/jira/browse/HDFS-15175
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Wan Chang
>Priority: Critical
>
>  
> {panel:title=Crash exception}
> 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log 
> tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp 
> [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, 
> atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], 
> permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, 
> clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, 
> txid=32625024993]
>  java.io.IOException: File is not under construction: ..
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:360)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
>  at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
> {panel}
>  
> {panel:title=Editlog}
> 
>  OP_REASSIGN_LEASE
>  
>  32625021150
>  DFSClient_NONMAPREDUCE_-969060727_197760
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625023743
>  0
>  0
>  ..
>  3
>  1581816135883
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> ..
> 
>  OP_TRUNCATE
>  
>  32625024049
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  ..
>  185818644
>  1581816136336
>  
>  5568434562
>  185818648
>  4495417845
>  
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625024993
>  0
>  0
>  ..
>  3
>  1581816138774
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> {panel}
>  
>  
> The block size should be 185818648 in the first CloseOp. When truncate is 
> used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is 
> synchronized to the JournalNode in the same batch. The block used by CloseOp 
> twice is the same instance, which causes the first CloseOp has wrong block 
> size. When SNN rolling Editlog, TruncateOp does not make the file to the 
> UnderConstruction state. Then, when the second CloseOp is executed, the file 
> is not in the UnderConstruction state, and SNN crashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog

2020-06-15 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned HDFS-15175:
--

Assignee: Wan Chang  (was: Yicong Cai)

> Multiple CloseOp shared block instance causes the standby namenode to crash 
> when rolling editlog
> 
>
> Key: HDFS-15175
> URL: https://issues.apache.org/jira/browse/HDFS-15175
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Wan Chang
>Priority: Critical
>
>  
> {panel:title=Crash exception}
> 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log 
> tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp 
> [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, 
> atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], 
> permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, 
> clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, 
> txid=32625024993]
>  java.io.IOException: File is not under construction: ..
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:360)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
>  at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
> {panel}
>  
> {panel:title=Editlog}
> 
>  OP_REASSIGN_LEASE
>  
>  32625021150
>  DFSClient_NONMAPREDUCE_-969060727_197760
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625023743
>  0
>  0
>  ..
>  3
>  1581816135883
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> ..
> 
>  OP_TRUNCATE
>  
>  32625024049
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  ..
>  185818644
>  1581816136336
>  
>  5568434562
>  185818648
>  4495417845
>  
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625024993
>  0
>  0
>  ..
>  3
>  1581816138774
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> {panel}
>  
>  
> The block size should be 185818648 in the first CloseOp. When truncate is 
> used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is 
> synchronized to the JournalNode in the same batch. The block used by CloseOp 
> twice is the same instance, which causes the first CloseOp has wrong block 
> size. When SNN rolling Editlog, TruncateOp does not make the file to the 
> UnderConstruction state. Then, when the second CloseOp is executed, the file 
> is not in the UnderConstruction state, and SNN crashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15412) Add options to set different block scan period for diffrent StorageType

2020-06-15 Thread Yang Yun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Yun updated HDFS-15412:

Summary: Add options to set different block scan period for diffrent 
StorageType  (was: Add options to set different scan period for diffrent 
StorageType)

> Add options to set different block scan period for diffrent StorageType
> ---
>
> Key: HDFS-15412
> URL: https://issues.apache.org/jira/browse/HDFS-15412
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Yang Yun
>Assignee: Yang Yun
>Priority: Minor
> Attachments: HDFS-15412.001.patch
>
>
> For some cold data,  sometime, we don't want to scan cold data as often as 
> hot data. Add options that we can set the scan period time according to 
> StorageType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15412) Add options to set different scan period for diffrent StorageType

2020-06-15 Thread Yang Yun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Yun updated HDFS-15412:

Attachment: HDFS-15412.001.patch
Status: Patch Available  (was: Open)

> Add options to set different scan period for diffrent StorageType
> -
>
> Key: HDFS-15412
> URL: https://issues.apache.org/jira/browse/HDFS-15412
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Yang Yun
>Assignee: Yang Yun
>Priority: Minor
> Attachments: HDFS-15412.001.patch
>
>
> For some cold data,  sometime, we don't want to scan cold data as often as 
> hot data. Add options that we can set the scan period time according to 
> StorageType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15412) Add options to set different scan period for diffrent StorageType

2020-06-15 Thread Yang Yun (Jira)
Yang Yun created HDFS-15412:
---

 Summary: Add options to set different scan period for diffrent 
StorageType
 Key: HDFS-15412
 URL: https://issues.apache.org/jira/browse/HDFS-15412
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Reporter: Yang Yun
Assignee: Yang Yun


For some cold data,  sometime, we don't want to scan cold data as often as hot 
data. Add options that we can set the scan period time according to StorageType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation

2020-06-15 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135615#comment-17135615
 ] 

Jinglun commented on HDFS-15346:


Your are genius [~linyiqun] !  Thanks your brilliant comments, the improvement 
is great ! The unit tests run very fast now. I followed up all the changes. And 
I did a little refactor based on your improvement. The logic of the improvement 
is the same as you suggested. I only extracted a method and refactored the 
class RunningJobStatus to make it easier to read. Please let me know your 
thoughts, I'm also ok to keep it just the same as you suggested.

 
{quote}Can you update following description in router option? I update this 
content as well but seems this was not addressed in the latest patch.
{quote}
Sorry I missed this. Update at v11.
{quote}Method name cleanUpBeforeInitDistcp can be renamed to 
pathCheckBeforeInitDistcp since we don't do any cleanup operation now.
{quote}
Address at v11.

> RBF: DistCpFedBalance implementation
> 
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, 
> HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, 
> HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15346) RBF: DistCpFedBalance implementation

2020-06-15 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15346:
---
Attachment: HDFS-15346.011.patch

> RBF: DistCpFedBalance implementation
> 
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, 
> HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, 
> HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15411) TestNameNodeMXBean.testDecommissioningNodes fails

2020-06-15 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135610#comment-17135610
 ] 

Ayush Saxena commented on HDFS-15411:
-

Thanx [~aajisaka] for reporting. I tried to reproduce this by putting a sleep 
at L367. The check failed but with different difference. This time being due to 
nonDFSUsed being different.  we should replace this check with a relaxed check..
{noformat}
Expected 
:{"127.0.0.1:39745":{"infoAddr":"127.0.0.1:35813","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:39745","lastContact":0,"usedSpace":49152,"adminState":"In
 
Service","nonDfsUsedSpace":303558705152,"capacity":3935851380736,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":3432219295744,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":1.2488276E-6,"volfails":0,"lastBlockReport":0},"127.0.0.1:33467":{"infoAddr":"127.0.0.1:40379","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:33467","lastContact":0,"usedSpace":49152,"adminState":"In
 
Service","nonDfsUsedSpace":303558705152,"capacity":3935851380736,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":3432219295744,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":1.2488276E-6,"volfails":0,"lastBlockReport":0},"127.0.0.1:44865":{"infoAddr":"127.0.0.1:44315","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:44865","lastContact":0,"usedSpace":49152,"adminState":"In
 
Service","nonDfsUsedSpace":303558705152,"capacity":3935851380736,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":3432219295744,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":1.2488276E-6,"volfails":0,"lastBlockReport":0}}
Actual   
:{"127.0.0.1:39745":{"infoAddr":"127.0.0.1:35813","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:39745","lastContact":0,"usedSpace":49152,"adminState":"In
 
Service","nonDfsUsedSpace":303558696960,"capacity":3935851380736,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":3432219303936,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":1.2488276E-6,"volfails":0,"lastBlockReport":0},"127.0.0.1:33467":{"infoAddr":"127.0.0.1:40379","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:33467","lastContact":0,"usedSpace":49152,"adminState":"In
 
Service","nonDfsUsedSpace":303558696960,"capacity":3935851380736,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":3432219303936,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":1.2488276E-6,"volfails":0,"lastBlockReport":0},"127.0.0.1:44865":{"infoAddr":"127.0.0.1:44315","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:44865","lastContact":0,"usedSpace":49152,"adminState":"In
 
Service","nonDfsUsedSpace":303558696960,"capacity":3935851380736,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":3432219303936,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":1.2488276E-6,"volfails":0,"lastBlockReport":0}}

{noformat}

> TestNameNodeMXBean.testDecommissioningNodes fails
> -
>
> Key: HDFS-15411
> URL: https://issues.apache.org/jira/browse/HDFS-15411
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Akira Ajisaka
>Priority: Major
>
> https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/172/testReport/org.apache.hadoop.hdfs.server.namenode/TestNameNodeMXBean/testDecommissioningNodes/
> {noformat}
> org.junit.ComparisonFailure: 
> expected:<...0,"lastBlockReport":[0},"127.0.0.1:35473":{"infoAddr":"127.0.0.1:38399","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:35473","lastContact":0,"usedSpace":49152,"adminState":"In
>  
> Service","nonDfsUsedSpace":325285158912,"capacity":7871746113536,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":7146454327296,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":6.244104E-7,"volfails":0,"lastBlockReport":0},"127.0.0.1:44811":{"infoAddr":"127.0.0.1:39743","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:44811","lastContact":0,"usedSpace":49152,"adminState":"In
>  
> Service","nonDfsUsedSpace":325285158912,"capacity":7871746113536,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":7146454327296,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":6.244104E-7,"volfails":0,"lastBlockReport":0]}}>
>  but 
> was:<...0,"lastBlockReport":[362603},"127.0.0.1:35473":{"infoAddr":"127.0.0.1:38399","infoSecureAddr":"127.0.0.1:0","xferaddr":"127.0.0.1:35473","lastContact":0,"usedSpace":49152,"adminState":"In
>  
> Service","nonDfsUsedSpace":325285158912,"capacity":7871746113536,"numBlocks":0,"version":"3.4.0-SNAPSHOT","used":49152,"remaining":7146454327296,"blockScheduled":0,"blockPoolUsed":49152,"blockPoolUsedPercent":6.244104E-7,"volfails":0,"lastBlockReport":362603},"127.0.0.1:448