[ 
https://issues.apache.org/jira/browse/HDFS-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026215#comment-14026215
 ] 

Binglin Chang commented on HDFS-6506:
-------------------------------------

Balancer already sleep 2*DFS_HEARTBEAT_INTERVAL seconds between rounds, but in 
TestBalancer.java:
{code}
    conf.setLong(DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY, 1L);
{code}
replica state update speed is related to DFS_NAMENODE_REPLICATION_INTERVAL too, 
which is 3 by default.
TestBalancer only change heartbeat interval(which changes heartbeat interval 
and balancer iteration sleep time), but doesn't change ReplicationMonitor check 
interval, so the sleep time is too small to wait for movements getting 
committed.
The other thing is 2*DFS_HEARTBEAT_INTERVAL still seems a little dangerous. 
maybe change it to 2*DFS_HEARTBEAT_INTERVAL + DFS_NAMENODE_REPLICATION_INTERVAL


> Newly moved block replica been invalidated and deleted
> ------------------------------------------------------
>
>                 Key: HDFS-6506
>                 URL: https://issues.apache.org/jira/browse/HDFS-6506
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Binglin Chang
>            Assignee: Binglin Chang
>
> TestBalancerWithNodeGroup#testBalancerWithNodeGroup fails recently
> https://builds.apache.org/job/PreCommit-HDFS-Build/7045//testReport/
> from the error log, the reason seems to be that newly moved block replicas 
> been invalidated and deleted, so some work of the balancer are reversed.
> {noformat}
> 2014-06-06 18:15:51,681 INFO  balancer.Balancer (Balancer.java:dispatch(370)) 
> - Successfully moved blk_1073741834_1010 with size=100 from 127.0.0.1:49159 
> to 127.0.0.1:55468 through 127.0.0.1:49159
> 2014-06-06 18:15:51,683 INFO  balancer.Balancer (Balancer.java:dispatch(370)) 
> - Successfully moved blk_1073741833_1009 with size=100 from 127.0.0.1:49159 
> to 127.0.0.1:55468 through 127.0.0.1:49159
> 2014-06-06 18:15:51,683 INFO  balancer.Balancer (Balancer.java:dispatch(370)) 
> - Successfully moved blk_1073741830_1006 with size=100 from 127.0.0.1:49159 
> to 127.0.0.1:55468 through 127.0.0.1:49159
> 2014-06-06 18:15:51,683 INFO  balancer.Balancer (Balancer.java:dispatch(370)) 
> - Successfully moved blk_1073741831_1007 with size=100 from 127.0.0.1:49159 
> to 127.0.0.1:55468 through 127.0.0.1:49159
> 2014-06-06 18:15:51,682 INFO  balancer.Balancer (Balancer.java:dispatch(370)) 
> - Successfully moved blk_1073741832_1008 with size=100 from 127.0.0.1:49159 
> to 127.0.0.1:55468 through 127.0.0.1:49159
> 2014-06-06 18:15:54,702 INFO  balancer.Balancer (Balancer.java:dispatch(370)) 
> - Successfully moved blk_1073741827_1003 with size=100 from 127.0.0.1:49159 
> to 127.0.0.1:55468 through 127.0.0.1:49159
> 2014-06-06 18:15:54,702 INFO  balancer.Balancer (Balancer.java:dispatch(370)) 
> - Successfully moved blk_1073741828_1004 with size=100 from 127.0.0.1:49159 
> to 127.0.0.1:55468 through 127.0.0.1:49159
> 2014-06-06 18:15:54,701 INFO  balancer.Balancer (Balancer.java:dispatch(370)) 
> - Successfully moved blk_1073741829_1005 with size=100 fr
> 2014-06-06 18:15:54,706 INFO  BlockStateChange 
> (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* 
> chooseExcessReplicates: (127.0.0.1:55468, blk_1073741833_1009) is added to 
> invalidated blocks set
> 2014-06-06 18:15:54,709 INFO  BlockStateChange 
> (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* 
> chooseExcessReplicates: (127.0.0.1:55468, blk_1073741834_1010) is added to 
> invalidated blocks set
> 2014-06-06 18:15:56,421 INFO  BlockStateChange 
> (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 
> 127.0.0.1:55468 to delete [blk_1073741833_1009, blk_1073741834_1010]
> 2014-06-06 18:15:57,717 INFO  BlockStateChange 
> (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* 
> chooseExcessReplicates: (127.0.0.1:55468, blk_1073741832_1008) is added to 
> invalidated blocks set
> 2014-06-06 18:15:57,720 INFO  BlockStateChange 
> (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* 
> chooseExcessReplicates: (127.0.0.1:55468, blk_1073741827_1003) is added to 
> invalidated blocks set
> 2014-06-06 18:15:57,721 INFO  BlockStateChange 
> (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* 
> chooseExcessReplicates: (127.0.0.1:55468, blk_1073741830_1006) is added to 
> invalidated blocks set
> 2014-06-06 18:15:57,722 INFO  BlockStateChange 
> (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* 
> chooseExcessReplicates: (127.0.0.1:55468, blk_1073741831_1007) is added to 
> invalidated blocks set
> 2014-06-06 18:15:57,723 INFO  BlockStateChange 
> (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* 
> chooseExcessReplicates: (127.0.0.1:55468, blk_1073741829_1005) is added to 
> invalidated blocks set
> 2014-06-06 18:15:59,422 INFO  BlockStateChange 
> (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 
> 127.0.0.1:55468 to delete [blk_1073741827_1003, blk_1073741829_1005, 
> blk_1073741830_1006, blk_1073741831_1007, blk_1073741832_1008]
> 2014-06-06 18:16:02,423 INFO  BlockStateChange 
> (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 
> 127.0.0.1:55468 to delete [blk_1073741845_1021]
> {noformat}
> Normally this should not happen, when moving a block from src to dest, 
> replica on src should be invalided not the dest, there should be bug inside 
> related logic. 
> I don't think TestBalancerWithNodeGroup#testBalancerWithNodeGroup caused 
> this. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to