[ 
https://issues.apache.org/jira/browse/HDFS-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy updated HDFS-10530:
--------------------------------------
    Attachment: HDFS-10530.5.patch

Thanks for the detailed review comments [~tasanuma0829] and [~andrew.wang]. 
Much appreciated. Attaching v5 patch with comments addressed. Please take a 
look.

bq. It would be more readable if the names of the additional DNs are different 
from the first DNs.
Sure, modified the same rack hosts as per your suggestion. 

bq. We can use DFSTestUtil.waitForReplication instead of the 
GenericTestUtils.waitFor.
Good idea. Replaced GenericTestUtils.waitFor() with 
DFSTestUtil.waitForReplication

bq. Are these necessarily the parity blocks, or could they be any of the blocks 
that are co-located on the first 6 racks?
DFSStripedOutputStream verifies if the allocated block locations length is at 
least equals numDataBlocks, otherwise it throws IOException and the client 
halts. So, the relaxation is only for the parity blocks.

{code}
[Thread-5] WARN  hdfs.DFSOutputStream 
(DFSStripedOutputStream.java:allocateNewBlock(497)) - Failed to get block 
location for parity block, index=6
[Thread-5] WARN  hdfs.DFSOutputStream 
(DFSStripedOutputStream.java:allocateNewBlock(497)) - Failed to get block 
location for parity block, index=7
[Thread-5] WARN  hdfs.DFSOutputStream 
(DFSStripedOutputStream.java:allocateNewBlock(497)) - Failed to get block 
location for parity block, index=8
{code}

So, upon file stream close we get the following warning message (though not 
accurate) when the parity blocks are not yet written out.

{code}
INFO  namenode.FSNamesystem (FSNamesystem.java:checkBlocksComplete(2726)) - 
BLOCK* blk_-9223372036854775792_1002 is COMMITTED but not COMPLETE(numNodes= 0 
<  minimum = 6) in file /ec/test1
INFO  hdfs.StateChange (FSNamesystem.java:completeFile(2679)) - DIR* 
completeFile: /ec/test1 is closed by DFSClient_NONMAPREDUCE_-1900076771_17
WARN  hdfs.DFSOutputStream (DFSStripedOutputStream.java:logCorruptBlocks(1117)) 
- Block group <1> has 3 corrupt blocks. It's at high risk of losing data.
{code}

bq. Also, does this happen via EC reconstruction, or do we simply copy the 
blocks over to the new racks?
Upon addition of 3 new hosts to the existing racks, and after the heartbeat, we 
get a follow up command "{{DNA_ERASURE_CODING_RECOVERY}} and I see the 
following, which looks like copy of block from existing data nodes.

{code}
INFO  datanode.DataNode (DataXceiver.java:writeBlock(717)) - Receiving 
BP-1357293931-172.16.3.66-1489688993295:blk_-922337203     6854775786_1002 src: 
/127.0.0.1:63711 dest: /127.0.0.1:63701
INFO  datanode.DataNode (DataXceiver.java:writeBlock(717)) - Receiving 
BP-1357293931-172.16.3.66-1489688993295:blk_-922337203     6854775785_1002 src: 
/127.0.0.1:63712 dest: /127.0.0.1:63697
INFO  datanode.DataNode (DataXceiver.java:writeBlock(717)) - Receiving 
BP-1357293931-172.16.3.66-1489688993295:blk_-922337203     6854775784_1002 src: 
/127.0.0.1:63713 dest: /127.0.0.1:63693
INFO  datanode.DataNode (DataXceiver.java:writeBlock(893)) - Received 
BP-1357293931-172.16.3.66-1489688993295:blk_-9223372036     854775786_1002 src: 
/127.0.0.1:63711 dest: /127.0.0.1:63701 of size 65536
INFO  datanode.DataNode (DataXceiver.java:writeBlock(893)) - Received 
BP-1357293931-172.16.3.66-1489688993295:blk_-9223372036     854775785_1002 src: 
/127.0.0.1:63712 dest: /127.0.0.1:63697 of size 65536
{code}

bq. Is the BPP violated before entering the waitFor? If so we should assert 
that. This may require pausing reconstruction work and resuming later.
BPP is not violated before or after the addition of 3 new hosts in the existing 
racks as there are only 6 racks which is lesser than the optimal 9 racks. One 
more asserts after waitFor() is added now.

bq. Do you think TestBPPRackFaultTolerant needs any additional unit tests along 
these lines?
Sure, will discuss with you on this.

bq. Looks like these have the same names as the initial DNs as Takanobu noted. 
Might be nice to specify the racks too to be explicit.
Done.

bq. If we later enhance the NN to automatically fix up misplaced EC blocks, 
this assert will be flaky. Maybe add a comment?
Thats right, my intention is to verify the other proposed fix of automatic 
correction for misplaced EC blocks via this test. Sure, added a comment on this 
verification and a TODO.


> BlockManager reconstruction work scheduling should correctly adhere to EC 
> block placement policy
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10530
>                 URL: https://issues.apache.org/jira/browse/HDFS-10530
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: namenode
>            Reporter: Rui Gao
>            Assignee: Manoj Govindassamy
>              Labels: hdfs-ec-3.0-nice-to-have
>         Attachments: HDFS-10530.1.patch, HDFS-10530.2.patch, 
> HDFS-10530.3.patch, HDFS-10530.4.patch, HDFS-10530.5.patch
>
>
> This issue was found by [~tfukudom].
> Under RS-DEFAULT-6-3-64k EC policy, 
> 1. Create an EC file, the file was witten to all the 5 racks( 2 dns for each) 
> of the cluster.
> 2. Reconstruction work would be scheduled if the 6th rack is added. 
> 3. While adding the 7th rack or more racks will not trigger reconstruction 
> work. 
> Based on default EC block placement policy defined in 
> “BlockPlacementPolicyRackFaultTolerant.java”, EC file should be able to be 
> scheduled to distribute to 9 racks if possible.
> In *BlockManager#isPlacementPolicySatisfied(BlockInfo storedBlock)* , 
> *numReplicas* of striped blocks might should be *getRealTotalBlockNum()*, 
> instead of *getRealDataBlockNum()*.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to