[ https://issues.apache.org/jira/browse/HDFS-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Manoj Govindassamy updated HDFS-10530: -------------------------------------- Attachment: HDFS-10530.5.patch Thanks for the detailed review comments [~tasanuma0829] and [~andrew.wang]. Much appreciated. Attaching v5 patch with comments addressed. Please take a look. bq. It would be more readable if the names of the additional DNs are different from the first DNs. Sure, modified the same rack hosts as per your suggestion. bq. We can use DFSTestUtil.waitForReplication instead of the GenericTestUtils.waitFor. Good idea. Replaced GenericTestUtils.waitFor() with DFSTestUtil.waitForReplication bq. Are these necessarily the parity blocks, or could they be any of the blocks that are co-located on the first 6 racks? DFSStripedOutputStream verifies if the allocated block locations length is at least equals numDataBlocks, otherwise it throws IOException and the client halts. So, the relaxation is only for the parity blocks. {code} [Thread-5] WARN hdfs.DFSOutputStream (DFSStripedOutputStream.java:allocateNewBlock(497)) - Failed to get block location for parity block, index=6 [Thread-5] WARN hdfs.DFSOutputStream (DFSStripedOutputStream.java:allocateNewBlock(497)) - Failed to get block location for parity block, index=7 [Thread-5] WARN hdfs.DFSOutputStream (DFSStripedOutputStream.java:allocateNewBlock(497)) - Failed to get block location for parity block, index=8 {code} So, upon file stream close we get the following warning message (though not accurate) when the parity blocks are not yet written out. {code} INFO namenode.FSNamesystem (FSNamesystem.java:checkBlocksComplete(2726)) - BLOCK* blk_-9223372036854775792_1002 is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 6) in file /ec/test1 INFO hdfs.StateChange (FSNamesystem.java:completeFile(2679)) - DIR* completeFile: /ec/test1 is closed by DFSClient_NONMAPREDUCE_-1900076771_17 WARN hdfs.DFSOutputStream (DFSStripedOutputStream.java:logCorruptBlocks(1117)) - Block group <1> has 3 corrupt blocks. It's at high risk of losing data. {code} bq. Also, does this happen via EC reconstruction, or do we simply copy the blocks over to the new racks? Upon addition of 3 new hosts to the existing racks, and after the heartbeat, we get a follow up command "{{DNA_ERASURE_CODING_RECOVERY}} and I see the following, which looks like copy of block from existing data nodes. {code} INFO datanode.DataNode (DataXceiver.java:writeBlock(717)) - Receiving BP-1357293931-172.16.3.66-1489688993295:blk_-922337203 6854775786_1002 src: /127.0.0.1:63711 dest: /127.0.0.1:63701 INFO datanode.DataNode (DataXceiver.java:writeBlock(717)) - Receiving BP-1357293931-172.16.3.66-1489688993295:blk_-922337203 6854775785_1002 src: /127.0.0.1:63712 dest: /127.0.0.1:63697 INFO datanode.DataNode (DataXceiver.java:writeBlock(717)) - Receiving BP-1357293931-172.16.3.66-1489688993295:blk_-922337203 6854775784_1002 src: /127.0.0.1:63713 dest: /127.0.0.1:63693 INFO datanode.DataNode (DataXceiver.java:writeBlock(893)) - Received BP-1357293931-172.16.3.66-1489688993295:blk_-9223372036 854775786_1002 src: /127.0.0.1:63711 dest: /127.0.0.1:63701 of size 65536 INFO datanode.DataNode (DataXceiver.java:writeBlock(893)) - Received BP-1357293931-172.16.3.66-1489688993295:blk_-9223372036 854775785_1002 src: /127.0.0.1:63712 dest: /127.0.0.1:63697 of size 65536 {code} bq. Is the BPP violated before entering the waitFor? If so we should assert that. This may require pausing reconstruction work and resuming later. BPP is not violated before or after the addition of 3 new hosts in the existing racks as there are only 6 racks which is lesser than the optimal 9 racks. One more asserts after waitFor() is added now. bq. Do you think TestBPPRackFaultTolerant needs any additional unit tests along these lines? Sure, will discuss with you on this. bq. Looks like these have the same names as the initial DNs as Takanobu noted. Might be nice to specify the racks too to be explicit. Done. bq. If we later enhance the NN to automatically fix up misplaced EC blocks, this assert will be flaky. Maybe add a comment? Thats right, my intention is to verify the other proposed fix of automatic correction for misplaced EC blocks via this test. Sure, added a comment on this verification and a TODO. > BlockManager reconstruction work scheduling should correctly adhere to EC > block placement policy > ------------------------------------------------------------------------------------------------ > > Key: HDFS-10530 > URL: https://issues.apache.org/jira/browse/HDFS-10530 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode > Reporter: Rui Gao > Assignee: Manoj Govindassamy > Labels: hdfs-ec-3.0-nice-to-have > Attachments: HDFS-10530.1.patch, HDFS-10530.2.patch, > HDFS-10530.3.patch, HDFS-10530.4.patch, HDFS-10530.5.patch > > > This issue was found by [~tfukudom]. > Under RS-DEFAULT-6-3-64k EC policy, > 1. Create an EC file, the file was witten to all the 5 racks( 2 dns for each) > of the cluster. > 2. Reconstruction work would be scheduled if the 6th rack is added. > 3. While adding the 7th rack or more racks will not trigger reconstruction > work. > Based on default EC block placement policy defined in > âBlockPlacementPolicyRackFaultTolerant.javaâ, EC file should be able to be > scheduled to distribute to 9 racks if possible. > In *BlockManager#isPlacementPolicySatisfied(BlockInfo storedBlock)* , > *numReplicas* of striped blocks might should be *getRealTotalBlockNum()*, > instead of *getRealDataBlockNum()*. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org