Rushabh Shah created HDFS-16829:
-----------------------------------

             Summary: Delay deleting blocks with older generation stamp until 
the block is fully replicated.
                 Key: HDFS-16829
                 URL: https://issues.apache.org/jira/browse/HDFS-16829
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode, namenode
    Affects Versions: 2.10.1
            Reporter: Rushabh Shah


We encountered this data loss issue in one of our production clusters which 
runs hbase service. We received a missing block alert in this cluster. This 
error was logged in the datanode holding the block.
{noformat}
2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - 
nodeA:51010:DataXceiver error processing READ_BLOCK operation  src: 
/nodeA:31722 dst: <destination>
java.io.IOException:  Offset 64410559 and length 4096 don't match block 
BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:384)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
        at java.lang.Thread.run(Thread.java:750)
{noformat}

The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 
59158528 but the length of this block according to namenode is 64414655 
(according to fsck)

This are the sequence of events for this block.

 
1. Namenode created a file with 3 replicas with block id: blk_3317546151 and 
genstamp: 2244173147. 
2. The first datanode in the pipeline (This physical host was also running 
region server process which was hdfs client) was restarting at the same time. 
Unfortunately this node was sick and it didn't log anything neither in datanode 
process or regionserver process during the time of block creation.
3. Namenode updated the pipeline just with the first node.
4. Namenode logged updatePipeline success with just 1st node nodeA with block 
size: 64414655 and new generation stamp: 2244173222
5. Namenode asked nodeB and nodeC to delete the block since it has old 
generation stamp.
6. All the reads (client reads and data transfer reads) from nodeA are failing 
with the above stack trace.
See logs below from namenode and nodeB and nodeC.


{noformat}
---------------- Logs from namenode  -----------------
2022-10-23 12:36:34,449 INFO  [on default port 8020] hdfs.StateChange - BLOCK* 
allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 , 
nodeC:51010 for <file>
2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success
2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
updatePipeline(blk_3317546151_2244173147, newGS=2244173222, newLength=64414655, 
newNodes=[nodeA:51010], client=DFSClient_NONMAPREDUCE_1038417265_1)
2022-10-23 12:36:35,004 INFO  [on default port 8020] hdfs.StateChange - DIR* 
completeFile: <file> is closed by DFSClient_NONMAPREDUCE_1038417265_1
{noformat}

{noformat}
-------------  Logs from nodeB -------------
2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
from nodeA:30302
2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating
2022-10-23 12:36:39,738 INFO  [/data-2/hdfs/current] 
impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 
blk_3317546151_2244173147 file 
/data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
{noformat}
 
{noformat}
-------------  Logs from nodeC -------------
2022-10-23 12:36:34,985 INFO  [ype=LAST_IN_PIPELINE] datanode.DataNode - 
Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
from nodeB:56486
2022-10-23 12:36:34,985 INFO  [ype=LAST_IN_PIPELINE] datanode.DataNode - 
PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
type=LAST_IN_PIPELINE terminating
2022-10-23 12:36:38,891 INFO  [/data-1/hdfs/current] 
impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 
blk_3317546151_2244173147 file 
/data-1/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
{noformat}

My observation but can't confirm anything since I don't have logs from first 
node in pipeline (nodeA) or hdfs client.
Due to disk write errors or bug in BlockReceiver, nodeA didn't flush/sync the 
last 5MB (64414655-59158528) of data to disk. I assume it buffered in memory 
since nameonode got an acknowledgement from the client that updatePipeline 
succeeded. 


*Improvement: *
One suggested improvement is to delay deleting the blocks with old generation 
stamp until the block is replicated sufficiently which satisfies replication 
factor.









--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to