Rushabh Shah created HDFS-16829:
-----------------------------------
Summary: Delay deleting blocks with older generation stamp until
the block is fully replicated.
Key: HDFS-16829
URL: https://issues.apache.org/jira/browse/HDFS-16829
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode, namenode
Affects Versions: 2.10.1
Reporter: Rushabh Shah
We encountered this data loss issue in one of our production clusters which
runs hbase service. We received a missing block alert in this cluster. This
error was logged in the datanode holding the block.
{noformat}
2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode -
nodeA:51010:DataXceiver error processing READ_BLOCK operation src:
/nodeA:31722 dst: <destination>
java.io.IOException: Offset 64410559 and length 4096 don't match block
BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:384)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
at java.lang.Thread.run(Thread.java:750)
{noformat}
The node +nodeA+ has this block blk_3317546151_2244173222 with file length:
59158528 but the length of this block according to namenode is 64414655
(according to fsck)
This are the sequence of events for this block.
1. Namenode created a file with 3 replicas with block id: blk_3317546151 and
genstamp: 2244173147.
2. The first datanode in the pipeline (This physical host was also running
region server process which was hdfs client) was restarting at the same time.
Unfortunately this node was sick and it didn't log anything neither in datanode
process or regionserver process during the time of block creation.
3. Namenode updated the pipeline just with the first node.
4. Namenode logged updatePipeline success with just 1st node nodeA with block
size: 64414655 and new generation stamp: 2244173222
5. Namenode asked nodeB and nodeC to delete the block since it has old
generation stamp.
6. All the reads (client reads and data transfer reads) from nodeA are failing
with the above stack trace.
See logs below from namenode and nodeB and nodeC.
{noformat}
---------------- Logs from namenode -----------------
2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange - BLOCK*
allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 ,
nodeC:51010 for <file>
2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem -
updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success
2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem -
updatePipeline(blk_3317546151_2244173147, newGS=2244173222, newLength=64414655,
newNodes=[nodeA:51010], client=DFSClient_NONMAPREDUCE_1038417265_1)
2022-10-23 12:36:35,004 INFO [on default port 8020] hdfs.StateChange - DIR*
completeFile: <file> is closed by DFSClient_NONMAPREDUCE_1038417265_1
{noformat}
{noformat}
------------- Logs from nodeB -------------
2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode -
Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655
from nodeA:30302
2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode -
PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147,
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating
2022-10-23 12:36:39,738 INFO [/data-2/hdfs/current]
impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029
blk_3317546151_2244173147 file
/data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
{noformat}
{noformat}
------------- Logs from nodeC -------------
2022-10-23 12:36:34,985 INFO [ype=LAST_IN_PIPELINE] datanode.DataNode -
Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655
from nodeB:56486
2022-10-23 12:36:34,985 INFO [ype=LAST_IN_PIPELINE] datanode.DataNode -
PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147,
type=LAST_IN_PIPELINE terminating
2022-10-23 12:36:38,891 INFO [/data-1/hdfs/current]
impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029
blk_3317546151_2244173147 file
/data-1/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
{noformat}
My observation but can't confirm anything since I don't have logs from first
node in pipeline (nodeA) or hdfs client.
Due to disk write errors or bug in BlockReceiver, nodeA didn't flush/sync the
last 5MB (64414655-59158528) of data to disk. I assume it buffered in memory
since nameonode got an acknowledgement from the client that updatePipeline
succeeded.
*Improvement: *
One suggested improvement is to delay deleting the blocks with old generation
stamp until the block is replicated sufficiently which satisfies replication
factor.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]