Dimas Shidqi Parikesit created HDFS-17768:
---------------------------------------------
Summary: Observer namenode network delay causing empty block
location for getBatchedListing
Key: HDFS-17768
URL: https://issues.apache.org/jira/browse/HDFS-17768
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Affects Versions: 3.4.1
Reporter: Dimas Shidqi Parikesit
In our testing with the latest hdfs version (e8a64d0), we found a similar case
to HDFS-16732 happening in getBatchedListing. During a getBatchedListing, if
the block report of the observer nn is delayed, one or more of the listing
results will return blocks without location.
Steps to reproduce this bug:
# Start a cluster with 1 observer namenode
# Create an empty file
# Inject network delay between observer nn and active nn to delay block report
(or add sleep to the BlockReportProcessingThread of the observer).
# Append file to add block
# Send a batchedListPaths request using client API
# Check that the result has block without location
In HDFS-16732 and HDFS-13924, a check was added in getBlockLocations,
getFileInfo, and getListing that checks whether the found blocks have valid
locations. Missing locations indicate that the observer namenode is not
up-to-date compared to the active namenode.
We propose to add the same check to getBatchedListing. If any of the
sub-listing return blocks without location then it will throw
ObserverRetryOnActiveException and exit the function early. The entire
batchedListing request will be then retried on active namenode.
Your insights are very much appreciated. We will continue following up this
issue until it is resolved.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]