Garret Wilson created HADOOP-18525:
--------------------------------------
Summary: ViewFileSystem major bug can cause entire subtrees to
effectively disappear
Key: HADOOP-18525
URL: https://issues.apache.org/jira/browse/HADOOP-18525
Project: Hadoop Common
Issue Type: Bug
Components: viewfs
Affects Versions: 3.3.4
Reporter: Garret Wilson
{{ViewFileSystem}} allows a federated view of a file system, so that for
example under the path {{foo/}} I might have {{foo/bar1}} mapped to some other
file system, {{foo/bar2}} mapped to some different file system, etc. using the
ViewFS mount table.
Consider a situation where I have 1,000 subdirectories {{foo/bar000}} to
{{foo/bar999}} mapped to 1,000 different cloud providers (e.g. AWS S3 buckets
or whatever). Let's say that for whatever reason the mapping for {{foo/bar123}}
was incorrect (maybe there was a corrupted mount table or a race condition in
creating the destination cloud storage), so that when we we try to get the
status of {{foo/bar123}} it returns an HTTP {{404}}, throwing an exception.
But let's say that we were instead _listing the status of {{foo/}} itself_, in
order to return all 1,000 children. Look what would happen in the
{{ViewFileSystem.listStatus(Path f)}} code when we call
{{ViewFileSystem.listStatus(new Path("…/foo"))}}. We expect it to return 999
child paths instead of 1,000 child (because one of the mounted paths is
misconfigured and returns {{404}})):
{code:java}
for (Entry<String, INode<FileSystem>> iEntry :
theInternalDir.getChildren().entrySet()) {
…
try {
FileStatus status =
((ChRootedFileSystem)link.getTargetFileSystem())
.getMyFs().getFileStatus(new Path(linkedPath));
linkStatuses.add(
new FileStatus(status.getLen(), status.isDirectory(),
status.getReplication(), status.getBlockSize(),
status.getModificationTime(), status.getAccessTime(),
status.getPermission(), status.getOwner(),
status.getGroup(), null, path));
} catch (FileNotFoundException ex) {
LOG.warn("Cannot get one of the children's(" + path
+ ") target path(" + link.getTargetFileSystem().getUri()
+ ") file status.", ex);
throw ex;
}
{code}
For each particular child that is mapped in the map table, a
{{((ChRootedFileSystem)link.getTargetFileSystem()).getMyFs().getFileStatus(new
Path(linkedPath))}} is performed on the underlying federated file system and
the resulting `FileSystatus` is added to the list. But in the case of
{{foo/bar123}}, it throws an exception. The code above appropriately catches
the exception and warns, "Cannot get one of the children's … file status" That
part is perfectly fine. *But then the code rethrows the exception, which is
incorrect.*
Rethrowing the exception with {{throw ex}} breaks the directory listing; it
will result in an exception for the entire directory listing of {{foo/}}, not
just the child. If the child mapping for {{foo/bar123}} has somehow disappeared
(maybe it's just a race condition, and that the mapping table was stale when
the directory listing started so that the mapping was never current) and
{{foo/bar123}} returns a {{404}}, suddenly the entire directory listing,
instead of returning 999 entries as expected doesn't return any entries because
the file status listing of {{foo/}} itself returns {{404}}!
This bug essentially causes an entire subtree to disappear merely because of a
problem accessing one of the _children_. In a distributed environment (which is
what ViewFs was intended for), with thousands of mappings to various HTTP-based
cloud storage accounts, it's not unexpected that one of them might be
temporarily unavailable. But this bug would cause the _parent_ directory to
seem unavailable, essentially making it appear that e.g. {{/users}} simply did
not exist simply because {{/users/fulano}} happened to be missing.
And if we happen to have {{/missing-mount}} mounted under the root and it was
temporarily unavailable, and we did a {{listStatus()}} on the root directory
{{/}} itself? Yes, it would _appear as if the root directory itself was
missing_, i.e. the entire federated file system.
I have seen this bug in practice. In fact I had thought I had already filed a
ticket for this, but maybe it was at some organization's internal bug tracking
system instead of on the public Apache Hadoop bug tracking system.
You can verify this bug simply by adding a unit/integration test that mocks
{{foo/bar1}}, {{foo/bar2}}, and {{foo/bar3}} as {{ChRootedFileSystem}} in a
{{ViewFileSystem}} via {{ViewFileSystem.getMyFs()}}. Perform a
{{ViewFileStatus.listStatus()}} on {{foo/}} and see that it returns 3 children.
Then have {{getMyFs().getFileStatus()}} return a {{404}} error only for
{{foo/bar2}}. Do a {{ViewFileStatus.listStatus()}} on {{foo/}} again, and
instead of returning 2 children, it will claim that {{foo/}} itself does not
exist.
Fixing this bug is very simple: remove the {{throw ex}} altogether on line 1449.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]