Garret Wilson created HADOOP-18525:
--------------------------------------

             Summary: ViewFileSystem major bug can cause entire subtrees to 
effectively disappear
                 Key: HADOOP-18525
                 URL: https://issues.apache.org/jira/browse/HADOOP-18525
             Project: Hadoop Common
          Issue Type: Bug
          Components: viewfs
    Affects Versions: 3.3.4
            Reporter: Garret Wilson


{{ViewFileSystem}} allows a federated view of a file system, so that for 
example under the path {{foo/}} I might have {{foo/bar1}} mapped to some other 
file system, {{foo/bar2}} mapped to some different file system, etc. using the 
ViewFS mount table.

Consider a situation where I have 1,000 subdirectories {{foo/bar000}} to 
{{foo/bar999}} mapped to 1,000 different cloud providers (e.g. AWS S3 buckets 
or whatever). Let's say that for whatever reason the mapping for {{foo/bar123}} 
was incorrect (maybe there was a corrupted mount table or a race condition in 
creating the destination cloud storage), so that when we we try to get the 
status of {{foo/bar123}} it returns an HTTP {{404}}, throwing an exception.

But let's say that we were instead _listing the status of {{foo/}} itself_, in 
order to return all 1,000 children. Look what would happen in the 
{{ViewFileSystem.listStatus(Path f)}} code when we call 
{{ViewFileSystem.listStatus(new Path("…/foo"))}}. We expect it to return 999 
child paths instead of 1,000 child (because one of the mounted paths is 
misconfigured and returns {{404}})):

{code:java}
      for (Entry<String, INode<FileSystem>> iEntry :
          theInternalDir.getChildren().entrySet()) {
…
          try {
            FileStatus status =
                ((ChRootedFileSystem)link.getTargetFileSystem())
                .getMyFs().getFileStatus(new Path(linkedPath));
            linkStatuses.add(
                new FileStatus(status.getLen(), status.isDirectory(),
                    status.getReplication(), status.getBlockSize(),
                    status.getModificationTime(), status.getAccessTime(),
                    status.getPermission(), status.getOwner(),
                    status.getGroup(), null, path));
          } catch (FileNotFoundException ex) {
            LOG.warn("Cannot get one of the children's(" + path
                + ")  target path(" + link.getTargetFileSystem().getUri()
                + ") file status.", ex);
            throw ex;
          }
{code}

For each particular child that is mapped in the map table, a 
{{((ChRootedFileSystem)link.getTargetFileSystem()).getMyFs().getFileStatus(new 
Path(linkedPath))}} is performed on the underlying federated file system and 
the resulting `FileSystatus` is added to the list. But in the case of 
{{foo/bar123}}, it throws an exception. The code above appropriately catches 
the exception and warns, "Cannot get one of the children's … file status" That 
part is perfectly fine. *But then the code rethrows the exception, which is 
incorrect.*

Rethrowing the exception with {{throw ex}} breaks the directory listing; it 
will result in an exception for the entire directory listing of {{foo/}}, not 
just the child. If the child mapping for {{foo/bar123}} has somehow disappeared 
(maybe it's just a race condition, and that the mapping table was stale when 
the directory listing started so that the mapping was never current) and 
{{foo/bar123}} returns a {{404}}, suddenly the entire directory listing, 
instead of returning 999 entries as expected doesn't return any entries because 
the file status listing of {{foo/}} itself returns {{404}}!

This bug essentially causes an entire subtree to disappear merely because of a 
problem accessing one of the _children_. In a distributed environment (which is 
what ViewFs was intended for), with thousands of mappings to various HTTP-based 
cloud storage accounts, it's not unexpected that one of them might be 
temporarily unavailable. But this bug would cause the _parent_ directory to 
seem unavailable, essentially making it appear that e.g. {{/users}} simply did 
not exist simply because {{/users/fulano}} happened to be missing.

And if we happen to have {{/missing-mount}} mounted under the root and it was 
temporarily unavailable, and we did a {{listStatus()}} on the root directory 
{{/}} itself? Yes, it would _appear as if the root directory itself was 
missing_, i.e. the entire federated file system.

I have seen this bug in practice. In fact I had thought I had already filed a 
ticket for this, but maybe it was at some organization's internal bug tracking 
system instead of on the public Apache Hadoop bug tracking system.

You can verify this bug simply by adding a unit/integration test that mocks 
{{foo/bar1}}, {{foo/bar2}}, and {{foo/bar3}} as {{ChRootedFileSystem}} in a 
{{ViewFileSystem}} via {{ViewFileSystem.getMyFs()}}. Perform a 
{{ViewFileStatus.listStatus()}} on {{foo/}} and see that it returns 3 children. 
Then have {{getMyFs().getFileStatus()}} return a {{404}} error only for 
{{foo/bar2}}. Do a {{ViewFileStatus.listStatus()}} on {{foo/}} again, and 
instead of returning 2 children, it will claim that {{foo/}} itself does not 
exist.

Fixing this bug is very simple: remove the {{throw ex}} altogether on line 1449.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to