Garret Wilson created HADOOP-18525: -------------------------------------- Summary: ViewFileSystem major bug can cause entire subtrees to effectively disappear Key: HADOOP-18525 URL: https://issues.apache.org/jira/browse/HADOOP-18525 Project: Hadoop Common Issue Type: Bug Components: viewfs Affects Versions: 3.3.4 Reporter: Garret Wilson
{{ViewFileSystem}} allows a federated view of a file system, so that for example under the path {{foo/}} I might have {{foo/bar1}} mapped to some other file system, {{foo/bar2}} mapped to some different file system, etc. using the ViewFS mount table. Consider a situation where I have 1,000 subdirectories {{foo/bar000}} to {{foo/bar999}} mapped to 1,000 different cloud providers (e.g. AWS S3 buckets or whatever). Let's say that for whatever reason the mapping for {{foo/bar123}} was incorrect (maybe there was a corrupted mount table or a race condition in creating the destination cloud storage), so that when we we try to get the status of {{foo/bar123}} it returns an HTTP {{404}}, throwing an exception. But let's say that we were instead _listing the status of {{foo/}} itself_, in order to return all 1,000 children. Look what would happen in the {{ViewFileSystem.listStatus(Path f)}} code when we call {{ViewFileSystem.listStatus(new Path("…/foo"))}}. We expect it to return 999 child paths instead of 1,000 child (because one of the mounted paths is misconfigured and returns {{404}})): {code:java} for (Entry<String, INode<FileSystem>> iEntry : theInternalDir.getChildren().entrySet()) { … try { FileStatus status = ((ChRootedFileSystem)link.getTargetFileSystem()) .getMyFs().getFileStatus(new Path(linkedPath)); linkStatuses.add( new FileStatus(status.getLen(), status.isDirectory(), status.getReplication(), status.getBlockSize(), status.getModificationTime(), status.getAccessTime(), status.getPermission(), status.getOwner(), status.getGroup(), null, path)); } catch (FileNotFoundException ex) { LOG.warn("Cannot get one of the children's(" + path + ") target path(" + link.getTargetFileSystem().getUri() + ") file status.", ex); throw ex; } {code} For each particular child that is mapped in the map table, a {{((ChRootedFileSystem)link.getTargetFileSystem()).getMyFs().getFileStatus(new Path(linkedPath))}} is performed on the underlying federated file system and the resulting `FileSystatus` is added to the list. But in the case of {{foo/bar123}}, it throws an exception. The code above appropriately catches the exception and warns, "Cannot get one of the children's … file status" That part is perfectly fine. *But then the code rethrows the exception, which is incorrect.* Rethrowing the exception with {{throw ex}} breaks the directory listing; it will result in an exception for the entire directory listing of {{foo/}}, not just the child. If the child mapping for {{foo/bar123}} has somehow disappeared (maybe it's just a race condition, and that the mapping table was stale when the directory listing started so that the mapping was never current) and {{foo/bar123}} returns a {{404}}, suddenly the entire directory listing, instead of returning 999 entries as expected doesn't return any entries because the file status listing of {{foo/}} itself returns {{404}}! This bug essentially causes an entire subtree to disappear merely because of a problem accessing one of the _children_. In a distributed environment (which is what ViewFs was intended for), with thousands of mappings to various HTTP-based cloud storage accounts, it's not unexpected that one of them might be temporarily unavailable. But this bug would cause the _parent_ directory to seem unavailable, essentially making it appear that e.g. {{/users}} simply did not exist simply because {{/users/fulano}} happened to be missing. And if we happen to have {{/missing-mount}} mounted under the root and it was temporarily unavailable, and we did a {{listStatus()}} on the root directory {{/}} itself? Yes, it would _appear as if the root directory itself was missing_, i.e. the entire federated file system. I have seen this bug in practice. In fact I had thought I had already filed a ticket for this, but maybe it was at some organization's internal bug tracking system instead of on the public Apache Hadoop bug tracking system. You can verify this bug simply by adding a unit/integration test that mocks {{foo/bar1}}, {{foo/bar2}}, and {{foo/bar3}} as {{ChRootedFileSystem}} in a {{ViewFileSystem}} via {{ViewFileSystem.getMyFs()}}. Perform a {{ViewFileStatus.listStatus()}} on {{foo/}} and see that it returns 3 children. Then have {{getMyFs().getFileStatus()}} return a {{404}} error only for {{foo/bar2}}. Do a {{ViewFileStatus.listStatus()}} on {{foo/}} again, and instead of returning 2 children, it will claim that {{foo/}} itself does not exist. Fixing this bug is very simple: remove the {{throw ex}} altogether on line 1449. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org