JohnZZGithub edited a comment on pull request #2185: URL: https://github.com/apache/hadoop/pull/2185#issuecomment-686183399
Thanks @umamaheswararao @umamaheswararao Thanks for the comments. Please see the reply inline. > Hi @JohnZZGithub, I got few other points to discuss. > > 1. We have exposed getMountPoints API. It seems we can't return any mount points from REGEX based because you would not know until you got src paths to resolve and find real target fs. What should we do for this API? It's a great question. I guess most caller of getMountPoints wants to traverse all the file systems to do some operation. E.g. setVerifyChecksum(). We didn't see issues on our internal Yarn + HDFS and Yarn + GCS clusters. The usage pattern includes but not limited to MR, Spark, Presto, Vertica loading and etc. But it's possible that some users might rely on these APIs. I could see two options forward: 1. Returning a MountPint with special FileSystem for Regex Mount points. We could cache the initialized fileSystem under the regex mountpoint and perform the operation. For filesystems that might appear in the future, we could cache the past calls from callers and try to apply it or just not support it. 2. We could indicate that we don't support such APIs for regex mount points. And to extend the topic a little bit, this kind of ViewFileSystem API (API which tries to visit all file systems) caused several problems for us. E.g. setVerifyChecksum() initialized a file system for a mount point users didn't want to use it all. And the initialization of the file system will fail as it requires credentials during initialization. Users don't have it as it never means to visit the mount point. We developed a LazyChRootedFileSystem on top of every target system (not public) to do lazy initialization for path-based APIs. But it's hard to tackle APIs without path passed in. So to summarize, we see cases users want to avoid these non-path based API to trigger actions on every child file system. In the meantime, some users(though rare in our scenarios) might want to use these APIs applied to all children's filesystems. I feel it's hard to satisfy both needs. > 2. Other API is getDelegationTokenIssuers. Applications like YARN uses this API to get all child fs delegation tokens. This also will not work for REGEX based mount points. We did see an issue with addDelegationTokens in the secure Hadoop cluster. But the problem we met is not all normal mountpoints are secure. So the API caused a problem when it tries to initialize all children's file systems. We took a workaround by making it path-based. As for getDelegationTokens, I guess the problem is similar. We didn't see issues because it's not used. Could we make it path based too? Or we could take the approach stated in problem one. > 3. Other question is how this child filesystem objects gets closed. There was an issue with [ViewFileSystem#close | https://issues.apache.org/jira/browse/HADOOP-15565 ]. I would like to know how that get addressed in this case as don't keep anything in InnerCache. Could we make the inner cache a thread-safe structure and track all the opened file systems under regex mount points? These are really great points, thanks a lot. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org