[
https://issues.apache.org/jira/browse/HADOOP-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059161#comment-18059161
]
Christopher Tubbs commented on HADOOP-19815:
--------------------------------------------
It would probably be enough to document that the URI constructor for Path
preserves the provided URI, whereas the String constructor normalizes and
strips. However, that still leaves bugs in the FileSystem implementations that
return Paths based on URIs that have been stripped. So, to fix this, Path APIs
can be documented better, while FileSystem APIs need a behavior change to fix
the bug.
> Path normalizes away important trailing slash used for URI.resolve(other)
> -------------------------------------------------------------------------
>
> Key: HADOOP-19815
> URL: https://issues.apache.org/jira/browse/HADOOP-19815
> Project: Hadoop Common
> Issue Type: Bug
> Components: common
> Affects Versions: 3.4.2
> Reporter: Christopher Tubbs
> Priority: Major
>
> This issue appears to be a relatively long-standing bug with Hadoop's
> FileSystem and Path classes, but is nevertheless important.
> The core of the issue is that {{URI.resolve(...)}} relies on a trailing slash
> to determine how to resolve path components, but the trailing slash is often
> stripped out in common code paths for FileSystem and Path. This causes
> problems when trying to resolve new URIs/Paths from existing ones.
> Constructing a Path from a URI, rather than a String or another Path, does
> preserve the original URI, so things do resolve correctly, but that yields
> highly inconsistent behavior, and depends on the specifics of how it was
> constructed and how the original URI was preserved internally.
> However, even if one argues that the String constructor for Path is supposed
> to normalize, and the URI constructor is supposed to preserve, the problem
> also exists with many of the {{FileSystem}} methods, such as
> {{{}fs.getUri(){}}}, {{{}fs.getHomeDirectory(){}}},
> {{{}fs.getWorkingDirectory(){}}}, etc. So, one must do convoluted string
> manipulation to resolve one Path from another.
> For example:
> {code:java}
> new Path("hdfs://localhost:8020/path/to/somewhere").toUri().resolve("other");
> // expected ==> URI(hdfs://localhost:8020/path/to/other)
> // actual ==> URI(hdfs://localhost:8020/path/to/other)
> new Path("hdfs://localhost:8020/path/to/somewhere/").toUri().resolve("other");
> // expected ==> URI(hdfs://localhost:8020/path/to/somewhere/other)
> // actual ==> URI(hdfs://localhost:8020/path/to/other)
> new Path(new
> URI("hdfs://localhost:8020/path/to/somewhere")).toUri().resolve("other");
> // expected ==> URI(hdfs://localhost:8020/path/to/other)
> // actual ==> URI(hdfs://localhost:8020/path/to/other)
> new Path(new
> URI("hdfs://localhost:8020/path/to/somewhere/")).toUri().resolve("other");
> // expected ==> URI(hdfs://localhost:8020/path/to/somewhere/other)
> // actual ==> URI(hdfs://localhost:8020/path/to/somewhere/other)
> var fs = FileSystem.get(new Configuration());
> fs.getUri();
> // expected ==> URI(hdfs://localhost:8020/)
> // actual ==> URI(hdfs://localhost:8020) // probably matters more for
> LocalFileSystem or viewfs, etc.
> fs.getWorkingDirectory().toUri();
> fs.getHomeDirectory().toUri();
> // expected ==> URI(hdfs://localhost:8020/user/me/)
> // actual ==> URI(hdfs://localhost:8020/user/me)
> // broken code
> URI relativeURI = new URI("mytempdir");
> fs.getWorkingDirectory().toUri().resolve(relativeURI);
> // expected ==> hdfs://localhost:8020/user/me/mytempdir
> // actual ==> hdfs://localhost:8020/user/mytempdir
> // convoluted workaround (assuming relative path in the suffix without any
> other URI elements)
> URI relativeURI = new URI("mytempdir");
> fs.getWorkingDirectory().suffix("/" + relativeURI.toString()).toUri();
> // expected ==> hdfs://localhost:8020/user/me/mytempdir
> // actual ==> hdfs://localhost:8020/user/me/mytempdir
> {code}
> Some of this is workable, so long as you're staying with Path, but the moment
> you try to work with URIs/URLs, things get convoluted quickly, requiring
> {{toString()}} calls and concatenation with slash {{/}} characters, and edge
> cases when the other path isn't relative, or contains a different authority
> or scheme, etc. These are things {{URI.resolve()}} would already handle, so
> code can get unnecessarily complex to work around these API problems.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]