Christopher Tubbs created HADOOP-19815:
------------------------------------------
Summary: Path normalizes away important trailing slash used for
URI.resolve(other)
Key: HADOOP-19815
URL: https://issues.apache.org/jira/browse/HADOOP-19815
Project: Hadoop Common
Issue Type: Bug
Components: common
Affects Versions: 3.4.2
Reporter: Christopher Tubbs
This issue appears to be a relatively long-standing bug with Hadoop's
FileSystem and Path classes, but is nevertheless important.
The core of the issue is that {{URI.resolve(...)}} relies on a trailing slash
to determine how to resolve path components, but the trailing slash is often
stripped out in common code paths for FileSystem and Path. This causes problems
when trying to resolve new URIs/Paths from existing ones. Constructing a Path
from a URI, rather than a String or another Path, does preserve the original
URI, so things do resolve correctly, but that yields highly inconsistent
behavior, and depends on the specifics of how it was constructed and how the
original URI was preserved internally.
However, even if one argues that the String constructor for Path is supposed to
normalize, and the URI constructor is supposed to preserve, the problem also
exists with many of the {{FileSystem}} methods, such as {{{}fs.getUri(){}}},
{{{}fs.getHomeDirectory(){}}}, {{{}fs.getWorkingDirectory(){}}}, etc. So, one
must do convoluted string manipulation to resolve one Path from another.
For example:
{code:java}
new Path("hdfs://localhost:8020/path/to/somewhere").toUri().resolve("other");
// expected ==> URI(hdfs://localhost:8020/path/to/other)
// actual ==> URI(hdfs://localhost:8020/path/to/other)
new Path("hdfs://localhost:8020/path/to/somewhere/").toUri().resolve("other");
// expected ==> URI(hdfs://localhost:8020/path/to/somewhere/other)
// actual ==> URI(hdfs://localhost:8020/path/to/other)
new Path(new
URI("hdfs://localhost:8020/path/to/somewhere")).toUri().resolve("other");
// expected ==> URI(hdfs://localhost:8020/path/to/other)
// actual ==> URI(hdfs://localhost:8020/path/to/other)
new Path(new
URI("hdfs://localhost:8020/path/to/somewhere/")).toUri().resolve("other");
// expected ==> URI(hdfs://localhost:8020/path/to/somewhere/other)
// actual ==> URI(hdfs://localhost:8020/path/to/somewhere/other)
var fs = FileSystem.get(new Configuration());
fs.getUri();
// expected ==> URI(hdfs://localhost:8020/)
// actual ==> URI(hdfs://localhost:8020) // probably matters more for
LocalFileSystem or viewfs, etc.
fs.getWorkingDirectory().toUri();
fs.getHomeDirectory().toUri();
// expected ==> URI(hdfs://localhost:8020/user/me/)
// actual ==> URI(hdfs://localhost:8020/user/me)
// broken code
URI relativeURI = new URI("mytempdir");
fs.getWorkingDirectory().toUri().resolve(relativeURI);
// expected ==> hdfs://localhost:8020/user/me/mytempdir
// actual ==> hdfs://localhost:8020/user/mytempdir
// convoluted workaround (assuming relative path in the suffix without any
other URI elements)
URI relativeURI = new URI("mytempdir");
fs.getWorkingDirectory().suffix("/" + relativeURI.toString()).toUri();
// expected ==> hdfs://localhost:8020/user/me/mytempdir
// actual ==> hdfs://localhost:8020/user/me/mytempdir
{code}
Some of this is workable, so long as you're staying with Path, but the moment
you try to work with URIs/URLs, things get convoluted quickly, requiring
{{toString()}} calls and concatenation with slash {{/}} characters, and edge
cases when the other path isn't relative, or contains a different authority or
scheme, etc. These are things {{URI.resolve()}} would already handle, so code
can get unnecessarily complex to work around these API problems.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]