stevenzwu commented on code in PR #16174:
URL: https://github.com/apache/iceberg/pull/16174#discussion_r3290473847
##########
core/src/main/java/org/apache/iceberg/util/LocationUtil.java:
##########
@@ -57,4 +57,61 @@ public static String tableLocation(TableIdentifier
tableIdentifier, boolean useU
return tableIdentifier.name();
}
}
+
+ /**
+ * Returns true if the location contains a URI scheme (e.g. {@code s3:},
{@code hdfs:}, {@code
+ * file:}), per <a
href="https://datatracker.ietf.org/doc/html/rfc3986#section-3.1">RFC 3986
+ * section 3.1</a>.
+ */
+ private static boolean hasScheme(String location) {
+ for (int i = 0; i < location.length(); i += 1) {
+ char ch = location.charAt(i);
+ if (ch == ':') {
+ return i > 0;
+ }
+
+ if (!Character.isLetterOrDigit(ch) && ch != '+' && ch != '-' && ch !=
'.') {
Review Comment:
`Character.isLetterOrDigit(char)` admits any BMP letter/digit category (CJK
ideographs, Cyrillic, Arabic-Indic digits, etc.), which is broader than what
the scheme grammar allows — RFC 3986 defines the URI grammar over US-ASCII per
[§2](https://datatracker.ietf.org/doc/html/rfc3986#section-2):
> The ABNF notation defines its terminal values to be non-negative integers
(codepoints) based on the US-ASCII coded character set [ASCII]. Because a URI
is a sequence of characters, we must invert that relation in order to
understand the URI syntax. Therefore, the integer values used by the ABNF must
be mapped back to their corresponding characters via US-ASCII in order to
complete the syntax rules.
And `ALPHA` / `DIGIT` in the scheme production at
[§3.1](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) come from
[RFC 5234 Appendix
B.1](https://datatracker.ietf.org/doc/html/rfc5234#appendix-B.1), which
restricts them to `%x41-5A`, `%x61-7A`, and `%x30-39`.
In practice no Iceberg location places non-ASCII before `:`, so this is
theoretical. Either note the deliberate liberal-accept relative to the RFC in
the Javadoc, or tighten to inline ranges:
```java
(ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z')
|| (i > 0 && ((ch >= '0' && ch <= '9') || ch == '+' || ch == '-' || ch
== '.'))
```
The JDK does not expose an ASCII-only helper until Java 21
(`Character.isAsciiAlphabetic` / `isAsciiDigit`), so an inline range check is
the cleanest path on Java 17.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]