gsmiller commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1206966623
##########
lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java:
##########
@@ -477,38 +477,60 @@ public static int UTF8toUTF32(final BytesRef utf8, final
int[] ints) {
int utf8Upto = utf8.offset;
final byte[] bytes = utf8.bytes;
final int utf8Limit = utf8.offset + utf8.length;
+ UTF8CodePoint reuse = null;
while (utf8Upto < utf8Limit) {
- final int numBytes = utf8CodeLength[bytes[utf8Upto] & 0xFF];
- int v = 0;
- switch (numBytes) {
- case 1:
- ints[utf32Count++] = bytes[utf8Upto++];
- continue;
- case 2:
- // 5 useful bits
- v = bytes[utf8Upto++] & 31;
- break;
- case 3:
- // 4 useful bits
- v = bytes[utf8Upto++] & 15;
- break;
- case 4:
- // 3 useful bits
- v = bytes[utf8Upto++] & 7;
- break;
- default:
- throw new IllegalArgumentException("invalid utf8");
- }
+ reuse = codePointAt(bytes, utf8Upto, reuse);
+ ints[utf32Count++] = reuse.codePoint;
+ utf8Upto += reuse.codePointBytes;
+ }
- // TODO: this may read past utf8's limit.
- final int limit = utf8Upto + numBytes - 1;
- while (utf8Upto < limit) {
- v = v << 6 | bytes[utf8Upto++] & 63;
+ return utf32Count;
+ }
+
+ /**
+ * Computes the codepoint and codepoint length (in bytes) of the specified
{@code offset} in the
+ * provided {@code utf8} byte array, assuming UTF8 encoding. As with other
related methods in this
+ * class, this assumes valid UTF8 input and <strong>does not
perform</strong> full UTF8
+ * validation.
+ *
+ * @throws IllegalArgumentException If invalid codepoint header byte occurs
or the content is
Review Comment:
You're correct that it could AIOOBE on a particularly malformed header byte.
I think the `v` business is OK since the default switch case translates that to
IAE, but I agree with your suggestion to make a more general statement that
this method may do all sort of terrible and unexpected things if you feed it
invalid utf8 (or reference an invalid start position)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]