steffenvan commented on code in PR #1071: URL: https://github.com/apache/jackrabbit-oak/pull/1071#discussion_r1319566342
########## oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java: ########## @@ -316,6 +314,58 @@ protected boolean indexTypeOrderedFields(Document doc, String pname, int tag, Pr return fieldAdded; } + /** + * Returns a {@code BytesRef} object constructed from the given {@code String} value and also truncates the length + * of the {@code BytesRef} object to the specified {@code maxLength}, ensuring that the multi-byte sequences are + * properly truncated. + * + * <p>The {@code BytesRef} object is created from the provided {@code String} value using UTF-8 encoding. As a result, its length + * can exceed that of the {@code String} value, since Java strings use UTF-16 encoding. This necessitates appropriate truncation. + * + * <p>Multi-byte sequences will be of the form {@code 11xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}. + * The method first truncates continuation bytes, which start with {@code 10} in binary. It then truncates the head byte, which + * starts with {@code 11}. Both truncation operations use a binary mask of {@code 11100000}. Review Comment: Do both truncation operations really use a binary mask of `11100000`? From what I can see, the first while loop: ```java while ((ref.bytes[end] & 0b11000000) == 0b10000000) { end--; } ``` is checking if the byte at `ref.bytes[end]` is of the form `10xxxxxx`. It's doing that by masking the byte with `11000000` and checking if the result is `10000000`. If it is, then the byte has the form `10xxxxxx`. And the following `if` statement: ```java if ((ref.bytes[end] & 0b11000000) == 0b11000000) { end--; } ``` checks in a similar manner if the byte-sequence is of the pattern `110xxxxx`. But I don't see that any of the truncation operations use a binary mask of `11100000`. In any case, I think this operation is complex enough that I'd be in strong favour of refactoring it into its own function and documented accordingly. That way, writing a proper example wouldn't clutter the `getTruncatedBytesRef` function. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@jackrabbit.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org