amit-jain commented on code in PR #1071: URL: https://github.com/apache/jackrabbit-oak/pull/1071#discussion_r1319588274
########## oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java: ########## @@ -316,6 +314,58 @@ protected boolean indexTypeOrderedFields(Document doc, String pname, int tag, Pr return fieldAdded; } + /** + * Returns a {@code BytesRef} object constructed from the given {@code String} value and also truncates the length + * of the {@code BytesRef} object to the specified {@code maxLength}, ensuring that the multi-byte sequences are + * properly truncated. + * + * <p>The {@code BytesRef} object is created from the provided {@code String} value using UTF-8 encoding. As a result, its length + * can exceed that of the {@code String} value, since Java strings use UTF-16 encoding. This necessitates appropriate truncation. + * + * <p>Multi-byte sequences will be of the form {@code 11xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}. + * The method first truncates continuation bytes, which start with {@code 10} in binary. It then truncates the head byte, which + * starts with {@code 11}. Both truncation operations use a binary mask of {@code 11100000}. Review Comment: You are right the comment essentially mentions the wrong mask. > I think this operation is complex enough that I'd be in strong favour of refactoring it into its own function and documented accordingly. I am not in favor of refactoring it as the operation is atomic and really does one thing which is to get a bytesRef. But then if you feel strongly let's take the refactoring as a separate issue. ########## oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java: ########## @@ -316,6 +314,58 @@ protected boolean indexTypeOrderedFields(Document doc, String pname, int tag, Pr return fieldAdded; } + /** + * Returns a {@code BytesRef} object constructed from the given {@code String} value and also truncates the length + * of the {@code BytesRef} object to the specified {@code maxLength}, ensuring that the multi-byte sequences are + * properly truncated. + * + * <p>The {@code BytesRef} object is created from the provided {@code String} value using UTF-8 encoding. As a result, its length + * can exceed that of the {@code String} value, since Java strings use UTF-16 encoding. This necessitates appropriate truncation. + * + * <p>Multi-byte sequences will be of the form {@code 11xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}. + * The method first truncates continuation bytes, which start with {@code 10} in binary. It then truncates the head byte, which + * starts with {@code 11}. Both truncation operations use a binary mask of {@code 11100000}. Review Comment: You are right the comment essentially mentions the wrong mask. > I think this operation is complex enough that I'd be in strong favour of refactoring it into its own function and documented accordingly. I am not in favor of refactoring it as the operation is atomic and really does one thing which is to get a bytesRef. But then if you feel strongly let's take the refactoring as a separate issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@jackrabbit.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org