steffenvan commented on code in PR #1071:
URL: https://github.com/apache/jackrabbit-oak/pull/1071#discussion_r1319566342


##########
oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java:
##########
@@ -316,6 +314,58 @@ protected boolean indexTypeOrderedFields(Document doc, 
String pname, int tag, Pr
         return fieldAdded;
     }
 
+    /**
+     * Returns a {@code BytesRef} object constructed from the given {@code 
String} value and also truncates the length
+     * of the {@code BytesRef} object to the specified {@code maxLength}, 
ensuring that the multi-byte sequences are 
+     * properly truncated.
+     *
+     * <p>The {@code BytesRef} object is created from the provided {@code 
String} value using UTF-8 encoding. As a result, its length
+     * can exceed that of the {@code String} value, since Java strings use 
UTF-16 encoding. This necessitates appropriate truncation.
+     *
+     * <p>Multi-byte sequences will be of the form {@code 11xxxxxx 10xxxxxx 
10xxxxxx 10xxxxxx}.
+     * The method first truncates continuation bytes, which start with {@code 
10} in binary. It then truncates the head byte, which
+     * starts with {@code 11}. Both truncation operations use a binary mask of 
{@code 11100000}.

Review Comment:
   Do both truncation operations really use a binary mask of `11100000`?
   
   From what I can see, the first while loop:
   ```java
   while ((ref.bytes[end] & 0b11000000) == 0b10000000) {
       end--;
   }
   ```
    is checking if the byte at `ref.bytes[end]` is of the form `10xxxxxx`. It's 
doing that by masking the byte with `11000000` and checking if the result is 
`10000000`. If it is, then the byte has the form `10xxxxxx`. 
   
   And the following `if` statement:
   ```java
   if ((ref.bytes[end] & 0b11000000) == 0b11000000) {
       end--;
   }
   ``` 
   checks in a similar manner if the byte-sequence is of the pattern 
`110xxxxx`. But I don't see that any of the truncation operations use a binary 
mask of `11100000`. 
   
   In any case, I think this operation is complex enough that I'd be in strong 
favour of refactoring it into its own function and documented accordingly. That 
way, writing a proper example wouldn't clutter the `getTruncatedBytesRef` 
function. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to