amit-jain commented on code in PR #1071:
URL: https://github.com/apache/jackrabbit-oak/pull/1071#discussion_r1319588274


##########
oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java:
##########
@@ -316,6 +314,58 @@ protected boolean indexTypeOrderedFields(Document doc, 
String pname, int tag, Pr
         return fieldAdded;
     }
 
+    /**
+     * Returns a {@code BytesRef} object constructed from the given {@code 
String} value and also truncates the length
+     * of the {@code BytesRef} object to the specified {@code maxLength}, 
ensuring that the multi-byte sequences are 
+     * properly truncated.
+     *
+     * <p>The {@code BytesRef} object is created from the provided {@code 
String} value using UTF-8 encoding. As a result, its length
+     * can exceed that of the {@code String} value, since Java strings use 
UTF-16 encoding. This necessitates appropriate truncation.
+     *
+     * <p>Multi-byte sequences will be of the form {@code 11xxxxxx 10xxxxxx 
10xxxxxx 10xxxxxx}.
+     * The method first truncates continuation bytes, which start with {@code 
10} in binary. It then truncates the head byte, which
+     * starts with {@code 11}. Both truncation operations use a binary mask of 
{@code 11100000}.

Review Comment:
   You are right the comment essentially mentions the wrong mask.
   
   >  I think this operation is complex enough that I'd be in strong favour of 
refactoring it into its own function and documented accordingly.
   I am not in favor of refactoring it as the operation is atomic and really 
does one thing which is to get a bytesRef.  But then if you feel strongly let's 
take the refactoring as a separate issue. 



##########
oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java:
##########
@@ -316,6 +314,58 @@ protected boolean indexTypeOrderedFields(Document doc, 
String pname, int tag, Pr
         return fieldAdded;
     }
 
+    /**
+     * Returns a {@code BytesRef} object constructed from the given {@code 
String} value and also truncates the length
+     * of the {@code BytesRef} object to the specified {@code maxLength}, 
ensuring that the multi-byte sequences are 
+     * properly truncated.
+     *
+     * <p>The {@code BytesRef} object is created from the provided {@code 
String} value using UTF-8 encoding. As a result, its length
+     * can exceed that of the {@code String} value, since Java strings use 
UTF-16 encoding. This necessitates appropriate truncation.
+     *
+     * <p>Multi-byte sequences will be of the form {@code 11xxxxxx 10xxxxxx 
10xxxxxx 10xxxxxx}.
+     * The method first truncates continuation bytes, which start with {@code 
10} in binary. It then truncates the head byte, which
+     * starts with {@code 11}. Both truncation operations use a binary mask of 
{@code 11100000}.

Review Comment:
   You are right the comment essentially mentions the wrong mask.
   
   >  I think this operation is complex enough that I'd be in strong favour of 
refactoring it into its own function and documented accordingly.
   
   I am not in favor of refactoring it as the operation is atomic and really 
does one thing which is to get a bytesRef.  But then if you feel strongly let's 
take the refactoring as a separate issue. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to