steffenvan commented on code in PR #1071: URL: https://github.com/apache/jackrabbit-oak/pull/1071#discussion_r1318352601
########## oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java: ########## @@ -315,6 +313,38 @@ protected boolean indexTypeOrderedFields(Document doc, String pname, int tag, Pr } return fieldAdded; } + + protected static BytesRef checkTruncateLength(String prop, String value, String path, int maxLength) { + log.trace("Property {} at path:[{}] has value {}", prop, path, value); + + BytesRef ref = new BytesRef(value); + if (ref.length <= maxLength) { + return ref; + } + log.info("Truncating property {} at path:[{}] as length after encoding {} is > {} ", + prop, path, ref.length, maxLength); + int end = maxLength - 1; + // skip over tails of utf-8 multi-byte sequences (up to 3 bytes) + while ((ref.bytes[end] & 0b11000000) == 0b10000000) { + end--; + } + // remove one head of a utf-8 multi-byte sequence (at most 1) + if ((ref.bytes[end] & 0b11000000) == 0b11000000) { + end--; + } + byte[] bytes2 = Arrays.copyOf(ref.bytes, end + 1); + String truncated = new String(bytes2, StandardCharsets.UTF_8); + ref = new BytesRef(truncated); + log.trace("Truncated property {} at path:[{}] to {}", prop, path, ref.utf8ToString()); + while (ref.length > maxLength) { Review Comment: I think it's important to document this "emergency" even more. When can it happen, what do we do to fix it, and why is it okay for us to fix it the way we are? From a quick glance at the code, it is not immediately clear why this works. Specifically, one or more clear examples would make it very clear imo. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@jackrabbit.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org