Hello,
The current commons lang3 StringUtils.abbreviate (3.17.0) implementation will
destroy 4 byte emoji characters and larger grapheme clusters. I know that
handling grapheme correctly before java 20 is not possible, but at least a
codepoint aware solution with String.offsetByCodPoints could be build. I wrote
a small test to show the problem.
The zero width joiners in the family emoji are questionable for the abbreviate,
but there should never be a question mark for an invalid char in the result as
there is now.
The problem is not so much the βdoesnβt look niceβ aspect of the broken emoji,
but if that abbreviated string is passed to an XML Writer
(com.ctc.wstx.io.UTF8Writer in my case) it throws an exception on this broken
byte sequence. Like this: Caused by: java.io.IOException: Broken surrogate
pair: first char 0xd83c, second 0x2e; illegal combination
at
com.ctc.wstx.io.UTF8Writer._convertSurrogate(UTF8Writer.java:402)
~[woodstox-core-7.0.0.jar:7.0.0]
Thanks,
Carsten
import org.apache.commons.lang3.StringUtils;
import org.junit.Test;
import static org.junit.Assert.*;
public class AbbreviateTest {
String[] expectedResultsFox = {
"π¦...", // 4
"π¦π¦...",
"π¦π¦π¦...",
"π¦π¦π¦π¦...",
"π¦π¦π¦π¦π¦...",
"π¦π¦π¦π¦π¦π¦...",
"π¦π¦π¦π¦π¦π¦π¦...", // 10
};
String[] expectedResultsFamilyWithCodepoints = {
"π©...",
"π©π»...",
"π©π»β...", // zero width joiner
"π©π»βπ¨...",
"π©π»βπ¨π»...",
"π©π»βπ¨π»β...",
"π©π»βπ¨π»βπ¦..."
};
String[] expectedResultsFamilyWithGrapheme = {
"π©π»βπ¨π»βπ¦π»βπ¦π»...", // 4
"π©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌ...",
"π©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌπ©π½βπ¨π½βπ¦π½βπ¦π½...",
"π©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌπ©π½βπ¨π½βπ¦π½βπ¦π½π©πΎβπ¨πΎβπ¦πΎβπ¦πΎ...",
"π©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌπ©π½βπ¨π½βπ¦π½βπ¦π½π©πΎβπ¨πΎβπ¦πΎβπ¦πΎπ©πΏβπ¨πΏβπ¦πΏβπ¦πΏ...",
"π©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌπ©π½βπ¨π½βπ¦π½βπ¦π½π©πΎβπ¨πΎβπ¦πΎβπ¦πΎπ©πΏβπ¨πΏβπ¦πΏβπ¦πΏπ©π»βπ¨π»βπ¦π»βπ¦π»...",
"π©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌπ©π½βπ¨π½βπ¦π½βπ¦π½π©πΎβπ¨πΎβπ¦πΎβπ¦πΎπ©πΏβπ¨πΏβπ¦πΏβπ¦πΏπ©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌ..."
// 10
};
@Test
public void abberviateTest() {
String abbreviateResult;
for(var i = 4; i <= 10; i++) {
abbreviateResult =
StringUtils.abbreviate("π¦π¦π¦π¦π¦π¦π¦π¦π¦π¦π¦π¦π¦π¦", i);
System.out.println(abbreviateResult);
//assertEquals(expectedResultsFox[i - 4], abbreviateResult);
}
for(var i = 4; i <= 10; i++) {
abbreviateResult =
StringUtils.abbreviate("π©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌπ©π½βπ¨π½βπ¦π½βπ¦π½π©πΎβπ¨πΎβπ¦πΎβπ¦πΎπ©πΏβπ¨πΏβπ¦πΏβπ¦πΏπ©π»βπ¨π»βπ¦π»βπ¦π»π©πΌβπ¨πΌβπ¦πΌβπ¦πΌπ©π½βπ¨π½βπ¦π½βπ¦π½π©πΎβπ¨πΎβπ¦πΎβπ¦πΎπ©πΏβπ¨πΏβπ¦πΏβπ¦πΏ",
i);
System.out.println(abbreviateResult);
//assertEquals(expectedResultsFamilyWithCodepoints[i - 4], abbreviateResult);
}
}
}