[
https://issues.apache.org/jira/browse/LANG-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089965#comment-18089965
]
Gary D. Gregory commented on LANG-1770:
---------------------------------------
[~ggregory]
Of course, but as I noted above there are pitfalls and no current consenus as
to a solution. This could be a large scale issue in the whole library without a
simple solution. IMO, all (library-wide) or nothing would be a solution, where
nothing might be to update the Javadoc and leave it at that. Needs more
research...
> StringUtils.abbreviate is not emoji aware, breaks surrogate pairs
> -----------------------------------------------------------------
>
> Key: LANG-1770
> URL: https://issues.apache.org/jira/browse/LANG-1770
> Project: Commons Lang
> Issue Type: Bug
> Components: lang.*
> Affects Versions: 3.17.0
> Reporter: Gary D. Gregory
> Priority: Major
>
> ---------- Forwarded message ---------
> From: Carsten Kirschner <[email protected]>
> Date: Fri, Apr 11, 2025 at 10:15 AM
> Subject: [lang] StringUtils.abbreviate is not emoji aware, breaks surrogate
> pairs
> To: [email protected] <[email protected]>
> Hello,
> The current commons lang3 StringUtils.abbreviate (3.17.0) implementation will
> destroy 4 byte emoji characters and larger grapheme clusters. I know that
> handling grapheme correctly before java 20 is not possible, but at least a
> codepoint aware solution with String.offsetByCodPoints could be build. I
> wrote a small test to show the problem.
> The zero width joiners in the family emoji are questionable for the
> abbreviate, but there should never be a question mark for an invalid char in
> the result as there is now.
> The problem is not so much the „doesn’t look nice“ aspect of the broken
> emoji, but if that abbreviated string is passed to an XML Writer
> (com.ctc.wstx.io.UTF8Writer in my case) it throws an exception on this broken
> byte sequence. Like this: Caused by: java.io.IOException: Broken surrogate
> pair: first char 0xd83c, second 0x2e; illegal combination
> at
> com.ctc.wstx.io.UTF8Writer._convertSurrogate(UTF8Writer.java:402)
> ~[woodstox-core-7.0.0.jar:7.0.0]
> Thanks,
> Carsten
> {code:java}
> import org.apache.commons.lang3.StringUtils;
> import org.junit.Test;
> import static org.junit.Assert.*;
> public class AbbreviateTest {
> String[] expectedResultsFox = {
> "🦊...", // 4
> "🦊🦊...",
> "🦊🦊🦊...",
> "🦊🦊🦊🦊...",
> "🦊🦊🦊🦊🦊...",
> "🦊🦊🦊🦊🦊🦊...",
> "🦊🦊🦊🦊🦊🦊🦊...", // 10
> };
> String[] expectedResultsFamilyWithCodepoints = {
> "👩...",
> "👩🏻...",
> "👩🏻...", // zero width joiner
> "👩🏻👨...",
> "👩🏻👨🏻...",
> "👩🏻👨🏻...",
> "👩🏻👨🏻👦..."
> };
> String[] expectedResultsFamilyWithGrapheme = {
> "👩🏻👨🏻👦🏻👦🏻...", // 4
> "👩🏻👨🏻👦🏻👦🏻👩🏼👨🏼👦🏼👦🏼...",
>
> "👩🏻👨🏻👦🏻👦🏻👩🏼👨🏼👦🏼👦🏼👩🏽👨🏽👦🏽👦🏽...",
>
> "👩🏻👨🏻👦🏻👦🏻👩🏼👨🏼👦🏼👦🏼👩🏽👨🏽👦🏽👦🏽👩🏾👨🏾👦🏾👦🏾...",
>
> "👩🏻👨🏻👦🏻👦🏻👩🏼👨🏼👦🏼👦🏼👩🏽👨🏽👦🏽👦🏽👩🏾👨🏾👦🏾👦🏾👩🏿👨🏿👦🏿👦🏿...",
>
> "👩🏻👨🏻👦🏻👦🏻👩🏼👨🏼👦🏼👦🏼👩🏽👨🏽👦🏽👦🏽👩🏾👨🏾👦🏾👦🏾👩🏿👨🏿👦🏿👦🏿👩🏻👨🏻👦🏻👦🏻...",
>
> "👩🏻👨🏻👦🏻👦🏻👩🏼👨🏼👦🏼👦🏼👩🏽👨🏽👦🏽👦🏽👩🏾👨🏾👦🏾👦🏾👩🏿👨🏿👦🏿👦🏿👩🏻👨🏻👦🏻👦🏻👩🏼👨🏼👦🏼👦🏼..."
> // 10
> };
> @Test
> public void abberviateTest() {
> String abbreviateResult;
> for(var i = 4; i <= 10; i++) {
> abbreviateResult =
> StringUtils.abbreviate("🦊🦊🦊🦊🦊🦊🦊🦊🦊🦊🦊🦊🦊🦊", i);
>
> System.out.println(abbreviateResult);
>
> //assertEquals(expectedResultsFox[i - 4], abbreviateResult);
> }
> for(var i = 4; i <= 10; i++) {
> abbreviateResult =
> StringUtils.abbreviate("👩🏻👨🏻👦🏻👦🏻👩🏼👨🏼👦🏼👦🏼👩🏽👨🏽👦🏽👦🏽👩🏾👨🏾👦🏾👦🏾👩🏿👨🏿👦🏿👦🏿👩🏻👨🏻👦🏻👦🏻👩🏼👨🏼👦🏼👦🏼👩🏽👨🏽👦🏽👦🏽👩🏾👨🏾👦🏾👦🏾👩🏿👨🏿👦🏿👦🏿",
> i);
>
> System.out.println(abbreviateResult);
>
> //assertEquals(expectedResultsFamilyWithCodepoints[i - 4], abbreviateResult);
> }
> }
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)