[ 
https://issues.apache.org/jira/browse/LANG-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089965#comment-18089965
 ] 

Gary D. Gregory commented on LANG-1770:
---------------------------------------

[~ggregory] 

Of course, but as I noted above there are pitfalls and no current consenus as 
to a solution. This could be a large scale issue in the whole library without a 
simple solution. IMO, all (library-wide) or nothing would be a solution, where 
nothing might be to update the Javadoc and leave it at that. Needs more 
research...

> StringUtils.abbreviate is not emoji aware, breaks surrogate pairs
> -----------------------------------------------------------------
>
>                 Key: LANG-1770
>                 URL: https://issues.apache.org/jira/browse/LANG-1770
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.*
>    Affects Versions: 3.17.0
>            Reporter: Gary D. Gregory
>            Priority: Major
>
> ---------- Forwarded message ---------
> From: Carsten Kirschner <[email protected]>
> Date: Fri, Apr 11, 2025 at 10:15 AM
> Subject: [lang] StringUtils.abbreviate is not emoji aware, breaks surrogate 
> pairs
> To: [email protected] <[email protected]>
> Hello,
> The current commons lang3 StringUtils.abbreviate (3.17.0) implementation will 
> destroy 4 byte emoji characters and larger grapheme clusters. I know that 
> handling grapheme correctly before java 20 is not possible, but at least a 
> codepoint aware solution with String.offsetByCodPoints could be build. I 
> wrote a small test to show the problem.
> The zero width joiners in the family emoji are questionable for the 
> abbreviate, but there should never be a question mark for an invalid char in 
> the result as there is now.
> The problem is not so much the „doesn’t look nice“ aspect of the broken 
> emoji, but if that abbreviated string is passed to an XML Writer 
> (com.ctc.wstx.io.UTF8Writer in my case) it throws an exception on this broken 
> byte sequence. Like this: Caused by: java.io.IOException: Broken surrogate 
> pair: first char 0xd83c, second 0x2e; illegal combination
>                 at 
> com.ctc.wstx.io.UTF8Writer._convertSurrogate(UTF8Writer.java:402) 
> ~[woodstox-core-7.0.0.jar:7.0.0]
> Thanks,
> Carsten
> {code:java}
> import org.apache.commons.lang3.StringUtils;
> import org.junit.Test;
> import static org.junit.Assert.*;
> public class AbbreviateTest {
>                 String[] expectedResultsFox = {
>                                                "🦊...", // 4
>                                                "🦊🦊...",
>                                                "🦊🦊🦊...",
>                                                "🦊🦊🦊🦊...",
>                                                "🦊🦊🦊🦊🦊...",
>                                                "🦊🦊🦊🦊🦊🦊...",
>                                                "🦊🦊🦊🦊🦊🦊🦊...", // 10
>                 };
>                 String[] expectedResultsFamilyWithCodepoints = {
>                                                "👩...",
>                                                "👩🏻...",
>                                                "👩🏻‍...", // zero width joiner
>                                                "👩🏻‍👨...",
>                                                "👩🏻‍👨🏻...",
>                                                "👩🏻‍👨🏻‍...",
>                                                "👩🏻‍👨🏻‍👦..."
>                 };
>                 String[] expectedResultsFamilyWithGrapheme = {
>                                                "👩🏻‍👨🏻‍👦🏻‍👦🏻...", // 4
>                                                "👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼...",
>                                                
> "👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼👩🏽‍👨🏽‍👦🏽‍👦🏽...",
>                                                
> "👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼👩🏽‍👨🏽‍👦🏽‍👦🏽👩🏾‍👨🏾‍👦🏾‍👦🏾...",
>                                                
> "👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼👩🏽‍👨🏽‍👦🏽‍👦🏽👩🏾‍👨🏾‍👦🏾‍👦🏾👩🏿‍👨🏿‍👦🏿‍👦🏿...",
>                                                
> "👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼👩🏽‍👨🏽‍👦🏽‍👦🏽👩🏾‍👨🏾‍👦🏾‍👦🏾👩🏿‍👨🏿‍👦🏿‍👦🏿👩🏻‍👨🏻‍👦🏻‍👦🏻...",
>                                                
> "👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼👩🏽‍👨🏽‍👦🏽‍👦🏽👩🏾‍👨🏾‍👦🏾‍👦🏾👩🏿‍👨🏿‍👦🏿‍👦🏿👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼..."
>  // 10
>                 };
>                 @Test
>                 public void abberviateTest() {
>                                String abbreviateResult;
>                                for(var i = 4; i <= 10; i++) {
>                                                abbreviateResult = 
> StringUtils.abbreviate("🦊🦊🦊🦊🦊🦊🦊🦊🦊🦊🦊🦊🦊🦊", i);
>                                                
> System.out.println(abbreviateResult);
>                                                
> //assertEquals(expectedResultsFox[i - 4], abbreviateResult);
>                                }
>                                for(var i = 4; i <= 10; i++) {
>                                                abbreviateResult = 
> StringUtils.abbreviate("👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼👩🏽‍👨🏽‍👦🏽‍👦🏽👩🏾‍👨🏾‍👦🏾‍👦🏾👩🏿‍👨🏿‍👦🏿‍👦🏿👩🏻‍👨🏻‍👦🏻‍👦🏻👩🏼‍👨🏼‍👦🏼‍👦🏼👩🏽‍👨🏽‍👦🏽‍👦🏽👩🏾‍👨🏾‍👦🏾‍👦🏾👩🏿‍👨🏿‍👦🏿‍👦🏿",
>  i);
>                                                
> System.out.println(abbreviateResult);
>                                                
> //assertEquals(expectedResultsFamilyWithCodepoints[i - 4], abbreviateResult);
>                                }
>                 }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to