[ https://issues.apache.org/jira/browse/SLING-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427630#comment-15427630 ]
Ben Fortuna commented on SLING-5973: ------------------------------------ [~olli] I'm not able to run that test in my local env just now, however I have created another simple test that I believe demonstrates this issue: https://github.com/micronode/whistlepost/blob/master/whistlepost-rewrite-lib/src/test/groovy/org/apache/cocoon/components/serializers/encoding/XMLEncoderTest.groovy As these emoji characters exist above the 16bit unicode range, they can't be represented by a single unicode escape sequence in Java. However they can be represented as a "surrogate pair". eg: {code}U+1F340 in Java is: "\ud83c\udf40"{code} I found a good explanation of this here: http://stackoverflow.com/a/26231925/163223 The Cocoon HTMLSerializer uses an XMLEncoder implementation that encodes characters *one at a time*. Which means that it doesn't realize these two characters are a surrogate pair, and subsequently doesn't return the correct HTML escape code. So I don't think this is necessarily a Sling issue, but perhaps a workaround might be to provide a different encoder that supports surrogate pairs. > HTMLSerializer not handling some unicode characters (emoji, etc.) > ----------------------------------------------------------------- > > Key: SLING-5973 > URL: https://issues.apache.org/jira/browse/SLING-5973 > Project: Sling > Issue Type: Bug > Components: Extensions > Reporter: Ben Fortuna > Attachments: emoji-no-sling-rewriter.png, > emoji-with-sling-rewriter.png > > > I've noticed that when I have unicode special characters (e.g. emoji) in my > sling content and the sling rewriter is enabled the characters are not output > correctly to the browser. For example: > {code}😁{code} becomes {code}��{code} > If I disable the rewriter pipeline the output is as expected. > I've looked in the code and I suspect the issue is in the HTMLSerializer from > the Cocoon library, however I'm not sure why as it should be using the > default encoding for output (which is UTF-8). My rewriter pipeline is using > the default html-generator and html-serializer provided by sling. > My code is available on GitHub here: > https://github.com/Whistlepost/emojistrip > It provides a very simple app/content project pair with some emoji characters > in the content (see src/main/resources/SLING-INF/content/phrases.json). Many > thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)