Re: Getting UTF-16 encoding on dynamic content regardless of output content type

Christopher Schultz Thu, 31 Mar 2022 09:18:16 -0700

Greg,

On 3/29/22 13:41, gelo1234 wrote:

Have you also tried HTMLT or XHTMLT Serializers?
Default HTMLSerializer cannot handle some unicode characters:https://issues.apache.org/jira/browse/SLING-5973?attachmentOrder=asc

Hmm. Are the HTMLT / XHTMLT serializers built-in? I have disabled allblocks during the build, so I'm just using Cocoon core.


Thanks,
-chris

wt., 29 mar 2022 o 19:37 gelo1234 <[email protected]<mailto:[email protected]>> napisał(a):


    Hello Chris,

    I think you will not get any icon-type character on output without
    using proper font rendering - like Emoji support? Emoji might not be
    supported by default in Cocoon.
    So this might be the reason why you get HTML entities instead of
    Emoji-icons.
    Also notice:
    https://www.mail-archive.com/[email protected]/msg61629.html
    <https://www.mail-archive.com/[email protected]/msg61629.html>

    Greetings,
    Greg



    wt., 29 mar 2022 o 18:36 Christopher Schultz
    <[email protected] <mailto:[email protected]>>
    napisał(a):

        Cédric,

        On 3/29/22 12:06, Cédric Damioli wrote:
         > Could you provide more details ?
         > How is your XML processed before outputting the wrong UTF-8
        sequence ?

        It's somewhat straightforward:

        <map:match pattern="/foo">
            <map:generate src="https://source/ <https://source/>" />

            <map:transform src="stuff-to-cincludes.xsl" />

            <map:transform src="other-stuff-to-cincludes.xsl" />

            <map:transform type="cinclude" />

            <map:transform src="my-big-transformer-to-xhtml.xsl" />

            <map:transform type="cinclude" /><!-- Yes, another one -->

            <map:transform type="i18n" />

            <map:transform src="strip-namespaces.xsl" /><!-- This is
        mine, not
        Cocoons ->

            <map:serialize type="xhtml" />
        </map:match>

        The xhtml serializer is the default, with encoding set to UTF-8.
        The
        HTTP response has "Content-Type: text/html" and the document itself
        contains:

        <?xml version="1.0" encoding="UTF-8"?>

        and

        <meta content="text/html; charset=utf-8"
        http-equiv="content-type" />

        So I think everything is configured correctly; it's just that those
        characters are getting mangled by something. I can try to
        cut-out some
        of those steps and see where it's happening.

        I seem to remember being able to give each pipeline step a
        "marker" or
        something where you can say "stop after step 3" or whatever
        instead of
        having to chop-out configuration. Can you remind me or what that
        is again?

        Thanks,
        -chris

         > Le 29/03/2022 à 17:48, Christopher Schultz a écrit :
         >> All,
         >>
         >> I'm still struggling with this. I have upgraded to 2.1.13 which
         >> includes the fix for
        https://issues.apache.org/jira/browse/COCOON-2352
        <https://issues.apache.org/jira/browse/COCOON-2352>
         >> but I'm still getting that American flag converted into
        those 4 HTML
         >> entities:
         >>
         >> &#55356;&#56826;&#55356;&#56824;
         >>
         >> I would expect there to be a single (multibyte) character in
        the
         >> output with no HTML entities.
         >>
         >> I've double-checked, and the source XML contains the flag as
        a single
         >> multi-byte character, served as UTF-8.
         >>
         >> Any ideas for how to get this working? I'm sure I could put
        together a
         >> trivial test-case.
         >>
         >> Thanks,
         >> -chris
         >>
         >> On 10/30/18 12:18, Christopher Schultz wrote:
         >>> All,
         >>>
         >>> Some additional information at the end.
         >>>
         >>> On 10/30/18 11:58, Christopher Schultz wrote:
         >>>> All,
         >>>
         >>>> I'm attempting to do everything with UTF-8 in Cocoon
        2.1.11. I have
         >>>> a servlet generating XML in UTF-8 encoding and I have a
        pipeline
         >>>> with a few transforms in it, ultimately serializing to XHTML.
         >>>
         >>>> If I have a Unicode character in the XML which is outside
        of the
         >>>> BMP, such as this one: 🇺🇸  (that's an American flag, in
        case your
         >>>> mail reader doesn't render it correctly), then I end up
        getting a
         >>>> series of bytes coming from Cocoon after the transform
        that look
         >>>> like UTF-16.
         >>>
         >>>> Here's what's in the XML:
         >>>
         >>>> <first-name>Test🇺🇸</first-name>
         >>>
         >>>> Just like that. The bytes in the message for the flag
        character
         >>>> are:
         >>>
         >>>> f0  9f  87  ba  f0  9f  87  b8
         >>>
         >>>> When rendering that into XHTML, I'm getting this in the
        output:
         >>>
         >>>> Test&#55356;&#56826;&#55356;&#56824;
         >>>
         >>>> The American flag in Unicode reference can be found here:
         >>>>
        https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87%
        <https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87%>
         >>> B8
         >>>
         >>>>   You can see it broken down a bit better here for
        "Regional U":
         >>>>
        http://www.fileformat.info/info/unicode/char/1f1fa/index.htm
        <http://www.fileformat.info/info/unicode/char/1f1fa/index.htm>
         >>>
         >>>> and "Regional S":
         >>>>
        http://www.fileformat.info/info/unicode/char/1f1f8/index.htm
        <http://www.fileformat.info/info/unicode/char/1f1f8/index.htm>
         >>>
         >>>> What's happening is that some component in Cocoon has
        decided to
         >>>> generate HTML entities instead of just emitting the character.
         >>>> That's okay IMO. But what it does doesn't make sense for a
        UTF-8
         >>>> output encodin g.
         >>>
         >>>> The first two entities "&#55356;&#56826;" are the decimal
        numbers
         >>>> that represent the UTF-16 character for that "Regional
        Indicator
         >>>> Symbol Letter U" and they are correct... for UTF-16. If I
        change
         >>>> the output encoding from UTF-8 to UTF0-16, then the
        browser will
         >>>> render these correctly. Using UTF-8, they show as four of
        those
         >>>> ugly [?] characters on the screen.
         >>>
         >>>> I had originally just decided to throw up my hands and use
        UTF-16
         >>>> encoding even though it's dumb. But it seems that MSIE
        cannot be
         >>>> convinced to use UTF-16 no matter what, and I must continue to
         >>>> support MSIE. :(
         >>>
         >>>> So it's back to UTF-8 for me.
         >>>
         >>>> How can I get Cocoon to output that character (or "those
         >>>> characters") correctly?
         >>>
         >>>> It needs to be one of the following:
         >>>
         >>>> &#127482;&#127480; (HTML decimal entities)

>>>> 🇺🇸 (HTML hex entities) f0 9f87 ba

         >>>> f0  9f  87  b8 (raw UTF-8 bytes)
         >>>
         >>>> Does anyone know how/where this conversion is being
        performed ion
         >>>> Cocoon? Probably in a XHTML serializer (I'm using
         >>>> org.apache.cocoon.serialization.XMLSerializer). I'm using
         >>>> mime-type "text/html" and <encoding>UTF-8</encoding> in my
        sitemap
         >>>> for that serializer (the one named "xhtml"). I believe
        I've mads
         >>>> very few changes from the default, if any.
         >>>
         >>>> I haven't yet figured out how to get from what Java sees
        (\uE50C
         >>>> for the "S" for example) to &#x1f1f8;, but knowing where
        the code
         >>>> is that is making that decision would be very helpful.
         >>>
         >>>> Any ideas?
         >>>
         >>>> -chris
         >>>
         >>> I created a text file (UTF-8) containing only the flag and
        read it in
         >>> using Java and printed all of the code points. There should
        be 2
         >>> "characters" in the file. It's 4 bytes per UTF-8 character so I
         >>> assumed I'd end up with 2 'char' primitives in the file,
        but I ended
         >>> up with more.
         >>>
         >>> Here's the loop and the output:
         >>>
         >>>          try(java.io.FileReader in = new
        java.io.FileReader("file.txt"))
         >>> {
         >>>              char[] chars = new char[10];
         >>>
         >>>              int count = in.read(chars);
         >>>
         >>>              for(int i=0; i<count; ++i)
         >>>                  System.out.println("Code point at " + i +
        " is " +
         >>> Integer.toHexString(Character.codePointAt(chars, i)));
         >>>
         >>>          } catch (Exception e) {
         >>>              e.printStackTrace();
         >>>          }
         >>>
         >>> == output ==
         >>>
         >>> Code point at 0 is 1f1fa
         >>> Code point at 1 is ddfa
         >>> Code point at 2 is 1f1f8
         >>> Code point at 3 is ddf8
         >>> Code point at 4 is a
         >>>
         >>> So Java thinks there are 4 things there, not 2. That could
        be a part
         >>> of the confusion. The code points shown for indexes 0 and 2
        are the
         >>> "correct" ones. Those at indexes 1 and 3 should actually be
        *skipped*.
         >>>
         >>> So, to render this string as an HTML numeric entity, we'd
        do something
         >>> like this:
         >>>
         >>> String str = // this is the input
         >>>
         >>> for(int i=0; i<str.length(); ++i) {
         >>>    int cp = Character.codePointAt(chars, i);
         >>>
         >>>    out.print("&#x");
         >>>    out.print(Integer.toHexString(cp));
         >>>    out.println(';');
         >>>
         >>>    // Skip any trailing "characters" that are actually a
        part of this
         >>> one
         >>>    if(1 < Character.charCount(cp))
         >>>      i += Character.charCount(cp) - 1;
         >>> }
         >>>
         >>> Using the above code is completely encoding-agnostic,
        because it's
         >>> describing the Unicode code point and not some set of bytes
        in a
         >>> particular flavor of UTF-x.
         >>>
         >>> -chris
         >>
         >>
        ---------------------------------------------------------------------
         >> To unsubscribe, e-mail: [email protected]
        <mailto:[email protected]>
         >> For additional commands, e-mail:
        [email protected] <mailto:[email protected]>
         >>
         >

        ---------------------------------------------------------------------
        To unsubscribe, e-mail: [email protected]
        <mailto:[email protected]>
        For additional commands, e-mail: [email protected]
        <mailto:[email protected]>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Getting UTF-16 encoding on dynamic content regardless of output content type

Reply via email to