Greg,
On 3/29/22 13:41, gelo1234 wrote:
Have you also tried HTMLT or XHTMLT Serializers?
Default HTMLSerializer cannot handle some unicode characters:
https://issues.apache.org/jira/browse/SLING-5973?attachmentOrder=asc
Hmm. Are the HTMLT / XHTMLT serializers built-in? I have disabled all
blocks during the build, so I'm just using Cocoon core.
Thanks,
-chris
wt., 29 mar 2022 o 19:37 gelo1234 <gelo1...@gmail.com
<mailto:gelo1...@gmail.com>> napisał(a):
Hello Chris,
I think you will not get any icon-type character on output without
using proper font rendering - like Emoji support? Emoji might not be
supported by default in Cocoon.
So this might be the reason why you get HTML entities instead of
Emoji-icons.
Also notice:
https://www.mail-archive.com/dev@cocoon.apache.org/msg61629.html
<https://www.mail-archive.com/dev@cocoon.apache.org/msg61629.html>
Greetings,
Greg
wt., 29 mar 2022 o 18:36 Christopher Schultz
<ch...@christopherschultz.net <mailto:ch...@christopherschultz.net>>
napisał(a):
Cédric,
On 3/29/22 12:06, Cédric Damioli wrote:
> Could you provide more details ?
> How is your XML processed before outputting the wrong UTF-8
sequence ?
It's somewhat straightforward:
<map:match pattern="/foo">
<map:generate src="https://source/ <https://source/>" />
<map:transform src="stuff-to-cincludes.xsl" />
<map:transform src="other-stuff-to-cincludes.xsl" />
<map:transform type="cinclude" />
<map:transform src="my-big-transformer-to-xhtml.xsl" />
<map:transform type="cinclude" /><!-- Yes, another one -->
<map:transform type="i18n" />
<map:transform src="strip-namespaces.xsl" /><!-- This is
mine, not
Cocoons ->
<map:serialize type="xhtml" />
</map:match>
The xhtml serializer is the default, with encoding set to UTF-8.
The
HTTP response has "Content-Type: text/html" and the document itself
contains:
<?xml version="1.0" encoding="UTF-8"?>
and
<meta content="text/html; charset=utf-8"
http-equiv="content-type" />
So I think everything is configured correctly; it's just that those
characters are getting mangled by something. I can try to
cut-out some
of those steps and see where it's happening.
I seem to remember being able to give each pipeline step a
"marker" or
something where you can say "stop after step 3" or whatever
instead of
having to chop-out configuration. Can you remind me or what that
is again?
Thanks,
-chris
> Le 29/03/2022 à 17:48, Christopher Schultz a écrit :
>> All,
>>
>> I'm still struggling with this. I have upgraded to 2.1.13 which
>> includes the fix for
https://issues.apache.org/jira/browse/COCOON-2352
<https://issues.apache.org/jira/browse/COCOON-2352>
>> but I'm still getting that American flag converted into
those 4 HTML
>> entities:
>>
>> ����
>>
>> I would expect there to be a single (multibyte) character in
the
>> output with no HTML entities.
>>
>> I've double-checked, and the source XML contains the flag as
a single
>> multi-byte character, served as UTF-8.
>>
>> Any ideas for how to get this working? I'm sure I could put
together a
>> trivial test-case.
>>
>> Thanks,
>> -chris
>>
>> On 10/30/18 12:18, Christopher Schultz wrote:
>>> All,
>>>
>>> Some additional information at the end.
>>>
>>> On 10/30/18 11:58, Christopher Schultz wrote:
>>>> All,
>>>
>>>> I'm attempting to do everything with UTF-8 in Cocoon
2.1.11. I have
>>>> a servlet generating XML in UTF-8 encoding and I have a
pipeline
>>>> with a few transforms in it, ultimately serializing to XHTML.
>>>
>>>> If I have a Unicode character in the XML which is outside
of the
>>>> BMP, such as this one: 🇺🇸 (that's an American flag, in
case your
>>>> mail reader doesn't render it correctly), then I end up
getting a
>>>> series of bytes coming from Cocoon after the transform
that look
>>>> like UTF-16.
>>>
>>>> Here's what's in the XML:
>>>
>>>> <first-name>Test🇺🇸</first-name>
>>>
>>>> Just like that. The bytes in the message for the flag
character
>>>> are:
>>>
>>>> f0 9f 87 ba f0 9f 87 b8
>>>
>>>> When rendering that into XHTML, I'm getting this in the
output:
>>>
>>>> Test����
>>>
>>>> The American flag in Unicode reference can be found here:
>>>>
https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87%
<https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%87%BA%F0%9F%87%>
>>> B8
>>>
>>>> You can see it broken down a bit better here for
"Regional U":
>>>>
http://www.fileformat.info/info/unicode/char/1f1fa/index.htm
<http://www.fileformat.info/info/unicode/char/1f1fa/index.htm>
>>>
>>>> and "Regional S":
>>>>
http://www.fileformat.info/info/unicode/char/1f1f8/index.htm
<http://www.fileformat.info/info/unicode/char/1f1f8/index.htm>
>>>
>>>> What's happening is that some component in Cocoon has
decided to
>>>> generate HTML entities instead of just emitting the character.
>>>> That's okay IMO. But what it does doesn't make sense for a
UTF-8
>>>> output encodin g.
>>>
>>>> The first two entities "��" are the decimal
numbers
>>>> that represent the UTF-16 character for that "Regional
Indicator
>>>> Symbol Letter U" and they are correct... for UTF-16. If I
change
>>>> the output encoding from UTF-8 to UTF0-16, then the
browser will
>>>> render these correctly. Using UTF-8, they show as four of
those
>>>> ugly [?] characters on the screen.
>>>
>>>> I had originally just decided to throw up my hands and use
UTF-16
>>>> encoding even though it's dumb. But it seems that MSIE
cannot be
>>>> convinced to use UTF-16 no matter what, and I must continue to
>>>> support MSIE. :(
>>>
>>>> So it's back to UTF-8 for me.
>>>
>>>> How can I get Cocoon to output that character (or "those
>>>> characters") correctly?
>>>
>>>> It needs to be one of the following:
>>>
>>>> 🇺🇸 (HTML decimal entities)
>>>> 🇺🇸 (HTML hex entities) f0 9f
87 ba
>>>> f0 9f 87 b8 (raw UTF-8 bytes)
>>>
>>>> Does anyone know how/where this conversion is being
performed ion
>>>> Cocoon? Probably in a XHTML serializer (I'm using
>>>> org.apache.cocoon.serialization.XMLSerializer). I'm using
>>>> mime-type "text/html" and <encoding>UTF-8</encoding> in my
sitemap
>>>> for that serializer (the one named "xhtml"). I believe
I've mads
>>>> very few changes from the default, if any.
>>>
>>>> I haven't yet figured out how to get from what Java sees
(\uE50C
>>>> for the "S" for example) to 🇸, but knowing where
the code
>>>> is that is making that decision would be very helpful.
>>>
>>>> Any ideas?
>>>
>>>> -chris
>>>
>>> I created a text file (UTF-8) containing only the flag and
read it in
>>> using Java and printed all of the code points. There should
be 2
>>> "characters" in the file. It's 4 bytes per UTF-8 character so I
>>> assumed I'd end up with 2 'char' primitives in the file,
but I ended
>>> up with more.
>>>
>>> Here's the loop and the output:
>>>
>>> try(java.io.FileReader in = new
java.io.FileReader("file.txt"))
>>> {
>>> char[] chars = new char[10];
>>>
>>> int count = in.read(chars);
>>>
>>> for(int i=0; i<count; ++i)
>>> System.out.println("Code point at " + i +
" is " +
>>> Integer.toHexString(Character.codePointAt(chars, i)));
>>>
>>> } catch (Exception e) {
>>> e.printStackTrace();
>>> }
>>>
>>> == output ==
>>>
>>> Code point at 0 is 1f1fa
>>> Code point at 1 is ddfa
>>> Code point at 2 is 1f1f8
>>> Code point at 3 is ddf8
>>> Code point at 4 is a
>>>
>>> So Java thinks there are 4 things there, not 2. That could
be a part
>>> of the confusion. The code points shown for indexes 0 and 2
are the
>>> "correct" ones. Those at indexes 1 and 3 should actually be
*skipped*.
>>>
>>> So, to render this string as an HTML numeric entity, we'd
do something
>>> like this:
>>>
>>> String str = // this is the input
>>>
>>> for(int i=0; i<str.length(); ++i) {
>>> int cp = Character.codePointAt(chars, i);
>>>
>>> out.print("&#x");
>>> out.print(Integer.toHexString(cp));
>>> out.println(';');
>>>
>>> // Skip any trailing "characters" that are actually a
part of this
>>> one
>>> if(1 < Character.charCount(cp))
>>> i += Character.charCount(cp) - 1;
>>> }
>>>
>>> Using the above code is completely encoding-agnostic,
because it's
>>> describing the Unicode code point and not some set of bytes
in a
>>> particular flavor of UTF-x.
>>>
>>> -chris
>>
>>
---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
<mailto:users-unsubscr...@cocoon.apache.org>
>> For additional commands, e-mail:
users-h...@cocoon.apache.org <mailto:users-h...@cocoon.apache.org>
>>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
<mailto:users-unsubscr...@cocoon.apache.org>
For additional commands, e-mail: users-h...@cocoon.apache.org
<mailto:users-h...@cocoon.apache.org>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
For additional commands, e-mail: users-h...@cocoon.apache.org