Re: [2.1] Overzealous escaping of high Unicode code points

2017-06-21 Thread gelo1234
Hi Chris,

I suppose you cannot use 2 different encodings in 1 Serializer, so if you
changed
your Serializer config to be UTF16, you also have to use _external_ UTF16
encoded
CSS styles. Of couse you can define many different Serializer configs per
each pipeline.

By default common-lang/cocoon uses 2-byte char sequence as encoding base.
If you had UTF-8 and 32 bits, you would have 4 chars (each 8 bits), encoded
as 1 PAIR 2-bytes sequence.
if you switched to UTF-16, you would have 2 chars (each 16 bits), encoded
as 1 SINGLE 4-bytes sequence.

Greetings,
Greg


2017-06-20 22:14 GMT+02:00 Christopher Schultz :

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> Greg,
>
> On 6/20/17 4:11 PM, Christopher Schultz wrote:
> > Greg,
> >
> > On 6/8/17 2:17 PM, gelo1234 wrote:
> >> Chris,
> >
> >> Even with C3 (cocoon 3.0 beta) unless you specify optional
> >> encoding in your Serializer config, you fallback to default
> >> UTF-8:
> >
> >> org.apache.cocoon.optional.servlet.components.sax.serializers.util
> >
> >>  public class ConfigurationUtils {
> >
> >> private ConfigurationUtils() { }
> >
> >> public static String getEncoding(Map
> >> configuration) { String encoding = (String)
> >> configuration.get("encoding");
> >
> >> if (encoding == null || "".equals(encoding)) { encoding =
> >> "UTF-8"; }
> >
> >> return encoding; } ...
> >
> > I would have expected the Unicode codepoint to be converted into a
> > single 4-byte UTF-8 byte without any &-encoding at all. It looks
> > like what I got was a pair of 2-byte characters with &-encoding.
> >
> > I'll try UTF-16 but my expectation is that it's going to get
> > worse, not better.
>
> Interestingly enough, my emojis are now showing (which I don't totally
> understand why!) but it looks like my CSS aren't being loaded. That's
> a separate problem I'll have to figure out for myself.
>
> In my own application, switching from commons-lang to commans-lang3
> HTML/XML escaping allowed me to use these 4-byte emojis and UTF-8
> together. I'm surprised that Cocoon can't do the same thing. (I think
> it comes down to exactly how the character-escaper makes its decisions).
>
> Thanks,
> - -chris
> -BEGIN PGP SIGNATURE-
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAllJgiwACgkQHPApP6U8
> pFgJkRAAqiXn7DWNDN41m1V98aI5xWjTuoka0tKcadN1IUGemTZwipaXHtYQcois
> 6yuI3st31ZuanghIpRPcBu9pZzuHtOSBVSHZSIhDGqPwYgczScQ2LgnfMi6zwAdd
> j2LFlSWtKGjgCczV5Ok56PyMq1BEAOVw96vmF5xfXmpLAyNA/PvLKsncoW4pN+ES
> 1MQMm1aPwbmEpWz7ykReUzfauwBtL4rEX1wO3pl88m9Wq3x174AKHWs/a+4Z1Hdq
> 0CnxfrdTK50p7Ng+ECfnPwx8y1Em64lA7KKMuz2jTd0PnxlpZTAgO6lq8S7BdSeY
> H1lwBJojVT/+m2w8b9OC/XoyiAyiC/zIswQ3TSMA3ZC2SnCxxAXMTsmT49Ql+lyq
> 01JRCIVMitKeoKI4I4066oaBW91FpSSpZXX14XCHrMBtKnIJI+NxBnI++eQq8wdi
> ZdX3GzLF2zaPHvZMSz4DRskR1xKGLsAxZAukINW3AGrEAZ/GwbPd76ml3YJam5Yy
> R31u0kcRJl4z79pd1n46yxB66V10Rn5IkSMQ8R7uK/ht9wLi5T8bkeAoLjZFFoyq
> awmfQTbJzquXAtwjX99WKWEzviN2ph+P0h2rBInHnos5ud8IlLjcS7FmdxQ4DNOw
> Nirmj7cikxcr2Fn22pGQh6o3/Eph0lMf1d1HjUZ1C7SchEgsqrk=
> =0nTd
> -END PGP SIGNATURE-
>
> -
> To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
> For additional commands, e-mail: users-h...@cocoon.apache.org
>
>


Re: [2.1] Overzealous escaping of high Unicode code points

2017-06-20 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Greg,

On 6/20/17 4:11 PM, Christopher Schultz wrote:
> Greg,
> 
> On 6/8/17 2:17 PM, gelo1234 wrote:
>> Chris,
> 
>> Even with C3 (cocoon 3.0 beta) unless you specify optional
>> encoding in your Serializer config, you fallback to default
>> UTF-8:
> 
>> org.apache.cocoon.optional.servlet.components.sax.serializers.util
>
>>  public class ConfigurationUtils {
> 
>> private ConfigurationUtils() { }
> 
>> public static String getEncoding(Map 
>> configuration) { String encoding = (String) 
>> configuration.get("encoding");
> 
>> if (encoding == null || "".equals(encoding)) { encoding =
>> "UTF-8"; }
> 
>> return encoding; } ...
> 
> I would have expected the Unicode codepoint to be converted into a 
> single 4-byte UTF-8 byte without any &-encoding at all. It looks
> like what I got was a pair of 2-byte characters with &-encoding.
> 
> I'll try UTF-16 but my expectation is that it's going to get
> worse, not better.

Interestingly enough, my emojis are now showing (which I don't totally
understand why!) but it looks like my CSS aren't being loaded. That's
a separate problem I'll have to figure out for myself.

In my own application, switching from commons-lang to commans-lang3
HTML/XML escaping allowed me to use these 4-byte emojis and UTF-8
together. I'm surprised that Cocoon can't do the same thing. (I think
it comes down to exactly how the character-escaper makes its decisions).

Thanks,
- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAllJgiwACgkQHPApP6U8
pFgJkRAAqiXn7DWNDN41m1V98aI5xWjTuoka0tKcadN1IUGemTZwipaXHtYQcois
6yuI3st31ZuanghIpRPcBu9pZzuHtOSBVSHZSIhDGqPwYgczScQ2LgnfMi6zwAdd
j2LFlSWtKGjgCczV5Ok56PyMq1BEAOVw96vmF5xfXmpLAyNA/PvLKsncoW4pN+ES
1MQMm1aPwbmEpWz7ykReUzfauwBtL4rEX1wO3pl88m9Wq3x174AKHWs/a+4Z1Hdq
0CnxfrdTK50p7Ng+ECfnPwx8y1Em64lA7KKMuz2jTd0PnxlpZTAgO6lq8S7BdSeY
H1lwBJojVT/+m2w8b9OC/XoyiAyiC/zIswQ3TSMA3ZC2SnCxxAXMTsmT49Ql+lyq
01JRCIVMitKeoKI4I4066oaBW91FpSSpZXX14XCHrMBtKnIJI+NxBnI++eQq8wdi
ZdX3GzLF2zaPHvZMSz4DRskR1xKGLsAxZAukINW3AGrEAZ/GwbPd76ml3YJam5Yy
R31u0kcRJl4z79pd1n46yxB66V10Rn5IkSMQ8R7uK/ht9wLi5T8bkeAoLjZFFoyq
awmfQTbJzquXAtwjX99WKWEzviN2ph+P0h2rBInHnos5ud8IlLjcS7FmdxQ4DNOw
Nirmj7cikxcr2Fn22pGQh6o3/Eph0lMf1d1HjUZ1C7SchEgsqrk=
=0nTd
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
For additional commands, e-mail: users-h...@cocoon.apache.org



Re: [2.1] Overzealous escaping of high Unicode code points

2017-06-20 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Greg,

On 6/8/17 2:17 PM, gelo1234 wrote:
> Chris,
> 
> Even with C3 (cocoon 3.0 beta) unless you specify optional encoding
> in your Serializer config, you fallback to default UTF-8:
> 
> org.apache.cocoon.optional.servlet.components.sax.serializers.util
> 
> public class ConfigurationUtils {
> 
> private ConfigurationUtils() { }
> 
> public static String getEncoding(Map 
> configuration) { String encoding = (String)
> configuration.get("encoding");
> 
> if (encoding == null || "".equals(encoding)) { encoding = "UTF-8"; 
> }
> 
> return encoding; } ...

I would have expected the Unicode codepoint to be converted into a
single 4-byte UTF-8 byte without any &-encoding at all. It looks like
what I got was a pair of 2-byte characters with &-encoding.

I'll try UTF-16 but my expectation is that it's going to get worse,
not better.

Thanks,
- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAllJgYoACgkQHPApP6U8
pFjKCg//UXuln4vSZ4bw32OVWRlsLnfm9RcOjiuDb+DqKjfTTqdIY1kdLyZQK+o4
Y8n12ct3sHQRdsViULtm9dhOClF+6qBXFgbjKO9ya6v4WvWeC4NOh0HK+nFlmvqA
1fNjTuc4orDgDl5npt+6Co8LprToPKBJlF7Vq+dvgLbiYJHh4lTrgAQuyY7YCXoC
BUJAieW/ntPficv6q/Tm0g32N/pBnLYArJd3ncwxIZyEYt4jX6tMsPZNwqaY2HrE
+D1nc5jTfMnx7B9WH3W5MMw0t4VxiwE2KbK88oHSUf6IV/Nok/5EfMNefQSZr71Z
gtxvFRld8Lim/YYMgFieAHXFP5axE81Bk7Z76lj9jOK7YcOMFUPYST63JVv0uVUZ
urIEwf5FBEiW/264YTESUfOuPWsbuQQ9x23FRFKh2HiZJmN0afp7uJrkLK55XCT/
OAn6h9wcAtch4idney8BWkLfMOtdHTTaY5PzZRc1EpWDZk4jYYyD+2sdjnHD21Ka
CmwKkwnA9WDTJ5owD6n5lIZpYaPBGqFRaCcwWYQtERUA7ZrmBvI7GbuSvfLA3CDp
H0nO97fOd2s+IXlxno73V9B7Kvj56CKxP2O5OoXgQHl6b2J+z9ZZ16l83beEblNS
5HWxQSvFw2FjLqhSSQOOsLvkIjWLL/tpBSWq4XEH1iVxViFGJvk=
=KIbJ
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
For additional commands, e-mail: users-h...@cocoon.apache.org



Re: [2.1] Overzealous escaping of high Unicode code points

2017-06-08 Thread gelo1234
Chris,

Even with C3 (cocoon 3.0 beta) unless you specify optional encoding in your
Serializer config, you fallback to default UTF-8:

org.apache.cocoon.optional.servlet.components.sax.serializers.util

public class ConfigurationUtils {

private ConfigurationUtils() {
}

public static String getEncoding(Map
configuration) {
String encoding = (String) configuration.get("encoding");

if (encoding == null || "".equals(encoding)) {
encoding = "UTF-8";
}

return encoding;
}
...

Greetings,
Greg


2017-06-08 20:11 GMT+02:00 gelo1234 :

>
> It depends on what type of Serializer you use and what kind of Serlializer
> config you put into your sitemap?
>
> By default XMLSerializer/HTMLSerializer uses UTF-8 encoding. So instead of
> 1 UTF-16 char you got 2 chars UTF-8 encoded.
> Of cource there might be also issue with emoji charset, but I would first
> try to change encoding in Serliazer config (to be UTF-16).
>
> Greetings,
> -Greg
>
> 2017-06-07 10:43 GMT+02:00 Flynn, Peter :
>
>> I had a related problem with 3–4 CJK characters being converted to their
>> 

Re: [2.1] Overzealous escaping of high Unicode code points

2017-06-08 Thread gelo1234
It depends on what type of Serializer you use and what kind of Serlializer
config you put into your sitemap?

By default XMLSerializer/HTMLSerializer uses UTF-8 encoding. So instead of
1 UTF-16 char you got 2 chars UTF-8 encoded.
Of cource there might be also issue with emoji charset, but I would first
try to change encoding in Serliazer config (to be UTF-16).

Greetings,
-Greg

2017-06-07 10:43 GMT+02:00 Flynn, Peter :

> I had a related problem with 3–4 CJK characters being converted to their
> 

Re: [2.1] Overzealous escaping of high Unicode code points

2017-06-07 Thread Flynn, Peter
I had a related problem with 3–4 CJK characters being converted to their 

[2.1] Overzealous escaping of high Unicode code points

2017-06-06 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

All,

I've been testing my application for use with high Unicode code points
such as emoji like  which is this one:
http://www.fileformat.info/info/unicode/char/1F60D/index.htm

My application and database can handle this code point, but Cocoon
butchers it in a way that I have seen before -- the way that
commons-lang's StringEscapeUtils.escapeXml/escapeHtml seems to do.

Instead of letting the character through as-is, it tries to convert it
into these two numbered entities:



Oddly enough, those are the two double-byte UTF-16 characters you'd
get, but they shouldn't be split-up like that, I don't think.

I haven't found a version of commons-lang 2.x that doesn't break these
kinds of characters. commons-lang3 does the right thing, but they are
incompatible libraries.

Does anyone know the code well enough to know how difficult it would
be to change the way Cocoon 2.1 escapes its output? For example, by
using commons-lang3?

I haven't tried Cocoon 2.2, yet, and I can't tell what dependencies it
has. I also can't exactly tell what to do now that I've downloaded the
binary package. Can this just be used as a drop-in replacement for
Cocoon 2.1.x? Cocoon 2.1.x could build a WAR file that I then
customized for my own application, adding various libraries and
configuration files to it. I think I'll follow-up with a separate post
about this.

- -chris

-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJZNtOBAAoJEBzwKT+lPKRYEuIP/3gSJZDNEbzsHkI5zYjMZbFf
vKvRRnBSl+6IdrcUasftf+AkXIIYwj6xnUQ7winsLW/n8TdDG6jPqsg4Khsozc6z
aa23qDly62gmCsqpLohXxt/ZNKdPY4sOTghaaEUFTtTgpeD3M/INF90myT8SwO4K
WUtqVparSqp/Zf9JMm3OCIguMKbsRNYWVIQuiJxDQJkWYwrw0iVk2v8mc6iz/mDF
w6np4EvFr9fqdDufKpPw8anEkrp5JEuTx47vMOtz4sixVr2C6ehgP4zs3kVzdVid
QPeUsrosV1tsRC9bMVLGmjo7UhNseeXCp/AceIT6AQE8Q1clgy9GcoNMf60dgGku
et0xoGptYgbCfmJL+PuA9y7fJYjgTTQheqzuC721n2/sx+kyBSBWSMIhqia2sd4y
spcT4kw+uChsWjwoeGOHOm4IimrVgXkfJeHVSXV4m66sHS9t+bDiiErwS1SikvSV
qF64/L0u8hYFLD1ehURoHBi4foE1Td3eRGOGHgodcYL9C8U+Yv+fWaiYQ5O4CCnW
pToFvVoQOdZY+VVC8hz1ggbRMSxjT2GQLLJ2mjbGzGUJjlwyQaoZnADSSu0efj88
O2AlWB2Bf/Ag6E4C9jEjj+cauBfR+1NIK7F1Jo6C02yY1SUOSoOAFDZ7EkO4qYAO
YhvgSQXNmKps6rusNjNZ
=q8Eh
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
For additional commands, e-mail: users-h...@cocoon.apache.org