Re: [2.1] Overzealous escaping of high Unicode code points
Hi Chris, I suppose you cannot use 2 different encodings in 1 Serializer, so if you changed your Serializer config to be UTF16, you also have to use _external_ UTF16 encoded CSS styles. Of couse you can define many different Serializer configs per each pipeline. By default common-lang/cocoon uses 2-byte char sequence as encoding base. If you had UTF-8 and 32 bits, you would have 4 chars (each 8 bits), encoded as 1 PAIR 2-bytes sequence. if you switched to UTF-16, you would have 2 chars (each 16 bits), encoded as 1 SINGLE 4-bytes sequence. Greetings, Greg 2017-06-20 22:14 GMT+02:00 Christopher Schultz: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > Greg, > > On 6/20/17 4:11 PM, Christopher Schultz wrote: > > Greg, > > > > On 6/8/17 2:17 PM, gelo1234 wrote: > >> Chris, > > > >> Even with C3 (cocoon 3.0 beta) unless you specify optional > >> encoding in your Serializer config, you fallback to default > >> UTF-8: > > > >> org.apache.cocoon.optional.servlet.components.sax.serializers.util > > > >> public class ConfigurationUtils { > > > >> private ConfigurationUtils() { } > > > >> public static String getEncoding(Map > >> configuration) { String encoding = (String) > >> configuration.get("encoding"); > > > >> if (encoding == null || "".equals(encoding)) { encoding = > >> "UTF-8"; } > > > >> return encoding; } ... > > > > I would have expected the Unicode codepoint to be converted into a > > single 4-byte UTF-8 byte without any &-encoding at all. It looks > > like what I got was a pair of 2-byte characters with &-encoding. > > > > I'll try UTF-16 but my expectation is that it's going to get > > worse, not better. > > Interestingly enough, my emojis are now showing (which I don't totally > understand why!) but it looks like my CSS aren't being loaded. That's > a separate problem I'll have to figure out for myself. > > In my own application, switching from commons-lang to commans-lang3 > HTML/XML escaping allowed me to use these 4-byte emojis and UTF-8 > together. I'm surprised that Cocoon can't do the same thing. (I think > it comes down to exactly how the character-escaper makes its decisions). > > Thanks, > - -chris > -BEGIN PGP SIGNATURE- > Comment: GPGTools - http://gpgtools.org > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAllJgiwACgkQHPApP6U8 > pFgJkRAAqiXn7DWNDN41m1V98aI5xWjTuoka0tKcadN1IUGemTZwipaXHtYQcois > 6yuI3st31ZuanghIpRPcBu9pZzuHtOSBVSHZSIhDGqPwYgczScQ2LgnfMi6zwAdd > j2LFlSWtKGjgCczV5Ok56PyMq1BEAOVw96vmF5xfXmpLAyNA/PvLKsncoW4pN+ES > 1MQMm1aPwbmEpWz7ykReUzfauwBtL4rEX1wO3pl88m9Wq3x174AKHWs/a+4Z1Hdq > 0CnxfrdTK50p7Ng+ECfnPwx8y1Em64lA7KKMuz2jTd0PnxlpZTAgO6lq8S7BdSeY > H1lwBJojVT/+m2w8b9OC/XoyiAyiC/zIswQ3TSMA3ZC2SnCxxAXMTsmT49Ql+lyq > 01JRCIVMitKeoKI4I4066oaBW91FpSSpZXX14XCHrMBtKnIJI+NxBnI++eQq8wdi > ZdX3GzLF2zaPHvZMSz4DRskR1xKGLsAxZAukINW3AGrEAZ/GwbPd76ml3YJam5Yy > R31u0kcRJl4z79pd1n46yxB66V10Rn5IkSMQ8R7uK/ht9wLi5T8bkeAoLjZFFoyq > awmfQTbJzquXAtwjX99WKWEzviN2ph+P0h2rBInHnos5ud8IlLjcS7FmdxQ4DNOw > Nirmj7cikxcr2Fn22pGQh6o3/Eph0lMf1d1HjUZ1C7SchEgsqrk= > =0nTd > -END PGP SIGNATURE- > > - > To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org > For additional commands, e-mail: users-h...@cocoon.apache.org > >
Re: [2.1] Overzealous escaping of high Unicode code points
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Greg, On 6/20/17 4:11 PM, Christopher Schultz wrote: > Greg, > > On 6/8/17 2:17 PM, gelo1234 wrote: >> Chris, > >> Even with C3 (cocoon 3.0 beta) unless you specify optional >> encoding in your Serializer config, you fallback to default >> UTF-8: > >> org.apache.cocoon.optional.servlet.components.sax.serializers.util > >> public class ConfigurationUtils { > >> private ConfigurationUtils() { } > >> public static String getEncoding(Map>> configuration) { String encoding = (String) >> configuration.get("encoding"); > >> if (encoding == null || "".equals(encoding)) { encoding = >> "UTF-8"; } > >> return encoding; } ... > > I would have expected the Unicode codepoint to be converted into a > single 4-byte UTF-8 byte without any &-encoding at all. It looks > like what I got was a pair of 2-byte characters with &-encoding. > > I'll try UTF-16 but my expectation is that it's going to get > worse, not better. Interestingly enough, my emojis are now showing (which I don't totally understand why!) but it looks like my CSS aren't being loaded. That's a separate problem I'll have to figure out for myself. In my own application, switching from commons-lang to commans-lang3 HTML/XML escaping allowed me to use these 4-byte emojis and UTF-8 together. I'm surprised that Cocoon can't do the same thing. (I think it comes down to exactly how the character-escaper makes its decisions). Thanks, - -chris -BEGIN PGP SIGNATURE- Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAllJgiwACgkQHPApP6U8 pFgJkRAAqiXn7DWNDN41m1V98aI5xWjTuoka0tKcadN1IUGemTZwipaXHtYQcois 6yuI3st31ZuanghIpRPcBu9pZzuHtOSBVSHZSIhDGqPwYgczScQ2LgnfMi6zwAdd j2LFlSWtKGjgCczV5Ok56PyMq1BEAOVw96vmF5xfXmpLAyNA/PvLKsncoW4pN+ES 1MQMm1aPwbmEpWz7ykReUzfauwBtL4rEX1wO3pl88m9Wq3x174AKHWs/a+4Z1Hdq 0CnxfrdTK50p7Ng+ECfnPwx8y1Em64lA7KKMuz2jTd0PnxlpZTAgO6lq8S7BdSeY H1lwBJojVT/+m2w8b9OC/XoyiAyiC/zIswQ3TSMA3ZC2SnCxxAXMTsmT49Ql+lyq 01JRCIVMitKeoKI4I4066oaBW91FpSSpZXX14XCHrMBtKnIJI+NxBnI++eQq8wdi ZdX3GzLF2zaPHvZMSz4DRskR1xKGLsAxZAukINW3AGrEAZ/GwbPd76ml3YJam5Yy R31u0kcRJl4z79pd1n46yxB66V10Rn5IkSMQ8R7uK/ht9wLi5T8bkeAoLjZFFoyq awmfQTbJzquXAtwjX99WKWEzviN2ph+P0h2rBInHnos5ud8IlLjcS7FmdxQ4DNOw Nirmj7cikxcr2Fn22pGQh6o3/Eph0lMf1d1HjUZ1C7SchEgsqrk= =0nTd -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org For additional commands, e-mail: users-h...@cocoon.apache.org
Re: [2.1] Overzealous escaping of high Unicode code points
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Greg, On 6/8/17 2:17 PM, gelo1234 wrote: > Chris, > > Even with C3 (cocoon 3.0 beta) unless you specify optional encoding > in your Serializer config, you fallback to default UTF-8: > > org.apache.cocoon.optional.servlet.components.sax.serializers.util > > public class ConfigurationUtils { > > private ConfigurationUtils() { } > > public static String getEncoding(Map> configuration) { String encoding = (String) > configuration.get("encoding"); > > if (encoding == null || "".equals(encoding)) { encoding = "UTF-8"; > } > > return encoding; } ... I would have expected the Unicode codepoint to be converted into a single 4-byte UTF-8 byte without any &-encoding at all. It looks like what I got was a pair of 2-byte characters with &-encoding. I'll try UTF-16 but my expectation is that it's going to get worse, not better. Thanks, - -chris -BEGIN PGP SIGNATURE- Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAllJgYoACgkQHPApP6U8 pFjKCg//UXuln4vSZ4bw32OVWRlsLnfm9RcOjiuDb+DqKjfTTqdIY1kdLyZQK+o4 Y8n12ct3sHQRdsViULtm9dhOClF+6qBXFgbjKO9ya6v4WvWeC4NOh0HK+nFlmvqA 1fNjTuc4orDgDl5npt+6Co8LprToPKBJlF7Vq+dvgLbiYJHh4lTrgAQuyY7YCXoC BUJAieW/ntPficv6q/Tm0g32N/pBnLYArJd3ncwxIZyEYt4jX6tMsPZNwqaY2HrE +D1nc5jTfMnx7B9WH3W5MMw0t4VxiwE2KbK88oHSUf6IV/Nok/5EfMNefQSZr71Z gtxvFRld8Lim/YYMgFieAHXFP5axE81Bk7Z76lj9jOK7YcOMFUPYST63JVv0uVUZ urIEwf5FBEiW/264YTESUfOuPWsbuQQ9x23FRFKh2HiZJmN0afp7uJrkLK55XCT/ OAn6h9wcAtch4idney8BWkLfMOtdHTTaY5PzZRc1EpWDZk4jYYyD+2sdjnHD21Ka CmwKkwnA9WDTJ5owD6n5lIZpYaPBGqFRaCcwWYQtERUA7ZrmBvI7GbuSvfLA3CDp H0nO97fOd2s+IXlxno73V9B7Kvj56CKxP2O5OoXgQHl6b2J+z9ZZ16l83beEblNS 5HWxQSvFw2FjLqhSSQOOsLvkIjWLL/tpBSWq4XEH1iVxViFGJvk= =KIbJ -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org For additional commands, e-mail: users-h...@cocoon.apache.org
Re: [2.1] Overzealous escaping of high Unicode code points
Chris, Even with C3 (cocoon 3.0 beta) unless you specify optional encoding in your Serializer config, you fallback to default UTF-8: org.apache.cocoon.optional.servlet.components.sax.serializers.util public class ConfigurationUtils { private ConfigurationUtils() { } public static String getEncoding(Mapconfiguration) { String encoding = (String) configuration.get("encoding"); if (encoding == null || "".equals(encoding)) { encoding = "UTF-8"; } return encoding; } ... Greetings, Greg 2017-06-08 20:11 GMT+02:00 gelo1234 : > > It depends on what type of Serializer you use and what kind of Serlializer > config you put into your sitemap? > > By default XMLSerializer/HTMLSerializer uses UTF-8 encoding. So instead of > 1 UTF-16 char you got 2 chars UTF-8 encoded. > Of cource there might be also issue with emoji charset, but I would first > try to change encoding in Serliazer config (to be UTF-16). > > Greetings, > -Greg > > 2017-06-07 10:43 GMT+02:00 Flynn, Peter : > >> I had a related problem with 3–4 CJK characters being converted to their >>
Re: [2.1] Overzealous escaping of high Unicode code points
It depends on what type of Serializer you use and what kind of Serlializer config you put into your sitemap? By default XMLSerializer/HTMLSerializer uses UTF-8 encoding. So instead of 1 UTF-16 char you got 2 chars UTF-8 encoded. Of cource there might be also issue with emoji charset, but I would first try to change encoding in Serliazer config (to be UTF-16). Greetings, -Greg 2017-06-07 10:43 GMT+02:00 Flynn, Peter: > I had a related problem with 3–4 CJK characters being converted to their >
Re: [2.1] Overzealous escaping of high Unicode code points
I had a related problem with 3–4 CJK characters being converted to their
[2.1] Overzealous escaping of high Unicode code points
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 All, I've been testing my application for use with high Unicode code points such as emoji like which is this one: http://www.fileformat.info/info/unicode/char/1F60D/index.htm My application and database can handle this code point, but Cocoon butchers it in a way that I have seen before -- the way that commons-lang's StringEscapeUtils.escapeXml/escapeHtml seems to do. Instead of letting the character through as-is, it tries to convert it into these two numbered entities: Oddly enough, those are the two double-byte UTF-16 characters you'd get, but they shouldn't be split-up like that, I don't think. I haven't found a version of commons-lang 2.x that doesn't break these kinds of characters. commons-lang3 does the right thing, but they are incompatible libraries. Does anyone know the code well enough to know how difficult it would be to change the way Cocoon 2.1 escapes its output? For example, by using commons-lang3? I haven't tried Cocoon 2.2, yet, and I can't tell what dependencies it has. I also can't exactly tell what to do now that I've downloaded the binary package. Can this just be used as a drop-in replacement for Cocoon 2.1.x? Cocoon 2.1.x could build a WAR file that I then customized for my own application, adding various libraries and configuration files to it. I think I'll follow-up with a separate post about this. - -chris -BEGIN PGP SIGNATURE- Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJZNtOBAAoJEBzwKT+lPKRYEuIP/3gSJZDNEbzsHkI5zYjMZbFf vKvRRnBSl+6IdrcUasftf+AkXIIYwj6xnUQ7winsLW/n8TdDG6jPqsg4Khsozc6z aa23qDly62gmCsqpLohXxt/ZNKdPY4sOTghaaEUFTtTgpeD3M/INF90myT8SwO4K WUtqVparSqp/Zf9JMm3OCIguMKbsRNYWVIQuiJxDQJkWYwrw0iVk2v8mc6iz/mDF w6np4EvFr9fqdDufKpPw8anEkrp5JEuTx47vMOtz4sixVr2C6ehgP4zs3kVzdVid QPeUsrosV1tsRC9bMVLGmjo7UhNseeXCp/AceIT6AQE8Q1clgy9GcoNMf60dgGku et0xoGptYgbCfmJL+PuA9y7fJYjgTTQheqzuC721n2/sx+kyBSBWSMIhqia2sd4y spcT4kw+uChsWjwoeGOHOm4IimrVgXkfJeHVSXV4m66sHS9t+bDiiErwS1SikvSV qF64/L0u8hYFLD1ehURoHBi4foE1Td3eRGOGHgodcYL9C8U+Yv+fWaiYQ5O4CCnW pToFvVoQOdZY+VVC8hz1ggbRMSxjT2GQLLJ2mjbGzGUJjlwyQaoZnADSSu0efj88 O2AlWB2Bf/Ag6E4C9jEjj+cauBfR+1NIK7F1Jo6C02yY1SUOSoOAFDZ7EkO4qYAO YhvgSQXNmKps6rusNjNZ =q8Eh -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org For additional commands, e-mail: users-h...@cocoon.apache.org