[Lift] Re: xml parser, utf-8, special characters... kill me now

Charles F. Munat Sun, 15 Mar 2009 12:59:11 -0700

That was my thinking. It doesn't explain why &ccedil; in gets changed to 
&amp;ccedil;, but it explains why ç in becomes Ã§ out. So I think there 
are two separate issues here.


The ç can be created in two different ways in UTF-8. One is the single 
"c with a cedilla" character. The second is a c character followed by a 
cedilla character. I am not sure how UTF-8 indicates that these two 
characters should be displayed as one. Neither am I sure that this has 
anything to do with the problem. Maybe it is simply that something is 
assuming Latin1 input even though the input is UTF-8.

It is definitely on the front end, because it is stored in the database 
as Ã§.

When I use &ccedil; instead, the problem is that it is *not* converted 
to ç as it goes into the database, and then on the way out the XML 
interpreter does not recognize it as a character entity reference and so 
converts the & to &amp;.

Chas.

Marc Boschma wrote:
> Now I have some breakfast in me, to be clear it appears that UTF-8 byte 
> stream is being interpreted as Latin1 and then converted to unicode...
> 
> Marc
> On 16/03/2009, at 6:25 AM, Marc Boschma wrote:
> 
>> excuse the typo:
>> On 16/03/2009, at 6:23 AM, Marc Boschma wrote:
>>
>>> Just looking at http://jeppesn.dk/utf-8.html , I found the following 
>>> lines:
>>> Character   Latin1  Unicode         UTF-8   Latin1
>>>                     code                                            interpr.
>>> ç                   E7              00 E7           C3 A7   Ã§
>>> Ã is C38C, § is C2 A7
>> Ã is C383
>>> So it appears that somewhere there is a translation to Latin 1 going on.
>>> Hopefully that helps some what...
>>> Regards,
>>> Marc
>>>
>>> On 16/03/2009, at 1:08 AM, Derek Chen-Becker wrote:
>>>
>>>> This is really interesting. I've narrowed it down to something on 
>>>> form submission. The database shows gibberish, too, and if I 
>>>> manually enter the correct value in the DB it works fine on display. 
>>>> If I print the UTF-8 byte values of the string I get from the 
>>>> browser for my description when I submit a cedilla (ç), I see:
>>>>
>>>> INFO - Submitted desc bytes = c3 83 c2 a7
>>>>
>>>> A cedilla is c3 a7 in UTF-8, so I'm not sure where the "83 c2" is 
>>>> coming from. I googled around a bit and I found other people having 
>>>> the same issue but it wasn't clear in those posts what the cause 
>>>> was. I did a packet capture just as a sanity check, and here's what 
>>>> I got:
>>>>
>>>> POST / HTTP/1.1
>>>> ... headers here ...
>>>>  
>>>> F956759623045OFT=true&F956759623046BU5=1&F9567596230472LR=2009%2F03%2F18&F956759623048IZR=%C3%A7&F956759623049S3E=3&F956759623050E25=test
>>>>
>>>> As you can see, the (url encoded) value of the F956759623048IZR 
>>>> field (description) is %C3%A7, so something isn't properly 
>>>> converting that. Helpers.urlDecode seems to be working properly:
>>>>
>>>> scala> Helpers.urlDecode("F956759623048IZR=%C3%A7")   
>>>> res1: java.lang.String = F956759623048IZR=ç
>>>>
>>>> So I have no idea where this is coming from. All I know is that 
>>>> between the actual POST and when my submit function is called, 
>>>> something is tweaking the string. I'm going to dig some more, but I 
>>>> wanted to post this in case it triggers any thoughts out there.
>>>>
>>>> Derek
>>>>
>>>> PS - I just found this:
>>>>
>>>> http://mail-archives.apache.org/mod_mbox/struts-dev/200604.mbox/%3c3769847.1145910729808.javamail.j...@brutus%3e
>>>>
>>>> May be related?
>>>>
>>>> On Sun, Mar 15, 2009 at 7:26 AM, Derek Chen-Becker 
>>>> <dchenbec...@gmail.com <mailto:dchenbec...@gmail.com>> wrote:
>>>>
>>>>     OK, I can replicate this in our PocketChange app (also going
>>>>     against a PostgreSQL DB). Let me dig a bit.
>>>>
>>>>     Derek
>>>>
>>>>
>>>>     On Sun, Mar 15, 2009 at 3:58 AM, Charles F. Munat
>>>>     <c...@munat.com <mailto:c...@munat.com>> wrote:
>>>>
>>>>
>>>>         This might help, but I don't think I was clear. I have an
>>>>         online form.
>>>>         My clients enter text into it. Their text has characters
>>>>         like a c with a
>>>>         cedilla. That text gets saved into a PostgreSQL database
>>>>         (UTF-8) varchar
>>>>         field via JPA/Hibernate.
>>>>
>>>>         Then I pull it back out and dump it into a template, and it
>>>>         comes out
>>>>         gibberish. If I try using &ccedil; instead, I get
>>>>         &amp;cedil; back out.
>>>>
>>>>         Here is what I have:
>>>>
>>>>         "name" -> SHtml.text(thing.name <http://thing.name>,
>>>>         thing.name <http://thing.name> = _, ("size", "40"))
>>>>
>>>>         If I enter "cachaça" in the field, I get cachaÃ§a back out.
>>>>         The weird
>>>>         thing is that sometimes when I copy and paste text from
>>>>         another document
>>>>         into the form, it works. But if I use the keyboard, it fails
>>>>         every time.
>>>>
>>>>         I'll play around with this. Thanks.
>>>>
>>>>         Chas.
>>>>
>>>>         Derek Chen-Becker wrote:
>>>>         > Oops, forgot scala.xml.Unparsed, too:
>>>>         >
>>>>         > scala> val m = <span>a{ scala.xml.Unparsed("&ccedil;")
>>>>         }b</span>
>>>>         > m: scala.xml.Elem = <span>a&ccedil;b</span>
>>>>         >
>>>>         > That one might be what you're looking for.
>>>>         >
>>>>         > Derek
>>>>         >
>>>>         > On Sat, Mar 14, 2009 at 9:57 PM, Derek Chen-Becker
>>>>         > <dchenbec...@gmail.com <mailto:dchenbec...@gmail.com>
>>>>         <mailto:dchenbec...@gmail.com
>>>>         <mailto:dchenbec...@gmail.com>>> wrote:
>>>>         >
>>>>         >     I think it depends on how you're embedding them in the
>>>>         XML:
>>>>         >
>>>>         >     scala> val m = <span>a&ccedil;b</span>
>>>>         >     m: scala.xml.Elem = <span>a&ccedil;b</span>
>>>>         >
>>>>         >     scala> val m = <span>a{"&ccedil;"}b</span>
>>>>         >     m: scala.xml.Elem = <span>a&amp;ccedil;b</span>
>>>>         >
>>>>         >     scala> val m = <span>a{"ç"}b</span>
>>>>         >     m: scala.xml.Elem = <span>açb</span>
>>>>         >
>>>>         >     That last one was input using dead keys (alt+,) on my
>>>>         linux (USA
>>>>         >     International with dead keys) layout. Let me know if
>>>>         this doesn't
>>>>         >     help; if not, could you send the code/template that's
>>>>         having issues?
>>>>         >
>>>>         >     Derek
>>>>         >
>>>>         >
>>>>         >     On Sat, Mar 14, 2009 at 6:36 PM, Charles F. Munat
>>>>         <c...@munat.com <mailto:c...@munat.com>
>>>>         >     <mailto:c...@munat.com <mailto:c...@munat.com>>> wrote:
>>>>         >
>>>>         >
>>>>         >         I have a site that uses a lot of "special"
>>>>         characters (a remarkably
>>>>         >         biased description, since there is nothing
>>>>         "special" about accented
>>>>         >         characters to the people who use them daily). In
>>>>         particular, I
>>>>         >         need the
>>>>         >         c with cedilla and the n with the tilde.
>>>>         >
>>>>         >         These characters are being input to a database
>>>>         (UTF-8) via an online
>>>>         >         form, then spit back out onto the page.
>>>>         >
>>>>         >         It's a fucking disaster. Apparently, everything
>>>>         goes through the xml
>>>>         >         parser, which is great, except when I try to enter
>>>>         these as entity
>>>>         >         references, such as &ccedil;, the parser changes &
>>>>         to &amp; and
>>>>         >         I get
>>>>         >         the literal &ccedil; back out again.
>>>>         >
>>>>         >         When I type ç using the keyboard (or copy and
>>>>         paste it from a
>>>>         >         page or a
>>>>         >         text editor), I get gibberish.
>>>>         >
>>>>         >         Anyone know the trick to getting around this? I
>>>>         need everything
>>>>         >         from e
>>>>         >         acute to e grave to trademark and registered
>>>>         trademark symbols,
>>>>         >         and I
>>>>         >         need to enter them this way.
>>>>         >
>>>>         >         Thanks for any help. If I can get this to work,
>>>>         I'll add an
>>>>         >         explanation
>>>>         >         to the wiki.
>>>>         >
>>>>         >         Chas.
>>>>         >
>>>>         >
>>>>         >
>>>>         >
>>>>         >
>>>>         > >
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
> 
> 
> > 

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Lift" group.
To post to this group, send email to liftweb@googlegroups.com
To unsubscribe from this group, send email to 
liftweb+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/liftweb?hl=en
-~----------~----~----~----~------~----~------~--~---

[Lift] Re: xml parser, utf-8, special characters... kill me now

Reply via email to