[Lift] Re: xml parser, utf-8, special characters... kill me now

Derek Chen-Becker Sun, 15 Mar 2009 07:08:52 -0700

This is really interesting. I've narrowed it down to something on form
submission. The database shows gibberish, too, and if I manually enter the
correct value in the DB it works fine on display. If I print the UTF-8 byte
values of the string I get from the browser for my description when I submit
a cedilla (ç), I see:


INFO - Submitted desc bytes = c3 83 c2 a7

A cedilla is c3 a7 in UTF-8, so I'm not sure where the "83 c2" is coming
from. I googled around a bit and I found other people having the same issue
but it wasn't clear in those posts what the cause was. I did a packet
capture just as a sanity check, and here's what I got:

POST / HTTP/1.1
... headers here ...

F956759623045OFT=true&F956759623046BU5=1&F9567596230472LR=2009%2F03%2F18&F956759623048IZR=%C3%A7&F956759623049S3E=3&F956759623050E25=test

As you can see, the (url encoded) value of the F956759623048IZR field
(description) is %C3%A7, so something isn't properly converting that.
Helpers.urlDecode seems to be working properly:

scala> Helpers.urlDecode("F956759623048IZR=%C3%A7")
res1: java.lang.String = F956759623048IZR=ç

So I have no idea where this is coming from. All I know is that between the
actual POST and when my submit function is called, something is tweaking the
string. I'm going to dig some more, but I wanted to post this in case it
triggers any thoughts out there.

Derek

PS - I just found this:

http://mail-archives.apache.org/mod_mbox/struts-dev/200604.mbox/%3c3769847.1145910729808.javamail.j...@brutus%3e

May be related?

On Sun, Mar 15, 2009 at 7:26 AM, Derek Chen-Becker <dchenbec...@gmail.com>wrote:

> OK, I can replicate this in our PocketChange app (also going against a
> PostgreSQL DB). Let me dig a bit.
>
> Derek
>
>
> On Sun, Mar 15, 2009 at 3:58 AM, Charles F. Munat <c...@munat.com> wrote:
>
>>
>> This might help, but I don't think I was clear. I have an online form.
>> My clients enter text into it. Their text has characters like a c with a
>> cedilla. That text gets saved into a PostgreSQL database (UTF-8) varchar
>> field via JPA/Hibernate.
>>
>> Then I pull it back out and dump it into a template, and it comes out
>> gibberish. If I try using &ccedil; instead, I get &amp;cedil; back out.
>>
>> Here is what I have:
>>
>> "name" -> SHtml.text(thing.name, thing.name = _, ("size", "40"))
>>
>> If I enter "cachaça" in the field, I get cachaÃ§a back out. The weird
>> thing is that sometimes when I copy and paste text from another document
>> into the form, it works. But if I use the keyboard, it fails every time.
>>
>> I'll play around with this. Thanks.
>>
>> Chas.
>>
>> Derek Chen-Becker wrote:
>> > Oops, forgot scala.xml.Unparsed, too:
>> >
>> > scala> val m = <span>a{ scala.xml.Unparsed("&ccedil;") }b</span>
>> > m: scala.xml.Elem = <span>a&ccedil;b</span>
>> >
>> > That one might be what you're looking for.
>> >
>> > Derek
>> >
>> > On Sat, Mar 14, 2009 at 9:57 PM, Derek Chen-Becker
>> > <dchenbec...@gmail.com <mailto:dchenbec...@gmail.com>> wrote:
>> >
>> >     I think it depends on how you're embedding them in the XML:
>> >
>> >     scala> val m = <span>a&ccedil;b</span>
>> >     m: scala.xml.Elem = <span>a&ccedil;b</span>
>> >
>> >     scala> val m = <span>a{"&ccedil;"}b</span>
>> >     m: scala.xml.Elem = <span>a&amp;ccedil;b</span>
>> >
>> >     scala> val m = <span>a{"ç"}b</span>
>> >     m: scala.xml.Elem = <span>açb</span>
>> >
>> >     That last one was input using dead keys (alt+,) on my linux (USA
>> >     International with dead keys) layout. Let me know if this doesn't
>> >     help; if not, could you send the code/template that's having issues?
>> >
>> >     Derek
>> >
>> >
>> >     On Sat, Mar 14, 2009 at 6:36 PM, Charles F. Munat <c...@munat.com
>> >     <mailto:c...@munat.com>> wrote:
>> >
>> >
>> >         I have a site that uses a lot of "special" characters (a
>> remarkably
>> >         biased description, since there is nothing "special" about
>> accented
>> >         characters to the people who use them daily). In particular, I
>> >         need the
>> >         c with cedilla and the n with the tilde.
>> >
>> >         These characters are being input to a database (UTF-8) via an
>> online
>> >         form, then spit back out onto the page.
>> >
>> >         It's a fucking disaster. Apparently, everything goes through the
>> xml
>> >         parser, which is great, except when I try to enter these as
>> entity
>> >         references, such as &ccedil;, the parser changes & to &amp; and
>> >         I get
>> >         the literal &ccedil; back out again.
>> >
>> >         When I type ç using the keyboard (or copy and paste it from a
>> >         page or a
>> >         text editor), I get gibberish.
>> >
>> >         Anyone know the trick to getting around this? I need everything
>> >         from e
>> >         acute to e grave to trademark and registered trademark symbols,
>> >         and I
>> >         need to enter them this way.
>> >
>> >         Thanks for any help. If I can get this to work, I'll add an
>> >         explanation
>> >         to the wiki.
>> >
>> >         Chas.
>> >
>> >
>> >
>> >
>> >
>> > >
>>
>> >>
>>
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Lift" group.
To post to this group, send email to liftweb@googlegroups.com
To unsubscribe from this group, send email to 
liftweb+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/liftweb?hl=en
-~----------~----~----~----~------~----~------~--~---

[Lift] Re: xml parser, utf-8, special characters... kill me now

Reply via email to