That was my thinking. It doesn't explain why ç in gets changed to ç, but it explains why ç in becomes ç out. So I think there are two separate issues here.
The ç can be created in two different ways in UTF-8. One is the single "c with a cedilla" character. The second is a c character followed by a cedilla character. I am not sure how UTF-8 indicates that these two characters should be displayed as one. Neither am I sure that this has anything to do with the problem. Maybe it is simply that something is assuming Latin1 input even though the input is UTF-8. It is definitely on the front end, because it is stored in the database as ç. When I use ç instead, the problem is that it is *not* converted to ç as it goes into the database, and then on the way out the XML interpreter does not recognize it as a character entity reference and so converts the & to &. Chas. Marc Boschma wrote: > Now I have some breakfast in me, to be clear it appears that UTF-8 byte > stream is being interpreted as Latin1 and then converted to unicode... > > Marc > On 16/03/2009, at 6:25 AM, Marc Boschma wrote: > >> excuse the typo: >> On 16/03/2009, at 6:23 AM, Marc Boschma wrote: >> >>> Just looking at http://jeppesn.dk/utf-8.html , I found the following >>> lines: >>> Character Latin1 Unicode UTF-8 Latin1 >>> code interpr. >>> ç E7 00 E7 C3 A7 ç >>> à is C38C, § is C2 A7 >> à is C383 >>> So it appears that somewhere there is a translation to Latin 1 going on. >>> Hopefully that helps some what... >>> Regards, >>> Marc >>> >>> On 16/03/2009, at 1:08 AM, Derek Chen-Becker wrote: >>> >>>> This is really interesting. I've narrowed it down to something on >>>> form submission. The database shows gibberish, too, and if I >>>> manually enter the correct value in the DB it works fine on display. >>>> If I print the UTF-8 byte values of the string I get from the >>>> browser for my description when I submit a cedilla (ç), I see: >>>> >>>> INFO - Submitted desc bytes = c3 83 c2 a7 >>>> >>>> A cedilla is c3 a7 in UTF-8, so I'm not sure where the "83 c2" is >>>> coming from. I googled around a bit and I found other people having >>>> the same issue but it wasn't clear in those posts what the cause >>>> was. I did a packet capture just as a sanity check, and here's what >>>> I got: >>>> >>>> POST / HTTP/1.1 >>>> ... headers here ... >>>> >>>> F956759623045OFT=true&F956759623046BU5=1&F9567596230472LR=2009%2F03%2F18&F956759623048IZR=%C3%A7&F956759623049S3E=3&F956759623050E25=test >>>> >>>> As you can see, the (url encoded) value of the F956759623048IZR >>>> field (description) is %C3%A7, so something isn't properly >>>> converting that. Helpers.urlDecode seems to be working properly: >>>> >>>> scala> Helpers.urlDecode("F956759623048IZR=%C3%A7") >>>> res1: java.lang.String = F956759623048IZR=ç >>>> >>>> So I have no idea where this is coming from. All I know is that >>>> between the actual POST and when my submit function is called, >>>> something is tweaking the string. I'm going to dig some more, but I >>>> wanted to post this in case it triggers any thoughts out there. >>>> >>>> Derek >>>> >>>> PS - I just found this: >>>> >>>> http://mail-archives.apache.org/mod_mbox/struts-dev/200604.mbox/%3c3769847.1145910729808.javamail.j...@brutus%3e >>>> >>>> May be related? >>>> >>>> On Sun, Mar 15, 2009 at 7:26 AM, Derek Chen-Becker >>>> <dchenbec...@gmail.com <mailto:dchenbec...@gmail.com>> wrote: >>>> >>>> OK, I can replicate this in our PocketChange app (also going >>>> against a PostgreSQL DB). Let me dig a bit. >>>> >>>> Derek >>>> >>>> >>>> On Sun, Mar 15, 2009 at 3:58 AM, Charles F. Munat >>>> <c...@munat.com <mailto:c...@munat.com>> wrote: >>>> >>>> >>>> This might help, but I don't think I was clear. I have an >>>> online form. >>>> My clients enter text into it. Their text has characters >>>> like a c with a >>>> cedilla. That text gets saved into a PostgreSQL database >>>> (UTF-8) varchar >>>> field via JPA/Hibernate. >>>> >>>> Then I pull it back out and dump it into a template, and it >>>> comes out >>>> gibberish. If I try using ç instead, I get >>>> &cedil; back out. >>>> >>>> Here is what I have: >>>> >>>> "name" -> SHtml.text(thing.name <http://thing.name>, >>>> thing.name <http://thing.name> = _, ("size", "40")) >>>> >>>> If I enter "cachaça" in the field, I get cachaça back out. >>>> The weird >>>> thing is that sometimes when I copy and paste text from >>>> another document >>>> into the form, it works. But if I use the keyboard, it fails >>>> every time. >>>> >>>> I'll play around with this. Thanks. >>>> >>>> Chas. >>>> >>>> Derek Chen-Becker wrote: >>>> > Oops, forgot scala.xml.Unparsed, too: >>>> > >>>> > scala> val m = <span>a{ scala.xml.Unparsed("ç") >>>> }b</span> >>>> > m: scala.xml.Elem = <span>açb</span> >>>> > >>>> > That one might be what you're looking for. >>>> > >>>> > Derek >>>> > >>>> > On Sat, Mar 14, 2009 at 9:57 PM, Derek Chen-Becker >>>> > <dchenbec...@gmail.com <mailto:dchenbec...@gmail.com> >>>> <mailto:dchenbec...@gmail.com >>>> <mailto:dchenbec...@gmail.com>>> wrote: >>>> > >>>> > I think it depends on how you're embedding them in the >>>> XML: >>>> > >>>> > scala> val m = <span>açb</span> >>>> > m: scala.xml.Elem = <span>açb</span> >>>> > >>>> > scala> val m = <span>a{"ç"}b</span> >>>> > m: scala.xml.Elem = <span>a&ccedil;b</span> >>>> > >>>> > scala> val m = <span>a{"ç"}b</span> >>>> > m: scala.xml.Elem = <span>açb</span> >>>> > >>>> > That last one was input using dead keys (alt+,) on my >>>> linux (USA >>>> > International with dead keys) layout. Let me know if >>>> this doesn't >>>> > help; if not, could you send the code/template that's >>>> having issues? >>>> > >>>> > Derek >>>> > >>>> > >>>> > On Sat, Mar 14, 2009 at 6:36 PM, Charles F. Munat >>>> <c...@munat.com <mailto:c...@munat.com> >>>> > <mailto:c...@munat.com <mailto:c...@munat.com>>> wrote: >>>> > >>>> > >>>> > I have a site that uses a lot of "special" >>>> characters (a remarkably >>>> > biased description, since there is nothing >>>> "special" about accented >>>> > characters to the people who use them daily). In >>>> particular, I >>>> > need the >>>> > c with cedilla and the n with the tilde. >>>> > >>>> > These characters are being input to a database >>>> (UTF-8) via an online >>>> > form, then spit back out onto the page. >>>> > >>>> > It's a fucking disaster. Apparently, everything >>>> goes through the xml >>>> > parser, which is great, except when I try to enter >>>> these as entity >>>> > references, such as ç, the parser changes & >>>> to & and >>>> > I get >>>> > the literal ç back out again. >>>> > >>>> > When I type ç using the keyboard (or copy and >>>> paste it from a >>>> > page or a >>>> > text editor), I get gibberish. >>>> > >>>> > Anyone know the trick to getting around this? I >>>> need everything >>>> > from e >>>> > acute to e grave to trademark and registered >>>> trademark symbols, >>>> > and I >>>> > need to enter them this way. >>>> > >>>> > Thanks for any help. If I can get this to work, >>>> I'll add an >>>> > explanation >>>> > to the wiki. >>>> > >>>> > Chas. >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > > >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >> >> >> >> > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Lift" group. To post to this group, send email to liftweb@googlegroups.com To unsubscribe from this group, send email to liftweb+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/liftweb?hl=en -~----------~----~----~----~------~----~------~--~---