My understanding was that *all* ampersands get converted in raw embedded strings:
scala> val m = <span>{ "&" }</span> m: scala.xml.Elem = <span>&amp;</span> scala> val m = <span>{ "&" }</span> m: scala.xml.Elem = <span>&</span> If you don't "embed" the string it doesn't do validation on the entities: scala> val m = <span>ç</span> m: scala.xml.Elem = <span>ç</span> If you want to embed a string with ampersands and you don't want them expanded, use scala.xml.Unparsed: scala> import scala.xml.Unparsed import scala.xml.Unparsed scala> val m = <span>{ Unparsed("&") }</span> m: scala.xml.Elem = <span>&</span> scala> val m = <span>{ Unparsed("ç") }</span> m: scala.xml.Elem = <span>ç</span> Back to the more important part of this post, I found this interesting article: http://www.intertwingly.net/blog/2004/04/15/Character-Encoding-and-HTML-Forms It indicates that you can use the "accept-charset" attribute on the form element itself to force a particular input encoding: http://www.w3.org/TR/html401/interact/forms.html#h-17.3 It seems like we should be able to simply force UTF-8 via the form tag and then fix whatever is interpreting the string as Latin-1. I'm going to hack a little more here. Derek On Sun, Mar 15, 2009 at 2:45 PM, Marc Boschma <marc+lift...@boschma.cx<marc%2blift...@boschma.cx> > wrote: > > On 16/03/2009, at 6:59 AM, Charles F. Munat wrote: > > > > > That was my thinking. It doesn't explain why ç in gets > > changed to > > &ccedil;, but it explains why ç in becomes ç out. So I think > > there > > are two separate issues here. > > I tend to agree. > > > > > > > The ç can be created in two different ways in UTF-8. One is the single > > "c with a cedilla" character. The second is a c character followed > > by a > > cedilla character. I am not sure how UTF-8 indicates that these two > > characters should be displayed as one. > > the c with a cedilla two character sequence is encoded as 0063 0327 > which is equivalent to 00E7 (at least optically). the 0327 is seen as > a modifier to the 'c' (0063) character. > > > Neither am I sure that this has > > anything to do with the problem. Maybe it is simply that something is > > assuming Latin1 input even though the input is UTF-8. > > > > It is definitely on the front end, because it is stored in the > > database > > as ç. > > > > When I use ç instead, the problem is that it is *not* converted > > to ç as it goes into the database, and then on the way out the XML > > interpreter does not recognize it as a character entity reference > > and so > > converts the & to &. > > I think this is due to using the standard Scala XML load functions > rather than the lift XML parser. From memory I don't think the > standard parser recognises that many named entities. ie. does ç > work instead of ç ? If so then that is probably what is > happening on this issue. > > > > > > > Chas. > > > > Marc Boschma wrote: > >> Now I have some breakfast in me, to be clear it appears that UTF-8 > >> byte > >> stream is being interpreted as Latin1 and then converted to > >> unicode... > >> > >> Marc > >> On 16/03/2009, at 6:25 AM, Marc Boschma wrote: > >> > >>> excuse the typo: > >>> On 16/03/2009, at 6:23 AM, Marc Boschma wrote: > >>> > >>>> Just looking at http://jeppesn.dk/utf-8.html , I found the > >>>> following > >>>> lines: > >>>> Character Latin1 Unicode UTF-8 Latin1 > >>>> code > interpr. > >>>> ç E7 00 E7 C3 A7 ç > >>>> à is C38C, § is C2 A7 > >>> à is C383 > >>>> So it appears that somewhere there is a translation to Latin 1 > >>>> going on. > >>>> Hopefully that helps some what... > >>>> Regards, > >>>> Marc > >>>> > >>>> On 16/03/2009, at 1:08 AM, Derek Chen-Becker wrote: > >>>> > >>>>> This is really interesting. I've narrowed it down to something on > >>>>> form submission. The database shows gibberish, too, and if I > >>>>> manually enter the correct value in the DB it works fine on > >>>>> display. > >>>>> If I print the UTF-8 byte values of the string I get from the > >>>>> browser for my description when I submit a cedilla (ç), I see: > >>>>> > >>>>> INFO - Submitted desc bytes = c3 83 c2 a7 > >>>>> > >>>>> A cedilla is c3 a7 in UTF-8, so I'm not sure where the "83 c2" is > >>>>> coming from. I googled around a bit and I found other people > >>>>> having > >>>>> the same issue but it wasn't clear in those posts what the cause > >>>>> was. I did a packet capture just as a sanity check, and here's > >>>>> what > >>>>> I got: > >>>>> > >>>>> POST / HTTP/1.1 > >>>>> ... headers here ... > >>>>> > >>>>> F956759623045OFT > >>>>> = > >>>>> true > >>>>> &F956759623046BU5 > >>>>> =1&F9567596230472LR=2009%2F03%2F18&F956759623048IZR= > >>>>> %C3%A7&F956759623049S3E=3&F956759623050E25=test > >>>>> > >>>>> As you can see, the (url encoded) value of the F956759623048IZR > >>>>> field (description) is %C3%A7, so something isn't properly > >>>>> converting that. Helpers.urlDecode seems to be working properly: > >>>>> > >>>>> scala> Helpers.urlDecode("F956759623048IZR=%C3%A7") > >>>>> res1: java.lang.String = F956759623048IZR=ç > >>>>> > >>>>> So I have no idea where this is coming from. All I know is that > >>>>> between the actual POST and when my submit function is called, > >>>>> something is tweaking the string. I'm going to dig some more, > >>>>> but I > >>>>> wanted to post this in case it triggers any thoughts out there. > >>>>> > >>>>> Derek > >>>>> > >>>>> PS - I just found this: > >>>>> > >>>>> > http://mail-archives.apache.org/mod_mbox/struts-dev/200604.mbox/%3c3769847.1145910729808.javamail.j...@brutus%3e > >>>>> > >>>>> May be related? > >>>>> > >>>>> On Sun, Mar 15, 2009 at 7:26 AM, Derek Chen-Becker > >>>>> <dchenbec...@gmail.com <mailto:dchenbec...@gmail.com>> wrote: > >>>>> > >>>>> OK, I can replicate this in our PocketChange app (also going > >>>>> against a PostgreSQL DB). Let me dig a bit. > >>>>> > >>>>> Derek > >>>>> > >>>>> > >>>>> On Sun, Mar 15, 2009 at 3:58 AM, Charles F. Munat > >>>>> <c...@munat.com <mailto:c...@munat.com>> wrote: > >>>>> > >>>>> > >>>>> This might help, but I don't think I was clear. I have an > >>>>> online form. > >>>>> My clients enter text into it. Their text has characters > >>>>> like a c with a > >>>>> cedilla. That text gets saved into a PostgreSQL database > >>>>> (UTF-8) varchar > >>>>> field via JPA/Hibernate. > >>>>> > >>>>> Then I pull it back out and dump it into a template, and it > >>>>> comes out > >>>>> gibberish. If I try using ç instead, I get > >>>>> &cedil; back out. > >>>>> > >>>>> Here is what I have: > >>>>> > >>>>> "name" -> SHtml.text(thing.name <http://thing.name>, > >>>>> thing.name <http://thing.name> = _, ("size", "40")) > >>>>> > >>>>> If I enter "cachaça" in the field, I get cachaça back out. > >>>>> The weird > >>>>> thing is that sometimes when I copy and paste text from > >>>>> another document > >>>>> into the form, it works. But if I use the keyboard, it > >>>>> fails > >>>>> every time. > >>>>> > >>>>> I'll play around with this. Thanks. > >>>>> > >>>>> Chas. > >>>>> > >>>>> Derek Chen-Becker wrote: > >>>>>> Oops, forgot scala.xml.Unparsed, too: > >>>>>> > >>>>>> scala> val m = <span>a{ scala.xml.Unparsed("ç") > >>>>> }b</span> > >>>>>> m: scala.xml.Elem = <span>açb</span> > >>>>>> > >>>>>> That one might be what you're looking for. > >>>>>> > >>>>>> Derek > >>>>>> > >>>>>> On Sat, Mar 14, 2009 at 9:57 PM, Derek Chen-Becker > >>>>>> <dchenbec...@gmail.com <mailto:dchenbec...@gmail.com> > >>>>> <mailto:dchenbec...@gmail.com > >>>>> <mailto:dchenbec...@gmail.com>>> wrote: > >>>>>> > >>>>>> I think it depends on how you're embedding them in the > >>>>> XML: > >>>>>> > >>>>>> scala> val m = <span>açb</span> > >>>>>> m: scala.xml.Elem = <span>açb</span> > >>>>>> > >>>>>> scala> val m = <span>a{"ç"}b</span> > >>>>>> m: scala.xml.Elem = <span>a&ccedil;b</span> > >>>>>> > >>>>>> scala> val m = <span>a{"ç"}b</span> > >>>>>> m: scala.xml.Elem = <span>açb</span> > >>>>>> > >>>>>> That last one was input using dead keys (alt+,) on my > >>>>> linux (USA > >>>>>> International with dead keys) layout. Let me know if > >>>>> this doesn't > >>>>>> help; if not, could you send the code/template that's > >>>>> having issues? > >>>>>> > >>>>>> Derek > >>>>>> > >>>>>> > >>>>>> On Sat, Mar 14, 2009 at 6:36 PM, Charles F. Munat > >>>>> <c...@munat.com <mailto:c...@munat.com> > >>>>>> <mailto:c...@munat.com <mailto:c...@munat.com>>> wrote: > >>>>>> > >>>>>> > >>>>>> I have a site that uses a lot of "special" > >>>>> characters (a remarkably > >>>>>> biased description, since there is nothing > >>>>> "special" about accented > >>>>>> characters to the people who use them daily). In > >>>>> particular, I > >>>>>> need the > >>>>>> c with cedilla and the n with the tilde. > >>>>>> > >>>>>> These characters are being input to a database > >>>>> (UTF-8) via an online > >>>>>> form, then spit back out onto the page. > >>>>>> > >>>>>> It's a fucking disaster. Apparently, everything > >>>>> goes through the xml > >>>>>> parser, which is great, except when I try to enter > >>>>> these as entity > >>>>>> references, such as ç, the parser changes & > >>>>> to & and > >>>>>> I get > >>>>>> the literal ç back out again. > >>>>>> > >>>>>> When I type ç using the keyboard (or copy and > >>>>> paste it from a > >>>>>> page or a > >>>>>> text editor), I get gibberish. > >>>>>> > >>>>>> Anyone know the trick to getting around this? I > >>>>> need everything > >>>>>> from e > >>>>>> acute to e grave to trademark and registered > >>>>> trademark symbols, > >>>>>> and I > >>>>>> need to enter them this way. > >>>>>> > >>>>>> Thanks for any help. If I can get this to work, > >>>>> I'll add an > >>>>>> explanation > >>>>>> to the wiki. > >>>>>> > >>>>>> Chas. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>> > >>>> > >>>> > >>>> > >>> > >>> > >>> > >>> > >> > >> > >>> > > > > > > > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Lift" group. To post to this group, send email to liftweb@googlegroups.com To unsubscribe from this group, send email to liftweb+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/liftweb?hl=en -~----------~----~----~----~------~----~------~--~---