Sorry, I'm not suggesting that this is the appropriate method for users; they should just be able to type. I was just trying to explain why the "&" is getting expanded. I think that the current behavior is not really what anyone wants, and hopefully we can fix it in a transparent manner.
Derek On Sun, Mar 15, 2009 at 2:38 PM, Charles F. Munat <c...@munat.com> wrote: > > Unfortunately, there is no easy way to do that with user input. But the > use of character entity references is problematic in itself. I can't > teach all my site's users all the references they will need, nor is it > really reasonable to expect, for example, an international group of > users to have to hand code every accented character. > > There must be a way to input UTF-8 and have it come out properly. I've > set the keyboard on my Mac to U.S. Extended, which makes everything > UTF-8. I note that *most* of the keyboards available for the Mac are > UTF-8 (though the default U.S. keyboard is Roman, and there are many > European keyboards that are Roman or Cyrillic). > > Ideally, Lift would recognize the character encoding and act > appropriately. (I'd be happy to convert everything to UTF-8.) Another > possibility, much less preferred but at least workable, would be to add > the ability for the user to select the character encoding (they could > use trial and error if they weren't sure). > > But the upshot is that someone with a keyboard set to UTF-8 (which > includes much of the world) should be able to use that keyboard and have > it come out the same way it went in. I have no idea how to accomplish > this, however, as I don't know how that part of Lift works. > > Chas. > > Derek Chen-Becker wrote: > > The scala XML syntax automatically converts any "&" in embedded strings > > to "&". You have to put the string inside a scala.xml.Unparsed node > > to prevent that from happening. > > > > Derek > > > > On Sun, Mar 15, 2009 at 1:59 PM, Charles F. Munat <c...@munat.com > > <mailto:c...@munat.com>> wrote: > > > > > > That was my thinking. It doesn't explain why ç in gets changed > to > > &ccedil;, but it explains why ç in becomes ç out. So I think > there > > are two separate issues here. > > > > The ç can be created in two different ways in UTF-8. One is the > single > > "c with a cedilla" character. The second is a c character followed by > a > > cedilla character. I am not sure how UTF-8 indicates that these two > > characters should be displayed as one. Neither am I sure that this > has > > anything to do with the problem. Maybe it is simply that something is > > assuming Latin1 input even though the input is UTF-8. > > > > It is definitely on the front end, because it is stored in the > database > > as ç. > > > > When I use ç instead, the problem is that it is *not* > converted > > to ç as it goes into the database, and then on the way out the XML > > interpreter does not recognize it as a character entity reference and > so > > converts the & to &. > > > > Chas. > > > > Marc Boschma wrote: > > > Now I have some breakfast in me, to be clear it appears that > > UTF-8 byte > > > stream is being interpreted as Latin1 and then converted to > > unicode... > > > > > > Marc > > > On 16/03/2009, at 6:25 AM, Marc Boschma wrote: > > > > > >> excuse the typo: > > >> On 16/03/2009, at 6:23 AM, Marc Boschma wrote: > > >> > > >>> Just looking at http://jeppesn.dk/utf-8.html , I found the > > following > > >>> lines: > > >>> Character Latin1 Unicode UTF-8 Latin1 > > >>> code > > interpr. > > >>> ç E7 00 E7 C3 A7 ç > > >>> à is C38C, § is C2 A7 > > >> à is C383 > > >>> So it appears that somewhere there is a translation to Latin 1 > > going on. > > >>> Hopefully that helps some what... > > >>> Regards, > > >>> Marc > > >>> > > >>> On 16/03/2009, at 1:08 AM, Derek Chen-Becker wrote: > > >>> > > >>>> This is really interesting. I've narrowed it down to something > on > > >>>> form submission. The database shows gibberish, too, and if I > > >>>> manually enter the correct value in the DB it works fine on > > display. > > >>>> If I print the UTF-8 byte values of the string I get from the > > >>>> browser for my description when I submit a cedilla (ç), I see: > > >>>> > > >>>> INFO - Submitted desc bytes = c3 83 c2 a7 > > >>>> > > >>>> A cedilla is c3 a7 in UTF-8, so I'm not sure where the "83 c2" > is > > >>>> coming from. I googled around a bit and I found other people > > having > > >>>> the same issue but it wasn't clear in those posts what the > cause > > >>>> was. I did a packet capture just as a sanity check, and here's > > what > > >>>> I got: > > >>>> > > >>>> POST / HTTP/1.1 > > >>>> ... headers here ... > > >>>> > > >>>> > > > F956759623045OFT=true&F956759623046BU5=1&F9567596230472LR=2009%2F03%2F18&F956759623048IZR=%C3%A7&F956759623049S3E=3&F956759623050E25=test > > >>>> > > >>>> As you can see, the (url encoded) value of the F956759623048IZR > > >>>> field (description) is %C3%A7, so something isn't properly > > >>>> converting that. Helpers.urlDecode seems to be working > properly: > > >>>> > > >>>> scala> Helpers.urlDecode("F956759623048IZR=%C3%A7") > > >>>> res1: java.lang.String = F956759623048IZR=ç > > >>>> > > >>>> So I have no idea where this is coming from. All I know is that > > >>>> between the actual POST and when my submit function is called, > > >>>> something is tweaking the string. I'm going to dig some more, > > but I > > >>>> wanted to post this in case it triggers any thoughts out there. > > >>>> > > >>>> Derek > > >>>> > > >>>> PS - I just found this: > > >>>> > > >>>> > > > http://mail-archives.apache.org/mod_mbox/struts-dev/200604.mbox/%3c3769847.1145910729808.javamail.j...@brutus%3e > > >>>> > > >>>> May be related? > > >>>> > > >>>> On Sun, Mar 15, 2009 at 7:26 AM, Derek Chen-Becker > > >>>> <dchenbec...@gmail.com <mailto:dchenbec...@gmail.com> > > <mailto:dchenbec...@gmail.com <mailto:dchenbec...@gmail.com>>> > wrote: > > >>>> > > >>>> OK, I can replicate this in our PocketChange app (also > going > > >>>> against a PostgreSQL DB). Let me dig a bit. > > >>>> > > >>>> Derek > > >>>> > > >>>> > > >>>> On Sun, Mar 15, 2009 at 3:58 AM, Charles F. Munat > > >>>> <c...@munat.com <mailto:c...@munat.com> > > <mailto:c...@munat.com <mailto:c...@munat.com>>> wrote: > > >>>> > > >>>> > > >>>> This might help, but I don't think I was clear. I have > an > > >>>> online form. > > >>>> My clients enter text into it. Their text has > characters > > >>>> like a c with a > > >>>> cedilla. That text gets saved into a PostgreSQL > database > > >>>> (UTF-8) varchar > > >>>> field via JPA/Hibernate. > > >>>> > > >>>> Then I pull it back out and dump it into a template, > > and it > > >>>> comes out > > >>>> gibberish. If I try using ç instead, I get > > >>>> &cedil; back out. > > >>>> > > >>>> Here is what I have: > > >>>> > > >>>> "name" -> SHtml.text(thing.name <http://thing.name> > > <http://thing.name>, > > >>>> thing.name <http://thing.name> <http://thing.name> = > > _, ("size", "40")) > > >>>> > > >>>> If I enter "cachaça" in the field, I get cachaça back > > out. > > >>>> The weird > > >>>> thing is that sometimes when I copy and paste text from > > >>>> another document > > >>>> into the form, it works. But if I use the keyboard, it > > fails > > >>>> every time. > > >>>> > > >>>> I'll play around with this. Thanks. > > >>>> > > >>>> Chas. > > >>>> > > >>>> Derek Chen-Becker wrote: > > >>>> > Oops, forgot scala.xml.Unparsed, too: > > >>>> > > > >>>> > scala> val m = <span>a{ > scala.xml.Unparsed("ç") > > >>>> }b</span> > > >>>> > m: scala.xml.Elem = <span>açb</span> > > >>>> > > > >>>> > That one might be what you're looking for. > > >>>> > > > >>>> > Derek > > >>>> > > > >>>> > On Sat, Mar 14, 2009 at 9:57 PM, Derek Chen-Becker > > >>>> > <dchenbec...@gmail.com > > <mailto:dchenbec...@gmail.com> <mailto:dchenbec...@gmail.com > > <mailto:dchenbec...@gmail.com>> > > >>>> <mailto:dchenbec...@gmail.com > > <mailto:dchenbec...@gmail.com> > > >>>> <mailto:dchenbec...@gmail.com > > <mailto:dchenbec...@gmail.com>>>> wrote: > > >>>> > > > >>>> > I think it depends on how you're embedding them > > in the > > >>>> XML: > > >>>> > > > >>>> > scala> val m = <span>açb</span> > > >>>> > m: scala.xml.Elem = <span>açb</span> > > >>>> > > > >>>> > scala> val m = <span>a{"ç"}b</span> > > >>>> > m: scala.xml.Elem = <span>a&ccedil;b</span> > > >>>> > > > >>>> > scala> val m = <span>a{"ç"}b</span> > > >>>> > m: scala.xml.Elem = <span>açb</span> > > >>>> > > > >>>> > That last one was input using dead keys (alt+,) > > on my > > >>>> linux (USA > > >>>> > International with dead keys) layout. Let me know > if > > >>>> this doesn't > > >>>> > help; if not, could you send the code/template > > that's > > >>>> having issues? > > >>>> > > > >>>> > Derek > > >>>> > > > >>>> > > > >>>> > On Sat, Mar 14, 2009 at 6:36 PM, Charles F. Munat > > >>>> <c...@munat.com <mailto:c...@munat.com> > > <mailto:c...@munat.com <mailto:c...@munat.com>> > > >>>> > <mailto:c...@munat.com <mailto:c...@munat.com> > > <mailto:c...@munat.com <mailto:c...@munat.com>>>> wrote: > > >>>> > > > >>>> > > > >>>> > I have a site that uses a lot of "special" > > >>>> characters (a remarkably > > >>>> > biased description, since there is nothing > > >>>> "special" about accented > > >>>> > characters to the people who use them daily). > In > > >>>> particular, I > > >>>> > need the > > >>>> > c with cedilla and the n with the tilde. > > >>>> > > > >>>> > These characters are being input to a > database > > >>>> (UTF-8) via an online > > >>>> > form, then spit back out onto the page. > > >>>> > > > >>>> > It's a fucking disaster. Apparently, > everything > > >>>> goes through the xml > > >>>> > parser, which is great, except when I try to > > enter > > >>>> these as entity > > >>>> > references, such as ç, the parser > > changes & > > >>>> to & and > > >>>> > I get > > >>>> > the literal ç back out again. > > >>>> > > > >>>> > When I type ç using the keyboard (or copy and > > >>>> paste it from a > > >>>> > page or a > > >>>> > text editor), I get gibberish. > > >>>> > > > >>>> > Anyone know the trick to getting around this? > I > > >>>> need everything > > >>>> > from e > > >>>> > acute to e grave to trademark and registered > > >>>> trademark symbols, > > >>>> > and I > > >>>> > need to enter them this way. > > >>>> > > > >>>> > Thanks for any help. If I can get this to > work, > > >>>> I'll add an > > >>>> > explanation > > >>>> > to the wiki. > > >>>> > > > >>>> > Chas. > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>> > > >>> > > >>> > > >>> > > >> > > >> > > >> > > >> > > > > > > > > > > > > > > > > > > > > > > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Lift" group. To post to this group, send email to liftweb@googlegroups.com To unsubscribe from this group, send email to liftweb+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/liftweb?hl=en -~----------~----~----~----~------~----~------~--~---