[Lift] Re: xml parser, utf-8, special characters... kill me now

Derek Chen-Becker Sun, 15 Mar 2009 13:41:32 -0700

Sorry, I'm not suggesting that this is the appropriate method for users;
they should just be able to type. I was just trying to explain why the "&"
is getting expanded. I think that the current behavior is not really what
anyone wants, and hopefully we can fix it in a transparent manner.


Derek

On Sun, Mar 15, 2009 at 2:38 PM, Charles F. Munat <c...@munat.com> wrote:

>
> Unfortunately, there is no easy way to do that with user input. But the
> use of character entity references is problematic in itself. I can't
> teach all my site's users all the references they will need, nor is it
> really reasonable to expect, for example, an international group of
> users to have to hand code every accented character.
>
> There must be a way to input UTF-8 and have it come out properly. I've
> set the keyboard on my Mac to U.S. Extended, which makes everything
> UTF-8. I note that *most* of the keyboards available for the Mac are
> UTF-8 (though the default U.S. keyboard is Roman, and there are many
> European keyboards that are Roman or Cyrillic).
>
> Ideally, Lift would recognize the character encoding and act
> appropriately. (I'd be happy to convert everything to UTF-8.) Another
> possibility, much less preferred but at least workable, would be to add
> the ability for the user to select the character encoding (they could
> use trial and error if they weren't sure).
>
> But the upshot is that someone with a keyboard set to UTF-8 (which
> includes much of the world) should be able to use that keyboard and have
> it come out the same way it went in. I have no idea how to accomplish
> this, however, as I don't know how that part of Lift works.
>
> Chas.
>
> Derek Chen-Becker wrote:
> > The scala XML syntax automatically converts any "&" in embedded strings
> > to "&amp;". You have to put the string inside a scala.xml.Unparsed node
> > to prevent that from happening.
> >
> > Derek
> >
> > On Sun, Mar 15, 2009 at 1:59 PM, Charles F. Munat <c...@munat.com
> > <mailto:c...@munat.com>> wrote:
> >
> >
> >     That was my thinking. It doesn't explain why &ccedil; in gets changed
> to
> >     &amp;ccedil;, but it explains why ç in becomes Ã§ out. So I think
> there
> >     are two separate issues here.
> >
> >     The ç can be created in two different ways in UTF-8. One is the
> single
> >     "c with a cedilla" character. The second is a c character followed by
> a
> >     cedilla character. I am not sure how UTF-8 indicates that these two
> >     characters should be displayed as one. Neither am I sure that this
> has
> >     anything to do with the problem. Maybe it is simply that something is
> >     assuming Latin1 input even though the input is UTF-8.
> >
> >     It is definitely on the front end, because it is stored in the
> database
> >     as Ã§.
> >
> >     When I use &ccedil; instead, the problem is that it is *not*
> converted
> >     to ç as it goes into the database, and then on the way out the XML
> >     interpreter does not recognize it as a character entity reference and
> so
> >     converts the & to &amp;.
> >
> >     Chas.
> >
> >     Marc Boschma wrote:
> >      > Now I have some breakfast in me, to be clear it appears that
> >     UTF-8 byte
> >      > stream is being interpreted as Latin1 and then converted to
> >     unicode...
> >      >
> >      > Marc
> >      > On 16/03/2009, at 6:25 AM, Marc Boschma wrote:
> >      >
> >      >> excuse the typo:
> >      >> On 16/03/2009, at 6:23 AM, Marc Boschma wrote:
> >      >>
> >      >>> Just looking at http://jeppesn.dk/utf-8.html , I found the
> >     following
> >      >>> lines:
> >      >>> Character   Latin1  Unicode         UTF-8   Latin1
> >      >>>                     code
> >          interpr.
> >      >>> ç                   E7              00 E7           C3 A7   Ã§
> >      >>> Ã is C38C, § is C2 A7
> >      >> Ã is C383
> >      >>> So it appears that somewhere there is a translation to Latin 1
> >     going on.
> >      >>> Hopefully that helps some what...
> >      >>> Regards,
> >      >>> Marc
> >      >>>
> >      >>> On 16/03/2009, at 1:08 AM, Derek Chen-Becker wrote:
> >      >>>
> >      >>>> This is really interesting. I've narrowed it down to something
> on
> >      >>>> form submission. The database shows gibberish, too, and if I
> >      >>>> manually enter the correct value in the DB it works fine on
> >     display.
> >      >>>> If I print the UTF-8 byte values of the string I get from the
> >      >>>> browser for my description when I submit a cedilla (ç), I see:
> >      >>>>
> >      >>>> INFO - Submitted desc bytes = c3 83 c2 a7
> >      >>>>
> >      >>>> A cedilla is c3 a7 in UTF-8, so I'm not sure where the "83 c2"
> is
> >      >>>> coming from. I googled around a bit and I found other people
> >     having
> >      >>>> the same issue but it wasn't clear in those posts what the
> cause
> >      >>>> was. I did a packet capture just as a sanity check, and here's
> >     what
> >      >>>> I got:
> >      >>>>
> >      >>>> POST / HTTP/1.1
> >      >>>> ... headers here ...
> >      >>>>
> >      >>>>
> >
> F956759623045OFT=true&F956759623046BU5=1&F9567596230472LR=2009%2F03%2F18&F956759623048IZR=%C3%A7&F956759623049S3E=3&F956759623050E25=test
> >      >>>>
> >      >>>> As you can see, the (url encoded) value of the F956759623048IZR
> >      >>>> field (description) is %C3%A7, so something isn't properly
> >      >>>> converting that. Helpers.urlDecode seems to be working
> properly:
> >      >>>>
> >      >>>> scala> Helpers.urlDecode("F956759623048IZR=%C3%A7")
> >      >>>> res1: java.lang.String = F956759623048IZR=ç
> >      >>>>
> >      >>>> So I have no idea where this is coming from. All I know is that
> >      >>>> between the actual POST and when my submit function is called,
> >      >>>> something is tweaking the string. I'm going to dig some more,
> >     but I
> >      >>>> wanted to post this in case it triggers any thoughts out there.
> >      >>>>
> >      >>>> Derek
> >      >>>>
> >      >>>> PS - I just found this:
> >      >>>>
> >      >>>>
> >
> http://mail-archives.apache.org/mod_mbox/struts-dev/200604.mbox/%3c3769847.1145910729808.javamail.j...@brutus%3e
> >      >>>>
> >      >>>> May be related?
> >      >>>>
> >      >>>> On Sun, Mar 15, 2009 at 7:26 AM, Derek Chen-Becker
> >      >>>> <dchenbec...@gmail.com <mailto:dchenbec...@gmail.com>
> >     <mailto:dchenbec...@gmail.com <mailto:dchenbec...@gmail.com>>>
> wrote:
> >      >>>>
> >      >>>>     OK, I can replicate this in our PocketChange app (also
> going
> >      >>>>     against a PostgreSQL DB). Let me dig a bit.
> >      >>>>
> >      >>>>     Derek
> >      >>>>
> >      >>>>
> >      >>>>     On Sun, Mar 15, 2009 at 3:58 AM, Charles F. Munat
> >      >>>>     <c...@munat.com <mailto:c...@munat.com>
> >     <mailto:c...@munat.com <mailto:c...@munat.com>>> wrote:
> >      >>>>
> >      >>>>
> >      >>>>         This might help, but I don't think I was clear. I have
> an
> >      >>>>         online form.
> >      >>>>         My clients enter text into it. Their text has
> characters
> >      >>>>         like a c with a
> >      >>>>         cedilla. That text gets saved into a PostgreSQL
> database
> >      >>>>         (UTF-8) varchar
> >      >>>>         field via JPA/Hibernate.
> >      >>>>
> >      >>>>         Then I pull it back out and dump it into a template,
> >     and it
> >      >>>>         comes out
> >      >>>>         gibberish. If I try using &ccedil; instead, I get
> >      >>>>         &amp;cedil; back out.
> >      >>>>
> >      >>>>         Here is what I have:
> >      >>>>
> >      >>>>         "name" -> SHtml.text(thing.name <http://thing.name>
> >     <http://thing.name>,
> >      >>>>         thing.name <http://thing.name> <http://thing.name> =
> >     _, ("size", "40"))
> >      >>>>
> >      >>>>         If I enter "cachaça" in the field, I get cachaÃ§a back
> >     out.
> >      >>>>         The weird
> >      >>>>         thing is that sometimes when I copy and paste text from
> >      >>>>         another document
> >      >>>>         into the form, it works. But if I use the keyboard, it
> >     fails
> >      >>>>         every time.
> >      >>>>
> >      >>>>         I'll play around with this. Thanks.
> >      >>>>
> >      >>>>         Chas.
> >      >>>>
> >      >>>>         Derek Chen-Becker wrote:
> >      >>>>         > Oops, forgot scala.xml.Unparsed, too:
> >      >>>>         >
> >      >>>>         > scala> val m = <span>a{
> scala.xml.Unparsed("&ccedil;")
> >      >>>>         }b</span>
> >      >>>>         > m: scala.xml.Elem = <span>a&ccedil;b</span>
> >      >>>>         >
> >      >>>>         > That one might be what you're looking for.
> >      >>>>         >
> >      >>>>         > Derek
> >      >>>>         >
> >      >>>>         > On Sat, Mar 14, 2009 at 9:57 PM, Derek Chen-Becker
> >      >>>>         > <dchenbec...@gmail.com
> >     <mailto:dchenbec...@gmail.com> <mailto:dchenbec...@gmail.com
> >     <mailto:dchenbec...@gmail.com>>
> >      >>>>         <mailto:dchenbec...@gmail.com
> >     <mailto:dchenbec...@gmail.com>
> >      >>>>         <mailto:dchenbec...@gmail.com
> >     <mailto:dchenbec...@gmail.com>>>> wrote:
> >      >>>>         >
> >      >>>>         >     I think it depends on how you're embedding them
> >     in the
> >      >>>>         XML:
> >      >>>>         >
> >      >>>>         >     scala> val m = <span>a&ccedil;b</span>
> >      >>>>         >     m: scala.xml.Elem = <span>a&ccedil;b</span>
> >      >>>>         >
> >      >>>>         >     scala> val m = <span>a{"&ccedil;"}b</span>
> >      >>>>         >     m: scala.xml.Elem = <span>a&amp;ccedil;b</span>
> >      >>>>         >
> >      >>>>         >     scala> val m = <span>a{"ç"}b</span>
> >      >>>>         >     m: scala.xml.Elem = <span>açb</span>
> >      >>>>         >
> >      >>>>         >     That last one was input using dead keys (alt+,)
> >     on my
> >      >>>>         linux (USA
> >      >>>>         >     International with dead keys) layout. Let me know
> if
> >      >>>>         this doesn't
> >      >>>>         >     help; if not, could you send the code/template
> >     that's
> >      >>>>         having issues?
> >      >>>>         >
> >      >>>>         >     Derek
> >      >>>>         >
> >      >>>>         >
> >      >>>>         >     On Sat, Mar 14, 2009 at 6:36 PM, Charles F. Munat
> >      >>>>         <c...@munat.com <mailto:c...@munat.com>
> >     <mailto:c...@munat.com <mailto:c...@munat.com>>
> >      >>>>         >     <mailto:c...@munat.com <mailto:c...@munat.com>
> >     <mailto:c...@munat.com <mailto:c...@munat.com>>>> wrote:
> >      >>>>         >
> >      >>>>         >
> >      >>>>         >         I have a site that uses a lot of "special"
> >      >>>>         characters (a remarkably
> >      >>>>         >         biased description, since there is nothing
> >      >>>>         "special" about accented
> >      >>>>         >         characters to the people who use them daily).
> In
> >      >>>>         particular, I
> >      >>>>         >         need the
> >      >>>>         >         c with cedilla and the n with the tilde.
> >      >>>>         >
> >      >>>>         >         These characters are being input to a
> database
> >      >>>>         (UTF-8) via an online
> >      >>>>         >         form, then spit back out onto the page.
> >      >>>>         >
> >      >>>>         >         It's a fucking disaster. Apparently,
> everything
> >      >>>>         goes through the xml
> >      >>>>         >         parser, which is great, except when I try to
> >     enter
> >      >>>>         these as entity
> >      >>>>         >         references, such as &ccedil;, the parser
> >     changes &
> >      >>>>         to &amp; and
> >      >>>>         >         I get
> >      >>>>         >         the literal &ccedil; back out again.
> >      >>>>         >
> >      >>>>         >         When I type ç using the keyboard (or copy and
> >      >>>>         paste it from a
> >      >>>>         >         page or a
> >      >>>>         >         text editor), I get gibberish.
> >      >>>>         >
> >      >>>>         >         Anyone know the trick to getting around this?
> I
> >      >>>>         need everything
> >      >>>>         >         from e
> >      >>>>         >         acute to e grave to trademark and registered
> >      >>>>         trademark symbols,
> >      >>>>         >         and I
> >      >>>>         >         need to enter them this way.
> >      >>>>         >
> >      >>>>         >         Thanks for any help. If I can get this to
> work,
> >      >>>>         I'll add an
> >      >>>>         >         explanation
> >      >>>>         >         to the wiki.
> >      >>>>         >
> >      >>>>         >         Chas.
> >      >>>>         >
> >      >>>>         >
> >      >>>>         >
> >      >>>>         >
> >      >>>>         >
> >      >>>>         > >
> >      >>>>
> >      >>>>
> >      >>>>
> >      >>>>
> >      >>>>
> >      >>>>
> >      >>>>
> >      >>>
> >      >>>
> >      >>>
> >      >>>
> >      >>
> >      >>
> >      >>
> >      >>
> >      >
> >      >
> >      > >
> >
> >
> >
> >
> > >
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Lift" group.
To post to this group, send email to liftweb@googlegroups.com
To unsubscribe from this group, send email to 
liftweb+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/liftweb?hl=en
-~----------~----~----~----~------~----~------~--~---

[Lift] Re: xml parser, utf-8, special characters... kill me now

Reply via email to