[Lift] Re: xml parser, utf-8, special characters... kill me now

Derek Chen-Becker Sun, 15 Mar 2009 13:55:17 -0700

My understanding was that *all* ampersands get converted in raw embedded
strings:


scala> val m = <span>{ "&amp;" }</span>
m: scala.xml.Elem = <span>&amp;amp;</span>

scala> val m = <span>{ "&" }</span>
m: scala.xml.Elem = <span>&amp;</span>


If you don't "embed" the string it doesn't do validation on the entities:

scala> val m = <span>&ccedil;</span>
m: scala.xml.Elem = <span>&ccedil;</span>

If you want to embed a string with ampersands and you don't want them
expanded, use scala.xml.Unparsed:

scala> import scala.xml.Unparsed
import scala.xml.Unparsed

scala> val m = <span>{ Unparsed("&") }</span>
m: scala.xml.Elem = <span>&</span>

scala> val m = <span>{ Unparsed("&ccedil;") }</span>
m: scala.xml.Elem = <span>&ccedil;</span>


Back to the more important part of this post, I found this interesting
article:

http://www.intertwingly.net/blog/2004/04/15/Character-Encoding-and-HTML-Forms

It indicates that you can use the "accept-charset" attribute on the form
element itself to force a particular input encoding:

http://www.w3.org/TR/html401/interact/forms.html#h-17.3

It seems like we should be able to simply force UTF-8 via the form tag and
then fix whatever is interpreting the string as Latin-1. I'm going to hack a
little more here.

Derek

On Sun, Mar 15, 2009 at 2:45 PM, Marc Boschma
<marc+lift...@boschma.cx<marc%2blift...@boschma.cx>
> wrote:

>
> On 16/03/2009, at 6:59 AM, Charles F. Munat wrote:
>
> >
> > That was my thinking. It doesn't explain why &ccedil; in gets
> > changed to
> > &amp;ccedil;, but it explains why ç in becomes Ã§ out. So I think
> > there
> > are two separate issues here.
>
> I tend to agree.
>
> >
> >
> > The ç can be created in two different ways in UTF-8. One is the single
> > "c with a cedilla" character. The second is a c character followed
> > by a
> > cedilla character. I am not sure how UTF-8 indicates that these two
> > characters should be displayed as one.
>
> the c with a cedilla two character sequence is encoded as 0063 0327
> which is equivalent to 00E7 (at least optically). the 0327 is seen as
> a modifier to the 'c' (0063) character.
>
> > Neither am I sure that this has
> > anything to do with the problem. Maybe it is simply that something is
> > assuming Latin1 input even though the input is UTF-8.
> >
> > It is definitely on the front end, because it is stored in the
> > database
> > as Ã§.
> >
> > When I use &ccedil; instead, the problem is that it is *not* converted
> > to ç as it goes into the database, and then on the way out the XML
> > interpreter does not recognize it as a character entity reference
> > and so
> > converts the & to &amp;.
>
> I think this is due to using the standard Scala XML load functions
> rather than the lift XML parser. From memory I don't think the
> standard parser recognises that many named entities. ie. does &#x00E7;
> work instead of &ccedil; ? If so then that is probably what is
> happening on this issue.
>
> >
> >
> > Chas.
> >
> > Marc Boschma wrote:
> >> Now I have some breakfast in me, to be clear it appears that UTF-8
> >> byte
> >> stream is being interpreted as Latin1 and then converted to
> >> unicode...
> >>
> >> Marc
> >> On 16/03/2009, at 6:25 AM, Marc Boschma wrote:
> >>
> >>> excuse the typo:
> >>> On 16/03/2009, at 6:23 AM, Marc Boschma wrote:
> >>>
> >>>> Just looking at http://jeppesn.dk/utf-8.html , I found the
> >>>> following
> >>>> lines:
> >>>> Character  Latin1  Unicode         UTF-8   Latin1
> >>>>                    code
>  interpr.
> >>>> ç                  E7              00 E7           C3 A7   Ã§
> >>>> Ã is C38C, § is C2 A7
> >>> Ã is C383
> >>>> So it appears that somewhere there is a translation to Latin 1
> >>>> going on.
> >>>> Hopefully that helps some what...
> >>>> Regards,
> >>>> Marc
> >>>>
> >>>> On 16/03/2009, at 1:08 AM, Derek Chen-Becker wrote:
> >>>>
> >>>>> This is really interesting. I've narrowed it down to something on
> >>>>> form submission. The database shows gibberish, too, and if I
> >>>>> manually enter the correct value in the DB it works fine on
> >>>>> display.
> >>>>> If I print the UTF-8 byte values of the string I get from the
> >>>>> browser for my description when I submit a cedilla (ç), I see:
> >>>>>
> >>>>> INFO - Submitted desc bytes = c3 83 c2 a7
> >>>>>
> >>>>> A cedilla is c3 a7 in UTF-8, so I'm not sure where the "83 c2" is
> >>>>> coming from. I googled around a bit and I found other people
> >>>>> having
> >>>>> the same issue but it wasn't clear in those posts what the cause
> >>>>> was. I did a packet capture just as a sanity check, and here's
> >>>>> what
> >>>>> I got:
> >>>>>
> >>>>> POST / HTTP/1.1
> >>>>> ... headers here ...
> >>>>>
> >>>>> F956759623045OFT
> >>>>> =
> >>>>> true
> >>>>> &F956759623046BU5
> >>>>> =1&F9567596230472LR=2009%2F03%2F18&F956759623048IZR=
> >>>>> %C3%A7&F956759623049S3E=3&F956759623050E25=test
> >>>>>
> >>>>> As you can see, the (url encoded) value of the F956759623048IZR
> >>>>> field (description) is %C3%A7, so something isn't properly
> >>>>> converting that. Helpers.urlDecode seems to be working properly:
> >>>>>
> >>>>> scala> Helpers.urlDecode("F956759623048IZR=%C3%A7")
> >>>>> res1: java.lang.String = F956759623048IZR=ç
> >>>>>
> >>>>> So I have no idea where this is coming from. All I know is that
> >>>>> between the actual POST and when my submit function is called,
> >>>>> something is tweaking the string. I'm going to dig some more,
> >>>>> but I
> >>>>> wanted to post this in case it triggers any thoughts out there.
> >>>>>
> >>>>> Derek
> >>>>>
> >>>>> PS - I just found this:
> >>>>>
> >>>>>
> http://mail-archives.apache.org/mod_mbox/struts-dev/200604.mbox/%3c3769847.1145910729808.javamail.j...@brutus%3e
> >>>>>
> >>>>> May be related?
> >>>>>
> >>>>> On Sun, Mar 15, 2009 at 7:26 AM, Derek Chen-Becker
> >>>>> <dchenbec...@gmail.com <mailto:dchenbec...@gmail.com>> wrote:
> >>>>>
> >>>>>    OK, I can replicate this in our PocketChange app (also going
> >>>>>    against a PostgreSQL DB). Let me dig a bit.
> >>>>>
> >>>>>    Derek
> >>>>>
> >>>>>
> >>>>>    On Sun, Mar 15, 2009 at 3:58 AM, Charles F. Munat
> >>>>>    <c...@munat.com <mailto:c...@munat.com>> wrote:
> >>>>>
> >>>>>
> >>>>>        This might help, but I don't think I was clear. I have an
> >>>>>        online form.
> >>>>>        My clients enter text into it. Their text has characters
> >>>>>        like a c with a
> >>>>>        cedilla. That text gets saved into a PostgreSQL database
> >>>>>        (UTF-8) varchar
> >>>>>        field via JPA/Hibernate.
> >>>>>
> >>>>>        Then I pull it back out and dump it into a template, and it
> >>>>>        comes out
> >>>>>        gibberish. If I try using &ccedil; instead, I get
> >>>>>        &amp;cedil; back out.
> >>>>>
> >>>>>        Here is what I have:
> >>>>>
> >>>>>        "name" -> SHtml.text(thing.name <http://thing.name>,
> >>>>>        thing.name <http://thing.name> = _, ("size", "40"))
> >>>>>
> >>>>>        If I enter "cachaça" in the field, I get cachaÃ§a back out.
> >>>>>        The weird
> >>>>>        thing is that sometimes when I copy and paste text from
> >>>>>        another document
> >>>>>        into the form, it works. But if I use the keyboard, it
> >>>>> fails
> >>>>>        every time.
> >>>>>
> >>>>>        I'll play around with this. Thanks.
> >>>>>
> >>>>>        Chas.
> >>>>>
> >>>>>        Derek Chen-Becker wrote:
> >>>>>> Oops, forgot scala.xml.Unparsed, too:
> >>>>>>
> >>>>>> scala> val m = <span>a{ scala.xml.Unparsed("&ccedil;")
> >>>>>        }b</span>
> >>>>>> m: scala.xml.Elem = <span>a&ccedil;b</span>
> >>>>>>
> >>>>>> That one might be what you're looking for.
> >>>>>>
> >>>>>> Derek
> >>>>>>
> >>>>>> On Sat, Mar 14, 2009 at 9:57 PM, Derek Chen-Becker
> >>>>>> <dchenbec...@gmail.com <mailto:dchenbec...@gmail.com>
> >>>>>        <mailto:dchenbec...@gmail.com
> >>>>>        <mailto:dchenbec...@gmail.com>>> wrote:
> >>>>>>
> >>>>>>    I think it depends on how you're embedding them in the
> >>>>>        XML:
> >>>>>>
> >>>>>>    scala> val m = <span>a&ccedil;b</span>
> >>>>>>    m: scala.xml.Elem = <span>a&ccedil;b</span>
> >>>>>>
> >>>>>>    scala> val m = <span>a{"&ccedil;"}b</span>
> >>>>>>    m: scala.xml.Elem = <span>a&amp;ccedil;b</span>
> >>>>>>
> >>>>>>    scala> val m = <span>a{"ç"}b</span>
> >>>>>>    m: scala.xml.Elem = <span>açb</span>
> >>>>>>
> >>>>>>    That last one was input using dead keys (alt+,) on my
> >>>>>        linux (USA
> >>>>>>    International with dead keys) layout. Let me know if
> >>>>>        this doesn't
> >>>>>>    help; if not, could you send the code/template that's
> >>>>>        having issues?
> >>>>>>
> >>>>>>    Derek
> >>>>>>
> >>>>>>
> >>>>>>    On Sat, Mar 14, 2009 at 6:36 PM, Charles F. Munat
> >>>>>        <c...@munat.com <mailto:c...@munat.com>
> >>>>>>    <mailto:c...@munat.com <mailto:c...@munat.com>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>>        I have a site that uses a lot of "special"
> >>>>>        characters (a remarkably
> >>>>>>        biased description, since there is nothing
> >>>>>        "special" about accented
> >>>>>>        characters to the people who use them daily). In
> >>>>>        particular, I
> >>>>>>        need the
> >>>>>>        c with cedilla and the n with the tilde.
> >>>>>>
> >>>>>>        These characters are being input to a database
> >>>>>        (UTF-8) via an online
> >>>>>>        form, then spit back out onto the page.
> >>>>>>
> >>>>>>        It's a fucking disaster. Apparently, everything
> >>>>>        goes through the xml
> >>>>>>        parser, which is great, except when I try to enter
> >>>>>        these as entity
> >>>>>>        references, such as &ccedil;, the parser changes &
> >>>>>        to &amp; and
> >>>>>>        I get
> >>>>>>        the literal &ccedil; back out again.
> >>>>>>
> >>>>>>        When I type ç using the keyboard (or copy and
> >>>>>        paste it from a
> >>>>>>        page or a
> >>>>>>        text editor), I get gibberish.
> >>>>>>
> >>>>>>        Anyone know the trick to getting around this? I
> >>>>>        need everything
> >>>>>>        from e
> >>>>>>        acute to e grave to trademark and registered
> >>>>>        trademark symbols,
> >>>>>>        and I
> >>>>>>        need to enter them this way.
> >>>>>>
> >>>>>>        Thanks for any help. If I can get this to work,
> >>>>>        I'll add an
> >>>>>>        explanation
> >>>>>>        to the wiki.
> >>>>>>
> >>>>>>        Chas.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >>>
> >
> > >
>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Lift" group.
To post to this group, send email to liftweb@googlegroups.com
To unsubscribe from this group, send email to 
liftweb+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/liftweb?hl=en
-~----------~----~----~----~------~----~------~--~---

[Lift] Re: xml parser, utf-8, special characters... kill me now

Reply via email to