Thanks for tracking this down Godmar. I've emailed tictocs and we'll see what they say.
-Glen :-) ------------------------------------------------------------------ From: Godmar Back <god...@gmail.com> Sender: Code for Libraries <CODE4LIB@LISTSERV.ND.EDU> To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Character problems with tictoc Date: Mon, 21 Dec 2009 13:20:08 -0500 Message-ID: <719dced30912211020y7b726c83jc54d0fadcba92...@mail.gmail.com> The string in question is double-encoded, that is, a string that's in UTF-8 already was run through a UTF-8 encoder. The string is "Acta Ortopedica" where the 'e' is really '\u00e9' aka 'Latin Small Letter E with Acute'. [1] In UTF-8, the e-acute is two-byte encoded as C3 A9. If you run the bytes C3 A9 through a UTF-8 encoder, C3 ('\u00c3' - Capital A with tilde) becomes C3 83 and A9 (copyright sign, '\u00a9' becomes C2 A9). C3 83 C2 A9 is exactly what JISC is serving, what it should be serving is C3 A9. Send email to them. - Godmar [1] http://www.utf8-chartable.de/ 2009/12/21 Glen Newton <glen.new...@nrc-cnrc.gc.ca> > > [I realise there was a recent related 'Character-sets for dummies'[1] > discussion recently] > > I am using tictocs[2] list of journal RSS feeds, and I am getting > gibberish in places for diacritics. Below is an example: > > in emacs: > 221 Acta Ortop dica Brasileira > http://www.scielo.br/rss.php?pid=1413-7852&lang=en 1413-7852 > in Firefox: > 221 Acta Ortop dica Brasileira > http://www.scielo.br/rss.php?pid=1413-7852&lang=en 1413-7852 > > Note that the emacs view is both of a save of the Firefox, and from a > direct download using 'wget'. > > Is this something on my end, or are the tictocs people not serving > proper UTF-8? > > The HTTP header from wget claims UTF-8: > > wget -S http://www.tictocs.ac.uk/text.php > > --2009-12-21 12:47:59-- http://www.tictocs.ac.uk/text.php > > Resolving www.tictocs.ac.uk... 130.88.101.131 > > Connecting to www.tictocs.ac.uk|130.88.101.131|:80... connected. > > HTTP request sent, awaiting response... > > HTTP/1.1 200 OK > > Date: Mon, 21 Dec 2009 17:42:05 GMT > > Server: Apache/2.2.13 (Unix) mod_ssl/2.2.13 OpenSSL/0.9.8k PHP/5.3.0 DAV/2 > > X-Powered-By: PHP/5.3.0 > > Content-Type: text/plain; charset=utf-8 > > Connection: close > > Length: unspecified [text/plain] > ><....stuff removed> > > Can someone validate if they are also experiencing this issue? > > Thanks, > Glen > > [1]https://listserv.nd.edu/cgi-bin/wa?S2=CODE4LIB&q=&s=character-sets+for+dummies&f=&a=&b= > [2]http://www.tictocs.ac.uk/text.php > > -- > Glen Newton | glen.new...@nrc-cnrc.gc.ca > Researcher, Information Science, CISTI Research > & NRC W3C Advisory Committee Representative > http://tinyurl.com/yvchmu > tel/t l: 613-990-9163 | facsimile/t l copieur 613-952-8246 > Canada Institute for Scientific and Technical Information (CISTI) > National Research Council Canada (NRC)| M-55, 1200 Montreal Road > http://www.nrc-cnrc.gc.ca/ > Institut canadien de l'information scientifique et technique (ICIST) > Conseil national de recherches Canada | M-55, 1200 chemin Montr al > Ottawa, Ontario K1A 0R6 > Government of Canada | Gouvernement du Canada > --