Re: [CODE4LIB] Character problems with tictoc
Thanks to everyone to drawing our attention to this issue. A couple of days ago the ticTOCs service moved to a new server where the data is stored as UTF-8 (which it wasn't before). We'd forgotten to remove the UFT-8 conversion in text.php so we were serving double-encoded content (UTF-8 encoded as UTF-8) until our developer put it right in the middle of the discussion on this list (which started at 5pm our time!) You should find the problem is fixed now. Terry Terry Bucknell Electronic Resources Manager Sydney Jones Library University of Liverpool Chatham St, PO Box 123 Liverpool, L69 3DA, UK Tel: +44 (0)151 794 2692 Fax: +44 (0)151 794 2681 -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Glen Newton Sent: 21 December 2009 17:52 To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Character problems with tictoc [I realise there was a recent related 'Character-sets for dummies'[1] discussion recently] I am using tictocs[2] list of journal RSS feeds, and I am getting gibberish in places for diacritics. Below is an example: in emacs: 221Acta Ortop dica Brasileira http://www.scielo.br/rss.php?pid=1413-7852lang=en 1413-7852 in Firefox: 221Acta Ortop dica Brasileira http://www.scielo.br/rss.php?pid=1413-7852lang=en 1413-7852 Note that the emacs view is both of a save of the Firefox, and from a direct download using 'wget'. Is this something on my end, or are the tictocs people not serving proper UTF-8? The HTTP header from wget claims UTF-8: wget -S http://www.tictocs.ac.uk/text.php --2009-12-21 12:47:59-- http://www.tictocs.ac.uk/text.php Resolving www.tictocs.ac.uk... 130.88.101.131 Connecting to www.tictocs.ac.uk|130.88.101.131|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Mon, 21 Dec 2009 17:42:05 GMT Server: Apache/2.2.13 (Unix) mod_ssl/2.2.13 OpenSSL/0.9.8k PHP/5.3.0 DAV/2 X-Powered-By: PHP/5.3.0 Content-Type: text/plain; charset=utf-8 Connection: close Length: unspecified [text/plain] stuff removed Can someone validate if they are also experiencing this issue? Thanks, Glen [1]https://listserv.nd.edu/cgi-bin/wa?S2=CODE4LIBq=s=character-sets+for+dummiesf=a=b= [2]http://www.tictocs.ac.uk/text.php -- Glen Newton | glen.new...@nrc-cnrc.gc.ca Researcher, Information Science, CISTI Research NRC W3C Advisory Committee Representative http://tinyurl.com/yvchmu tel/t l: 613-990-9163 | facsimile/t l copieur 613-952-8246 Canada Institute for Scientific and Technical Information (CISTI) National Research Council Canada (NRC)| M-55, 1200 Montreal Road http://www.nrc-cnrc.gc.ca/ Institut canadien de l'information scientifique et technique (ICIST) Conseil national de recherches Canada | M-55, 1200 chemin Montr al Ottawa, Ontario K1A 0R6 Government of Canada | Gouvernement du Canada --
[CODE4LIB] Character problems with tictoc
[I realise there was a recent related 'Character-sets for dummies'[1] discussion recently] I am using tictocs[2] list of journal RSS feeds, and I am getting gibberish in places for diacritics. Below is an example: in emacs: 221Acta Ortop dica Brasileira http://www.scielo.br/rss.php?pid=1413-7852lang=en 1413-7852 in Firefox: 221Acta Ortop dica Brasileira http://www.scielo.br/rss.php?pid=1413-7852lang=en 1413-7852 Note that the emacs view is both of a save of the Firefox, and from a direct download using 'wget'. Is this something on my end, or are the tictocs people not serving proper UTF-8? The HTTP header from wget claims UTF-8: wget -S http://www.tictocs.ac.uk/text.php --2009-12-21 12:47:59-- http://www.tictocs.ac.uk/text.php Resolving www.tictocs.ac.uk... 130.88.101.131 Connecting to www.tictocs.ac.uk|130.88.101.131|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Mon, 21 Dec 2009 17:42:05 GMT Server: Apache/2.2.13 (Unix) mod_ssl/2.2.13 OpenSSL/0.9.8k PHP/5.3.0 DAV/2 X-Powered-By: PHP/5.3.0 Content-Type: text/plain; charset=utf-8 Connection: close Length: unspecified [text/plain] stuff removed Can someone validate if they are also experiencing this issue? Thanks, Glen [1]https://listserv.nd.edu/cgi-bin/wa?S2=CODE4LIBq=s=character-sets+for+dummiesf=a=b= [2]http://www.tictocs.ac.uk/text.php -- Glen Newton | glen.new...@nrc-cnrc.gc.ca Researcher, Information Science, CISTI Research NRC W3C Advisory Committee Representative http://tinyurl.com/yvchmu tel/t l: 613-990-9163 | facsimile/t l copieur 613-952-8246 Canada Institute for Scientific and Technical Information (CISTI) National Research Council Canada (NRC)| M-55, 1200 Montreal Road http://www.nrc-cnrc.gc.ca/ Institut canadien de l'information scientifique et technique (ICIST) Conseil national de recherches Canada | M-55, 1200 chemin Montr al Ottawa, Ontario K1A 0R6 Government of Canada | Gouvernement du Canada --
Re: [CODE4LIB] Character problems with tictoc
The string in question is double-encoded, that is, a string that's in UTF-8 already was run through a UTF-8 encoder. The string is Acta Ortopedica where the 'e' is really '\u00e9' aka 'Latin Small Letter E with Acute'. [1] In UTF-8, the e-acute is two-byte encoded as C3 A9. If you run the bytes C3 A9 through a UTF-8 encoder, C3 ('\u00c3' - Capital A with tilde) becomes C3 83 and A9 (copyright sign, '\u00a9' becomes C2 A9). C3 83 C2 A9 is exactly what JISC is serving, what it should be serving is C3 A9. Send email to them. - Godmar [1] http://www.utf8-chartable.de/ 2009/12/21 Glen Newton glen.new...@nrc-cnrc.gc.ca [I realise there was a recent related 'Character-sets for dummies'[1] discussion recently] I am using tictocs[2] list of journal RSS feeds, and I am getting gibberish in places for diacritics. Below is an example: in emacs: 221 Acta Ortop dica Brasileira http://www.scielo.br/rss.php?pid=1413-7852lang=en 1413-7852 in Firefox: 221 Acta Ortop dica Brasileira http://www.scielo.br/rss.php?pid=1413-7852lang=en 1413-7852 Note that the emacs view is both of a save of the Firefox, and from a direct download using 'wget'. Is this something on my end, or are the tictocs people not serving proper UTF-8? The HTTP header from wget claims UTF-8: wget -S http://www.tictocs.ac.uk/text.php --2009-12-21 12:47:59-- http://www.tictocs.ac.uk/text.php Resolving www.tictocs.ac.uk... 130.88.101.131 Connecting to www.tictocs.ac.uk|130.88.101.131|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Mon, 21 Dec 2009 17:42:05 GMT Server: Apache/2.2.13 (Unix) mod_ssl/2.2.13 OpenSSL/0.9.8k PHP/5.3.0 DAV/2 X-Powered-By: PHP/5.3.0 Content-Type: text/plain; charset=utf-8 Connection: close Length: unspecified [text/plain] stuff removed Can someone validate if they are also experiencing this issue? Thanks, Glen [1]https://listserv.nd.edu/cgi-bin/wa?S2=CODE4LIBq=s=character-sets+for+dummiesf=a=b= [2]http://www.tictocs.ac.uk/text.php -- Glen Newton | glen.new...@nrc-cnrc.gc.ca Researcher, Information Science, CISTI Research NRC W3C Advisory Committee Representative http://tinyurl.com/yvchmu tel/t l: 613-990-9163 | facsimile/t l copieur 613-952-8246 Canada Institute for Scientific and Technical Information (CISTI) National Research Council Canada (NRC)| M-55, 1200 Montreal Road http://www.nrc-cnrc.gc.ca/ Institut canadien de l'information scientifique et technique (ICIST) Conseil national de recherches Canada | M-55, 1200 chemin Montr al Ottawa, Ontario K1A 0R6 Government of Canada | Gouvernement du Canada --
Re: [CODE4LIB] Character problems with tictoc
Thanks for tracking this down Godmar. I've emailed tictocs and we'll see what they say. -Glen :-) -- From: Godmar Back god...@gmail.com Sender: Code for Libraries CODE4LIB@LISTSERV.ND.EDU To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Character problems with tictoc Date: Mon, 21 Dec 2009 13:20:08 -0500 Message-ID: 719dced30912211020y7b726c83jc54d0fadcba92...@mail.gmail.com The string in question is double-encoded, that is, a string that's in UTF-8 already was run through a UTF-8 encoder. The string is Acta Ortopedica where the 'e' is really '\u00e9' aka 'Latin Small Letter E with Acute'. [1] In UTF-8, the e-acute is two-byte encoded as C3 A9. If you run the bytes C3 A9 through a UTF-8 encoder, C3 ('\u00c3' - Capital A with tilde) becomes C3 83 and A9 (copyright sign, '\u00a9' becomes C2 A9). C3 83 C2 A9 is exactly what JISC is serving, what it should be serving is C3 A9. Send email to them. - Godmar [1] http://www.utf8-chartable.de/ 2009/12/21 Glen Newton glen.new...@nrc-cnrc.gc.ca [I realise there was a recent related 'Character-sets for dummies'[1] discussion recently] I am using tictocs[2] list of journal RSS feeds, and I am getting gibberish in places for diacritics. Below is an example: in emacs: 221Acta Ortop dica Brasileira http://www.scielo.br/rss.php?pid=1413-7852lang=en 1413-7852 in Firefox: 221Acta Ortop dica Brasileira http://www.scielo.br/rss.php?pid=1413-7852lang=en 1413-7852 Note that the emacs view is both of a save of the Firefox, and from a direct download using 'wget'. Is this something on my end, or are the tictocs people not serving proper UTF-8? The HTTP header from wget claims UTF-8: wget -S http://www.tictocs.ac.uk/text.php --2009-12-21 12:47:59-- http://www.tictocs.ac.uk/text.php Resolving www.tictocs.ac.uk... 130.88.101.131 Connecting to www.tictocs.ac.uk|130.88.101.131|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Mon, 21 Dec 2009 17:42:05 GMT Server: Apache/2.2.13 (Unix) mod_ssl/2.2.13 OpenSSL/0.9.8k PHP/5.3.0 DAV/2 X-Powered-By: PHP/5.3.0 Content-Type: text/plain; charset=utf-8 Connection: close Length: unspecified [text/plain] stuff removed Can someone validate if they are also experiencing this issue? Thanks, Glen [1]https://listserv.nd.edu/cgi-bin/wa?S2=CODE4LIBq=s=character-sets+for+dummiesf=a=b= [2]http://www.tictocs.ac.uk/text.php -- Glen Newton | glen.new...@nrc-cnrc.gc.ca Researcher, Information Science, CISTI Research NRC W3C Advisory Committee Representative http://tinyurl.com/yvchmu tel/t l: 613-990-9163 | facsimile/t l copieur 613-952-8246 Canada Institute for Scientific and Technical Information (CISTI) National Research Council Canada (NRC)| M-55, 1200 Montreal Road http://www.nrc-cnrc.gc.ca/ Institut canadien de l'information scientifique et technique (ICIST) Conseil national de recherches Canada | M-55, 1200 chemin Montr al Ottawa, Ontario K1A 0R6 Government of Canada | Gouvernement du Canada --
Re: [CODE4LIB] Character problems with tictoc
It seems that different people are seeing different things in their respective viewers (i.e some are OK and others are like what I am seeing). When I use wget and view the local file in Firefox (3.0.4, Linux Suse 11.0) I see: http://cuvier.cisti.nrc.ca/~gnewton/tictoc1.gif [gif used as it is not lossy] The text is clearly not correct. The file I got with wget is: http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt Is this just a question of different client software (and/or OSes) viewing or mangling the content? -glen --- Thanks for tracking this down Godmar. I've emailed tictocs and we'll see what they say. -Glen :-) -- From: Godmar Back god...@gmail.com Sender: Code for Libraries CODE4LIB@LISTSERV.ND.EDU To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Character problems with tictoc Date: Mon, 21 Dec 2009 13:20:08 -0500 Message-ID: 719dced30912211020y7b726c83jc54d0fadcba92...@mail.gmail.com The string in question is double-encoded, that is, a string that's in UTF-8 already was run through a UTF-8 encoder. The string is Acta Ortopedica where the 'e' is really '\u00e9' aka 'Latin Small Letter E with Acute'. [1] In UTF-8, the e-acute is two-byte encoded as C3 A9. If you run the bytes C3 A9 through a UTF-8 encoder, C3 ('\u00c3' - Capital A with tilde) becomes C3 83 and A9 (copyright sign, '\u00a9' becomes C2 A9). C3 83 C2 A9 is exactly what JISC is serving, what it should be serving is C3 A9. Send email to them. - Godmar [1] http://www.utf8-chartable.de/ 2009/12/21 Glen Newton glen.new...@nrc-cnrc.gc.ca [I realise there was a recent related 'Character-sets for dummies'[1] discussion recently] I am using tictocs[2] list of journal RSS feeds, and I am getting gibberish in places for diacritics. Below is an example: in emacs: 221Acta Ortop dica Brasileira http://www.scielo.br/rss.php?pid=1413-7852lang=en 1413-7852 in Firefox: 221Acta Ortop dica Brasileira http://www.scielo.br/rss.php?pid=1413-7852lang=en 1413-7852 Note that the emacs view is both of a save of the Firefox, and from a direct download using 'wget'. Is this something on my end, or are the tictocs people not serving proper UTF-8? The HTTP header from wget claims UTF-8: wget -S http://www.tictocs.ac.uk/text.php --2009-12-21 12:47:59-- http://www.tictocs.ac.uk/text.php Resolving www.tictocs.ac.uk... 130.88.101.131 Connecting to www.tictocs.ac.uk|130.88.101.131|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Mon, 21 Dec 2009 17:42:05 GMT Server: Apache/2.2.13 (Unix) mod_ssl/2.2.13 OpenSSL/0.9.8k PHP/5.3.0 DAV/2 X-Powered-By: PHP/5.3.0 Content-Type: text/plain; charset=utf-8 Connection: close Length: unspecified [text/plain] stuff removed Can someone validate if they are also experiencing this issue? Thanks, Glen [1]https://listserv.nd.edu/cgi-bin/wa?S2=CODE4LIBq=s=character-sets+for+dummiesf=a=b= [2]http://www.tictocs.ac.uk/text.php -- Glen Newton | glen.new...@nrc-cnrc.gc.ca Researcher, Information Science, CISTI Research NRC W3C Advisory Committee Representative http://tinyurl.com/yvchmu tel/t l: 613-990-9163 | facsimile/t l copieur 613-952-8246 Canada Institute for Scientific and Technical Information (CISTI) National Research Council Canada (NRC)| M-55, 1200 Montreal Road http://www.nrc-cnrc.gc.ca/ Institut canadien de l'information scientifique et technique (ICIST) Conseil national de recherches Canada | M-55, 1200 chemin Montr al Ottawa, Ontario K1A 0R6 Government of Canada | Gouvernement du Canada --
Re: [CODE4LIB] Character problems with tictoc
Thanks, Erik, some useful tools and advice. I've solved the problem: Using the emacs hexl-find-file, I could see that the wget file was OK: 21b0: 2d33 3638 320a 3232 3109 4163 7461 204f -3682.221.Acta O 21c0: 7274 6f70 c3a9 6469 6361 2042 7261 7369 rtop..dica Brasi 21d0: 6c65 6972 6109 6874 7470 3a2f 2f77 leira.http://www But not from the saved from Firefox: 21b0: 2d33 3638 320a 3232 3109 4163 7461 204f -3682.221.Acta O 21c0: 7274 6f70 c383 c2a9 6469 6361 2042 7261 rtopdica Bra 21d0: 7369 6c65 6972 6109 6874 7470 3a2f 2f77 sileira.http://w I checked my default character encoding in Firefox [3.0.4: Edit--Preferences; Content.Default Font.Advanced; Character encoding.Default Character Encoding] and it turned-out it was 'Western ISO-Latin 8859-1' (!). I changed it to 'UTF-8' and all the diacritic problems went away. So it was a client software configuration problem, not the tictocs site. I'll send tictocs an update email. But I don't understand why Firefox was ignoring the Content-Type: text/plain; charset=utf-8 It should not be using the default charset (ISO-Latin 8859-1) for this content, as it has been told the text encoding is UTF-8... -- Thanks to all who helped (on- and off-list), Glen -- From: Erik Hetzner erik.hetz...@ucop.edu Sender: Code for Libraries CODE4LIB@LISTSERV.ND.EDU To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Character problems with tictoc Date: Mon, 21 Dec 2009 11:24:49 -0800 Message-ID: p-irc-exbe01l9ntdej1...@ex.ucop.edu At Mon, 21 Dec 2009 14:09:28 -0500, Glen Newton wrote: It seems that different people are seeing different things in their respective viewers (i.e some are OK and others are like what I am seeing). When I use wget and view the local file in Firefox (3.0.4, Linux Suse 11.0) I see: http://cuvier.cisti.nrc.ca/~gnewton/tictoc1.gif [gif used as it is not lossy] The text is clearly not correct. The file I got with wget is: http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt Is this just a question of different client software (and/or OSes) viewing or mangling the content? When dealing with character set issues (especially the dreaded double-encoding!) I find it best to use hex editors or dumpers. If in emacs, try M-x hexl-find-file. On a Unix command line, the od or hd commands are useful. For the record: 48 54 54 50 2f 31 2e 31 20 32 30 30 20 4f 4b 0d |HTTP/1.1 200 OK.| 0010 0a 44 61 74 65 3a 20 4d 6f 6e 2c 20 32 31 20 44 |.Date: Mon, 21 D| 0020 65 63 20 32 30 30 39 20 31 39 3a 32 32 3a 33 38 |ec 2009 19:22:38| 0030 20 47 4d 54 0d 0a 53 65 72 76 65 72 3a 20 41 70 | GMT..Server: Ap| 0040 61 63 68 65 2f 32 2e 32 2e 31 33 20 28 55 6e 69 |ache/2.2.13 (Uni| 0050 78 29 20 6d 6f 64 5f 73 73 6c 2f 32 2e 32 2e 31 |x) mod_ssl/2.2.1| 0060 33 20 4f 70 65 6e 53 53 4c 2f 30 2e 39 2e 38 6b |3 OpenSSL/0.9.8k| 0070 20 50 48 50 2f 35 2e 33 2e 30 20 44 41 56 2f 32 | PHP/5.3.0 DAV/2| 0080 0d 0a 58 2d 50 6f 77 65 72 65 64 2d 42 79 3a 20 |..X-Powered-By: | 0090 50 48 50 2f 35 2e 33 2e 30 0d 0a 43 6f 6e 74 65 |PHP/5.3.0..Conte| 00a0 6e 74 2d 54 79 70 65 3a 20 74 65 78 74 2f 70 6c |nt-Type: text/pl| 00b0 61 69 6e 3b 20 63 68 61 72 73 65 74 3d 75 74 66 |ain; charset=utf| 00c0 2d 38 0d 0a 54 72 61 6e 73 66 65 72 2d 45 6e 63 |-8..Transfer-Enc| 00d0 6f 64 69 6e 67 3a 20 63 68 75 6e 6b 65 64 0d 0a |oding: chunked..| ... 2230 4f 72 74 68 6f 70 61 65 64 69 63 61 09 68 74 74 |Orthopaedica.htt| 2240 70 3a 2f 2f 69 6e 66 6f 72 6d 61 68 65 61 6c 74 |p://informahealt| 2250 68 63 61 72 65 2e 63 6f 6d 2f 61 63 74 69 6f 6e |hcare.com/action| 2260 2f 73 68 6f 77 46 65 65 64 3f 6a 63 3d 6f 72 74 |/showFeed?jc=ort| 2270 26 74 79 70 65 3d 65 74 6f 63 26 66 65 65 64 3d |type=etocfeed=| 2280 72 73 73 09 31 37 34 35 2d 33 36 37 34 09 31 37 |rss.1745-3674.17| 2290 34 35 2d 33 36 38 32 0a 32 32 31 09 41 63 74 61 |45-3682.221.Acta| 22a0 20 4f 72 74 6f 70 c3 a9 64 69 63 61 20 42 72 61 | Ortop..dica Bra| 22b0 73 69 6c 65 69 72 61 09 68 74 74 70 3a 2f 2f 77 |sileira.http://w| ... best, Erik Hetzner -- ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3 [GNUPG:] ERRSIG 081801FF01DB07E3 17 2 01 1261423489 9 [GNUPG:] NO_PUBKEY 081801FF01DB07E3
Re: [CODE4LIB] Character problems with tictoc
At Mon, 21 Dec 2009 14:59:01 -0500, Glen Newton wrote: Thanks, Erik, some useful tools and advice. Glad to help! […] But I don't understand why Firefox was ignoring the Content-Type: text/plain; charset=utf-8 It should not be using the default charset (ISO-Latin 8859-1) for this content, as it has been told the text encoding is UTF-8... It seems to work fine in my version of Firefox (Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.6) Gecko/20091215 Ubuntu/9.10 (karmic) Firefox/3.5.6), with latin-1 default. best, Erik ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3 pgpQvfQeD04GX.pgp Description: PGP signature
Re: [CODE4LIB] Character problems with tictoc
Just for the record, I was using: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.4) Gecko/2008103100 SUSE/3.0.4-4.7 Firefox/3.0.4 I have upgraded to 3.5.6 :-) -glen -- From: Erik Hetzner erik.hetz...@ucop.edu Sender: Code for Libraries CODE4LIB@LISTSERV.ND.EDU To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Character problems with tictoc Date: Mon, 21 Dec 2009 12:14:54 -0800 Message-ID: p-irc-exbe01xjmxehy1...@ex.ucop.edu At Mon, 21 Dec 2009 14:59:01 -0500, Glen Newton wrote: Thanks, Erik, some useful tools and advice. Glad to help! […] But I don't understand why Firefox was ignoring the Content-Type: text/plain; charset=utf-8 It should not be using the default charset (ISO-Latin 8859-1) for this content, as it has been told the text encoding is UTF-8... It seems to work fine in my version of Firefox (Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.6) Gecko/20091215 Ubuntu/9.10 (karmic) Firefox/3.5.6), with latin-1 default. best, Erik -- ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3 [GNUPG:] ERRSIG 081801FF01DB07E3 17 2 01 1261426493 9 [GNUPG:] NO_PUBKEY 081801FF01DB07E3
Re: [CODE4LIB] Character problems with tictoc
I believe they've changed it while we were having the discussion. When I downloaded the file (with curl), it looked like this: 0020700 r t o p C etx B ) d i c a sp B r a 72 74 6f 70 c3 83 c2 a9 64 69 63 61 20 42 72 61 0020720 s i l e i r a ht h t t p : / / w 73 69 6c 65 69 72 61 09 68 74 74 70 3a 2f 2f 77 - Godmar On Mon, Dec 21, 2009 at 2:24 PM, Erik Hetzner erik.hetz...@ucop.edu wrote: At Mon, 21 Dec 2009 14:09:28 -0500, Glen Newton wrote: It seems that different people are seeing different things in their respective viewers (i.e some are OK and others are like what I am seeing). When I use wget and view the local file in Firefox (3.0.4, Linux Suse 11.0) I see: http://cuvier.cisti.nrc.ca/~gnewton/tictoc1.gif [gif used as it is not lossy] The text is clearly not correct. The file I got with wget is: http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt Is this just a question of different client software (and/or OSes) viewing or mangling the content? When dealing with character set issues (especially the dreaded double-encoding!) I find it best to use hex editors or dumpers. If in emacs, try M-x hexl-find-file. On a Unix command line, the od or hd commands are useful. For the record: 48 54 54 50 2f 31 2e 31 20 32 30 30 20 4f 4b 0d |HTTP/1.1 200 OK.| 0010 0a 44 61 74 65 3a 20 4d 6f 6e 2c 20 32 31 20 44 |.Date: Mon, 21 D| 0020 65 63 20 32 30 30 39 20 31 39 3a 32 32 3a 33 38 |ec 2009 19:22:38| 0030 20 47 4d 54 0d 0a 53 65 72 76 65 72 3a 20 41 70 | GMT..Server: Ap| 0040 61 63 68 65 2f 32 2e 32 2e 31 33 20 28 55 6e 69 |ache/2.2.13 (Uni| 0050 78 29 20 6d 6f 64 5f 73 73 6c 2f 32 2e 32 2e 31 |x) mod_ssl/2.2.1| 0060 33 20 4f 70 65 6e 53 53 4c 2f 30 2e 39 2e 38 6b |3 OpenSSL/0.9.8k| 0070 20 50 48 50 2f 35 2e 33 2e 30 20 44 41 56 2f 32 | PHP/5.3.0 DAV/2| 0080 0d 0a 58 2d 50 6f 77 65 72 65 64 2d 42 79 3a 20 |..X-Powered-By: | 0090 50 48 50 2f 35 2e 33 2e 30 0d 0a 43 6f 6e 74 65 |PHP/5.3.0..Conte| 00a0 6e 74 2d 54 79 70 65 3a 20 74 65 78 74 2f 70 6c |nt-Type: text/pl| 00b0 61 69 6e 3b 20 63 68 61 72 73 65 74 3d 75 74 66 |ain; charset=utf| 00c0 2d 38 0d 0a 54 72 61 6e 73 66 65 72 2d 45 6e 63 |-8..Transfer-Enc| 00d0 6f 64 69 6e 67 3a 20 63 68 75 6e 6b 65 64 0d 0a |oding: chunked..| ... 2230 4f 72 74 68 6f 70 61 65 64 69 63 61 09 68 74 74 |Orthopaedica.htt| 2240 70 3a 2f 2f 69 6e 66 6f 72 6d 61 68 65 61 6c 74 |p://informahealt| 2250 68 63 61 72 65 2e 63 6f 6d 2f 61 63 74 69 6f 6e |hcare.com/action| 2260 2f 73 68 6f 77 46 65 65 64 3f 6a 63 3d 6f 72 74 |/showFeed?jc=ort| 2270 26 74 79 70 65 3d 65 74 6f 63 26 66 65 65 64 3d |type=etocfeed=| 2280 72 73 73 09 31 37 34 35 2d 33 36 37 34 09 31 37 |rss.1745-3674.17| 2290 34 35 2d 33 36 38 32 0a 32 32 31 09 41 63 74 61 |45-3682.221.Acta| 22a0 20 4f 72 74 6f 70 c3 a9 64 69 63 61 20 42 72 61 | Ortop..dica Bra| 22b0 73 69 6c 65 69 72 61 09 68 74 74 70 3a 2f 2f 77 |sileira.http://w| ... best, Erik Hetzner ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3
Re: [CODE4LIB] Character problems with tictoc
On Mon, Dec 21, 2009 at 2:09 PM, Glen Newton glen.new...@nrc-cnrc.gc.ca wrote: The file I got with wget is: http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt (Just to convince myself I'm not going nuts...) - this file, which Glen downloaded with wget, appears double-encoded: # curl -s http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt | od -a -t x1 | head -1082 | tail -4 0020660 - 3 6 8 2 nl 2 2 1 ht A c t a sp O 2d 33 36 38 32 0a 32 32 31 09 41 63 74 61 20 4f 0020700 r t o p C etx B ) d i c a sp B r a 72 74 6f 70 c3 83 c2 a9 64 69 63 61 20 42 72 61 - Godmar
Re: [CODE4LIB] Character problems with tictoc
I agree with Godmar: it looks like (some) change happened to tictocs between my original wget download and the one I downloaded after I changed my browser settings. It appears Godmar is not going nuts (or at least this issue is not due to him going nuts!) ;-) Viewing the file http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt with my newly installed firefox 3.5.6 I see mangled characters: 221 Acta Ortop \u0192 dica Brasileira http://www.scielo.br/rss.php?pid=1413-7852lang=en 1413-7852 And my browser default encodings is: UTF-8. So ignore most of my solution! :-) -glen PS. I am contemplating trademarking I see mangled characters !! :-) On Mon, Dec 21, 2009 at 2:09 PM, Glen Newton glen.new...@nrc-cnrc.gc.ca wrote: The file I got with wget is: http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt (Just to convince myself I'm not going nuts...) - this file, which Glen downloaded with wget, appears double-encoded: # curl -s http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt | od -a -t x1 | head -1082 | tail -4 0020660 - 3 6 8 2 nl 2 2 1 ht A c t a sp O 2d 33 36 38 32 0a 32 32 31 09 41 63 74 61 20 4f 0020700 r t o p C etx B ) d i c a sp B r a 72 74 6f 70 c3 83 c2 a9 64 69 63 61 20 42 72 61 - Godmar