Re: [CODE4LIB] Character problems with tictoc

2009-12-22 Thread Bucknell, Terry
Thanks to everyone to drawing our attention to this issue.

A couple of days ago the ticTOCs service moved to a new server where the data 
is stored as UTF-8 (which it wasn't before). We'd forgotten to remove the UFT-8 
conversion in text.php so we were serving double-encoded content (UTF-8 encoded 
as UTF-8) until our developer put it right in the middle of the discussion on 
this list (which started at 5pm our time!)

You should find the problem is fixed now.


Terry


Terry Bucknell
Electronic Resources Manager
Sydney Jones Library
University of Liverpool
Chatham St, PO Box 123
Liverpool, L69 3DA, UK
Tel: +44 (0)151 794 2692
Fax: +44 (0)151 794 2681



-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Glen 
Newton
Sent: 21 December 2009 17:52
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Character problems with tictoc

[I realise there was a recent related 'Character-sets for dummies'[1]
discussion recently] 

I am using tictocs[2] list of journal RSS feeds, and I am getting
gibberish in places for diacritics. Below is an example:

in emacs:
 221Acta Ortop  dica Brasileira 
http://www.scielo.br/rss.php?pid=1413-7852lang=en  1413-7852   
in Firefox:
 221Acta Ortop  dica Brasileira 
http://www.scielo.br/rss.php?pid=1413-7852lang=en  1413-7852

Note that the emacs view is both of a save of the Firefox, and from a
direct download using 'wget'.

Is this something on my end, or are the tictocs people not serving
proper UTF-8? 

The HTTP header from wget claims UTF-8:
 wget -S http://www.tictocs.ac.uk/text.php
 --2009-12-21 12:47:59--  http://www.tictocs.ac.uk/text.php
 Resolving www.tictocs.ac.uk... 130.88.101.131
 Connecting to www.tictocs.ac.uk|130.88.101.131|:80... connected.
 HTTP request sent, awaiting response... 
   HTTP/1.1 200 OK
   Date: Mon, 21 Dec 2009 17:42:05 GMT
   Server: Apache/2.2.13 (Unix) mod_ssl/2.2.13 OpenSSL/0.9.8k PHP/5.3.0 DAV/2
   X-Powered-By: PHP/5.3.0
   Content-Type: text/plain; charset=utf-8
   Connection: close
 Length: unspecified [text/plain]
stuff removed

Can someone validate if they are also experiencing this issue?

Thanks,
Glen

[1]https://listserv.nd.edu/cgi-bin/wa?S2=CODE4LIBq=s=character-sets+for+dummiesf=a=b=
[2]http://www.tictocs.ac.uk/text.php

-- 
Glen Newton | glen.new...@nrc-cnrc.gc.ca
Researcher, Information Science, CISTI Research
 NRC W3C Advisory Committee Representative
http://tinyurl.com/yvchmu
tel/t l: 613-990-9163 | facsimile/t l copieur 613-952-8246
Canada Institute for Scientific and Technical Information (CISTI)
National Research Council Canada (NRC)| M-55, 1200 Montreal Road
http://www.nrc-cnrc.gc.ca/
Institut canadien de l'information scientifique et technique (ICIST) 
Conseil national de recherches Canada | M-55, 1200 chemin Montr al
Ottawa, Ontario K1A 0R6  
Government of Canada | Gouvernement du Canada   
--


[CODE4LIB] Character problems with tictoc

2009-12-21 Thread Glen Newton
[I realise there was a recent related 'Character-sets for dummies'[1]
discussion recently] 

I am using tictocs[2] list of journal RSS feeds, and I am getting
gibberish in places for diacritics. Below is an example:

in emacs:
 221Acta Ortop  dica Brasileira 
http://www.scielo.br/rss.php?pid=1413-7852lang=en  1413-7852   
in Firefox:
 221Acta Ortop  dica Brasileira 
http://www.scielo.br/rss.php?pid=1413-7852lang=en  1413-7852

Note that the emacs view is both of a save of the Firefox, and from a
direct download using 'wget'.

Is this something on my end, or are the tictocs people not serving
proper UTF-8? 

The HTTP header from wget claims UTF-8:
 wget -S http://www.tictocs.ac.uk/text.php
 --2009-12-21 12:47:59--  http://www.tictocs.ac.uk/text.php
 Resolving www.tictocs.ac.uk... 130.88.101.131
 Connecting to www.tictocs.ac.uk|130.88.101.131|:80... connected.
 HTTP request sent, awaiting response... 
   HTTP/1.1 200 OK
   Date: Mon, 21 Dec 2009 17:42:05 GMT
   Server: Apache/2.2.13 (Unix) mod_ssl/2.2.13 OpenSSL/0.9.8k PHP/5.3.0 DAV/2
   X-Powered-By: PHP/5.3.0
   Content-Type: text/plain; charset=utf-8
   Connection: close
 Length: unspecified [text/plain]
stuff removed

Can someone validate if they are also experiencing this issue?

Thanks,
Glen

[1]https://listserv.nd.edu/cgi-bin/wa?S2=CODE4LIBq=s=character-sets+for+dummiesf=a=b=
[2]http://www.tictocs.ac.uk/text.php

-- 
Glen Newton | glen.new...@nrc-cnrc.gc.ca
Researcher, Information Science, CISTI Research
 NRC W3C Advisory Committee Representative
http://tinyurl.com/yvchmu
tel/t l: 613-990-9163 | facsimile/t l copieur 613-952-8246
Canada Institute for Scientific and Technical Information (CISTI)
National Research Council Canada (NRC)| M-55, 1200 Montreal Road
http://www.nrc-cnrc.gc.ca/
Institut canadien de l'information scientifique et technique (ICIST) 
Conseil national de recherches Canada | M-55, 1200 chemin Montr al
Ottawa, Ontario K1A 0R6  
Government of Canada | Gouvernement du Canada   
--


Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Godmar Back
The string in question is double-encoded, that is, a string that's in
UTF-8 already was run through a UTF-8 encoder.

The string is Acta Ortopedica where the 'e' is really '\u00e9' aka
'Latin Small Letter E with Acute'. [1]

In UTF-8, the e-acute is two-byte encoded as C3 A9.  If you run the
bytes C3 A9 through a UTF-8 encoder, C3 ('\u00c3' - Capital A with
tilde) becomes C3 83 and A9 (copyright sign, '\u00a9' becomes C2 A9).
C3 83 C2 A9 is exactly what JISC is serving, what it should be serving
is C3 A9.

Send email to them.

 - Godmar

[1] http://www.utf8-chartable.de/

2009/12/21 Glen Newton glen.new...@nrc-cnrc.gc.ca

 [I realise there was a recent related 'Character-sets for dummies'[1]
 discussion recently]

 I am using tictocs[2] list of journal RSS feeds, and I am getting
 gibberish in places for diacritics. Below is an example:

 in emacs:
  221    Acta Ortop  dica Brasileira     
 http://www.scielo.br/rss.php?pid=1413-7852lang=en      1413-7852
 in Firefox:
  221    Acta Ortop  dica Brasileira     
 http://www.scielo.br/rss.php?pid=1413-7852lang=en      1413-7852

 Note that the emacs view is both of a save of the Firefox, and from a
 direct download using 'wget'.

 Is this something on my end, or are the tictocs people not serving
 proper UTF-8?

 The HTTP header from wget claims UTF-8:
  wget -S http://www.tictocs.ac.uk/text.php
  --2009-12-21 12:47:59--  http://www.tictocs.ac.uk/text.php
  Resolving www.tictocs.ac.uk... 130.88.101.131
  Connecting to www.tictocs.ac.uk|130.88.101.131|:80... connected.
  HTTP request sent, awaiting response...
    HTTP/1.1 200 OK
    Date: Mon, 21 Dec 2009 17:42:05 GMT
    Server: Apache/2.2.13 (Unix) mod_ssl/2.2.13 OpenSSL/0.9.8k PHP/5.3.0 DAV/2
    X-Powered-By: PHP/5.3.0
    Content-Type: text/plain; charset=utf-8
    Connection: close
  Length: unspecified [text/plain]
 stuff removed

 Can someone validate if they are also experiencing this issue?

 Thanks,
 Glen

 [1]https://listserv.nd.edu/cgi-bin/wa?S2=CODE4LIBq=s=character-sets+for+dummiesf=a=b=
 [2]http://www.tictocs.ac.uk/text.php

 --
 Glen Newton | glen.new...@nrc-cnrc.gc.ca
 Researcher, Information Science, CISTI Research
  NRC W3C Advisory Committee Representative
 http://tinyurl.com/yvchmu
 tel/t l: 613-990-9163 | facsimile/t l copieur 613-952-8246
 Canada Institute for Scientific and Technical Information (CISTI)
 National Research Council Canada (NRC)| M-55, 1200 Montreal Road
 http://www.nrc-cnrc.gc.ca/
 Institut canadien de l'information scientifique et technique (ICIST)
 Conseil national de recherches Canada | M-55, 1200 chemin Montr al
 Ottawa, Ontario K1A 0R6
 Government of Canada | Gouvernement du Canada
 --


Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Glen Newton
Thanks for tracking this down Godmar. 
I've emailed tictocs and we'll see what they say.

-Glen :-)


--
From: Godmar Back god...@gmail.com
Sender:   Code for Libraries CODE4LIB@LISTSERV.ND.EDU
To:   CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Character problems with tictoc
Date: Mon, 21 Dec 2009 13:20:08 -0500
Message-ID:  719dced30912211020y7b726c83jc54d0fadcba92...@mail.gmail.com

The string in question is double-encoded, that is, a string that's in
UTF-8 already was run through a UTF-8 encoder.

The string is Acta Ortopedica where the 'e' is really '\u00e9' aka
'Latin Small Letter E with Acute'. [1]

In UTF-8, the e-acute is two-byte encoded as C3 A9.  If you run the
bytes C3 A9 through a UTF-8 encoder, C3 ('\u00c3' - Capital A with
tilde) becomes C3 83 and A9 (copyright sign, '\u00a9' becomes C2 A9).
C3 83 C2 A9 is exactly what JISC is serving, what it should be serving
is C3 A9.

Send email to them.

 - Godmar

[1] http://www.utf8-chartable.de/

2009/12/21 Glen Newton glen.new...@nrc-cnrc.gc.ca

 [I realise there was a recent related 'Character-sets for dummies'[1]
 discussion recently]

 I am using tictocs[2] list of journal RSS feeds, and I am getting
 gibberish in places for diacritics. Below is an example:

 in emacs:
  221Acta Ortop  dica Brasileira 
 http://www.scielo.br/rss.php?pid=1413-7852lang=en  1413-7852
 in Firefox:
  221Acta Ortop  dica Brasileira 
 http://www.scielo.br/rss.php?pid=1413-7852lang=en  1413-7852

 Note that the emacs view is both of a save of the Firefox, and from a
 direct download using 'wget'.

 Is this something on my end, or are the tictocs people not serving
 proper UTF-8?

 The HTTP header from wget claims UTF-8:
  wget -S http://www.tictocs.ac.uk/text.php
  --2009-12-21 12:47:59--  http://www.tictocs.ac.uk/text.php
  Resolving www.tictocs.ac.uk... 130.88.101.131
  Connecting to www.tictocs.ac.uk|130.88.101.131|:80... connected.
  HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Mon, 21 Dec 2009 17:42:05 GMT
Server: Apache/2.2.13 (Unix) mod_ssl/2.2.13 OpenSSL/0.9.8k PHP/5.3.0 DAV/2
X-Powered-By: PHP/5.3.0
Content-Type: text/plain; charset=utf-8
Connection: close
  Length: unspecified [text/plain]
 stuff removed

 Can someone validate if they are also experiencing this issue?

 Thanks,
 Glen

 [1]https://listserv.nd.edu/cgi-bin/wa?S2=CODE4LIBq=s=character-sets+for+dummiesf=a=b=
 [2]http://www.tictocs.ac.uk/text.php

 --
 Glen Newton | glen.new...@nrc-cnrc.gc.ca
 Researcher, Information Science, CISTI Research
  NRC W3C Advisory Committee Representative
 http://tinyurl.com/yvchmu
 tel/t l: 613-990-9163 | facsimile/t l copieur 613-952-8246
 Canada Institute for Scientific and Technical Information (CISTI)
 National Research Council Canada (NRC)| M-55, 1200 Montreal Road
 http://www.nrc-cnrc.gc.ca/
 Institut canadien de l'information scientifique et technique (ICIST)
 Conseil national de recherches Canada | M-55, 1200 chemin Montr al
 Ottawa, Ontario K1A 0R6
 Government of Canada | Gouvernement du Canada
 --


Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Glen Newton
It seems that different people are seeing different things in their
respective viewers (i.e some are OK and others are like what I am
seeing). 

When I use wget and view the local file in Firefox (3.0.4, Linux Suse
11.0) I see:
 http://cuvier.cisti.nrc.ca/~gnewton/tictoc1.gif
[gif used as it is not lossy]

The text is clearly not correct.

The file I got with wget is:
  http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt

Is this just a question of different client software (and/or OSes)
viewing or mangling the content?

-glen

---
Thanks for tracking this down Godmar. 
I've emailed tictocs and we'll see what they say.

-Glen :-)


--
From: Godmar Back god...@gmail.com
Sender:   Code for Libraries CODE4LIB@LISTSERV.ND.EDU
To:   CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Character problems with tictoc
Date: Mon, 21 Dec 2009 13:20:08 -0500
Message-ID:  719dced30912211020y7b726c83jc54d0fadcba92...@mail.gmail.com

The string in question is double-encoded, that is, a string that's in
UTF-8 already was run through a UTF-8 encoder.

The string is Acta Ortopedica where the 'e' is really '\u00e9' aka
'Latin Small Letter E with Acute'. [1]

In UTF-8, the e-acute is two-byte encoded as C3 A9.  If you run the
bytes C3 A9 through a UTF-8 encoder, C3 ('\u00c3' - Capital A with
tilde) becomes C3 83 and A9 (copyright sign, '\u00a9' becomes C2 A9).
C3 83 C2 A9 is exactly what JISC is serving, what it should be serving
is C3 A9.

Send email to them.

 - Godmar

[1] http://www.utf8-chartable.de/

2009/12/21 Glen Newton glen.new...@nrc-cnrc.gc.ca

 [I realise there was a recent related 'Character-sets for dummies'[1]
 discussion recently]

 I am using tictocs[2] list of journal RSS feeds, and I am getting
 gibberish in places for diacritics. Below is an example:

 in emacs:
  221Acta Ortop  dica Brasileira 
 http://www.scielo.br/rss.php?pid=1413-7852lang=en  1413-7852
 in Firefox:
  221Acta Ortop  dica Brasileira 
 http://www.scielo.br/rss.php?pid=1413-7852lang=en  1413-7852

 Note that the emacs view is both of a save of the Firefox, and from a
 direct download using 'wget'.

 Is this something on my end, or are the tictocs people not serving
 proper UTF-8?

 The HTTP header from wget claims UTF-8:
  wget -S http://www.tictocs.ac.uk/text.php
  --2009-12-21 12:47:59--  http://www.tictocs.ac.uk/text.php
  Resolving www.tictocs.ac.uk... 130.88.101.131
  Connecting to www.tictocs.ac.uk|130.88.101.131|:80... connected.
  HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Mon, 21 Dec 2009 17:42:05 GMT
Server: Apache/2.2.13 (Unix) mod_ssl/2.2.13 OpenSSL/0.9.8k PHP/5.3.0 DAV/2
X-Powered-By: PHP/5.3.0
Content-Type: text/plain; charset=utf-8
Connection: close
  Length: unspecified [text/plain]
 stuff removed

 Can someone validate if they are also experiencing this issue?

 Thanks,
 Glen

 [1]https://listserv.nd.edu/cgi-bin/wa?S2=CODE4LIBq=s=character-sets+for+dummiesf=a=b=
 [2]http://www.tictocs.ac.uk/text.php

 --
 Glen Newton | glen.new...@nrc-cnrc.gc.ca
 Researcher, Information Science, CISTI Research
  NRC W3C Advisory Committee Representative
 http://tinyurl.com/yvchmu
 tel/t l: 613-990-9163 | facsimile/t l copieur 613-952-8246
 Canada Institute for Scientific and Technical Information (CISTI)
 National Research Council Canada (NRC)| M-55, 1200 Montreal Road
 http://www.nrc-cnrc.gc.ca/
 Institut canadien de l'information scientifique et technique (ICIST)
 Conseil national de recherches Canada | M-55, 1200 chemin Montr al
 Ottawa, Ontario K1A 0R6
 Government of Canada | Gouvernement du Canada
 --


Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Glen Newton
Thanks, Erik, some useful tools and advice.

I've solved the problem:

Using the emacs hexl-find-file, I could see that the wget file was OK:   
 
21b0: 2d33 3638 320a 3232 3109 4163 7461 204f  -3682.221.Acta O
21c0: 7274 6f70 c3a9 6469 6361 2042 7261 7369  rtop..dica Brasi
21d0: 6c65 6972 6109 6874 7470 3a2f 2f77   leira.http://www

But not from the saved from Firefox:

21b0: 2d33 3638 320a 3232 3109 4163 7461 204f  -3682.221.Acta O
21c0: 7274 6f70 c383 c2a9 6469 6361 2042 7261  rtopdica Bra
21d0: 7369 6c65 6972 6109 6874 7470 3a2f 2f77  sileira.http://w

I checked my default character encoding in Firefox
[3.0.4: Edit--Preferences; Content.Default Font.Advanced; Character
encoding.Default Character Encoding] and it turned-out it was
'Western ISO-Latin 8859-1' (!). I changed it to 'UTF-8' and all the 
diacritic problems went away.

So it was a client software configuration problem, not the tictocs
site. 

I'll send tictocs an update email.

But I don't understand why Firefox was ignoring the
 Content-Type: text/plain; charset=utf-8
It should not be using the default charset (ISO-Latin 8859-1) for 
this content, as it has been told the text encoding is UTF-8...

--

Thanks to all who helped (on- and off-list),

Glen

--
From: Erik Hetzner erik.hetz...@ucop.edu
Sender:   Code for Libraries CODE4LIB@LISTSERV.ND.EDU
To:   CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Character problems with tictoc
Date: Mon, 21 Dec 2009 11:24:49 -0800
Message-ID:  p-irc-exbe01l9ntdej1...@ex.ucop.edu

At Mon, 21 Dec 2009 14:09:28 -0500,
Glen Newton wrote:

 It seems that different people are seeing different things in their
 respective viewers (i.e some are OK and others are like what I am
 seeing).

 When I use wget and view the local file in Firefox (3.0.4, Linux Suse
 11.0) I see:
  http://cuvier.cisti.nrc.ca/~gnewton/tictoc1.gif
 [gif used as it is not lossy]

 The text is clearly not correct.

 The file I got with wget is:
   http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt

 Is this just a question of different client software (and/or OSes)
 viewing or mangling the content?

When dealing with character set issues (especially the dreaded
double-encoding!) I find it best to use hex editors or dumpers. If in
emacs, try M-x hexl-find-file. On a Unix command line, the od or hd
commands are useful.

For the record:

  48 54 54 50 2f 31 2e 31  20 32 30 30 20 4f 4b 0d  |HTTP/1.1 200 OK.|
0010  0a 44 61 74 65 3a 20 4d  6f 6e 2c 20 32 31 20 44  |.Date: Mon, 21 D|
0020  65 63 20 32 30 30 39 20  31 39 3a 32 32 3a 33 38  |ec 2009 19:22:38|
0030  20 47 4d 54 0d 0a 53 65  72 76 65 72 3a 20 41 70  | GMT..Server: Ap|
0040  61 63 68 65 2f 32 2e 32  2e 31 33 20 28 55 6e 69  |ache/2.2.13 (Uni|
0050  78 29 20 6d 6f 64 5f 73  73 6c 2f 32 2e 32 2e 31  |x) mod_ssl/2.2.1|
0060  33 20 4f 70 65 6e 53 53  4c 2f 30 2e 39 2e 38 6b  |3 OpenSSL/0.9.8k|
0070  20 50 48 50 2f 35 2e 33  2e 30 20 44 41 56 2f 32  | PHP/5.3.0 DAV/2|
0080  0d 0a 58 2d 50 6f 77 65  72 65 64 2d 42 79 3a 20  |..X-Powered-By: |
0090  50 48 50 2f 35 2e 33 2e  30 0d 0a 43 6f 6e 74 65  |PHP/5.3.0..Conte|
00a0  6e 74 2d 54 79 70 65 3a  20 74 65 78 74 2f 70 6c  |nt-Type: text/pl|
00b0  61 69 6e 3b 20 63 68 61  72 73 65 74 3d 75 74 66  |ain; charset=utf|
00c0  2d 38 0d 0a 54 72 61 6e  73 66 65 72 2d 45 6e 63  |-8..Transfer-Enc|
00d0  6f 64 69 6e 67 3a 20 63  68 75 6e 6b 65 64 0d 0a  |oding: chunked..|
...
2230  4f 72 74 68 6f 70 61 65  64 69 63 61 09 68 74 74  |Orthopaedica.htt|
2240  70 3a 2f 2f 69 6e 66 6f  72 6d 61 68 65 61 6c 74  |p://informahealt|
2250  68 63 61 72 65 2e 63 6f  6d 2f 61 63 74 69 6f 6e  |hcare.com/action|
2260  2f 73 68 6f 77 46 65 65  64 3f 6a 63 3d 6f 72 74  |/showFeed?jc=ort|
2270  26 74 79 70 65 3d 65 74  6f 63 26 66 65 65 64 3d  |type=etocfeed=|
2280  72 73 73 09 31 37 34 35  2d 33 36 37 34 09 31 37  |rss.1745-3674.17|
2290  34 35 2d 33 36 38 32 0a  32 32 31 09 41 63 74 61  |45-3682.221.Acta|
22a0  20 4f 72 74 6f 70 c3 a9  64 69 63 61 20 42 72 61  | Ortop..dica Bra|
22b0  73 69 6c 65 69 72 61 09  68 74 74 70 3a 2f 2f 77  |sileira.http://w|
...

best,
Erik Hetzner

--
;; Erik Hetzner, California Digital Library
;; gnupg key id: 1024D/01DB07E3

[GNUPG:] ERRSIG 081801FF01DB07E3 17 2 01 1261423489 9
[GNUPG:] NO_PUBKEY 081801FF01DB07E3


Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Erik Hetzner
At Mon, 21 Dec 2009 14:59:01 -0500,
Glen Newton wrote:
 Thanks, Erik, some useful tools and advice.

Glad to help!

 […]

 But I don't understand why Firefox was ignoring the
  Content-Type: text/plain; charset=utf-8
 It should not be using the default charset (ISO-Latin 8859-1) for 
 this content, as it has been told the text encoding is UTF-8...

It seems to work fine in my version of Firefox (Mozilla/5.0 (X11; U;
Linux i686; en-US; rv:1.9.1.6) Gecko/20091215 Ubuntu/9.10 (karmic)
Firefox/3.5.6), with latin-1 default.

best,
Erik
;; Erik Hetzner, California Digital Library
;; gnupg key id: 1024D/01DB07E3


pgpQvfQeD04GX.pgp
Description: PGP signature


Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Glen Newton
Just for the record, I was using:
 Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.4) Gecko/2008103100
 SUSE/3.0.4-4.7 Firefox/3.0.4 

I have upgraded to 3.5.6  :-)

-glen

--
From: Erik Hetzner erik.hetz...@ucop.edu
Sender:   Code for Libraries CODE4LIB@LISTSERV.ND.EDU
To:   CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Character problems with tictoc
Date: Mon, 21 Dec 2009 12:14:54 -0800
Message-ID:  p-irc-exbe01xjmxehy1...@ex.ucop.edu

At Mon, 21 Dec 2009 14:59:01 -0500,
Glen Newton wrote:
 Thanks, Erik, some useful tools and advice.

Glad to help!

 […]

 But I don't understand why Firefox was ignoring the
  Content-Type: text/plain; charset=utf-8
 It should not be using the default charset (ISO-Latin 8859-1) for 
 this content, as it has been told the text encoding is UTF-8...

It seems to work fine in my version of Firefox (Mozilla/5.0 (X11; U;
Linux i686; en-US; rv:1.9.1.6) Gecko/20091215 Ubuntu/9.10 (karmic)
Firefox/3.5.6), with latin-1 default.

best,
Erik

--
;; Erik Hetzner, California Digital Library
;; gnupg key id: 1024D/01DB07E3

[GNUPG:] ERRSIG 081801FF01DB07E3 17 2 01 1261426493 9
[GNUPG:] NO_PUBKEY 081801FF01DB07E3


Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Godmar Back
I believe they've changed it while we were having the discussion.

When I downloaded the file (with curl), it looked like this:

0020700   r   t   o   p   C etx   B   )   d   i   c   a  sp   B   r   a
72 74 6f 70 c3 83 c2 a9 64 69 63 61 20 42 72 61
0020720   s   i   l   e   i   r   a  ht   h   t   t   p   :   /   /   w
73 69 6c 65 69 72 61 09 68 74 74 70 3a 2f 2f 77

 - Godmar

On Mon, Dec 21, 2009 at 2:24 PM, Erik Hetzner erik.hetz...@ucop.edu wrote:
 At Mon, 21 Dec 2009 14:09:28 -0500,
 Glen Newton wrote:

 It seems that different people are seeing different things in their
 respective viewers (i.e some are OK and others are like what I am
 seeing).

 When I use wget and view the local file in Firefox (3.0.4, Linux Suse
 11.0) I see:
  http://cuvier.cisti.nrc.ca/~gnewton/tictoc1.gif
 [gif used as it is not lossy]

 The text is clearly not correct.

 The file I got with wget is:
   http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt

 Is this just a question of different client software (and/or OSes)
 viewing or mangling the content?

 When dealing with character set issues (especially the dreaded
 double-encoding!) I find it best to use hex editors or dumpers. If in
 emacs, try M-x hexl-find-file. On a Unix command line, the od or hd
 commands are useful.

 For the record:

   48 54 54 50 2f 31 2e 31  20 32 30 30 20 4f 4b 0d  |HTTP/1.1 200 OK.|
 0010  0a 44 61 74 65 3a 20 4d  6f 6e 2c 20 32 31 20 44  |.Date: Mon, 21 D|
 0020  65 63 20 32 30 30 39 20  31 39 3a 32 32 3a 33 38  |ec 2009 19:22:38|
 0030  20 47 4d 54 0d 0a 53 65  72 76 65 72 3a 20 41 70  | GMT..Server: Ap|
 0040  61 63 68 65 2f 32 2e 32  2e 31 33 20 28 55 6e 69  |ache/2.2.13 (Uni|
 0050  78 29 20 6d 6f 64 5f 73  73 6c 2f 32 2e 32 2e 31  |x) mod_ssl/2.2.1|
 0060  33 20 4f 70 65 6e 53 53  4c 2f 30 2e 39 2e 38 6b  |3 OpenSSL/0.9.8k|
 0070  20 50 48 50 2f 35 2e 33  2e 30 20 44 41 56 2f 32  | PHP/5.3.0 DAV/2|
 0080  0d 0a 58 2d 50 6f 77 65  72 65 64 2d 42 79 3a 20  |..X-Powered-By: |
 0090  50 48 50 2f 35 2e 33 2e  30 0d 0a 43 6f 6e 74 65  |PHP/5.3.0..Conte|
 00a0  6e 74 2d 54 79 70 65 3a  20 74 65 78 74 2f 70 6c  |nt-Type: text/pl|
 00b0  61 69 6e 3b 20 63 68 61  72 73 65 74 3d 75 74 66  |ain; charset=utf|
 00c0  2d 38 0d 0a 54 72 61 6e  73 66 65 72 2d 45 6e 63  |-8..Transfer-Enc|
 00d0  6f 64 69 6e 67 3a 20 63  68 75 6e 6b 65 64 0d 0a  |oding: chunked..|
 ...
 2230  4f 72 74 68 6f 70 61 65  64 69 63 61 09 68 74 74  |Orthopaedica.htt|
 2240  70 3a 2f 2f 69 6e 66 6f  72 6d 61 68 65 61 6c 74  |p://informahealt|
 2250  68 63 61 72 65 2e 63 6f  6d 2f 61 63 74 69 6f 6e  |hcare.com/action|
 2260  2f 73 68 6f 77 46 65 65  64 3f 6a 63 3d 6f 72 74  |/showFeed?jc=ort|
 2270  26 74 79 70 65 3d 65 74  6f 63 26 66 65 65 64 3d  |type=etocfeed=|
 2280  72 73 73 09 31 37 34 35  2d 33 36 37 34 09 31 37  |rss.1745-3674.17|
 2290  34 35 2d 33 36 38 32 0a  32 32 31 09 41 63 74 61  |45-3682.221.Acta|
 22a0  20 4f 72 74 6f 70 c3 a9  64 69 63 61 20 42 72 61  | Ortop..dica Bra|
 22b0  73 69 6c 65 69 72 61 09  68 74 74 70 3a 2f 2f 77  |sileira.http://w|
 ...

 best,
 Erik Hetzner

 ;; Erik Hetzner, California Digital Library
 ;; gnupg key id: 1024D/01DB07E3




Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Godmar Back
On Mon, Dec 21, 2009 at 2:09 PM, Glen Newton glen.new...@nrc-cnrc.gc.ca wrote:

 The file I got with wget is:
  http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt


(Just to convince myself I'm not going nuts...) - this file, which
Glen downloaded with wget, appears double-encoded:

# curl -s http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt | od -a -t x1
| head -1082 | tail -4
0020660   -   3   6   8   2  nl   2   2   1  ht   A   c   t   a  sp   O
2d 33 36 38 32 0a 32 32 31 09 41 63 74 61 20 4f
0020700   r   t   o   p   C etx   B   )   d   i   c   a  sp   B   r   a
72 74 6f 70 c3 83 c2 a9 64 69 63 61 20 42 72 61

 - Godmar


Re: [CODE4LIB] Character problems with tictoc

2009-12-21 Thread Glen Newton
I agree with Godmar: it looks like (some) change happened to tictocs
between my original wget download and the one I downloaded after I
changed my browser settings. 

It appears Godmar is not going nuts (or at least this issue is not due
to him going nuts!)  ;-)

Viewing the file http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt
with my newly installed firefox 3.5.6 I see mangled characters:

221 Acta Ortop \u0192  dica Brasileira  
http://www.scielo.br/rss.php?pid=1413-7852lang=en  1413-7852

And my browser default encodings is: UTF-8.

So ignore most of my solution!  :-)

-glen

PS. I am contemplating trademarking I see mangled characters  !! :-)


On Mon, Dec 21, 2009 at 2:09 PM, Glen Newton glen.new...@nrc-cnrc.gc.ca wrote:

 The file I got with wget is:
  http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt


(Just to convince myself I'm not going nuts...) - this file, which
Glen downloaded with wget, appears double-encoded:

# curl -s http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt | od -a -t x1
| head -1082 | tail -4
0020660   -   3   6   8   2  nl   2   2   1  ht   A   c   t   a  sp   O
2d 33 36 38 32 0a 32 32 31 09 41 63 74 61 20 4f
0020700   r   t   o   p   C etx   B   )   d   i   c   a  sp   B   r   a
72 74 6f 70 c3 83 c2 a9 64 69 63 61 20 42 72 61

 - Godmar