Re: UTF-8 and changelog

2003-08-07 Thread Dagfinn Ilmari Mannsåker
Joel Baker <[EMAIL PROTECTED]> writes:

> Thus, unless you're using "high characters" not defined in US-ASCII, all
> of the following three statements are true:
>
> 1) It is a valid US-ASCII file
> 2) It is a valid ISO-8859-1 file

More generally, it's valid (and will appear identical) in any of the
ISO-8859-* encodings, they all are the same in the first 128 characters.

> 3) It is a valid UTF-8 file

-- 
ilmari


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: UTF-8 and changelog

2003-08-04 Thread Dagfinn Ilmari Mannsåker
Joel Baker <[EMAIL PROTECTED]> writes:

> Thus, unless you're using "high characters" not defined in US-ASCII, all
> of the following three statements are true:
>
> 1) It is a valid US-ASCII file
> 2) It is a valid ISO-8859-1 file

More generally, it's valid (and will appear identical) in any of the
ISO-8859-* encodings, they all are the same in the first 128 characters.

> 3) It is a valid UTF-8 file

-- 
ilmari



Re: UTF-8 and changelog

2003-08-04 Thread Matthias Urlichs
Hi, Stephen Gran wrote:

> Just a quick question about encoding changelog in utf-8.  My normal
> locale is iso-8859-1 (en_US or so, I guess), and `file changelog`
> returns 'ASCII text'.

Note that "file" doesn't look at the whole file, so it may miss non-ASCII
characters if they're not at the beginning.

Personally, I use vim. I have the following line in my .vimrc:

:set fileencodings=utf-8,latin1

In other words, use utf-8 if the file is well-formed utf-8, else use
Latin1. (The set of latin1-encoded files which are valid as UTF-8 is
empty, for all practical purposes.) This preserves the file's original
encoding.

Changing the encoding to utf-8 is as simple as typing
:set fileencoding=utf-8
before saving. (note the singular.)

-- 
Matthias Urlichs   |   {M:U} IT Design @ m-u-it.de   |  [EMAIL PROTECTED]
Disclaimer: The quote was selected randomly. Really. | http://smurf.noris.de
-- 
"Burn the libraries, for their value is in this one book (the Koran)."
   [Omar I, 2nd Caliph, at the capture of Alexandria]



Re: UTF-8 and changelog

2003-08-04 Thread Stephen Gran
This one time, at band camp, Joel Baker said:
> US-ASCII only defines characters from 0x00 through 0x7F (0 - 127); it is a
> formal subset of both ISO-8859-1 (Latin-1) and UTF-8. Or, more precisely,
> both Latin-1 and UTF-8 are proper supersets of US-ASCII, largely to prevent
> being gratuitously backwards-incompatible with the standard that has been
> used for decades.

This was what I had thought, mostly from osmosis, rather than any real
research.

> Thus, unless you're using "high characters" not defined in US-ASCII, all
> of the following three statements are true:
> 
> 1) It is a valid US-ASCII file
> 2) It is a valid ISO-8859-1 file
> 3) It is a valid UTF-8 file
> 
> It's only once you get into characters not found in US-ASCII that things
> differ. So, unless and until you add any, you don't need to worry about
> conversions (and, at that point, you should just add them as UTF-8
> characters, and not worry about Latin-1 at all :)
> 
> FWIW, you can use the 'en_US.UTF-8' locale if you want to see everything in
> Unicode. However, at least on woody, some applications won't cope with this
> well (many of them have newer versions in unstable that cope just fine,
> though).

Thanks for filling in the blanks, and at least giving me enough to
google intelligently from here.  The changelog issue is at rest, but I
would like to know more about this for the future.

Thanks again,
-- 
 -
|   ,''`.Stephen Gran |
|  : :' :[EMAIL PROTECTED] |
|  `. `'Debian user, admin, and developer |
|`- http://www.debian.org |
 -


pgp4LvkuGX9rg.pgp
Description: PGP signature


Re: UTF-8 and changelog

2003-08-04 Thread Joel Baker
On Mon, Aug 04, 2003 at 01:15:07PM -0400, Stephen Gran wrote:
> Hello all,
> 
> Just a quick question about encoding changelog in utf-8.  My normal
> locale is iso-8859-1 (en_US or so, I guess), and `file changelog`
> returns 'ASCII text'.  I tried 
> `iconv -f ISO-8859-1 -t utf8 changelog -o changelog.new`, but then 
> `file changelog.new` returns 'ASCII text' again, and diff shows no
> difference.  Do I need to be doing this each time, or can I leave it be?
> 
> As you can probably tell, I am not that familiar with the issues around
> utf-8, but my impression was that it is a superset of ASCII, so if I
> only use ASCII characters, it should be fine.  I checked with the line
> from developers-reference (footnote 76, IIRC) and got an exit code of 0,
> but since I am not sure about this kind of thing, I thought I had better
> ask.

US-ASCII only defines characters from 0x00 through 0x7F (0 - 127); it is a
formal subset of both ISO-8859-1 (Latin-1) and UTF-8. Or, more precisely,
both Latin-1 and UTF-8 are proper supersets of US-ASCII, largely to prevent
being gratuitously backwards-incompatible with the standard that has been
used for decades.

Thus, unless you're using "high characters" not defined in US-ASCII, all
of the following three statements are true:

1) It is a valid US-ASCII file
2) It is a valid ISO-8859-1 file
3) It is a valid UTF-8 file

It's only once you get into characters not found in US-ASCII that things
differ. So, unless and until you add any, you don't need to worry about
conversions (and, at that point, you should just add them as UTF-8
characters, and not worry about Latin-1 at all :)

FWIW, you can use the 'en_US.UTF-8' locale if you want to see everything in
Unicode. However, at least on woody, some applications won't cope with this
well (many of them have newer versions in unstable that cope just fine,
though).
-- 
Joel Baker <[EMAIL PROTECTED]>


pgpiteo9YYPuT.pgp
Description: PGP signature


Re: UTF-8 and changelog

2003-08-04 Thread Matthias Urlichs
Hi, Stephen Gran wrote:

> Just a quick question about encoding changelog in utf-8.  My normal
> locale is iso-8859-1 (en_US or so, I guess), and `file changelog`
> returns 'ASCII text'.

Note that "file" doesn't look at the whole file, so it may miss non-ASCII
characters if they're not at the beginning.

Personally, I use vim. I have the following line in my .vimrc:

:set fileencodings=utf-8,latin1

In other words, use utf-8 if the file is well-formed utf-8, else use
Latin1. (The set of latin1-encoded files which are valid as UTF-8 is
empty, for all practical purposes.) This preserves the file's original
encoding.

Changing the encoding to utf-8 is as simple as typing
:set fileencoding=utf-8
before saving. (note the singular.)

-- 
Matthias Urlichs   |   {M:U} IT Design @ m-u-it.de   |  [EMAIL PROTECTED]
Disclaimer: The quote was selected randomly. Really. | http://smurf.noris.de
-- 
"Burn the libraries, for their value is in this one book (the Koran)."
   [Omar I, 2nd Caliph, at the capture of Alexandria]


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: UTF-8 and changelog

2003-08-04 Thread Stephen Gran
This one time, at band camp, Joel Baker said:
> US-ASCII only defines characters from 0x00 through 0x7F (0 - 127); it is a
> formal subset of both ISO-8859-1 (Latin-1) and UTF-8. Or, more precisely,
> both Latin-1 and UTF-8 are proper supersets of US-ASCII, largely to prevent
> being gratuitously backwards-incompatible with the standard that has been
> used for decades.

This was what I had thought, mostly from osmosis, rather than any real
research.

> Thus, unless you're using "high characters" not defined in US-ASCII, all
> of the following three statements are true:
> 
> 1) It is a valid US-ASCII file
> 2) It is a valid ISO-8859-1 file
> 3) It is a valid UTF-8 file
> 
> It's only once you get into characters not found in US-ASCII that things
> differ. So, unless and until you add any, you don't need to worry about
> conversions (and, at that point, you should just add them as UTF-8
> characters, and not worry about Latin-1 at all :)
> 
> FWIW, you can use the 'en_US.UTF-8' locale if you want to see everything in
> Unicode. However, at least on woody, some applications won't cope with this
> well (many of them have newer versions in unstable that cope just fine,
> though).

Thanks for filling in the blanks, and at least giving me enough to
google intelligently from here.  The changelog issue is at rest, but I
would like to know more about this for the future.

Thanks again,
-- 
 -
|   ,''`.Stephen Gran |
|  : :' :[EMAIL PROTECTED] |
|  `. `'Debian user, admin, and developer |
|`- http://www.debian.org |
 -


pgp0.pgp
Description: PGP signature


Re: UTF-8 and changelog

2003-08-04 Thread Joel Baker
On Mon, Aug 04, 2003 at 01:15:07PM -0400, Stephen Gran wrote:
> Hello all,
> 
> Just a quick question about encoding changelog in utf-8.  My normal
> locale is iso-8859-1 (en_US or so, I guess), and `file changelog`
> returns 'ASCII text'.  I tried 
> `iconv -f ISO-8859-1 -t utf8 changelog -o changelog.new`, but then 
> `file changelog.new` returns 'ASCII text' again, and diff shows no
> difference.  Do I need to be doing this each time, or can I leave it be?
> 
> As you can probably tell, I am not that familiar with the issues around
> utf-8, but my impression was that it is a superset of ASCII, so if I
> only use ASCII characters, it should be fine.  I checked with the line
> from developers-reference (footnote 76, IIRC) and got an exit code of 0,
> but since I am not sure about this kind of thing, I thought I had better
> ask.

US-ASCII only defines characters from 0x00 through 0x7F (0 - 127); it is a
formal subset of both ISO-8859-1 (Latin-1) and UTF-8. Or, more precisely,
both Latin-1 and UTF-8 are proper supersets of US-ASCII, largely to prevent
being gratuitously backwards-incompatible with the standard that has been
used for decades.

Thus, unless you're using "high characters" not defined in US-ASCII, all
of the following three statements are true:

1) It is a valid US-ASCII file
2) It is a valid ISO-8859-1 file
3) It is a valid UTF-8 file

It's only once you get into characters not found in US-ASCII that things
differ. So, unless and until you add any, you don't need to worry about
conversions (and, at that point, you should just add them as UTF-8
characters, and not worry about Latin-1 at all :)

FWIW, you can use the 'en_US.UTF-8' locale if you want to see everything in
Unicode. However, at least on woody, some applications won't cope with this
well (many of them have newer versions in unstable that cope just fine,
though).
-- 
Joel Baker <[EMAIL PROTECTED]>


pgp0.pgp
Description: PGP signature


Re: UTF-8 and changelog

2003-08-04 Thread Colin Watson
On Mon, Aug 04, 2003 at 01:15:07PM -0400, Stephen Gran wrote:
> Just a quick question about encoding changelog in utf-8.  My normal
> locale is iso-8859-1 (en_US or so, I guess), and `file changelog`
> returns 'ASCII text'.  I tried 
> `iconv -f ISO-8859-1 -t utf8 changelog -o changelog.new`, but then 
> `file changelog.new` returns 'ASCII text' again, and diff shows no
> difference.  Do I need to be doing this each time, or can I leave it be?

If the file is already plain ASCII (i.e. contains no top-bit-set
characters), then there is no difference between ISO-8859-1 and UTF-8,
and you needn't worry. The procedure you just described demonstrates
that the file is indeed plain ASCII.

Cheers,

-- 
Colin Watson  [EMAIL PROTECTED]



Re: UTF-8 and changelog

2003-08-04 Thread Stephen Gran
This one time, at band camp, Mike Hommey said:
> On Monday 04 August 2003 19:15, Stephen Gran wrote:
> [snip]
> > As you can probably tell, I am not that familiar with the issues around
> > utf-8, but my impression was that it is a superset of ASCII, so if I
> > only use ASCII characters, it should be fine. 
> 
> You are perfectly right.

OK, thanks - that makes life easy.

-- 
 -
|   ,''`.Stephen Gran |
|  : :' :[EMAIL PROTECTED] |
|  `. `'Debian user, admin, and developer |
|`- http://www.debian.org |
 -


pgp8K928Xjbc0.pgp
Description: PGP signature


Re: UTF-8 and changelog

2003-08-04 Thread Mike Hommey
On Monday 04 August 2003 19:15, Stephen Gran wrote:
[snip]
> As you can probably tell, I am not that familiar with the issues around
> utf-8, but my impression was that it is a superset of ASCII, so if I
> only use ASCII characters, it should be fine. 

You are perfectly right.

-- 
"I have sampled every language, french is my favorite. Fantastic language,
especially to curse with. Nom de dieu de putain de bordel de merde de
saloperie de connard d'enculé de ta mère. It's like wiping your ass
with silk! I love it." -- The Merovingian, in the Matrix Reloaded



Re: UTF-8 and changelog

2003-08-04 Thread Colin Watson
On Mon, Aug 04, 2003 at 01:15:07PM -0400, Stephen Gran wrote:
> Just a quick question about encoding changelog in utf-8.  My normal
> locale is iso-8859-1 (en_US or so, I guess), and `file changelog`
> returns 'ASCII text'.  I tried 
> `iconv -f ISO-8859-1 -t utf8 changelog -o changelog.new`, but then 
> `file changelog.new` returns 'ASCII text' again, and diff shows no
> difference.  Do I need to be doing this each time, or can I leave it be?

If the file is already plain ASCII (i.e. contains no top-bit-set
characters), then there is no difference between ISO-8859-1 and UTF-8,
and you needn't worry. The procedure you just described demonstrates
that the file is indeed plain ASCII.

Cheers,

-- 
Colin Watson  [EMAIL PROTECTED]


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: UTF-8 and changelog

2003-08-04 Thread Stephen Gran
This one time, at band camp, Mike Hommey said:
> On Monday 04 August 2003 19:15, Stephen Gran wrote:
> [snip]
> > As you can probably tell, I am not that familiar with the issues around
> > utf-8, but my impression was that it is a superset of ASCII, so if I
> > only use ASCII characters, it should be fine. 
> 
> You are perfectly right.

OK, thanks - that makes life easy.

-- 
 -
|   ,''`.Stephen Gran |
|  : :' :[EMAIL PROTECTED] |
|  `. `'Debian user, admin, and developer |
|`- http://www.debian.org |
 -


pgp0.pgp
Description: PGP signature


Re: UTF-8 and changelog

2003-08-04 Thread Mike Hommey
On Monday 04 August 2003 19:15, Stephen Gran wrote:
[snip]
> As you can probably tell, I am not that familiar with the issues around
> utf-8, but my impression was that it is a superset of ASCII, so if I
> only use ASCII characters, it should be fine. 

You are perfectly right.

-- 
"I have sampled every language, french is my favorite. Fantastic language,
especially to curse with. Nom de dieu de putain de bordel de merde de
saloperie de connard d'enculé de ta mère. It's like wiping your ass
with silk! I love it." -- The Merovingian, in the Matrix Reloaded


--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]