Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-08 Thread Jan Engelhardt

On Jan 8 2007 14:17, Tim Pepper wrote:
> On 1/8/07, Pavel Machek <[EMAIL PROTECTED]> wrote:
>> On Sun 2007-01-07 22:30:55, Alan wrote:
>> > I think that would be a good idea - and add it to the coding/docs
>> > specs
>> > that documentation is UTF-8. Code should IMHO say 7bit though.
>> 
>> Yes, yes, please.
>> 
>> I have been flamed when someone tried to do 8bit patch, and I was
>> trying to NAK it...
>
> Could this get put in Documentation/CodingStyle?

Someone do that.

> And an item added to
> the kernel janitors' list to fix up 8bit files?  Last I looked trying

That's already been just done by me. http://lkml.org/lkml/2007/1/8/222

> to decided if there was a standard here I found a mish-mash of
> encodings based output of file vs Linus' git tree.

-`J'
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-08 Thread Eberhard Moenkeberg
Hi,

On Tue, 9 Jan 2007, Jan Engelhardt wrote:
> On Jan 8 2007 22:00, Ken Moffat wrote:

> > Looks nicely done, but I query the postal address changes in
> >Documentation/cdrom/sbpcd - that seems to be a change of address
> >(without anything to explain it).
> 
> Eberhard [cc], please attach an Acked-by: YourName 
> keep Ccs, thanks ;-)
> 
> [thread/patch: http://lkml.org/lkml/2007/1/8/222 ]

Acked-by: Eberhard Moenkeberg <[EMAIL PROTECTED]>

Jan had contacted me before, and I had sent him my new address data.

This very young guy is doing a really good job. ;-))

Cheers -e
-- 
Eberhard Moenkeberg ([EMAIL PROTECTED], [EMAIL PROTECTED])
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-08 Thread Jan Engelhardt

On Jan 8 2007 22:00, Ken Moffat wrote:

> Looks nicely done, but I query the postal address changes in
>Documentation/cdrom/sbpcd - that seems to be a change of address
>(without anything to explain it).

Eberhard [cc], please attach an Acked-by: YourName 
keep Ccs, thanks ;-)

[thread/patch: http://lkml.org/lkml/2007/1/8/222 ]

-`J'
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-08 Thread Tim Pepper

On 1/8/07, Pavel Machek <[EMAIL PROTECTED]> wrote:

On Sun 2007-01-07 22:30:55, Alan wrote:
> I think that would be a good idea - and add it to the coding/docs specs
> that documentation is UTF-8. Code should IMHO say 7bit though.

Yes, yes, please.

I have been flamed when someone tried to do 8bit patch, and I was
trying to NAK it...


Could this get put in Documentation/CodingStyle?  And an item added to
the kernel janitors' list to fix up 8bit files?  Last I looked trying
to decided if there was a standard here I found a mish-mash of
encodings based output of file vs Linus' git tree.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-08 Thread Ken Moffat
On Mon, Jan 08, 2007 at 09:17:06PM +0100, Jan Engelhardt wrote:
> 
> On Jan 8 2007 02:22, Jan Engelhardt wrote:
> >On Jan 7 2007 22:30, Alan wrote:
> >>
> >>> >The kernel maintainers/help/config pretty consistently use UTF8
> >>> 
> >>> I've seen a lot of places that don't do so. Want a patch?
> >>
> >>I think that would be a good idea - and add it to the coding/docs specs
> >>that documentation is UTF-8. Code should IMHO say 7bit though.
> 
> Most memorable issues:
> 
> * "dont" (standalone accent aigu) rather than "don't" 
> (apostrophe)
> * "", non breaking spaces
> * cp437 encoding in some files (heh, heh, DOS!)
> * iso8859-1/utf-8 mixed in some files
 Looks nicely done, but I query the postal address changes in
Documentation/cdrom/sbpcd - that seems to be a change of address
(without anything to explain it).  Everything else seems to be just
character-set conversion or the occasional translation of comments
into English.  (And no, I didn't attempt to review the character-set
changes, even it there is an occasional error it will be better than
where we are now, and easy to patch.)
> 
> My compose key is hot now...
 I prefer the AltGr dead keys in X (they seem to work more reliably
for me), but I guess I'm straying OT.
> 
> None of you people screw that patch with your buggy MUAs! I'll pack
> it up into a .bz2 to get it marked as application/octet-stream to
> not even give your MUA the chance to. ;-) [and because it's 221 K 
> uncompressed and I am not sure if splitting it up makes much sense for 
> such 'trivial' changes, or not?]
> 
> Signed-off-by: Jan Engelhardt <[EMAIL PROTECTED]>
> 
> 
>   -`J'
> -- 

 Thanks for doing this, I hope it wasn't in vain.

Ken
-- 
das eine Mal als Tragödie, das andere Mal als Farce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-08 Thread Pavel Machek
On Sun 2007-01-07 22:30:55, Alan wrote:
> > >The kernel maintainers/help/config pretty consistently use UTF8
> > 
> > I've seen a lot of places that don't do so. Want a patch?
> 
> I think that would be a good idea - and add it to the coding/docs specs
> that documentation is UTF-8. Code should IMHO say 7bit though.

Yes, yes, please.

I have been flamed when someone tried to do 8bit patch, and I was
trying to NAK it...

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-08 Thread Valdis . Kletnieks
On Mon, 08 Jan 2007 01:38:57 +0100, Willy Tarreau said:
> it's clearly the proof of a flaw in the initial design. And I'm not even
> discussing the stupidity which requires that you read a whole text to get
> its number of characters !

It's no more stupid than the *current* situation with Linux kernel code, where
the stupidity actually requires that even if you know that there are only 60
characters on a given line, you actually have to look at each one in order to
figure out if the line goes past column 80



pgpKwk0s3pnrW.pgp
Description: PGP signature


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-08 Thread Nicolas Mailhot

Le Lun 8 janvier 2007 11:44, Alan a écrit :
>> (case in point: Russel's system. I was ROTFL when he proudly announced
>> he
>> was running a full iso-8859-1 system after dissing UTF-8. Last I've seen
>> the official 8bit EU encoding was iso-8859-15, and UK is part of the EU)
>
> There is no correct UK encoding. You need -14 or -15 depending upon
> language and can come horribly unstuck the moment a name is involved.

Either way it's not iso-8859-1 :)

-- 
Nicolas Mailhot

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-08 Thread Alan
> (case in point: Russel's system. I was ROTFL when he proudly announced he
> was running a full iso-8859-1 system after dissing UTF-8. Last I've seen
> the official 8bit EU encoding was iso-8859-15, and UK is part of the EU)

There is no correct UK encoding. You need -14 or -15 depending upon
language and can come horribly unstuck the moment a name is involved.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-08 Thread Nicolas Mailhot
>> How would you do this technically in a way that it's significantely
>> easier than simply finishing the UTF=8 transition?

> In how many decades do you think the transition will be finished ?

Right now it looks like it will be finished way earlier than app bother
supporting the later 8-bit encodings such as iso-8859-15

(case in point: Russel's system. I was ROTFL when he proudly announced he
was running a full iso-8859-1 system after dissing UTF-8. Last I've seen
the official 8bit EU encoding was iso-8859-15, and UK is part of the EU)

-- 
Nicolas Mailhot

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-08 Thread Nicolas Mailhot
> elinks is one such program.  It now assumes UTF-8 _only_ displays.
> That's no better than programs which assume ISO-8859-1 only or US-ASCII
> only.

That's way better than programs:
- which assume an encoding you can't write most world languages in (BTW
ISO-8859-1 & US-ASCII are broken by design for Western Europe since at
least the Euro creation)
- which perpetuate the myth local 8-bit encodings are manageable (they
aren't, people spent decades trying to limp along with them, unicode &
UTF-8 where not created just to make your life miserable)

Show me one program that spurns Unicode I'll show you one that "passed on"
iso-8859-15 (typically, though it's the easiest non-iso-8859-1 to do)

The only reason you have the UTF-8 big stick approach nowadays is people
have tried for years to get app writers manage 8-bit locales properly to
dismal results. The old system was only working for en_US users (and
perhaps to .uk people)

-- 
Nicolas Mailhot

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-08 Thread Adrian Bunk
On Mon, Jan 08, 2007 at 07:52:48AM +0100, Jan Engelhardt wrote:
> 
> On Jan 8 2007 02:03, Adrian Bunk wrote:
> >
> >The only major MUA not supporting UTF-8 is Eudora.
> >
> >And if you are talking about buggy old pine, in the latest development 
> >version [1] it does not only become open source, it also got some 
> >working Unicode support.
> 
> Uhm, just for the record, I run pine 4.61 where my mail delivers to,
> and Unicode works, yes, including the spam.

For some years I'm using pine only as a newsreader, and I remember some 
display problems of Unicode characters that are fixed in Alpine.

It might be that the support in pine was already better than I thought
(but my switch to MUA was so many years ago...).

>   -`J'

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Jan Engelhardt

On Jan 8 2007 02:03, Adrian Bunk wrote:
>
>The only major MUA not supporting UTF-8 is Eudora.
>
>And if you are talking about buggy old pine, in the latest development 
>version [1] it does not only become open source, it also got some 
>working Unicode support.

Uhm, just for the record, I run pine 4.61 where my mail delivers to,
and Unicode works, yes, including the spam.


-`J'
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread David Woodhouse
On Sun, 2007-01-07 at 15:05 -0500, Dave Jones wrote:
> This has been bugging me for a while.
> Viewing the mail I applied in mutt shows his name correctly as Rafał
> Applying it with git-applymbox and viewing the log on master.kernel.org
> with git log shows Rafa   And then later when put into email
> it turns into Rafa³ 

I believe you need to use the misnamed '-u' option to git-applymbox,
which _really_ ought to be the default behaviour. Otherwise, it fails to
pay any attention to the character set tags in the mail it's decoding --
it commits the sin which rmk was whining about; assuming the input data
is of a given type and ignoring the explicit tags which indicate the
contrary.

The '-u' option is misdocumented as 'causes the resulting commit to be
encoded in utf-8', but in fact I believe it doesn't necessarily do that
-- it actually causes the resulting commit to be encoded in the
configured storage charset for the repository, which just _happens_ to
default to UTF-8 unless otherwise specified. That is something which
should definitely be the _default_ behaviour.

We should make the '-u' behaviour the default, and if anyone really
wants the old behaviour of importing arbitrary data in untagged 
binary form overriding its labelling then they can have a separate
option which does that.

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Adrian Bunk
On Mon, Jan 08, 2007 at 02:14:41AM +0100, Willy Tarreau wrote:
> On Mon, Jan 08, 2007 at 02:03:37AM +0100, Adrian Bunk wrote:
> > On Mon, Jan 08, 2007 at 01:38:57AM +0100, Willy Tarreau wrote:
> > > On Mon, Jan 08, 2007 at 12:37:50AM +0100, Adrian Bunk wrote:
> > > > On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote:
> > > > > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote:
> > > > > > 
> > > > > > On Jan 7 2007 17:06, Russell King wrote:
> > > > > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> > > > > > >
> > > > > > >$ git log | head -n 1000 | tail -n 200 > o
> > > > > > >$ file -i o
> > > > > > >o: text/plain; charset=us-ascii
> > > > > > >$ git log | head -n 1000 | tail -n 300 > o
> > > > > > >$ file -i o
> > > > > > >o: text/plain; charset=us-ascii
> > > > > > >$ git log | head -n 1000 | tail -n 400 > o
> > > > > > >$ file -i o
> > > > > > >o: text/plain; charset=utf-8
> > > > > > 
> > > > > > I am inclined to say that "file" does not count, because it tries 
> > > > > > to guess an
> > > > > > ambiguous mapping from bytes to character set. Even more, file 
> > > > > > should be
> > > > > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or 
> > > > > > worse: 15)
> > > > > > file. This program is soo... forget it, it's not an argument. It 
> > > > > > works well for
> > > > > > headerful files, but text files don't really contain one. The next 
> > > > > > best thing
> > > > > > would be html, with a proper  tag.
> > > > > 
> > > > > The stupidity from the start up with those character sets is that they
> > > > > consider that a whole file is written with a given set. In fact, the
> > > > > charset should apply to characters themselves. At least, the
> > > > > quoted-printable, non-human friendly, encoding was the least stupid.
> > > > 
> > > > I doubt doing this would really be worth the effort.
> > > > 
> > > > In the 21st century, people should simply use UTF-8.
> > > > 
> > > > > Now that UTF8 comes everywhere, everyone receives tons of mangled 
> > > > > mails,
> > > > > and even mailers which correctly support UTF8 and use it by default 
> > > > > manage
> > > > > to shoot themselves in the foot when they reply to, or forward a 
> > > > > mail. The
> > > > > system is completely broken because limited by design, and we have to 
> > > > > learn
> > > > > to live with this brokenness.
> > > > 
> > > > Only if MUAs have broken charset support or don't set a correct 
> > > > "charset" header in the mails they are sending.
> > > > 
> > > > If some software still can't handle UTF-8 correctly more than 10 years 
> > > > after it was introduced, that's not a brokenness you can blame on UTF-8.
> > > 
> > > I'm not blaming UTF-8 per se, but people who still believe in encoding
> > > *whole documents*. Copy-paste, text insertion, git output, etc... 
> > > everything
> > > has a good reason not to be in the same encoding as what your MUA 
> > > believes.
> > 
> > How would you do this technically in a way that it's significantely 
> > easier than simply finishing the UTF=8 transition?
> 
> In how many decades do you think the transition will be finished ?
> 
> > > If major MUAs still have problems with UTF-8 10 years after it was 
> > > introduced,
> > > it's clearly the proof of a flaw in the initial design. And I'm not even
> > > discussing the stupidity which requires that you read a whole text to get
> > > its number of characters !
> > 
> > The only major MUA not supporting UTF-8 is Eudora.
> > 
> > And if you are talking about buggy old pine, in the latest development 
> > version [1] it does not only become open source, it also got some 
> > working Unicode support.
> 
> No, I'm not speaking about "not supporting", but "having problems". Every
> one of us has already received mails from Thunderbird, Outlook, Notes, etc...
> with erroneously encoded characters because of this :
> 
>   - an UTF8 MUA sends a mail to a non-UTF8 aware one.

"non-UTF8 aware one" = Eudora (BTW: there's no Linux version)

>   - this last one only sees double chars. When it wants to forward the mail
> to someone else, it keeps the chars verbatim, and sets the encoding type
> to its own, something like iso8859-1 for instance.

Let's not base everything on the one broken non-Linux MUA,

>   - the final MUA, which is UTF8-aware, is very happy to detect lots of UTF8
> combinations in the forwarded mail and decides that everything in it is
> UTF8, then you get lots of chars mangled in the mail, in the middle of
> UTF8 combinations. Then, this crappy mail can be forwarded as long as
> you want between UTF8 MUAs, they will all apply heuristics and to the
> wrong thing : consider the *whole* document with *one* type.

Which MUAs exactly do ignore the "charset" of an email and try their own 
guessing instead?

Or which MUAs exactly do not set a "charset" so that the receiving MUA 
might have a reason for guessing?

> What I find even funni

Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Horst H. von Brand
Russell King <[EMAIL PROTECTED]> wrote:

[...]

> All that UTF-8 has done is added to the "which charset is this data"
> problem rather than actually solving any proper real life problem.

It solves real-world problems, the pain is that it is not (yet) universally
used. The charset problems today are much more visible today than, say, 15
years back, that is all.
-- 
Dr. Horst H. von Brand   User #22616 counter.li.org
Departamento de InformaticaFono: +56 32 2654431
Universidad Tecnica Federico Santa Maria +56 32 2654239
Casilla 110-V, Valparaiso, Chile   Fax:  +56 32 2797513
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Jan Engelhardt

On Jan 7 2007 22:30, Alan wrote:
>
>> >The kernel maintainers/help/config pretty consistently use UTF8
>> 
>> I've seen a lot of places that don't do so. Want a patch?
>
>I think that would be a good idea - and add it to the coding/docs specs
>that documentation is UTF-8. Code should IMHO say 7bit though.

Hm, what do the list of authors in .c/.h files and kerneldoc
in .c/h belong to? doc or code?


-`J'
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Willy Tarreau
On Mon, Jan 08, 2007 at 02:03:37AM +0100, Adrian Bunk wrote:
> On Mon, Jan 08, 2007 at 01:38:57AM +0100, Willy Tarreau wrote:
> > On Mon, Jan 08, 2007 at 12:37:50AM +0100, Adrian Bunk wrote:
> > > On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote:
> > > > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote:
> > > > > 
> > > > > On Jan 7 2007 17:06, Russell King wrote:
> > > > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> > > > > >
> > > > > >$ git log | head -n 1000 | tail -n 200 > o
> > > > > >$ file -i o
> > > > > >o: text/plain; charset=us-ascii
> > > > > >$ git log | head -n 1000 | tail -n 300 > o
> > > > > >$ file -i o
> > > > > >o: text/plain; charset=us-ascii
> > > > > >$ git log | head -n 1000 | tail -n 400 > o
> > > > > >$ file -i o
> > > > > >o: text/plain; charset=utf-8
> > > > > 
> > > > > I am inclined to say that "file" does not count, because it tries to 
> > > > > guess an
> > > > > ambiguous mapping from bytes to character set. Even more, file should 
> > > > > be
> > > > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or 
> > > > > worse: 15)
> > > > > file. This program is soo... forget it, it's not an argument. It 
> > > > > works well for
> > > > > headerful files, but text files don't really contain one. The next 
> > > > > best thing
> > > > > would be html, with a proper  tag.
> > > > 
> > > > The stupidity from the start up with those character sets is that they
> > > > consider that a whole file is written with a given set. In fact, the
> > > > charset should apply to characters themselves. At least, the
> > > > quoted-printable, non-human friendly, encoding was the least stupid.
> > > 
> > > I doubt doing this would really be worth the effort.
> > > 
> > > In the 21st century, people should simply use UTF-8.
> > > 
> > > > Now that UTF8 comes everywhere, everyone receives tons of mangled mails,
> > > > and even mailers which correctly support UTF8 and use it by default 
> > > > manage
> > > > to shoot themselves in the foot when they reply to, or forward a mail. 
> > > > The
> > > > system is completely broken because limited by design, and we have to 
> > > > learn
> > > > to live with this brokenness.
> > > 
> > > Only if MUAs have broken charset support or don't set a correct 
> > > "charset" header in the mails they are sending.
> > > 
> > > If some software still can't handle UTF-8 correctly more than 10 years 
> > > after it was introduced, that's not a brokenness you can blame on UTF-8.
> > 
> > I'm not blaming UTF-8 per se, but people who still believe in encoding
> > *whole documents*. Copy-paste, text insertion, git output, etc... everything
> > has a good reason not to be in the same encoding as what your MUA believes.
> 
> How would you do this technically in a way that it's significantely 
> easier than simply finishing the UTF=8 transition?

In how many decades do you think the transition will be finished ?

> > If major MUAs still have problems with UTF-8 10 years after it was 
> > introduced,
> > it's clearly the proof of a flaw in the initial design. And I'm not even
> > discussing the stupidity which requires that you read a whole text to get
> > its number of characters !
> 
> The only major MUA not supporting UTF-8 is Eudora.
> 
> And if you are talking about buggy old pine, in the latest development 
> version [1] it does not only become open source, it also got some 
> working Unicode support.

No, I'm not speaking about "not supporting", but "having problems". Every
one of us has already received mails from Thunderbird, Outlook, Notes, etc...
with erroneously encoded characters because of this :

  - an UTF8 MUA sends a mail to a non-UTF8 aware one.

  - this last one only sees double chars. When it wants to forward the mail
to someone else, it keeps the chars verbatim, and sets the encoding type
to its own, something like iso8859-1 for instance.

  - the final MUA, which is UTF8-aware, is very happy to detect lots of UTF8
combinations in the forwarded mail and decides that everything in it is
UTF8, then you get lots of chars mangled in the mail, in the middle of
UTF8 combinations. Then, this crappy mail can be forwarded as long as
you want between UTF8 MUAs, they will all apply heuristics and to the
wrong thing : consider the *whole* document with *one* type.

What I find even funnier is when, for no apparent reason, the same MUA is used
on both ends and the contents get mangled because the sender copies a portion
of text from somewhere else.

Anyway, I don't want to follow up on this thread, it's *highly* off-topic here.

Cheers,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Adrian Bunk
On Mon, Jan 08, 2007 at 01:38:57AM +0100, Willy Tarreau wrote:
> On Mon, Jan 08, 2007 at 12:37:50AM +0100, Adrian Bunk wrote:
> > On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote:
> > > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote:
> > > > 
> > > > On Jan 7 2007 17:06, Russell King wrote:
> > > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> > > > >
> > > > >$ git log | head -n 1000 | tail -n 200 > o
> > > > >$ file -i o
> > > > >o: text/plain; charset=us-ascii
> > > > >$ git log | head -n 1000 | tail -n 300 > o
> > > > >$ file -i o
> > > > >o: text/plain; charset=us-ascii
> > > > >$ git log | head -n 1000 | tail -n 400 > o
> > > > >$ file -i o
> > > > >o: text/plain; charset=utf-8
> > > > 
> > > > I am inclined to say that "file" does not count, because it tries to 
> > > > guess an
> > > > ambiguous mapping from bytes to character set. Even more, file should be
> > > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or 
> > > > worse: 15)
> > > > file. This program is soo... forget it, it's not an argument. It works 
> > > > well for
> > > > headerful files, but text files don't really contain one. The next best 
> > > > thing
> > > > would be html, with a proper  tag.
> > > 
> > > The stupidity from the start up with those character sets is that they
> > > consider that a whole file is written with a given set. In fact, the
> > > charset should apply to characters themselves. At least, the
> > > quoted-printable, non-human friendly, encoding was the least stupid.
> > 
> > I doubt doing this would really be worth the effort.
> > 
> > In the 21st century, people should simply use UTF-8.
> > 
> > > Now that UTF8 comes everywhere, everyone receives tons of mangled mails,
> > > and even mailers which correctly support UTF8 and use it by default manage
> > > to shoot themselves in the foot when they reply to, or forward a mail. The
> > > system is completely broken because limited by design, and we have to 
> > > learn
> > > to live with this brokenness.
> > 
> > Only if MUAs have broken charset support or don't set a correct 
> > "charset" header in the mails they are sending.
> > 
> > If some software still can't handle UTF-8 correctly more than 10 years 
> > after it was introduced, that's not a brokenness you can blame on UTF-8.
> 
> I'm not blaming UTF-8 per se, but people who still believe in encoding
> *whole documents*. Copy-paste, text insertion, git output, etc... everything
> has a good reason not to be in the same encoding as what your MUA believes.

How would you do this technically in a way that it's significantely 
easier than simply finishing the UTF=8 transition?

> If major MUAs still have problems with UTF-8 10 years after it was introduced,
> it's clearly the proof of a flaw in the initial design. And I'm not even
> discussing the stupidity which requires that you read a whole text to get
> its number of characters !

The only major MUA not supporting UTF-8 is Eudora.

And if you are talking about buggy old pine, in the latest development 
version [1] it does not only become open source, it also got some 
working Unicode support.

> Willy

cu
Adrian

[1] Alpine

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Willy Tarreau
On Mon, Jan 08, 2007 at 12:37:50AM +0100, Adrian Bunk wrote:
> On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote:
> > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote:
> > > 
> > > On Jan 7 2007 17:06, Russell King wrote:
> > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> > > >
> > > >$ git log | head -n 1000 | tail -n 200 > o
> > > >$ file -i o
> > > >o: text/plain; charset=us-ascii
> > > >$ git log | head -n 1000 | tail -n 300 > o
> > > >$ file -i o
> > > >o: text/plain; charset=us-ascii
> > > >$ git log | head -n 1000 | tail -n 400 > o
> > > >$ file -i o
> > > >o: text/plain; charset=utf-8
> > > 
> > > I am inclined to say that "file" does not count, because it tries to 
> > > guess an
> > > ambiguous mapping from bytes to character set. Even more, file should be
> > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or 
> > > worse: 15)
> > > file. This program is soo... forget it, it's not an argument. It works 
> > > well for
> > > headerful files, but text files don't really contain one. The next best 
> > > thing
> > > would be html, with a proper  tag.
> > 
> > The stupidity from the start up with those character sets is that they
> > consider that a whole file is written with a given set. In fact, the
> > charset should apply to characters themselves. At least, the
> > quoted-printable, non-human friendly, encoding was the least stupid.
> 
> I doubt doing this would really be worth the effort.
> 
> In the 21st century, people should simply use UTF-8.
> 
> > Now that UTF8 comes everywhere, everyone receives tons of mangled mails,
> > and even mailers which correctly support UTF8 and use it by default manage
> > to shoot themselves in the foot when they reply to, or forward a mail. The
> > system is completely broken because limited by design, and we have to learn
> > to live with this brokenness.
> 
> Only if MUAs have broken charset support or don't set a correct 
> "charset" header in the mails they are sending.
> 
> If some software still can't handle UTF-8 correctly more than 10 years 
> after it was introduced, that's not a brokenness you can blame on UTF-8.

I'm not blaming UTF-8 per se, but people who still believe in encoding
*whole documents*. Copy-paste, text insertion, git output, etc... everything
has a good reason not to be in the same encoding as what your MUA believes.
If major MUAs still have problems with UTF-8 10 years after it was introduced,
it's clearly the proof of a flaw in the initial design. And I'm not even
discussing the stupidity which requires that you read a whole text to get
its number of characters !

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Adrian Bunk
On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote:
> On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote:
> > 
> > On Jan 7 2007 17:06, Russell King wrote:
> > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> > >
> > >$ git log | head -n 1000 | tail -n 200 > o
> > >$ file -i o
> > >o: text/plain; charset=us-ascii
> > >$ git log | head -n 1000 | tail -n 300 > o
> > >$ file -i o
> > >o: text/plain; charset=us-ascii
> > >$ git log | head -n 1000 | tail -n 400 > o
> > >$ file -i o
> > >o: text/plain; charset=utf-8
> > 
> > I am inclined to say that "file" does not count, because it tries to guess 
> > an
> > ambiguous mapping from bytes to character set. Even more, file should be
> > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 
> > 15)
> > file. This program is soo... forget it, it's not an argument. It works well 
> > for
> > headerful files, but text files don't really contain one. The next best 
> > thing
> > would be html, with a proper  tag.
> 
> The stupidity from the start up with those character sets is that they
> consider that a whole file is written with a given set. In fact, the
> charset should apply to characters themselves. At least, the
> quoted-printable, non-human friendly, encoding was the least stupid.

I doubt doing this would really be worth the effort.

In the 21st century, people should simply use UTF-8.

> Now that UTF8 comes everywhere, everyone receives tons of mangled mails,
> and even mailers which correctly support UTF8 and use it by default manage
> to shoot themselves in the foot when they reply to, or forward a mail. The
> system is completely broken because limited by design, and we have to learn
> to live with this brokenness.

Only if MUAs have broken charset support or don't set a correct 
"charset" header in the mails they are sending.

If some software still can't handle UTF-8 correctly more than 10 years 
after it was introduced, that's not a brokenness you can blame on UTF-8.

> Willy

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Alan
> >The kernel maintainers/help/config pretty consistently use UTF8
> 
> I've seen a lot of places that don't do so. Want a patch?

I think that would be a good idea - and add it to the coding/docs specs
that documentation is UTF-8. Code should IMHO say 7bit though.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Xavier Bestel
Le dimanche 07 janvier 2007 à 21:40 +0100, Jan Engelhardt a écrit :
> >On Sun, 7 Jan 2007 15:05:53 -0500
> >Dave Jones <[EMAIL PROTECTED]> wrote:
> >
> >> If there's something I should be doing when I commit that I'm not,
> >> I'll be happy to change my scripts.  My $LANG is set to en_US.UTF-8
> >> which should DTRT to the best of my knowledge, but clearly, that isn't
> >> the case.
> 
> No, LC_CTYPE defines what charset you use. (I may be wrong, though.)

IIRC LANG is a superset for all LC_* - i.e. if only LANG is defined, it
sets all your locales, but you can individually set the charset, numeric
format, date format, etc.

Xav


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Willy Tarreau
On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote:
> 
> On Jan 7 2007 17:06, Russell King wrote:
> >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> >
> >$ git log | head -n 1000 | tail -n 200 > o
> >$ file -i o
> >o: text/plain; charset=us-ascii
> >$ git log | head -n 1000 | tail -n 300 > o
> >$ file -i o
> >o: text/plain; charset=us-ascii
> >$ git log | head -n 1000 | tail -n 400 > o
> >$ file -i o
> >o: text/plain; charset=utf-8
> 
> I am inclined to say that "file" does not count, because it tries to guess an
> ambiguous mapping from bytes to character set. Even more, file should be
> _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15)
> file. This program is soo... forget it, it's not an argument. It works well 
> for
> headerful files, but text files don't really contain one. The next best thing
> would be html, with a proper  tag.

The stupidity from the start up with those character sets is that they
consider that a whole file is written with a given set. In fact, the
charset should apply to characters themselves. At least, the
quoted-printable, non-human friendly, encoding was the least stupid.

Now that UTF8 comes everywhere, everyone receives tons of mangled mails,
and even mailers which correctly support UTF8 and use it by default manage
to shoot themselves in the foot when they reply to, or forward a mail. The
system is completely broken because limited by design, and we have to learn
to live with this brokenness.

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Jan Engelhardt

>On Sun, 7 Jan 2007 15:05:53 -0500
>Dave Jones <[EMAIL PROTECTED]> wrote:
>
>> If there's something I should be doing when I commit that I'm not,
>> I'll be happy to change my scripts.  My $LANG is set to en_US.UTF-8
>> which should DTRT to the best of my knowledge, but clearly, that isn't
>> the case.

No, LC_CTYPE defines what charset you use. (I may be wrong, though.)


-`J'
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Robin Rosenberg
söndag 07 januari 2007 20:17 skrev Russell King:
[...]
> clearly not UTF-8.  I doubt whether any of the commits I do on my
> en_GB ISO-8859-1 systems end up being UTF-8 encoded.

They don't. Git doesn't convert, with the exception of two mail-related tools, 
which is the reason the commit being discussed ended up as UTF-8
in GIT. The mail containing the patch was in ISO-8859-1. All other git tools 
just store whatever byte sequence they are fed, be ut ISO-latin, utf-8 or 
something (to westeners) more exotic.

-- robin

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Sean
On Sun, 7 Jan 2007 15:05:53 -0500
Dave Jones <[EMAIL PROTECTED]> wrote:

Including the Git list...

> On Sun, Jan 07, 2007 at 07:17:30PM +, Russell King wrote:
> 
>  > commit 24ebead82bbf9785909d4cf205e2df5e9ff7da32
>  > tree 921f686860e918a01c3d3fb6cd106ba82bf4ace6
>  > parent 264166e604a7e14c278e31cadd1afb06a7d51a11
>  > author Rafa³ Bilski <[EMAIL PROTECTED]> 1167691774 +0100
>  > committer Dave Jones <[EMAIL PROTECTED]> 1167799119 -0500
>  > 
>  > and looking at that "author" closer with od:
>  > 
>  > 140 74 68 6f 72 20 52 61 66 61 b3 20 42 69 6c 73 6b
>  >   t   h   o   r   R   a   f   a   ³   B   i   l   s   k
>  > 
>  > clearly not UTF-8.  I doubt whether any of the commits I do on my
>  > en_GB ISO-8859-1 systems end up being UTF-8 encoded.
> 
> This has been bugging me for a while.
> Viewing the mail I applied in mutt shows his name correctly as Rafał
> Applying it with git-applymbox and viewing the log on master.kernel.org
> with git log shows Rafa   And then later when put into email
> it turns into Rafa³
> 
>  > But the point is there is charset damage which has happened _long_ before
>  > Linus' action.  There is no character set defined for the contents of git
>  > repositories, and as such the output of the git tools can not be
>  > interpreted as any one single character set.
> 
> If there's something I should be doing when I commit that I'm not,
> I'll be happy to change my scripts.  My $LANG is set to en_US.UTF-8
> which should DTRT to the best of my knowledge, but clearly, that isn't
> the case.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Dave Jones
On Sun, Jan 07, 2007 at 07:17:30PM +, Russell King wrote:

 > commit 24ebead82bbf9785909d4cf205e2df5e9ff7da32
 > tree 921f686860e918a01c3d3fb6cd106ba82bf4ace6
 > parent 264166e604a7e14c278e31cadd1afb06a7d51a11
 > author Rafa³ Bilski <[EMAIL PROTECTED]> 1167691774 +0100
 > committer Dave Jones <[EMAIL PROTECTED]> 1167799119 -0500
 > 
 > and looking at that "author" closer with od:
 > 
 > 140 74 68 6f 72 20 52 61 66 61 b3 20 42 69 6c 73 6b
 >   t   h   o   r   R   a   f   a   ³   B   i   l   s   k
 > 
 > clearly not UTF-8.  I doubt whether any of the commits I do on my
 > en_GB ISO-8859-1 systems end up being UTF-8 encoded.

This has been bugging me for a while.
Viewing the mail I applied in mutt shows his name correctly as Rafał
Applying it with git-applymbox and viewing the log on master.kernel.org
with git log shows Rafa   And then later when put into email
it turns into Rafa³

 > But the point is there is charset damage which has happened _long_ before
 > Linus' action.  There is no character set defined for the contents of git
 > repositories, and as such the output of the git tools can not be
 > interpreted as any one single character set.

If there's something I should be doing when I commit that I'm not,
I'll be happy to change my scripts.  My $LANG is set to en_US.UTF-8
which should DTRT to the best of my knowledge, but clearly, that isn't
the case.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Russell King
On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote:
> 
> On Jan 7 2007 17:06, Russell King wrote:
> >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> >
> >$ git log | head -n 1000 | tail -n 200 > o
> >$ file -i o
> >o: text/plain; charset=us-ascii
> >$ git log | head -n 1000 | tail -n 300 > o
> >$ file -i o
> >o: text/plain; charset=us-ascii
> >$ git log | head -n 1000 | tail -n 400 > o
> >$ file -i o
> >o: text/plain; charset=utf-8
> 
> I am inclined to say that "file" does not count, because it tries to guess an
> ambiguous mapping from bytes to character set. Even more, file should be
> _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15)
> file. This program is soo... forget it, it's not an argument. It works well 
> for
> headerful files, but text files don't really contain one. The next best thing
> would be html, with a proper  tag.

You're discarding a perfectly reasonable argument - file itself obviously
is not good at guessing the charset, but inspecting the resulting file
manually and identifying *both* ISO-8859 and UTF-8 character sequences
in there is pretty conclusive.  As I did indeed do prior to sending
that message.

In this case, 'file' was doing a remarkably accurate job.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Russell King
On Sun, Jan 07, 2007 at 06:21:51PM +, Alan wrote:
> > So, in short, UTF-8 is all fine and dandy if your _entire_ universe
> > is UTF-8 enabled.  If you're operating in a mixed charset environment
> > it's one bloody big pain in the butt.
> 
> Net ASCII is 7bit and is 1:1 mapped with UTF-8 unicode.

The same is true of ISO-8859-1.

> It's just old broken 8bit encodings that are problematic.
> 
> The kernel maintainers/help/config pretty consistently use UTF8

As I've tried to point out, that's not universally true.  For instance:

commit 24ebead82bbf9785909d4cf205e2df5e9ff7da32
tree 921f686860e918a01c3d3fb6cd106ba82bf4ace6
parent 264166e604a7e14c278e31cadd1afb06a7d51a11
author Rafa³ Bilski <[EMAIL PROTECTED]> 1167691774 +0100
committer Dave Jones <[EMAIL PROTECTED]> 1167799119 -0500

and looking at that "author" closer with od:

140 74 68 6f 72 20 52 61 66 61 b3 20 42 69 6c 73 6b
  t   h   o   r   R   a   f   a   ³   B   i   l   s   k

clearly not UTF-8.  I doubt whether any of the commits I do on my
en_GB ISO-8859-1 systems end up being UTF-8 encoded.

And _this_ is the problem when it comes to generating the logs,
irrespective of whether or not Linus loads UTF-8 data into an
ISO-8859-1 message.  For all we know, Linus' system could be using
an ISO-8859 charset rather than UTF-8.

But the point is there is charset damage which has happened _long_ before
Linus' action.  There is no character set defined for the contents of git
repositories, and as such the output of the git tools can not be
interpreted as any one single character set.

All that UTF-8 has done is added to the "which charset is this data"
problem rather than actually solving any proper real life problem.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Jan Engelhardt

On Jan 7 2007 18:21, Alan wrote:
>
>> So, in short, UTF-8 is all fine and dandy if your _entire_ universe
>> is UTF-8 enabled.  If you're operating in a mixed charset environment
>> it's one bloody big pain in the butt.
>
>Net ASCII is 7bit and is 1:1 mapped with UTF-8 unicode. It's just old
>broken 8bit encodings that are problematic.
>
>The kernel maintainers/help/config pretty consistently use UTF8

I've seen a lot of places that don't do so. Want a patch?


-`J'
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Jan Engelhardt

On Jan 7 2007 17:06, Russell King wrote:
>On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
>
>$ git log | head -n 1000 | tail -n 200 > o
>$ file -i o
>o: text/plain; charset=us-ascii
>$ git log | head -n 1000 | tail -n 300 > o
>$ file -i o
>o: text/plain; charset=us-ascii
>$ git log | head -n 1000 | tail -n 400 > o
>$ file -i o
>o: text/plain; charset=utf-8

I am inclined to say that "file" does not count, because it tries to guess an
ambiguous mapping from bytes to character set. Even more, file should be
_unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15)
file. This program is soo... forget it, it's not an argument. It works well for
headerful files, but text files don't really contain one. The next best thing
would be html, with a proper  tag.


-`J'
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Alan
> So, in short, UTF-8 is all fine and dandy if your _entire_ universe
> is UTF-8 enabled.  If you're operating in a mixed charset environment
> it's one bloody big pain in the butt.

Net ASCII is 7bit and is 1:1 mapped with UTF-8 unicode. It's just old
broken 8bit encodings that are problematic.

The kernel maintainers/help/config pretty consistently use UTF8
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Russell King
On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> On Sun, 2007-01-07 at 15:38 +, Russell King wrote:
> > When a text file is stored on disk, there's no way to tell what
> > character set the characters in that file belong to.  As a result,
> > ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded.
> > UTF-8 folk assume all text files are UTF-8 encoded.  This leads to
> > utter confusion.
> 
> Only if you are making different assumptions about the _same_ set of
> files, on the _same_ system. But that would be silly.

$ git log | head -n 1000 | tail -n 200 > o
$ file -i o
o: text/plain; charset=us-ascii
$ git log | head -n 1000 | tail -n 300 > o
$ file -i o
o: text/plain; charset=us-ascii
$ git log | head -n 1000 | tail -n 400 > o
$ file -i o
o: text/plain; charset=utf-8

(and you know what charset the file is thought to have with all 1000
lines in it.)

All on a system with LANG set to en_GB (iow ISO-8859-1).

> > To see what I mean, try the following:
> > 
> > $ git log | head -n 1000 > o
> > $ file -i o
> > o: text/x-c; charset=iso-8859-1
> > 
> > According to that, the charset of the 'git log' output (which on that
> > test included Leonard's entry) is iso-8859-1, and by that Linus' mailer
> > was right to include it as ISO-8859-1.
> 
> Yes. When you stored it on disk, the character set information was lost.

The same thing actually happens when I look at it via:

  $ git log | head -n 1000 | less

but in this case the output is always interpreted by the terminal to be
in its character set.

> If you were running a mixed-charset system then attempting to recreating
> the lost information with heuristics and assumptions is obviously going
> to be problematic.

I'm not - I'm running a pure ISO-8859-1 system:

$ echo $LANG
en_GB
$ locale -k LC_CTYPE | grep charmap
charmap="ISO-8859-1"

> Actually, because UTF-8 allows me to run a system which is purely based
> on a single character set, I get better results when I try the same
> trick:
>   shinybook /shiny/git/mtd-2.6 $ git log | head -n 1000 > o
>   shinybook /shiny/git/mtd-2.6 $ file -i o
>   o: text/plain; charset=utf-8

$ LANG=en_GB.UTF-8 locale -k LC_CTYPE | grep charmap
charmap="UTF-8"
$ LANG=en_GB.UTF-8 git log | head -n 1000 > o
$ LANG=en_GB.UTF-8 file -i o
o: text/x-c; charset=iso-8859-1
$ git version
git version 1.4.4.2

Looks like the output is iso-8859-1 even with UTF-8!

> > In reality, the output from git log contains an ad-hoc collection of
> > character sets making its interpretation under any one character set
> > incorrect.
> 
> No, the contents of the git log ought to be UTF-8, unless people have
> been misusing it. Git stores its text in UTF-8 (by default), and is
> capable of converting to and from legacy character sets on input
> (git-commit) and output (git-log).

Git may store its text internally in UTF-8 (I don't know but I have no
evidence to suggest it does - in fact I have some evidence in this test
that it doesn't care about charsets.)  git log output on a non-UTF-8
system certainly is not in the hosts character set.  For example:

$ LANG=en_GB.UTF-8 git log | head -n 1000 > o
$ LANG=en_GB git log | head -n 1000 > o2
$ diff -u o o2

That includes the UTF-8 encoded part of Leonard name.  It also includes
Rafa? Bilski's name which is non-UTF-8 encoded.

So, in both cases, exactly the same output bytestream was created
independent of the character set _actually_ being used, which both
includes untranslated UTF-8 and non-UTF-8 sequences.

There is obviously no character set translation going on with the output.
So we can add 'git' to my list of charset-broken programs.

Also, since we have recent data in the git repository which is non-UTF-8
as well, it is clear that there is no character set translation going on
at input time either.

Looking at the git-commit script, there appears to be no character set
conversion going on in there either.

So, I think you'll find that the contents of git _is_ an ad-hoc collection
of character sets which people happen to have in use on their machines.

> > So, in short, UTF-8 is all fine and dandy if your _entire_ universe
> > is UTF-8 enabled.  If you're operating in a mixed charset environment
> > it's one bloody big pain in the butt.
> 
> A mixed charset environment was _already_ a pain in the butt, because
> almost nobody got labelling right. It's wrong to blame that on UTF-8.

I'm not talking about a mixed charset environment.  I'm talking about
non-UTF-8 single charset environments being broken by programs which
universally think the universe is UTF-8 only.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread David Woodhouse
On Sun, 2007-01-07 at 15:38 +, Russell King wrote:
> On Sun, Jan 07, 2007 at 11:13:57PM +0800, David Woodhouse wrote:
> > On Sun, 2007-01-07 at 14:06 +0100, Tilman Schmidt wrote:
> > > Russell King schrieb:
> > > > Welcome to the mess which the UTF-8 charset creates.
> > 
> > Utter bollocks.
> 
> Wrong.  The problem is partly caused by not everything understanding
> multi-byte character encodings, 

No, that's a different problem; not the one you were referring to above.
And it's a problem which is rapidly diminishing, too.

> and text files containing absolutely
> _no_ information about their character encodings.

That's a real problem, yes -- but it was a problem long before UTF-8 was
added to the collection of character sets in use. Even within the UK, we
had to choose between ISO8859-1 and ISO8859-15.

> When a text file is stored on disk, there's no way to tell what
> character set the characters in that file belong to.  As a result,
> ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded.
> UTF-8 folk assume all text files are UTF-8 encoded.  This leads to
> utter confusion.

Only if you are making different assumptions about the _same_ set of
files, on the _same_ system. But that would be silly.

If I suddenly "assume" that my laptop has a Dvorak keyboard layout
despite that blatantly not being true, I'll get the same kind of
confusion. That isn't Dvorak's fault, either.

If, on the other hand, I have one system which is entirely ISO8859-1 and
a separate system which is entirely UTF-8, each of those are _fine_ and
unconfusing. Obviously I have to make sure files are properly labelled
and converted in transport between different systems -- but that's
nothing new.

> To see what I mean, try the following:
> 
> $ git log | head -n 1000 > o
> $ file -i o
> o: text/x-c; charset=iso-8859-1
> 
> According to that, the charset of the 'git log' output (which on that
> test included Leonard's entry) is iso-8859-1, and by that Linus' mailer
> was right to include it as ISO-8859-1.

Yes. When you stored it on disk, the character set information was lost.
If you were running a mixed-charset system then attempting to recreating
the lost information with heuristics and assumptions is obviously going
to be problematic.

Actually, because UTF-8 allows me to run a system which is purely based
on a single character set, I get better results when I try the same
trick:
shinybook /shiny/git/mtd-2.6 $ git log | head -n 1000 > o
shinybook /shiny/git/mtd-2.6 $ file -i o
o: text/plain; charset=utf-8

Again, the problem of labelling isn't at all new to UTF-8. The only
thing that's new with UTF-8 is that it's now actually _practical_ to
have a system which only uses one character set throughout, and which
thus _can_ get its 'guess' right when you don't bother to label
everything.

> In reality, the output from git log contains an ad-hoc collection of
> character sets making its interpretation under any one character set
> incorrect.

No, the contents of the git log ought to be UTF-8, unless people have
been misusing it. Git stores its text in UTF-8 (by default), and is
capable of converting to and from legacy character sets on input
(git-commit) and output (git-log).

(Obviously, that's likely to be lossy if you convert it to any given
legacy character set, because ∀ legacy character set, ∃ characters
within UTF-8 that aren't in that legacy character set.)
 
> > Far from being the cause of the problem, UTF-8 actually offers the
> > chance of a _solution_. Because once the Luddites catch up, it'll
> > largely eliminate the need for using the multitude of legacy character
> > sets and converting between them -- and the problem of mislabelling will
> > fairly much go away.
> 
> In other words, the UTF-8 luddites require the entire Internet to
> upgrade to UTF-8 for UTF-8 to work properly.

Not at all. The problems arise when character set information is lost,
which can happen at any point during the flow of information.

Anything we can do to reduce the likelihood of charset information being
lost is an overall improvement. We already demonstrated an example
(git-log > o; file -i o) of a case where a _consistent_ system gets it
right, while an inconsistent system introduces an error.

If any individual system processes all text in a single character set,
then that system is no longer a likely source of corruption due to
labelling errors. And because UTF-8 fully covers the set of characters
which can be represented in the legacy character sets, it allows us to
deploy systems which do just that.

> I _regularly_ struggle with idiotic programs that assume that the world
> is UTF-8 and nothing else. 

I don't think I've encountered such a program in my distribution of
choice. If I had, I would have filed a bug. Making assumptions about
character sets, outside of the locally-controlled environment, is
invalid. That's been true since the first 8-bit character sets, if not
longer.

> So, in short,

Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread Russell King
On Sun, Jan 07, 2007 at 11:13:57PM +0800, David Woodhouse wrote:
> On Sun, 2007-01-07 at 14:06 +0100, Tilman Schmidt wrote:
> > Russell King schrieb:
> > > Welcome to the mess which the UTF-8 charset creates.
> 
> Utter bollocks.

Wrong.  The problem is partly caused by not everything understanding
multi-byte character encodings, and text files containing absolutely
_no_ information about their character encodings.

When a text file is stored on disk, there's no way to tell what
character set the characters in that file belong to.  As a result,
ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded.
UTF-8 folk assume all text files are UTF-8 encoded.  This leads to
utter confusion.

To see what I mean, try the following:

$ git log | head -n 1000 > o
$ file -i o
o: text/x-c; charset=iso-8859-1

According to that, the charset of the 'git log' output (which on that
test included Leonard's entry) is iso-8859-1, and by that Linus' mailer
was right to include it as ISO-8859-1.

In reality, the output from git log contains an ad-hoc collection of
character sets making its interpretation under any one character set
incorrect.

> > The problem of different character encodings coexisting on the same
> > platform, and the resulting occasional messing-up, far predates Unicode.
> > I distinctly remember one case of being bitten by this myself in 1977
> > when Unicode wasn't even on the horizon yet, and I don't think that was
> > the first time.
> 
> Indeed. If you take arbitrary content and send it out to the world
> labelled as ISO8859-1, of _course_ you're likely to be corrupting it.
> 
> Far from being the cause of the problem, UTF-8 actually offers the
> chance of a _solution_. Because once the Luddites catch up, it'll
> largely eliminate the need for using the multitude of legacy character
> sets and converting between them -- and the problem of mislabelling will
> fairly much go away.

In other words, the UTF-8 luddites require the entire Internet to
upgrade to UTF-8 for UTF-8 to work properly.

I _regularly_ struggle with idiotic programs that assume that the world
is UTF-8 and nothing else.  UTF-8 does _not_ solve these inter-operability
problems - it only makes the entire situation worse by introducing yet
another different charset.  (Yes, it's also true that there are programs
which assume the world is only another, different, character set.)

Rather than having these problems fixed properly (by looking at the LANG
environment variable) many of these programs now assume that the world
is UTF-8.  It isn't.

elinks is one such program.  It now assumes UTF-8 _only_ displays.
That's no better than programs which assume ISO-8859-1 only or US-ASCII
only.

So, in short, UTF-8 is all fine and dandy if your _entire_ universe
is UTF-8 enabled.  If you're operating in a mixed charset environment
it's one bloody big pain in the butt.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OT: character encodings (was: Linux 2.6.20-rc4)

2007-01-07 Thread David Woodhouse
On Sun, 2007-01-07 at 14:06 +0100, Tilman Schmidt wrote:
> Russell King schrieb:
> > Welcome to the mess which the UTF-8 charset creates.

Utter bollocks.

> The problem of different character encodings coexisting on the same
> platform, and the resulting occasional messing-up, far predates Unicode.
> I distinctly remember one case of being bitten by this myself in 1977
> when Unicode wasn't even on the horizon yet, and I don't think that was
> the first time.

Indeed. If you take arbitrary content and send it out to the world
labelled as ISO8859-1, of _course_ you're likely to be corrupting it.

Far from being the cause of the problem, UTF-8 actually offers the
chance of a _solution_. Because once the Luddites catch up, it'll
largely eliminate the need for using the multitude of legacy character
sets and converting between them -- and the problem of mislabelling will
fairly much go away.

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/