Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Jan 8 2007 14:17, Tim Pepper wrote: > On 1/8/07, Pavel Machek <[EMAIL PROTECTED]> wrote: >> On Sun 2007-01-07 22:30:55, Alan wrote: >> > I think that would be a good idea - and add it to the coding/docs >> > specs >> > that documentation is UTF-8. Code should IMHO say 7bit though. >> >> Yes, yes, please. >> >> I have been flamed when someone tried to do 8bit patch, and I was >> trying to NAK it... > > Could this get put in Documentation/CodingStyle? Someone do that. > And an item added to > the kernel janitors' list to fix up 8bit files? Last I looked trying That's already been just done by me. http://lkml.org/lkml/2007/1/8/222 > to decided if there was a standard here I found a mish-mash of > encodings based output of file vs Linus' git tree. -`J' -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
Hi, On Tue, 9 Jan 2007, Jan Engelhardt wrote: > On Jan 8 2007 22:00, Ken Moffat wrote: > > Looks nicely done, but I query the postal address changes in > >Documentation/cdrom/sbpcd - that seems to be a change of address > >(without anything to explain it). > > Eberhard [cc], please attach an Acked-by: YourName > keep Ccs, thanks ;-) > > [thread/patch: http://lkml.org/lkml/2007/1/8/222 ] Acked-by: Eberhard Moenkeberg <[EMAIL PROTECTED]> Jan had contacted me before, and I had sent him my new address data. This very young guy is doing a really good job. ;-)) Cheers -e -- Eberhard Moenkeberg ([EMAIL PROTECTED], [EMAIL PROTECTED]) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Jan 8 2007 22:00, Ken Moffat wrote: > Looks nicely done, but I query the postal address changes in >Documentation/cdrom/sbpcd - that seems to be a change of address >(without anything to explain it). Eberhard [cc], please attach an Acked-by: YourName keep Ccs, thanks ;-) [thread/patch: http://lkml.org/lkml/2007/1/8/222 ] -`J' -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On 1/8/07, Pavel Machek <[EMAIL PROTECTED]> wrote: On Sun 2007-01-07 22:30:55, Alan wrote: > I think that would be a good idea - and add it to the coding/docs specs > that documentation is UTF-8. Code should IMHO say 7bit though. Yes, yes, please. I have been flamed when someone tried to do 8bit patch, and I was trying to NAK it... Could this get put in Documentation/CodingStyle? And an item added to the kernel janitors' list to fix up 8bit files? Last I looked trying to decided if there was a standard here I found a mish-mash of encodings based output of file vs Linus' git tree. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Mon, Jan 08, 2007 at 09:17:06PM +0100, Jan Engelhardt wrote: > > On Jan 8 2007 02:22, Jan Engelhardt wrote: > >On Jan 7 2007 22:30, Alan wrote: > >> > >>> >The kernel maintainers/help/config pretty consistently use UTF8 > >>> > >>> I've seen a lot of places that don't do so. Want a patch? > >> > >>I think that would be a good idea - and add it to the coding/docs specs > >>that documentation is UTF-8. Code should IMHO say 7bit though. > > Most memorable issues: > > * "dont" (standalone accent aigu) rather than "don't" > (apostrophe) > * "", non breaking spaces > * cp437 encoding in some files (heh, heh, DOS!) > * iso8859-1/utf-8 mixed in some files Looks nicely done, but I query the postal address changes in Documentation/cdrom/sbpcd - that seems to be a change of address (without anything to explain it). Everything else seems to be just character-set conversion or the occasional translation of comments into English. (And no, I didn't attempt to review the character-set changes, even it there is an occasional error it will be better than where we are now, and easy to patch.) > > My compose key is hot now... I prefer the AltGr dead keys in X (they seem to work more reliably for me), but I guess I'm straying OT. > > None of you people screw that patch with your buggy MUAs! I'll pack > it up into a .bz2 to get it marked as application/octet-stream to > not even give your MUA the chance to. ;-) [and because it's 221 K > uncompressed and I am not sure if splitting it up makes much sense for > such 'trivial' changes, or not?] > > Signed-off-by: Jan Engelhardt <[EMAIL PROTECTED]> > > > -`J' > -- Thanks for doing this, I hope it wasn't in vain. Ken -- das eine Mal als Tragödie, das andere Mal als Farce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Sun 2007-01-07 22:30:55, Alan wrote: > > >The kernel maintainers/help/config pretty consistently use UTF8 > > > > I've seen a lot of places that don't do so. Want a patch? > > I think that would be a good idea - and add it to the coding/docs specs > that documentation is UTF-8. Code should IMHO say 7bit though. Yes, yes, please. I have been flamed when someone tried to do 8bit patch, and I was trying to NAK it... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Mon, 08 Jan 2007 01:38:57 +0100, Willy Tarreau said: > it's clearly the proof of a flaw in the initial design. And I'm not even > discussing the stupidity which requires that you read a whole text to get > its number of characters ! It's no more stupid than the *current* situation with Linux kernel code, where the stupidity actually requires that even if you know that there are only 60 characters on a given line, you actually have to look at each one in order to figure out if the line goes past column 80 pgpKwk0s3pnrW.pgp Description: PGP signature
Re: OT: character encodings (was: Linux 2.6.20-rc4)
Le Lun 8 janvier 2007 11:44, Alan a écrit : >> (case in point: Russel's system. I was ROTFL when he proudly announced >> he >> was running a full iso-8859-1 system after dissing UTF-8. Last I've seen >> the official 8bit EU encoding was iso-8859-15, and UK is part of the EU) > > There is no correct UK encoding. You need -14 or -15 depending upon > language and can come horribly unstuck the moment a name is involved. Either way it's not iso-8859-1 :) -- Nicolas Mailhot - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
> (case in point: Russel's system. I was ROTFL when he proudly announced he > was running a full iso-8859-1 system after dissing UTF-8. Last I've seen > the official 8bit EU encoding was iso-8859-15, and UK is part of the EU) There is no correct UK encoding. You need -14 or -15 depending upon language and can come horribly unstuck the moment a name is involved. Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
>> How would you do this technically in a way that it's significantely >> easier than simply finishing the UTF=8 transition? > In how many decades do you think the transition will be finished ? Right now it looks like it will be finished way earlier than app bother supporting the later 8-bit encodings such as iso-8859-15 (case in point: Russel's system. I was ROTFL when he proudly announced he was running a full iso-8859-1 system after dissing UTF-8. Last I've seen the official 8bit EU encoding was iso-8859-15, and UK is part of the EU) -- Nicolas Mailhot - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
> elinks is one such program. It now assumes UTF-8 _only_ displays. > That's no better than programs which assume ISO-8859-1 only or US-ASCII > only. That's way better than programs: - which assume an encoding you can't write most world languages in (BTW ISO-8859-1 & US-ASCII are broken by design for Western Europe since at least the Euro creation) - which perpetuate the myth local 8-bit encodings are manageable (they aren't, people spent decades trying to limp along with them, unicode & UTF-8 where not created just to make your life miserable) Show me one program that spurns Unicode I'll show you one that "passed on" iso-8859-15 (typically, though it's the easiest non-iso-8859-1 to do) The only reason you have the UTF-8 big stick approach nowadays is people have tried for years to get app writers manage 8-bit locales properly to dismal results. The old system was only working for en_US users (and perhaps to .uk people) -- Nicolas Mailhot - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Mon, Jan 08, 2007 at 07:52:48AM +0100, Jan Engelhardt wrote: > > On Jan 8 2007 02:03, Adrian Bunk wrote: > > > >The only major MUA not supporting UTF-8 is Eudora. > > > >And if you are talking about buggy old pine, in the latest development > >version [1] it does not only become open source, it also got some > >working Unicode support. > > Uhm, just for the record, I run pine 4.61 where my mail delivers to, > and Unicode works, yes, including the spam. For some years I'm using pine only as a newsreader, and I remember some display problems of Unicode characters that are fixed in Alpine. It might be that the support in pine was already better than I thought (but my switch to MUA was so many years ago...). > -`J' cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Jan 8 2007 02:03, Adrian Bunk wrote: > >The only major MUA not supporting UTF-8 is Eudora. > >And if you are talking about buggy old pine, in the latest development >version [1] it does not only become open source, it also got some >working Unicode support. Uhm, just for the record, I run pine 4.61 where my mail delivers to, and Unicode works, yes, including the spam. -`J' -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Sun, 2007-01-07 at 15:05 -0500, Dave Jones wrote: > This has been bugging me for a while. > Viewing the mail I applied in mutt shows his name correctly as Rafał > Applying it with git-applymbox and viewing the log on master.kernel.org > with git log shows Rafa And then later when put into email > it turns into Rafa³ I believe you need to use the misnamed '-u' option to git-applymbox, which _really_ ought to be the default behaviour. Otherwise, it fails to pay any attention to the character set tags in the mail it's decoding -- it commits the sin which rmk was whining about; assuming the input data is of a given type and ignoring the explicit tags which indicate the contrary. The '-u' option is misdocumented as 'causes the resulting commit to be encoded in utf-8', but in fact I believe it doesn't necessarily do that -- it actually causes the resulting commit to be encoded in the configured storage charset for the repository, which just _happens_ to default to UTF-8 unless otherwise specified. That is something which should definitely be the _default_ behaviour. We should make the '-u' behaviour the default, and if anyone really wants the old behaviour of importing arbitrary data in untagged binary form overriding its labelling then they can have a separate option which does that. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Mon, Jan 08, 2007 at 02:14:41AM +0100, Willy Tarreau wrote: > On Mon, Jan 08, 2007 at 02:03:37AM +0100, Adrian Bunk wrote: > > On Mon, Jan 08, 2007 at 01:38:57AM +0100, Willy Tarreau wrote: > > > On Mon, Jan 08, 2007 at 12:37:50AM +0100, Adrian Bunk wrote: > > > > On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote: > > > > > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote: > > > > > > > > > > > > On Jan 7 2007 17:06, Russell King wrote: > > > > > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > > > > > > > > > > > > > >$ git log | head -n 1000 | tail -n 200 > o > > > > > > >$ file -i o > > > > > > >o: text/plain; charset=us-ascii > > > > > > >$ git log | head -n 1000 | tail -n 300 > o > > > > > > >$ file -i o > > > > > > >o: text/plain; charset=us-ascii > > > > > > >$ git log | head -n 1000 | tail -n 400 > o > > > > > > >$ file -i o > > > > > > >o: text/plain; charset=utf-8 > > > > > > > > > > > > I am inclined to say that "file" does not count, because it tries > > > > > > to guess an > > > > > > ambiguous mapping from bytes to character set. Even more, file > > > > > > should be > > > > > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or > > > > > > worse: 15) > > > > > > file. This program is soo... forget it, it's not an argument. It > > > > > > works well for > > > > > > headerful files, but text files don't really contain one. The next > > > > > > best thing > > > > > > would be html, with a proper tag. > > > > > > > > > > The stupidity from the start up with those character sets is that they > > > > > consider that a whole file is written with a given set. In fact, the > > > > > charset should apply to characters themselves. At least, the > > > > > quoted-printable, non-human friendly, encoding was the least stupid. > > > > > > > > I doubt doing this would really be worth the effort. > > > > > > > > In the 21st century, people should simply use UTF-8. > > > > > > > > > Now that UTF8 comes everywhere, everyone receives tons of mangled > > > > > mails, > > > > > and even mailers which correctly support UTF8 and use it by default > > > > > manage > > > > > to shoot themselves in the foot when they reply to, or forward a > > > > > mail. The > > > > > system is completely broken because limited by design, and we have to > > > > > learn > > > > > to live with this brokenness. > > > > > > > > Only if MUAs have broken charset support or don't set a correct > > > > "charset" header in the mails they are sending. > > > > > > > > If some software still can't handle UTF-8 correctly more than 10 years > > > > after it was introduced, that's not a brokenness you can blame on UTF-8. > > > > > > I'm not blaming UTF-8 per se, but people who still believe in encoding > > > *whole documents*. Copy-paste, text insertion, git output, etc... > > > everything > > > has a good reason not to be in the same encoding as what your MUA > > > believes. > > > > How would you do this technically in a way that it's significantely > > easier than simply finishing the UTF=8 transition? > > In how many decades do you think the transition will be finished ? > > > > If major MUAs still have problems with UTF-8 10 years after it was > > > introduced, > > > it's clearly the proof of a flaw in the initial design. And I'm not even > > > discussing the stupidity which requires that you read a whole text to get > > > its number of characters ! > > > > The only major MUA not supporting UTF-8 is Eudora. > > > > And if you are talking about buggy old pine, in the latest development > > version [1] it does not only become open source, it also got some > > working Unicode support. > > No, I'm not speaking about "not supporting", but "having problems". Every > one of us has already received mails from Thunderbird, Outlook, Notes, etc... > with erroneously encoded characters because of this : > > - an UTF8 MUA sends a mail to a non-UTF8 aware one. "non-UTF8 aware one" = Eudora (BTW: there's no Linux version) > - this last one only sees double chars. When it wants to forward the mail > to someone else, it keeps the chars verbatim, and sets the encoding type > to its own, something like iso8859-1 for instance. Let's not base everything on the one broken non-Linux MUA, > - the final MUA, which is UTF8-aware, is very happy to detect lots of UTF8 > combinations in the forwarded mail and decides that everything in it is > UTF8, then you get lots of chars mangled in the mail, in the middle of > UTF8 combinations. Then, this crappy mail can be forwarded as long as > you want between UTF8 MUAs, they will all apply heuristics and to the > wrong thing : consider the *whole* document with *one* type. Which MUAs exactly do ignore the "charset" of an email and try their own guessing instead? Or which MUAs exactly do not set a "charset" so that the receiving MUA might have a reason for guessing? > What I find even funni
Re: OT: character encodings (was: Linux 2.6.20-rc4)
Russell King <[EMAIL PROTECTED]> wrote: [...] > All that UTF-8 has done is added to the "which charset is this data" > problem rather than actually solving any proper real life problem. It solves real-world problems, the pain is that it is not (yet) universally used. The charset problems today are much more visible today than, say, 15 years back, that is all. -- Dr. Horst H. von Brand User #22616 counter.li.org Departamento de InformaticaFono: +56 32 2654431 Universidad Tecnica Federico Santa Maria +56 32 2654239 Casilla 110-V, Valparaiso, Chile Fax: +56 32 2797513 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Jan 7 2007 22:30, Alan wrote: > >> >The kernel maintainers/help/config pretty consistently use UTF8 >> >> I've seen a lot of places that don't do so. Want a patch? > >I think that would be a good idea - and add it to the coding/docs specs >that documentation is UTF-8. Code should IMHO say 7bit though. Hm, what do the list of authors in .c/.h files and kerneldoc in .c/h belong to? doc or code? -`J' -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Mon, Jan 08, 2007 at 02:03:37AM +0100, Adrian Bunk wrote: > On Mon, Jan 08, 2007 at 01:38:57AM +0100, Willy Tarreau wrote: > > On Mon, Jan 08, 2007 at 12:37:50AM +0100, Adrian Bunk wrote: > > > On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote: > > > > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote: > > > > > > > > > > On Jan 7 2007 17:06, Russell King wrote: > > > > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > > > > > > > > > > > >$ git log | head -n 1000 | tail -n 200 > o > > > > > >$ file -i o > > > > > >o: text/plain; charset=us-ascii > > > > > >$ git log | head -n 1000 | tail -n 300 > o > > > > > >$ file -i o > > > > > >o: text/plain; charset=us-ascii > > > > > >$ git log | head -n 1000 | tail -n 400 > o > > > > > >$ file -i o > > > > > >o: text/plain; charset=utf-8 > > > > > > > > > > I am inclined to say that "file" does not count, because it tries to > > > > > guess an > > > > > ambiguous mapping from bytes to character set. Even more, file should > > > > > be > > > > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or > > > > > worse: 15) > > > > > file. This program is soo... forget it, it's not an argument. It > > > > > works well for > > > > > headerful files, but text files don't really contain one. The next > > > > > best thing > > > > > would be html, with a proper tag. > > > > > > > > The stupidity from the start up with those character sets is that they > > > > consider that a whole file is written with a given set. In fact, the > > > > charset should apply to characters themselves. At least, the > > > > quoted-printable, non-human friendly, encoding was the least stupid. > > > > > > I doubt doing this would really be worth the effort. > > > > > > In the 21st century, people should simply use UTF-8. > > > > > > > Now that UTF8 comes everywhere, everyone receives tons of mangled mails, > > > > and even mailers which correctly support UTF8 and use it by default > > > > manage > > > > to shoot themselves in the foot when they reply to, or forward a mail. > > > > The > > > > system is completely broken because limited by design, and we have to > > > > learn > > > > to live with this brokenness. > > > > > > Only if MUAs have broken charset support or don't set a correct > > > "charset" header in the mails they are sending. > > > > > > If some software still can't handle UTF-8 correctly more than 10 years > > > after it was introduced, that's not a brokenness you can blame on UTF-8. > > > > I'm not blaming UTF-8 per se, but people who still believe in encoding > > *whole documents*. Copy-paste, text insertion, git output, etc... everything > > has a good reason not to be in the same encoding as what your MUA believes. > > How would you do this technically in a way that it's significantely > easier than simply finishing the UTF=8 transition? In how many decades do you think the transition will be finished ? > > If major MUAs still have problems with UTF-8 10 years after it was > > introduced, > > it's clearly the proof of a flaw in the initial design. And I'm not even > > discussing the stupidity which requires that you read a whole text to get > > its number of characters ! > > The only major MUA not supporting UTF-8 is Eudora. > > And if you are talking about buggy old pine, in the latest development > version [1] it does not only become open source, it also got some > working Unicode support. No, I'm not speaking about "not supporting", but "having problems". Every one of us has already received mails from Thunderbird, Outlook, Notes, etc... with erroneously encoded characters because of this : - an UTF8 MUA sends a mail to a non-UTF8 aware one. - this last one only sees double chars. When it wants to forward the mail to someone else, it keeps the chars verbatim, and sets the encoding type to its own, something like iso8859-1 for instance. - the final MUA, which is UTF8-aware, is very happy to detect lots of UTF8 combinations in the forwarded mail and decides that everything in it is UTF8, then you get lots of chars mangled in the mail, in the middle of UTF8 combinations. Then, this crappy mail can be forwarded as long as you want between UTF8 MUAs, they will all apply heuristics and to the wrong thing : consider the *whole* document with *one* type. What I find even funnier is when, for no apparent reason, the same MUA is used on both ends and the contents get mangled because the sender copies a portion of text from somewhere else. Anyway, I don't want to follow up on this thread, it's *highly* off-topic here. Cheers, Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Mon, Jan 08, 2007 at 01:38:57AM +0100, Willy Tarreau wrote: > On Mon, Jan 08, 2007 at 12:37:50AM +0100, Adrian Bunk wrote: > > On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote: > > > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote: > > > > > > > > On Jan 7 2007 17:06, Russell King wrote: > > > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > > > > > > > > > >$ git log | head -n 1000 | tail -n 200 > o > > > > >$ file -i o > > > > >o: text/plain; charset=us-ascii > > > > >$ git log | head -n 1000 | tail -n 300 > o > > > > >$ file -i o > > > > >o: text/plain; charset=us-ascii > > > > >$ git log | head -n 1000 | tail -n 400 > o > > > > >$ file -i o > > > > >o: text/plain; charset=utf-8 > > > > > > > > I am inclined to say that "file" does not count, because it tries to > > > > guess an > > > > ambiguous mapping from bytes to character set. Even more, file should be > > > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or > > > > worse: 15) > > > > file. This program is soo... forget it, it's not an argument. It works > > > > well for > > > > headerful files, but text files don't really contain one. The next best > > > > thing > > > > would be html, with a proper tag. > > > > > > The stupidity from the start up with those character sets is that they > > > consider that a whole file is written with a given set. In fact, the > > > charset should apply to characters themselves. At least, the > > > quoted-printable, non-human friendly, encoding was the least stupid. > > > > I doubt doing this would really be worth the effort. > > > > In the 21st century, people should simply use UTF-8. > > > > > Now that UTF8 comes everywhere, everyone receives tons of mangled mails, > > > and even mailers which correctly support UTF8 and use it by default manage > > > to shoot themselves in the foot when they reply to, or forward a mail. The > > > system is completely broken because limited by design, and we have to > > > learn > > > to live with this brokenness. > > > > Only if MUAs have broken charset support or don't set a correct > > "charset" header in the mails they are sending. > > > > If some software still can't handle UTF-8 correctly more than 10 years > > after it was introduced, that's not a brokenness you can blame on UTF-8. > > I'm not blaming UTF-8 per se, but people who still believe in encoding > *whole documents*. Copy-paste, text insertion, git output, etc... everything > has a good reason not to be in the same encoding as what your MUA believes. How would you do this technically in a way that it's significantely easier than simply finishing the UTF=8 transition? > If major MUAs still have problems with UTF-8 10 years after it was introduced, > it's clearly the proof of a flaw in the initial design. And I'm not even > discussing the stupidity which requires that you read a whole text to get > its number of characters ! The only major MUA not supporting UTF-8 is Eudora. And if you are talking about buggy old pine, in the latest development version [1] it does not only become open source, it also got some working Unicode support. > Willy cu Adrian [1] Alpine -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Mon, Jan 08, 2007 at 12:37:50AM +0100, Adrian Bunk wrote: > On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote: > > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote: > > > > > > On Jan 7 2007 17:06, Russell King wrote: > > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > > > > > > > >$ git log | head -n 1000 | tail -n 200 > o > > > >$ file -i o > > > >o: text/plain; charset=us-ascii > > > >$ git log | head -n 1000 | tail -n 300 > o > > > >$ file -i o > > > >o: text/plain; charset=us-ascii > > > >$ git log | head -n 1000 | tail -n 400 > o > > > >$ file -i o > > > >o: text/plain; charset=utf-8 > > > > > > I am inclined to say that "file" does not count, because it tries to > > > guess an > > > ambiguous mapping from bytes to character set. Even more, file should be > > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or > > > worse: 15) > > > file. This program is soo... forget it, it's not an argument. It works > > > well for > > > headerful files, but text files don't really contain one. The next best > > > thing > > > would be html, with a proper tag. > > > > The stupidity from the start up with those character sets is that they > > consider that a whole file is written with a given set. In fact, the > > charset should apply to characters themselves. At least, the > > quoted-printable, non-human friendly, encoding was the least stupid. > > I doubt doing this would really be worth the effort. > > In the 21st century, people should simply use UTF-8. > > > Now that UTF8 comes everywhere, everyone receives tons of mangled mails, > > and even mailers which correctly support UTF8 and use it by default manage > > to shoot themselves in the foot when they reply to, or forward a mail. The > > system is completely broken because limited by design, and we have to learn > > to live with this brokenness. > > Only if MUAs have broken charset support or don't set a correct > "charset" header in the mails they are sending. > > If some software still can't handle UTF-8 correctly more than 10 years > after it was introduced, that's not a brokenness you can blame on UTF-8. I'm not blaming UTF-8 per se, but people who still believe in encoding *whole documents*. Copy-paste, text insertion, git output, etc... everything has a good reason not to be in the same encoding as what your MUA believes. If major MUAs still have problems with UTF-8 10 years after it was introduced, it's clearly the proof of a flaw in the initial design. And I'm not even discussing the stupidity which requires that you read a whole text to get its number of characters ! Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Sun, Jan 07, 2007 at 09:48:34PM +0100, Willy Tarreau wrote: > On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote: > > > > On Jan 7 2007 17:06, Russell King wrote: > > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > > > > > >$ git log | head -n 1000 | tail -n 200 > o > > >$ file -i o > > >o: text/plain; charset=us-ascii > > >$ git log | head -n 1000 | tail -n 300 > o > > >$ file -i o > > >o: text/plain; charset=us-ascii > > >$ git log | head -n 1000 | tail -n 400 > o > > >$ file -i o > > >o: text/plain; charset=utf-8 > > > > I am inclined to say that "file" does not count, because it tries to guess > > an > > ambiguous mapping from bytes to character set. Even more, file should be > > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: > > 15) > > file. This program is soo... forget it, it's not an argument. It works well > > for > > headerful files, but text files don't really contain one. The next best > > thing > > would be html, with a proper tag. > > The stupidity from the start up with those character sets is that they > consider that a whole file is written with a given set. In fact, the > charset should apply to characters themselves. At least, the > quoted-printable, non-human friendly, encoding was the least stupid. I doubt doing this would really be worth the effort. In the 21st century, people should simply use UTF-8. > Now that UTF8 comes everywhere, everyone receives tons of mangled mails, > and even mailers which correctly support UTF8 and use it by default manage > to shoot themselves in the foot when they reply to, or forward a mail. The > system is completely broken because limited by design, and we have to learn > to live with this brokenness. Only if MUAs have broken charset support or don't set a correct "charset" header in the mails they are sending. If some software still can't handle UTF-8 correctly more than 10 years after it was introduced, that's not a brokenness you can blame on UTF-8. > Willy cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
> >The kernel maintainers/help/config pretty consistently use UTF8 > > I've seen a lot of places that don't do so. Want a patch? I think that would be a good idea - and add it to the coding/docs specs that documentation is UTF-8. Code should IMHO say 7bit though. Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
Le dimanche 07 janvier 2007 à 21:40 +0100, Jan Engelhardt a écrit : > >On Sun, 7 Jan 2007 15:05:53 -0500 > >Dave Jones <[EMAIL PROTECTED]> wrote: > > > >> If there's something I should be doing when I commit that I'm not, > >> I'll be happy to change my scripts. My $LANG is set to en_US.UTF-8 > >> which should DTRT to the best of my knowledge, but clearly, that isn't > >> the case. > > No, LC_CTYPE defines what charset you use. (I may be wrong, though.) IIRC LANG is a superset for all LC_* - i.e. if only LANG is defined, it sets all your locales, but you can individually set the charset, numeric format, date format, etc. Xav - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote: > > On Jan 7 2007 17:06, Russell King wrote: > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > > > >$ git log | head -n 1000 | tail -n 200 > o > >$ file -i o > >o: text/plain; charset=us-ascii > >$ git log | head -n 1000 | tail -n 300 > o > >$ file -i o > >o: text/plain; charset=us-ascii > >$ git log | head -n 1000 | tail -n 400 > o > >$ file -i o > >o: text/plain; charset=utf-8 > > I am inclined to say that "file" does not count, because it tries to guess an > ambiguous mapping from bytes to character set. Even more, file should be > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15) > file. This program is soo... forget it, it's not an argument. It works well > for > headerful files, but text files don't really contain one. The next best thing > would be html, with a proper tag. The stupidity from the start up with those character sets is that they consider that a whole file is written with a given set. In fact, the charset should apply to characters themselves. At least, the quoted-printable, non-human friendly, encoding was the least stupid. Now that UTF8 comes everywhere, everyone receives tons of mangled mails, and even mailers which correctly support UTF8 and use it by default manage to shoot themselves in the foot when they reply to, or forward a mail. The system is completely broken because limited by design, and we have to learn to live with this brokenness. Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
>On Sun, 7 Jan 2007 15:05:53 -0500 >Dave Jones <[EMAIL PROTECTED]> wrote: > >> If there's something I should be doing when I commit that I'm not, >> I'll be happy to change my scripts. My $LANG is set to en_US.UTF-8 >> which should DTRT to the best of my knowledge, but clearly, that isn't >> the case. No, LC_CTYPE defines what charset you use. (I may be wrong, though.) -`J' -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
söndag 07 januari 2007 20:17 skrev Russell King: [...] > clearly not UTF-8. I doubt whether any of the commits I do on my > en_GB ISO-8859-1 systems end up being UTF-8 encoded. They don't. Git doesn't convert, with the exception of two mail-related tools, which is the reason the commit being discussed ended up as UTF-8 in GIT. The mail containing the patch was in ISO-8859-1. All other git tools just store whatever byte sequence they are fed, be ut ISO-latin, utf-8 or something (to westeners) more exotic. -- robin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Sun, 7 Jan 2007 15:05:53 -0500 Dave Jones <[EMAIL PROTECTED]> wrote: Including the Git list... > On Sun, Jan 07, 2007 at 07:17:30PM +, Russell King wrote: > > > commit 24ebead82bbf9785909d4cf205e2df5e9ff7da32 > > tree 921f686860e918a01c3d3fb6cd106ba82bf4ace6 > > parent 264166e604a7e14c278e31cadd1afb06a7d51a11 > > author Rafa³ Bilski <[EMAIL PROTECTED]> 1167691774 +0100 > > committer Dave Jones <[EMAIL PROTECTED]> 1167799119 -0500 > > > > and looking at that "author" closer with od: > > > > 140 74 68 6f 72 20 52 61 66 61 b3 20 42 69 6c 73 6b > > t h o r R a f a ³ B i l s k > > > > clearly not UTF-8. I doubt whether any of the commits I do on my > > en_GB ISO-8859-1 systems end up being UTF-8 encoded. > > This has been bugging me for a while. > Viewing the mail I applied in mutt shows his name correctly as Rafał > Applying it with git-applymbox and viewing the log on master.kernel.org > with git log shows Rafa And then later when put into email > it turns into Rafa³ > > > But the point is there is charset damage which has happened _long_ before > > Linus' action. There is no character set defined for the contents of git > > repositories, and as such the output of the git tools can not be > > interpreted as any one single character set. > > If there's something I should be doing when I commit that I'm not, > I'll be happy to change my scripts. My $LANG is set to en_US.UTF-8 > which should DTRT to the best of my knowledge, but clearly, that isn't > the case. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Sun, Jan 07, 2007 at 07:17:30PM +, Russell King wrote: > commit 24ebead82bbf9785909d4cf205e2df5e9ff7da32 > tree 921f686860e918a01c3d3fb6cd106ba82bf4ace6 > parent 264166e604a7e14c278e31cadd1afb06a7d51a11 > author Rafa³ Bilski <[EMAIL PROTECTED]> 1167691774 +0100 > committer Dave Jones <[EMAIL PROTECTED]> 1167799119 -0500 > > and looking at that "author" closer with od: > > 140 74 68 6f 72 20 52 61 66 61 b3 20 42 69 6c 73 6b > t h o r R a f a ³ B i l s k > > clearly not UTF-8. I doubt whether any of the commits I do on my > en_GB ISO-8859-1 systems end up being UTF-8 encoded. This has been bugging me for a while. Viewing the mail I applied in mutt shows his name correctly as Rafał Applying it with git-applymbox and viewing the log on master.kernel.org with git log shows Rafa And then later when put into email it turns into Rafa³ > But the point is there is charset damage which has happened _long_ before > Linus' action. There is no character set defined for the contents of git > repositories, and as such the output of the git tools can not be > interpreted as any one single character set. If there's something I should be doing when I commit that I'm not, I'll be happy to change my scripts. My $LANG is set to en_US.UTF-8 which should DTRT to the best of my knowledge, but clearly, that isn't the case. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Sun, Jan 07, 2007 at 08:11:38PM +0100, Jan Engelhardt wrote: > > On Jan 7 2007 17:06, Russell King wrote: > >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > > > >$ git log | head -n 1000 | tail -n 200 > o > >$ file -i o > >o: text/plain; charset=us-ascii > >$ git log | head -n 1000 | tail -n 300 > o > >$ file -i o > >o: text/plain; charset=us-ascii > >$ git log | head -n 1000 | tail -n 400 > o > >$ file -i o > >o: text/plain; charset=utf-8 > > I am inclined to say that "file" does not count, because it tries to guess an > ambiguous mapping from bytes to character set. Even more, file should be > _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15) > file. This program is soo... forget it, it's not an argument. It works well > for > headerful files, but text files don't really contain one. The next best thing > would be html, with a proper tag. You're discarding a perfectly reasonable argument - file itself obviously is not good at guessing the charset, but inspecting the resulting file manually and identifying *both* ISO-8859 and UTF-8 character sequences in there is pretty conclusive. As I did indeed do prior to sending that message. In this case, 'file' was doing a remarkably accurate job. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Sun, Jan 07, 2007 at 06:21:51PM +, Alan wrote: > > So, in short, UTF-8 is all fine and dandy if your _entire_ universe > > is UTF-8 enabled. If you're operating in a mixed charset environment > > it's one bloody big pain in the butt. > > Net ASCII is 7bit and is 1:1 mapped with UTF-8 unicode. The same is true of ISO-8859-1. > It's just old broken 8bit encodings that are problematic. > > The kernel maintainers/help/config pretty consistently use UTF8 As I've tried to point out, that's not universally true. For instance: commit 24ebead82bbf9785909d4cf205e2df5e9ff7da32 tree 921f686860e918a01c3d3fb6cd106ba82bf4ace6 parent 264166e604a7e14c278e31cadd1afb06a7d51a11 author Rafa³ Bilski <[EMAIL PROTECTED]> 1167691774 +0100 committer Dave Jones <[EMAIL PROTECTED]> 1167799119 -0500 and looking at that "author" closer with od: 140 74 68 6f 72 20 52 61 66 61 b3 20 42 69 6c 73 6b t h o r R a f a ³ B i l s k clearly not UTF-8. I doubt whether any of the commits I do on my en_GB ISO-8859-1 systems end up being UTF-8 encoded. And _this_ is the problem when it comes to generating the logs, irrespective of whether or not Linus loads UTF-8 data into an ISO-8859-1 message. For all we know, Linus' system could be using an ISO-8859 charset rather than UTF-8. But the point is there is charset damage which has happened _long_ before Linus' action. There is no character set defined for the contents of git repositories, and as such the output of the git tools can not be interpreted as any one single character set. All that UTF-8 has done is added to the "which charset is this data" problem rather than actually solving any proper real life problem. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Jan 7 2007 18:21, Alan wrote: > >> So, in short, UTF-8 is all fine and dandy if your _entire_ universe >> is UTF-8 enabled. If you're operating in a mixed charset environment >> it's one bloody big pain in the butt. > >Net ASCII is 7bit and is 1:1 mapped with UTF-8 unicode. It's just old >broken 8bit encodings that are problematic. > >The kernel maintainers/help/config pretty consistently use UTF8 I've seen a lot of places that don't do so. Want a patch? -`J' -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Jan 7 2007 17:06, Russell King wrote: >On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > >$ git log | head -n 1000 | tail -n 200 > o >$ file -i o >o: text/plain; charset=us-ascii >$ git log | head -n 1000 | tail -n 300 > o >$ file -i o >o: text/plain; charset=us-ascii >$ git log | head -n 1000 | tail -n 400 > o >$ file -i o >o: text/plain; charset=utf-8 I am inclined to say that "file" does not count, because it tries to guess an ambiguous mapping from bytes to character set. Even more, file should be _unable at all_ to distinguish an iso-8859-1 from an iso-8859-2 (or worse: 15) file. This program is soo... forget it, it's not an argument. It works well for headerful files, but text files don't really contain one. The next best thing would be html, with a proper tag. -`J' -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
> So, in short, UTF-8 is all fine and dandy if your _entire_ universe > is UTF-8 enabled. If you're operating in a mixed charset environment > it's one bloody big pain in the butt. Net ASCII is 7bit and is 1:1 mapped with UTF-8 unicode. It's just old broken 8bit encodings that are problematic. The kernel maintainers/help/config pretty consistently use UTF8 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote: > On Sun, 2007-01-07 at 15:38 +, Russell King wrote: > > When a text file is stored on disk, there's no way to tell what > > character set the characters in that file belong to. As a result, > > ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded. > > UTF-8 folk assume all text files are UTF-8 encoded. This leads to > > utter confusion. > > Only if you are making different assumptions about the _same_ set of > files, on the _same_ system. But that would be silly. $ git log | head -n 1000 | tail -n 200 > o $ file -i o o: text/plain; charset=us-ascii $ git log | head -n 1000 | tail -n 300 > o $ file -i o o: text/plain; charset=us-ascii $ git log | head -n 1000 | tail -n 400 > o $ file -i o o: text/plain; charset=utf-8 (and you know what charset the file is thought to have with all 1000 lines in it.) All on a system with LANG set to en_GB (iow ISO-8859-1). > > To see what I mean, try the following: > > > > $ git log | head -n 1000 > o > > $ file -i o > > o: text/x-c; charset=iso-8859-1 > > > > According to that, the charset of the 'git log' output (which on that > > test included Leonard's entry) is iso-8859-1, and by that Linus' mailer > > was right to include it as ISO-8859-1. > > Yes. When you stored it on disk, the character set information was lost. The same thing actually happens when I look at it via: $ git log | head -n 1000 | less but in this case the output is always interpreted by the terminal to be in its character set. > If you were running a mixed-charset system then attempting to recreating > the lost information with heuristics and assumptions is obviously going > to be problematic. I'm not - I'm running a pure ISO-8859-1 system: $ echo $LANG en_GB $ locale -k LC_CTYPE | grep charmap charmap="ISO-8859-1" > Actually, because UTF-8 allows me to run a system which is purely based > on a single character set, I get better results when I try the same > trick: > shinybook /shiny/git/mtd-2.6 $ git log | head -n 1000 > o > shinybook /shiny/git/mtd-2.6 $ file -i o > o: text/plain; charset=utf-8 $ LANG=en_GB.UTF-8 locale -k LC_CTYPE | grep charmap charmap="UTF-8" $ LANG=en_GB.UTF-8 git log | head -n 1000 > o $ LANG=en_GB.UTF-8 file -i o o: text/x-c; charset=iso-8859-1 $ git version git version 1.4.4.2 Looks like the output is iso-8859-1 even with UTF-8! > > In reality, the output from git log contains an ad-hoc collection of > > character sets making its interpretation under any one character set > > incorrect. > > No, the contents of the git log ought to be UTF-8, unless people have > been misusing it. Git stores its text in UTF-8 (by default), and is > capable of converting to and from legacy character sets on input > (git-commit) and output (git-log). Git may store its text internally in UTF-8 (I don't know but I have no evidence to suggest it does - in fact I have some evidence in this test that it doesn't care about charsets.) git log output on a non-UTF-8 system certainly is not in the hosts character set. For example: $ LANG=en_GB.UTF-8 git log | head -n 1000 > o $ LANG=en_GB git log | head -n 1000 > o2 $ diff -u o o2 That includes the UTF-8 encoded part of Leonard name. It also includes Rafa? Bilski's name which is non-UTF-8 encoded. So, in both cases, exactly the same output bytestream was created independent of the character set _actually_ being used, which both includes untranslated UTF-8 and non-UTF-8 sequences. There is obviously no character set translation going on with the output. So we can add 'git' to my list of charset-broken programs. Also, since we have recent data in the git repository which is non-UTF-8 as well, it is clear that there is no character set translation going on at input time either. Looking at the git-commit script, there appears to be no character set conversion going on in there either. So, I think you'll find that the contents of git _is_ an ad-hoc collection of character sets which people happen to have in use on their machines. > > So, in short, UTF-8 is all fine and dandy if your _entire_ universe > > is UTF-8 enabled. If you're operating in a mixed charset environment > > it's one bloody big pain in the butt. > > A mixed charset environment was _already_ a pain in the butt, because > almost nobody got labelling right. It's wrong to blame that on UTF-8. I'm not talking about a mixed charset environment. I'm talking about non-UTF-8 single charset environments being broken by programs which universally think the universe is UTF-8 only. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Sun, 2007-01-07 at 15:38 +, Russell King wrote: > On Sun, Jan 07, 2007 at 11:13:57PM +0800, David Woodhouse wrote: > > On Sun, 2007-01-07 at 14:06 +0100, Tilman Schmidt wrote: > > > Russell King schrieb: > > > > Welcome to the mess which the UTF-8 charset creates. > > > > Utter bollocks. > > Wrong. The problem is partly caused by not everything understanding > multi-byte character encodings, No, that's a different problem; not the one you were referring to above. And it's a problem which is rapidly diminishing, too. > and text files containing absolutely > _no_ information about their character encodings. That's a real problem, yes -- but it was a problem long before UTF-8 was added to the collection of character sets in use. Even within the UK, we had to choose between ISO8859-1 and ISO8859-15. > When a text file is stored on disk, there's no way to tell what > character set the characters in that file belong to. As a result, > ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded. > UTF-8 folk assume all text files are UTF-8 encoded. This leads to > utter confusion. Only if you are making different assumptions about the _same_ set of files, on the _same_ system. But that would be silly. If I suddenly "assume" that my laptop has a Dvorak keyboard layout despite that blatantly not being true, I'll get the same kind of confusion. That isn't Dvorak's fault, either. If, on the other hand, I have one system which is entirely ISO8859-1 and a separate system which is entirely UTF-8, each of those are _fine_ and unconfusing. Obviously I have to make sure files are properly labelled and converted in transport between different systems -- but that's nothing new. > To see what I mean, try the following: > > $ git log | head -n 1000 > o > $ file -i o > o: text/x-c; charset=iso-8859-1 > > According to that, the charset of the 'git log' output (which on that > test included Leonard's entry) is iso-8859-1, and by that Linus' mailer > was right to include it as ISO-8859-1. Yes. When you stored it on disk, the character set information was lost. If you were running a mixed-charset system then attempting to recreating the lost information with heuristics and assumptions is obviously going to be problematic. Actually, because UTF-8 allows me to run a system which is purely based on a single character set, I get better results when I try the same trick: shinybook /shiny/git/mtd-2.6 $ git log | head -n 1000 > o shinybook /shiny/git/mtd-2.6 $ file -i o o: text/plain; charset=utf-8 Again, the problem of labelling isn't at all new to UTF-8. The only thing that's new with UTF-8 is that it's now actually _practical_ to have a system which only uses one character set throughout, and which thus _can_ get its 'guess' right when you don't bother to label everything. > In reality, the output from git log contains an ad-hoc collection of > character sets making its interpretation under any one character set > incorrect. No, the contents of the git log ought to be UTF-8, unless people have been misusing it. Git stores its text in UTF-8 (by default), and is capable of converting to and from legacy character sets on input (git-commit) and output (git-log). (Obviously, that's likely to be lossy if you convert it to any given legacy character set, because ∀ legacy character set, ∃ characters within UTF-8 that aren't in that legacy character set.) > > Far from being the cause of the problem, UTF-8 actually offers the > > chance of a _solution_. Because once the Luddites catch up, it'll > > largely eliminate the need for using the multitude of legacy character > > sets and converting between them -- and the problem of mislabelling will > > fairly much go away. > > In other words, the UTF-8 luddites require the entire Internet to > upgrade to UTF-8 for UTF-8 to work properly. Not at all. The problems arise when character set information is lost, which can happen at any point during the flow of information. Anything we can do to reduce the likelihood of charset information being lost is an overall improvement. We already demonstrated an example (git-log > o; file -i o) of a case where a _consistent_ system gets it right, while an inconsistent system introduces an error. If any individual system processes all text in a single character set, then that system is no longer a likely source of corruption due to labelling errors. And because UTF-8 fully covers the set of characters which can be represented in the legacy character sets, it allows us to deploy systems which do just that. > I _regularly_ struggle with idiotic programs that assume that the world > is UTF-8 and nothing else. I don't think I've encountered such a program in my distribution of choice. If I had, I would have filed a bug. Making assumptions about character sets, outside of the locally-controlled environment, is invalid. That's been true since the first 8-bit character sets, if not longer. > So, in short,
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Sun, Jan 07, 2007 at 11:13:57PM +0800, David Woodhouse wrote: > On Sun, 2007-01-07 at 14:06 +0100, Tilman Schmidt wrote: > > Russell King schrieb: > > > Welcome to the mess which the UTF-8 charset creates. > > Utter bollocks. Wrong. The problem is partly caused by not everything understanding multi-byte character encodings, and text files containing absolutely _no_ information about their character encodings. When a text file is stored on disk, there's no way to tell what character set the characters in that file belong to. As a result, ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded. UTF-8 folk assume all text files are UTF-8 encoded. This leads to utter confusion. To see what I mean, try the following: $ git log | head -n 1000 > o $ file -i o o: text/x-c; charset=iso-8859-1 According to that, the charset of the 'git log' output (which on that test included Leonard's entry) is iso-8859-1, and by that Linus' mailer was right to include it as ISO-8859-1. In reality, the output from git log contains an ad-hoc collection of character sets making its interpretation under any one character set incorrect. > > The problem of different character encodings coexisting on the same > > platform, and the resulting occasional messing-up, far predates Unicode. > > I distinctly remember one case of being bitten by this myself in 1977 > > when Unicode wasn't even on the horizon yet, and I don't think that was > > the first time. > > Indeed. If you take arbitrary content and send it out to the world > labelled as ISO8859-1, of _course_ you're likely to be corrupting it. > > Far from being the cause of the problem, UTF-8 actually offers the > chance of a _solution_. Because once the Luddites catch up, it'll > largely eliminate the need for using the multitude of legacy character > sets and converting between them -- and the problem of mislabelling will > fairly much go away. In other words, the UTF-8 luddites require the entire Internet to upgrade to UTF-8 for UTF-8 to work properly. I _regularly_ struggle with idiotic programs that assume that the world is UTF-8 and nothing else. UTF-8 does _not_ solve these inter-operability problems - it only makes the entire situation worse by introducing yet another different charset. (Yes, it's also true that there are programs which assume the world is only another, different, character set.) Rather than having these problems fixed properly (by looking at the LANG environment variable) many of these programs now assume that the world is UTF-8. It isn't. elinks is one such program. It now assumes UTF-8 _only_ displays. That's no better than programs which assume ISO-8859-1 only or US-ASCII only. So, in short, UTF-8 is all fine and dandy if your _entire_ universe is UTF-8 enabled. If you're operating in a mixed charset environment it's one bloody big pain in the butt. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OT: character encodings (was: Linux 2.6.20-rc4)
On Sun, 2007-01-07 at 14:06 +0100, Tilman Schmidt wrote: > Russell King schrieb: > > Welcome to the mess which the UTF-8 charset creates. Utter bollocks. > The problem of different character encodings coexisting on the same > platform, and the resulting occasional messing-up, far predates Unicode. > I distinctly remember one case of being bitten by this myself in 1977 > when Unicode wasn't even on the horizon yet, and I don't think that was > the first time. Indeed. If you take arbitrary content and send it out to the world labelled as ISO8859-1, of _course_ you're likely to be corrupting it. Far from being the cause of the problem, UTF-8 actually offers the chance of a _solution_. Because once the Luddites catch up, it'll largely eliminate the need for using the multitude of legacy character sets and converting between them -- and the problem of mislabelling will fairly much go away. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/