Re: [9fans] simplicity
On 10/9/07, erik quanstrom [EMAIL PROTECTED] wrote: i think i see what the reasoning is. the thought is that, e.g., in spanish [a-z] should match ñ. Ah, thanks! I was thinking of the simplistic scenario, where someone might be looking for niño in some file, regardless of what locale they might happen to be in. Now I can imagine the nightmare it must be for non-English speakers looking for letter combinations irrespective of accents. But, it seems more like a problem with the shorthand than grep, per se. I could see an argument for [:alpha:] potentially matching n and ñ depending on the locale, but [a-z] not matching ñ in any locale. But even that, my tendency would be that [:alpha:] match ñ in every locale. But then, does [:alpha:] match ἄγαθος? How ironic that it doesn't match α. What an ugly problem. -Jack
Re: [9fans] simplicity
My most annoying locale problem concerned reading Czech HTML emails in mh. Don't ask why, just accept that I got a lot of these and could not simply ignore them. The problem was that mh saw a text/html MIME type and, as it does for text, helpfully converted from the original encoding, usually CP1250 or iso8859-2, to the encoding specified in my locale environment variable, utf-8. Since the content was html, it then handed it to a ``browser'', in my case w3m, for pretty formatting. w3m read the encoding from the html header, thought its input was CP1250 or iso8859-2, and helpfully converted to utf-8. Both programs were behaving in a vaguely sensible way, but iconv was being run twice, and the result was gibberish. It took me a while to figure our what was happening and a while to figure out a way to make it stop. I don't know what the general answer to problems like this is. Forcing everyone to use English is not an option. Forcing everyone to use utf-8 would be better, but is not going to happen either. John -- John Stalker School of Mathematics Trinity College Dublin tel +353 1 896 1983 fax +353 1 896 2282
Re: [9fans] simplicity
Forcing everyone to use utf-8 would be better, but is not going to happen either. it will, it will just take some time (some things will be in utf-x for x8) partly because it isn't `forced' (who could ever do the `forcing')
Re: [9fans] simplicity
Heh, funny that this thread got revived the very day that my colleague's backup script choked because he was running in a utf8 locale and hit a filename encoded in iso8859-1. Apparently GNU sed's . stops matching when it hits an invalid bytestream (which is not entirely unreasonable I guess). -sqweek clearly in their world, it is unreasonable. - erik
Re: [9fans] simplicity
My most annoying locale problem concerned reading Czech HTML emails in mh. Don't ask why, just accept that I got a lot of these and could not simply ignore them. The problem was that mh saw a text/html MIME type and, as it does for text, helpfully converted from the original encoding, usually CP1250 or iso8859-2, [...] i think this is a character set conversion problem, not a locale problem. a small distinction, but i think one can live with converting character sets as they come onto a system. localized (ha!) complexity. - erik
Re: [9fans] simplicity
I was thinking of the simplistic scenario, where someone might be looking for niño in some file, regardless of what locale they might happen to be in. Now I can imagine the nightmare it must be for non-English speakers looking for letter combinations irrespective of accents. But, it seems more like a problem with the shorthand than grep, per se. i agree with this. or it's a historical problem with the character set. clearly if you were designing a universial character set with no compatability constraints, the alphabet would have nñ together so [a-z] would match both. I could see an argument for [:alpha:] potentially matching n and ñ depending on the locale, but [a-z] not matching ñ in any locale. But even that, my tendency would be that [:alpha:] match ñ in every locale. But then, does [:alpha:] match ἄγαθος? How ironic that it doesn't match α. i don't think one can go this route. you can't have a magic environment variable that changes everything. testing is a nightmare in such a world. you have to go through every combination of (data cs, locale) to see if things are working. a better solution is to use the properties of unicode. ñ is noted in the table as 00f1;latin small letter n with tilde;ll;0;l;006e 0303n;latin small letter n tilde;;00d1;;00d1 field 6 has the base codepoint 006e as its first subfield. it would not be hard to build a table quickly mapping a codepoint to its base codepoint σ. but it would probablly be most useful to also have a mapping from base codepoints to all composed forms ξ. suppose, for lack of creativity, we use » to mean all base codepoints matching the next item character so »a matches ä as does »[a-z]. so for » of a letter c can be grepped by taking ξσ(c) which results in a character class. plan 9 already has some of this in the c library with tolowerrune, etc. i did some work with this some time ago and wrote some rc scripts to generate the to*rune tables from the unicode standard data. it would be easy to adapt them to generate ξ and σ. (the tables would be pretty big.) What an ugly problem. it can be made ugly quickly. but i'm not convinced that all approaches to this problem are bad. - erik
Re: [9fans] simplicity
i think this is a character set conversion problem, not a locale problem. a small distinction, but i think one can live with converting character sets as they come onto a system. localized (ha!) complexity. I'm not sure your solution is always the correct one, or is implementable. Should an MTA silently convert incoming mail to the local character set? I'm not sure I want that. The other program in my example was a web browser reading from a pipe. It can't know whether it's processing data as it comes into the system or data which is already there and has already been converted, unless either it can trust the meta tag in the document to have been updated or the conversion is pushed out into the network layer. Also, it's meaningful to talk about the system character set in the plan9 world or the windows world, but not under UNIX, which is where I spend most of my time, for better or worse. -- John Stalker School of Mathematics Trinity College Dublin tel +353 1 896 1983 fax +353 1 896 2282
Re: [9fans] simplicity
I'm not sure your solution is always the correct one, or is implementable. Should an MTA silently convert incoming mail to the local character set? it doesn't have to. upas/fs does given the character set in the file. i've thought about the mta doing it. i think that would be a nice solution. In my case this was being done by the MUA, which was mh rather than upas, but the net effect is the same. I'm not sure I want that. The other program in my example was a web browser reading from a pipe. It can't know whether it's processing data as it comes into the system or data which is already there and has already been converted, unless either it can trust the meta tag in the document to have been updated or the conversion is pushed out into the network layer. what is the standard. if the encoding in the header header is x does that me an that the encoding in the html header needs to be x? what happends if they differ? the only case that makes sense is that they have to be the same. but html and http generally run counter to common sense. ;-) I don't know what happens if they differ. In my case they were the same, but the problem was that both programs assigned themselves the job of converting. I think that the mailer SHOULD NOT, to use the RFC capitals, convert the character set if it is handing off the display job to another program. In any case that's the way I set things up once I figured out what was going on. This is counter to the way the CRLF issue is handled, though. There the network standard is CRLF and systems which use other systems, including all the ones I use, are expected to convert before sending and after receiving so no local programs need to know about such issues. -- John Stalker School of Mathematics Trinity College Dublin tel +353 1 896 1983 fax +353 1 896 2282
Re: [9fans] simplicity
In article [EMAIL PROTECTED] Uriel wrote: Don't complain, at least it is not producing random behaviour, I have seen versions of gnu awk that when feed plain ASCII input, if the locale was UTF-8, rules would match random lines of input, the fix? set the locale to 'C' at the top of all your scripts (and don't even think of dealing with files which actually contain non-ASCII UTF-8). This was some years ago, it might be fixed by now, but it demonstrates how the locale insanity makes life so much more fun. It likely is fixed by now. If not, I'd like to have a sample program and data and locale name to test under. And the truth is, even if it doesn't work, I can blame the library routines and locale and not my code. :-) Testing should be performed using current sources, available via anonymous CVS from savannah.gnu.org, check out the gawk-stable module. From CVS use: ./bootstrap.sh ./configure make make check to build on a Unix or Linux system. I hope to make a formal release in the next few weeks. As to the original thread, yeah, configure (= autoconf + automake + libtool + gnulib) has gotten way too hairy to handle. I don't use gnulib on principle: I have the gut feeling that the configuration goop would likely outweigh the source code in line count. The only reason I added Automake support was to get GNU Gettext, which on balance is a good thing. Locales, on the other hand, I think are very painful. I hope that people who use them find them valuable (I'm a parochial English speaking American myself, so ASCII is usually enough for me.) My two cents, Arnold -- Aharon (Arnold) Robbins arnold AT skeeve DOT com P.O. Box 354Home Phone: +972 8 979-0381Fax: +1 206 202 4333 Nof Ayalon Cell Phone: +972 50 729-7545 D.N. Shimshon 99785 ISRAEL
Re: [9fans] simplicity
This was some years ago, it might be fixed by now, but it demonstrates how the locale insanity makes life so much more fun. It likely is fixed by now. If not, I'd like to have a sample program and data and locale name to test under. And the truth is, even if it doesn't work, I can blame the library routines and locale and not my code. :-) Yes, it is likely fixed now, and it was very likely a bug in the libraries rather than awk, but illustrates the kinds of problems locales create. And I can tell you, in a production environment it can be a pain when who knows what tool who knows where in your whole system starts to misbehave because it is not happy with your locale. I also find most sad how in the name of 'localization' the output of many tools (specially error messages) has become unpredictable. It makes providing support most fun when you ask people can you copy paste the output you get when you run this, and they answer with a bunch of stuff Aramaic. If you use unix, you are supposed to understand English, period. (Or what is next? will they have a set of 'magic symlinks' that links '/bin/gato' to '/bin/cat' if your locale is in Spanish?) And now that you mention Gettext, if only I could get back all the time I wasted trying to compile some stupid program (that should never have been 'localized' in the first place) which is somehow unhappy about the gettext version I have (or the other way around)... uriel P.S.: Oh, and people who insist in using encodings other than UTF-8 should be locked up in padded cells (without access to computers and ideally even without electricity, unless it is to help them electrocute themselves) for the good of mankind.
Re: [9fans] simplicity
Yes, old thread, sorry. Blame Uriel. On 9/18/07, Douglas A. Gwyn [EMAIL PROTECTED] wrote: erik quanstrom wrote: suppose Linux user a and user b grep the same text file for the same string. results will depend on the users' locales. But if they're trying to match an alphabetic character class, the result *should* depend on the locale. This baffles me. Can anyone think of examples where one might want differing results depending on your locale? -Jack
Re: [9fans] simplicity
Yes, old thread, sorry. Blame Uriel. On 9/18/07, Douglas A. Gwyn [EMAIL PROTECTED] wrote: erik quanstrom wrote: suppose Linux user a and user b grep the same text file for the same string. results will depend on the users' locales. But if they're trying to match an alphabetic character class, the result *should* depend on the locale. This baffles me. Can anyone think of examples where one might want differing results depending on your locale? -Jack i think i see what the reasoning is. the thought is that, e.g., in spanish [a-z] should match ñ. the problem is this means that grep(regexp, data) now returns a set of results, one for each locale. so on the one hand, one would like [a-z] to do the Right Thing, depending on language. and on the other hand, one wants grep(regexp, data) to return a single result. i think the way to see through this issue is to notice that the reason we want ñ to be in [a-z] is because of visual similarity. what if we were dealing with chinese? i think it's pretty clear that [a-z] should map to a contiguous set of unicode codepoints. if you want to deal with ñ, the unicode tables do note that ñ is n+combining ~, so one could come up with a new denotation for base codepoint. unfortunately the combining that with existing regexp would be a bit painful. - erik
Re: [9fans] simplicity
On 9/18/07, Uriel [EMAIL PROTECTED] wrote: Don't complain, at least it is not producing random behaviour, I have seen versions of gnu awk that when feed plain ASCII input, if the locale was UTF-8, rules would match random lines of input, the fix? set the locale to 'C' at the top of all your scripts (and don't even think of dealing with files which actually contain non-ASCII UTF-8). This was some years ago, it might be fixed by now, but it demonstrates how the locale insanity makes life so much more fun.- Heh, funny that this thread got revived the very day that my colleague's backup script choked because he was running in a utf8 locale and hit a filename encoded in iso8859-1. Apparently GNU sed's . stops matching when it hits an invalid bytestream (which is not entirely unreasonable I guess). -sqweek
Re: [9fans] simplicity
Uriel wrote: found this gem in one of the many X headers: #define NBBY8 /* number of bits in a byte */ So what is supposed to be wrong with using a manifest constant instead of hard-coding 8 in various places? As I recall, The Elements of Programming Style recommended this approach. Similar definitions have been in Unix system headers for decades. CHAR_BIT is defined in limits.h. (Yes, I know there is a difference between a char and a byte. Less well known, there is a difference between a byte and an octet.) I'm not saying that some of the complaints don't have a point, especially when important tools perform poorly. However, I've observed an unusal degree of arrogance in the Plan 9 newsgroup, approaching religion. Plan 9's way of doing things is not the only intelligent way; others may have different goals and constraints that affect how they do things in their particular environments.
Re: [9fans] simplicity
So what is supposed to be wrong with using a manifest constant instead of hard-coding 8 in various places? As I recall, The Elements of Programming Style recommended this approach. i see two problems with this sort of indirection. if i see NBBY in the code, i have to look up it's value. NBBY doesn't mean anything to me. this layer of mental gymnastics that makes the code hard to read and understand. on the other hand, 8 means something to me. more importantly, it implies that the code would work with NBBY of 10 or 12. (c standard says you can't have 8 §5.2.4.2.1.) i'd bet there are many things in the code that depend on the sizeof a byte that don't reference NBBY. so this define goes 0 fer 2. it can't be changed and it is not informative. Similar definitions have been in Unix system headers for decades. CHAR_BIT is defined in limits.h. (Yes, I know there is a difference between a char and a byte. Less well known, there is a difference between a byte and an octet.) this mightn't be the right place to defend a practice by saying that unix systems have been doing it for years. - erik
Re: [9fans] simplicity
Less well known, there is a difference between a byte and an octet. grep octet /sys/games/lib/fortunes 20 octets is 160 guys playing flutes -- rob easily one of my favourites
Re: [9fans] simplicity
On 9/19/07, Douglas A. Gwyn [EMAIL PROTECTED] wrote: I'm not saying that some of the complaints don't have a point, especially when important tools perform poorly. However, I've observed an unusal degree of arrogance in the Plan 9 newsgroup, approaching religion. Plan 9's way of doing things is not the only intelligent way; others may have different goals and constraints that affect how they do things in their particular environments. imho a big problem is that in the mentioned places every environment is always thought as a particular one. iru
Re: [9fans] simplicity
i see two problems with this sort of indirection. if i see NBBY in the code, i have to look up it's value. NBBY doesn't mean anything to me. this layer of mental gymnastics that makes the code hard to read and understand. on the other hand, 8 means something to me. more importantly, it implies that the code would work with NBBY of 10 or 12. (c standard says you can't have 8 §5.2.4.2.1.) i'd bet there are many things in the code that depend on the sizeof a byte that don't reference NBBY. so this define goes 0 fer 2. it can't be changed and it is not informative. 8 can be a lot of things besides the number of bits in a byte (the number of bytes in a double or vlong, for example). if you're doing enough conversions between byte counts and bit counts, then using NBBY makes it clear *why* you're using an 8 there, which might help a lot. in other contexts, it might not be worth the effort. jumping all over a #define without seeing how or why it is being used is not productive. nor interesting. in fact i can't believe i'm writing this. sorry. russ
Re: [9fans] simplicity
However, I've observed an unusal degree of arrogance in the Plan 9 newsgroup, approaching religion. elitism, not arrogance. I don't want to belong to any club that will accept me as a member. - Groucho Marx
Re: [9fans] simplicity
erik quanstrom wrote: wchar_t is not the equivalent of Rune. Rune is always utf-8. wchar_t can be whatever. I could have sworn that Plan 9 rune is used to contain a Unicode value (UCS-2). wchar_t can do the same thing, and does on some platforms. On others, wchar_t holds a full 31-but UCS-4 code, and on others (Solaris for example) its encoding is locale-dependent (which I would agree is not a good design). suppose Linux user a and user b grep the same text file for the same string. results will depend on the users' locales. But if they're trying to match an alphabetic character class, the result *should* depend on the locale.
Re: [9fans] simplicity
But if they're trying to match an alphabetic character class, the result *should* depend on the locale. ... so what *should* the result be if the locale specifies an ideographic script? DaveL
Re: [9fans] simplicity
On 9/18/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: But if they're trying to match an alphabetic character class, the result *should* depend on the locale. ... so what *should* the result be if the locale specifies an ideographic script? DaveL the result *should* be 'now go and use plan 9' iru
Re: [9fans] simplicity
On 9/17/07, Douglas A. Gwyn [EMAIL PROTECTED] wrote: erik quanstrom wrote: i think the devolution of gnu grep is quite instructive. ... it gets to the heart of why plan9's invention and use (thank's rob, ken) of utf-8 is so great. If the problem is that Gnu grep converts any non-8-bit character set to wchar_t (the equivalent of Plan 9 rune), then it's not really a fair criticism of the software. The conversion approach handles a wide variety of character encoding scheme, whereas grepping the encodings directly (the fast approach) doesn't work well for many non-UTF-8 encodings. Well, on a 2GHz x86, gnu grep ran for me at about 9600 baud on an ASCII file if I set my locale to the UTF-8 locale. UTF-8 is ASCII compatible - explicitly, publicly, and on purpose - so there is no excuse for this sort of performance penalty. To be specific, in the UTF-8 locale it should take just a few instructions to convert any character to wchar_t, ASCII or not, but gnu grep was calling malloc for this, even for an ASCII byte. It is a fair criticism to say this is unacceptable, whatever the intentions of the authors may be. -rob
Re: [9fans] simplicity
Don't complain, at least it is not producing random behaviour, I have seen versions of gnu awk that when feed plain ASCII input, if the locale was UTF-8, rules would match random lines of input, the fix? set the locale to 'C' at the top of all your scripts (and don't even think of dealing with files which actually contain non-ASCII UTF-8). This was some years ago, it might be fixed by now, but it demonstrates how the locale insanity makes life so much more fun. And talking of simplicity, don't forget to mention X. By chance I just found this gem in one of the many X headers: #define NBBY8 /* number of bits in a byte */ uriel On 9/18/07, Rob Pike [EMAIL PROTECTED] wrote: On 9/17/07, Douglas A. Gwyn [EMAIL PROTECTED] wrote: erik quanstrom wrote: i think the devolution of gnu grep is quite instructive. ... it gets to the heart of why plan9's invention and use (thank's rob, ken) of utf-8 is so great. If the problem is that Gnu grep converts any non-8-bit character set to wchar_t (the equivalent of Plan 9 rune), then it's not really a fair criticism of the software. The conversion approach handles a wide variety of character encoding scheme, whereas grepping the encodings directly (the fast approach) doesn't work well for many non-UTF-8 encodings. Well, on a 2GHz x86, gnu grep ran for me at about 9600 baud on an ASCII file if I set my locale to the UTF-8 locale. UTF-8 is ASCII compatible - explicitly, publicly, and on purpose - so there is no excuse for this sort of performance penalty. To be specific, in the UTF-8 locale it should take just a few instructions to convert any character to wchar_t, ASCII or not, but gnu grep was calling malloc for this, even for an ASCII byte. It is a fair criticism to say this is unacceptable, whatever the intentions of the authors may be. -rob
Re: [9fans] simplicity
Iruata Souza wrote: On 9/18/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: But if they're trying to match an alphabetic character class, the result *should* depend on the locale. ... so what *should* the result be if the locale specifies an ideographic script? the result *should* be 'now go and use plan 9' That doesn't address the issue Dave L raised. I don't know off hand what POSIX decreed for character classes involving ideographs. My guess is that they have to not count as uppercase or lowercase, and probably not as alphabetic nor alphanumeric. You could ask similar questions about accented characters in alphabet-based languages. This isn't about character coding so much as it is about classification.
Re: [9fans] simplicity
On 9/18/07, Douglas A. Gwyn [EMAIL PROTECTED] wrote: Iruata Souza wrote: On 9/18/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: But if they're trying to match an alphabetic character class, the result *should* depend on the locale. ... so what *should* the result be if the locale specifies an ideographic script? the result *should* be 'now go and use plan 9' That doesn't address the issue Dave L raised. I can't realize why not. iru
Re: [9fans] simplicity
On 9/16/07, Francisco J Ballesteros [EMAIL PROTECTED] wrote: Any other suggestion? ELF prelinking (on, e.g., FC7) how to take a bad decision and make it worse ron
Re: [9fans] simplicity
oh, yeah, the utf8 example is great. abiword use to be fast. before internationalization. Now it is so slow as to be totally useless. ron
Re: [9fans] simplicity
Francisco J Ballesteros wrote: the slides are a buch of programs. In fact, I use a terminal to compile and run programs from the 9.intro.pdf book. ... By the way, I've been reading through that book in my spare time, and it's a pretty good resource.
Re: [9fans] simplicity
erik quanstrom wrote: i think the devolution of gnu grep is quite instructive. ... it gets to the heart of why plan9's invention and use (thank's rob, ken) of utf-8 is so great. If the problem is that Gnu grep converts any non-8-bit character set to wchar_t (the equivalent of Plan 9 rune), then it's not really a fair criticism of the software. The conversion approach handles a wide variety of character encoding scheme, whereas grepping the encodings directly (the fast approach) doesn't work well for many non-UTF-8 encodings.
Re: [9fans] simplicity
Steve Simon wrote: Top of my over-complex list would be configure. My experience with configure is that it seldom selects the compiler I wanted to use, for some reason preferring the Gnu software even though the conventional Unix versions work at least as well for the purpose.
Re: [9fans] simplicity
erik quanstrom wrote: i think the devolution of gnu grep is quite instructive. ... it gets to the heart of why plan9's invention and use (thank's rob, ken) of utf-8 is so great. If the problem is that Gnu grep converts any non-8-bit character set to wchar_t (the equivalent of Plan 9 rune), then it's not really a fair criticism of the software. The conversion approach handles a wide variety of character encoding scheme, whereas grepping the encodings directly (the fast approach) doesn't work well for many non-UTF-8 encodings. performance may suck, but that's just a symptom of a bigger problem. wchar_t is not the equivalent of Rune. Rune is always utf-8. wchar_t can be whatever. this is not a feature. it is a bug. suppose Linux user a and user b grep the same text file for the same string. results will depend on the users' locales. contrast plan 9. any two users grepping the same file for the same string will get the same results. in either case a character set conversion might be necessary to match the locale. but in the plan 9 case, one conversion will fix things for any plan 9 user. in the Linux case, there is no conversion that will fix things for any Linux user. - erik p.s. gnu grep does special-cases utf-8 and avoids wchar_t conversions
Re: [9fans] simplicity
In my experience, the one thing that really gets Plan 9 across to people is the telco server. That's an example of something that you can't nicely do in Unix, and that exhibits power and elegance as a consequence of a few basic design choices.
[9fans] simplicity
Time ago, Ron said I know we have some faculty on this list. Please talk to your students :-) regarding the madness of making complex software (that time, it was about configure). I have allocated half of the presentation lecture for this semester to Why does this matter at all. Among other things, I´ll be comparing gnu cat.c with plan 9 cat.c, so they get the picture. Any other suggestion?
Re: [9fans] simplicity
I have allocated half of the presentation lecture for this semester to Why does this matter at all. Among other things, I´ll be comparing gnu cat.c with plan 9 cat.c, so they get the picture. Any other suggestion? Please do put up the slides online, if possible, for the benefit of the students on this list :) -- Anant
Re: [9fans] simplicity
Top of my over-complex list would be configure. -Steve
Re: [9fans] simplicity
the slides are a buch of programs. In fact, I use a terminal to compile and run programs from the 9.intro.pdf book. I introduce mistakes and show the consequences, and then I fix them. In this particular course, I use slides just for the introduction classs. I'll put them on the web once we update the web pages for the semester. On 9/16/07, Anant Narayanan [EMAIL PROTECTED] wrote: I have allocated half of the presentation lecture for this semester to Why does this matter at all. Among other things, I´ll be comparing gnu cat.c with plan 9 cat.c, so they get the picture. Any other suggestion? Please do put up the slides online, if possible, for the benefit of the students on this list :) -- Anant
Re: [9fans] simplicity
I know we have some faculty on this list. Please talk to your students :-) regarding the madness of making complex software (that time, it was about configure). I have allocated half of the presentation lecture for this semester to Why does this matter at all. Among other things, I´ll be comparing gnu cat.c with plan 9 cat.c, so they get the picture. Any other suggestion? i think the devolution of gnu grep is quite instructive. once upon a time it was simple and very fast. (thanks, mike.) today it is neither. the last time i tried to fix a utf-8 problem (it was 80 times slower processing utf8 than ascii), i gave up after encountering dozens of if(special char set){fast version}else{slow version} constructions. it gets to the heart of why plan9's invention and use (thank's rob, ken) of utf-8 is so great. and speaking of regular expressions, one could use russ' excellent work on perl regular expressions vs. plan 9 regular expressions to talk about how seemingly straightforward extensions are not always Mostly Harmless; complexity is a sneaky thing. - erik