i had to dig this off 9fans.net/archive. htmlfmt does some very bad things with non-ascii characters. i hope i put them back correctly.
> Yes, and then there is locale: does [a-z] include ij when you run it > in Holland (it should)? Does it include á, è, ô in France (it should)? > Does it include ø, å in Norway (it should not)? And what happens when > you evaluate "è"< "o" (it depends)? > > Fixing awk is much harder than anyone things. I had a chat about it with > Brian Kernighan and he says he's been thinking about fixing awk for a > long time, but that it really is a hard problem. how does a program know where it's being run? ☺ how do you write a program that processes byte streams from a dutch user and from a norwegian? how does one deal with a multi-language file. i see some problems with localized regexps. like pre-utf character sets, it's impossible to tell from a byte stream what the character set is. two users can run the same program and get different results. (how do you test in an environment like this?) and, of course, you can't switch locale within a file making multi-language files difficult. perhaps it would be more effective to break down the concept a bit. instead of a general locale hammer, why not expose some operations that could go into a locale? for example, have a base- character folding switch that allows regexps to fold codpoints into base codepoints so that íïìîi -> i. this information is in the unicode tables. perhaps the language-dependent character mapping should be specified explictly. &c. - erik