Re: [9fans] awk, not utf aware...

erik quanstrom Thu, 28 Feb 2008 07:30:36 -0800

i had to dig this off 9fans.net/archive.  htmlfmt does some very bad things
with non-ascii characters.  i hope i put them back correctly.


> Yes, and then there is locale: does [a-z] include ĳ when you run it
> in Holland (it should)?  Does it include á, è, ô in France (it should)?
> Does it include ø, å in Norway (it should not)?  And what happens when
> you evaluate "è"< "o" (it depends)?
> 
> Fixing awk is much harder than anyone things.  I had a chat about it with
> Brian Kernighan and he says he's been thinking about fixing awk for a
> long time, but that it really is a hard problem.

how does a program know where it's being run?  ☺ how do you write a
program that processes byte streams from a dutch user and from a
norwegian?  how does one deal with a multi-language file.

i see some problems with localized regexps.  like pre-utf character
sets, it's impossible to tell from a byte stream what the character
set is.  two users can run the same program and get different results.
(how do you test in an environment like this?) and, of course, you
can't switch locale within a file making multi-language files
difficult.

perhaps it would be more effective to break down the concept
a bit.  instead of a general locale hammer, why not expose some
operations that could go into a locale?  for example, have a base-
character folding switch that allows regexps to fold codpoints into
base codepoints so that íïìîi -> i.  this information is in the unicode
tables.  perhaps the language-dependent character mapping should
be specified explictly. &c.

- erik

Re: [9fans] awk, not utf aware...

Reply via email to