subject:"\[9fans\] awk, not utf aware..."

Re: [9fans] awk, not utf aware...

2008-03-03 Thread Jack Johnson

On Thu, Feb 28, 2008 at 6:10 AM, erik quanstrom [EMAIL PROTECTED] wrote:
  perhaps it would be more effective to break down the concept
  a bit.  instead of a general locale hammer, why not expose some
  operations that could go into a locale?  for example, have a base-
  character folding switch that allows regexps to fold codpoints into
  base codepoints so that íïìîi - i.  this information is in the unicode
  tables.  perhaps the language-dependent character mapping should
  be specified explictly. c.

Loosely-related tangent:

http://www.mail-archive.com/[EMAIL PROTECTED]/msg20395.html

 On the LINUX machines running utf-8 the ä is coded as $C3A4 which is
 in utf-8 equal to the character E4. The ä occupies in that way 2 bytes.

 I was very astonished, when I copied a mac-filename, pasted into a
 texteditor and looked at the file:

 In the mac-filename the letter ä is coded as: $61CC88, which in utf-8
 means the letter a followed by a $0308. (Combining diacritical marks)
 So the Mac combines the letter a with the two points above it instead
 using the E4 letter
 Now the things are clear: The filenames are different, in spite of
 looking equally.

So, if folding codepoints is a reasonable tactic, how many
representations do you need to fold?  How many binary representations
are needed to fold íïìîi - i?

-Jack

Re: [9fans] awk, not utf aware...

2008-03-03 Thread erik quanstrom

  On the LINUX machines running utf-8 the ä is coded as $C3A4 which is
  in utf-8 equal to the character E4. The ä occupies in that way 2 bytes.
 
  I was very astonished, when I copied a mac-filename, pasted into a
  texteditor and looked at the file:
 
  In the mac-filename the letter ä is coded as: $61CC88, which in utf-8
  means the letter a followed by a $0308. (Combining diacritical marks)
  So the Mac combines the letter a with the two points above it instead
  using the E4 letter
  Now the things are clear: The filenames are different, in spite of
  looking equally.
 
 So, if folding codepoints is a reasonable tactic, how many
 representations do you need to fold?  How many binary representations
 are needed to fold íïìîi - i?

i didn't make my point very well.  in this case i was suggesting a -f flag
for grep that would map a codepoints into their base codepoint.  the match
result would be the original text --- in the manner of the -i flag.

seperately, however ...

utf combining characters are a really unfortunate choice, imho.  there
is no limit to the number of combining codepoints one can add to
a base codepoint.  you can, for example build a single letter like this
U+0061 U+0302 ... U+0302
i don't think it's possible to build legible glyphs from bitmaps using
combining diacriticals.

therefore, i would argue for reducing letters made up of base+combiners
to a precombined codepoint whenever possible.  it would be helpful
if tcs did this.  infortunately some transliterations of russian into the roman
alphabet use characters with no precombined form in unicode.

rob probablly has a more informed opinion on this than i.

- erik

Re: [9fans] awk, not utf aware...

2008-02-29 Thread Douglas A. Gwyn

Joel C. Salomon wrote:
 Also recall that sizeof('c') == sizeof(int).  I suspect, though, that
 literals like 'abcd' are left from the B (word-addressable, not
 byte-addressable) days.

Yes, in C ordinary character constants have always had type int.
Multi-character constants were used in the first C version of troff,
for one example, so the language permits them even though their use
has nonportable aspects.

Re: [9fans] awk, not utf aware...

2008-02-28 Thread erik quanstrom

i had to dig this off 9fans.net/archive.  htmlfmt does some very bad things
with non-ascii characters.  i hope i put them back correctly.

 Yes, and then there is locale: does [a-z] include ĳ when you run it
 in Holland (it should)?  Does it include á, è, ô in France (it should)?
 Does it include ø, å in Norway (it should not)?  And what happens when
 you evaluate è o (it depends)?
 
 Fixing awk is much harder than anyone things.  I had a chat about it with
 Brian Kernighan and he says he's been thinking about fixing awk for a
 long time, but that it really is a hard problem.

how does a program know where it's being run?  ☺ how do you write a
program that processes byte streams from a dutch user and from a
norwegian?  how does one deal with a multi-language file.

i see some problems with localized regexps.  like pre-utf character
sets, it's impossible to tell from a byte stream what the character
set is.  two users can run the same program and get different results.
(how do you test in an environment like this?) and, of course, you
can't switch locale within a file making multi-language files
difficult.

perhaps it would be more effective to break down the concept
a bit.  instead of a general locale hammer, why not expose some
operations that could go into a locale?  for example, have a base-
character folding switch that allows regexps to fold codpoints into
base codepoints so that íïìîi - i.  this information is in the unicode
tables.  perhaps the language-dependent character mapping should
be specified explictly. c.

- erik

Re: [9fans] awk, not utf aware...

2008-02-28 Thread Aharon Robbins

 Date: Wed, 27 Feb 2008 21:01:33 +0100
 From: Uriel [EMAIL PROTECTED]
 Subject: Re: [9fans] awk, not utf aware...
 To: Fans of the OS Plan 9 from Bell Labs 9fans@cse.psu.edu

 None of those issues are specific to AWK, they apply just as well to
 sed(1) or any program dealing with regexps. I think the plan9 tools
 demonstrate that it is not so hard to find a 'good enough' solution;
 and the lunix locale debacle demonstrate that if you want to get it
 'right' you will end up with a nightmare.

Plan 9 had the luxury of starting over with Unicode from the ground
up. Many of the C mb* interfaces predate Unicode, as do many of the
character encodings in use in different parts of the world. Unix vendors
(and standards bodies) have the very real problems of trying to make
their software work, and continue to work for the forseeable future,
in different countries, encodings, etc.

I am not saying that the POSIX locale stuff is wonderful, elegant,
clean, etc.  It has real problems, and for the most recent gawk
release, gawk no longer uses the locale's decimal point for numeric
output by default.

But one has to give the standards groups and Unix vendors credit for
trying to grapple with a real problem instead of side stepping it and
then crowing about it.

 The problem with awk is that it is not a native plan9 app, and it
 simian nature shows in too many places. For example system() and | are
 badly broken:

 %  echo |awk '{print |echo $KSH_VERSION}'
 @(#)PD KSH v5.2.14 99/07/13.2

Why is this broken?  If the shell that awk is running is PDKSH, or
KSH_VERSION exists in the environment, this is to be expected.

For awk specifically, off the top of my head, the functions that have to
be character-set aware are: index, substr, length, tolower, toupper, and
match.  Gawk has been multibyte aware for several years, although there
were some bugs initially.  And someone recently pointed out another one:

str = sprintf(%.5s, otherstr)

has to work in terms of characters, not bytes, which I overlooked
and still have to fix.

 Boyd made a native port of awk that fixed most (all?) of this issues,
 it can be found somewhere in his contrib dir but I don't think is
 production-ready.

I remember talking to him about this some, since for a long while the Plan
9 awk was one that was forked from BWK's circa 1993 and needed updating.

 On Wed, Feb 27, 2008 at 4:54 PM, Sape Mullender
 [EMAIL PROTECTED] wrote:
   There is split and other functions,
for example:

toupper(aֳ)
gives
Aֳ

My guess is that there are many more little (or not) corners where it
doesn't work.

   Yes, and then there is locale: does [a-z] include ִ³ when you run it
   in Holland (it should)?  Does it include ֳ¡, ֳ¨, ֳ´ in France (it should)?
   Does it include ֳ¸, ֳ¥ in Norway (it should not)?  And what happens when
   you evaluate ֳ¨  o (it depends)?

   Fixing awk is much harder than anyone things.  I had a chat about it with
   Brian Kernighan and he says he's been thinking about fixing awk for a
   long time, but that it really is a hard problem.

Indeed.  I bit the bullet; Brian hasn't been willing to suffer the complaints,
and I don't blame him. :-)  You can see some of his travails by looking
at the CHANGES file in his distribution, available from his Bell Labs
and Princeton web pages.

As far as I know, gawk and the Solaris /usr/xpg4/bin/awk are the only
awks that are multibyte aware.  The Solaris version is derived from the MKS
one (see the code from opensolaris.org) with multibyte fixes. I can supply
simple patches to make it compile on Linux if anyone wants.  This version
doesn't handle some dark corners, but has the advantage of being
very small.

Arnold

Re: [9fans] awk, not utf aware...

2008-02-28 Thread Uriel

   %  echo |awk '{print |echo $KSH_VERSION}'
   @(#)PD KSH v5.2.14 99/07/13.2

  Why is this broken?  If the shell that awk is running is PDKSH, or
  KSH_VERSION exists in the environment, this is to be expected.

I thought it was obvious that the output was from a 'standard' Plan 9
terminal. But given the percentage of people actually using plan9 in
this list, I guess I should have been much more explicit.

And the problem is precisely that the environment under which awk run
commands is completely different from the one awk is run in; in other
words, awk spreads its 'simian' (ape-ish) nature.

uriel

Re: [9fans] awk, not utf aware...

2008-02-28 Thread erik quanstrom

 I thought it was obvious that the output was from a 'standard' Plan 9
 terminal. But given the percentage of people actually using plan9 in
 this list, I guess I should have been much more explicit.
 
 And the problem is precisely that the environment under which awk run
 commands is completely different from the one awk is run in; in other
 words, awk spreads its 'simian' (ape-ish) nature.

i think that awk is in a no-win situation here.  if it used rc, then
awk scripts from plan 9 would break on unix and vice versa.  sam and
acme have similar issues in p9p's environment.  i don't see how either
using the native shell or using the shell from the original
environment is wrong a priori.  awk picks a lane and sticks too it.
i'd bet that benefits other ape stuff like lp.

if you really don't like this situation, perhaps the solution is to
improve upon awk.  a plan 9 scripting language based on sre's --- as
suggested by rob --- could be really cool.

- erik

Re: [9fans] awk, not utf aware...

2008-02-27 Thread erik quanstrom

 There is split and other functions,
 for example:
 
 toupper(aí)
 gives
 Aí
 
 My guess is that there are many more little (or not) corners where it
 doesn't work.
 We can go on and on looking for crevices and hiding the bugs further
 under the rug
 so that they are not evident and find everyone completely unaware,
 leave awk as it is now or really fix the problem. The first approach
 doesn't work. I am going to take
 the second till I have time to take the third which means use runes or
 at least revise all the
 code so that it is uniformly aware of the existance of non-ascii characters.

i don't understand this approach.  you propose redoing a fundamental
part of awk.   yet at the end you won't have solved the bug that's bothering
you.

ignoring the fact that awk is an ape program and doesn't use runes, the
problem with toupper is independent of the internal representation
of strings. as far as i can tell, ape doesn't even have towupper and towlower.

so if you provide those functions, fixing toupper and tolower could be
a 5 minute fix.  and you know you won't have broken anything else.

/sys/doc/utf.ps is worth a read.  it's not to hard to think of situations
that depend on character boundaries or operate on non-ascii characters.
generally there are few.  for example, rc only bothers with character
boundaries in matching. perhaps you could build a utf testsuite for awk.
make sure to use non-latin1 languages, too.

- erik

Re: [9fans] awk, not utf aware...

2008-02-27 Thread Sape Mullender

 There is split and other functions,
 for example:
 
 toupper(aí)
 gives
 Aí
 
 My guess is that there are many more little (or not) corners where it
 doesn't work.

Yes, and then there is locale: does [a-z] include ĳ when you run it
in Holland (it should)?  Does it include á, è, ô in France (it should)?
Does it include ø, å in Norway (it should not)?  And what happens when
you evaluate è  o (it depends)?

Fixing awk is much harder than anyone things.  I had a chat about it with
Brian Kernighan and he says he's been thinking about fixing awk for a
long time, but that it really is a hard problem.

Sape

Re: [9fans] awk, not utf aware...

2008-02-27 Thread Uriel

None of those issues are specific to AWK, they apply just as well to
sed(1) or any program dealing with regexps. I think the plan9 tools
demonstrate that it is not so hard to find a 'good enough' solution;
and the lunix locale debacle demonstrate that if you want to get it
'right' you will end up with a nightmare.

The problem with awk is that it is not a native plan9 app, and it
simian nature shows in too many places. For example system() and | are
badly broken:

%  echo |awk '{print |echo $KSH_VERSION}'
@(#)PD KSH v5.2.14 99/07/13.2

Boyd made a native port of awk that fixed most (all?) of this issues,
it can be found somewhere in his contrib dir but I don't think is
production-ready.

uriel

On Wed, Feb 27, 2008 at 4:54 PM, Sape Mullender
[EMAIL PROTECTED] wrote:
  There is split and other functions,
   for example:
  
   toupper(aí)
   gives
   Aí
  
   My guess is that there are many more little (or not) corners where it
   doesn't work.

  Yes, and then there is locale: does [a-z] include ĳ when you run it
  in Holland (it should)?  Does it include á, è, ô in France (it should)?
  Does it include ø, å in Norway (it should not)?  And what happens when
  you evaluate è  o (it depends)?

  Fixing awk is much harder than anyone things.  I had a chat about it with
  Brian Kernighan and he says he's been thinking about fixing awk for a
  long time, but that it really is a hard problem.

 Sape

[9fans] awk, not utf aware...

2008-02-26 Thread Gorka Guardiola

I think this has come up before, but I didn't found reply.
If I do in awk something like:

split($0, c, );

c should be an array of Runes internally, UTF externally, but apparently,
it is not. Is it just broken?, is there a replacement?, is it just the
builtins or
is the whole awk broken?.

Example, freqpair

--
#!/bin/awk -f

{
n = split($0, c , );
for(i=1; in; i++){
pair=c[i] c[i+1]
f[pair]++;
}
}
END{
for(h in f)
printf(%d %s\n, f[h], h);
}

--

% echo abcd|freqpair
1 ab
1 cd
1 bc
% echo aícd|freqpair
1 cd
1 �c
1 í
1 a�


where the ? is a Peter face...

Thanks.

-- 
- curiosity sKilled the cat

Re: [9fans] awk, not utf aware...

2008-02-26 Thread Martin Neubauer

Awk is one of the few programs in the ditribution that is maintained
externally (by Brian Kernighan) and is pulled in via ape and pcc (it might
actually be the only one - I didn't bother to check.) A quick glimpse at
lex.c suggests that awk scans input one char at a time. In hindsight I'm a
bit surprised that I haven't got bitten by this, but I probably didn't split
within multibyte sequences. It's probably not too hard to change awk to read
runes for the price of creating ``the other one true awk.''

Martin

* Gorka Guardiola ([EMAIL PROTECTED]) wrote:
 I think this has come up before, but I didn't found reply.
 If I do in awk something like:
 
 split($0, c, );
 
 c should be an array of Runes internally, UTF externally, but apparently,
 it is not. Is it just broken?, is there a replacement?, is it just the
 builtins or
 is the whole awk broken?.
 
 Example, freqpair
 
 --
 #!/bin/awk -f
 
 {
   n = split($0, c , );
   for(i=1; in; i++){
   pair=c[i] c[i+1]
   f[pair]++;
   }
 }
 END{
   for(h in f)
   printf(%d %s\n, f[h], h);
 }
 
 --
 
 % echo abcd|freqpair
 1 ab
 1 cd
 1 bc
 % echo aícd|freqpair
 1 cd
 1 �c
 1 í
 1 a�
 
 
 where the ? is a Peter face...
 
 Thanks.
 
 -- 
 - curiosity sKilled the cat

Re: [9fans] awk, not utf aware...

2008-02-26 Thread Gorka Guardiola

On Tue, Feb 26, 2008 at 2:16 PM, Martin Neubauer [EMAIL PROTECTED] wrote:
 Awk is one of the few programs in the ditribution that is maintained
  externally (by Brian Kernighan) and is pulled in via ape and pcc (it might
  actually be the only one - I didn't bother to check.) A quick glimpse at
  lex.c suggests that awk scans input one char at a time. In hindsight I'm a
  bit surprised that I haven't got bitten by this, but I probably didn't split
  within multibyte sequences. It's probably not too hard to change awk to read
  runes for the price of creating ``the other one true awk.''


I don't know if it is as easy. I leave it in my todo list for the future :-).
Anyway, the BUGS section should say it does not know about UTF.
I´ll send a patch.


-- 
- curiosity sKilled the cat

Re: [9fans] awk, not utf aware...

2008-02-26 Thread erik quanstrom

 I think this has come up before, but I didn't found reply.
 If I do in awk something like:
 
 split($0, c, );
 
 c should be an array of Runes internally, UTF externally, but apparently,
 it is not. Is it just broken?, is there a replacement?, is it just the
 builtins or
 is the whole awk broken?.

i think the comments about this problem are missing the point
a bit.  utf8 should be transparent to awk unless the situation demands
that awk needs to know the length of a character.  it's not necessary
to keep strings as Rune*s internally to work with utf8.  splitting on
 is a special case where awk does need to know the length of
a character.  e.g. this script should work fine

; cat /tmp/smile
#!/bin/awk -f
{
n = split($0, c, ☺);
for(i = 1; i = n; i++)
print c[i]
}
; echo fu☺bar|/tmp/smile
fu
bar

but splitting on  won't.  i attached a patch that fixes this problem
as an illustration.  i'm not using utflen because pcc won't see it.
it's an ugly patch.

i don't think i know what a proper fix for awk would be.  i wouldn't
think there are many cases like this, but i haven't spent much time
with awk internals.

- erik

--

9diff run.c
/n/sources/plan9//sys/src/cmd/awk/run.c:1191,1196 - run.c:1191,1219
return(False);
  }
  
+ static int
+ utf8len(char *s)
+ {
+   int c, n, i;
+ 
+   c = *(unsigned char*)s++;
+   if ((c0xe0) == 0xc0)
+   n = 2;
+   else if ((c0xf0) == 0xe0)
+   n = 3;
+   else if ((c0xf8) == 0xf0)
+   n = 4;
+   else
+   return 1;   //-1;
+   i = n-1;
+   if(strlen(s)  i)
+   return 1;   // -1;
+   for(; i--  (c = *(unsigned char*)s++);)
+   if(0x80 != (c0xc0))
+   return 1;   //-1;
+   return n;
+ }
+ 
  Cell *split(Node **a, int nnn)/* split(a[0], a[1], a[2]); a[3] is 
type */
  {
Cell *x = 0, *y, *ap;
/n/sources/plan9//sys/src/cmd/awk/run.c:1279,1290 - run.c:1302,1316
s++;
}
} else if (sep == 0) {  /* new: split(s, a, ) = 1 char/elem */
-   for (n = 0; *s != 0; s++) {
-   char buf[2];
+   int i, len;
+   char buf[5];
+   for (n = 0; *s != 0; s += len) {
n++;
sprintf(num, %d, n);
-   buf[0] = *s;
-   buf[1] = 0;
+   len = utf8len(s);
+   for(i = 0; i  len; i++)
+   buf[i] = s[i];
+   buf[len] = 0;
if (isdigit(buf[0]))
setsymtab(num, buf, atof(buf), STR|NUM, (Array 
*) ap-sval);
else

Re: [9fans] awk, not utf aware...

2008-02-26 Thread geoff

Plan 9 awk is an APE program, so it uses the unpronounceable ANSI
mbtowc/wctomb functions to deal with UTF.  Thus it uses mblen rather
than utflen or utf8len.

Re: [9fans] awk, not utf aware...

2008-02-26 Thread Pietro Gagliardi

And it's wonderful that the C standard defines a character literal as  
so:


char-literal:
' characters '
characters:
character
characters character

(or something like that)

Question, then: why do we need wchar_t/Rune?

On Feb 26, 2008, at 4:08 PM, [EMAIL PROTECTED] wrote:


Plan 9 awk is an APE program, so it uses the unpronounceable ANSI
mbtowc/wctomb functions to deal with UTF.  Thus it uses mblen rather
than utflen or utf8len.

Re: [9fans] awk, not utf aware...

2008-02-26 Thread Steven Vormwald

On Tue, 2008-02-26 at 16:21 -0500, Pietro Gagliardi wrote:
 And it's wonderful that the C standard defines a character literal as  
 so:
 
   char-literal:
   ' characters '
   characters:
   character
   characters character
 
 (or something like that)
 
 Question, then: why do we need wchar_t/Rune?

The definitions are ( used to indicate non-terminals in the
grammar...):

(6.4.4.4) character-constant:
' c-char-sequence '
L' c-char-sequence '

(6.4.4.4) c-char-sequence:
c-char
c-char-sequence c-char

(6.4.4.4) c-char:
any member of the source character set except the single-quote ',
backslash \, or new-line character

escape-sequence

Steven Vormwald
[EMAIL PROTECTED]

Re: [9fans] awk, not utf aware...

2008-02-26 Thread erik quanstrom

 And it's wonderful that the C standard defines a character literal as  
 so:
 
   char-literal:
   ' characters '
   characters:
   character
   characters character
 
 (or something like that)
 
 Question, then: why do we need wchar_t/Rune?
 

because we have more tha 255 characters.

- erik

Re: [9fans] awk, not utf aware...

2008-02-26 Thread Pietro Gagliardi


Yes. I'm too lazy to pick up my copy of the standard.

On Feb 26, 2008, at 4:32 PM, Steven Vormwald wrote:


On Tue, 2008-02-26 at 16:21 -0500, Pietro Gagliardi wrote:

And it's wonderful that the C standard defines a character literal as
so:

char-literal:
' characters '
characters:
character
characters character

(or something like that)

Question, then: why do we need wchar_t/Rune?


The definitions are ( used to indicate non-terminals in the
grammar...):

(6.4.4.4) character-constant:
' c-char-sequence '
L' c-char-sequence '

(6.4.4.4) c-char-sequence:
c-char
c-char-sequence c-char

(6.4.4.4) c-char:
any member of the source character set except the single-quote ',
backslash \, or new-line character

escape-sequence

Steven Vormwald
[EMAIL PROTECTED]

Re: [9fans] awk, not utf aware...

2008-02-26 Thread Pietro Gagliardi


(which I have sitting next to me)

On Feb 26, 2008, at 4:40 PM, Pietro Gagliardi wrote:


Yes. I'm too lazy to pick up my copy of the standard.

On Feb 26, 2008, at 4:32 PM, Steven Vormwald wrote:


On Tue, 2008-02-26 at 16:21 -0500, Pietro Gagliardi wrote:
And it's wonderful that the C standard defines a character  
literal as

so:

char-literal:
' characters '
characters:
character
characters character

(or something like that)

Question, then: why do we need wchar_t/Rune?


The definitions are ( used to indicate non-terminals in the
grammar...):

(6.4.4.4) character-constant:
' c-char-sequence '
L' c-char-sequence '

(6.4.4.4) c-char-sequence:
c-char
c-char-sequence c-char

(6.4.4.4) c-char:
any member of the source character set except the single-quote ',
backslash \, or new-line character

escape-sequence

Steven Vormwald
[EMAIL PROTECTED]

Re: [9fans] awk, not utf aware...

2008-02-26 Thread erik quanstrom

thanks for catching that.

my brain's not on today.  generally i avoid the mb functions because they
rely on locale.  of course this doesn't apply on plan 9 and so there's no reason
for utf8len.

it looks like mblen is used elsewhere; perhaps this would now be a worthwhile
patch.

- erik

 Plan 9 awk is an APE program, so it uses the unpronounceable ANSI
 mbtowc/wctomb functions to deal with UTF.  Thus it uses mblen rather
 than utflen or utf8len.

Re: [9fans] awk, not utf aware...

2008-02-26 Thread Steven Vormwald

On Tue, 2008-02-26 at 16:40 -0500, Pietro Gagliardi wrote:
 Yes. I'm too lazy to pick up my copy of the standard.

I just happened to be reading through Annex A (the grammar) at the time,
so I thought I'd send it out.

Steven Vormwald
[EMAIL PROTECTED]

Re: [9fans] awk, not utf aware...

2008-02-26 Thread Joel C. Salomon

On Tue, Feb 26, 2008 at 4:21 PM, Pietro Gagliardi [EMAIL PROTECTED] wrote:
 And it's wonderful that the C standard defines a character literal as
  so:

But it leaves the meaning of a literal like 'abcd' up to the compiler.
 I did something very perverse -- but 'legal' -- in the compiler I
started writing for class...

Also recall that sizeof('c') == sizeof(int).  I suspect, though, that
literals like 'abcd' are left from the B (word-addressable, not
byte-addressable) days.

A quick check of /sys/src/cmd/cc/lex.c shows that kenc disallows such horrors.

--Joel

Re: [9fans] awk, not utf aware...

2008-02-26 Thread Gorka Guardiola

On Tue, Feb 26, 2008 at 9:24 PM, erik quanstrom [EMAIL PROTECTED] wrote:

  i think the comments about this problem are missing the point
  a bit.  utf8 should be transparent to awk unless the situation demands

No. It is not transparent at all. It is semitranslucid because someone did it
partways and because of that I have been bitten hardly by this in different
situations (I am not complaining, just saying that this may not be the right
approach to take in the future).

What someone did is make it so:
/a.j/
matches
a☺j
because someone fixed the regexp part of awk somehow it already understands this
which made me (falsely) think originally that it works and conned me
into the bug.

There is split and other functions,
for example:

toupper(aí)
gives
Aí

My guess is that there are many more little (or not) corners where it
doesn't work.
We can go on and on looking for crevices and hiding the bugs further
under the rug
so that they are not evident and find everyone completely unaware,
leave awk as it is now or really fix the problem. The first approach
doesn't work. I am going to take
the second till I have time to take the third which means use runes or
at least revise all the
code so that it is uniformly aware of the existance of non-ascii characters.
-- 
- curiosity sKilled the cat

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

[9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

Re: [9fans] awk, not utf aware...

24 matches

Site Navigation

Mail list logo

Footer information