Re: [9fans] simplicity

2007-10-10 Thread Jack Johnson
On 10/9/07, erik quanstrom [EMAIL PROTECTED] wrote:
 i think i see what the reasoning is.  the thought is that, e.g.,
 in spanish [a-z] should match ñ.

Ah, thanks!

I was thinking of the simplistic scenario, where someone might be
looking for niño in some file, regardless of what locale they might
happen to be in.  Now I can imagine the nightmare it must be for
non-English speakers looking for letter combinations irrespective of
accents.

But, it seems more like a problem with the shorthand than grep, per
se.  I could see an argument for [:alpha:] potentially matching n and
ñ depending on the locale, but [a-z] not matching ñ in any locale. But
even that, my tendency would be that [:alpha:] match ñ in every
locale.

But then, does [:alpha:] match ἄγαθος?  How ironic that it doesn't match α.

What an ugly problem.

-Jack


Re: [9fans] simplicity

2007-10-10 Thread John Stalker
My most annoying locale problem concerned reading Czech HTML emails in
mh.  Don't ask why, just accept that I got a lot of these and could not
simply ignore them.  The problem was that mh saw a text/html MIME type
and, as it does for text, helpfully converted from the original encoding,
usually CP1250 or iso8859-2, to the encoding specified in my locale
environment variable, utf-8.  Since the content was html, it then handed
it to a ``browser'', in my case w3m, for pretty formatting.  w3m read the
encoding from the html header, thought its input was CP1250 or iso8859-2,
and helpfully converted to utf-8.  Both programs were behaving in a
vaguely sensible way, but iconv was being run twice, and the result was
gibberish.  It took me a while to figure our what was happening and a
while to figure out a way to make it stop.  I don't know what the general
answer to problems like this is.  Forcing everyone to use English is not
an option.  Forcing everyone to use utf-8 would be better, but is not
going to happen either.

John
-- 
John Stalker
School of Mathematics
Trinity College Dublin
tel +353 1 896 1983
fax +353 1 896 2282


Re: [9fans] simplicity

2007-10-10 Thread Charles Forsyth
 Forcing everyone to use utf-8 would be better, but is not
 going to happen either.

it will, it will just take some time (some things will be in utf-x for x8)
partly because it isn't `forced' (who could ever do the `forcing')



Re: [9fans] simplicity

2007-10-10 Thread erik quanstrom
  Heh, funny that this thread got revived the very day that my
 colleague's backup script choked because he was running in a utf8
 locale and hit a filename encoded in iso8859-1. Apparently GNU sed's .
 stops matching when it hits an invalid bytestream (which is not
 entirely unreasonable I guess).
 -sqweek

clearly in their world, it is unreasonable.

- erik


Re: [9fans] simplicity

2007-10-10 Thread erik quanstrom
 My most annoying locale problem concerned reading Czech HTML emails in
 mh.  Don't ask why, just accept that I got a lot of these and could not
 simply ignore them.  The problem was that mh saw a text/html MIME type
 and, as it does for text, helpfully converted from the original encoding,
 usually CP1250 or iso8859-2, [...]

i think this is a character set conversion problem, not a locale
problem.  a small distinction, but i think one can live with converting
character sets as they come onto a system.  localized (ha!) complexity.

- erik


Re: [9fans] simplicity

2007-10-10 Thread erik quanstrom
 I was thinking of the simplistic scenario, where someone might be
 looking for niño in some file, regardless of what locale they might
 happen to be in.  Now I can imagine the nightmare it must be for
 non-English speakers looking for letter combinations irrespective of
 accents.
 
 But, it seems more like a problem with the shorthand than grep, per
 se.

i agree with this.  or it's a historical problem with the character set.
clearly if you were designing a universial character set with no compatability
constraints, the alphabet would have nñ together so [a-z] would 
match both.

 I could see an argument for [:alpha:] potentially matching n and
 ñ depending on the locale, but [a-z] not matching ñ in any locale. But
 even that, my tendency would be that [:alpha:] match ñ in every
 locale.
 
 But then, does [:alpha:] match ἄγαθος?  How ironic that it doesn't match α.

i don't think one can go this route.  you can't have a magic environment
variable that changes everything.  testing is a nightmare in such a world.
you have to go through every combination of (data cs, locale) to see if
things are working.

a better solution is to use the properties of unicode.  ñ is noted in the
table as

00f1;latin small letter n with tilde;ll;0;l;006e 0303n;latin small letter n 
tilde;;00d1;;00d1

field 6 has the base codepoint 006e as its first subfield.  it would not be hard
to build a table quickly mapping a codepoint to its base codepoint σ.
but it would probablly be most useful to also have a mapping from
base codepoints to all composed forms ξ.

suppose, for lack of creativity, we use » to mean all base codepoints
matching the next item character so »a matches ä as does »[a-z].
so for » of a letter c can be grepped by taking ξσ(c) which results
in a character class.

plan 9 already has some of this in the c library with tolowerrune, etc.
i did some work with this some time ago and wrote some rc scripts to
generate the to*rune tables from the unicode standard data.  it would
be easy to adapt them to generate ξ and σ.  (the tables would be pretty big.)

 
 What an ugly problem.

it can be made ugly quickly.  but i'm not convinced that all approaches
to this problem are bad.

- erik


Re: [9fans] simplicity

2007-10-10 Thread John Stalker
 i think this is a character set conversion problem, not a locale
 problem.  a small distinction, but i think one can live with converting
 character sets as they come onto a system.  localized (ha!) complexity.

I'm not sure your solution is always the correct one, or is implementable.
Should an MTA silently convert incoming mail to the local character set?
I'm not sure I want that.  The other program in my example was a web
browser reading from a pipe.  It can't know whether it's processing data
as it comes into the system or data which is already there and has already
been converted, unless either it can trust the meta tag in the document to
have been updated or the conversion is pushed out into the network layer.
Also, it's meaningful to talk about the system character set in the plan9
world or the windows world, but not under UNIX, which is where I spend
most of my time, for better or worse.
-- 
John Stalker
School of Mathematics
Trinity College Dublin
tel +353 1 896 1983
fax +353 1 896 2282


Re: [9fans] simplicity

2007-10-10 Thread John Stalker
  I'm not sure your solution is always the correct one, or is implementable.
  Should an MTA silently convert incoming mail to the local character set?
 
 it doesn't have to.  upas/fs does given the character set in the file.
 i've thought about the mta doing it.  i think that would be a nice solution.

In my case this was being done by the MUA, which was mh rather than upas,
but the net effect is the same.

  I'm not sure I want that.  The other program in my example was a web
  browser reading from a pipe.  It can't know whether it's processing data
  as it comes into the system or data which is already there and has already
  been converted, unless either it can trust the meta tag in the document to
  have been updated or the conversion is pushed out into the network layer.
 
 what is the standard.  if the encoding in the header header is x does that me
 an
 that the encoding in the html header needs to be x?  what happends if they
 differ?
 
 the only case that makes sense is that they have to be the same.  but html
 and http generally run counter to common sense. ;-)

I don't know what happens if they differ.  In my case they were the same, but
the problem was that both programs assigned themselves the job of converting.
I think that the mailer SHOULD NOT, to use the RFC capitals, convert the
character set if it is handing off the display job to another program.  In any
case that's the way I set things up once I figured out what was going on.
This is counter to the way the CRLF issue is handled, though.  There the network
standard is CRLF and systems which use other systems, including all the ones I 
use,
are expected to convert before sending and after receiving so no local programs
need to know about such issues.
-- 
John Stalker
School of Mathematics
Trinity College Dublin
tel +353 1 896 1983
fax +353 1 896 2282


Re: [9fans] simplicity

2007-10-09 Thread Aharon Robbins
In article [EMAIL PROTECTED] Uriel wrote:
Don't complain, at least it is not producing random behaviour, I have
seen versions of gnu awk that when feed plain ASCII input, if the
locale was UTF-8, rules would match random lines of input, the fix?
set the locale to 'C' at the top of all your scripts (and don't even
think of dealing with files which actually contain non-ASCII UTF-8).

This was some years ago, it might be fixed by now, but it demonstrates
how the locale insanity makes life so much more fun.

It likely is fixed by now.  If not, I'd like to have a sample program and
data and locale name to test under. And the truth is, even if it doesn't work,
I can blame the library routines and locale and not my code. :-)

Testing should be performed using current sources, available via anonymous
CVS from savannah.gnu.org, check out the gawk-stable module.  From CVS use:

./bootstrap.sh
./configure  make  make check

to build on a Unix or Linux system.

I hope to make a formal release in the next few weeks.

As to the original thread, yeah, configure (= autoconf + automake +
libtool + gnulib) has gotten way too hairy to handle. I don't use gnulib
on principle: I have the gut feeling that the configuration goop would
likely outweigh the source code in line count.

The only reason I added Automake support was to get GNU Gettext, which
on balance is a good thing.  Locales, on the other hand, I think are
very painful.  I hope that people who use them find them valuable (I'm
a parochial English speaking American myself, so ASCII is usually
enough for me.)

My two cents,

Arnold
-- 
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354Home Phone: +972  8 979-0381Fax: +1 206 202 4333
Nof Ayalon  Cell Phone: +972 50  729-7545
D.N. Shimshon 99785 ISRAEL


Re: [9fans] simplicity

2007-10-09 Thread Uriel
 This was some years ago, it might be fixed by now, but it demonstrates
 how the locale insanity makes life so much more fun.

 It likely is fixed by now.  If not, I'd like to have a sample program and
 data and locale name to test under. And the truth is, even if it doesn't work,
 I can blame the library routines and locale and not my code. :-)

Yes, it is likely fixed now, and it was very likely a bug in the
libraries rather than awk, but illustrates the kinds of problems
locales create. And I can tell you, in a production environment it can
be a pain when who knows what tool who knows where in your whole
system starts to misbehave because it is not happy with your locale.

I also find most sad how in the name of 'localization' the output of
many tools (specially error messages) has become unpredictable. It
makes providing support most fun when you ask people can you copy
paste the output you get when you run this, and they answer with a
bunch of stuff Aramaic. If you use unix, you are supposed to
understand English, period. (Or what is next? will they have a set of
'magic symlinks' that links '/bin/gato' to '/bin/cat' if your locale
is in Spanish?)

And now that you mention Gettext, if only I could get back all the
time I wasted trying to compile some stupid program (that should never
have been 'localized' in the first place) which is somehow unhappy
about the gettext version I have (or the other way around)...

uriel

P.S.: Oh, and people who insist in using encodings other than UTF-8
should be locked up in padded cells (without access to computers and
ideally even without electricity, unless it is to help them
electrocute themselves) for the good of mankind.


Re: [9fans] simplicity

2007-10-09 Thread Jack Johnson
Yes, old thread, sorry.  Blame Uriel.

On 9/18/07, Douglas A. Gwyn [EMAIL PROTECTED] wrote:
 erik quanstrom wrote:
  suppose Linux user a and user b grep the same text file for the same 
  string.
  results will depend on the users' locales.

 But if they're trying to match an alphabetic character class, the
 result *should* depend on the locale.

This baffles me.  Can anyone think of examples where one might want
differing results depending on your locale?

-Jack


Re: [9fans] simplicity

2007-10-09 Thread erik quanstrom
 Yes, old thread, sorry.  Blame Uriel.
 
 On 9/18/07, Douglas A. Gwyn [EMAIL PROTECTED] wrote:
  erik quanstrom wrote:
   suppose Linux user a and user b grep the same text file for the same 
   string.
   results will depend on the users' locales.
 
  But if they're trying to match an alphabetic character class, the
  result *should* depend on the locale.
 
 This baffles me.  Can anyone think of examples where one might want
 differing results depending on your locale?
 
 -Jack

i think i see what the reasoning is.  the thought is that, e.g.,
in spanish [a-z] should match ñ.  

the problem is this means that grep(regexp, data) now
returns a set of results, one for each locale.

so on the one hand, one would like [a-z] to do the Right Thing,
depending on language.  and on the other hand, one wants
grep(regexp, data) to return a single result.

i think the way to see through this issue is to notice that
the reason we want ñ to be in [a-z] is because of visual
similarity.  what if we were dealing with chinese?  i think
it's pretty clear that [a-z] should map to a contiguous set
of unicode codepoints.

if you want to deal with ñ, the unicode tables do note that ñ
is n+combining ~, so one could come up with a new
denotation for base codepoint.  unfortunately the combining
that with existing regexp would be a bit painful.

- erik


Re: [9fans] simplicity

2007-10-09 Thread sqweek
On 9/18/07, Uriel [EMAIL PROTECTED] wrote:
 Don't complain, at least it is not producing random behaviour, I have
 seen versions of gnu awk that when feed plain ASCII input, if the
 locale was UTF-8, rules would match random lines of input, the fix?
 set the locale to 'C' at the top of all your scripts (and don't even
 think of dealing with files which actually contain non-ASCII UTF-8).

 This was some years ago, it might be fixed by now, but it demonstrates
 how the locale insanity makes life so much more fun.-

 Heh, funny that this thread got revived the very day that my
colleague's backup script choked because he was running in a utf8
locale and hit a filename encoded in iso8859-1. Apparently GNU sed's .
stops matching when it hits an invalid bytestream (which is not
entirely unreasonable I guess).
-sqweek


Re: [9fans] simplicity

2007-09-19 Thread Douglas A. Gwyn
Uriel wrote:
 found this gem in one of the many X headers:
 #define NBBY8   /* number of bits in a byte */

So what is supposed to be wrong with using a manifest constant
instead of hard-coding 8 in various places?  As I recall,
The Elements of Programming Style recommended this approach.

Similar definitions have been in Unix system headers for
decades.  CHAR_BIT is defined in limits.h. (Yes, I know
there is a difference between a char and a byte.  Less well
known, there is a difference between a byte and an octet.)

I'm not saying that some of the complaints don't have a
point, especially when important tools perform poorly.
However, I've observed an unusal degree of arrogance in
the Plan 9 newsgroup, approaching religion.  Plan 9's way
of doing things is not the only intelligent way; others
may have different goals and constraints that affect how
they do things in their particular environments.


Re: [9fans] simplicity

2007-09-19 Thread erik quanstrom
 So what is supposed to be wrong with using a manifest constant
 instead of hard-coding 8 in various places?  As I recall,
 The Elements of Programming Style recommended this approach.

i see two problems with this sort of indirection.  if i see NBBY
in the code, i have to look up it's value.  NBBY doesn't mean anything
to me.  this layer of mental gymnastics that makes the code hard
 to read and understand.  on the other hand, 8 means something to me.

more importantly, it implies that the code would work with NBBY
of 10 or 12.  (c standard says you can't have  8 §5.2.4.2.1.)
i'd bet there are many things in the code that depend on the sizeof
a byte that don't reference NBBY.

so this define goes 0 fer 2.  it can't be changed and it is not informative.

 Similar definitions have been in Unix system headers for
 decades.  CHAR_BIT is defined in limits.h. (Yes, I know
 there is a difference between a char and a byte.  Less well
 known, there is a difference between a byte and an octet.)

this mightn't be the right place to defend a practice by saying that
unix systems have been doing it for years.

- erik



Re: [9fans] simplicity

2007-09-19 Thread Charles Forsyth
Less well known, there is a difference between a byte and an octet.

grep octet /sys/games/lib/fortunes
20 octets is 160 guys playing flutes -- rob

easily one of my favourites



Re: [9fans] simplicity

2007-09-19 Thread Iruata Souza
On 9/19/07, Douglas A. Gwyn [EMAIL PROTECTED] wrote:
 I'm not saying that some of the complaints don't have a
 point, especially when important tools perform poorly.
 However, I've observed an unusal degree of arrogance in
 the Plan 9 newsgroup, approaching religion.  Plan 9's way
 of doing things is not the only intelligent way; others
 may have different goals and constraints that affect how
 they do things in their particular environments.


imho a big problem is that in the mentioned places every environment
is always thought as a particular one.

iru


Re: [9fans] simplicity

2007-09-19 Thread Russ Cox
 i see two problems with this sort of indirection.  if i see NBBY
 in the code, i have to look up it's value.  NBBY doesn't mean anything
 to me.  this layer of mental gymnastics that makes the code hard
  to read and understand.  on the other hand, 8 means something to me.
 
 more importantly, it implies that the code would work with NBBY
 of 10 or 12.  (c standard says you can't have  8 §5.2.4.2.1.)
 i'd bet there are many things in the code that depend on the sizeof
 a byte that don't reference NBBY.
 
 so this define goes 0 fer 2.  it can't be changed and it is not informative.

8 can be a lot of things besides the number of bits in a byte
(the number of bytes in a double or vlong, for example).
if you're doing enough conversions between byte counts
and bit counts, then using NBBY makes it clear *why* you're
using an 8 there, which might help a lot.

in other contexts, it might not be worth the effort.

jumping all over a #define without seeing how or 
why it is being used is not productive.  nor interesting.
in fact i can't believe i'm writing this.  sorry.

russ



Re: [9fans] simplicity

2007-09-19 Thread Skip Tavakkolian
 However, I've observed an unusal degree of arrogance in
 the Plan 9 newsgroup, approaching religion.

elitism, not arrogance.

I don't want to belong to any club that will accept me as a member. - Groucho 
Marx



Re: [9fans] simplicity

2007-09-18 Thread Douglas A. Gwyn
erik quanstrom wrote:
 wchar_t is not the equivalent of Rune.  Rune is always utf-8.  wchar_t
 can be whatever.

I could have sworn that Plan 9 rune is used to contain a Unicode
value (UCS-2).  wchar_t can do the same thing, and does on some
platforms.  On others, wchar_t holds a full 31-but UCS-4 code, and
on others (Solaris for example) its encoding is locale-dependent
(which I would agree is not a good design).

 suppose Linux user a and user b grep the same text file for the same string.
 results will depend on the users' locales.

But if they're trying to match an alphabetic character class, the
result *should* depend on the locale.


Re: [9fans] simplicity

2007-09-18 Thread dave . l
But if they're trying to match an alphabetic character class, the
result *should* depend on the locale.

... so what *should* the result be if the locale specifies an ideographic 
script?

DaveL


Re: [9fans] simplicity

2007-09-18 Thread Iruata Souza
On 9/18/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 But if they're trying to match an alphabetic character class, the
 result *should* depend on the locale.

 ... so what *should* the result be if the locale specifies an ideographic 
 script?

 DaveL


the result *should* be 'now go and use plan 9'

iru


Re: [9fans] simplicity

2007-09-18 Thread Rob Pike
On 9/17/07, Douglas A. Gwyn [EMAIL PROTECTED] wrote:
 erik quanstrom wrote:
  i think the devolution of gnu grep is quite instructive.  ...
  it gets to the heart of why plan9's invention and use (thank's rob, ken) of
  utf-8 is so great.

 If the problem is that Gnu grep converts any non-8-bit character set
 to wchar_t (the equivalent of Plan 9 rune), then it's not really a
 fair criticism of the software.  The conversion approach handles a
 wide variety of character encoding scheme, whereas grepping the
 encodings directly (the fast approach) doesn't work well for many
 non-UTF-8 encodings.

Well, on a 2GHz x86, gnu grep ran for me at about 9600 baud on an
ASCII file if I set my locale to the UTF-8 locale.  UTF-8 is ASCII
compatible - explicitly, publicly, and on purpose - so there is no
excuse for this sort of performance penalty.  To be specific, in
the UTF-8 locale it should take just a few instructions to convert
any character to wchar_t, ASCII or not, but gnu grep was calling
malloc for this, even for an ASCII byte.

It is a fair criticism to say this is unacceptable, whatever the
intentions of the authors may be.

-rob


Re: [9fans] simplicity

2007-09-18 Thread Uriel
Don't complain, at least it is not producing random behaviour, I have
seen versions of gnu awk that when feed plain ASCII input, if the
locale was UTF-8, rules would match random lines of input, the fix?
set the locale to 'C' at the top of all your scripts (and don't even
think of dealing with files which actually contain non-ASCII UTF-8).

This was some years ago, it might be fixed by now, but it demonstrates
how the locale insanity makes life so much more fun.

And talking of simplicity, don't forget to mention X. By chance I just
found this gem in one of the many X headers:

#define NBBY8   /* number of bits in a byte */

uriel


On 9/18/07, Rob Pike [EMAIL PROTECTED] wrote:
 On 9/17/07, Douglas A. Gwyn [EMAIL PROTECTED] wrote:
  erik quanstrom wrote:
   i think the devolution of gnu grep is quite instructive.  ...
   it gets to the heart of why plan9's invention and use (thank's rob, ken) 
   of
   utf-8 is so great.
 
  If the problem is that Gnu grep converts any non-8-bit character set
  to wchar_t (the equivalent of Plan 9 rune), then it's not really a
  fair criticism of the software.  The conversion approach handles a
  wide variety of character encoding scheme, whereas grepping the
  encodings directly (the fast approach) doesn't work well for many
  non-UTF-8 encodings.

 Well, on a 2GHz x86, gnu grep ran for me at about 9600 baud on an
 ASCII file if I set my locale to the UTF-8 locale.  UTF-8 is ASCII
 compatible - explicitly, publicly, and on purpose - so there is no
 excuse for this sort of performance penalty.  To be specific, in
 the UTF-8 locale it should take just a few instructions to convert
 any character to wchar_t, ASCII or not, but gnu grep was calling
 malloc for this, even for an ASCII byte.

 It is a fair criticism to say this is unacceptable, whatever the
 intentions of the authors may be.

 -rob



Re: [9fans] simplicity

2007-09-18 Thread Douglas A. Gwyn
Iruata Souza wrote:
 On 9/18/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
  But if they're trying to match an alphabetic character class, the
  result *should* depend on the locale.
  ... so what *should* the result be if the locale specifies an ideographic 
  script?
 the result *should* be 'now go and use plan 9'

That doesn't address the issue Dave L raised.

I don't know off hand what POSIX decreed for character classes
involving ideographs.  My guess is that they have to not count
as uppercase or lowercase, and probably not as alphabetic nor
alphanumeric.  You could ask similar questions about accented
characters in alphabet-based languages.  This isn't about
character coding so much as it is about classification.


Re: [9fans] simplicity

2007-09-18 Thread Iruata Souza
On 9/18/07, Douglas A. Gwyn [EMAIL PROTECTED] wrote:
 Iruata Souza wrote:
  On 9/18/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
   But if they're trying to match an alphabetic character class, the
   result *should* depend on the locale.
   ... so what *should* the result be if the locale specifies an ideographic 
   script?
  the result *should* be 'now go and use plan 9'

 That doesn't address the issue Dave L raised.

I can't realize why not.

iru


Re: [9fans] simplicity

2007-09-17 Thread ron minnich
On 9/16/07, Francisco J Ballesteros [EMAIL PROTECTED] wrote:

 Any other suggestion?

ELF prelinking (on, e.g., FC7)

how to take a bad decision and make it worse

ron


Re: [9fans] simplicity

2007-09-17 Thread ron minnich
oh, yeah, the utf8 example is great.

abiword use to be fast. before internationalization. Now it is so slow
as to be totally useless.

ron


Re: [9fans] simplicity

2007-09-17 Thread Douglas A. Gwyn
Francisco J Ballesteros wrote:
 the slides are a buch of programs. In fact, I use a terminal to
 compile and run
 programs from the 9.intro.pdf book. ...

By the way, I've been reading through that book in my spare time,
and it's a pretty good resource.


Re: [9fans] simplicity

2007-09-17 Thread Douglas A. Gwyn
erik quanstrom wrote:
 i think the devolution of gnu grep is quite instructive.  ...
 it gets to the heart of why plan9's invention and use (thank's rob, ken) of
 utf-8 is so great.

If the problem is that Gnu grep converts any non-8-bit character set
to wchar_t (the equivalent of Plan 9 rune), then it's not really a
fair criticism of the software.  The conversion approach handles a
wide variety of character encoding scheme, whereas grepping the
encodings directly (the fast approach) doesn't work well for many
non-UTF-8 encodings.


Re: [9fans] simplicity

2007-09-17 Thread Douglas A. Gwyn
Steve Simon wrote:
 Top of my over-complex list would be configure.

My experience with configure is that it seldom selects the compiler
I wanted to use, for some reason preferring the Gnu software even
though the conventional Unix versions work at least as well for the
purpose.


Re: [9fans] simplicity

2007-09-17 Thread erik quanstrom
 erik quanstrom wrote:
  i think the devolution of gnu grep is quite instructive.  ...
  it gets to the heart of why plan9's invention and use (thank's rob, ken) of
  utf-8 is so great.
 
 If the problem is that Gnu grep converts any non-8-bit character set
 to wchar_t (the equivalent of Plan 9 rune), then it's not really a
 fair criticism of the software.  The conversion approach handles a
 wide variety of character encoding scheme, whereas grepping the
 encodings directly (the fast approach) doesn't work well for many
 non-UTF-8 encodings.

performance may suck, but that's just a symptom of a bigger problem.

wchar_t is not the equivalent of Rune.  Rune is always utf-8.  wchar_t
can be whatever.

this is not a feature.  it is a bug.

suppose Linux user a and user b grep the same text file for the same string.
results will depend on the users' locales.

contrast plan 9.  any two users grepping the same file for the same string
will get the same results.

in either case a character set conversion might be necessary to match
the locale.  but in the plan 9 case, one conversion will fix things for
any plan 9 user.  in the Linux case, there is no conversion that will fix
things for any Linux user.

- erik

p.s. gnu grep does special-cases utf-8 and avoids wchar_t conversions



Re: [9fans] simplicity

2007-09-17 Thread Scott Schwartz
In my experience, the one thing that really gets Plan 9 across to people
is the telco server.  That's an example of something that you can't nicely
do in Unix, and that exhibits power and elegance as a consequence of a
few basic design choices.



[9fans] simplicity

2007-09-16 Thread Francisco J Ballesteros
Time ago, Ron said

 I know we have some faculty on this list. Please talk to your students :-)

regarding the madness of making complex software (that time, it was
about configure).

I have allocated  half of the presentation lecture for this semester to
Why does this matter at all. Among other things,
I´ll be comparing gnu cat.c with plan 9 cat.c, so they get the picture.

Any other suggestion?


Re: [9fans] simplicity

2007-09-16 Thread Anant Narayanan
I have allocated  half of the presentation lecture for this  
semester to

Why does this matter at all. Among other things,
I´ll be comparing gnu cat.c with plan 9 cat.c, so they get the  
picture.


Any other suggestion?


Please do put up the slides online, if possible, for the benefit of  
the students on this list :)


--
Anant

Re: [9fans] simplicity

2007-09-16 Thread Steve Simon
Top of my over-complex list would be configure.

-Steve


Re: [9fans] simplicity

2007-09-16 Thread Francisco J Ballesteros
the slides are a buch of programs. In fact, I use a terminal to
compile and run
programs from the 9.intro.pdf book. I introduce mistakes and show the
consequences,
and then I fix them.

In this particular course, I use slides just for the introduction
classs. I'll put them on
the web once we update the web pages for the semester.


On 9/16/07, Anant Narayanan [EMAIL PROTECTED] wrote:
  I have allocated  half of the presentation lecture for this
  semester to
  Why does this matter at all. Among other things,
  I´ll be comparing gnu cat.c with plan 9 cat.c, so they get the
  picture.
 
  Any other suggestion?

 Please do put up the slides online, if possible, for the benefit of
 the students on this list :)

 --
 Anant


Re: [9fans] simplicity

2007-09-16 Thread erik quanstrom
 I know we have some faculty on this list. Please talk to your students :-)
 
 regarding the madness of making complex software (that time, it was
 about configure).
 
 I have allocated  half of the presentation lecture for this semester to
 Why does this matter at all. Among other things,
 I´ll be comparing gnu cat.c with plan 9 cat.c, so they get the picture.
 
 Any other suggestion?

i think the devolution of gnu grep is quite instructive.  once upon a time
it was simple and very fast.  (thanks, mike.)  today it is neither.

the last time i tried to fix a utf-8 problem (it was 80 times slower
processing utf8 than ascii), i gave up after encountering dozens of
if(special char set){fast version}else{slow version} constructions.

it gets to the heart of why plan9's invention and use (thank's rob, ken) of
utf-8 is so great.

and speaking of regular expressions, one could use russ' excellent work
on perl regular expressions vs. plan 9 regular expressions to talk about
how seemingly straightforward extensions are not always Mostly Harmless;
complexity is a sneaky thing.

- erik