Re: wcwidth

Marcin 'Qrczak' Kowalczyk Wed, 27 Sep 2000 11:38:37 -0700
Wed, 27 Sep 2000 17:36:42 +0100, Markus Kuhn <[EMAIL PROTECTED]> pisze:

> On POSIX systems, I think wcwidth() should be defined by the current
> locale, because the locale is already used to define the character
> coding of these simple output devices.

OK, but Haskell uses a fixed charset internally, Unicode, and in this
language it does make a difference when a function might depend on
the environment state. Character properties stuff is already designed
to not depend on the environment.

There are the following choices for the interface:

* Forget about width, until a generic stateful locale framework
  is designed. It will more nicely fit the picture that will be
  established.

* Have this one strange stateful character property function, among
  others that are stateless, leaving an open question: why isn't it
  placed together with other I/O functions? it does not belong here!

* Cheat by providing a stateless interface, even though in reality
  it might change or at least depend on when the program is run.
  Explain that this is a compromise between elegance and reality,
  and the function is really stateless, only its implementation is
  created on the fly when the program is run, or something like that.

* Pretend that this width is a property of the character itself, not
  of a particular combination of terminal, library used for doing the
  output (ncurses etc.), and encoding used in the session. Specialized
  libraries would provide other, more exact in their contexts variants
  anyway.

It is quite independent of the implementation choices. Even in the
fourth variant it could be implemented using wcwidth, as long as
we assume that it should behave well, and not change with locale by
design when the meaning of wchar_t does not change.

I guess that wcwidth does not depend on $TERM anyway, or whether I
am using ncurses or outputting characters to stdout directly. But
IMHO this would be the only sensible dependency. I can't believe that
the same Unicode character could have a different width in Hungarian
and Japanese.

That's why I am currently assuming the fourth variant, and want
to put width in the same family as isSpace and toUpper, which are
already implemented privately.

Note that toUpper is split into the first and fourth variant: there
is a quick stateless approximation, because many programs expect that
and its interface was designed a long time ago, and a future locale
framework will certainly provide locale-dependent String->String case
converters. Even then the good old toUpper will be used e.g. when
a compiler support tool will need to find a default mapping of a type
name from another language to Haskell, where type names start with
a capital letter.

The primary difference between treating characters in Haskell and
in C is that Haskell uses a fixed charset internally (otherwise it
would be impractical to use a stateful isSpace in many contexts),
and C always works in terms of an unknown locale-dependent charset.

> I recommend usage of my hardwired wcwidth() only for systems,
> where the C library does not provide an adequate locale-based one.

What should be used on a system where wchar_t is not Unicode?
I'm only interested in the width of characters expressed in Unicode.
I don't want to expose to the user the encoding of wchar_t.

Another reason for using a private mapping instead of wcwidth could
be that, as the discussion here shows, wcwidth implementation in
glibc is often wrong.

An argument for using wcwidth would be that it is not a property of
characters, but of a system on which the program is compiled.

Anyway I see only two main implementation choices:

* Try to detect at configure time if wcwidth should be used,
  and use a private implementation if wcwidth is not present
  or is determined to be unusable.

* Always use a private implementation.

In the first choice I still don't know how to decide when to use it.
One approximation: always use when available (it's wrong when wchar_t
is not Unicode or when wcwidth is too buggy). Another approximation:
use only when __STDC_ISO_10646__ is defined (it's wrong when wcwidth
is too buggy, and is wrong in the other direction in e.g. glibc-2.1.3,
assuming that wcwidth is not too buggy there). Perhaps the configure
stript could test some particular characters to see if results look
sensibly.

> Why not add a separate function isassigned() that tests whether
> the implementation knows about the definition of the character.

Its functionality is already there: by checking if the category is Cn.
Maybe it's worth to provide an explicit property.

Then it would be easy to simulate either definition of printable
characters with the other one.

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: wcwidth

Reply via email to