from:"Glenn Maynard"

Re: glibc wcwidth

2004-05-27 Thread Glenn Maynard

On Fri, May 28, 2004 at 12:39:21AM -0400, srintuar wrote:
(B I'm running with glibc-2.3.2, and the wcwidth system call seems to have
(B
(B(same; Debian unstable)
(B
(B For example, in the locale ja_JP.utf8:
(B 0x6BDF "$B][(B" mk_wcwidth=2 wcwidth=-1" iswprint=no
(B 0x30E2 "$B%b(B" mk_wcwidth=2 wcwidth=-1" iswprint=no
(B 0x8AAD "$BFI(B" mk_wcwidth=2 wcwidth=-1" iswprint=no
(B 0x307F "$B$_(B" mk_wcwidth=2 wcwidth=-1" iswprint=no
(B 0x4EEE "$B2>(B" mk_wcwidth=2 wcwidth=-1" iswprint=no
(B 0x540D "$BL>(B" mk_wcwidth=2 wcwidth=-1" iswprint=no
(B 
(B Does anyone know if wcwidth is/was broken in glibc?
(B
(B#include stdio.h
(B#include wchar.h
(B#include locale.h
(B
(Bmain()
(B{
(Bsetlocale(LC_ALL, "");
(Bprintf("%lc: %i\n", 0x6bdf, wcwidth(0x6bdf));
(B}
(B
(Bprints "$B][(B: 2" for me, in en_US.UTF-8 and ja_JP.UTF-8.
(B
(BDid you forget to call setlocale()?  If not, the data probably isn't loaded.
(B(Tip: always include your test program.)
(B
(B-- 
(BGlenn Maynard
(B
(B--
(BLinux-UTF8:   i18n of Linux on all levels
(BArchive:  http://mail.nl.linux.org/linux-utf8/

Re: glibc wcwidth

2004-05-27 Thread Glenn Maynard

On Fri, May 28, 2004 at 01:04:57AM -0400, srintuar wrote:
 Did you forget to call setlocale()?  If not, the data probably isn't 
 loaded.
 (Tip: always include your test program.)
 
 Yeah, that was it.
 Embarrasingly obvious in hindsight I guess :)

Well, it's reasonable to forget: on modern systems, wchar_t doesn't
change across locales, widths don't change much, and wcwidth(3) doesn't
mention setlocale() at all on my system.  I would have tried it first
without, myself, except that I knew it was needed for %lc.

-- 
Glenn Maynard

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: iconv limitations

2004-04-08 Thread Glenn Maynard

On Thu, Apr 08, 2004 at 04:17:41AM -0400, Michael B Allen wrote:
- knows that the input is zero terminated
 
 I have great difficulty in envisioning the opposite. Binary file formats
 and network protocols have a lot of zero terminated strings in all sorts
 of encodings.
 
- does not know whether this is an 8-bit, 16-bit or 32-bit
  wide and aligned zero
 
 Again for me it's rare that an application would not need to know what
 data it's dealing with. Applications do not exist in a vacuum. You have to
 do I/O in which case the the encoding of text is usually predefined or
 negotiated. You do not always have the luxury of defining how text is
 represeted throughout the system.

However, the case where 1: data is zero-terminated *and* 2: you don't at
least know whether you're dealing with an 8-, 16- or 32-bit encoding is,
in my experience, non-existant.  After all, zero-terminated is
meaningless unless you know what zero means--an 8-bit, 16-bit or
32-bit zero?

-- 
Glenn Maynard

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: iconv limitations

2004-04-08 Thread Glenn Maynard

On Thu, Apr 08, 2004 at 06:17:55PM -0400, Michael B Allen wrote:
  On the other hand, the iconv API is more flexible the way it is. It
  can handle strings with embedded zeroes,
 
 Now *that* is rare.

I use std::string, which is 8-bit clean, and I always like to make things
remain that way unless I have a strong reason not to.

 For that use iconv.
...
 Just because the conversion routine stops at a null terminator in the
 source doesn't mean it cannot operate on a string that is not null
 terminated. The encdec interface I described can convert non-null
 terminated strings by limiting the number of bytes inspected in src using
 the sn parameter.

I'd suggest that one shouldn't have to use two notably different interfaces
just because your nul-termination needs are different, and that stop on
nul should be a conversion flag, as should other things that some need
and some don't want: replacing unconvertable characters ( - ?),
transliteration ( - a), etc.

Better would be a low-level conversion interface that allows implementing
these things efficiently (which iconv doesn't), with iconv, encdec, etc.
interfaces being implemented on top of that.  At the very least, this could
solve the problem of having to lug around large conversion tables when
you outgrow iconv().

 pages and MIME messages with bogus length parameters. The W3C claims all
 apps should use UTF-16 internally so if you want to use those in your

FWIW, I'd say that what the W3C claims applications should use internally is
no more interesting than what the FSF claims I should eat for breakfast.  :)
(Not to mention that UTF-16 is such a horrible recommendation to be making!)

-- 
Glenn Maynard

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: W3C and UTF-16

2004-04-08 Thread Glenn Maynard

On Thu, Apr 08, 2004 at 08:35:21PM -0400, Michael B Allen wrote:
 This is probably states the definitive position for text handling:
 
 http://www.w3.org/TR/1999/WD-charmod-19991129/#Encodings
 
 But even though the encoding is not clearly stated as UTF-16, the Document
 Object Model (DOM) which is basically the document tree inside a web
 browser and key to all HTML and XML processing including JavaScript and
 XSLT processing *requires* the encoding be UTF-16:
 
 http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#ID-C74D1578

The UTF-16 encoding was chosen because of its widespread industry practice.

Very funny; it was chosen since it's what Windows is stuck with.

That aside, all above is incorrect.  You don't have to use DOM to process
HTML and XML.  (Ultimately, if one *had* to use UTF-16 to process HTML, then
something along the line is horribly wrong: a language specification can't
legitimately make any requirements about transparent implementation details.)

-- 
Glenn Maynard

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Perl unicode weirdness.

2004-02-02 Thread Glenn Maynard

On Mon, Feb 02, 2004 at 12:09:07PM -0800, Larry Wall wrote:
 (To avoid confusion, we don't call our encoding UTF-8.  We tend to
 say UTF-8 when we mean UTF-8, and utf8 when we mean the more general
 not-necessarily-Unicode encoding.

This is an insane way to make a distinction, just as silly as trying to
differentiate between kilobits and kilobytes with kb and kB.
Changing hyphens and case doesn't make distinctions or avoid confusion.

-- 
Glenn Maynard

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Perl unicode weirdness.

2004-02-02 Thread Glenn Maynard

On Mon, Feb 02, 2004 at 12:21:40PM -0800, Larry Wall wrote:
 locales for everyone willy nilly.  So 5.8.1 backed off on that, with
 the result that you have to be a little more intentional about your
 input formats (or set the PERL_UNICODE environment variable).

What's the normal way to say use the locale, like every other Unix
program that processes text?  Setting PERL_UNICODE seems to make it
*always* use Unicode:

04:39pm [EMAIL PROTECTED]/5 [~] export LANG=en_US.ISO-8859-1
04:39pm [EMAIL PROTECTED]/5 [~] perl -ne 'if(/^(\x{fa})$/) { print $1\n; }'
ú
ú
04:39pm [EMAIL PROTECTED]/5 [~] export PERL_UNICODE=1
04:39pm [EMAIL PROTECTED]/5 [~] perl -ne 'if(/^(\x{fa})$/) { print $1\n; }'
ú

Also, with PERL_UNICODE=1 in en_US.UTF-8, entering ú outputs one byte,
0xfa (the codepoint), instead of 0xc3 0xba; why?

This is perl, v5.8.2 built for i386-linux-thread-multi

(It's a shame that Perl doesn't behave like everyone else and obey
locale settings correctly; I thought we were finally getting away
from having to tell each program individually to use UTF-8.  I don't
understand the logic of RedHat set the locale to UTF-8 prematurely,
so Perl shouldn't obey the locale.)

-- 
Glenn Maynard

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Perl unicode weirdness.

2004-02-02 Thread Glenn Maynard

On Mon, Feb 02, 2004 at 04:49:22PM -0800, Larry Wall wrote:
 I believe use open ':locale' does that.

This seems to work:

perl -e 'use open :locale;' -ne 'if(/^(\x{fa})$/) { print $1\n; }'

(rather ugly for commandline one-liners)

 Well, hey, I'm the one who agreed with you in the first place and asked
 that 5.8.0 be done that way, but apparently the current maintainers of
 Perl 5 got an excessive amount of grief from people whose production
 programs broke under RedHat.  And I've been so far off in Perl 6 La La
 Land (aka second system syndrome done right) that I let the Perl 5
 folks make the decision to back that out.

Oh well.  Please do get locale handling right this time around.  :)

-- 
Glenn Maynard

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Perl unicode weirdness.

2004-01-31 Thread Glenn Maynard

On Sat, Jan 31, 2004 at 02:07:07PM +, Markus Kuhn wrote:
 Question: What is a quick way in Perl to get a regular expression that
 matches all Unicode characters in the range U0100..U10, in other
 words all non-ASCII Unicode characters?

It looks like /[\x{100}-\x{10}]/ should do that, but it doesn't work
here.

perl -v
This is perl, v5.8.2 built for i386-linux-thread-multi
LANG=en_US.UTF-8

perl -ne 'if(/^(\x{61})$/) { print $1\n; }'
(in) a
(out) a

perl -ne 'if(/^(\x{fa})$/) { print $1\n; }'
(in) ú
(nothing out)

perl -ne 'if(/^(.)$/) { print $1\n; }'
(in) a
(out) a
(in) ú

grep '^.$'
(in) a
(out) a
(in) ú
(out) ú

perl -ne 'if(/^(..)$/) { print $1\n; }'
ú
ú

Why is . matching a single byte in perl, instead of a single codepoint? Why
isn't \x{fa} working?

-- 
Glenn Maynard

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Linux console UTF-8 by default

2004-01-14 Thread Glenn Maynard

On Wed, Jan 14, 2004 at 08:31:16PM +0100, Brian Foster wrote:
  yes there is.  if the illegal 5-byter has the first
  4-bytes legal followed by an US-ASCII byte (which is
  what makes the 5-byter illegal), a parser that never
  considers sequences longer than 4-bytes will see an
  illegal sequence of 4-bytes and then a valid byte.

That would be correct: if a byte that was expected to be a continuation
byte is not, the UTF-8 string should be considered invalid and the
character that was just read should start a new sequence.  A 5-byte
sequence with the fifth byte invalid:

fb bf bf bf 41

should be parsed as an invalid sequence, followed by 0x41 ('A').  (That's
only sensible; on many media, lost bytes are much more common than bit
errors.)

Looking at

  http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

3.3.4: if it was parsed as you suggest, then the ASCII quote after the
partial sequence would be considered part of the sequence, and not
displayed.

-- 
Glenn Maynard

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode fonts on Debian

2003-12-17 Thread Glenn Maynard

On Wed, Dec 17, 2003 at 08:24:35PM +0100, Jan Willem Stumpel wrote:
  If you see html lang=ja then the page should use the font 
  specified by the Japanese setting by default. [..] Encoding 
  is fairly irrelevent to this, afaik 
  http://ken2403king.kir.jp/form.htm
 
 Thats a funny one, indeed. When I opened it in Mozilla it was
 displayed as .For a moment I thought it
 was Chinese (which I do not know) but it is gibberish. Mozilla

 So, isnt the LANG attribute *more* irrelevant, because it did not
 help Mozilla (1.5a) to display the text correctly? A META tag
 attribute charset=shift-jis added to (a copy of) the page did.
 Doesnt that mean that encoding is more relevant than language?

Encoding is more relevant to being able to decode the text.  It's
not relevant to deciding which font to use.

(Well, if you don't have a language tag, the encoding can be used to
help guess it, but not if it's UTF-8.)

That's what he said, of course.  :)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Perl in a UTF-8 locale

2003-11-17 Thread Glenn Maynard

On Mon, Nov 10, 2003 at 05:20:59PM +, Edmund GRIMLEY EVANS wrote:
 I have a problem here with Perl v5.8.0 on Red Hat 9. Simplified, my
 script looks like this:
 
 while () {
 s//cx/g;
 print;
 }
 
 This works with older versions of Perl, and it works in the C locale,
 but it doesn't work here in a UTF-8 locale. I tried putting stuff like
 use bytes or no utf8 or no locale, but it didn't help.

As long as the Perl script and the input is in the same encoding, it
works for me.  (Debian unstable)

This is perl, v5.8.0 built for i386-linux-thread-multi
10:14am [EMAIL PROTECTED]/2 [~] cat testing.txt; file testing.txt
abd
testing.txt: UTF-8 Unicode text
10:17am [EMAIL PROTECTED]/2 [~] LANG=en_US.UTF-8 ./xxx.pl  testing.txt
abcxd
10:14am [EMAIL PROTECTED]/2 [~] LANG=C ./xxx.pl  testing.txt
abcxd
10:14am [EMAIL PROTECTED]/2 [~] LANG=en_US.ISO-8859-3 ./xxx.pl  testing.txt
abcxd

ISO-8859-3:
10:17am [EMAIL PROTECTED]/2 [~] LANG=en_US.UTF-8 ./xxx3.pl  testing-3.txt
abcxd
10:18am [EMAIL PROTECTED]/2 [~] LANG=C ./xxx3.pl  testing-3.txt
abcxd
10:18am [EMAIL PROTECTED]/2 [~] LANG=en_US.ISO-8859-3 ./xxx3.pl  testing-3.txt
abcxd

(Of course, no locale works if I mix encodings.)

 exec(/path/to/this/script, @ARGV);
 }
 .)??D??-|??{??v??W?z[
Hmm.  What's this garbage at the end of the message?  Oh.  Poking at the
raw message body, it's the stupid footer that the mailing list blindly
spams on every message (despite this being a base64 message).

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: grep is horriby slow in UTF-8 locales

2003-11-08 Thread Glenn Maynard

On Fri, Nov 07, 2003 at 12:52:44PM +, Markus Kuhn wrote:
 $ grep --version
 grep (GNU grep) 2.5.1
 $ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt
 Command exited with non-zero status 1
 6.83user 0.07system 0:06.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
 0inputs+0outputs (157major+34minor)pagefaults 0swaps
 $ LC_ALL=POSIX time grep XYZ test.txt
 Command exited with non-zero status 1
 0.07user 0.09system 0:00.16elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
 0inputs+0outputs (125major+24minor)pagefaults 0swaps

FYI:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=206470
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=181378

I've noticed this, too.  I often use LANG=C for grepping due to this.

Someone mentioned --with-included-regex, but that's not good enough
(a 10% gain for me).

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: grep is horriby slow in UTF-8 locales

2003-11-08 Thread Glenn Maynard

On Fri, Nov 07, 2003 at 04:49:58PM +0100, Danilo Segan wrote:
 This doesn't happen with:
 
 $ grep --version
 grep (GNU grep) 2.4.2

This was probably before full multibyte support was added to grep; the
issue here specifically only happens in multibyte encodings.  (My grep
is slow in en_US.UTF-8, and fast in en_US.ISO-8859-1.) Try:

# echo tést | grep 't.st'
tést
# echo tést | grep 't[aé]st'
tést

 $ LC_ALL=POSIX time grep XYZ test.txt 
 Command exited with non-zero status 1
 0.04user 0.06system 0:00.10elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
 0inputs+0outputs (118major+25minor)pagefaults 0swaps
 
 Last example shows that CPU usage is not really any kind of rule to
 base conculsions on (sr_CS.UTF-8 is my everyday locale, and I would
 really notice if grep had any problems with it).

The field you should be reading is user.  CPU is roughly
(user+system)/elapsed, and isn't very relevant here.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: FYI: Some links about UTF-16

2003-07-08 Thread Glenn Maynard

On Tue, Jul 08, 2003 at 02:03:14PM +0800, Wu Yongwei wrote:
 Don't you know what respect is?  I do not call other people silly, and I

(reply made in private)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: FYI: Some links about UTF-16

2003-07-07 Thread Glenn Maynard

On Tue, Jul 08, 2003 at 11:22:19AM +0800, Wu Yongwei wrote:
 Is it true that Almost all modern software that supports Unicode,
 especially software that supports it well, does so using 16-bit Unicode
 internally: Windows and all Microsoft applications (Office etc.), Java,
 MacOS X and its applications, ECMAScript/JavaScript/JScript, Python,
 Rosette, ICU, C#, XML DOM, KDE/Qt, Opera, Mozilla/NetScape,
 OpenOffice/StarOffice, ... ?

Blatently false.  Lots of modern software uses UTF-8 internally.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: FYI: Some links about UTF-16

2003-07-07 Thread Glenn Maynard

On Tue, Jul 08, 2003 at 01:29:04PM +0800, Wu Yongwei wrote:
   Is it true that Almost all modern software that supports Unicode,
   especially software that supports it well, does so using 16-bit Unicode
   internally: Windows and all Microsoft applications (Office etc.), Java,
   MacOS X and its applications, ECMAScript/JavaScript/JScript, Python,
   Rosette, ICU, C#, XML DOM, KDE/Qt, Opera, Mozilla/NetScape,
   OpenOffice/StarOffice, ... ?
 
 Blatently false.  Lots of modern software uses UTF-8 internally.
 
 Name them.

Why?  There are so many (such as the editor I'm typing in right now)
that you sound rather silly asking me to name examples.

Sorry; if you don't know even this, I'm not interested in having this
converation.  Do some research.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Wide character APIs

2003-07-03 Thread Glenn Maynard

On Thu, Jul 03, 2003 at 09:03:40PM +0200, Bruno Haible wrote:
  But no one answered my original question; why are the format specifiers
  for wide character functions different?
 
 Here's the answer: So that the a given format specifier corresponds to a
 given argument type.
 
Format specifierArgument type
 
  %dint
  %schar *
  %ls   wchar_t *
  %cint (promoted from char)
  %lc   wint_t (promoted from wchar_t)

Changing between char and wchar_t at compile-time with macros (TCHAR) is a
hideous Windows hack.  If you really want to generalize it, you could fork
printf to have a TCHAR type, eg:

  const TCHAR *t = _T(abc);
  printf(%t, %t, t, _T(def));

(%t probably has some meaning in printf that I don't know off the top of
my head; I'm not suggesting you actually do this.)  This type switching
is just a gross migration scheme, for programmers who want to distribute
both Unicode and ANSI versions of their programs (for Win9x compatibility).

I doubt this was the intent with the C wide functions having similar
parameters; that's just consistency.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: gtk2

2003-04-02 Thread Glenn Maynard

On Wed, Apr 02, 2003 at 06:17:42PM +0900, Tomohiro KUBOTA wrote:
 And, do you say that non-European-language speaking people don't
 need to have choices?  For example, there are people who like Eterm,
 Aterm, Wterm, Rxvt, Xterm, or so on.  (Note that all of them support
 XIM.)  Is it a priviledge of European-language-speaking people to
 say such preferences?  It is what I wanted to call ethno-centrism.

People write code to do what *they* need; I guess that's self-centrism.
(After all, most of this is written by people in their spare time.)

I suppose the problem you're really complaining about is a likely
typical response by writers of terminal emulators: why should we
support it; use xterm if you want that.  You'd probably get a similar
response if you tried to get Eterm's silly eyecandy bloat features added
to Xterm.

There's a difference, of course--handling Unicode in all terminal emulators
is actually a good idea (adding bloat to Xterm is not :); i18n just needs
to be more widely understood as a fundamentally important feature.  That's
happening steadily.

Nobody's saying that you shouldn't have choices, of course.

On the topic of toolkits: libraries like GTK and QT absolutely should be
able to automatically handle as much i18n (IM, font rendering, widget
repositioning) as possible.  Line input should automatically hint the IM
for clean over-the-spot rendering, and whatever else is useful.  They just
can't be required; we must be able to handle input methods anywhere (without
having to learn a complicated library).  (I'm not sure what this subthread
is really arguing about, though, since I don't see anyone disagreeing on
this.  :)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: gtk2

2003-04-01 Thread Glenn Maynard

On Tue, Apr 01, 2003 at 10:02:36PM -0500, srintuar26 wrote:
 gnome-terminal and multi-gnome-terminal are fairly lightweight.

Package: gnome-terminal

Depends: bonobo-activation (= 1:2.2.0), libart-2.0-2 (= 2.3.8),
libatk1.0-0 (= 1.2.2), libaudiofile0 (= 0.2.3-4),
libbonobo-activation4 (= 1:2.2.0), libbonobo2-0 (= 2.2.0),
libbonoboui2-0 (= 2.2.0), libc6 (= 2.3.1-1), libesd0 (= 0.2.23-1) |
libesd-alsa0 (= 0.2.23-1), libfontconfig1 (= 2.1), libfreetype6 (=
2.1.3-5), libgconf2-4 (= 2.2.0), libgcrypt1 ( 1.1.11-0), libglade2-0
(= 2.0.0), libglib2.0-0 (= 2.2.1), libgnome2-0 (= 2.1.90),
libgnomecanvas2-0 (= 2.1.90), libgnomeui-0 (= 2.1.90), libgnomevfs2-0
(= 2.2.0), libgnutls5 (= 0.8.0-1), libgtk2.0-0 (= 2.2.0), libjpeg62,
liblinc1 (= 1:1.0.0), libncurses5 (= 5.3.20021109-1), liborbit2 (=
1:2.6.0), libpango1.0-0 (= 1.2.1), libpopt0 (= 1.6.4),
libstartup-notification0, libtasn1-0 (= 0.1.1-2), libvte4 (= 0.10.10),
libxft2 (= 2.1), libxml2 (= 2.5.0-1), xlibs ( 4.1.0), xlibs (
4.2.0), zlib1g (= 1:1.1.4), scrollkeeper (= 0.3.8), yelp

Of course, much of this is optional, but nothing about any GTK app is
lightweight unless you happen to be on a GTK system.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: supporting XIM

2003-03-30 Thread Glenn Maynard

On Mon, Mar 31, 2003 at 08:19:49AM +0900, Tomohiro KUBOTA wrote:
 I think there are no people who explicitly think so.  However, how
 do you think if a developer think, for example, italic character
 support for 8bit characters is very important while he/she don't
 won't understand importance of multibyte support?

I believe this is perfectly understandable and normal, even though it's
very annoying to Japanese users.

A side-effect of open source is people prioritizing features that they
care about at the expense of those they don't.  English-speaking programmers
are bound to care more about features for English than features for other
languages--just as programmers in X care more about X support than Windows
support (which is very annoying to Windows users, who often end up with
old, buggy ports of X software when they get them at all).

The only things that can be done about this are what's being done and
discussed: making it easier (so the time commitment is reduced) and
submitting patches.

Actually, there's one more: give them a reason to care.  I wonder if
there's any way to sneak a few double-width characters into common use
among English-speaking programmers.  :)

This is actually one advantage of NFD: it makes combining support much
more important.  (At least, it's an advantage from this perspective;
those who would have to implement combining who wouldn't otherwise
probably wouldn't see it that way.)

By the way, I just gave lv a try: apt-get installed it, used it on a
UTF-8 textfile containing Japanese, and I'm seeing garbage.  It looks
like it's stripping off the high bits of each byte and printing it as
ASCII.  I had to play around with switches to get it to display; apparently
it ignores the locale.   Very poor.  Less, on the other hand, displays
it without having to play games.  It has some problems with double-width
characters, unfortunately.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: supporting XIM

2003-03-29 Thread Glenn Maynard

On Fri, Mar 28, 2003 at 11:32:21AM -0800, H. Peter Anvin wrote:
 WHOA... that's a pretty darn strong statement.  In particular, that
 would seem to request internationalization of kernel (or other
 debugging or logging messages), which is probably a completely
 unrealistic goal.

 For user-interface issues, I would agree with you however.

I think handling i18n in cooked input mode is realistic and important.
(This is both UI and kernel.)

 When it comes to (a), it pretty much means that the complexity needs
 to be hidden from the application programmer.  Terminal applications,
 toolkits, and perhaps libraries like readline need to support this,
 but applications shouldn't need to be affected beyond a few basic
 guidelines, such as don't assume byte == character.  Getting UTF-8
 universally deployed will be a huge part of this, because it means
 that anything other than 7-bit ASCII will have to take this into
 consideration.

Chicken and egg.  :)

  Of course several Japanese companies are competing in Input Method
  area on Windows.  These companies are researching for better input
  methods -- larger and better-tuned dictionaries with newly coined
  words and phrases, better grammartical and semantic analyzers,
  and so on so on.  I imagine this area is one of areas where Open
  Source people cannot compete with commercial softwares by full-time
  developer teams.
 
 This seems to call for a plugin architecture.  More than anything I
 suspect we need *standards*.

And, in this case, non-GPL licensing (if being able to use proprietary
input method plugins is desired) ...

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: supporting XIM

2003-03-28 Thread Glenn Maynard

On Sat, Mar 29, 2003 at 01:33:02AM +0900, Tomohiro KUBOTA wrote:
 Another point: I want to purge all non-internationalized softwares.
 Today, internationalization (such as Japanese character support) is
 regarded as a special feature.  However, I think that non-supporting
 of internationalization should be regarded as a bug which is as severe
 as racist software.  However, GTK is a relatively heavy toolkit and
 developers who want to write a lightweight software won't use it.

Stop using the word racist.  It's like saying if you don't support a
feature I want, you're supporting terrorism; it makes people groan and
stop paying attention.  It's inflammatory, doesn't help your case at all,
and injures your credibility.

Not being racist is free, takes no time, doesn't take any new code, testing,
has no support costs and doesn't require people to learn new APIs.  If
i18n ever becomes implicit, such that supporting i18n is as easy and
effortless as not being racist, and not supporting i18n takes a deliberate
act by the programmer, then the word racist might have some relevance
(but it'd still be inflammatory and cause groaning and ignoring).

I'm aware that English isn't your native language, but I'm pretty sure
you know how strong this comparison is.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: supporting XIM

2003-03-26 Thread Glenn Maynard

On Wed, Mar 26, 2003 at 03:38:54PM -0500, Maiorana, Jason wrote:
(B You can, you just select which keyboard/input method you like to use
(B from the keyboard menu (which list all the installed/enabled ones)!
(B But wait... That's Windows...  And Mac...
(B 
(B No you cant. I have access to a windows machine, with global IME installed.
(B The keyboard is rearranged into dvorak layout, and all other input methods
(B aside from english fail.
(B
(BYes, you can; I did it to type this: $B4A;z(B.  Nobody's claiming it's perfect
(Bor bug-free, but it's undisputably there and useful to many people who need
(Bto input text in multiple languages.  Imperfection is not nonexistance.
(B
(B The windows model is not perfect, imo. (Beyond-BMP codepoints
(B may break many applications, etc.)
(B
(BI don't see how Windows's use of UTF-16 is relevant to the discussion (the
(Bability to change keyboard mappings on the fly).  The only point was that
(Bit's taking X a while to do things that Windows has been doing gracefully
(B(relatively speaking) since at least Win2K.
(B
(B-- 
(BGlenn Maynard
(B--
(BLinux-UTF8:   i18n of Linux on all levels
(BArchive:  http://mail.nl.linux.org/linux-utf8/

Re: supporting XIM

2003-03-25 Thread Glenn Maynard

On Tue, Mar 25, 2003 at 05:12:12PM -0800, H. Peter Anvin wrote:
  However, locale-dependence itself is not a bad thing.  For
  example, XCIN supports both of traditional and simplified
  Chinese depending on locale.  We can imagine about an
  improvement that the default mode would be determined by
  locale even when it would support run-time switching of
  traditional and simplified Chinese.
 
 Indeed.  It would be nice to at some point in the future be able to
 edit, for example, Swedish-langauge document and suddently decide I
 need to insert some Japanese text, call up the appropriate input
 method, without having to have anticipated this need (other than
 having it installed, of course.)

As a person who's only done IM-related stuff in Windows, this seems
fundamental.  I simply hit lcontrol+lshift to switch between English,
Japanese, Korean and Finnish (which I seem to have accidentally installed)
input systems.  X is miles behind in this, unfortunately.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: FYI: lamerpad

2003-03-11 Thread Glenn Maynard

On Wed, Mar 12, 2003 at 10:01:03AM +0900, Tomohiro KUBOTA wrote:
 Lamerpad, http://www.debian.org.hk/~ypwong/lamerpad.html, seems to
 be a good way for developers who don't know CJK languages to test
 their own softwares whether they support Kanji input or not.

A partial test, anyway.  The IM's I've used need to know the cursor
position, to render the current composition, to know where to put
selection dialogs, and so on.  I'd imagine that this type of program
wouldn't test that very well.

Unless it shows the best match as a composition string, but I can't run
it to see if it does that.

 Of course, adoptation of Unicode alone cannot make your software
 support CJK languages (more efforts are needed).  I hope Lamerpad
 will help testing softwares and will lead more softwares supporting
 CJK languages.

What more is needed?

Combining (Korean) and double-width characters (in the case of console apps)
are two things that need special attention, but they're both just parts of
supporting Unicode.

Other than that, and input method support (which is unreasonably difficult
at the moment--based on conversations on this list--except in Windows where
it's merely annoying), what more is needed in the general case?

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: FYI: lamerpad

2003-03-11 Thread Glenn Maynard

On Wed, Mar 12, 2003 at 08:02:59AM +0100, Janusz S. Bie wrote:
  Sorry but I've no time to look into the problem...

A 0.1 program whose upstream author won't look into problems is of
limited value, unfortunately.  :)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: mp3-tags, zip-archives, tool to convert filenames to UTF

2003-02-18 Thread Glenn Maynard

On Tue, Feb 18, 2003 at 09:41:48AM +0100, Nikolai Prokoschenko wrote:
  - mutt will work but you have to compile it against ncursesw
(that means getting the ncurses 5.3 source and recompiling also)
 
 mutt from Debian doesn't have any problems at all!

Debian has a mutt-utf8 package that's compiled against ncursesw.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: mutt and ncursesw

2003-02-18 Thread Glenn Maynard

On Tue, Feb 18, 2003 at 01:50:58PM +0200, Jari P.T. Alhonen wrote:
Last time I checked, mutt compiled against the ordinary ncurses
  (as opposed to ncursesw) does NOT work for characters with East
  Asian width of 'full'. You may get an impression that it works
  because you use it only for chars. with East Asian width of 'half'.
  For CJK, compiling mutt against 'ncursesw' is a must.
 
 mutt-utf8 seems to contain the mutt binary and nothing else (apart from
 a changelog).

Of course; mutt-utf8 in Debian is a diversion.

 And it certainly does work with CJK.

(because it's compiled against ncursesw)

Why mutt-utf8 is a separate package instead of the default in Debian, I
have no idea.  It used to make sense, when mutt-utf8 was compiled
against a buggy Slang hack, but that's no longer the case; it's now
just as functional as the main binary.  (I don't feel like spending
the time trying to convince the Mutt maintainer to change this, though.) 

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: RE: filename and normalization (was gcc identifiers)

2002-12-05 Thread Glenn Maynard

On Thu, Dec 05, 2002 at 11:02:17AM -0500, Maiorana, Jason wrote:
 Also, imagine the extra load on your system if when you do:
 
 cat bigfiles | b | c | d | less
 
 and the text is being normalized back and forth at every step
 of the pipeline.

That has nothing to do with the filesystem; pipes are 8-bit clean for
completely different reasons (you can pipe binary data through them).

(Not that I disagree; this is just a bad example.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: filename and normalization (was gcc identifiers)

2002-12-04 Thread Glenn Maynard

On Wed, Dec 04, 2002 at 04:03:38PM +0100, Keld Jørn Simonsen wrote:
 Well, users should not expect these two sequences to be identical,
 they are not, according to ISO/IEC 10646.

Users expect that Ö == Ö, and don't know or care about Unicode, and
that's reasonable.

Programmers should care, of course, but programmers aren't the only ones
who use filenames, and this problem, as Henry pointed out, is a more
general one.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: filename and normalization

2002-12-04 Thread Glenn Maynard

On Wed, Dec 04, 2002 at 12:49:15PM -0500, Maiorana, Jason wrote:
 As a side-note, I copy/pasted a command line flag from a RH8.0
 manpage back into the console, and tried to execute the command.
 
 It failed, and gave me usage. The reason, I discovered, is that
 the manpage was not using a regular ascii '-', but instead one
 of the HYPEN, or EM_DASH things (Which is why i HATE them).

I think they're perfectly useful, including in manpages, but I agree
they shouldn't be used in syntax displays.  (Unless the application can
actually handle them; which would, in fact, be neat in a novel way,
though I think that would ultimately be a bad idea.  :)

 Irregardless, I dont think the O/S or filesystem code should
 enforce, require, or even know about normalization forms.
 Instead, a well designed user interface should simply show
 non-normalized, over-coded, or invalid UTF-8 sequences as
 bakemoji, in some standard way (such as big rectangles),
 such that it can still be copy/pasted and worked with, but
 not easily confused with proper stuff. The input method
 would always generate normal utf-8, naturally.

It's not clear who's responsibility this is.  There are quite a few
things that are invalid, and they're not easy to handle at every layer.

For example, suppose you have a filename that begins with a combining
character.  If it's the terminal's job to deal with weird output, it
can't do that here; if you run 'ls', the combining character will just
get attached to the whitespace preceding the filename.  ls has to handle
it.

It's probably the terminal's job only so far as always sending NFC when
the user types (which seems to be the de-facto standard, at least); beyond
that it seems to be the job of tools.  Pasting is a little fuzzier.  What
if I'm in Windows, and some other app I'm using uses NFD (for some, possibly
valid, reason)?  I don't want my terminal pasting text from that app in NFD
(since it'll result in filenames on my system in NFD, for example).

If the shell interface is designed to allow me to do everything in NFC
(eg. by having ls and friends escape anything that's not in NFC, along
with all of the other things it should be escaping), then it shouldn't
be a problem to have terminals normalize output text in NFC.

I think it's important that, in the end, I'm always consistently able to
reference any filename displayed by ls via copy-and-paste; otherwise
I'll have to go to annoying lengths to, for example, delete a file with
a bad filename.

Note that when I'm talking about ls escaping text, I mean that it
should have a new a flag indicating that it's allowed to use \u and \U
escapes and that it should use those--and \x--for escaping UTF-8-related
things; this would combine with whatever --quoting-style is in use, and
might be good to default to being on.  Things that would be useful to
escape are invalid/overlong UTF-8 sequences, using \x; combining characters
at the beginning of filenames, too many combining characters--configurable;
anything of width zero that isn't a combining character (control
characters); and possibly anything that isn't in NFC (all with \u and \U).

(But, of course, none of this should be enforced by the kernel or libc; I
think everyone is in agreement here.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: filename and normalization (was gcc identifiers)

2002-12-04 Thread Glenn Maynard

On Wed, Dec 04, 2002 at 08:41:42PM +0100, Keld Jørn Simonsen wrote:
  Users expect that Ö == Ö, and don't know or care about Unicode, and
  that's reasonable.
 
 Well, it is not equal if you code it differently. One is a letter 
 and the other is a letter with some special combining accent.
 They do not compare equal either, at the most detailed level
 according to ISO/IEC 14651- the ISO sorting standard.

This isn't something users care about, and it's not something users
(including clueful Unix users) should ever have to care about.  The
only people that should ever have to care about this is programmers.
It's perfectly reasonable for a user to expect that, if he creates a file
with Ö in it on a Unix system from a Windows terminal and then tries
to cat it from a Mac terminal, it'll work, even if the filename is
pasted from another Mac program that happens to use NFD.  The terminal
should renormalize everything (including pastes) to NFC.

Of course, it's reasonable for this to be an option, but NFC seems to
be a sensible default, at least when connecting to Unix systems.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: filename and normalization

2002-12-04 Thread Glenn Maynard

On Wed, Dec 04, 2002 at 03:11:01PM -0500, Henry Spencer wrote:
  When --help is printed, I want to see two hyphens, not a dash.
 
 You probably want to see two minus signs, not two hyphens...

Err.  Right.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: filename and normalization (was gcc identifiers)

2002-12-04 Thread Glenn Maynard

On Wed, Dec 04, 2002 at 03:17:24PM -0500, Maiorana, Jason wrote:
 The terminal
 should renormalize everything (including pastes) to NFC.
 
 Then how will I paste in some wacky invalid filename into
 my terminal in order, to say, rm it? Like I was saying,
 paste's should not be normalized.

I already explained this at length: ls (and other tools) should escape
wacky filenames using \x, \u and \U.  This is nothing new; ls already
escapes things, so it's just an extension on existing functionality.

Even if you don't normalize, unless ls does some quoting work, you're
not going to be able to paste all strange filenames.  For example, as I
mentioned, combining characters at the start of a filename.

Also, it's very difficult for terminals to handle this consistently.  Is
an invalid UTF-8 string one column width?  One per byte?  There are
definitions (eg. Markus has a page on it), but it's difficult enough to
get width right without having to deal with this.  Also, it's more
difficult to have a terminal implementation that can remember invalid
sequences on-screen to be able to copy them later; and it'd need to be
handled in terminal layers, like Screen, and mbswidth() identically, or
it'd become desynced.

In practice, since this (precise displaying of invalid UTF-8 sequences)
is a relatively obscure issue, this will never happen, and the result
would be broken filenames causing screen desyncs and not easily being
referenced (eg. to rm).

 Normalization for D has some serious drawbacks: if you were to try
 to implement, say vietnamese using only composing characters,
 it would look horrible. The appearance, position, shape, and size
 of the combining accents depends on which letter they are being
 combined with, as well as which other diacritics are being combined
 with that same letter.

That's entirely a rendering implementation detail; it should be easy for
the terminal's font renderer to normalize internally in whatever way is
most appropriate.

What scripts do you think NFD would be more appropriate than NFC for?
NFC seems to be fairly (de-facto) standard in Unix.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: filename and normalization (was gcc identifiers)

2002-12-04 Thread Glenn Maynard

On Wed, Dec 04, 2002 at 12:33:59PM -0800, McDonald, Ira wrote:
 Actually, I rarely link with just one library.  And if the two
 (or more) different libraries had their identifiers normalized
 into different forms, then no solution will be possible.
 
 And since all these different codepoint representations of the
 same character look alike, any but the most sophisticated
 programmers will be defeated and just unable to link those two 
 libraries with the same program.

That aside, I use NFC, and I certainly don't want to have to switch my
environment to NFD just to use a library!  My environment shouldn't
be dictated by the environment of some random library programmer.
(That would have to include my terminal, so I'm able to type in
identifiers for gdb, and so on.)

In practice, my terminal isn't even capable of sending NFD, and I like
it that way; it does help to ensure people who don't know what they're
doing don't accidentally switch to NFD and start polluting filesystems
with NFD filenames.  (The situation would be uglier if people were
actually doing that.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: filename and normalization (was gcc identifiers)

2002-12-04 Thread Glenn Maynard

On Wed, Dec 04, 2002 at 04:11:46PM -0500, Maiorana, Jason wrote:
 I meant that rather than invisibly normalizing the paste, it would
 do what you say and print the escape sequences out. If it were
 to normalize on paste, it could be hiding problems.

But other apps on the system might be using NFD.  On those systems (eg
Macs), that might be normal, and the text needs to be changed to NFC
somewhere between being copied and being sent to the remote machine.

Likewise, on those systems it might be appropriate for an NFC terminal
to change copied text (eg. terminal - clipboard) to NFD, if that system
expects NFD.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: English Unicode keyboards?

2002-11-10 Thread Glenn Maynard

On Sun, Nov 10, 2002 at 09:08:17PM -0500, Henry Spencer wrote:
 I think that's pushing it a bit far; adoption of such a thing will be far
 more likely if space (which *is* the single most common character in most
 forms of text) remains under the right thumb.  But it doesn't need to be
 particularly wide -- examine most well-used keyboards and you'll find a
 relatively narrow shiny spot on the space bar.

Actually, mine has a dent ...

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Please do not use en_US.UTF-8 outside the US

2002-10-19 Thread Glenn Maynard

On Sun, Oct 20, 2002 at 12:06:32AM +0200, Antoine Leca wrote:
  What's being suggested is that locales be generated per-region/language;
  eg. tell the system to generate tr_TR, and then be able to use all
  relevant encodings (ISO-8859-9 and UTF-8 and whatever else is convertable).
  Case mappings, collation rules, translation text and so on can be stored in
  Unicode and converted at runtime, probably still caching common encodings
  for speed.
  
  Seems like a nice, but naive, idea.  If such a simple, generic solution
  was possible, I'd imagine it would have been done already.
 
 Windows NT did that in 1993. Exactly what you describe.
 
 
 Sorry.

Sorry?  I don't even see how this is relevant.  NT and POSIX i18n is
completely different, so just because NT can do it doesn't mean it's
practical here.

If you have a point, please say it; I can't even tell whether you agree
with the idea (which is not my own) or not.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Please do not use en_US.UTF-8 outside the US

2002-10-17 Thread Glenn Maynard

On Thu, Oct 17, 2002 at 07:06:52PM -0400, [EMAIL PROTECTED] wrote:
  I suppose one reason this isn't done is because locale generation does
  take quite a while (maybe 20 seconds per locale on my system).  There
  are probably other, less obvious reasons this isn't done, but I don't
  know them.  One such might be http://bugs.debian.org/99623 ; but that
  doesn't seem to prevent generating UTF-8 most of the time.
 
 It would be yet simpler to eliminate all non-utf-8 locales.

It would be simpler, but since the vast majority of the world is still
using legacy locales, it's irrelevant.  Come back in 5-10 years, maybe;
I'm talking about things that can be done today.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: SPAM

2002-10-15 Thread Glenn Maynard


On Tue, Oct 15, 2002 at 10:08:41PM -0400, [EMAIL PROTECTED] wrote:
 This is one of my more favored lists, but it is a major
 spam re-forwarder. Can anyone in the world set it to
 subscribers-only-posting? (or actually filter)

You can filter yourself, too, you know--SpamAssassin, for example.

Subscriber-only posting is overly restrictive, since threads occasionally
get crossposted to multiple relevant lists, and people posting are often
in only one of them.  Preventing that in the name of a little less spam
is a poor trade.

Besides, I only see a couple spams a day on this list at most.  That's
miniscule.

By the way, if you have a name, you might want to set it.  :)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Lazy man's UTF8

2002-09-19 Thread Glenn Maynard


On Thu, Sep 19, 2002 at 03:03:30AM -0400, Michael B. Allen wrote:
 Is  libiconv  capable  of  doing  wchar_t,  UCS-4,  and UTF-8 operations on
 Windows? I couldn't even build it (although I didn't try very hard). 

It should be able to do any conversion it can in *nix ...

Giving wchar_t to iconv isn't portable, though, is it?

(It's a bit of a hack, too, but a bearable one.)

Hmm.  Another thing, while we're on iconv: How do you get the number of
non-reversible conversions when -1/E2BIG is returned?  It seems that
converting blocks into a small output buffer (eg. taking advantage of
E2BIG) means that count is lost.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Lazy man's UTF8

2002-09-19 Thread Glenn Maynard


On Thu, Sep 19, 2002 at 04:02:07AM -0400, Michael B. Allen wrote:
 Not  if your importing/exporting. But you might very well use it internally
 and  if  someone  want'd  to run that app on Windows too that's the kind of
 thing  I  would  think  libiconv  should  be  good for so I was surprised I
 couldn't build it with full support. 

No, I'm referring to passing wchar_t as an iconv parameter; when was
this added to iconv?  I thought it was relatively recently.

  (It's a bit of a hack, too, but a bearable one.)
 
 Are  you  talking  about Bruno's implementation? I have wondered if wchar_t
 could  just  be  treated  like  any other encoding. It may not have a rigid
 definition  but it wasn't clear to my why those wchar_t clauses in the main
 convertion loops really had to be there.

The iconv interface is for char*'s; passing wchar_t* through it is a
hack of forced casting, and you have to deal with adjusting buffer
sizes for byte counts.

It's easily fixed with wrappers, though.

 Yikes! You just left my sphere of knowledge :-)

That was to anyone on the list who can answer it.  :)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Linux and UTF8 filenames

2002-09-19 Thread Glenn Maynard


On Thu, Sep 19, 2002 at 09:57:43AM +0200, Radovan Garabik wrote:
 There is a concept of filesystem encoding (NLS), but it requires 
 root assistance, and does not solve the problem of two users
 having different locales, accessing the same filesystem - considering
 this situation, the only possible solution is to have filenames
 in UTF-8, and applications (such as ls) aware of it.

No, the only possible solution is for all terminals UTF-8, too, and ls
continues printing filenames as it is now.

If I have a file héllo in UTF-8, and my terminal is ISO-8859-1, and ls
helpfully recodes that for me, and I type cat héllo, cat doesn't
know to recode the filename, so it doesn't work.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Linux and UTF8 filenames

2002-09-19 Thread Glenn Maynard


On Fri, Sep 20, 2002 at 01:31:21AM +0200, Pablo Saratxaga wrote:
 Now, if what you meant, was the ability to mount an ext2 partition and
 tell to convert its filemanes using the kernel nls modules; yes, it could
 be done.

But would be somewhat tricky, since filenames need to be 8-bit clean
except for / and NULL.  It's a bag of worms with very little value ...

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Lazy man's UTF8

2002-09-18 Thread Glenn Maynard


On Wed, Sep 18, 2002 at 10:14:35PM +0100, Robert de Bath wrote:
 iconv() is _fairly_ easy to use, the problem isn't that's it's difficult
 just that there's a lot you have to remember to do for a function that
 appears (at first) to have a simple job.

It's easy to write a wrapper for the simple, common tasks.  You almost
never want to call iconv() directly from most code, unless you actually
need to.

  //here is an example utf-8 formatter
 BTDTGTTS.

BTDPQKKD!  (trans: what?)

Obeying the locale's encoding is both good practice and an absolute
requirement for most; outputting UTF-8 in all locales is simply wrong.
It's certainly very bad advice.

 But, you're converting utf-8 values that (strictly speaking) are out of
 range _and_ assuming the wchar_t is a UCS character.

Why does Mr. Lazy even care about ancient non-__STDC_ISO_10646__ systems?
He's lazy! :)

(But you should be using mb[r]towc anyway.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Lazy man's UTF8

2002-09-18 Thread Glenn Maynard


On Thu, Sep 19, 2002 at 01:21:10AM -0400, [EMAIL PROTECTED] wrote:
 Unless you believe that locales shouldnt specify encoding, and are
 unhappy with their implementation (too global).

If an application wants to provide more detailed encoding configurations
(such as editing multiple files in different windows, like Vim can do), 
that's fine, but it should always default to obeying the locale (which
Vim does).

The locale certainly shouldn't allow saying things like use UTF-8 for
the terminal and EUC-JP for files, since that's far more complicated.
(What do you use if you're formatting from stdin? It might be either.)

 Also, using them isnt necessarily future-proof. For example you
 generally wouldnt want to use the mb functions if all your
 output was ucs-4 wide characters. (are there any utf-32 locales?)

(assuming s/utf-32/ucs-4/; they're close, but not synonymous)

No, but if there was, then the multibyte encoding would be UCS-4, and the
mb* functions would treat them as such--wide characters and locale
characters would contain the same binary data, mblen() would always return 0
or 4, and converting wc-mb would be a null op. (Ignoring endianness, and
all of the other numerous reasons you don't use UCS-4 as a locale encoding.)

 Why does Mr. Lazy even care about ancient non-__STDC_ISO_10646__ systems?
 He's lazy! :)
 
 Taking this argument to its logical conclusion; why care about
 those using legacy(non-utf8) encodings...

My personal opinion is that there's been plenty of time for systems to
support __STDC_ISO_10646__; the fact that almost all systems do is
evidence that it's been long enough, and I don't want to go out of my
way to support systems that are lagging so far behind.  However, there
are a lot more people who still, for one reason or another, can't use
UTF-8, so there's a lot more reason to support legacy encodings.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: input methods

2002-08-29 Thread Glenn Maynard


On Thu, Aug 29, 2002 at 05:43:24PM -0400, Maiorana, Jason wrote:
 Does anyone know of a general purpose input method library
 which is not dependant upon anything else? By that I mean
 not dependant upon X-Windows, not dependant upon a console,
 not relying upon locales whatsoever, and not tied to any
 specific application, and doesnt even know about fonts.

I'd imagine this would be useful both as the backend of normal GUI IM's
and also for use where standard IM's aren't suitable.  For example,
games: you want to render everything yourself, feed input from the user
to the IM by hand (since you might be using something system IM's might
not like, such as DirectInput), and not tie yourself to platform-specific
IM's.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Forcing vim 6.0 to stay in UTF-8 mode in a UTF-8 locale

2002-08-20 Thread Glenn Maynard


On Tue, Aug 20, 2002 at 10:42:23AM +0200, Bram Moolenaar wrote:
  do also:
  set fileencoding=utf-8
  so that you do not encounter those nasty CONVERSION ERRORs
 
 The value of 'fileencoding' is changed as soon as you open a file.  It's
 used to remember the encoding of the file (can be different from the
 encoding used inside Vim).  You can also change it after reading a file,
 so that :w writes it with a different encoding.

Well, is this exact?

My default fenc is cp1252 (as I'm using the test setting I mentioned).

If I load a UTF-8 file, fenc becomes UTF-8.

But, if I then :new, the new window is created with fenc=cp1252, despite
fenc being UTF-8.

Doing a :set fenc in each window then shows that it's different for
each, but :new always creates fenc=cp1252.

This makes me conclude that there's a global fenc, which determines
the default encoding of new files, and a local fenc to each window,
marking the encoding of that file.

That's fine, except it seems undocumented, and it's not clear how to
explicitely set the global fenc versus the current local one.

 You probably want to set 'fileencodings' to utf-8 or make it empty.
 Then Vim won't check for a BOM or fall back to using latin1.  You still
 get CONVERSION ERRORs when editing a file with an illegal byte sequence,
 and that's a good hint for the user.

It'll also set the file readonly, though, which probably isn't wanted
here.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Forcing vim 6.0 to stay in UTF-8 mode in a UTF-8 locale

2002-08-19 Thread Glenn Maynard


On Mon, Aug 19, 2002 at 06:13:23PM +0100, Markus Kuhn wrote:
 properly in UTF-8 mode, but it deactivates UTF-8 mode when you load
 instead a file that contains malformed sequences, such as
 
   http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

Make sure fencs and fenc are empty.

However, it'll still set the ro flag when it finds invalid characters.
That shouldn't happen here.

 Even worse, it also deactivates UTF-8 mode when you load a file that
 contains new Unicode 3.2 characters, such as
 
   http://www.cl.cam.ac.uk/~mgk25/UTF-8-demo.txt
(that's ucs/examples/UTF-8-demo.txt)

This works for me even with my normal fencs=ucs-bom,utf-8,latin1 setup;
there's no reason Vim should ever fall out of UTF-8 mode for this
reason.

VIM - Vi IMproved 6.1 (2002 Mar 24, compiled Aug 13 2002 15:12:46)

Upgrade?

BTW. Bram, Vim isn't handling overlong sequences well.  (It also doesn't
handle 3.3 in UTF-8-test.txt like Marcus suggests, but I think the
display-every-character-in-hex behavior is better for an editor.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Forcing vim 6.0 to stay in UTF-8 mode in a UTF-8 locale

2002-08-19 Thread Glenn Maynard


On Mon, Aug 19, 2002 at 12:54:24PM -0700, H. Peter Anvin wrote:
 One way is to treat each byte of a malformed sequence as a character
 (different from all real Unicode characters).  This is a mostly good
 approach, except that it allows the user to construct a valid UTF-8
 character out of malformed sequence escapes -- this may or may not be
 a problem in any particular application, but it needs to take into
 account, lest we get another instance of the overlong sequence
 problem.

That's what Vim does.  Malformed sequences show up as HEX, which
functions as a single character.

If the editor is 8-bit-clean, and you combine bytes that were parts of
invalid UTF-8 sequences such that you have a valid UTF-8 sequence, you
have a UTF-8 sequence; if I combine 0xC2 with 0xA9, it'd better write
those two bytes to disk, even though it happens to correspond to U+00A9;
doing anything else isn't 8-bit-clean.

I tested this, and that's exactly what happens; pasitng A9 in front of
C2 turns the pair into (C).

What could be done differently?

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: world of utf-8

2002-08-19 Thread Glenn Maynard


On Mon, Aug 19, 2002 at 08:29:21PM -0400, [EMAIL PROTECTED] wrote:
 The ultimate goal is that older encodings can start to 
 fade away, and having every app that deals with text have
 to deal with a plethora of encodings and codeset conversion
 issues will be a thing of the past.

Um, I think he knows this. :)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: mk_wcwidth (OT)

2002-06-18 Thread Glenn Maynard


On Tue, Jun 18, 2002 at 02:20:24AM -0400, Seer wrote:
  (Err ... how in the nineteen hells is this simplification?)
 
 well, mk_wcwidth would be algorithmically simpler itself, and
 all the interval/width data would be in one table or tree.
 (though a tree itself looks pretty bad when written as an initialized
 set of C objects)

It's the complexity of the whole that I'm referring to, and setting up a
tree is much more complicated.  (You actually suggested code generation,
which is orders of magnitude more complex.)

  If you really need a speedup for specific cases, it could work, but
  it's actually a tradeoff; speed one up and slow down others.  (And
  it's not an even trade: for every one you move up the tree, you move
  two down.)  Except for ASCII, that kind of tradeoff isn't very useful
  in general-purpose code.
 
 not sure i agree with that. I think that a tree lookup would be
 significantly fewer compares. admittedly, a difference wouldnt
 likely matter unless one was widthing megs worth of data.

They're both O(log n) compares.  They're doing the same thing, except
a binary tree conceptually moves the binary search logic into the data
structure.

The only way you'd have less compares is if you optimized the tree for
certain data sets, and except for ASCII, you can't do that in
generalized code.

If you think I'm wrong, please be specific.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: XTerm patch to call luit (2)

2002-06-12 Thread Glenn Maynard


On Thu, Jun 13, 2002 at 09:43:19AM +0900, Tomohiro KUBOTA wrote:
 So, how do you think about the default of false?

I don't like programs that support locales, but need specal
configuration to turn it on.  They're annoying.

 think the default should be true and then the default can be
 changed to true without annoying people.

For any given default, people will be annoyed.  It'll annoy me if
it's false, and it'll annoy some other people if it's true.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: ASCII and JIS X 0201 Roman - the backslash problem

2002-05-10 Thread Glenn Maynard


On Fri, May 10, 2002 at 02:58:21PM +0200, Bruno Haible wrote:
 So it is a minor annoyance over the time of a few months, but by far
 not the costs that you are estimating.

The problem isn't the conversion costs, it's the fact that Windows will
continue to use the characters incorrectly, and will reintroduce the
problem continuously.

I'd give my left leg if someone would just show up and give me a reliable
way to change my local Windows JP fonts to have a correct backslash.  That
would fix it for me, at least.  It wouldn't help people that actually
need to *use* the Yen symbol, since there'd still be no way to input the
real single-width yen symbol, though it might be possible to add that to
the input method.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: ASCII and JIS X 0201 Roman - the backslash problem

2002-05-10 Thread Glenn Maynard

On Fri, May 10, 2002 at 08:03:08PM +0100, Markus Kuhn wrote:
 The only long-term solution out of this mess is pure Unicode. Use proper
 Unicode fonts where U+00A5 is a (single-width) YEN and U+005C is a
 backslash, and (you normally should never need it) U+FFE5 is the
 FULLWIDTH YEN SIGN.

An ideal long-term solution is of no use if it's impossible to get
people to use it.  Microsoft refuses to fix their buggy fonts, so it's
unlikely this solution can ever be used widespread.

 Forget about the Shift_JIS and EUC_JP tradition and start to think in a
 context, where character semantics is completely and exclusively defined
 by Unicode. You will loose a few double-width characters (such as
 doublewidth Cyrillic and double-width block graphics), and you will
 discover that it is perfectly possible to write nice Japanese plaintext
 files nicely without any of these. For old files, people will surely

Aren't there enough obstacles to getting Unicode accepted in some places
without having to convince them they don't really need something they've
been using for years?  It doesn't really matter if it's true or not; it
seems there's enough battles to be fought already.

Out of curiosity, Tomohiro, is full-width Yen commonly used?  (I'd guess $B1_(B
would be a more obvious choice for full-width.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Switching to UTF-8

2002-05-02 Thread Glenn Maynard


On Thu, May 02, 2002 at 02:03:06AM -0400, Jungshik Shin wrote:
   I know very little about Win32 APIs, but according to  what little
 I learned from Mozilla source code, it doesn't seem to be so simple as
 you wrote in Windows, either.  Actually, my impression is that Windows
 IME APIs are almost parallel (concept-wise) to those of XIM APIs.  (btw,
 MS WIndows XP introduced an enhanced IM related APIs called TSF?.) In
 both cases, you have to determine what type of preediting support
 (in XIM terms, over-the-spot, on-the-spot, off-the-spot and none?)
 is shared by clients and IM server. Depending on the preediting type,
 the amount of works to be done by clients varies.
 
   I'm afraid your impression that Windows IME clients have very little
 to do to get keyboard input comes from your not having written programs
 that can accept input from CJK IME(input method editors) as it appears
 to be confirmed by what I'm quoting below.

I wrote the patch for PuTTY to accept input from Win2K's IME, and some
fixes for Vim's.  What I said is all that's necessary for simple
support, and the vast majority of applications don't need any more than
that.

Of course, what you do with this input is up to the application, and if
you have no support for storing anything but text in the system codepage,
there might be a lot of work to do.  That's a different topic entirely,
of course.

   It just occurred to me that Mozilla.org has an excellent summary
 of input method supports on three major platforms (Unix/X11, MacOS,
 MS-Windows). See
 
   http://www.mozilla.org/projects/intl/input-method-spec.html.

I've never seen any application do anything other than what this
describes as Over-The-Spot composition.  This includes system dialogs,
Word, Notepad and IE.

This document incorrectly says:

Windows does not use the off-the-spot or over-the-spot styles of input.

As far as I know, Windows uses *only* over-the-spot input.  Perhaps
on-the-spot can be implemented (and most people would probably agree
that it's cosmetically better), but it would proably take a lot more
work.

Ex:
http://zewt.org/~glenn/over1.jpg
http://zewt.org/~glenn/over2.jpg

(The rest of the first half of the document describes input styles that
most programs don't use.)  The document states Last modified May 18,
1999, so the information on it is probably out of date.

The only other thing you have to handle is described in Platform
Protocols: WM_IME_COMPOSITION.  The other two messages can be ignored.

The only API function listed here that's often needed is SetCaretPosition,
to set the cursor position.

  It's little enough to add it easily to programs, but the fact that it
  exists at all means that I can't enter CJK into most programs.  Since
  the regular 8-bit character message is in the system codepage, it's
  impossible to send CJK through.
 
   Even in English or any SBCS-based Windows 9x/ME, you
 can write programs that can accept CJK characters from CJK (global)
 IMEs. Mozilla, MS IE, MS Word, and MS OE are good examples.

Yes, you're agreeing with what you quoted.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Paper size

2002-05-02 Thread Glenn Maynard


On Thu, May 02, 2002 at 05:30:47PM +0100, Edmund GRIMLEY EVANS wrote:
 But there is! Firstly, if you cut a piece of A4 paper into two halves,
 each has the same proportions as A4. Secondly, a piece of An paper has
 area 1/2**n of a square metre. Standard photocopier paper weighs 80
 grams a square metre, so a piece of A4 weights 5 g, and airmail
 postage rates go in steps of 5 g or 10 g ...
 
 Of course, it's not really 210x297mm; it's more like 210.224x297.302mm.

These are just novelties to most people; I don't remember the last time I made
a photocopy, and when I do, I don't mind that it doesn't scale perfectly.

It's probably very useful for some people, but not most, and it's the
majority that'll keep everyone from switching.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Switching to UTF-8

2002-05-01 Thread Glenn Maynard


On Thu, May 02, 2002 at 11:38:38AM +0900, Tomohiro KUBOTA wrote:
  * input methods
 Any way to input complex languages which cannot be supported
 by xkb mechanism (i.e., CJK) ?  XIM? IIIMP? (How about Gnome2?)
 Or, any software-specific input methods (like Emacs or Yudit)?

How much extra work do X apps currently need to do to support input
methods?

In Windows, you do need to do a little--there's a small API to tell the
input method the cursor position (for when it opens a character selection
box) and to receive characters.  (The former can be omitted and it'll
still be usable, if annoying--the dialog will be at 0x0.  The latter can
be omitted for Unicode-based programs, or if the system codepage happens
to match the characters.)

It's little enough to add it easily to programs, but the fact that it
exists at all means that I can't enter CJK into most programs.  Since
the regular 8-bit character message is in the system codepage, it's
impossible to send CJK through.

How does this compare with the situation in X?

  * fonts availability
Though each software is not responsible for this, This software
is designed to require Times font means that it cannot use
non-Latin/Greek/Cyrillic characters.

I can't think of ever using an (untranslated, English) X program and having
it display anything but Latin characters.  When is this actually a problem?

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Is there a UTF-8 regex library?

2002-03-31 Thread Glenn Maynard


On Sun, Mar 31, 2002 at 03:53:52PM -0600, David Starner wrote:
 The dict standard dictates that all data crossing the wire shall be in
 UTF-8. Unfortunately, the reference implementation doesn't even try to
 get it right. I was discussing the issue with a maintainer of a Russian
 dictionary for dict, and part of the problem was that there was no UTF-8
 regex engine. Does anyone know of a UTF-8 regex engine, preferably one
 that can be plugged into a GPL'ed C program easily?

I know GNU grep (at least alpha versions) implement generic multibyte.  That's
not an easy drop-in, of course.  It was also orders of magnitude slower;
I don't know if it was simply unoptimized.

pcre(7) mentions experimental UTF-8 support.  I havn't tried it.  By the
description, it looks extremely limited.  In particular:

 5. A class is matched against a UTF-8 character instead of just a
 single byte, but it can match only characters whose values  are less
 than 256. Characters with greater values always fail to match a class.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: encdec-0.2.1 released

2002-03-13 Thread Glenn Maynard


On Wed, Mar 13, 2002 at 05:05:25AM -0500, Michael B Allen wrote:
   char *dec_mbscpy_new(char **src, const char *fromcode);
   char *dec_mbsncpy_new(char **src, size_t sn, size_t dn,
 int wn, const char *fromcode);

mumble grumble

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: encdec-0.2.1 released

2002-03-13 Thread Glenn Maynard


On Wed, Mar 13, 2002 at 01:59:07PM -0500, Michael B Allen wrote:
 char *dec_mbscpy_new(char **src, const char *fromcode);
 char *dec_mbsncpy_new(char **src, size_t sn, size_t dn,
   int wn, const char *fromcode);
  
  mumble grumble
 
 Are you serious or are you joking?

Serious in that they're overly-long names that don't follow patterns most
everyone is used to; joking in that it's not a major issue that's worth
spending time debating.  (I'd probably rename them if I ever used the code.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: NFS4 requires UTF-8

2002-03-07 Thread Glenn Maynard


On Thu, Mar 07, 2002 at 10:54:11AM -0800, H. Peter Anvin wrote:
  But I can't see the BOM; ls just shows hello.  That's why I'm
  suggesting that zero-width characters not useful in filenames be
  escaped as the above by ls and friends.  (Nothing new; ls already
  escapes ASCII control characters and other things.)
 
 Agreed.  ls -b in particular needs to be extra careful here.  This
 *does* beg the question what wisprint() and friends actually return.

What about wcwidth() == 0?

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Statically link LGPL cp1252.h with MIT Licensed code?

2002-03-04 Thread Glenn Maynard


On Mon, Mar 04, 2002 at 03:37:55PM -0500, Michael B Allen wrote:
 int enc_mbscpy(const char *src, char **dst, const char *tocode);
 int enc_mbsncpy(const char *src, size_t sn, char **dst, size_t dn,
 int wn, const char *tocode);
 
 char *dec_mbscpy_new(char **src, const char *fromcode);
 char *dec_mbsncpy_new(char **src, size_t sn, size_t dn,
 int wn, const char *fromcode);
 size_t dec_mbscpy(char **src, char *dst, const char *fromcode);
 size_t dec_mbsncpy(char **src, size_t sn, char *dst, size_t dn,
 int wn, const char *fromcode);
 
 for encodeing and decoding strings The two main differences here are
 that we're converting to/from many to one where the one is the locale
 dependent multi-byte string encoding (eg UTF-8) and that in addition
 to contraining the operation by sn and dn bytes you can also contrain
 the operation by the number of characters wn Mbsncpy_new is like a
 mbsndup and if dst is NULL for the dec_ functions it still works but

Why not call it dec_mbs[n]dup?  (I'd lean toward putting _dec/_enc at the
end, too, but that's just my habits)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mailnllinuxorg/linux-utf8/

Re: Statically link LGPL cp1252.h with MIT Licensed code?

2002-03-02 Thread Glenn Maynard


On Sat, Mar 02, 2002 at 01:42:13AM -0500, Michael B Allen wrote:
 Can I statically link of the codepage headers (eg cp1252h) from
 libiconv with an MIT Licensed module? I would not actually alter the
 file of course so a user could not modify the LGPL files in my module
 any more than if they had used libiconv directly

The LGPL is designed to allow programs with GPL-incompatible licenses to
link against them; that license (assuming you mean
http://wwwjclarkcom/xml/copyingtxt) is GPL-compatible (says
http://wwwgnuorg/licenses/license-listhtml), so you could link against
it even if the header in question was GPL'd  (Strictly speaking, using
headers isn't linking; I'm not sure how this is covered in the license,
but the LGPL would be useless if it permitted linking but not including)

IANAL nor a license expert; assume all of the above is false  You'd be
much better off looking for a license-oriented list or mailing the FSF

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mailnllinuxorg/linux-utf8/

Re: Statically link LGPL cp1252.h with MIT Licensed code?

2002-03-02 Thread Glenn Maynard


On Sat, Mar 02, 2002 at 01:42:26PM -0500, Michael B Allen wrote:
 Very strange that you ref James Clarks site because it is his expat
 product that encouraged me to license my DOM as MIT and I want to use the
 libiconv codepage headers to add support for extended character sets to
 this DOM that uses expat

Well, GNU's site says that the license is really the Expat license, not the 
MIT license  (That's how I interpret it, anyway)

 Well actually these headers are not public and have code in them The
 design calls for abstracting the conversion of a character to and from
 ucs codes by using a function pointer to code included in different many
 different files Regardless of the fact that these include files are h
 files they each have code in them

Well, if you're going to include the header itself *with* the program,
you'll need to include a copy of the LGPL, too  I'm not sure if there
are any other issues in this case

(FWIW, many glibc headers have inline code)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mailnllinuxorg/linux-utf8/

Re: mbscmp

2002-02-25 Thread Glenn Maynard


On Mon, Feb 25, 2002 at 08:52:38PM +0100, Bruno Haible wrote:
 strncpy strncat strncmp
 cannot work for multi-byte characters because they truncate
 characters

You could write multibyte-aware versions of these, too, making them not
truncate characters.  That'd be useful for strncpy and strncat. at least.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: mbscmp

2002-02-25 Thread Glenn Maynard


On Mon, Feb 25, 2002 at 02:56:09PM -0500, Jimmy Kaplowitz wrote:
 I haven't tested this, nor really done anything relating to programming
 with i18n, but based on looking at man pages, you can use one of three
 functions (mbstowcs, mbsrtowcs, or mbsnrtowcs) to convert your multibyte
 string to a wide character string (an array of type wchar_t, one wchar_t
 per *character*), and then use the many wcs* functions to do various
 tests. My recollection of the consensus on this list is that for

That's extremely cumbersome for everyday ops.

Doing conversions at every turn is expensive, too.

 internal purposes, wchar_t is the way to go, and conversion to multibyte
 strings of char is necessary only for I/O, and there only when you can't
 use functions like fwprintf. However, wchar_t is only guaranteed to be

Not always.  Some people use the locale encoding internally; some use
UTF-8 internally.  They all have their advantages.

wchar-based programs are still harder to debug; gdb doesn't deal with
them yet.

I expect there'll be lot more libraries that expect locale-encoded char *
strings in their API than will be providing an alternate wide interface.

Using locale encodings internally is the quickest to start, but then you
know nothing about your strings and need to convert everything for most
ops (if you really want it to work).

Converting existing programs is a case where wchar is particularly
difficult.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

strcoll and hiragana

2002-02-25 Thread Glenn Maynard

On Mon, Feb 25, 2002 at 05:30:59PM +0100, Bruno Haible wrote:
 No. In glibc-2.2 strcoll works fine for all multibyte encodings.

Speaking of which, this is perplexing me:

05:12pm [EMAIL PROTECTED]/2 [~] sort
$B$"(B
$B$3(B
$B$s(B
$B$s(B
$B$3(B
$B$"(B
(eof)
$B$"(B
$B$3(B
$B$s(B
$B$s(B
$B$3(B
$B$"(B

strcoll is returning 0.  (Same for $B$"(B and $B%"(B.)

(Language shouldn't matter, but this happens in both en_US.UTF-8 and
ja_JP.UTF-8.)

Kanji appear to be getting collated, however:

05:13pm [EMAIL PROTECTED]/2 [~] sort
$BF|K\(B
$Be:No(B
$BF|K\(B
(eof)
$BF|K\(B
$BF|K\(B
$Be:No(B

(I couldn't tell if that's the correct collation order, but it's clear
they're being reordered, where the hiragana above are not.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: sorting order of Kanji

2002-02-25 Thread Glenn Maynard

On Tue, Feb 26, 2002 at 09:42:25AM +0900, Tomohiro KUBOTA wrote:
  Kanji appear to be getting collated, however:
  
  05:13pm [EMAIL PROTECTED]/2 [~] sort
  $BF|K\(B
  $Be:No(B
  $BF|K\(B
  (eof)
  $BF|K\(B
  $BF|K\(B
  $Be:No(B
  
  (I couldn't tell if that's the correct collation order, but it's clear
  they're being reordered, where the hiragana above are not.)
 
 It is impossible to collate Kanji by using simple functions such
 as strcoll(), because one Kanji has several readings depending on
 context (or word) in most cases.  (This is Japanese case).
 (It is technically virtually impossible.  It will need natural
 language understanding algorithm.)

I'm not concerned about the collation order of Kanji.  (It's probably
useful that there be one, even if it's just UCS order, to allow ie.
"sort | uniq".)  There does seem to be collation for Kanji; I showed
this to distinguish it from hiragana.

The question was, why aren't katakana and hiragana getting collated?  As
far as I can tell, they should be.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Thoughts on keyboard layout input

2002-02-23 Thread Glenn Maynard


On Sat, Feb 23, 2002 at 06:53:11PM +0100, [EMAIL PROTECTED] wrote:
 [on POSIX]
 I quoted the POSIX definitions. Nevertheless many people
 claim contradictory things about the POSIX point of view.
 Wonder whether my post went out to the list?

This is the only reply I received from you to this thread; you might
want to repost.

 [on delays and composing symbols]
 I have very bad experiences. For example, when using mutt
 on some remote machine to read my mail I may have to press
 downarrow five times before it is accepted. Sometimes net
 delay is such that it is impossible to get mutt to see
 an escape sequence.

Delays for control characters (^[[A) and delays expected when actually
typing are different.  The former should never be a problem and I'd
assume there's something wrong with your environment if that's
happening--normally, the entire escape sequence goes out in a single
packet so lag between packets shouldn't affect them at all.  (Try
vi-like j and k, by the way.)

You can set ESCDELAY, if you need to (but you shouldn't.)  I tend to
lower this a lot, to get better response time for a single ESC.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: NFS4 requires UTF-8

2002-02-22 Thread Glenn Maynard


On Sat, Feb 23, 2002 at 10:18:28AM +0900, Gaspar Sinai wrote:
 This was just a suggestion to clean up things by
 specifying the characters that can be allowed for
 filenames. Currently we can not have /, ., ..
 and \0 for a filename. What if we say we can not

More precisely, you can't have . or .. for a filename and you can
not have / and nul *in* filenames, and you can look at the first two
as these files already exist and not really a restriction as such.

 have composing and zero with characters for a filename?

Er, composing characters are OK, NFC just avoids them when there's a
precomposed alternative available.  (And Pablo said that there are some
zero-width characters that are useful in filenames ... which is rather
annoying.)

Why can't we do that? Because filenames would go from being nearly
8-bit clean to having UTF-8 specific requirements.  That's not the FS's
job.  And this couldn't only by NFS: the problems you're describing
would happen with local FS's, too--and they need to work with all
active charsets, not just UTF-8.

 That would not need compicated normalization - just
 a character check.

The current restrictions on filenames have been around forever, are
unavoidable, and are the only things keeping filenames from being
completely 8-bit clean.  (Normalization involves changing text, as
well; the existing restrictions are simply pass or fail.)

Aside: can a UTF-8 string ever grow longer due to being changed to NFC?
It's obvious that a wide char string can't, but it's not clear that this
holds with UTF-8 (and if so, that it always will.)

 The problem occurs if normalization does happen - and some programs
 may do normalization.

If any are normalizing to NFD, they should probably be changed to not do
that.  Fixing that isn't the FS's job.

But the filesystem, C library calls, network protocols, etc. should
*never* change filenames at all.  That stuff must remain 8-bit clean
(as far as it is now.)

I'm not advocating any low-level constraints or normalization at all.  I
just want to be able to use UTF-8 in filenames, without hitting filenames
that I can't use c+p to enter.  That's not the FS's job to fix, it's the
interface's.  The simple solution, have tools escape zero-width chars
and other oddities, isn't quite good enough, due to some of these 
characters being useful in filenames.  (I might settle for it myself--I
don't use any languages that need them--but it'd be nice to find a more
general solution.)  

This isn't a new problem, it's new symptoms of an old one.  The old ones
were fixed by escaping invalid byte sequences, spaces, and ASCII control
characters--the new symptoms just need to be worked out.  (Invalid UTF-8
sequences aren't one of these new problems--ls already escapes those.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: NFS4 requires UTF-8

2002-02-21 Thread Glenn Maynard


On Thu, Feb 21, 2002 at 11:08:24AM +0100, Radovan Garabik wrote:
  One thing that's bound to be lost in the transition to UTF-8 filenames:
  the ability to reference any file on the filesystem with a pure CLI.
  If I see a file with a pi symbol in it, I simply can't type that; I have
  to copy and paste it or wildcard it.  If I have a filename with all
  Kanji, I can only use wildcards.

(Er, meant copy and paste for the last; wildcards aren't useful for
selecting a filename where you can't enter *any* of the characters,
unless the length is unique.)

 sorry, but that is just plain impossible. For one thing, the c can 
 quite well be U+04AB, CYRILLIC SMALL LETTER ES, ditto for other 
 letters. But I agree that normalization can save us a lot of headache.

Normalization would catch the cases where it's impossible to tell from
context what it's likely to be.

 Input method should produce normalized characters. Since most
 filenames are somehow produced via human operation, it would 
 catch most of pathological cases.

Not just at the input method.  I'm in Windows; my input method produces wide
characters, which my terminal emulator catches and converts to UTF-8, so my
terminal would need to follow the same normalization as input methods in X.

Terminal compose keys and real keybindings (actual non-English
keyboards) are other things an IM isn't involved in; terminals and GUI
apps (or at least widget sets) would need to handle it directly.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: NFS4 requires UTF-8

2002-02-21 Thread Glenn Maynard


On Thu, Feb 21, 2002 at 11:59:14AM +0100, Pablo Saratxaga wrote:
 It isn't that much of a problem.

I think it's not a completely trivial loss, compared to an ASCII environment
where filenames were completely unambiguous (invalid characters being
escaped.)  There doesn't seem to be any obvious fix, so I suppose it's
just a price paid.

 The same thing could happen here; well, not as bad, as I don't think any
 program will purposedly *change* the chars composing a filename previously
 selected (eg when doing open then save there wouldn't be any name
 change); but whe a user will type manually a filename it could happen

If a program wants to operate in a normalized form internally, it might,
but that's probably asking for trouble anyway.

 that the system will tell him no such filename and he will be puzzled
 as he sees there is; as there is no visual difference betwen a precomposed
 character like aacute and two characters a and composing acute accent.

Should control characters ever end up in filenames?  I'd be surprised if
many terminal emulators handled copy and paste with control characters
well, if at all.  (They don't need to be drawn, so I'd expect most that
don't use them would just discard them.)

06:29am [EMAIL PROTECTED]/2 [~/testing] perl -e '`touch \xEF\xBB\xBF`;'
06:29am [EMAIL PROTECTED]/2 [~/testing] ls

06:29am [EMAIL PROTECTED]/2 [~/testing] ls -l
total 0
-rw-r--r--1 glennusers   0 Feb 21 06:29

(rm)

06:31am [EMAIL PROTECTED]/2 [~/testing] perl -e '`touch \xEF\xBB\xBFfile`;'
06:31am [EMAIL PROTECTED]/2 [~/testing] ls
file
06:31am [EMAIL PROTECTED]/2 [~/testing] cat file
cat: file: No such file or directory

I can't copy and paste it.  Wildcards wouldn't help much if I'd stuck BOM's
between letters (and *f*i*l*e* isn't very obvious, especially if you
don't know what's going on, or if one's not really the letter it looks
like), and tab completion may or may not help, depending on the shell.
(Someone mentioned moving everything out of the directory and rm -f'ing;
I should never have to do that.)

Are control characters (and all non-printing characters) useful in filenames
at all?  If not, they should be escaped, too, to avoid this kind of problem.

(Another one, perhaps: a character with a ton of combining characters on
top of it.  Most terminal emulators won't deal with an arbitrary number
of them.)

 This reminds me of a discussion in pango and the ability to have different
 view and edit modes: normal (with text showing as expected), and another
 mode where composing chars are de-composed, and invisible control characters
 (such as zwj, etc) are made visible.

Reveal codes for filenames? :)

  I don't know who would actually normalize filenames, though--a shell
  can't just normalize all args (not all args are filenames) and doing it
  in all tools would be unreliable.
 
 The normalization should be done at the input method layer; that way it will
 be transparent and hopefully, if all OS do the same, the potential problem
 of duplicates will never happen.

See my other response: characters are often entered in other ways than a
nice modularized input method; terminal emulators will need to behave in
the same way as IMs for this to work, as well as GUIs at some layer.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: NFS4 requires UTF-8

2002-02-21 Thread Glenn Maynard


On Thu, Feb 21, 2002 at 11:23:20AM +, Edmund GRIMLEY EVANS wrote:
 People are advocating normalisation as a solution for various kinds of
 file name confusion, but I can imagine normalisation making things
 worse.
 
 For example, file names with a trailing space can certainly be
 confusing, but would life be any simpler if some programmer decided to
 strip trailing white space at some point in the processing of a file
 name? I don't think so. You would then potentially have files that are
 not just hard to delete, but impossible to delete.

If I have two computers, one sending precomposed and one not, I can't
access my câr file created on one on the other.  If terminal emulators,
IMs, etc. send normalized characters, this isn't a problem.  (It doesn't
fix all problems, but it would help fix up some of the major ones.)

Then, if a filename is being displayed by ls which doesn't fit the
normalization form expected in filenames, display it in a way that shows
what it really is.  (c\u00E2r.)  (Optional, of course.)  This is less
useful with the other unavoidable glyph ambiguities, though.

cat certainly shouldn't normalize its arguments.

 I'm not even convinced that it's a good idea to force file names to be
 in UTF-8. Perhaps it would be simpler and more robust to let file
 names be any null-terminated string of octets and just recommend that
 people use (some normalisation form of) UTF-8. That way you won't have
 the problem of some files (with ill-formed names) being visible
 locally but not remotely because the server or the client is either
 blocking the names or normalising them in some weird and unexpected
 way.

I'm not suggesting NFS normalize anything; this is just as important on
a single system being accessed from multiple terminals.

Sorry, the switch from NFS to filenames in general wasn't clear.

 What's so bad about just being 8-bit clean?

Oh, network protocols *should* be 8-bit clean for filenames (minus nul).
If I have a remote filename with an invalid filename (overlong UTF-8
sequence or just plain garbage), I'd better be able to access it over
NFS.  I don't think the FS (NFS, local filesystem, FTP, whatever) should
touch filenames at all.  (Mandating that they be UTF-8 in the standard
is a good thing; enforcing it at the FS layer is not.)

Related: I frequently can't touch filenames with non-English characters
over Samba, and filenames with characters Windows bans from filenames.
Windows displays them as some random-looking series of characters, and it
doesn't always map back correctly.  This doesn't really have anything to do
with the network protocol--though the actual implementation problem might
be in there--it's that it doesn't deal with invalid filename properly.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: NFS4 requires UTF-8

2002-02-21 Thread Glenn Maynard


By the way, to all of the people threading on inputting other language
text: I was showing a loss from ASCII--you can't type all filenames
because some of them will have characters you can't necessarily type.
This was a minor point, since (as I've said) it can't really be fixed.

(Well, it could be fixed, but not cleanly.)

OTOH, the unprinting character problem is important.  Would it be
reasonable to escape (\u) characters with wcwidth(c)==0 (in tool output,
ie ls -b), or is there some reasonable use of them in filenames?

Combining characters at the beginning of a filename probably shouldn't be
output literally, either.

On Thu, Feb 21, 2002 at 03:33:40PM +, Markus Kuhn wrote:
  One thing that's bound to be lost in the transition to UTF-8 filenames:
  the ability to reference any file on the filesystem with a pure CLI.
 
 I can generate plenty of file names with ISO 8859-1 that you will have
 troubles typing in. Try a file name that starts with CR or NBSP just to
 warm up. Nothing new with UTF-8 here. Keep it simple.

02:01pm [EMAIL PROTECTED]/5 [~/testing] touch 
dquote hello
02:01pm [EMAIL PROTECTED]/5 [~/testing] ls
\nhello

ls escapes the control character.  If I'm not in escape mode, it outputs
a question mark; it never outputs it literally.  It doesn't do this for
Unicode unprinting characters.

(NBSP isn't a problem here, since it can be copy-and-pasted.)

 Just like with the file £¤¥¦§¨©ª« I guess. Has that been a problem
 in practice so far?

That can still be copy-and-pasted; the control character examples can not.
Overly combined characters probably couldn't, either.

 We agreed already ages ago here that Normalization Form C should be
 considered to be recommended practice under Linux and on the Web. But

Then we're in agreement.

 nothing should prevent you in the future from using arbitrary opaque
 byte strings as POSIX file names. In particular, POSIX forbids that the
 file system applies any sort of normalization automatically. All the URL
 security issues that IIS on NTFS had demonstrates, what a wise decision
 that was.

 Please do not even think about automatically normalizing file names
 anywhere. There is absolutely no need for introducing such nonsense, and
 deviating from the POSIX requirement that filenames be opaque byte
 strings is a Bad Idea[TM] (also known as NTFS).

Nobody's disagreeing on any of this.

 No, it won't. Unicode normalization will not eliminate homoglyphs and
 can't possibly. You try to apply the wrong tool to the wrong problem.
 Again nothing new here. We have lived happily for over a decade with the
 homoglyphs SP and NBSP in ISO 8859-1 in POSIX file systems. Security
 problems have arousen in file systems that attempted to do case
 invariant matching and other forms of normalization and now we know that
 that was a bad idea (see the web attack log I posted here 2002-02-14
 as one example).

(this has been said already)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: NFS4 requires UTF-8

2002-02-21 Thread Glenn Maynard

On Fri, Feb 22, 2002 at 12:55:31AM +0100, Pablo Saratxaga wrote:
  OTOH, the unprinting character problem is important.  Would it be
  reasonable to escape (\u) characters with wcwidth(c)==0 (in tool output,
  ie ls -b), or is there some reasonable use of them in filenames?
 
 There are reasonable use of zwj and zwnj and similar, they are needed
 for proper writing in some languages.
 
 In fact, all the trouble comes from the xterm, not from "ls".

If a filename is a BOM followed by "hello", how can I enter it?  I
don't expect my terminal emulator to remember all control characters
sent at any cursor position and paste them along with other characters,
so I'd end up pasting "hello" alone.  It's worse when the filename is
*only* unprinting characters, and there's nothing on screen to copy at
all.  (That's just plain confusing, too.)

We can't blame the terminal for not being able to copy and paste
arbitrary sequences of bytes.  It's not ls's "fault" either, per se (it's
inherent), but that doesn't mean it can't help.

 I would say that ls should not escape them, only invalid utf-8 and
 control chars.
 
 then, another command line switch should be added to "escape all but
 printable ascii".

Well, I'd like all nonprinting characters escaped, but not, say, $BF|K\8l(B.
That means I can copy and paste the filename, and characters that *can*
be copied and pasted aren't escaped.  (but see below)

 more complex options are not to be done in the command line on an xterm,
 a graphical toolkit is more suited for that.

It's acceptable to go from "able to type all filenames with the
keyboard" to "need to copy and paste filenames which I can't type
directly".  That's reasonable (if only because it's unavoidable).  (As
has been pointed out, it's already there in ISO-8859-1.)

It's not acceptable to have filenames that I can't access from a CLI
(with C+P) reliably at all (or that I need to switch to a special ls mode
that escapes *everything* over ASCII to access.)  Wildcards are a useful
fallback, but they don't stand alone--it still wouldn't help me target a
file consisting only of control characters, for example.  Telling me to
"use a GUI" is simply no good.  (I'm not installing X on a 486 running
FTP to delete a file someone dumped in my /incoming.)

Files are an extremely fundamental part of a Unix system, and all fundamental
parts of Unix are accessible from a CLI.  That's always been one of its
greatest strengths, and we can't throw that away for filenames.  This is
why GNU ls supports escaping.

 the reason is that with ls/xterm the rendering and the tool handling the
 filenames are dissociated, so you cannot easily do interesting things,

ls supports escaping that matches bash's.  (\ooo, \xHH, \n, etc.) If this
is extended to include \u and \U, then ls can be extended to 
allow (optionally, for the sake of compatibility) displaying escape
characters, etc. in that form.

(I think that extension is useful, whether or not ls uses it.)

Just because the tools aren't maintained by the same person doesn't mean
there can't be cooperation.  (Though, considering how difficult it's
proving to be to get UTF-8 support at all in bash, I don't expect *all*
shells to support this.)

This doesn't involve xterm (or any terminal) at all, just the shell and
tools.

 So, the only interesting change that would be worth doing for the
 use of utf-8 in filenames will be an extra switch to ls to quote
 everything but ascii, and ensure it quotes incorrect utf-8 when the
 locale is in utf-8 mode.

I disagree; I think it's interesting, useful and practical to escape
certain other cases.  Leading combining characters, probably, and any
characters not useful in filenames.  (Of course, it's not necessarily
easy to determine what's useful.  I don't see BIDI support in filenames
as useful--that seems to be a property of whatever text is displaying
the filenames, not the filename themselves--but I'm not a BIDI user, so
I can only guess.)

I'm unclear on how control characters that change state behave in
filenames at all.  To pick a simple example, what if a filename contains
the language code "zh"?  I can no longer do a simple C program that
outputs "The first file is %s.  The second file is %s. [...]" as the
text after the first %s is marked Chinese.  (This probably won't break
anything, but other control characters probably would.)  Invalidate all
state after outputting a filename?  Complicated.  (I don't know what zwj
and zwnj do; perhaps a more practical example could be made with them.)
Anyone feel like filling me in here?

This would be like enbedding ANSI color sequences in filenames and ls
letting it through: the color would bleed onto the next line unless ls
knew to reset the color after each filename.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: brocken bar and UCS keyboard

2002-02-21 Thread Glenn Maynard


On Thu, Feb 21, 2002 at 09:49:01PM -0500, Henry Spencer wrote:
 No question there, but I think you have missed my point.  The most crucial
 step is simply to get people to realize that there is more than one symbol
 involved and that the choice matters.  So long as hitting the - key always
 gets them hyphen, that's not going to happen.  Having them grumble that
 the stupid software keeps picking the wrong one would be an *IMPROVEMENT*. 

When they're visibly very similar, do you think most users are going to
use them right, no matter how accessible they are?  Hyphen and dash are
distinct (most people who use dashes also know that you need two hyphens
to act as a dash, not one), but a single hyphen looks reasonable as a
minus sign in most fonts.  A real minus sign usually looks better, but
I doubt most people will care enough to want to learn the difference
between *four* different characters on their keyboard that generate a
horizontal line--hyphen, dash, minus and underscore.

If they won't do that, they won't even consider changing their typing
habits.

Would you add separate open double quote, close double quote,
open single quote, close single quote, neutral single and double quotes,
apostrophe and backtick keys, too?  They're all useful, but
that's one heck of a keyboard.  :)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: POSIX:2001 now available online (OT)

2002-02-20 Thread Glenn Maynard


On Thu, Feb 07, 2002 at 04:05:22PM +, Markus Kuhn wrote:
 The revised POSIX standard, which has been merged with the Single UNIX
 Specification is now available online in HTML!
 
 For your bookmarks:
 
   http://www.opengroup.org/onlinepubs/007904975/toc.htm

Neat--it completely blows up in IE6; http://zewt.org/~glenn/oops.jpg for
the curious.

Looks like you have to go through their annoying registration to use
that URL.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: NFS4 requires UTF-8

2002-02-20 Thread Glenn Maynard


On Thu, Feb 21, 2002 at 01:26:33PM +0900, Gaspar Sinai wrote:
 I just browsed through RFC-3010 and I found one thing that
 bothers me and it has not been discussed yet (I think).
 
 RFC says:
  The NFS version 4 protocol does not mandate the use
  of a particular  normalization form at this time.
 
 How do we mount something that contains a precomposed
 character like:
 
   U+00E1 (Composed of U+0061 and U+0301)
 
 If the U+0061 U+0301 is used and our server is assumimg U+00E1,
 can a malicious hacker set up another NFS server that has
 U+0061 and U+0301 to mount his NFS volume? I could even
 imagine very tricky combinations with Vietnamese text
 but that would be another question...
 
 Forgive my ignorance if this was discuseed - I did not see it
 in the archives.

One thing that's bound to be lost in the transition to UTF-8 filenames:
the ability to reference any file on the filesystem with a pure CLI.
If I see a file with a pi symbol in it, I simply can't type that; I have
to copy and paste it or wildcard it.  If I have a filename with all
Kanji, I can only use wildcards.

A normalization form would help a lot, though. It'd guarantee that in
all cases where I *do* know how to enter a character in a filename,
I can always manipulate the file.  (If I see cár, I'd be able to cat
cár and see it, reliably.)

I don't know who would actually normalize filenames, though--a shell
can't just normalize all args (not all args are filenames) and doing it
in all tools would be unreliable.

A mandatory normalization form would also eliminate visibly duplicate
filenames.  Of course, it can't be enforced, but tools that escape
filenames for output could change unnormalized text to \u/\U.

I don't quite understand the scenario you're trying to describe, though.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: isprint() under utf-8 locale

2002-02-15 Thread Glenn Maynard


On Fri, Feb 15, 2002 at 12:37:27PM +0100, Radovan Garabik wrote:
 in theory, yes
 but often it is used to filter out characters that should not
 go straight to the terminal, where they can be a source of
 a DOS attack (colour codes, switching terminal into
 graphics mode, backspaces - I happened to be a victim of such 
 a joke a long time ago).

ASCII escape values are still recognized as nonprintable, so none of
these are a problem.  (UTF-8 terminals shouldn't have a graphics mode,
of course.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Security

2002-02-14 Thread Glenn Maynard


You didn't seem to respond to the comments of your page on the earlier
thread.  If you're going to take such an extreme stance as Unicode text
is inherently unsecure, you need to defend it.  So, my own impressions:

On Fri, Feb 15, 2002 at 10:16:39AM +0900, Gaspar Sinai wrote:
 I mostly recovered my shock :) Most people pointed out that the
 real juice on my security page was the second example.

 http://www.yudit.org/security/

 At yudit.org, we maintain the view that Unicode text is inherently
 unsecure, until the current bi-directional algorithm defined by the
 Unicode Consortium is changed to be reversable. There should be an
 algorithm defined that converts logical order to view order, and there
 should be a separate algorithm defined that converts view order to
 logical order. If such algorith-pair existed we could also run sanity
 check on our rendering software.

 At yudit.org we will not sign digitally a Unicode document while this
 possiblity exists.

Mind elaborating on this logic?  Since there's an off chance that text
might be seen incorrectly in a few languages (and if this happens, there's
an off chance in a few extremely contrived cases that it might make a
sentence with a different meaning), you'll never sign messages in any
language at any time?

Signing text doesn't say you will interpret this message as I intend,
it just makes sure it doesn't get tampered with in transit and verifies
who the message is from.  It's not the signature's job to make sure it's
rendered, read or interpreted correctly.

Assuming that this *is* a real security problem, not signing messages
doesn't help anything; it just reduces security further.  I can hardly
see what this has to do with signatures at all.

Also, regardless of the severity of this problem, Unicode text is not
*inherently* insecure; that implies it's fundamentally flawed and can't
be fixed.  I don't think that's what you mean.

The rest of the page is useful as an example of the problem; whether or
not it's a serious issue is debatable, but it's clearly something people
should know about.

--
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Security

2002-02-14 Thread Glenn Maynard


On Fri, Feb 15, 2002 at 01:44:03PM +0900, Gaspar Sinai wrote:
 Which pretty much shows that there is an ambiguity and the
 algorithm should change. My argument would be: if it needs to be
 changed anyway can it be changed to make digital signatures easier
 and put scripts, like Old Hungarian (rovasiras) in it that can be
 written in both directions?

The Unicode standard and the standards concerning digital signatures are
separate.  Fixing Unicode doesn't imply any changes in signatures.

 I could not reach this level in my arguments because I was told
 that there  is no problem at all and I felt I have two choices:
 being violent or just silently unsubscribe from the list. I
 chose that latter.

You showed that there are problems with bidi rendering, and I don't think
anyone disagreed with that.  Your example was too contrived for people to
consider it a major problem.

(By the way, I don't think violent is the word you're looking for,
unless you think your first choice was to mailbomb the list or
something.  :)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: mutt crashes on utf-8 encoded headers

2002-02-12 Thread Glenn Maynard


On Tue, Feb 12, 2002 at 09:54:52AM +0200, Zvi Har'El wrote:
 I am using the same Mutt 1.3.27i (2002-01-22) with utf-8, and it has even
 problems 1.2.5 didn't have. For example, when the subject includes characters
 with 2-byte utf-8 representation, its length is not calculated correctly for
 representation in the index page, and it is truncated prematurely, but when you
 step over it with the cursor is highlight also the next line with the rest of
 the subject. Refreshing the index eliminates the phenomenon. I had no problems
 with viewing the subject in the message page, both in the title and in the
 headers. I am using an external pager, less, and mutt passes it the correct
 subject. But this was also ok in 1.2.5. My configuration

Current ncurses doesn't deal with multibyte characters, so the cursor
position becomes desynchronized.

Mutt has a special case for utf-8; it sends UTF-8 line drawing characters
manually:

  case M_TREE_LLCORNER:
if (option (OPTASCIICHARS))
  addch ('`');
else if (Charset_is_utf8)
  addstr (\342\224\224); /* WACS_LLCORNER */
else
  addch (ACS_LLCORNER);
break;

They may have added this since 1.2.5.  It helps with Debian's multibyte-
patched version of Slang, which breaks the ACS stuff in the ncurses
emulation.  When compiling with real ncurses, however, it'll just
confuse it.  (It'll draw the character correctly, and desync the
cursor.)  I don't know what happened to line drawing characters with
ncurses in UTF-8 before this special case.  Try adding set ascii_chars
to your .muttrc as a quick workaround.  I don't know of any quick workaround
for actual subjects with non-ASCII characters (which will be represented
with two or more bytes in a UTF-8 locale).

You shouldn't be having any problem if you're in a simple 8-bit locale,
displaying subjects with UTF-8 in them; I've never seen any problems
like that, though.

A warning about using slang as ncurses in general: it's not perfect.
It'll break meta-characters in Mutt and most other apps.  (This is
fixed; the fix isn't released yet.)  There are probably other glitches.

(Note that the above special case isn't actually wrong; if the ncurses
in use is locale-aware, it should be able to handle it.  However, if the
ncurses in use is locale-aware and does ACS properly, it's also
completely unnecessary--it should go away once multibyte ncurses is
stable.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: mutt crashes on utf-8 encoded headers

2002-02-12 Thread Glenn Maynard


On Tue, Feb 12, 2002 at 05:43:30AM -0500, Thomas E. Dickey wrote:
  Current ncurses doesn't deal with multibyte characters, so the cursor
  position becomes desynchronized.
 
 There are enough multibyte calls implemented in ncursesw to make this
 work.  (addstr is not one of them - but I don't see that the OP was using
 ncursesw anyway).

 I don't believe that anyone tried making it work with ncurses first.

There are 66 addstr() calls in Mutt, and I don't know what other functions
can't cope with multibyte.  Every ncurses call it's making will need to
deal with it; even basic messages may contain multibyte UTF-8 if it's in a
different language.  If a basic function like addstr doesn't support it, then
I'd assume a lot of work would be needed.

The slang patch is almost drop-in, so it's an easy stopgap until multibyte
support is in ncurses mainstream.  (I'm assuming that the ncursesw naming
is temporary, until it's fully implemented.)  It just needs a couple
workarounds to make it work for UTF-8.

(There may be other reasons for the UTF-8 line drawing special case; I'm
naming the one major affect I know of.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: mutt crashes on utf-8 encoded headers

2002-02-12 Thread Glenn Maynard


On Tue, Feb 12, 2002 at 02:33:20PM -0500, Thomas E. Dickey wrote:
 ...and it's trivial to redefine it with a wrapper.

And any other ncurses calls that take text.  This, presumably, won't be
needed once the regular (non-wide) functions are multibyte-aware; the
slang patch is just a stopgap until that's ready.

 my point: the number of people who actually follow up with proposed
 patches I can count on one hand - while I've lost track of the people who
 stand around waiting for someone else to do the work.

I'm saying why I think Mutt acts like it does; I'm not proposing it be
changed.  I'm fine with leaving it alone until ncursesw is done.
Mostly.

I'm not excited about it being the way it is in the next Debian release,
since that means a lot of people will be stuck with it; if there's
anything that'll get me to write a patch intended to be removed shortly
after, it's that.

By the way, ncurses(3X):

 The ncurses library is intended to be BASE-level conformant with the XSI Curses
 standard.  Certain portions of the EXTENDED XSI  Curses functionality 
 (including color support) are supported.  The following EXTENDED XSI Curses
 calls in support of wide (multibyte) characters are not yet implemented: ...

addstr isn't in this list, so I assume this is a list of missing wide support,
not multibyte; perhaps (multibyte) should be removed so this doesn't imply
that multibyte is implemented?

  The slang patch is almost drop-in, so it's an easy stopgap until multibyte
  support is in ncurses mainstream.  (I'm assuming that the ncursesw naming
  is temporary, until it's fully implemented.)  It just needs a couple

 actually the more I look at it, the better it looks from the standpoint
 of compatibility - not that this is guaranteed to have much impression
 on bulk packagers.  (I understand that slang users cannot possibly be
 concerned about compatibility - or else they haven't thought very long
 about it).

Er, what looks better for compatibility with what?  The slang patch is
horrible for compatibility (it's not binary-compatible, not quitesource-
compatible, though most programs wouldn't notice, and it's a bit ugly.)

Do you mean that leaving wide and multibyte support in its own library
is better for compatibility?  I'd hate to see that, at least for
multibyte support--which, presumably, would depend on wide support.
What problems could that cause?  (Programs that aren't locale-aware
won't setlocale(), so the behavior should be unchanged.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: mutt crashes on utf-8 encoded headers

2002-02-11 Thread Glenn Maynard


On Tue, Feb 12, 2002 at 04:14:21AM +0100, Damjan wrote:
 Anyone seen this, 
 I'm using mutt 1.2.5.1i and sometimes it would crash
 when entering my linux-utf8 mail folder. 
 Well it turns out that this message crashed mutt 
   Message-ID: [EMAIL PROTECTED]
 because it contained a line like this:
   From:   Richard =?utf-8?B?xIxlcGFz?= [EMAIL PROTECTED]
 
 Is there something wrong with my mutt version or this is known
 bug of mutt.
 
 btw - 
 Slackware 8.0, glibc 2.2.3, gcc 2.95.3, mutt compiled from source.

10:31pm [EMAIL PROTECTED]/5 [~] mutt -v
Mutt 1.3.27i (2002-01-22)

I'd upgrade.  (I'd point this at mutt-dev, too. :)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: ncurses or slang [Re: UTF-8 support status.]

2002-02-03 Thread Glenn Maynard


On Sun, Feb 03, 2002 at 11:28:40AM -0500, Thomas Dickey wrote:
  Er, xterm shouldn't honor ACS controls in UTF-8 mode.  One of the reasons I
  like UTF-8 as a terminal encoding is that they don't explode if I accidentally
  dump random binary data to it, which I tend to do at least once a day. :)
 
 hm (doesn't explode).
 
 try this (if you do)
 
   reset; tput enacs

I know how to fix it; UTF-8 means it never happens to begin with.  It's
something that should go away completely with the UTF-8 transition.
Leaving it on in the meantime doesn't hurt, as long as it's configurable.
(It doesn't matter to me; I don't use X, and my terminal emulator handles
this the way I prefer.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: ncurses or slang [Re: UTF-8 support status.]

2002-02-03 Thread Glenn Maynard


On Sun, Feb 03, 2002 at 06:55:23PM +, Markus Kuhn wrote:
 Thomas Dickey wrote on 2002-02-03 16:28 UTC:
   Er, xterm shouldn't honor ACS controls in UTF-8 mode.  One of the reasons I
   like UTF-8 as a terminal encoding is that they don't explode if I accidentally
   dump random binary data to it, which I tend to do at least once a day. :)
  
  hm (doesn't explode).
 
 When I execute in the UTF-8-mode xterm [XFree86 4.0.1h(149)] of Red Hat
 7.1 in a shell the line
 
   printf '\x1b(0'
 
 then xterm changes the Unicode values U+0020 to U+007E to the DEC
 graphics character set, even though it is supposed to ignore ISO 2022
 sequences while being in UTF-8 mode, because UTF-8 is one of the
 encodings outside ISO 2022 in the sense of ISO 2022.
 
 Has this bug been fixed in more recent versions of xterm?

It seems the problem is that terminfo has no real way to deal with these
sequences.

That is, if I'm in UTF-8, then my terminfo caps acsc, enacs, smacs, and
rmacs need to be changed.  Terminfo can't simply blindly change what it
returns for these sequences because the locale charset is UTF-8; there's
a chance the library is being used to simply read caps (ie. infocmp).

(The basic problem is that there's nothing in the terminfo API to handle
this interaction between terminfo entries and the locale.  It could be
handled at a higher level--completely within ncurses--but then terminfo
would still be returning incorrect information.)

This needs to be sorted out before terminal emulators can drop these
codes when in UTF-8 mode, since doing the latter first means breaking
line drawing characters for all terminfo/ncurses/slang apps.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Announcing Bytext

2002-02-03 Thread Glenn Maynard


On Sun, Feb 03, 2002 at 06:15:33PM +0100, Pablo Saratxaga wrote:
  Many of the elegant features of Unixes depend on the notion of 8 bit
  transparency: pipe, cat, echo... the byte stream is the common denominator.
  The functions are general purpose and thus more useful. Bytext takes this
  elegant notion to it?s logical conclusion: not only can you process text
  as bytes, you can also process bytes as text.
 
 I don't understand, how can you encode in an 8bit space all the characters
 of the world languages ?
 
 And if it is a multi-byte encoding, then it should have about the same 
 problems as utf-8 or euc have when faced with byte-only utilities.

It sounds to me that any 8-bit character sequence (hopefully excluding nuls)
is a valid character.  That doesn't sound particularly useful, though.
(So what if an arbitrary byte sequence can be displayed as random-ish
characters of equally random languages?)

If it's the case that any string of bytes is a valid character, then that
brings up the question of how robust it is.  (Seeking, sync; issues that
UTF-8 solved.) I tried to look this up, but one of the first things I saw when
paging down the Word version (after it asked me for a password but worked
anyway) was:

Unicode is messed up beyond repair.

I promptly became disgusted and closed the window.  Remarks like that have
no place whatsoever in a standard.  How can he possibly wonder why he
gets negative reactions from Unicode folks when he's making comments like
this?

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: ncurses or slang [Re: UTF-8 support status.]

2002-02-03 Thread Glenn Maynard


On Sun, Feb 03, 2002 at 07:34:58PM +, Markus Kuhn wrote:
 I believe you are thinking the wrong way here. As soon as you are in
 UTF-8 mode, the only correct way to send block graphics characters to
 the terminal is via the U+25xx UTF-8 sequences, not via terminfo ISO
 2022 fiddling. Terminfo sequences must *not at all* be used in UTF-8
 locales to draw certain characters.

The way to interface with this doesn't need to change substantially:
make the acsc cap capable of dealing with multibyte encodings.  Then, if
you're in UTF-8, enacs, smacs and rmacs are blank (since there's no state)
and make the acsc mapping map directly to UTF-8 strings.  Exactly how
this string can deal with multibyte characters is an internal terminfo
implementation detail; the end result is that the acsc returned from
terminfo can be interpreted as a multibyte string in the current locale.

Make ncurses deal with this (not difficult) and you get UTF-8 support
without changing the basic terminfo.  (That's important, of course: UTF-8
doesn't need real special casing by things using terminfo, and apps
are more likely to work in older encodings by people who write and test
primarily in UTF-8.)

The only problem with this is how terminfo knows the application wants
or does not want this behavior, since apps using terminfo for purposes
other than actually rendering to the terminal may not want it.

 If wctomb(seq_hor, 0x2500)  0, then do not use terminfo to draw this
 graphics character, because you have already the correct sequence to
 draw BOX DRAWINGS LIGHT HORIZONTAL stored in seq_hor.

Then you have to special case these characters further; it'd be nice to
avoid that.

(And, er, don't you mean wctomb(seq_hor, 0x2500)?  seq_hor needs to be a
char[MB_CUR_MAX], not a char.)

Long term, doing this is better than leaving things as they are, of
course; I think the above is better, though.

 Sounds like there are bugs in both ncurses/slang and xterm here at the
 moment that cancel each other out. Both should be fixed as soon as
 possible.

But the terminfo/ncurses/slang problems need to be fixed first; that way
there's no period where line drawing characters simply don't work.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Announcing Bytext

2002-02-02 Thread Glenn Maynard


On Sat, Feb 02, 2002 at 02:16:37AM -0800, Bernard Miller wrote:
 Hopefully flags will go off when members of this list read things that are
 equivalent to I don't understand it but here is my opinion on it

Flags certainly go off; but which flags depend on who is saying it.
David Starner is not an idiot.

 Bytext is a superset of Unicode normalization form C, so it certainly
 encodes all of ASCII including form feed, and all combining characters.
 ASCII code points are rearranged partly so that characters like form feed
 can be quickly identified by normalization algorithms. This is far from
 losing ASCII compatibility. It simply means that conversion must be
 proper, not simply ignoring certain ranges. Also, there is no need for a new

In other words, losing ASCII compatibility.  If I have to convert the
file, then it's not compatible; it needs an intermediary.  That's the
biggest reason UTF-8 exists; it provides a relatively easy transition
path, since it's a superset of ASCII.  Without that, UTF-8 would never
have caught on, either.

 that it will never catch on. Many people who seem to have an emotional
 attachment to Unicode seem to be providing this as the only evidence that
 Bytext is not worthwhile... as if how interesting something is should be
 directly related to how well devleoped and popular it is. Again, I hope
 flags go off.

If the only thing this has over UTF-8 is fast regex, then it loses
overall; complexity is a strike bigger than the gain.  I read a simple
description of UTF-8 once and immediately had a strong understanding of
its structure, capabilities (easy reverse scanning; fast substring
searching), advantages (direct compatibility with ASCII, robustness that
most multibyte encodings lack) and so on.

Note also that popularity among developers *does* say something that
popularity among the masses does not.  Developers tend to choose their
APIs and standards more deliberately than users choose their software.

You really need to stop arguing your point by arguing the motives
(emotional attachment) and insulting the intelligence of people (they
just can't understand it!) disagreeing with you.  As you say, flags go
off when I see that.  (Ad hominem flags, incidentally.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

2002-01-13 Thread Glenn Maynard


On Sun, Jan 13, 2002 at 03:38:55AM -0600, [EMAIL PROTECTED] wrote:
 Now, it's not too hard for Xiph to avoid this problem, as long as they
 define how to handle these translations.  
 
 Why should they define it? It's at the wrong level - let the system define
 the conversion.

Because that's not portable.  Read
http://www.debian.or.jp/~kubota/unicode-symbols.html.

 But the easy solution for
 Ogg--0x5C to U+00A5--doesn't work for a lot of things.  I can't convert
 everything from CP932 to standard Unicode this way; my C source
 containing 'printf(Hi\n);' would no longer function, since the \ is
 converted to a yen symbol.
 
 Like anyone involved in this discussion couldn't have written code
 to convert the backslashs in C code intellegently in the time to have
 this argument. Heck, we could probably have even traced variable usages
 to find what's used as a filename argument in this time. A Excel 
 programmer could probably have the exact same thing in this time.

Then you introduce all of the complexity and unreliability of
intelligent parsers, instead of the simplicity of translation tables.
It also means that iconv() simply won't work for this translation.
Every application that uses iconv() would have to know data types (to
know which parsers and heuristics to use) and have a special case for
this.

This isn't about translating CP932 to Unicode once, it's about
allowing them to coexist peacefully, letting CP932 be phased out, as is
done with every other charset.

 There is an upgrade path; intellegently convert the character. I think
 fixing the problem now is better than everyone dealing with it for the
 next 40 years.

If it was so easy to do, we wouldn't be having this discussion (nor
would any of the others who have had this discussion, so many times in
the past.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

2002-01-13 Thread Glenn Maynard


I'm not even certain where the conversation is now; there are two
distinct issues: 1; handling of CP932 0x5C, and 2: portable translation
tables.  (These only partially overlap.)  Since one of your mail readers
doesn't honor References, the threads get broken and are much harder to
follow.  So, if I mix responses to these issues, let me know.

On Sun, Jan 13, 2002 at 06:06:11PM -0600, David Starner wrote:
  Because that's not portable.  Read
  http://www.debian.or.jp/~kubota/unicode-symbols.html.
 
 I know the problem. It still doesn't mean that every file format that
 includes Unicode should define its own solution.

So we should sit back, accept Unicode as nonportable, and provide things
like RFC2047 so people can use other encodings?  No thanks.

And if we simply say use UTF-8, and people use whatever translation
tables their system happens to use, then it's a lot harder to fix things
if and when Unicode standardizes it.  If the file format uses a specific
set of translation tables, then as long as you can tell if the format is
using the old one, you can convert it to the new one automatically.  If
it doesn't do that, the file might have been converted with *any* table,
and it's quite impossible to fix existing data.

And file formats aren't going to wait to be used until Unicode fixes the
portability problems, especially since it's not even clear that they intend
to fix it at all.

 Yes? The main difference I see between my solution and yours, is that
 yours introduces intelligent parsers into every Unicode system,
 where's mine deals with at one place, where the conversion from
 CP932 happens.

I'm not advocating intelligent parsers at all.  (In fact, all of the
suggested solutions have their problems; I believe this particular
suggestion has by far the most.)

 Every application has to special case it under your situation, too.
 Under mine, only systems that plan to deal with CP932 have to special
 case it, and that code will eventually be removable.

Nope.  Using a specific translation table merely means changing your
iconv() call to one provided that uses them.  Using intelligent
parsers means you need to have different parsers for each data type,
so you can't use a simple interface like that.

 Apparently they have a hard time coexisting - poor semantics on CP932's
 fault, not Unicode's. I don't see transfering that bug to Unicode will
 help things in the long run.

It doesn't matter who's fault it is (I believe it would be JIS X 0201
Roman, where Tomohiro said CP932 got 0x5C.) It's in heavy use, and it
needs to be dealt with.

 ISO646-DE users did it. So did ISO646-DK, ISO646-ES and all the rest of
 the 7-bit codes. Why is it so different for CP932?

Considering that ISO646-DE puts a character on 0x5C that would be used
as a part of words (unlike CP932), I'd suspect the situation is different.
(It's one thing to not be able to use yen symbols in filenames in
Windows; it's quite another to lose a character.) I don't know anything
about their use; perhaps someone who does would enlighten us.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

2002-01-13 Thread Glenn Maynard


On Sun, Jan 13, 2002 at 08:26:57PM -0600, David Starner wrote:
 Is ISO-8859-1 not portable because you can't round trip CP932 through
 it? Why does CP932's lack of definition make Unicode unportable? People
 already pound Unicode for compromises with older systems; one more won't
 make people love it.

Um, ISO-8859-1 is completely irrelevant.  It doesn't claim to be a charset
for Japanese users; Unicode does.  If I take a CP932 document, convert
it to Unicode, and then back to CP932, I'd better get exactly what I
started with, or we don't have round-trip compatibility.  That had
better work across systems, too. This has nothing to do with any compromise
on Unicode's part; it's merely a matter of defining a table and using it.

(Incidentally, if programmers consistently distinguish CP932 from
Shift-JIS, this isn't a problem for that particular codeset; since it's
MS's charset, using MS's table is fine.  This is a problem for all of
the CJK encodings, not just CP932, however.  In practice, many Japanese
programmers may not know the difference and use a Shift-JIS translation.
Also, making sure all of the original CCS mappings line up is probably
more important, so if you go from CP932 to Unicode to EUC-JP to CP932
you end up with the same thing.)

 People are going to use whatever translation tables their system happens
 to use. Some systems are going to translate all strings to UTF-8 as
 standard practice - Java based systems, for example, and Gnome looks
 like it's heading that way. Others just aren't going to be interested in
 messing around with it - ANSIToUnicode, or iconv, or whatever the
 library call is already does it, why are they going to rewrite the
 wheel? 

The threat is that, if portable round-trip conversions arne't available,
some users (programmers) who value round-trip compatibility more than
Unicode will break spec and dump native charsets in the files.  (This
*did* happen with ID3 tags; this isn't a made-up threat.)  That's
probably the single worst case scenario, and must be avoided.

 What was your solution? I got that you expected systems to display the
 backslash as the yen sign under certain conditions. Right?

At one point; that doesn't really do anything to help the conversion
problems, though.  I've yet to see a reasonable solution that does.

Luckily, this doesn't affect Ogg, nor does it affect any file format or
protocol that doesn't treat \ as special; map 0x5C to U+00A5 and be done
with it.

  It doesn't matter who's fault it is 
 
 Actually, it does. Part of Unicode's success is that it's a simpler

It doesn't matter whose (oops) fault it is.  Whether it was MS's fault,
Unicode's fault, ISO 0201 Roman's fault or Santa's fault, the end result
is the same, and it still needs a solution.

 solution then dealing with dozens of charsets. If you import the bugs of
 dozens of charsets into Unicode, it loses part of that. 
 
 Yes, Unicode should offer a unified translation table. Barring that, the
 tables available at http://www.w3.org/TR/japanese-xml/ could be
 referenced - accepting that some systems won't or can't follow the
 recommendations. But importing the quirks and problems of other charset
 (seperate from those inherant in the script) into Unicode won't help
 things in the long run.

Like I said, I'd definitely suggest using an existing table, not
make one up from scratch; that *would* exacerbate the problem.

Thanks for the link, by the way.  (Unfortunately, it leaves a lot of
things undefined; it lists ambiguities but doesn't seem to suggest
solutions.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

2002-01-12 Thread Glenn Maynard


On Sat, Jan 12, 2002 at 05:01:32PM +0900, Tomohiro KUBOTA wrote:
  I think the only solution I've seen that can *work* for everybody, and
  doesn't have any showstoppers (that I can see), is your own suggestion
  of giving up and making backslash and yen two glyphs of U+005C.  I can
  see a few problems with that, but they're all within the bounds of
  compromise.  (And the bounds for this particular problem are very large ...)
 
 Do you mean the usage of Variation Selector?  I think it is an
 interesting suggestion and a good compromise.  However,
 (1) the problem that Windows CP932 text file cannot be
 transcoded into Unicode automatically is not solved.

As you said, doing this is nearly impossible; no matter how you mark it,
no solution can do this since you can't tell which a 0x5C is supposed to
be reliably.

 (2) I imagine Variation Selector is always needed for U+005C
 as Yen Sign.  I don't think Microsoft will accept this.

I'm not sure there's anything they will ...

 Note that the existance of problems doesn't mean the idea is bad,
 because there cannot exist any ideas without problems.  We have
 to seek better compromise and smaller nightmare, not to seek
 perfect solution which cannot exist.

Yep; as I said, the problems with this aren't showstoppers.  (Well, the
microsoft won't do it may be, but that's likely for any such fixes ...)

  By the way, you might want to update the links on
  http://www.debian.or.jp/~kubota/unicode-symbols.html.  While the nature
  of the problems you list is different, with Unicode obsoleting their own
  tables, it's still very useful information.
 
 Yes, I think the mapping tables are useful and Unicode Consortium
 should not obsolete them unless defining a new authorized mapping
 table, just as I wrote in the document.

Yes, but they did obsolete them, which means your links to the tables
are broken.  I'm suggesting you update them, since the files are still
available.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

2002-01-12 Thread Glenn Maynard


On Sat, Jan 12, 2002 at 03:13:00AM -0600, [EMAIL PROTECTED] wrote:
 It takes up space and developer time in the clients. It's easy
 to end up with a spec that only gets partially implemented because
 it's so big. 

If a player doesn't want to implement anything using the tags, it
ignores them--and if we didn't mention these tags at all, that's what
they'd be expected to do anyway.  (That's a reason I like this better than
the UTF8_LANG tag idea; it doesn't really add anything required at all,
just an if you want to do this, then do it as Unicode defines.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

2002-01-12 Thread Glenn Maynard


On Sat, Jan 12, 2002 at 05:35:05PM -0600, [EMAIL PROTECTED] wrote:
 If we tell CP932 users, your 0x5C is a yen symbol, so translate it to a
 Unicode yen symbol, what will they do?  Probably say no, that'll break
 almost all applications, just like our applications would break if we
 changed ISO-8859-1 backslashes to Unicode yen symbols.
 
 You tell them that the Unicode backslash is a backslash, and the Unicode
 Yen is a Yen. Let CP932 users make whatever arrangements they want - just
 please not export the problems to general Unicode users.

Again, you give suggestions that, in practice, simply won't work.  That
kind of let them deal with it, don't bother me attitude is what Unicode
had (and has) to avoid to be universally accepted.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

2002-01-12 Thread Glenn Maynard


On Sat, Jan 12, 2002 at 10:16:26PM -0600, [EMAIL PROTECTED] wrote:
 IMO, one of the big problems Unicode has is that it is a large complex
 standard. Telling everyone that the Backslash character may be the Yen
 character annoys all the people on Unix and Macintosh, who never had to
 deal with the problem, and even annoys the Windows people who never had to
 consciously deal with it. Bother everyone, because someone had some quirk 
 in a system has to be avoided, to make a reasonable, implementable
 standard.
 
 More cynically, CP932 users are already Unicode users; all new versions
 of Windows are Unicode based. Whether they accept Unicode or not is 

Except that the vast majority of Windows programs use the codepage encoding
for most things, *not* Unicode.  Even new applications, since most still
want compatibility with Win9x.

What an OS uses at a low level and what applications use at a high level
are two completely different things.

If CP932 was likely to fade away reasonably soon, this wouldn't be an
issue at all; but it's going to be around for quite some time.

 irrelevant; if they leave Windows to another desktop system, they're
 going to another system that doesn't confuse the Yen sign and the
 Backslash. For Unicode acceptance, they don't matter.

For Unicode acceptance, most Japanese users don't matter?  I certainly hope
the Unicode C. never takes that position.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

2002-01-11 Thread Glenn Maynard


On Fri, Jan 11, 2002 at 10:52:06PM +0900, Tomohiro KUBOTA wrote:
  Fixing the source code at the source is a lot cleaner than inflicting 
  your fix on the rest of the world. It's as bad as Oracle's attempt 
  to define a standard for its variant UTF-8 (CESU-8, which apparently 
  should be pronounced 'sezyu' in English). Their stated reason is the 
  same, that it's too much work to fix all of their databases, and 
  their cure is to lay even more work off on the rest of the world.
 
 At first, this problem affect not only source codes but also
 many texts of end users.  You can easily imagine text files
 of end users contain many \ as currency sign AND many \
 as a element of file names.  Even if you may success to persuade
 every Japanese Windows programmers to modify their source codes,
 you won't be successful to persuade Japanese business users to
 modify their files like accounts.xls .

A possibly more reasonable fix would be to change the fonts to the way
they're supposed to be, and reverse the problem: they get backslashes
instead of yen symbols for currency (and correctly get backslashes as
delimiters.)  Everything still works, except they end up with the
problem, not the rest of the world.  Then change \ to the correct
Unicode yen symbol as appropriate (and most documents don't contain
directory delimiters.)

The problem with this is that I suspect most Japanese wouldn't be pleased
to see backslashes instead of yen symbols.  (It's easy enough to say
that's just as bad as the reverse; just do it, but that's not going to
get any of them to go along with it.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

1 2 >

1 - 100 of 123 matches

Mail list logo