Re: glibc wcwidth
On Fri, May 28, 2004 at 12:39:21AM -0400, srintuar wrote: (B I'm running with glibc-2.3.2, and the wcwidth system call seems to have (B (B(same; Debian unstable) (B (B For example, in the locale ja_JP.utf8: (B 0x6BDF "$B][(B" mk_wcwidth=2 wcwidth=-1" iswprint=no (B 0x30E2 "$B%b(B" mk_wcwidth=2 wcwidth=-1" iswprint=no (B 0x8AAD "$BFI(B" mk_wcwidth=2 wcwidth=-1" iswprint=no (B 0x307F "$B$_(B" mk_wcwidth=2 wcwidth=-1" iswprint=no (B 0x4EEE "$B2>(B" mk_wcwidth=2 wcwidth=-1" iswprint=no (B 0x540D "$BL>(B" mk_wcwidth=2 wcwidth=-1" iswprint=no (B (B Does anyone know if wcwidth is/was broken in glibc? (B (B#include stdio.h (B#include wchar.h (B#include locale.h (B (Bmain() (B{ (Bsetlocale(LC_ALL, ""); (Bprintf("%lc: %i\n", 0x6bdf, wcwidth(0x6bdf)); (B} (B (Bprints "$B][(B: 2" for me, in en_US.UTF-8 and ja_JP.UTF-8. (B (BDid you forget to call setlocale()? If not, the data probably isn't loaded. (B(Tip: always include your test program.) (B (B-- (BGlenn Maynard (B (B-- (BLinux-UTF8: i18n of Linux on all levels (BArchive: http://mail.nl.linux.org/linux-utf8/
Re: glibc wcwidth
On Fri, May 28, 2004 at 01:04:57AM -0400, srintuar wrote: Did you forget to call setlocale()? If not, the data probably isn't loaded. (Tip: always include your test program.) Yeah, that was it. Embarrasingly obvious in hindsight I guess :) Well, it's reasonable to forget: on modern systems, wchar_t doesn't change across locales, widths don't change much, and wcwidth(3) doesn't mention setlocale() at all on my system. I would have tried it first without, myself, except that I knew it was needed for %lc. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: iconv limitations
On Thu, Apr 08, 2004 at 04:17:41AM -0400, Michael B Allen wrote: - knows that the input is zero terminated I have great difficulty in envisioning the opposite. Binary file formats and network protocols have a lot of zero terminated strings in all sorts of encodings. - does not know whether this is an 8-bit, 16-bit or 32-bit wide and aligned zero Again for me it's rare that an application would not need to know what data it's dealing with. Applications do not exist in a vacuum. You have to do I/O in which case the the encoding of text is usually predefined or negotiated. You do not always have the luxury of defining how text is represeted throughout the system. However, the case where 1: data is zero-terminated *and* 2: you don't at least know whether you're dealing with an 8-, 16- or 32-bit encoding is, in my experience, non-existant. After all, zero-terminated is meaningless unless you know what zero means--an 8-bit, 16-bit or 32-bit zero? -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: iconv limitations
On Thu, Apr 08, 2004 at 06:17:55PM -0400, Michael B Allen wrote: On the other hand, the iconv API is more flexible the way it is. It can handle strings with embedded zeroes, Now *that* is rare. I use std::string, which is 8-bit clean, and I always like to make things remain that way unless I have a strong reason not to. For that use iconv. ... Just because the conversion routine stops at a null terminator in the source doesn't mean it cannot operate on a string that is not null terminated. The encdec interface I described can convert non-null terminated strings by limiting the number of bytes inspected in src using the sn parameter. I'd suggest that one shouldn't have to use two notably different interfaces just because your nul-termination needs are different, and that stop on nul should be a conversion flag, as should other things that some need and some don't want: replacing unconvertable characters ( - ?), transliteration ( - a), etc. Better would be a low-level conversion interface that allows implementing these things efficiently (which iconv doesn't), with iconv, encdec, etc. interfaces being implemented on top of that. At the very least, this could solve the problem of having to lug around large conversion tables when you outgrow iconv(). pages and MIME messages with bogus length parameters. The W3C claims all apps should use UTF-16 internally so if you want to use those in your FWIW, I'd say that what the W3C claims applications should use internally is no more interesting than what the FSF claims I should eat for breakfast. :) (Not to mention that UTF-16 is such a horrible recommendation to be making!) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: W3C and UTF-16
On Thu, Apr 08, 2004 at 08:35:21PM -0400, Michael B Allen wrote: This is probably states the definitive position for text handling: http://www.w3.org/TR/1999/WD-charmod-19991129/#Encodings But even though the encoding is not clearly stated as UTF-16, the Document Object Model (DOM) which is basically the document tree inside a web browser and key to all HTML and XML processing including JavaScript and XSLT processing *requires* the encoding be UTF-16: http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#ID-C74D1578 The UTF-16 encoding was chosen because of its widespread industry practice. Very funny; it was chosen since it's what Windows is stuck with. That aside, all above is incorrect. You don't have to use DOM to process HTML and XML. (Ultimately, if one *had* to use UTF-16 to process HTML, then something along the line is horribly wrong: a language specification can't legitimately make any requirements about transparent implementation details.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Perl unicode weirdness.
On Mon, Feb 02, 2004 at 12:09:07PM -0800, Larry Wall wrote: (To avoid confusion, we don't call our encoding UTF-8. We tend to say UTF-8 when we mean UTF-8, and utf8 when we mean the more general not-necessarily-Unicode encoding. This is an insane way to make a distinction, just as silly as trying to differentiate between kilobits and kilobytes with kb and kB. Changing hyphens and case doesn't make distinctions or avoid confusion. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Perl unicode weirdness.
On Mon, Feb 02, 2004 at 12:21:40PM -0800, Larry Wall wrote: locales for everyone willy nilly. So 5.8.1 backed off on that, with the result that you have to be a little more intentional about your input formats (or set the PERL_UNICODE environment variable). What's the normal way to say use the locale, like every other Unix program that processes text? Setting PERL_UNICODE seems to make it *always* use Unicode: 04:39pm [EMAIL PROTECTED]/5 [~] export LANG=en_US.ISO-8859-1 04:39pm [EMAIL PROTECTED]/5 [~] perl -ne 'if(/^(\x{fa})$/) { print $1\n; }' ú ú 04:39pm [EMAIL PROTECTED]/5 [~] export PERL_UNICODE=1 04:39pm [EMAIL PROTECTED]/5 [~] perl -ne 'if(/^(\x{fa})$/) { print $1\n; }' ú Also, with PERL_UNICODE=1 in en_US.UTF-8, entering ú outputs one byte, 0xfa (the codepoint), instead of 0xc3 0xba; why? This is perl, v5.8.2 built for i386-linux-thread-multi (It's a shame that Perl doesn't behave like everyone else and obey locale settings correctly; I thought we were finally getting away from having to tell each program individually to use UTF-8. I don't understand the logic of RedHat set the locale to UTF-8 prematurely, so Perl shouldn't obey the locale.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Perl unicode weirdness.
On Mon, Feb 02, 2004 at 04:49:22PM -0800, Larry Wall wrote: I believe use open ':locale' does that. This seems to work: perl -e 'use open :locale;' -ne 'if(/^(\x{fa})$/) { print $1\n; }' (rather ugly for commandline one-liners) Well, hey, I'm the one who agreed with you in the first place and asked that 5.8.0 be done that way, but apparently the current maintainers of Perl 5 got an excessive amount of grief from people whose production programs broke under RedHat. And I've been so far off in Perl 6 La La Land (aka second system syndrome done right) that I let the Perl 5 folks make the decision to back that out. Oh well. Please do get locale handling right this time around. :) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Perl unicode weirdness.
On Sat, Jan 31, 2004 at 02:07:07PM +, Markus Kuhn wrote: Question: What is a quick way in Perl to get a regular expression that matches all Unicode characters in the range U0100..U10, in other words all non-ASCII Unicode characters? It looks like /[\x{100}-\x{10}]/ should do that, but it doesn't work here. perl -v This is perl, v5.8.2 built for i386-linux-thread-multi LANG=en_US.UTF-8 perl -ne 'if(/^(\x{61})$/) { print $1\n; }' (in) a (out) a perl -ne 'if(/^(\x{fa})$/) { print $1\n; }' (in) ú (nothing out) perl -ne 'if(/^(.)$/) { print $1\n; }' (in) a (out) a (in) ú grep '^.$' (in) a (out) a (in) ú (out) ú perl -ne 'if(/^(..)$/) { print $1\n; }' ú ú Why is . matching a single byte in perl, instead of a single codepoint? Why isn't \x{fa} working? -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Linux console UTF-8 by default
On Wed, Jan 14, 2004 at 08:31:16PM +0100, Brian Foster wrote: yes there is. if the illegal 5-byter has the first 4-bytes legal followed by an US-ASCII byte (which is what makes the 5-byter illegal), a parser that never considers sequences longer than 4-bytes will see an illegal sequence of 4-bytes and then a valid byte. That would be correct: if a byte that was expected to be a continuation byte is not, the UTF-8 string should be considered invalid and the character that was just read should start a new sequence. A 5-byte sequence with the fifth byte invalid: fb bf bf bf 41 should be parsed as an invalid sequence, followed by 0x41 ('A'). (That's only sensible; on many media, lost bytes are much more common than bit errors.) Looking at http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt 3.3.4: if it was parsed as you suggest, then the ASCII quote after the partial sequence would be considered part of the sequence, and not displayed. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode fonts on Debian
On Wed, Dec 17, 2003 at 08:24:35PM +0100, Jan Willem Stumpel wrote: If you see html lang=ja then the page should use the font specified by the Japanese setting by default. [..] Encoding is fairly irrelevent to this, afaik http://ken2403king.kir.jp/form.htm Thats a funny one, indeed. When I opened it in Mozilla it was displayed as .For a moment I thought it was Chinese (which I do not know) but it is gibberish. Mozilla So, isnt the LANG attribute *more* irrelevant, because it did not help Mozilla (1.5a) to display the text correctly? A META tag attribute charset=shift-jis added to (a copy of) the page did. Doesnt that mean that encoding is more relevant than language? Encoding is more relevant to being able to decode the text. It's not relevant to deciding which font to use. (Well, if you don't have a language tag, the encoding can be used to help guess it, but not if it's UTF-8.) That's what he said, of course. :) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Perl in a UTF-8 locale
On Mon, Nov 10, 2003 at 05:20:59PM +, Edmund GRIMLEY EVANS wrote: I have a problem here with Perl v5.8.0 on Red Hat 9. Simplified, my script looks like this: while () { s//cx/g; print; } This works with older versions of Perl, and it works in the C locale, but it doesn't work here in a UTF-8 locale. I tried putting stuff like use bytes or no utf8 or no locale, but it didn't help. As long as the Perl script and the input is in the same encoding, it works for me. (Debian unstable) This is perl, v5.8.0 built for i386-linux-thread-multi 10:14am [EMAIL PROTECTED]/2 [~] cat testing.txt; file testing.txt abd testing.txt: UTF-8 Unicode text 10:17am [EMAIL PROTECTED]/2 [~] LANG=en_US.UTF-8 ./xxx.pl testing.txt abcxd 10:14am [EMAIL PROTECTED]/2 [~] LANG=C ./xxx.pl testing.txt abcxd 10:14am [EMAIL PROTECTED]/2 [~] LANG=en_US.ISO-8859-3 ./xxx.pl testing.txt abcxd ISO-8859-3: 10:17am [EMAIL PROTECTED]/2 [~] LANG=en_US.UTF-8 ./xxx3.pl testing-3.txt abcxd 10:18am [EMAIL PROTECTED]/2 [~] LANG=C ./xxx3.pl testing-3.txt abcxd 10:18am [EMAIL PROTECTED]/2 [~] LANG=en_US.ISO-8859-3 ./xxx3.pl testing-3.txt abcxd (Of course, no locale works if I mix encodings.) exec(/path/to/this/script, @ARGV); } .)??D??-|??{??v??W?z[ Hmm. What's this garbage at the end of the message? Oh. Poking at the raw message body, it's the stupid footer that the mailing list blindly spams on every message (despite this being a base64 message). -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: grep is horriby slow in UTF-8 locales
On Fri, Nov 07, 2003 at 12:52:44PM +, Markus Kuhn wrote: $ grep --version grep (GNU grep) 2.5.1 $ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt Command exited with non-zero status 1 6.83user 0.07system 0:06.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (157major+34minor)pagefaults 0swaps $ LC_ALL=POSIX time grep XYZ test.txt Command exited with non-zero status 1 0.07user 0.09system 0:00.16elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (125major+24minor)pagefaults 0swaps FYI: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=206470 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=181378 I've noticed this, too. I often use LANG=C for grepping due to this. Someone mentioned --with-included-regex, but that's not good enough (a 10% gain for me). -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: grep is horriby slow in UTF-8 locales
On Fri, Nov 07, 2003 at 04:49:58PM +0100, Danilo Segan wrote: This doesn't happen with: $ grep --version grep (GNU grep) 2.4.2 This was probably before full multibyte support was added to grep; the issue here specifically only happens in multibyte encodings. (My grep is slow in en_US.UTF-8, and fast in en_US.ISO-8859-1.) Try: # echo tést | grep 't.st' tést # echo tést | grep 't[aé]st' tést $ LC_ALL=POSIX time grep XYZ test.txt Command exited with non-zero status 1 0.04user 0.06system 0:00.10elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (118major+25minor)pagefaults 0swaps Last example shows that CPU usage is not really any kind of rule to base conculsions on (sr_CS.UTF-8 is my everyday locale, and I would really notice if grep had any problems with it). The field you should be reading is user. CPU is roughly (user+system)/elapsed, and isn't very relevant here. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: FYI: Some links about UTF-16
On Tue, Jul 08, 2003 at 02:03:14PM +0800, Wu Yongwei wrote: Don't you know what respect is? I do not call other people silly, and I (reply made in private) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: FYI: Some links about UTF-16
On Tue, Jul 08, 2003 at 11:22:19AM +0800, Wu Yongwei wrote: Is it true that Almost all modern software that supports Unicode, especially software that supports it well, does so using 16-bit Unicode internally: Windows and all Microsoft applications (Office etc.), Java, MacOS X and its applications, ECMAScript/JavaScript/JScript, Python, Rosette, ICU, C#, XML DOM, KDE/Qt, Opera, Mozilla/NetScape, OpenOffice/StarOffice, ... ? Blatently false. Lots of modern software uses UTF-8 internally. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: FYI: Some links about UTF-16
On Tue, Jul 08, 2003 at 01:29:04PM +0800, Wu Yongwei wrote: Is it true that Almost all modern software that supports Unicode, especially software that supports it well, does so using 16-bit Unicode internally: Windows and all Microsoft applications (Office etc.), Java, MacOS X and its applications, ECMAScript/JavaScript/JScript, Python, Rosette, ICU, C#, XML DOM, KDE/Qt, Opera, Mozilla/NetScape, OpenOffice/StarOffice, ... ? Blatently false. Lots of modern software uses UTF-8 internally. Name them. Why? There are so many (such as the editor I'm typing in right now) that you sound rather silly asking me to name examples. Sorry; if you don't know even this, I'm not interested in having this converation. Do some research. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Wide character APIs
On Thu, Jul 03, 2003 at 09:03:40PM +0200, Bruno Haible wrote: But no one answered my original question; why are the format specifiers for wide character functions different? Here's the answer: So that the a given format specifier corresponds to a given argument type. Format specifierArgument type %dint %schar * %ls wchar_t * %cint (promoted from char) %lc wint_t (promoted from wchar_t) Changing between char and wchar_t at compile-time with macros (TCHAR) is a hideous Windows hack. If you really want to generalize it, you could fork printf to have a TCHAR type, eg: const TCHAR *t = _T(abc); printf(%t, %t, t, _T(def)); (%t probably has some meaning in printf that I don't know off the top of my head; I'm not suggesting you actually do this.) This type switching is just a gross migration scheme, for programmers who want to distribute both Unicode and ANSI versions of their programs (for Win9x compatibility). I doubt this was the intent with the C wide functions having similar parameters; that's just consistency. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: gtk2
On Wed, Apr 02, 2003 at 06:17:42PM +0900, Tomohiro KUBOTA wrote: And, do you say that non-European-language speaking people don't need to have choices? For example, there are people who like Eterm, Aterm, Wterm, Rxvt, Xterm, or so on. (Note that all of them support XIM.) Is it a priviledge of European-language-speaking people to say such preferences? It is what I wanted to call ethno-centrism. People write code to do what *they* need; I guess that's self-centrism. (After all, most of this is written by people in their spare time.) I suppose the problem you're really complaining about is a likely typical response by writers of terminal emulators: why should we support it; use xterm if you want that. You'd probably get a similar response if you tried to get Eterm's silly eyecandy bloat features added to Xterm. There's a difference, of course--handling Unicode in all terminal emulators is actually a good idea (adding bloat to Xterm is not :); i18n just needs to be more widely understood as a fundamentally important feature. That's happening steadily. Nobody's saying that you shouldn't have choices, of course. On the topic of toolkits: libraries like GTK and QT absolutely should be able to automatically handle as much i18n (IM, font rendering, widget repositioning) as possible. Line input should automatically hint the IM for clean over-the-spot rendering, and whatever else is useful. They just can't be required; we must be able to handle input methods anywhere (without having to learn a complicated library). (I'm not sure what this subthread is really arguing about, though, since I don't see anyone disagreeing on this. :) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: gtk2
On Tue, Apr 01, 2003 at 10:02:36PM -0500, srintuar26 wrote: gnome-terminal and multi-gnome-terminal are fairly lightweight. Package: gnome-terminal Depends: bonobo-activation (= 1:2.2.0), libart-2.0-2 (= 2.3.8), libatk1.0-0 (= 1.2.2), libaudiofile0 (= 0.2.3-4), libbonobo-activation4 (= 1:2.2.0), libbonobo2-0 (= 2.2.0), libbonoboui2-0 (= 2.2.0), libc6 (= 2.3.1-1), libesd0 (= 0.2.23-1) | libesd-alsa0 (= 0.2.23-1), libfontconfig1 (= 2.1), libfreetype6 (= 2.1.3-5), libgconf2-4 (= 2.2.0), libgcrypt1 ( 1.1.11-0), libglade2-0 (= 2.0.0), libglib2.0-0 (= 2.2.1), libgnome2-0 (= 2.1.90), libgnomecanvas2-0 (= 2.1.90), libgnomeui-0 (= 2.1.90), libgnomevfs2-0 (= 2.2.0), libgnutls5 (= 0.8.0-1), libgtk2.0-0 (= 2.2.0), libjpeg62, liblinc1 (= 1:1.0.0), libncurses5 (= 5.3.20021109-1), liborbit2 (= 1:2.6.0), libpango1.0-0 (= 1.2.1), libpopt0 (= 1.6.4), libstartup-notification0, libtasn1-0 (= 0.1.1-2), libvte4 (= 0.10.10), libxft2 (= 2.1), libxml2 (= 2.5.0-1), xlibs ( 4.1.0), xlibs ( 4.2.0), zlib1g (= 1:1.1.4), scrollkeeper (= 0.3.8), yelp Of course, much of this is optional, but nothing about any GTK app is lightweight unless you happen to be on a GTK system. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
On Mon, Mar 31, 2003 at 08:19:49AM +0900, Tomohiro KUBOTA wrote: I think there are no people who explicitly think so. However, how do you think if a developer think, for example, italic character support for 8bit characters is very important while he/she don't won't understand importance of multibyte support? I believe this is perfectly understandable and normal, even though it's very annoying to Japanese users. A side-effect of open source is people prioritizing features that they care about at the expense of those they don't. English-speaking programmers are bound to care more about features for English than features for other languages--just as programmers in X care more about X support than Windows support (which is very annoying to Windows users, who often end up with old, buggy ports of X software when they get them at all). The only things that can be done about this are what's being done and discussed: making it easier (so the time commitment is reduced) and submitting patches. Actually, there's one more: give them a reason to care. I wonder if there's any way to sneak a few double-width characters into common use among English-speaking programmers. :) This is actually one advantage of NFD: it makes combining support much more important. (At least, it's an advantage from this perspective; those who would have to implement combining who wouldn't otherwise probably wouldn't see it that way.) By the way, I just gave lv a try: apt-get installed it, used it on a UTF-8 textfile containing Japanese, and I'm seeing garbage. It looks like it's stripping off the high bits of each byte and printing it as ASCII. I had to play around with switches to get it to display; apparently it ignores the locale. Very poor. Less, on the other hand, displays it without having to play games. It has some problems with double-width characters, unfortunately. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
On Fri, Mar 28, 2003 at 11:32:21AM -0800, H. Peter Anvin wrote: WHOA... that's a pretty darn strong statement. In particular, that would seem to request internationalization of kernel (or other debugging or logging messages), which is probably a completely unrealistic goal. For user-interface issues, I would agree with you however. I think handling i18n in cooked input mode is realistic and important. (This is both UI and kernel.) When it comes to (a), it pretty much means that the complexity needs to be hidden from the application programmer. Terminal applications, toolkits, and perhaps libraries like readline need to support this, but applications shouldn't need to be affected beyond a few basic guidelines, such as don't assume byte == character. Getting UTF-8 universally deployed will be a huge part of this, because it means that anything other than 7-bit ASCII will have to take this into consideration. Chicken and egg. :) Of course several Japanese companies are competing in Input Method area on Windows. These companies are researching for better input methods -- larger and better-tuned dictionaries with newly coined words and phrases, better grammartical and semantic analyzers, and so on so on. I imagine this area is one of areas where Open Source people cannot compete with commercial softwares by full-time developer teams. This seems to call for a plugin architecture. More than anything I suspect we need *standards*. And, in this case, non-GPL licensing (if being able to use proprietary input method plugins is desired) ... -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
On Sat, Mar 29, 2003 at 01:33:02AM +0900, Tomohiro KUBOTA wrote: Another point: I want to purge all non-internationalized softwares. Today, internationalization (such as Japanese character support) is regarded as a special feature. However, I think that non-supporting of internationalization should be regarded as a bug which is as severe as racist software. However, GTK is a relatively heavy toolkit and developers who want to write a lightweight software won't use it. Stop using the word racist. It's like saying if you don't support a feature I want, you're supporting terrorism; it makes people groan and stop paying attention. It's inflammatory, doesn't help your case at all, and injures your credibility. Not being racist is free, takes no time, doesn't take any new code, testing, has no support costs and doesn't require people to learn new APIs. If i18n ever becomes implicit, such that supporting i18n is as easy and effortless as not being racist, and not supporting i18n takes a deliberate act by the programmer, then the word racist might have some relevance (but it'd still be inflammatory and cause groaning and ignoring). I'm aware that English isn't your native language, but I'm pretty sure you know how strong this comparison is. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
On Wed, Mar 26, 2003 at 03:38:54PM -0500, Maiorana, Jason wrote: (B You can, you just select which keyboard/input method you like to use (B from the keyboard menu (which list all the installed/enabled ones)! (B But wait... That's Windows... And Mac... (B (B No you cant. I have access to a windows machine, with global IME installed. (B The keyboard is rearranged into dvorak layout, and all other input methods (B aside from english fail. (B (BYes, you can; I did it to type this: $B4A;z(B. Nobody's claiming it's perfect (Bor bug-free, but it's undisputably there and useful to many people who need (Bto input text in multiple languages. Imperfection is not nonexistance. (B (B The windows model is not perfect, imo. (Beyond-BMP codepoints (B may break many applications, etc.) (B (BI don't see how Windows's use of UTF-16 is relevant to the discussion (the (Bability to change keyboard mappings on the fly). The only point was that (Bit's taking X a while to do things that Windows has been doing gracefully (B(relatively speaking) since at least Win2K. (B (B-- (BGlenn Maynard (B-- (BLinux-UTF8: i18n of Linux on all levels (BArchive: http://mail.nl.linux.org/linux-utf8/
Re: supporting XIM
On Tue, Mar 25, 2003 at 05:12:12PM -0800, H. Peter Anvin wrote: However, locale-dependence itself is not a bad thing. For example, XCIN supports both of traditional and simplified Chinese depending on locale. We can imagine about an improvement that the default mode would be determined by locale even when it would support run-time switching of traditional and simplified Chinese. Indeed. It would be nice to at some point in the future be able to edit, for example, Swedish-langauge document and suddently decide I need to insert some Japanese text, call up the appropriate input method, without having to have anticipated this need (other than having it installed, of course.) As a person who's only done IM-related stuff in Windows, this seems fundamental. I simply hit lcontrol+lshift to switch between English, Japanese, Korean and Finnish (which I seem to have accidentally installed) input systems. X is miles behind in this, unfortunately. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: FYI: lamerpad
On Wed, Mar 12, 2003 at 10:01:03AM +0900, Tomohiro KUBOTA wrote: Lamerpad, http://www.debian.org.hk/~ypwong/lamerpad.html, seems to be a good way for developers who don't know CJK languages to test their own softwares whether they support Kanji input or not. A partial test, anyway. The IM's I've used need to know the cursor position, to render the current composition, to know where to put selection dialogs, and so on. I'd imagine that this type of program wouldn't test that very well. Unless it shows the best match as a composition string, but I can't run it to see if it does that. Of course, adoptation of Unicode alone cannot make your software support CJK languages (more efforts are needed). I hope Lamerpad will help testing softwares and will lead more softwares supporting CJK languages. What more is needed? Combining (Korean) and double-width characters (in the case of console apps) are two things that need special attention, but they're both just parts of supporting Unicode. Other than that, and input method support (which is unreasonably difficult at the moment--based on conversations on this list--except in Windows where it's merely annoying), what more is needed in the general case? -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: FYI: lamerpad
On Wed, Mar 12, 2003 at 08:02:59AM +0100, Janusz S. Bie wrote: Sorry but I've no time to look into the problem... A 0.1 program whose upstream author won't look into problems is of limited value, unfortunately. :) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: mp3-tags, zip-archives, tool to convert filenames to UTF
On Tue, Feb 18, 2003 at 09:41:48AM +0100, Nikolai Prokoschenko wrote: - mutt will work but you have to compile it against ncursesw (that means getting the ncurses 5.3 source and recompiling also) mutt from Debian doesn't have any problems at all! Debian has a mutt-utf8 package that's compiled against ncursesw. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: mutt and ncursesw
On Tue, Feb 18, 2003 at 01:50:58PM +0200, Jari P.T. Alhonen wrote: Last time I checked, mutt compiled against the ordinary ncurses (as opposed to ncursesw) does NOT work for characters with East Asian width of 'full'. You may get an impression that it works because you use it only for chars. with East Asian width of 'half'. For CJK, compiling mutt against 'ncursesw' is a must. mutt-utf8 seems to contain the mutt binary and nothing else (apart from a changelog). Of course; mutt-utf8 in Debian is a diversion. And it certainly does work with CJK. (because it's compiled against ncursesw) Why mutt-utf8 is a separate package instead of the default in Debian, I have no idea. It used to make sense, when mutt-utf8 was compiled against a buggy Slang hack, but that's no longer the case; it's now just as functional as the main binary. (I don't feel like spending the time trying to convince the Mutt maintainer to change this, though.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: RE: filename and normalization (was gcc identifiers)
On Thu, Dec 05, 2002 at 11:02:17AM -0500, Maiorana, Jason wrote: Also, imagine the extra load on your system if when you do: cat bigfiles | b | c | d | less and the text is being normalized back and forth at every step of the pipeline. That has nothing to do with the filesystem; pipes are 8-bit clean for completely different reasons (you can pipe binary data through them). (Not that I disagree; this is just a bad example.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: filename and normalization (was gcc identifiers)
On Wed, Dec 04, 2002 at 04:03:38PM +0100, Keld Jørn Simonsen wrote: Well, users should not expect these two sequences to be identical, they are not, according to ISO/IEC 10646. Users expect that Ö == Ö, and don't know or care about Unicode, and that's reasonable. Programmers should care, of course, but programmers aren't the only ones who use filenames, and this problem, as Henry pointed out, is a more general one. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: filename and normalization
On Wed, Dec 04, 2002 at 12:49:15PM -0500, Maiorana, Jason wrote: As a side-note, I copy/pasted a command line flag from a RH8.0 manpage back into the console, and tried to execute the command. It failed, and gave me usage. The reason, I discovered, is that the manpage was not using a regular ascii '-', but instead one of the HYPEN, or EM_DASH things (Which is why i HATE them). I think they're perfectly useful, including in manpages, but I agree they shouldn't be used in syntax displays. (Unless the application can actually handle them; which would, in fact, be neat in a novel way, though I think that would ultimately be a bad idea. :) Irregardless, I dont think the O/S or filesystem code should enforce, require, or even know about normalization forms. Instead, a well designed user interface should simply show non-normalized, over-coded, or invalid UTF-8 sequences as bakemoji, in some standard way (such as big rectangles), such that it can still be copy/pasted and worked with, but not easily confused with proper stuff. The input method would always generate normal utf-8, naturally. It's not clear who's responsibility this is. There are quite a few things that are invalid, and they're not easy to handle at every layer. For example, suppose you have a filename that begins with a combining character. If it's the terminal's job to deal with weird output, it can't do that here; if you run 'ls', the combining character will just get attached to the whitespace preceding the filename. ls has to handle it. It's probably the terminal's job only so far as always sending NFC when the user types (which seems to be the de-facto standard, at least); beyond that it seems to be the job of tools. Pasting is a little fuzzier. What if I'm in Windows, and some other app I'm using uses NFD (for some, possibly valid, reason)? I don't want my terminal pasting text from that app in NFD (since it'll result in filenames on my system in NFD, for example). If the shell interface is designed to allow me to do everything in NFC (eg. by having ls and friends escape anything that's not in NFC, along with all of the other things it should be escaping), then it shouldn't be a problem to have terminals normalize output text in NFC. I think it's important that, in the end, I'm always consistently able to reference any filename displayed by ls via copy-and-paste; otherwise I'll have to go to annoying lengths to, for example, delete a file with a bad filename. Note that when I'm talking about ls escaping text, I mean that it should have a new a flag indicating that it's allowed to use \u and \U escapes and that it should use those--and \x--for escaping UTF-8-related things; this would combine with whatever --quoting-style is in use, and might be good to default to being on. Things that would be useful to escape are invalid/overlong UTF-8 sequences, using \x; combining characters at the beginning of filenames, too many combining characters--configurable; anything of width zero that isn't a combining character (control characters); and possibly anything that isn't in NFC (all with \u and \U). (But, of course, none of this should be enforced by the kernel or libc; I think everyone is in agreement here.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: filename and normalization (was gcc identifiers)
On Wed, Dec 04, 2002 at 08:41:42PM +0100, Keld Jørn Simonsen wrote: Users expect that Ö == Ö, and don't know or care about Unicode, and that's reasonable. Well, it is not equal if you code it differently. One is a letter and the other is a letter with some special combining accent. They do not compare equal either, at the most detailed level according to ISO/IEC 14651- the ISO sorting standard. This isn't something users care about, and it's not something users (including clueful Unix users) should ever have to care about. The only people that should ever have to care about this is programmers. It's perfectly reasonable for a user to expect that, if he creates a file with Ö in it on a Unix system from a Windows terminal and then tries to cat it from a Mac terminal, it'll work, even if the filename is pasted from another Mac program that happens to use NFD. The terminal should renormalize everything (including pastes) to NFC. Of course, it's reasonable for this to be an option, but NFC seems to be a sensible default, at least when connecting to Unix systems. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: filename and normalization
On Wed, Dec 04, 2002 at 03:11:01PM -0500, Henry Spencer wrote: When --help is printed, I want to see two hyphens, not a dash. You probably want to see two minus signs, not two hyphens... Err. Right. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: filename and normalization (was gcc identifiers)
On Wed, Dec 04, 2002 at 03:17:24PM -0500, Maiorana, Jason wrote: The terminal should renormalize everything (including pastes) to NFC. Then how will I paste in some wacky invalid filename into my terminal in order, to say, rm it? Like I was saying, paste's should not be normalized. I already explained this at length: ls (and other tools) should escape wacky filenames using \x, \u and \U. This is nothing new; ls already escapes things, so it's just an extension on existing functionality. Even if you don't normalize, unless ls does some quoting work, you're not going to be able to paste all strange filenames. For example, as I mentioned, combining characters at the start of a filename. Also, it's very difficult for terminals to handle this consistently. Is an invalid UTF-8 string one column width? One per byte? There are definitions (eg. Markus has a page on it), but it's difficult enough to get width right without having to deal with this. Also, it's more difficult to have a terminal implementation that can remember invalid sequences on-screen to be able to copy them later; and it'd need to be handled in terminal layers, like Screen, and mbswidth() identically, or it'd become desynced. In practice, since this (precise displaying of invalid UTF-8 sequences) is a relatively obscure issue, this will never happen, and the result would be broken filenames causing screen desyncs and not easily being referenced (eg. to rm). Normalization for D has some serious drawbacks: if you were to try to implement, say vietnamese using only composing characters, it would look horrible. The appearance, position, shape, and size of the combining accents depends on which letter they are being combined with, as well as which other diacritics are being combined with that same letter. That's entirely a rendering implementation detail; it should be easy for the terminal's font renderer to normalize internally in whatever way is most appropriate. What scripts do you think NFD would be more appropriate than NFC for? NFC seems to be fairly (de-facto) standard in Unix. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: filename and normalization (was gcc identifiers)
On Wed, Dec 04, 2002 at 12:33:59PM -0800, McDonald, Ira wrote: Actually, I rarely link with just one library. And if the two (or more) different libraries had their identifiers normalized into different forms, then no solution will be possible. And since all these different codepoint representations of the same character look alike, any but the most sophisticated programmers will be defeated and just unable to link those two libraries with the same program. That aside, I use NFC, and I certainly don't want to have to switch my environment to NFD just to use a library! My environment shouldn't be dictated by the environment of some random library programmer. (That would have to include my terminal, so I'm able to type in identifiers for gdb, and so on.) In practice, my terminal isn't even capable of sending NFD, and I like it that way; it does help to ensure people who don't know what they're doing don't accidentally switch to NFD and start polluting filesystems with NFD filenames. (The situation would be uglier if people were actually doing that.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: filename and normalization (was gcc identifiers)
On Wed, Dec 04, 2002 at 04:11:46PM -0500, Maiorana, Jason wrote: I meant that rather than invisibly normalizing the paste, it would do what you say and print the escape sequences out. If it were to normalize on paste, it could be hiding problems. But other apps on the system might be using NFD. On those systems (eg Macs), that might be normal, and the text needs to be changed to NFC somewhere between being copied and being sent to the remote machine. Likewise, on those systems it might be appropriate for an NFC terminal to change copied text (eg. terminal - clipboard) to NFD, if that system expects NFD. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: English Unicode keyboards?
On Sun, Nov 10, 2002 at 09:08:17PM -0500, Henry Spencer wrote: I think that's pushing it a bit far; adoption of such a thing will be far more likely if space (which *is* the single most common character in most forms of text) remains under the right thumb. But it doesn't need to be particularly wide -- examine most well-used keyboards and you'll find a relatively narrow shiny spot on the space bar. Actually, mine has a dent ... -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Sun, Oct 20, 2002 at 12:06:32AM +0200, Antoine Leca wrote: What's being suggested is that locales be generated per-region/language; eg. tell the system to generate tr_TR, and then be able to use all relevant encodings (ISO-8859-9 and UTF-8 and whatever else is convertable). Case mappings, collation rules, translation text and so on can be stored in Unicode and converted at runtime, probably still caching common encodings for speed. Seems like a nice, but naive, idea. If such a simple, generic solution was possible, I'd imagine it would have been done already. Windows NT did that in 1993. Exactly what you describe. Sorry. Sorry? I don't even see how this is relevant. NT and POSIX i18n is completely different, so just because NT can do it doesn't mean it's practical here. If you have a point, please say it; I can't even tell whether you agree with the idea (which is not my own) or not. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Thu, Oct 17, 2002 at 07:06:52PM -0400, [EMAIL PROTECTED] wrote: I suppose one reason this isn't done is because locale generation does take quite a while (maybe 20 seconds per locale on my system). There are probably other, less obvious reasons this isn't done, but I don't know them. One such might be http://bugs.debian.org/99623 ; but that doesn't seem to prevent generating UTF-8 most of the time. It would be yet simpler to eliminate all non-utf-8 locales. It would be simpler, but since the vast majority of the world is still using legacy locales, it's irrelevant. Come back in 5-10 years, maybe; I'm talking about things that can be done today. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: SPAM
On Tue, Oct 15, 2002 at 10:08:41PM -0400, [EMAIL PROTECTED] wrote: This is one of my more favored lists, but it is a major spam re-forwarder. Can anyone in the world set it to subscribers-only-posting? (or actually filter) You can filter yourself, too, you know--SpamAssassin, for example. Subscriber-only posting is overly restrictive, since threads occasionally get crossposted to multiple relevant lists, and people posting are often in only one of them. Preventing that in the name of a little less spam is a poor trade. Besides, I only see a couple spams a day on this list at most. That's miniscule. By the way, if you have a name, you might want to set it. :) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Lazy man's UTF8
On Thu, Sep 19, 2002 at 03:03:30AM -0400, Michael B. Allen wrote: Is libiconv capable of doing wchar_t, UCS-4, and UTF-8 operations on Windows? I couldn't even build it (although I didn't try very hard). It should be able to do any conversion it can in *nix ... Giving wchar_t to iconv isn't portable, though, is it? (It's a bit of a hack, too, but a bearable one.) Hmm. Another thing, while we're on iconv: How do you get the number of non-reversible conversions when -1/E2BIG is returned? It seems that converting blocks into a small output buffer (eg. taking advantage of E2BIG) means that count is lost. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Lazy man's UTF8
On Thu, Sep 19, 2002 at 04:02:07AM -0400, Michael B. Allen wrote: Not if your importing/exporting. But you might very well use it internally and if someone want'd to run that app on Windows too that's the kind of thing I would think libiconv should be good for so I was surprised I couldn't build it with full support. No, I'm referring to passing wchar_t as an iconv parameter; when was this added to iconv? I thought it was relatively recently. (It's a bit of a hack, too, but a bearable one.) Are you talking about Bruno's implementation? I have wondered if wchar_t could just be treated like any other encoding. It may not have a rigid definition but it wasn't clear to my why those wchar_t clauses in the main convertion loops really had to be there. The iconv interface is for char*'s; passing wchar_t* through it is a hack of forced casting, and you have to deal with adjusting buffer sizes for byte counts. It's easily fixed with wrappers, though. Yikes! You just left my sphere of knowledge :-) That was to anyone on the list who can answer it. :) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Linux and UTF8 filenames
On Thu, Sep 19, 2002 at 09:57:43AM +0200, Radovan Garabik wrote: There is a concept of filesystem encoding (NLS), but it requires root assistance, and does not solve the problem of two users having different locales, accessing the same filesystem - considering this situation, the only possible solution is to have filenames in UTF-8, and applications (such as ls) aware of it. No, the only possible solution is for all terminals UTF-8, too, and ls continues printing filenames as it is now. If I have a file héllo in UTF-8, and my terminal is ISO-8859-1, and ls helpfully recodes that for me, and I type cat héllo, cat doesn't know to recode the filename, so it doesn't work. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Linux and UTF8 filenames
On Fri, Sep 20, 2002 at 01:31:21AM +0200, Pablo Saratxaga wrote: Now, if what you meant, was the ability to mount an ext2 partition and tell to convert its filemanes using the kernel nls modules; yes, it could be done. But would be somewhat tricky, since filenames need to be 8-bit clean except for / and NULL. It's a bag of worms with very little value ... -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Lazy man's UTF8
On Wed, Sep 18, 2002 at 10:14:35PM +0100, Robert de Bath wrote: iconv() is _fairly_ easy to use, the problem isn't that's it's difficult just that there's a lot you have to remember to do for a function that appears (at first) to have a simple job. It's easy to write a wrapper for the simple, common tasks. You almost never want to call iconv() directly from most code, unless you actually need to. //here is an example utf-8 formatter BTDTGTTS. BTDPQKKD! (trans: what?) Obeying the locale's encoding is both good practice and an absolute requirement for most; outputting UTF-8 in all locales is simply wrong. It's certainly very bad advice. But, you're converting utf-8 values that (strictly speaking) are out of range _and_ assuming the wchar_t is a UCS character. Why does Mr. Lazy even care about ancient non-__STDC_ISO_10646__ systems? He's lazy! :) (But you should be using mb[r]towc anyway.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Lazy man's UTF8
On Thu, Sep 19, 2002 at 01:21:10AM -0400, [EMAIL PROTECTED] wrote: Unless you believe that locales shouldnt specify encoding, and are unhappy with their implementation (too global). If an application wants to provide more detailed encoding configurations (such as editing multiple files in different windows, like Vim can do), that's fine, but it should always default to obeying the locale (which Vim does). The locale certainly shouldn't allow saying things like use UTF-8 for the terminal and EUC-JP for files, since that's far more complicated. (What do you use if you're formatting from stdin? It might be either.) Also, using them isnt necessarily future-proof. For example you generally wouldnt want to use the mb functions if all your output was ucs-4 wide characters. (are there any utf-32 locales?) (assuming s/utf-32/ucs-4/; they're close, but not synonymous) No, but if there was, then the multibyte encoding would be UCS-4, and the mb* functions would treat them as such--wide characters and locale characters would contain the same binary data, mblen() would always return 0 or 4, and converting wc-mb would be a null op. (Ignoring endianness, and all of the other numerous reasons you don't use UCS-4 as a locale encoding.) Why does Mr. Lazy even care about ancient non-__STDC_ISO_10646__ systems? He's lazy! :) Taking this argument to its logical conclusion; why care about those using legacy(non-utf8) encodings... My personal opinion is that there's been plenty of time for systems to support __STDC_ISO_10646__; the fact that almost all systems do is evidence that it's been long enough, and I don't want to go out of my way to support systems that are lagging so far behind. However, there are a lot more people who still, for one reason or another, can't use UTF-8, so there's a lot more reason to support legacy encodings. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: input methods
On Thu, Aug 29, 2002 at 05:43:24PM -0400, Maiorana, Jason wrote: Does anyone know of a general purpose input method library which is not dependant upon anything else? By that I mean not dependant upon X-Windows, not dependant upon a console, not relying upon locales whatsoever, and not tied to any specific application, and doesnt even know about fonts. I'd imagine this would be useful both as the backend of normal GUI IM's and also for use where standard IM's aren't suitable. For example, games: you want to render everything yourself, feed input from the user to the IM by hand (since you might be using something system IM's might not like, such as DirectInput), and not tie yourself to platform-specific IM's. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Forcing vim 6.0 to stay in UTF-8 mode in a UTF-8 locale
On Tue, Aug 20, 2002 at 10:42:23AM +0200, Bram Moolenaar wrote: do also: set fileencoding=utf-8 so that you do not encounter those nasty CONVERSION ERRORs The value of 'fileencoding' is changed as soon as you open a file. It's used to remember the encoding of the file (can be different from the encoding used inside Vim). You can also change it after reading a file, so that :w writes it with a different encoding. Well, is this exact? My default fenc is cp1252 (as I'm using the test setting I mentioned). If I load a UTF-8 file, fenc becomes UTF-8. But, if I then :new, the new window is created with fenc=cp1252, despite fenc being UTF-8. Doing a :set fenc in each window then shows that it's different for each, but :new always creates fenc=cp1252. This makes me conclude that there's a global fenc, which determines the default encoding of new files, and a local fenc to each window, marking the encoding of that file. That's fine, except it seems undocumented, and it's not clear how to explicitely set the global fenc versus the current local one. You probably want to set 'fileencodings' to utf-8 or make it empty. Then Vim won't check for a BOM or fall back to using latin1. You still get CONVERSION ERRORs when editing a file with an illegal byte sequence, and that's a good hint for the user. It'll also set the file readonly, though, which probably isn't wanted here. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Forcing vim 6.0 to stay in UTF-8 mode in a UTF-8 locale
On Mon, Aug 19, 2002 at 06:13:23PM +0100, Markus Kuhn wrote: properly in UTF-8 mode, but it deactivates UTF-8 mode when you load instead a file that contains malformed sequences, such as http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt Make sure fencs and fenc are empty. However, it'll still set the ro flag when it finds invalid characters. That shouldn't happen here. Even worse, it also deactivates UTF-8 mode when you load a file that contains new Unicode 3.2 characters, such as http://www.cl.cam.ac.uk/~mgk25/UTF-8-demo.txt (that's ucs/examples/UTF-8-demo.txt) This works for me even with my normal fencs=ucs-bom,utf-8,latin1 setup; there's no reason Vim should ever fall out of UTF-8 mode for this reason. VIM - Vi IMproved 6.1 (2002 Mar 24, compiled Aug 13 2002 15:12:46) Upgrade? BTW. Bram, Vim isn't handling overlong sequences well. (It also doesn't handle 3.3 in UTF-8-test.txt like Marcus suggests, but I think the display-every-character-in-hex behavior is better for an editor.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Forcing vim 6.0 to stay in UTF-8 mode in a UTF-8 locale
On Mon, Aug 19, 2002 at 12:54:24PM -0700, H. Peter Anvin wrote: One way is to treat each byte of a malformed sequence as a character (different from all real Unicode characters). This is a mostly good approach, except that it allows the user to construct a valid UTF-8 character out of malformed sequence escapes -- this may or may not be a problem in any particular application, but it needs to take into account, lest we get another instance of the overlong sequence problem. That's what Vim does. Malformed sequences show up as HEX, which functions as a single character. If the editor is 8-bit-clean, and you combine bytes that were parts of invalid UTF-8 sequences such that you have a valid UTF-8 sequence, you have a UTF-8 sequence; if I combine 0xC2 with 0xA9, it'd better write those two bytes to disk, even though it happens to correspond to U+00A9; doing anything else isn't 8-bit-clean. I tested this, and that's exactly what happens; pasitng A9 in front of C2 turns the pair into (C). What could be done differently? -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: world of utf-8
On Mon, Aug 19, 2002 at 08:29:21PM -0400, [EMAIL PROTECTED] wrote: The ultimate goal is that older encodings can start to fade away, and having every app that deals with text have to deal with a plethora of encodings and codeset conversion issues will be a thing of the past. Um, I think he knows this. :) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: mk_wcwidth (OT)
On Tue, Jun 18, 2002 at 02:20:24AM -0400, Seer wrote: (Err ... how in the nineteen hells is this simplification?) well, mk_wcwidth would be algorithmically simpler itself, and all the interval/width data would be in one table or tree. (though a tree itself looks pretty bad when written as an initialized set of C objects) It's the complexity of the whole that I'm referring to, and setting up a tree is much more complicated. (You actually suggested code generation, which is orders of magnitude more complex.) If you really need a speedup for specific cases, it could work, but it's actually a tradeoff; speed one up and slow down others. (And it's not an even trade: for every one you move up the tree, you move two down.) Except for ASCII, that kind of tradeoff isn't very useful in general-purpose code. not sure i agree with that. I think that a tree lookup would be significantly fewer compares. admittedly, a difference wouldnt likely matter unless one was widthing megs worth of data. They're both O(log n) compares. They're doing the same thing, except a binary tree conceptually moves the binary search logic into the data structure. The only way you'd have less compares is if you optimized the tree for certain data sets, and except for ASCII, you can't do that in generalized code. If you think I'm wrong, please be specific. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: XTerm patch to call luit (2)
On Thu, Jun 13, 2002 at 09:43:19AM +0900, Tomohiro KUBOTA wrote: So, how do you think about the default of false? I don't like programs that support locales, but need specal configuration to turn it on. They're annoying. think the default should be true and then the default can be changed to true without annoying people. For any given default, people will be annoyed. It'll annoy me if it's false, and it'll annoy some other people if it's true. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: ASCII and JIS X 0201 Roman - the backslash problem
On Fri, May 10, 2002 at 02:58:21PM +0200, Bruno Haible wrote: So it is a minor annoyance over the time of a few months, but by far not the costs that you are estimating. The problem isn't the conversion costs, it's the fact that Windows will continue to use the characters incorrectly, and will reintroduce the problem continuously. I'd give my left leg if someone would just show up and give me a reliable way to change my local Windows JP fonts to have a correct backslash. That would fix it for me, at least. It wouldn't help people that actually need to *use* the Yen symbol, since there'd still be no way to input the real single-width yen symbol, though it might be possible to add that to the input method. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: ASCII and JIS X 0201 Roman - the backslash problem
On Fri, May 10, 2002 at 08:03:08PM +0100, Markus Kuhn wrote: The only long-term solution out of this mess is pure Unicode. Use proper Unicode fonts where U+00A5 is a (single-width) YEN and U+005C is a backslash, and (you normally should never need it) U+FFE5 is the FULLWIDTH YEN SIGN. An ideal long-term solution is of no use if it's impossible to get people to use it. Microsoft refuses to fix their buggy fonts, so it's unlikely this solution can ever be used widespread. Forget about the Shift_JIS and EUC_JP tradition and start to think in a context, where character semantics is completely and exclusively defined by Unicode. You will loose a few double-width characters (such as doublewidth Cyrillic and double-width block graphics), and you will discover that it is perfectly possible to write nice Japanese plaintext files nicely without any of these. For old files, people will surely Aren't there enough obstacles to getting Unicode accepted in some places without having to convince them they don't really need something they've been using for years? It doesn't really matter if it's true or not; it seems there's enough battles to be fought already. Out of curiosity, Tomohiro, is full-width Yen commonly used? (I'd guess $B1_(B would be a more obvious choice for full-width.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
On Thu, May 02, 2002 at 02:03:06AM -0400, Jungshik Shin wrote: I know very little about Win32 APIs, but according to what little I learned from Mozilla source code, it doesn't seem to be so simple as you wrote in Windows, either. Actually, my impression is that Windows IME APIs are almost parallel (concept-wise) to those of XIM APIs. (btw, MS WIndows XP introduced an enhanced IM related APIs called TSF?.) In both cases, you have to determine what type of preediting support (in XIM terms, over-the-spot, on-the-spot, off-the-spot and none?) is shared by clients and IM server. Depending on the preediting type, the amount of works to be done by clients varies. I'm afraid your impression that Windows IME clients have very little to do to get keyboard input comes from your not having written programs that can accept input from CJK IME(input method editors) as it appears to be confirmed by what I'm quoting below. I wrote the patch for PuTTY to accept input from Win2K's IME, and some fixes for Vim's. What I said is all that's necessary for simple support, and the vast majority of applications don't need any more than that. Of course, what you do with this input is up to the application, and if you have no support for storing anything but text in the system codepage, there might be a lot of work to do. That's a different topic entirely, of course. It just occurred to me that Mozilla.org has an excellent summary of input method supports on three major platforms (Unix/X11, MacOS, MS-Windows). See http://www.mozilla.org/projects/intl/input-method-spec.html. I've never seen any application do anything other than what this describes as Over-The-Spot composition. This includes system dialogs, Word, Notepad and IE. This document incorrectly says: Windows does not use the off-the-spot or over-the-spot styles of input. As far as I know, Windows uses *only* over-the-spot input. Perhaps on-the-spot can be implemented (and most people would probably agree that it's cosmetically better), but it would proably take a lot more work. Ex: http://zewt.org/~glenn/over1.jpg http://zewt.org/~glenn/over2.jpg (The rest of the first half of the document describes input styles that most programs don't use.) The document states Last modified May 18, 1999, so the information on it is probably out of date. The only other thing you have to handle is described in Platform Protocols: WM_IME_COMPOSITION. The other two messages can be ignored. The only API function listed here that's often needed is SetCaretPosition, to set the cursor position. It's little enough to add it easily to programs, but the fact that it exists at all means that I can't enter CJK into most programs. Since the regular 8-bit character message is in the system codepage, it's impossible to send CJK through. Even in English or any SBCS-based Windows 9x/ME, you can write programs that can accept CJK characters from CJK (global) IMEs. Mozilla, MS IE, MS Word, and MS OE are good examples. Yes, you're agreeing with what you quoted. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Paper size
On Thu, May 02, 2002 at 05:30:47PM +0100, Edmund GRIMLEY EVANS wrote: But there is! Firstly, if you cut a piece of A4 paper into two halves, each has the same proportions as A4. Secondly, a piece of An paper has area 1/2**n of a square metre. Standard photocopier paper weighs 80 grams a square metre, so a piece of A4 weights 5 g, and airmail postage rates go in steps of 5 g or 10 g ... Of course, it's not really 210x297mm; it's more like 210.224x297.302mm. These are just novelties to most people; I don't remember the last time I made a photocopy, and when I do, I don't mind that it doesn't scale perfectly. It's probably very useful for some people, but not most, and it's the majority that'll keep everyone from switching. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Switching to UTF-8
On Thu, May 02, 2002 at 11:38:38AM +0900, Tomohiro KUBOTA wrote: * input methods Any way to input complex languages which cannot be supported by xkb mechanism (i.e., CJK) ? XIM? IIIMP? (How about Gnome2?) Or, any software-specific input methods (like Emacs or Yudit)? How much extra work do X apps currently need to do to support input methods? In Windows, you do need to do a little--there's a small API to tell the input method the cursor position (for when it opens a character selection box) and to receive characters. (The former can be omitted and it'll still be usable, if annoying--the dialog will be at 0x0. The latter can be omitted for Unicode-based programs, or if the system codepage happens to match the characters.) It's little enough to add it easily to programs, but the fact that it exists at all means that I can't enter CJK into most programs. Since the regular 8-bit character message is in the system codepage, it's impossible to send CJK through. How does this compare with the situation in X? * fonts availability Though each software is not responsible for this, This software is designed to require Times font means that it cannot use non-Latin/Greek/Cyrillic characters. I can't think of ever using an (untranslated, English) X program and having it display anything but Latin characters. When is this actually a problem? -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Is there a UTF-8 regex library?
On Sun, Mar 31, 2002 at 03:53:52PM -0600, David Starner wrote: The dict standard dictates that all data crossing the wire shall be in UTF-8. Unfortunately, the reference implementation doesn't even try to get it right. I was discussing the issue with a maintainer of a Russian dictionary for dict, and part of the problem was that there was no UTF-8 regex engine. Does anyone know of a UTF-8 regex engine, preferably one that can be plugged into a GPL'ed C program easily? I know GNU grep (at least alpha versions) implement generic multibyte. That's not an easy drop-in, of course. It was also orders of magnitude slower; I don't know if it was simply unoptimized. pcre(7) mentions experimental UTF-8 support. I havn't tried it. By the description, it looks extremely limited. In particular: 5. A class is matched against a UTF-8 character instead of just a single byte, but it can match only characters whose values are less than 256. Characters with greater values always fail to match a class. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: encdec-0.2.1 released
On Wed, Mar 13, 2002 at 05:05:25AM -0500, Michael B Allen wrote: char *dec_mbscpy_new(char **src, const char *fromcode); char *dec_mbsncpy_new(char **src, size_t sn, size_t dn, int wn, const char *fromcode); mumble grumble -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: encdec-0.2.1 released
On Wed, Mar 13, 2002 at 01:59:07PM -0500, Michael B Allen wrote: char *dec_mbscpy_new(char **src, const char *fromcode); char *dec_mbsncpy_new(char **src, size_t sn, size_t dn, int wn, const char *fromcode); mumble grumble Are you serious or are you joking? Serious in that they're overly-long names that don't follow patterns most everyone is used to; joking in that it's not a major issue that's worth spending time debating. (I'd probably rename them if I ever used the code.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
On Thu, Mar 07, 2002 at 10:54:11AM -0800, H. Peter Anvin wrote: But I can't see the BOM; ls just shows hello. That's why I'm suggesting that zero-width characters not useful in filenames be escaped as the above by ls and friends. (Nothing new; ls already escapes ASCII control characters and other things.) Agreed. ls -b in particular needs to be extra careful here. This *does* beg the question what wisprint() and friends actually return. What about wcwidth() == 0? -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Statically link LGPL cp1252.h with MIT Licensed code?
On Mon, Mar 04, 2002 at 03:37:55PM -0500, Michael B Allen wrote: int enc_mbscpy(const char *src, char **dst, const char *tocode); int enc_mbsncpy(const char *src, size_t sn, char **dst, size_t dn, int wn, const char *tocode); char *dec_mbscpy_new(char **src, const char *fromcode); char *dec_mbsncpy_new(char **src, size_t sn, size_t dn, int wn, const char *fromcode); size_t dec_mbscpy(char **src, char *dst, const char *fromcode); size_t dec_mbsncpy(char **src, size_t sn, char *dst, size_t dn, int wn, const char *fromcode); for encodeing and decoding strings The two main differences here are that we're converting to/from many to one where the one is the locale dependent multi-byte string encoding (eg UTF-8) and that in addition to contraining the operation by sn and dn bytes you can also contrain the operation by the number of characters wn Mbsncpy_new is like a mbsndup and if dst is NULL for the dec_ functions it still works but Why not call it dec_mbs[n]dup? (I'd lean toward putting _dec/_enc at the end, too, but that's just my habits) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mailnllinuxorg/linux-utf8/
Re: Statically link LGPL cp1252.h with MIT Licensed code?
On Sat, Mar 02, 2002 at 01:42:13AM -0500, Michael B Allen wrote: Can I statically link of the codepage headers (eg cp1252h) from libiconv with an MIT Licensed module? I would not actually alter the file of course so a user could not modify the LGPL files in my module any more than if they had used libiconv directly The LGPL is designed to allow programs with GPL-incompatible licenses to link against them; that license (assuming you mean http://wwwjclarkcom/xml/copyingtxt) is GPL-compatible (says http://wwwgnuorg/licenses/license-listhtml), so you could link against it even if the header in question was GPL'd (Strictly speaking, using headers isn't linking; I'm not sure how this is covered in the license, but the LGPL would be useless if it permitted linking but not including) IANAL nor a license expert; assume all of the above is false You'd be much better off looking for a license-oriented list or mailing the FSF -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mailnllinuxorg/linux-utf8/
Re: Statically link LGPL cp1252.h with MIT Licensed code?
On Sat, Mar 02, 2002 at 01:42:26PM -0500, Michael B Allen wrote: Very strange that you ref James Clarks site because it is his expat product that encouraged me to license my DOM as MIT and I want to use the libiconv codepage headers to add support for extended character sets to this DOM that uses expat Well, GNU's site says that the license is really the Expat license, not the MIT license (That's how I interpret it, anyway) Well actually these headers are not public and have code in them The design calls for abstracting the conversion of a character to and from ucs codes by using a function pointer to code included in different many different files Regardless of the fact that these include files are h files they each have code in them Well, if you're going to include the header itself *with* the program, you'll need to include a copy of the LGPL, too I'm not sure if there are any other issues in this case (FWIW, many glibc headers have inline code) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mailnllinuxorg/linux-utf8/
Re: mbscmp
On Mon, Feb 25, 2002 at 08:52:38PM +0100, Bruno Haible wrote: strncpy strncat strncmp cannot work for multi-byte characters because they truncate characters You could write multibyte-aware versions of these, too, making them not truncate characters. That'd be useful for strncpy and strncat. at least. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: mbscmp
On Mon, Feb 25, 2002 at 02:56:09PM -0500, Jimmy Kaplowitz wrote: I haven't tested this, nor really done anything relating to programming with i18n, but based on looking at man pages, you can use one of three functions (mbstowcs, mbsrtowcs, or mbsnrtowcs) to convert your multibyte string to a wide character string (an array of type wchar_t, one wchar_t per *character*), and then use the many wcs* functions to do various tests. My recollection of the consensus on this list is that for That's extremely cumbersome for everyday ops. Doing conversions at every turn is expensive, too. internal purposes, wchar_t is the way to go, and conversion to multibyte strings of char is necessary only for I/O, and there only when you can't use functions like fwprintf. However, wchar_t is only guaranteed to be Not always. Some people use the locale encoding internally; some use UTF-8 internally. They all have their advantages. wchar-based programs are still harder to debug; gdb doesn't deal with them yet. I expect there'll be lot more libraries that expect locale-encoded char * strings in their API than will be providing an alternate wide interface. Using locale encodings internally is the quickest to start, but then you know nothing about your strings and need to convert everything for most ops (if you really want it to work). Converting existing programs is a case where wchar is particularly difficult. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
strcoll and hiragana
On Mon, Feb 25, 2002 at 05:30:59PM +0100, Bruno Haible wrote: No. In glibc-2.2 strcoll works fine for all multibyte encodings. Speaking of which, this is perplexing me: 05:12pm [EMAIL PROTECTED]/2 [~] sort $B$"(B $B$3(B $B$s(B $B$s(B $B$3(B $B$"(B (eof) $B$"(B $B$3(B $B$s(B $B$s(B $B$3(B $B$"(B strcoll is returning 0. (Same for $B$"(B and $B%"(B.) (Language shouldn't matter, but this happens in both en_US.UTF-8 and ja_JP.UTF-8.) Kanji appear to be getting collated, however: 05:13pm [EMAIL PROTECTED]/2 [~] sort $BF|K\(B $Be:No(B $BF|K\(B (eof) $BF|K\(B $BF|K\(B $Be:No(B (I couldn't tell if that's the correct collation order, but it's clear they're being reordered, where the hiragana above are not.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: sorting order of Kanji
On Tue, Feb 26, 2002 at 09:42:25AM +0900, Tomohiro KUBOTA wrote: Kanji appear to be getting collated, however: 05:13pm [EMAIL PROTECTED]/2 [~] sort $BF|K\(B $Be:No(B $BF|K\(B (eof) $BF|K\(B $BF|K\(B $Be:No(B (I couldn't tell if that's the correct collation order, but it's clear they're being reordered, where the hiragana above are not.) It is impossible to collate Kanji by using simple functions such as strcoll(), because one Kanji has several readings depending on context (or word) in most cases. (This is Japanese case). (It is technically virtually impossible. It will need natural language understanding algorithm.) I'm not concerned about the collation order of Kanji. (It's probably useful that there be one, even if it's just UCS order, to allow ie. "sort | uniq".) There does seem to be collation for Kanji; I showed this to distinguish it from hiragana. The question was, why aren't katakana and hiragana getting collated? As far as I can tell, they should be. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Thoughts on keyboard layout input
On Sat, Feb 23, 2002 at 06:53:11PM +0100, [EMAIL PROTECTED] wrote: [on POSIX] I quoted the POSIX definitions. Nevertheless many people claim contradictory things about the POSIX point of view. Wonder whether my post went out to the list? This is the only reply I received from you to this thread; you might want to repost. [on delays and composing symbols] I have very bad experiences. For example, when using mutt on some remote machine to read my mail I may have to press downarrow five times before it is accepted. Sometimes net delay is such that it is impossible to get mutt to see an escape sequence. Delays for control characters (^[[A) and delays expected when actually typing are different. The former should never be a problem and I'd assume there's something wrong with your environment if that's happening--normally, the entire escape sequence goes out in a single packet so lag between packets shouldn't affect them at all. (Try vi-like j and k, by the way.) You can set ESCDELAY, if you need to (but you shouldn't.) I tend to lower this a lot, to get better response time for a single ESC. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
On Sat, Feb 23, 2002 at 10:18:28AM +0900, Gaspar Sinai wrote: This was just a suggestion to clean up things by specifying the characters that can be allowed for filenames. Currently we can not have /, ., .. and \0 for a filename. What if we say we can not More precisely, you can't have . or .. for a filename and you can not have / and nul *in* filenames, and you can look at the first two as these files already exist and not really a restriction as such. have composing and zero with characters for a filename? Er, composing characters are OK, NFC just avoids them when there's a precomposed alternative available. (And Pablo said that there are some zero-width characters that are useful in filenames ... which is rather annoying.) Why can't we do that? Because filenames would go from being nearly 8-bit clean to having UTF-8 specific requirements. That's not the FS's job. And this couldn't only by NFS: the problems you're describing would happen with local FS's, too--and they need to work with all active charsets, not just UTF-8. That would not need compicated normalization - just a character check. The current restrictions on filenames have been around forever, are unavoidable, and are the only things keeping filenames from being completely 8-bit clean. (Normalization involves changing text, as well; the existing restrictions are simply pass or fail.) Aside: can a UTF-8 string ever grow longer due to being changed to NFC? It's obvious that a wide char string can't, but it's not clear that this holds with UTF-8 (and if so, that it always will.) The problem occurs if normalization does happen - and some programs may do normalization. If any are normalizing to NFD, they should probably be changed to not do that. Fixing that isn't the FS's job. But the filesystem, C library calls, network protocols, etc. should *never* change filenames at all. That stuff must remain 8-bit clean (as far as it is now.) I'm not advocating any low-level constraints or normalization at all. I just want to be able to use UTF-8 in filenames, without hitting filenames that I can't use c+p to enter. That's not the FS's job to fix, it's the interface's. The simple solution, have tools escape zero-width chars and other oddities, isn't quite good enough, due to some of these characters being useful in filenames. (I might settle for it myself--I don't use any languages that need them--but it'd be nice to find a more general solution.) This isn't a new problem, it's new symptoms of an old one. The old ones were fixed by escaping invalid byte sequences, spaces, and ASCII control characters--the new symptoms just need to be worked out. (Invalid UTF-8 sequences aren't one of these new problems--ls already escapes those.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
On Thu, Feb 21, 2002 at 11:08:24AM +0100, Radovan Garabik wrote: One thing that's bound to be lost in the transition to UTF-8 filenames: the ability to reference any file on the filesystem with a pure CLI. If I see a file with a pi symbol in it, I simply can't type that; I have to copy and paste it or wildcard it. If I have a filename with all Kanji, I can only use wildcards. (Er, meant copy and paste for the last; wildcards aren't useful for selecting a filename where you can't enter *any* of the characters, unless the length is unique.) sorry, but that is just plain impossible. For one thing, the c can quite well be U+04AB, CYRILLIC SMALL LETTER ES, ditto for other letters. But I agree that normalization can save us a lot of headache. Normalization would catch the cases where it's impossible to tell from context what it's likely to be. Input method should produce normalized characters. Since most filenames are somehow produced via human operation, it would catch most of pathological cases. Not just at the input method. I'm in Windows; my input method produces wide characters, which my terminal emulator catches and converts to UTF-8, so my terminal would need to follow the same normalization as input methods in X. Terminal compose keys and real keybindings (actual non-English keyboards) are other things an IM isn't involved in; terminals and GUI apps (or at least widget sets) would need to handle it directly. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
On Thu, Feb 21, 2002 at 11:59:14AM +0100, Pablo Saratxaga wrote: It isn't that much of a problem. I think it's not a completely trivial loss, compared to an ASCII environment where filenames were completely unambiguous (invalid characters being escaped.) There doesn't seem to be any obvious fix, so I suppose it's just a price paid. The same thing could happen here; well, not as bad, as I don't think any program will purposedly *change* the chars composing a filename previously selected (eg when doing open then save there wouldn't be any name change); but whe a user will type manually a filename it could happen If a program wants to operate in a normalized form internally, it might, but that's probably asking for trouble anyway. that the system will tell him no such filename and he will be puzzled as he sees there is; as there is no visual difference betwen a precomposed character like aacute and two characters a and composing acute accent. Should control characters ever end up in filenames? I'd be surprised if many terminal emulators handled copy and paste with control characters well, if at all. (They don't need to be drawn, so I'd expect most that don't use them would just discard them.) 06:29am [EMAIL PROTECTED]/2 [~/testing] perl -e '`touch \xEF\xBB\xBF`;' 06:29am [EMAIL PROTECTED]/2 [~/testing] ls 06:29am [EMAIL PROTECTED]/2 [~/testing] ls -l total 0 -rw-r--r--1 glennusers 0 Feb 21 06:29 (rm) 06:31am [EMAIL PROTECTED]/2 [~/testing] perl -e '`touch \xEF\xBB\xBFfile`;' 06:31am [EMAIL PROTECTED]/2 [~/testing] ls file 06:31am [EMAIL PROTECTED]/2 [~/testing] cat file cat: file: No such file or directory I can't copy and paste it. Wildcards wouldn't help much if I'd stuck BOM's between letters (and *f*i*l*e* isn't very obvious, especially if you don't know what's going on, or if one's not really the letter it looks like), and tab completion may or may not help, depending on the shell. (Someone mentioned moving everything out of the directory and rm -f'ing; I should never have to do that.) Are control characters (and all non-printing characters) useful in filenames at all? If not, they should be escaped, too, to avoid this kind of problem. (Another one, perhaps: a character with a ton of combining characters on top of it. Most terminal emulators won't deal with an arbitrary number of them.) This reminds me of a discussion in pango and the ability to have different view and edit modes: normal (with text showing as expected), and another mode where composing chars are de-composed, and invisible control characters (such as zwj, etc) are made visible. Reveal codes for filenames? :) I don't know who would actually normalize filenames, though--a shell can't just normalize all args (not all args are filenames) and doing it in all tools would be unreliable. The normalization should be done at the input method layer; that way it will be transparent and hopefully, if all OS do the same, the potential problem of duplicates will never happen. See my other response: characters are often entered in other ways than a nice modularized input method; terminal emulators will need to behave in the same way as IMs for this to work, as well as GUIs at some layer. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
On Thu, Feb 21, 2002 at 11:23:20AM +, Edmund GRIMLEY EVANS wrote: People are advocating normalisation as a solution for various kinds of file name confusion, but I can imagine normalisation making things worse. For example, file names with a trailing space can certainly be confusing, but would life be any simpler if some programmer decided to strip trailing white space at some point in the processing of a file name? I don't think so. You would then potentially have files that are not just hard to delete, but impossible to delete. If I have two computers, one sending precomposed and one not, I can't access my câr file created on one on the other. If terminal emulators, IMs, etc. send normalized characters, this isn't a problem. (It doesn't fix all problems, but it would help fix up some of the major ones.) Then, if a filename is being displayed by ls which doesn't fit the normalization form expected in filenames, display it in a way that shows what it really is. (c\u00E2r.) (Optional, of course.) This is less useful with the other unavoidable glyph ambiguities, though. cat certainly shouldn't normalize its arguments. I'm not even convinced that it's a good idea to force file names to be in UTF-8. Perhaps it would be simpler and more robust to let file names be any null-terminated string of octets and just recommend that people use (some normalisation form of) UTF-8. That way you won't have the problem of some files (with ill-formed names) being visible locally but not remotely because the server or the client is either blocking the names or normalising them in some weird and unexpected way. I'm not suggesting NFS normalize anything; this is just as important on a single system being accessed from multiple terminals. Sorry, the switch from NFS to filenames in general wasn't clear. What's so bad about just being 8-bit clean? Oh, network protocols *should* be 8-bit clean for filenames (minus nul). If I have a remote filename with an invalid filename (overlong UTF-8 sequence or just plain garbage), I'd better be able to access it over NFS. I don't think the FS (NFS, local filesystem, FTP, whatever) should touch filenames at all. (Mandating that they be UTF-8 in the standard is a good thing; enforcing it at the FS layer is not.) Related: I frequently can't touch filenames with non-English characters over Samba, and filenames with characters Windows bans from filenames. Windows displays them as some random-looking series of characters, and it doesn't always map back correctly. This doesn't really have anything to do with the network protocol--though the actual implementation problem might be in there--it's that it doesn't deal with invalid filename properly. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
By the way, to all of the people threading on inputting other language text: I was showing a loss from ASCII--you can't type all filenames because some of them will have characters you can't necessarily type. This was a minor point, since (as I've said) it can't really be fixed. (Well, it could be fixed, but not cleanly.) OTOH, the unprinting character problem is important. Would it be reasonable to escape (\u) characters with wcwidth(c)==0 (in tool output, ie ls -b), or is there some reasonable use of them in filenames? Combining characters at the beginning of a filename probably shouldn't be output literally, either. On Thu, Feb 21, 2002 at 03:33:40PM +, Markus Kuhn wrote: One thing that's bound to be lost in the transition to UTF-8 filenames: the ability to reference any file on the filesystem with a pure CLI. I can generate plenty of file names with ISO 8859-1 that you will have troubles typing in. Try a file name that starts with CR or NBSP just to warm up. Nothing new with UTF-8 here. Keep it simple. 02:01pm [EMAIL PROTECTED]/5 [~/testing] touch dquote hello 02:01pm [EMAIL PROTECTED]/5 [~/testing] ls \nhello ls escapes the control character. If I'm not in escape mode, it outputs a question mark; it never outputs it literally. It doesn't do this for Unicode unprinting characters. (NBSP isn't a problem here, since it can be copy-and-pasted.) Just like with the file £¤¥¦§¨©ª« I guess. Has that been a problem in practice so far? That can still be copy-and-pasted; the control character examples can not. Overly combined characters probably couldn't, either. We agreed already ages ago here that Normalization Form C should be considered to be recommended practice under Linux and on the Web. But Then we're in agreement. nothing should prevent you in the future from using arbitrary opaque byte strings as POSIX file names. In particular, POSIX forbids that the file system applies any sort of normalization automatically. All the URL security issues that IIS on NTFS had demonstrates, what a wise decision that was. Please do not even think about automatically normalizing file names anywhere. There is absolutely no need for introducing such nonsense, and deviating from the POSIX requirement that filenames be opaque byte strings is a Bad Idea[TM] (also known as NTFS). Nobody's disagreeing on any of this. No, it won't. Unicode normalization will not eliminate homoglyphs and can't possibly. You try to apply the wrong tool to the wrong problem. Again nothing new here. We have lived happily for over a decade with the homoglyphs SP and NBSP in ISO 8859-1 in POSIX file systems. Security problems have arousen in file systems that attempted to do case invariant matching and other forms of normalization and now we know that that was a bad idea (see the web attack log I posted here 2002-02-14 as one example). (this has been said already) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
On Fri, Feb 22, 2002 at 12:55:31AM +0100, Pablo Saratxaga wrote: OTOH, the unprinting character problem is important. Would it be reasonable to escape (\u) characters with wcwidth(c)==0 (in tool output, ie ls -b), or is there some reasonable use of them in filenames? There are reasonable use of zwj and zwnj and similar, they are needed for proper writing in some languages. In fact, all the trouble comes from the xterm, not from "ls". If a filename is a BOM followed by "hello", how can I enter it? I don't expect my terminal emulator to remember all control characters sent at any cursor position and paste them along with other characters, so I'd end up pasting "hello" alone. It's worse when the filename is *only* unprinting characters, and there's nothing on screen to copy at all. (That's just plain confusing, too.) We can't blame the terminal for not being able to copy and paste arbitrary sequences of bytes. It's not ls's "fault" either, per se (it's inherent), but that doesn't mean it can't help. I would say that ls should not escape them, only invalid utf-8 and control chars. then, another command line switch should be added to "escape all but printable ascii". Well, I'd like all nonprinting characters escaped, but not, say, $BF|K\8l(B. That means I can copy and paste the filename, and characters that *can* be copied and pasted aren't escaped. (but see below) more complex options are not to be done in the command line on an xterm, a graphical toolkit is more suited for that. It's acceptable to go from "able to type all filenames with the keyboard" to "need to copy and paste filenames which I can't type directly". That's reasonable (if only because it's unavoidable). (As has been pointed out, it's already there in ISO-8859-1.) It's not acceptable to have filenames that I can't access from a CLI (with C+P) reliably at all (or that I need to switch to a special ls mode that escapes *everything* over ASCII to access.) Wildcards are a useful fallback, but they don't stand alone--it still wouldn't help me target a file consisting only of control characters, for example. Telling me to "use a GUI" is simply no good. (I'm not installing X on a 486 running FTP to delete a file someone dumped in my /incoming.) Files are an extremely fundamental part of a Unix system, and all fundamental parts of Unix are accessible from a CLI. That's always been one of its greatest strengths, and we can't throw that away for filenames. This is why GNU ls supports escaping. the reason is that with ls/xterm the rendering and the tool handling the filenames are dissociated, so you cannot easily do interesting things, ls supports escaping that matches bash's. (\ooo, \xHH, \n, etc.) If this is extended to include \u and \U, then ls can be extended to allow (optionally, for the sake of compatibility) displaying escape characters, etc. in that form. (I think that extension is useful, whether or not ls uses it.) Just because the tools aren't maintained by the same person doesn't mean there can't be cooperation. (Though, considering how difficult it's proving to be to get UTF-8 support at all in bash, I don't expect *all* shells to support this.) This doesn't involve xterm (or any terminal) at all, just the shell and tools. So, the only interesting change that would be worth doing for the use of utf-8 in filenames will be an extra switch to ls to quote everything but ascii, and ensure it quotes incorrect utf-8 when the locale is in utf-8 mode. I disagree; I think it's interesting, useful and practical to escape certain other cases. Leading combining characters, probably, and any characters not useful in filenames. (Of course, it's not necessarily easy to determine what's useful. I don't see BIDI support in filenames as useful--that seems to be a property of whatever text is displaying the filenames, not the filename themselves--but I'm not a BIDI user, so I can only guess.) I'm unclear on how control characters that change state behave in filenames at all. To pick a simple example, what if a filename contains the language code "zh"? I can no longer do a simple C program that outputs "The first file is %s. The second file is %s. [...]" as the text after the first %s is marked Chinese. (This probably won't break anything, but other control characters probably would.) Invalidate all state after outputting a filename? Complicated. (I don't know what zwj and zwnj do; perhaps a more practical example could be made with them.) Anyone feel like filling me in here? This would be like enbedding ANSI color sequences in filenames and ls letting it through: the color would bleed onto the next line unless ls knew to reset the color after each filename. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: brocken bar and UCS keyboard
On Thu, Feb 21, 2002 at 09:49:01PM -0500, Henry Spencer wrote: No question there, but I think you have missed my point. The most crucial step is simply to get people to realize that there is more than one symbol involved and that the choice matters. So long as hitting the - key always gets them hyphen, that's not going to happen. Having them grumble that the stupid software keeps picking the wrong one would be an *IMPROVEMENT*. When they're visibly very similar, do you think most users are going to use them right, no matter how accessible they are? Hyphen and dash are distinct (most people who use dashes also know that you need two hyphens to act as a dash, not one), but a single hyphen looks reasonable as a minus sign in most fonts. A real minus sign usually looks better, but I doubt most people will care enough to want to learn the difference between *four* different characters on their keyboard that generate a horizontal line--hyphen, dash, minus and underscore. If they won't do that, they won't even consider changing their typing habits. Would you add separate open double quote, close double quote, open single quote, close single quote, neutral single and double quotes, apostrophe and backtick keys, too? They're all useful, but that's one heck of a keyboard. :) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: POSIX:2001 now available online (OT)
On Thu, Feb 07, 2002 at 04:05:22PM +, Markus Kuhn wrote: The revised POSIX standard, which has been merged with the Single UNIX Specification is now available online in HTML! For your bookmarks: http://www.opengroup.org/onlinepubs/007904975/toc.htm Neat--it completely blows up in IE6; http://zewt.org/~glenn/oops.jpg for the curious. Looks like you have to go through their annoying registration to use that URL. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
On Thu, Feb 21, 2002 at 01:26:33PM +0900, Gaspar Sinai wrote: I just browsed through RFC-3010 and I found one thing that bothers me and it has not been discussed yet (I think). RFC says: The NFS version 4 protocol does not mandate the use of a particular normalization form at this time. How do we mount something that contains a precomposed character like: U+00E1 (Composed of U+0061 and U+0301) If the U+0061 U+0301 is used and our server is assumimg U+00E1, can a malicious hacker set up another NFS server that has U+0061 and U+0301 to mount his NFS volume? I could even imagine very tricky combinations with Vietnamese text but that would be another question... Forgive my ignorance if this was discuseed - I did not see it in the archives. One thing that's bound to be lost in the transition to UTF-8 filenames: the ability to reference any file on the filesystem with a pure CLI. If I see a file with a pi symbol in it, I simply can't type that; I have to copy and paste it or wildcard it. If I have a filename with all Kanji, I can only use wildcards. A normalization form would help a lot, though. It'd guarantee that in all cases where I *do* know how to enter a character in a filename, I can always manipulate the file. (If I see cár, I'd be able to cat cár and see it, reliably.) I don't know who would actually normalize filenames, though--a shell can't just normalize all args (not all args are filenames) and doing it in all tools would be unreliable. A mandatory normalization form would also eliminate visibly duplicate filenames. Of course, it can't be enforced, but tools that escape filenames for output could change unnormalized text to \u/\U. I don't quite understand the scenario you're trying to describe, though. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: isprint() under utf-8 locale
On Fri, Feb 15, 2002 at 12:37:27PM +0100, Radovan Garabik wrote: in theory, yes but often it is used to filter out characters that should not go straight to the terminal, where they can be a source of a DOS attack (colour codes, switching terminal into graphics mode, backspaces - I happened to be a victim of such a joke a long time ago). ASCII escape values are still recognized as nonprintable, so none of these are a problem. (UTF-8 terminals shouldn't have a graphics mode, of course.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Security
You didn't seem to respond to the comments of your page on the earlier thread. If you're going to take such an extreme stance as Unicode text is inherently unsecure, you need to defend it. So, my own impressions: On Fri, Feb 15, 2002 at 10:16:39AM +0900, Gaspar Sinai wrote: I mostly recovered my shock :) Most people pointed out that the real juice on my security page was the second example. http://www.yudit.org/security/ At yudit.org, we maintain the view that Unicode text is inherently unsecure, until the current bi-directional algorithm defined by the Unicode Consortium is changed to be reversable. There should be an algorithm defined that converts logical order to view order, and there should be a separate algorithm defined that converts view order to logical order. If such algorith-pair existed we could also run sanity check on our rendering software. At yudit.org we will not sign digitally a Unicode document while this possiblity exists. Mind elaborating on this logic? Since there's an off chance that text might be seen incorrectly in a few languages (and if this happens, there's an off chance in a few extremely contrived cases that it might make a sentence with a different meaning), you'll never sign messages in any language at any time? Signing text doesn't say you will interpret this message as I intend, it just makes sure it doesn't get tampered with in transit and verifies who the message is from. It's not the signature's job to make sure it's rendered, read or interpreted correctly. Assuming that this *is* a real security problem, not signing messages doesn't help anything; it just reduces security further. I can hardly see what this has to do with signatures at all. Also, regardless of the severity of this problem, Unicode text is not *inherently* insecure; that implies it's fundamentally flawed and can't be fixed. I don't think that's what you mean. The rest of the page is useful as an example of the problem; whether or not it's a serious issue is debatable, but it's clearly something people should know about. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Security
On Fri, Feb 15, 2002 at 01:44:03PM +0900, Gaspar Sinai wrote: Which pretty much shows that there is an ambiguity and the algorithm should change. My argument would be: if it needs to be changed anyway can it be changed to make digital signatures easier and put scripts, like Old Hungarian (rovasiras) in it that can be written in both directions? The Unicode standard and the standards concerning digital signatures are separate. Fixing Unicode doesn't imply any changes in signatures. I could not reach this level in my arguments because I was told that there is no problem at all and I felt I have two choices: being violent or just silently unsubscribe from the list. I chose that latter. You showed that there are problems with bidi rendering, and I don't think anyone disagreed with that. Your example was too contrived for people to consider it a major problem. (By the way, I don't think violent is the word you're looking for, unless you think your first choice was to mailbomb the list or something. :) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: mutt crashes on utf-8 encoded headers
On Tue, Feb 12, 2002 at 09:54:52AM +0200, Zvi Har'El wrote: I am using the same Mutt 1.3.27i (2002-01-22) with utf-8, and it has even problems 1.2.5 didn't have. For example, when the subject includes characters with 2-byte utf-8 representation, its length is not calculated correctly for representation in the index page, and it is truncated prematurely, but when you step over it with the cursor is highlight also the next line with the rest of the subject. Refreshing the index eliminates the phenomenon. I had no problems with viewing the subject in the message page, both in the title and in the headers. I am using an external pager, less, and mutt passes it the correct subject. But this was also ok in 1.2.5. My configuration Current ncurses doesn't deal with multibyte characters, so the cursor position becomes desynchronized. Mutt has a special case for utf-8; it sends UTF-8 line drawing characters manually: case M_TREE_LLCORNER: if (option (OPTASCIICHARS)) addch ('`'); else if (Charset_is_utf8) addstr (\342\224\224); /* WACS_LLCORNER */ else addch (ACS_LLCORNER); break; They may have added this since 1.2.5. It helps with Debian's multibyte- patched version of Slang, which breaks the ACS stuff in the ncurses emulation. When compiling with real ncurses, however, it'll just confuse it. (It'll draw the character correctly, and desync the cursor.) I don't know what happened to line drawing characters with ncurses in UTF-8 before this special case. Try adding set ascii_chars to your .muttrc as a quick workaround. I don't know of any quick workaround for actual subjects with non-ASCII characters (which will be represented with two or more bytes in a UTF-8 locale). You shouldn't be having any problem if you're in a simple 8-bit locale, displaying subjects with UTF-8 in them; I've never seen any problems like that, though. A warning about using slang as ncurses in general: it's not perfect. It'll break meta-characters in Mutt and most other apps. (This is fixed; the fix isn't released yet.) There are probably other glitches. (Note that the above special case isn't actually wrong; if the ncurses in use is locale-aware, it should be able to handle it. However, if the ncurses in use is locale-aware and does ACS properly, it's also completely unnecessary--it should go away once multibyte ncurses is stable.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: mutt crashes on utf-8 encoded headers
On Tue, Feb 12, 2002 at 05:43:30AM -0500, Thomas E. Dickey wrote: Current ncurses doesn't deal with multibyte characters, so the cursor position becomes desynchronized. There are enough multibyte calls implemented in ncursesw to make this work. (addstr is not one of them - but I don't see that the OP was using ncursesw anyway). I don't believe that anyone tried making it work with ncurses first. There are 66 addstr() calls in Mutt, and I don't know what other functions can't cope with multibyte. Every ncurses call it's making will need to deal with it; even basic messages may contain multibyte UTF-8 if it's in a different language. If a basic function like addstr doesn't support it, then I'd assume a lot of work would be needed. The slang patch is almost drop-in, so it's an easy stopgap until multibyte support is in ncurses mainstream. (I'm assuming that the ncursesw naming is temporary, until it's fully implemented.) It just needs a couple workarounds to make it work for UTF-8. (There may be other reasons for the UTF-8 line drawing special case; I'm naming the one major affect I know of.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: mutt crashes on utf-8 encoded headers
On Tue, Feb 12, 2002 at 02:33:20PM -0500, Thomas E. Dickey wrote: ...and it's trivial to redefine it with a wrapper. And any other ncurses calls that take text. This, presumably, won't be needed once the regular (non-wide) functions are multibyte-aware; the slang patch is just a stopgap until that's ready. my point: the number of people who actually follow up with proposed patches I can count on one hand - while I've lost track of the people who stand around waiting for someone else to do the work. I'm saying why I think Mutt acts like it does; I'm not proposing it be changed. I'm fine with leaving it alone until ncursesw is done. Mostly. I'm not excited about it being the way it is in the next Debian release, since that means a lot of people will be stuck with it; if there's anything that'll get me to write a patch intended to be removed shortly after, it's that. By the way, ncurses(3X): The ncurses library is intended to be BASE-level conformant with the XSI Curses standard. Certain portions of the EXTENDED XSI Curses functionality (including color support) are supported. The following EXTENDED XSI Curses calls in support of wide (multibyte) characters are not yet implemented: ... addstr isn't in this list, so I assume this is a list of missing wide support, not multibyte; perhaps (multibyte) should be removed so this doesn't imply that multibyte is implemented? The slang patch is almost drop-in, so it's an easy stopgap until multibyte support is in ncurses mainstream. (I'm assuming that the ncursesw naming is temporary, until it's fully implemented.) It just needs a couple actually the more I look at it, the better it looks from the standpoint of compatibility - not that this is guaranteed to have much impression on bulk packagers. (I understand that slang users cannot possibly be concerned about compatibility - or else they haven't thought very long about it). Er, what looks better for compatibility with what? The slang patch is horrible for compatibility (it's not binary-compatible, not quitesource- compatible, though most programs wouldn't notice, and it's a bit ugly.) Do you mean that leaving wide and multibyte support in its own library is better for compatibility? I'd hate to see that, at least for multibyte support--which, presumably, would depend on wide support. What problems could that cause? (Programs that aren't locale-aware won't setlocale(), so the behavior should be unchanged.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: mutt crashes on utf-8 encoded headers
On Tue, Feb 12, 2002 at 04:14:21AM +0100, Damjan wrote: Anyone seen this, I'm using mutt 1.2.5.1i and sometimes it would crash when entering my linux-utf8 mail folder. Well it turns out that this message crashed mutt Message-ID: [EMAIL PROTECTED] because it contained a line like this: From: Richard =?utf-8?B?xIxlcGFz?= [EMAIL PROTECTED] Is there something wrong with my mutt version or this is known bug of mutt. btw - Slackware 8.0, glibc 2.2.3, gcc 2.95.3, mutt compiled from source. 10:31pm [EMAIL PROTECTED]/5 [~] mutt -v Mutt 1.3.27i (2002-01-22) I'd upgrade. (I'd point this at mutt-dev, too. :) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: ncurses or slang [Re: UTF-8 support status.]
On Sun, Feb 03, 2002 at 11:28:40AM -0500, Thomas Dickey wrote: Er, xterm shouldn't honor ACS controls in UTF-8 mode. One of the reasons I like UTF-8 as a terminal encoding is that they don't explode if I accidentally dump random binary data to it, which I tend to do at least once a day. :) hm (doesn't explode). try this (if you do) reset; tput enacs I know how to fix it; UTF-8 means it never happens to begin with. It's something that should go away completely with the UTF-8 transition. Leaving it on in the meantime doesn't hurt, as long as it's configurable. (It doesn't matter to me; I don't use X, and my terminal emulator handles this the way I prefer.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: ncurses or slang [Re: UTF-8 support status.]
On Sun, Feb 03, 2002 at 06:55:23PM +, Markus Kuhn wrote: Thomas Dickey wrote on 2002-02-03 16:28 UTC: Er, xterm shouldn't honor ACS controls in UTF-8 mode. One of the reasons I like UTF-8 as a terminal encoding is that they don't explode if I accidentally dump random binary data to it, which I tend to do at least once a day. :) hm (doesn't explode). When I execute in the UTF-8-mode xterm [XFree86 4.0.1h(149)] of Red Hat 7.1 in a shell the line printf '\x1b(0' then xterm changes the Unicode values U+0020 to U+007E to the DEC graphics character set, even though it is supposed to ignore ISO 2022 sequences while being in UTF-8 mode, because UTF-8 is one of the encodings outside ISO 2022 in the sense of ISO 2022. Has this bug been fixed in more recent versions of xterm? It seems the problem is that terminfo has no real way to deal with these sequences. That is, if I'm in UTF-8, then my terminfo caps acsc, enacs, smacs, and rmacs need to be changed. Terminfo can't simply blindly change what it returns for these sequences because the locale charset is UTF-8; there's a chance the library is being used to simply read caps (ie. infocmp). (The basic problem is that there's nothing in the terminfo API to handle this interaction between terminfo entries and the locale. It could be handled at a higher level--completely within ncurses--but then terminfo would still be returning incorrect information.) This needs to be sorted out before terminal emulators can drop these codes when in UTF-8 mode, since doing the latter first means breaking line drawing characters for all terminfo/ncurses/slang apps. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Announcing Bytext
On Sun, Feb 03, 2002 at 06:15:33PM +0100, Pablo Saratxaga wrote: Many of the elegant features of Unixes depend on the notion of 8 bit transparency: pipe, cat, echo... the byte stream is the common denominator. The functions are general purpose and thus more useful. Bytext takes this elegant notion to it?s logical conclusion: not only can you process text as bytes, you can also process bytes as text. I don't understand, how can you encode in an 8bit space all the characters of the world languages ? And if it is a multi-byte encoding, then it should have about the same problems as utf-8 or euc have when faced with byte-only utilities. It sounds to me that any 8-bit character sequence (hopefully excluding nuls) is a valid character. That doesn't sound particularly useful, though. (So what if an arbitrary byte sequence can be displayed as random-ish characters of equally random languages?) If it's the case that any string of bytes is a valid character, then that brings up the question of how robust it is. (Seeking, sync; issues that UTF-8 solved.) I tried to look this up, but one of the first things I saw when paging down the Word version (after it asked me for a password but worked anyway) was: Unicode is messed up beyond repair. I promptly became disgusted and closed the window. Remarks like that have no place whatsoever in a standard. How can he possibly wonder why he gets negative reactions from Unicode folks when he's making comments like this? -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: ncurses or slang [Re: UTF-8 support status.]
On Sun, Feb 03, 2002 at 07:34:58PM +, Markus Kuhn wrote: I believe you are thinking the wrong way here. As soon as you are in UTF-8 mode, the only correct way to send block graphics characters to the terminal is via the U+25xx UTF-8 sequences, not via terminfo ISO 2022 fiddling. Terminfo sequences must *not at all* be used in UTF-8 locales to draw certain characters. The way to interface with this doesn't need to change substantially: make the acsc cap capable of dealing with multibyte encodings. Then, if you're in UTF-8, enacs, smacs and rmacs are blank (since there's no state) and make the acsc mapping map directly to UTF-8 strings. Exactly how this string can deal with multibyte characters is an internal terminfo implementation detail; the end result is that the acsc returned from terminfo can be interpreted as a multibyte string in the current locale. Make ncurses deal with this (not difficult) and you get UTF-8 support without changing the basic terminfo. (That's important, of course: UTF-8 doesn't need real special casing by things using terminfo, and apps are more likely to work in older encodings by people who write and test primarily in UTF-8.) The only problem with this is how terminfo knows the application wants or does not want this behavior, since apps using terminfo for purposes other than actually rendering to the terminal may not want it. If wctomb(seq_hor, 0x2500) 0, then do not use terminfo to draw this graphics character, because you have already the correct sequence to draw BOX DRAWINGS LIGHT HORIZONTAL stored in seq_hor. Then you have to special case these characters further; it'd be nice to avoid that. (And, er, don't you mean wctomb(seq_hor, 0x2500)? seq_hor needs to be a char[MB_CUR_MAX], not a char.) Long term, doing this is better than leaving things as they are, of course; I think the above is better, though. Sounds like there are bugs in both ncurses/slang and xterm here at the moment that cancel each other out. Both should be fixed as soon as possible. But the terminfo/ncurses/slang problems need to be fixed first; that way there's no period where line drawing characters simply don't work. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Announcing Bytext
On Sat, Feb 02, 2002 at 02:16:37AM -0800, Bernard Miller wrote: Hopefully flags will go off when members of this list read things that are equivalent to I don't understand it but here is my opinion on it Flags certainly go off; but which flags depend on who is saying it. David Starner is not an idiot. Bytext is a superset of Unicode normalization form C, so it certainly encodes all of ASCII including form feed, and all combining characters. ASCII code points are rearranged partly so that characters like form feed can be quickly identified by normalization algorithms. This is far from losing ASCII compatibility. It simply means that conversion must be proper, not simply ignoring certain ranges. Also, there is no need for a new In other words, losing ASCII compatibility. If I have to convert the file, then it's not compatible; it needs an intermediary. That's the biggest reason UTF-8 exists; it provides a relatively easy transition path, since it's a superset of ASCII. Without that, UTF-8 would never have caught on, either. that it will never catch on. Many people who seem to have an emotional attachment to Unicode seem to be providing this as the only evidence that Bytext is not worthwhile... as if how interesting something is should be directly related to how well devleoped and popular it is. Again, I hope flags go off. If the only thing this has over UTF-8 is fast regex, then it loses overall; complexity is a strike bigger than the gain. I read a simple description of UTF-8 once and immediately had a strong understanding of its structure, capabilities (easy reverse scanning; fast substring searching), advantages (direct compatibility with ASCII, robustness that most multibyte encodings lack) and so on. Note also that popularity among developers *does* say something that popularity among the masses does not. Developers tend to choose their APIs and standards more deliberately than users choose their software. You really need to stop arguing your point by arguing the motives (emotional attachment) and insulting the intelligence of people (they just can't understand it!) disagreeing with you. As you say, flags go off when I see that. (Ad hominem flags, incidentally.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
On Sun, Jan 13, 2002 at 03:38:55AM -0600, [EMAIL PROTECTED] wrote: Now, it's not too hard for Xiph to avoid this problem, as long as they define how to handle these translations. Why should they define it? It's at the wrong level - let the system define the conversion. Because that's not portable. Read http://www.debian.or.jp/~kubota/unicode-symbols.html. But the easy solution for Ogg--0x5C to U+00A5--doesn't work for a lot of things. I can't convert everything from CP932 to standard Unicode this way; my C source containing 'printf(Hi\n);' would no longer function, since the \ is converted to a yen symbol. Like anyone involved in this discussion couldn't have written code to convert the backslashs in C code intellegently in the time to have this argument. Heck, we could probably have even traced variable usages to find what's used as a filename argument in this time. A Excel programmer could probably have the exact same thing in this time. Then you introduce all of the complexity and unreliability of intelligent parsers, instead of the simplicity of translation tables. It also means that iconv() simply won't work for this translation. Every application that uses iconv() would have to know data types (to know which parsers and heuristics to use) and have a special case for this. This isn't about translating CP932 to Unicode once, it's about allowing them to coexist peacefully, letting CP932 be phased out, as is done with every other charset. There is an upgrade path; intellegently convert the character. I think fixing the problem now is better than everyone dealing with it for the next 40 years. If it was so easy to do, we wouldn't be having this discussion (nor would any of the others who have had this discussion, so many times in the past.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
I'm not even certain where the conversation is now; there are two distinct issues: 1; handling of CP932 0x5C, and 2: portable translation tables. (These only partially overlap.) Since one of your mail readers doesn't honor References, the threads get broken and are much harder to follow. So, if I mix responses to these issues, let me know. On Sun, Jan 13, 2002 at 06:06:11PM -0600, David Starner wrote: Because that's not portable. Read http://www.debian.or.jp/~kubota/unicode-symbols.html. I know the problem. It still doesn't mean that every file format that includes Unicode should define its own solution. So we should sit back, accept Unicode as nonportable, and provide things like RFC2047 so people can use other encodings? No thanks. And if we simply say use UTF-8, and people use whatever translation tables their system happens to use, then it's a lot harder to fix things if and when Unicode standardizes it. If the file format uses a specific set of translation tables, then as long as you can tell if the format is using the old one, you can convert it to the new one automatically. If it doesn't do that, the file might have been converted with *any* table, and it's quite impossible to fix existing data. And file formats aren't going to wait to be used until Unicode fixes the portability problems, especially since it's not even clear that they intend to fix it at all. Yes? The main difference I see between my solution and yours, is that yours introduces intelligent parsers into every Unicode system, where's mine deals with at one place, where the conversion from CP932 happens. I'm not advocating intelligent parsers at all. (In fact, all of the suggested solutions have their problems; I believe this particular suggestion has by far the most.) Every application has to special case it under your situation, too. Under mine, only systems that plan to deal with CP932 have to special case it, and that code will eventually be removable. Nope. Using a specific translation table merely means changing your iconv() call to one provided that uses them. Using intelligent parsers means you need to have different parsers for each data type, so you can't use a simple interface like that. Apparently they have a hard time coexisting - poor semantics on CP932's fault, not Unicode's. I don't see transfering that bug to Unicode will help things in the long run. It doesn't matter who's fault it is (I believe it would be JIS X 0201 Roman, where Tomohiro said CP932 got 0x5C.) It's in heavy use, and it needs to be dealt with. ISO646-DE users did it. So did ISO646-DK, ISO646-ES and all the rest of the 7-bit codes. Why is it so different for CP932? Considering that ISO646-DE puts a character on 0x5C that would be used as a part of words (unlike CP932), I'd suspect the situation is different. (It's one thing to not be able to use yen symbols in filenames in Windows; it's quite another to lose a character.) I don't know anything about their use; perhaps someone who does would enlighten us. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
On Sun, Jan 13, 2002 at 08:26:57PM -0600, David Starner wrote: Is ISO-8859-1 not portable because you can't round trip CP932 through it? Why does CP932's lack of definition make Unicode unportable? People already pound Unicode for compromises with older systems; one more won't make people love it. Um, ISO-8859-1 is completely irrelevant. It doesn't claim to be a charset for Japanese users; Unicode does. If I take a CP932 document, convert it to Unicode, and then back to CP932, I'd better get exactly what I started with, or we don't have round-trip compatibility. That had better work across systems, too. This has nothing to do with any compromise on Unicode's part; it's merely a matter of defining a table and using it. (Incidentally, if programmers consistently distinguish CP932 from Shift-JIS, this isn't a problem for that particular codeset; since it's MS's charset, using MS's table is fine. This is a problem for all of the CJK encodings, not just CP932, however. In practice, many Japanese programmers may not know the difference and use a Shift-JIS translation. Also, making sure all of the original CCS mappings line up is probably more important, so if you go from CP932 to Unicode to EUC-JP to CP932 you end up with the same thing.) People are going to use whatever translation tables their system happens to use. Some systems are going to translate all strings to UTF-8 as standard practice - Java based systems, for example, and Gnome looks like it's heading that way. Others just aren't going to be interested in messing around with it - ANSIToUnicode, or iconv, or whatever the library call is already does it, why are they going to rewrite the wheel? The threat is that, if portable round-trip conversions arne't available, some users (programmers) who value round-trip compatibility more than Unicode will break spec and dump native charsets in the files. (This *did* happen with ID3 tags; this isn't a made-up threat.) That's probably the single worst case scenario, and must be avoided. What was your solution? I got that you expected systems to display the backslash as the yen sign under certain conditions. Right? At one point; that doesn't really do anything to help the conversion problems, though. I've yet to see a reasonable solution that does. Luckily, this doesn't affect Ogg, nor does it affect any file format or protocol that doesn't treat \ as special; map 0x5C to U+00A5 and be done with it. It doesn't matter who's fault it is Actually, it does. Part of Unicode's success is that it's a simpler It doesn't matter whose (oops) fault it is. Whether it was MS's fault, Unicode's fault, ISO 0201 Roman's fault or Santa's fault, the end result is the same, and it still needs a solution. solution then dealing with dozens of charsets. If you import the bugs of dozens of charsets into Unicode, it loses part of that. Yes, Unicode should offer a unified translation table. Barring that, the tables available at http://www.w3.org/TR/japanese-xml/ could be referenced - accepting that some systems won't or can't follow the recommendations. But importing the quirks and problems of other charset (seperate from those inherant in the script) into Unicode won't help things in the long run. Like I said, I'd definitely suggest using an existing table, not make one up from scratch; that *would* exacerbate the problem. Thanks for the link, by the way. (Unfortunately, it leaves a lot of things undefined; it lists ambiguities but doesn't seem to suggest solutions.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
On Sat, Jan 12, 2002 at 05:01:32PM +0900, Tomohiro KUBOTA wrote: I think the only solution I've seen that can *work* for everybody, and doesn't have any showstoppers (that I can see), is your own suggestion of giving up and making backslash and yen two glyphs of U+005C. I can see a few problems with that, but they're all within the bounds of compromise. (And the bounds for this particular problem are very large ...) Do you mean the usage of Variation Selector? I think it is an interesting suggestion and a good compromise. However, (1) the problem that Windows CP932 text file cannot be transcoded into Unicode automatically is not solved. As you said, doing this is nearly impossible; no matter how you mark it, no solution can do this since you can't tell which a 0x5C is supposed to be reliably. (2) I imagine Variation Selector is always needed for U+005C as Yen Sign. I don't think Microsoft will accept this. I'm not sure there's anything they will ... Note that the existance of problems doesn't mean the idea is bad, because there cannot exist any ideas without problems. We have to seek better compromise and smaller nightmare, not to seek perfect solution which cannot exist. Yep; as I said, the problems with this aren't showstoppers. (Well, the microsoft won't do it may be, but that's likely for any such fixes ...) By the way, you might want to update the links on http://www.debian.or.jp/~kubota/unicode-symbols.html. While the nature of the problems you list is different, with Unicode obsoleting their own tables, it's still very useful information. Yes, I think the mapping tables are useful and Unicode Consortium should not obsolete them unless defining a new authorized mapping table, just as I wrote in the document. Yes, but they did obsolete them, which means your links to the tables are broken. I'm suggesting you update them, since the files are still available. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
On Sat, Jan 12, 2002 at 03:13:00AM -0600, [EMAIL PROTECTED] wrote: It takes up space and developer time in the clients. It's easy to end up with a spec that only gets partially implemented because it's so big. If a player doesn't want to implement anything using the tags, it ignores them--and if we didn't mention these tags at all, that's what they'd be expected to do anyway. (That's a reason I like this better than the UTF8_LANG tag idea; it doesn't really add anything required at all, just an if you want to do this, then do it as Unicode defines.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
On Sat, Jan 12, 2002 at 05:35:05PM -0600, [EMAIL PROTECTED] wrote: If we tell CP932 users, your 0x5C is a yen symbol, so translate it to a Unicode yen symbol, what will they do? Probably say no, that'll break almost all applications, just like our applications would break if we changed ISO-8859-1 backslashes to Unicode yen symbols. You tell them that the Unicode backslash is a backslash, and the Unicode Yen is a Yen. Let CP932 users make whatever arrangements they want - just please not export the problems to general Unicode users. Again, you give suggestions that, in practice, simply won't work. That kind of let them deal with it, don't bother me attitude is what Unicode had (and has) to avoid to be universally accepted. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
On Sat, Jan 12, 2002 at 10:16:26PM -0600, [EMAIL PROTECTED] wrote: IMO, one of the big problems Unicode has is that it is a large complex standard. Telling everyone that the Backslash character may be the Yen character annoys all the people on Unix and Macintosh, who never had to deal with the problem, and even annoys the Windows people who never had to consciously deal with it. Bother everyone, because someone had some quirk in a system has to be avoided, to make a reasonable, implementable standard. More cynically, CP932 users are already Unicode users; all new versions of Windows are Unicode based. Whether they accept Unicode or not is Except that the vast majority of Windows programs use the codepage encoding for most things, *not* Unicode. Even new applications, since most still want compatibility with Win9x. What an OS uses at a low level and what applications use at a high level are two completely different things. If CP932 was likely to fade away reasonably soon, this wouldn't be an issue at all; but it's going to be around for quite some time. irrelevant; if they leave Windows to another desktop system, they're going to another system that doesn't confuse the Yen sign and the Backslash. For Unicode acceptance, they don't matter. For Unicode acceptance, most Japanese users don't matter? I certainly hope the Unicode C. never takes that position. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
On Fri, Jan 11, 2002 at 10:52:06PM +0900, Tomohiro KUBOTA wrote: Fixing the source code at the source is a lot cleaner than inflicting your fix on the rest of the world. It's as bad as Oracle's attempt to define a standard for its variant UTF-8 (CESU-8, which apparently should be pronounced 'sezyu' in English). Their stated reason is the same, that it's too much work to fix all of their databases, and their cure is to lay even more work off on the rest of the world. At first, this problem affect not only source codes but also many texts of end users. You can easily imagine text files of end users contain many \ as currency sign AND many \ as a element of file names. Even if you may success to persuade every Japanese Windows programmers to modify their source codes, you won't be successful to persuade Japanese business users to modify their files like accounts.xls . A possibly more reasonable fix would be to change the fonts to the way they're supposed to be, and reverse the problem: they get backslashes instead of yen symbols for currency (and correctly get backslashes as delimiters.) Everything still works, except they end up with the problem, not the rest of the world. Then change \ to the correct Unicode yen symbol as appropriate (and most documents don't contain directory delimiters.) The problem with this is that I suspect most Japanese wouldn't be pleased to see backslashes instead of yen symbols. (It's easy enough to say that's just as bad as the reverse; just do it, but that's not going to get any of them to go along with it.) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/