GNU libunistring 0.9 released
Hi, GNU libunistring 0.9 was released this week. Find below the announcement. There is a mailing list for this project at https://savannah.gnu.org/mail/?group=libunistring You are invited to join this mailing list, in order to influence and participate in future releases of this library. Enjoy! Bruno === GNU libunistring is a library that provides functions for manipulating Unicode strings and for manipulating C strings according to the Unicode standard. It consists of the following parts: unistr.h elementary string functions uniconv.hconversion from/to legacy encodings unistdio.h formatted output to strings uniname.hcharacter names unictype.h character classification and properties uniwidth.h string width when using nonproportional fonts uniwbrk.hword breaks unilbrk.hline breaking algorithm uninorm.hnormalization (composition and decomposition) unicase.hcase folding uniregex.h regular expressions (not yet implemented) libunistring is for you if your application involves non-trivial text processing, such as upper/lower case conversions, line breaking, operations on words, or more advanced analysis of text. Text provided by the user can, in general, contain characters of all kinds of scripts. The text processing functions provided by this library handle all scripts and all languages. libunistring is for you if your application already uses the ISO C / POSIX ctype.h, wctype.h functions and the text it operates on is provided by the user and can be in any language. libunistring is also for you if your application uses Unicode strings as internal in-memory representation. Download: http://ftp.gnu.org/gnu/libunistring/libunistring-0.9.tar.gz This is the first public release. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: wcwidth update
Hello Markus, Could you update your wcwidth implementation at http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c to latest unicode data? Done. This code assigns width 2 to U+4DC0..U+4DFF. But they are marked as 'N' in Unicode 5.0.0's ucd/EastAsianWidth.txt, therefore they should have width 1. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Proposed fix for Malayalam ( other Indic?) chars and wcwidth
Hello Rich, These characters are combining marks that attach on both sides of a cluster, and have canonical equivalence to the two separate pieces from which they are built, but yet Markus' wcwidth implementation and GNU libc assign them a width of 1. It appears very obvious to me that there's no hope of rendering both of these parts using only 1 character cell on a character cell device, and even if it were possible, it also seems horribly wrong for canonically equivalent strings to have different widths. What rendering to other terminal emulators produce for these characters, especially the ones from GNOME, KDE, Apple, and mlterm? I cannot submit a patch to glibc based on the data of just 1 terminal emulator. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: utf-8 and well-formed but illegal chars
Rich Felker wrote: hope this isn't too off-topic -- i'm working on a utf-8 implementation and trying to decide what to do with byte sequences that are well-formed but represent illegal code positions, i.e. 0xd800-0xdfff, 0xfffe-0x, and 0x11-0x1f. should these be treated as illegal sequences (EILSEQ) or decoded as ordinary characters? is there a good reference on the precedents? The three cases are probably best treated separately: - The range 0xd800-0xdfff. You should catch and reject them as invalid when you are programming a conversion to UCS-2 or UTF-16, for example UTF-8 - UTF-16 or UCS-4 - UTF-16 Otherwise it becomes possible for malicious users to create non-BMP characters at a level of processing where earlier stages of processing did not see them. In a conversion from UTF-8 to UCS-4 you don't need to catch 0xd800-0xdfff. - For the other two ranges, the advice is dictated merely by consistency. Most software layers treat 0xfffe-0x like unassigned Unicode characters, therefore there is no need to catch them. The range = 0x11, I would catch and reject as invalid. Some time ago I had a crash in an application because the first level of processing rejected only values = 0x8000, with a reasonable error message, and later processing relied on valid Unicode and called abort() when a character code = 0x11 was seen. Making the first level as strict as the later one fixed this. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: i18n of shell scripts
Koblinger Egmont wrote: The Bash manual only mentions the $... facility, but I cannot recommend using this facility, as it has a security hole by design. I was just planning to use this feature. Could you please tell something (e.g. a link) about this security hole by design? See the GNU gettext-0.14.5 manual, section bash - Bourne-Again Shell Script: A translator could - voluntarily or inadvertantly - use backquotes ``...`' or dollar-parentheses `$(...)' in her translations. The enclosed strings would be executed as command lists by the shell. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: i18n of shell scripts
D. Dale Gulledge wrote: For what it's worth, according to the gettext manual, there is an interface to the gettext library for shell scripts. It's documented here: http://www.gnu.org/software/gettext/manual/html_mono/gettext.html#SEC197 More info about this is found in the gettext-0.14.5 manual, section sh - Shell Script. The Bash Reference Manual is similarly terse about how to use it: http://www.gnu.org/software/bash/manual/bashref.html#SEC13 The Bash manual only mentions the $... facility, but I cannot recommend using this facility, as it has a security hole by design. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Using utf-8 in an application
Here are the questions. 1) In livido.h we #include wchar.c is this the right header for dealing with utf-8 ? No. Wide characters are useless, because they differ in width and in representation between platforms. On some platforms, wide character values are even locale dependent. We want to keep the header file as light as possible, so it would be preferable to include as little code as possible. The only functions we need are to get a string length in bytes, so it can be stored, and then to add a terminating utf-8 NULL when the string is retrieved, since NULL is not stored. strlen() will do it. 2) for getting the utf-8 string length in bytes, we use wcslen(). Is this the correct function ? No, use strlen(). 3) when a string is retrieved, we must add a utf-8 terminating NULL to the end. How is this done ? Like you add an ASCII '\0' to an 8-bit string. 4) For testing purposes, I want to create a utf-8 string. Is there a simple way to convert a char *string to utf-8 ? A char * is normally in locale dependent encoding. To convert it to UTF-8, you need to go through iconv(). Look for example - at function u8_conv_from_locale() in libuniconv/localeconv.c libuniconv/uniconv.c in ftp://ftp.ilog.fr/pub/Users/haible/gnu/libunistring-0.0.tar.gz - or the extras/iconv_string.c in libiconv-1.10.tar.gz, - or the 'iconvme' module in gnulib (http://savannah.gnu.org/projects/gnulib) Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Capitalisation of text, which library is it?
Several applications allow you to convert text to all caps, such as Firefox and OpenOffice.org Do you know where this information is stored or which library deals this task? Is it CLDR? Yes it should be CLDR. Because the glibc locale data files are only accessible through glibc API, and this API doesn't for example do toupper(ß) = SS, as needed for the German locale. Similarly in French, where often toupper(é) = E and not É. The libraries which exploit CLDR are ICU and GNU glocale ([1], work in progress). Bruno [1] http://live.gnome.org/LocaleProject -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: question on Linux UTF8 support
Sergey Poznyakoff wrote: The GNU tar maintainer is working a GNU pax program. Maybe he will also provide a command-line option for GNU tar that would perform the same filename charset conversions (suitable for 'tar' archives with UTF-8 filenames)? It has already been implemented. Current version of GNU tar (1.15.1) performs this conversion automatically when operating on an archive file in pax format. Thanks, indeed that works: When I create a .pax file (*) in an UTF-8 locale and use GNU tar 1.15.1 to unpack it in an ISO-8859-15 locale, the filenames are correctly converted. But it is hard to switch the general distribution of tar files to pax format, because - while a tar as old as GNU tar 1.11p supports pax files with just a warning, and AIX, HP-UX and IRIX tar similarly - the Solaris and OSF/1 /usr/bin/tar refuse to unpack them. Could you add to GNU tar an option, so that it performs the filename conversion _also_ when reading or creating archives in 'tar' format? Bruno (*) It's funny that to create a .pax file I have to use tar -H pax, because pax on my system is OpenBSD's pax, which rejects the option -x pax: it can only create cpio and tar archives, despite its name :-) -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: question on Linux UTF8 support
Danilo Segan wrote: 2. Is there any known application which still uses ISO-8859XXX codesets for creating file names? Many old (and new?) applications use current character set on the system (set through eg. LC_CTYPE, or other LC_* variables). I'd suggest all new applications to use UTF-8. This will mess up users who have their LC_CTYPE set to a non-UTF-8 encoding. It is weird if a user, in an application, enters a new file name Süß, and then in a terminal, the filename appears as Süà (wow, it even hangs my xterm!). It is just as bad as those old Motif applications which assume that everything is ISO-8859-1. This makes these applications useless in UTF-8 locales. In summary, I'd suggest - that ALL application follow LC_ALL/LC_CTYPE/LANG, like POSIX specifies, - that users switch to UTF-8 locale when they want. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: question on Linux UTF8 support
Danilo Segan wrote: what about user deciding to change LC_CTYPE? A user who switches to a different LC_CTYPE, or works in two different LC_CTYPEs in parallel, will need to convert his plain text files when moving them from one world to the other. It is not much more effort to also convert the file names at the same moment. Or even worse, what if administrator provides some dirs for the user in an encoding different from the one user wants to use? Eg. imagine having a global /Müsik in ISO-8859-1, and user desires to use UTF-8 or ISO-8859-5. For this directory to be useful for different users, the files that it contains have to be in the same encoding. (If a user put the titles or lyrics of a song there in ISO-8859-5, and another user wants to see them in his UTF-8 locale, there will be a mess.) So a requirement for using a common directory is _anyway_ that all users are in locales with the same encoding. My point is that the filesystem encoding should be filesystem-wide (not per-user) All that you say about the file names is also valid for the file contents. A lot of them are in plain text, and filenames are easily converted into plain text. But all POSIX compliant applications have their interpretation of plain text guided by LC_CTYPE et al. That's not closer to ever solving the problem. It's status quo. I think we should at least recommend improvements, if not require them (and nobody suggested requiring them). Basically, my recommendation was to set LC_CTYPE to UTF-8 on all new systems. We have the same goal, namely to let all users use UTF-8, and get rid of any user-visible character set conversions. I agree with the recommendations that you make to users and sysops. However, when you recommend to an application author that his application should consider all filenames as being UTF-8, this is not an improvement. It is a no-op for the UTF-8 users but breaks the world of the EUC-JP and KOI8-R users. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: viewing UTF-8 encoded man pages
Jan Willem Stumpel wrote: In languages like Japanese or Chinese, there are line breaking opportunities not only at spaces. And there are fewer spaces than in European languages. I guess that groff is looking for spaces when deciding to do line breaking, and this line breaking algorithm doesn't produce satisfactory results when there are long runs of characters without spaces. Yes, this makes sense.. but does it display correctly in your case? With groff-utf8 it doesn't display correctly: the linebreaks are not well positioned. But it should be enough for a translator who wants to proofread his/her translated man page. Wonder how FC 3 solves this. groff on Fedora contains a 400 KB patch for Japanese, which includes some adjustments to the line breaking algorithm. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [Groff] Re: man page encoding
Andries Brouwer wrote: The very long pipeline contains invocations of refer, ideal, pic, tbl, eqn, ditroff but also lots of preprocessors of my own. If the groff version of refer or tbl decides to turn my Latin-1 into UTF-8, then my own preprocessors later on in the pipeline will no longer be able to handle the input. On the other hand, if they turn stuff into \[...] or \N[...] escape sequences, then again my preprocessors are confused since this syntax is not traditional troff syntax, and unexpected in the input. Don't worry here: we don't plan to change 'refer' or 'tbl' to convert Latin1 input to something else. The plan is that when a user invokes groff, the constructed pipeline contains an invocation to 'gpreconv'. A pipeline that you construct by yourself will continue to work. Now you say tough luck, and I don't mind, but if the idea is that groff has a compatibility mode ... The compatibility mode is made for compatibility to ATT UNIX troff. At that time, Latin1 as an encoding didn't exist. Therefore it's hard to argue that -C should imply interpretation of non-ASCII input as being Latin1. 2) We would have low acceptance from the people who produce man pages in EUC-JP, with the consequence that these -Tnippon hacks in groff (or equivalent hacks in man in some distributions) would need to stay forever. But you talk as if you are forced to change groff in ugly ways because man is set in stone. But it is very easy to change man. It is not easy to change the opinion of many Japanese people, regarding the issue of EUC-JP vs. Unicode. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
man page encoding
Andries, Currently on a Linux system you find man pages in the following encodings: - ISO-8859-1 (German, Spanish, French, Italian, Brasilian, ...), - ISO-8859-2 (Hungarian, Polish, ...), - KOI8-R (Russian), - EUC-JP (Japanese), - UTF-8 (Vietnamese), - ISO-8859-7, ISO-8859-9, ISO-8859-15, ISO-8859-16 (man7/*), and none of them contains an encoding marker. The goal is that groff -T... -mandoc on any man page works, without need to specify the encoding as an argument to groff. There are two options: a) Recognize only UTF-8 encoded man pages. This is the simplest. groff will be changed to emit errors when it is fed a non-UTF-8 input, so that the man page maintainers are notified that they need to convert their man page to UTF-8. b) Recognize the encoding according to a note in the first line '\ -*- coding: EUC-JP -*- groff will then emit errors when it is fed input that is non-ASCII and without coding: marker, so that man page maintainers are notified that they need to add the coding: marker. Which of the two would you, as Linux man pages maintainer, prefer? Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: viewing UTF-8 encoded man pages
Andries Brouwer wrote: Hmm. Long ago I added some code to man that sufficed to make some Russian users happy. Forgot all details. See man-iconv.c. (Maybe that threw in an invocation of iconv when reading the pages?) That worked because KOI8-R, like ISO-8859-1, consists of only 256 characters, and they have all width 1. For Unicode in general, you need the other trick contained in groff-utf8.tar.gz. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: C source and execution encodings
Roger Leigh wrote: #include locale.h #include stdio.h #include wchar.h int main (void) { setlocale (LC_ALL, ); printf(‘Name1’\n); printf(%ls\n, L‘Name2’); fwide(stderr, 1); fwprintf(stderr, L‘Name3’\n); fwprintf(stderr, L%s\n, ‘Name4’); printf(‘Name5’\n); return 0; } Try running this in a C locale! $ ./test 'Name3' ‘Name1’ ‘Name5’ I get this (on a glibc 2.3 system): $ LC_ALL=C ./test ‘Name1’ ???Name3??? ‘Name5’ Since the encoding of the C locale is ASCII, you can see that none of the outputs is suitable for the C locale. Conclusion: use gettext(). Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Gettext and UTF-8
Roger Leigh wrote: I created a C.po file, and this installed as schroot.mo under /usr/share/locale. This po file simply converts the UTF-8 chars to the nearest ASCII equivalent e.g. © - (C). However, when running under the C or POSIX locales, bindtextdomain() never even checks for the existence of a message catalogue (checked with strace). Is this correct? If so, is this a gettext or libc bug? gettext() does no conversion at all when running in the C or POSIX locale. This is because the POSIX standard specifies the precise output of many commands in the C locale, and no localization is allowed in this case. You can get the desired behaviour by using an English locale (such as en_US.US-ASCII - note: you have to create this locale first, using 'localedef'). You build the message catalog for this locale using the 'msgen' command. It can contain UTF-8 in both the msgid and the msgstr; the gettext() library function will take care of converting many common UTF-8 characters to ASCII when the locale's encoding is ASCII. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: How to detect the encoding of a string?
Simos Xenitellis wrote: Is there a library or sample program that can do such a encoding detection based on short strings of unknown encoding (or to choose from encodings based on a smaller list than iconv --list)? It's very unfortunate the encoding of the filenames is not specified in the central_directory_file_header in unzip.h. So the best you can do is to fall back on heuristics, based on these three bits of information: 1) the version_made_by[1] field, which contains the OS on which the zip file was made. 2) the locale (especially language) of the user who attempts to extract the zip, 3) the set of filenames in the zip file. Here's how you can use this information to do something meaningful: 1) You know that AMIGA used the ISO-8859-1 encoding, ATARI used the ATARIST encoding, FS_NTFS and FS_VFAT use preferrably Windows encodings, BEOS uses UTF-8, MAC uses the MAC-* specific encodings, MAC_OSX uses UTF-8 in decomposed normal form. 2) Assuming that the language of the person who extracts the zip often matches the language of the one who created it, you can set up a list of encodings to try: Afrikaans UTF-8 ISO-8859-15 ISO-8859-1 Albanian UTF-8 ISO-8859-15 ISO-8859-1 Arabic UTF-8 ISO-8859-6 CP1256 Armenian UTF-8 ARMSCII-8 Basque UTF-8 ISO-8859-15 ISO-8859-1 Breton UTF-8 ISO-8859-15 ISO-8859-1 Bulgarian UTF-8 ISO-8859-5 Byelorussian UTF-8 ISO-8859-5 CatalanUTF-8 ISO-8859-15 ISO-8859-1 ChineseUTF-8 GB18030 CP936 CP950 BIG5 BIG5-HKSCS EUC-TW CornishUTF-8 ISO-8859-15 ISO-8859-1 Croatian UTF-8 ISO-8859-2 Czech UTF-8 ISO-8859-2 Danish UTF-8 ISO-8859-15 ISO-8859-1 Dutch UTF-8 ISO-8859-15 ISO-8859-1 EnglishUTF-8 ISO-8859-15 ISO-8859-1 Esperanto UTF-8 ISO-8859-3 Estonian UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4 Faeroese UTF-8 ISO-8859-15 ISO-8859-1 FinnishUTF-8 ISO-8859-15 ISO-8859-1 French UTF-8 ISO-8859-15 ISO-8859-1 FrisianUTF-8 ISO-8859-15 ISO-8859-1 Galician UTF-8 ISO-8859-15 ISO-8859-1 Georgian UTF-8 GEORGIAN-ACADEMY GEORGIAN-PS German UTF-8 ISO-8859-15 ISO-8859-1 ISO-8859-2 Greek UTF-8 ISO-8859-7 GreenlandicUTF-8 ISO-8859-15 ISO-8859-1 Hebrew UTF-8 ISO-8859-8 CP1255 Hungarian UTF-8 ISO-8859-2 Icelandic UTF-8 ISO-8859-10 ISO-8859-15 ISO-8859-1 Inuit UTF-8 ISO-8859-10 Irish UTF-8 ISO-8859-14 ISO-8859-15 ISO-8859-1 ItalianUTF-8 ISO-8859-15 ISO-8859-1 Japanese UTF-8 EUC-JP CP932 Kazakh UTF-8 PT154 Korean UTF-8 EUC-KR CP949 JOHAB LaotianUTF-8 MULELAO-1 CP1133 Latin UTF-8 ISO-8859-15 ISO-8859-1 LatvianUTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4 Lithuanian UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4 Luxemburgish UTF-8 ISO-8859-15 ISO-8859-1 Macedonian UTF-8 ISO-8859-5 MalteseUTF-8 ISO-8859-3 Manx GaelicUTF-8 ISO-8859-14 Norwegian UTF-8 ISO-8859-15 ISO-8859-1 Polish UTF-8 ISO-8859-2 ISO-8859-13 Portuguese UTF-8 ISO-8859-15 ISO-8859-1 Raeto-Romanic UTF-8 ISO-8859-15 ISO-8859-1 Romanian UTF-8 ISO-8859-16 RussianUTF-8 KOI8-R ISO-8859-5 KOI8-RU Sami UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4 Scottish UTF-8 ISO-8859-15 ISO-8859-1 ISO-8859-14 SerbianUTF-8 ISO-8859-5 Slovak UTF-8 ISO-8859-2 Slovenian UTF-8 ISO-8859-2 SorbianUTF-8 ISO-8859-2 SpanishUTF-8 ISO-8859-15 ISO-8859-1 Swedish languages UTF-8 ISO-8859-15 ISO-8859-1 Tajik UTF-8 KOI8-T Thai UTF-8 ISO-8859-11 TIS-620 CP874 TurkishUTF-8 ISO-8859-9 Ukrainian UTF-8 KOI8-U ISO-8859-5 Vietnamese UTF-8 VISCII TCVN CP1258 Welsh UTF-8 ISO-8859-14 3) Look at the set of file names in the zip. If they _all_ happen to be in UTF-8, you can assume that's it (because there are very few meaningful strings which look like UTF-8 but aren't). Then go ahead similarly for the other encodings. Furthermore, for Chinese, you can use frequency-of-characters based techniques such as http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html http://kamares.ucsd.edu/~arobert/hanziData.html http://www.mandarintools.com/codeguess.html Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: How to detect the encoding of a string?
Abel Cheung wrote: (because there are very few meaningful strings which look like UTF-8 but aren't). Yes, that's rare, though real world case has really happened before, especially for multibyte characters. Here is a sample: http://qa.mandrakesoft.com/show_bug.cgi?id=3935 Yes. It's a heuristic, and heuristics are always buggy. The programmer has to weigh the benefit for the many users for which it just works against the problem that it will cause for a few ones. In this case, when the heuristic doesn't work, the result will be a filename that is garbage, and a different garbage than if no heuristic took place. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: CSets 1.8 released
Michael B Allen wrote: I didn't realize there could be so many differences. Why is that? Are these just mistakes? I mean if Mac-Cyrillic is what it is on a Macintosh how can glibc-2.3 just decide to change the mapping for 0xB6? Some of the differences are because the character sets evolve: A new version of a Macintosh comes with new fonts, and suddenly a few particular, rarely used code points correspond to different glyphs. Even standardized character sets like ISO-8859-8 evolve over time. Some of the differences are because the mapping to Unicode is done by independent vendors, based on glyph tables. Characters like OHM SIGN and GREEK CAPITAL LETTER OMEGA look very similar. Some of the differences are because many vendors have to handle backward compatibility problems that other vendors don't have. Some of the differences are just mistakes and bugs: Many charset converters are shipped without having been tested with a testsuite. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: CSets 1.8 released
Mark Leisher wrote: CSets is a collection of mapping tables between Unicode and 48 different character encodings. ... http://crl.nmsu.edu/~mleisher/csets.html A repository for more frequently used charset encoding tables, with emphasis on the variations found in the various implementations, is at http://www.haible.de/bruno/charsets/conversion-tables/index.html Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Weird behaviour of emacs
David Sumbler wrote: If I save the file in emacs-mule format, a lower case 'alpha' appears as bytes [92 a6 c1] in case (a), and [9c f4 a7 b1] in case (b). Other characters show similar differences. I've spent weeks trying to solve this, without success. Can someone point me in the direction of an explanation and/or solution? The explanation: This a well-known design flaw of Mule in Emacs/XEmacs. Possibly the solution: The emacs-unicode[-2] branch of the Emacs CVS. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: character width in terminal
Egmont Koblinger asked: - Where can I find specification about the terminal width of each and every Unicode character? http://www.unicode.org/reports/tr11/ and the Unicode character database 4.1. - Is glibc's wcwidth() considered to be a good implementation? Yes. Note that for characters with ambiguous width (where the width is 1 in European contexts and 2 in Japanese contexts) it returns 1. What about the cases where it returns -1, including U+0603 mentioned above? -1 is returned for control characters and similar, where the cursor movement is not predictable. - Is it clearly a bug in the terminal emulator (gnome-terminal/vte) if it moves the cursor for a character whose wcwidth is zero? (I guess it is, and I found it in gnome's bugzilla as #162262.) Yes. A terminal emulator is supposed to display these zero-width and combining characters in a way that doesn't move the cursor. - Is it documented somewhere what a terminal emulator should do if it receives a character whose wcwidth equals to -1? These are control characters. For some, like U+000A, the semantics is clear; for others, it is unknown. - What shall a terminal emulator do with the cursor position if it receives a character that is not assigned and known that won'be assigned Undefined behaviour. or when it receives a character that is not yet assigned? It should assume that it is a normal graphic characters whose width is 1, 2, or 0, depending on the numeric code of the character. For example, the characters U+2..U+2FFFD and U+3..U+3FFFD all have width 2, although many of them are not yet assigned. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: wcsftime output encoding
Roger Leigh wrote: Viewed as hexadecimal (aligned for comparison): Narrow UTF-8: == d0 9f d1 82 d0 bd In UCS-4 these would be 041F 0442 043D Wide (unknown): B = == 1f 42 3d So you can see that it simply used the low 8 bit of every UCS-4 character. Which is broken. Before reporting this as a bug to the GCC people, you might want to find out whether it's a bug in std::wcsftime or a bug in the std::wcout stream. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: gcc and utf-8 source
srintuar wrote: 1) For printf(%s\n, Schne Gre); ... Being that UTF-8 is sortof an an endpoint in the evolution of encodings, I also consider option 1 to be perfectly valid. I would be careful with such statements. We don't know what the successor of UTF-8 might look like, nor when it will appear (in 6 years? 10 years? 15 years?). But predictions like A personal computer will never need more than 640 KB of RAM have too frequently turned out to be wrong. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: gcc and utf-8 source
Egmont Koblinger wrote: I was reading Markus's page and found the example: printf(%ls\n, LSchne Gre); and noticed that gcc always interprets the source code according to Latin-1. gcc-3.4's documentation contains the following: `-fexec-charset=CHARSET' Set the execution character set, used for string and character constants. The default is UTF-8. CHARSET can be any encoding supported by the system's `iconv' library routine. `-fwide-exec-charset=CHARSET' Set the wide execution character set, used for wide string and character constants. The default is UTF-32 or UTF-16, whichever corresponds to the width of `wchar_t'. As with `-ftarget-charset', CHARSET can be any encoding supported by the system's `iconv' library routine; however, you will have problems with encodings that do not fit exactly in `wchar_t'. `-finput-charset=CHARSET' Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there's a conflict. CHARSET can be any encoding supported by the system's `iconv' library routine. and these options work fine for me. However, these gcc options are normally not usable for portable programs. This is because 1) For printf(%s\n, Schne Gre); Many Linux users work in an UTF-8 locale, many others work in a pre-Unicode locale. Do you want to ship two executables, one produced with -fexec-charset=UTF-8 and one with -fexec-charset=ISO-8859-2 ? 2) For printf(%ls\n, LSchne Gre); On Solaris, FreeBSD and others, the wide character encoding is locale dependent and not documented. Therefore there is no good choice for the -fwide-exec-charset option that you could make. The portable solution is to use gettext: printf(%s\n, gettext (Schoene Gruesse)); or printf(%s\n, gettext (Greetings)); This works on all platforms, with all compilers, and furthermore allows the program to be localized. OTOH, if you limit yourself to Linux systems and don't want your programs to be portable or internationalized, you can now use option 2. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: char * to unicode/UTF string
Tomohiro KUBOTA wrote: Please use nl_langinfo(CODESET) for encoding name of char* string, because the encoding of char* string depends on locale. On most GNU-based systems it is available. You have to call setlocale() in advance. iconv_t ic = iconv_open(UTF-8,nl_langinfo(CODESET)); Right. And when you use GNU libc or GNU libiconv but your platform lacks nl_langinfo(CODESET) (like for example FreeBSD 4), then you can use the alias instead. It has the same meaning: the locale dependent char* encoding: iconv_t ic = iconv_open(UTF-8,); Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Standardized encoding names for iconv_open()
Markus Kuhn wrote: In general, the POSIX definition of iconv_open() would become *much* more useful, if it actually specified a couple of encoding strings, and what exactly they mean. I second that. JAVA has a similar minimal supported set of encodings in its conversion facility. multi-byte encoding of current LC_CTYPE locale UTF-8 UTF-8 (with overlong sequences being illegal) UTF-16 UTF-16 (same byte order as C's short) UTF-16BE UTF-16 BigEndian UTF-16LE UTF-16 LittleEndian UTF-32 UTF-32 (same byte order as C's long) ... UTF-16 and UTF-32 are defined differently than same byte order as C's short, in RFC 2781. It's better to refer to their lengthy definition in RFC 2781. and perhaps even EUC-JP, EUC-KR, EUC-TW, GB18030 I don't think there is a normative, widely used definition of EUC-TW. And for GB18030, the fact that its official definition is in Chinese, not English, doesn't prevent different implementations by different vendors. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: iconv limitations
srintuar wrote: The knowledge of how to detect a null in a stateful encoding is not necessarily trivial. If there was a function which could return the unit-word-size of any encoding accepted by iconv, ... Here is how to write such a function: Given the unknown encoding, 1. convert \000 from UTF-8 to the given encoding, 2. convert \000\000 from UTF-8 to the given encoding, 3. return the difference of the lengths (measured in bytes) of the two results. 4. If the encoding is UTF-7, this does not work. Here return 1 instead. The corresponding Clisp code: (defun encoding-zeroes (encoding) (let ((name (ext:encoding-charset encoding)) (table #.(make-hash-table :test #'equal :initial-contents '((UTF-7 . 1 (tester #.(make-string 2 :initial-element (code-char 0 (or (gethash name table) (setf (gethash name table) (- (length (ext:convert-string-to-bytes tester encoding)) (length (ext:convert-string-to-bytes tester encoding :end 1))) Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: iconv limitations
Michael B Allen wrote: Shift-JIS has embedded nulls, I don't think this is true. Shift_JIS is a multibyte encoding. It has the property that some bytes in the ASCII range (such as 'x' or '\') can occur as part of non-ASCII characters. But 0x00 cannot occur as part of a double- byte character. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: relevance of [PATCH] tty utf8 mode in linux-kernel 2.6.4-rc1
Markus Kuhn wrote: I believe the only practical solution for this problem is to implement BACKSPACE in UTF-8 terminal emulators such that it moves one *character* to the left, not one *cell*. I agree. The objects being displayed are characters. It does not make sense for a user or for applications to position the cursor in the middle of a character, or after 1/3 or 2/3 of a character. We have little choice if we want to keep the kernel free of locale-dependent monsters such as wcwidth(). There is also the problem of the TAB: Currently linux/drivers/char/n_tty.c also transforms a TAB to a sequences of spaces, and an erase of a TAB to a sequence of BACKSPACEs. If we keep it this way, the kernel must still learn to distinguish single-width and double-width characters, in order to keep a notion of current column number. What is the reason for treating TAB at the TTY level? Why can't TAB be treated like a graphic character of unknown width and be passed to the device driver unchanged? Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: GB18030
Jan Willem Stumpel wrote: What was wrong with UTF-8 one wonders (rhetorical question, dont really want to know the answer because it is probably very complicated). UTF-8 is upward compatible with ASCII, but the Chinese government wanted something that is upward compatible with GB2312, and thus they created GB18030. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: grep is horriby slow in UTF-8 locales
Markus Kuhn wrote: b) relying entirely on ISO C's generic multi-byte functions, to make sure that even stateful monsters like the ISO 2022 encodings are supported equally. Use of mbrlen is not done because of ISO 2022 encodings (which are not usable as locale encodings!), but because of the non-UTF-8 multibyte encodings: EUC-JP, Big5, GB18030 etc. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Uppercase string: broken tr?
Bob Proulx wrote: But sed and tr and other utilities just use the locale data provided on the system by glibc among other places. These programs are table driven by tables that are not part of these programs. This is why locale problems are global problems across the entire system of programs such as grep, sed, awk, tr, etc. or anything else that uses the locale data. The glibc locale data for 'ABÇ' has been correct in all locales since 2000, and is covered by glibc's testsuite. Before blaming glibc, you should make up a standalone test program that shows the glibc problem. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Uppercase string: broken tr?
Alex J. Dam wrote: $ echo 'AB' | tr [:upper:] [:lower:] gives me ab (the last character is an uppercase cedilla) I expecte its output to be: ab Am I doing something wrong? No, your expectations match what POSIX specifies. Is tr (version 2.1) broken? Yes, and even the i18n patches from IBM http://oss.software.ibm.com/developer/opensource/linux/patches/?patch_id=24 contain no fix for it. It happens with sed, too. $ echo 'AB' | sed -e 's,\(.*\),\L\1\E,' ab Yes this seems like a bug in GNU sed 4.0.3. I'm CCing bug-coreutils and the sed maintainer, so the maintainers can do something about it. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: To maintainer of the list
Wu Yongwei suggested that, to get rid of spam and worms, this list be made subscriber-only. This is now implemented. Sorry for the inconvenience that this will cause to well-behaved people who are not subscribed. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Wide character APIs
Michael B Allen said: Since Win32 is one of my target systems I need wide character support. But Win32 doesn't have reasonable wide characters. They have a 16-bit type called 'wchar_t' which cannot accomodate all characters since Unicode 3.1. So what they will likely end up doing is to use UTF-16 as an encoding for 'wchar_t *' strings, which means that wchar_t doesn't represent a *character* any more - it represents an UTF-16 memory unit. Is there a serious flaw with wchar_t on Linux? wchar_t by itself is OK on Linux (it's 32-bit wide). But the functions fgetwc() and fgetws() - as specified by ISO C 99 and POSIX:2001 - have a big drawback: When you use them, and the input stream/file is not in the expected encoding, you have no way to determine the invalid byte sequence and do some corrective action. Using these functions has the effect that your program becomes garbage in - more garbage out or garbage in - abort You need to use multibyte strings in order to get some decent program behaviour in the presence of invalid multibyte contents of streams/files. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Strings in a programming language
Hi Marcin, Most languages take 3, as I understand Perl it takes the mix of 3 and 2, and Python has both 3 and 1. I think I will take 1, but I need advice: - Don't look at Perl in this case - Perl has the handicap that for historical reasons it cannot make a clear distinctions between byte arrays (= binary data) and character arrays (= strings = text). Python's way of doing it - byte arrays are automatically converted to character arrays when there is need to - is OK when you consider what Python 1.5 looked like. But for a freshly designed language it'd be an unnecessary complexity. In Lisp (Common Lisp - Scheme guys appear not to care about Unicode or i18n) the common approach is to have one or two flavours of strings, namely strings containing Unicode characters, and possibly a second flavour, strings containing only ISO-8859-1 characters. Conversion is done during I/O. The Lisp 'open' function has had an argument 'external-format' since 1984 or 1986 at least; nowadays a combination of the encoding and the newline convention (Mac CR, Dos CRLF or Unix LF) gets passed here. You find details here: - GNU clisp http://clisp.sourceforge.net/impnotes/encoding.html http://clisp.sourceforge.net/impnotes/stream-dict.html#open - Allegro Common Lisp http://www.franz.com/support/documentation/6.2/doc/iacl.htm - Liquid Common Lisp http://www.lispworks.com/reference/lcl50/ics/ics-1.html - LispWorks http://www.lispworks.com/reference/lwl42/LWRM-U/html/lwref-u-95.htm#pgfId-886156 http://www.lispworks.com/reference/lwl42/LWRM-U/html/lwref-u-101.htm#98500 http://www.lispworks.com/reference/lwl42/LWRM-U/html/lwref-u-76.htm#pgfId-902973 and some older details at http://www.cliki.net/Unicode%20Support 1. Strings are in UTF-32 on Unix Since strings are immutable in your language, you can also represent strings as UCS-2 or ISO-8859-1 if possible; this saves 75% of the memory in many cases, at the cost of a little more expensive element access. and UTF-16 on Windows. They are recoded on the fly during I/O and communication with an OS (e.g. in filenames), with some recoding framework to be designed. Why not using UTF-32 as internal representation on Windows as well? I mean, once you have decided to put in place a conversion layer for I/O, this conversion layer can convert to UTF-16 on Windows. What you gain: you have the same internal representation on all platforms. 2. Strings are in UTF-8, otherwise it's the same as the above. The programer can create malformed strings, they use byte offsets for indexing. Unless you provide some built-in language constructs for safely iterating across a string, like for (c across-string: str) statement this would be too cumbersome for the user who is not aware of i18n. - How should the conversion API look like? Are there other such APIs which I can look at? It should permit interfacing with iconv and other platform-specific converters, and with C/C++ libraries which use various conventions (locale-based encoding in most, UTF-16 in Qt, UTF-8 in Gtk). The API typically has a data type 'external-format', consisting of EOL and encoding. Then you have some functions for creating streams (to files, pipes, sockets) which all take an 'external-format' argument. Furthermore you need some functions for converting a string from/to a byte sequence using an 'external-format'. (These can be methods on the 'external-format' object.) - What other issues I will encounter? People will want to switch the 'external-format' of a stream on the fly, because in some protocols like HTTP some part of the data is binary and other parts are text in a given encoding. The language is most similar to Dylan, but let's assume its purpose will be like Python's. It will have a compiler which produces C code The following article might be interesting for you. http://www.elwoodcorp.com/eclipse/papers/lugm98/lisp-c.html Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Wide character APIs
Michael B Allen wrote: I didn't know wchar_t was supposed to be able to represent an entire character. If wchar_t is not an entire character, the functions defined in wctype.h, like iswprint(), make no sense. And indeed, on Windows with UTF-16 as encoding of 'wchar_t *' strings, they make no sense. This is good to know. I have been avoiding those functions and converting to/from the locale encoding internally using mbstowc and wctombs. From the point of view of robustness versus malformed input, mbstowcs() is just as bad as fgetwc(). The only function that really helps is mbrtowc(). But no one answered my original question; why are the format specifiers for wide character functions different? Here's the answer: So that the a given format specifier corresponds to a given argument type. Format specifierArgument type %dint %schar * %ls wchar_t * %cint (promoted from char) %lc wint_t (promoted from wchar_t) Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: mbrtowc with dlopen doesn't work?
Michael B Allen wrote: I was using an 'n' limit parameter of INT_MAX. Limiting this to 0x appears to solve the problem. ... but it is still wrong. The ISO C and POSIX specification of mbrtowc() [http://www.opengroup.org/onlinepubs/007904975/functions/mbrtowc.html] implies that the mbrtowc() function is free to look at 'n' bytes, starting from the beginning of the string. In other words, the caller of the function has to guarantee that 'n' bytes can be accessed. Passing blindly n = x can crash your program. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [Translation-i18n] Re: Proposal for declinations in gettext
Yann Dirson wrote: it is difficult in some cases to find unique english strings that will be possible map one to one in all languages. A common technique is to use a context marker in the msgid string, like this: my_gettext ([menu item]Open) my_gettext ([combobox item]Open) which translators can translate like this: msgid [menu item]Open msgstr Ouvrir msgid [combobox item]Open msgstr Ouvert The my_gettext function calls gettext and, if it is returns the untranslated string, strips the [...] prefix. See also the gettext documentation, section GUI program problems. The only problem (quite small, IMO) with this approach is that translators must be made aware where the context marker ends. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [Translation-i18n] Proposal for declinations in gettext
Danilo Segan wrote: The usual practice among english-speaking programmers is to compose strings out of smaller parts. You need to educate the programmer to use entire sentences. You can refer them to the gettext documentation, section Preparing Translatable Strings. http://www.gnu.org/manual/gettext/html_chapter/gettext_3.html#SEC15 The reason is that in most languages sentences are not composed by juxtaposition, as in English: - For Serbian, you have given examples. - In many languages, a verb's form is spelled differently depending on the gender of the subject. - In Latin, the combiner and comes as a suffix -que. - Etc. etc. The translation for Workspace %d would look like: msgid Workspace %d msgstr0 der Workspace %d msgstr1 das Workspace %d msgstr2 dem Workspace %d msgstr3 den Workspace %d So, the title of Workspace 5 would be der Workspace 5, while the menu which allows switching to that workspace would read Switch to den Workspace 5. There are more bits of context that influence a translation than just a declination. For example, the beginning of a sentence is special. To pursue your example, an English programmer would be tempted to write %0s is empty. which would have the German translation %0s ist leer. and result in the final string der Workspace %d is leer. which is wrong because, in German, all sentences must start with a capital letter. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Hello world in UTF-8/X11
Manel de la Rosa writes: I don't need a complex rendering system or anything killer. Simply display a label with a UTF-8 encoded string. This is a contradiction in itself. The purpose of UTF-8 is that it can be used for languages from Russian over Vietnamese to Indic. This needs a complex rendering engine: for Russian you already need fonts in non-ISO-8859-1 encoding; for Vietnamese you need to attach multiple accents to a single letter, and for Indic (Devanagari etc.) you need vowel reordering. Not to mention right-to-left reordering (Hebrew, Arabic, Farsi), the problem of choosing the right fonts, and dealing with the subtleties of these fonts. Only two free GUI toolkits have the rendering engines today: Qt/KDE and GNOME. Also Mozilla and (to a more limited extent) GNU Emacs have some rendering engines, but not embedded in a GUI toolkit. With Motif/Lesstif you cannot go further than displaying Russian. There are no internationalization efforts underway there. (Except there is a complex rendering underway at the low X11 level, by Sun, http://stsf.sourceforge.net/, but I have no idea how easy it will be to use it when it will be finished, and whether the Motif adaptation will be freely distributable.) So my recommendation is: Drop Motif, and use KDE/Qt (if the GPL is acceptable for your program) or GNOME. Qt has a module that helps in migrating from Motif to Qt. with a short X11/UTF-8 Hello World example, for instance Can't be done. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: redhat 8.0 - using locales
Maiorana, Jason writes: A few files appear under LC_MESSAGES, but it seems they dont show up even when LANG=eo. First, you need to have a locale, maybe eo_ES or so. Second, In the LANGUAGE environment variable, but not in the LANG environment variable, LL_CC combinations can be abbreviated as LL to denote the language's main dialect. So you should use LANG=eo_ES, not simply LANG=eo. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: ``A Short Into ...'' - comments, suggestions?
Brian Foster writes: Suppose such a file is being opened. What bytes are passed as the name of the file? This is an unknown. It obviously depends on the Java/JVM implementation. The Sun Java 1.3 interprets the filenames on the file system according to the locale. This means, in an UTF-8 locale the file names are UTF-8, and in an ISO-8859-1 locale it replaces unencodable characters with question marks, while doing the conversion from Java String to filename. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: ``A Short Into ...'' - comments, suggestions?
Sandip Bhattacharya writes: The Sun Java 1.3 interprets the filenames on the file system according to the locale Can you explain what you mean by interprets? Any encoded filename is just a sequence of bytes. Why should apps be concerned any further than that? On the filesystem the filename is just a sequence of bytes. Inside Java, a filename is a String, i.e. a sequence of Unicode characters. Which you can display, for example in a graphical file chooser. So there must be some conversion between the Linux notion of filename and the Java notion of filename. And this conversion works perfectly according to the locale. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: filename and normalization
Mike FABIAN writes: . char - \N'45' because I found quite a few man pages which used just -o to write command line options of programs not \-o, for example the man page of less does this. Without that hack, groff translates - into yet another variant of -: U+2010 (HYPHEN). It's better to fix the man pages instead. The groff input language has the distinction between - and \- for ages. In some cases (not in command line options!) HYPHENs look better than MINUS signs, therefore I want to be able to write man pages where - gives a HYPHEN. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: acroread in UTF-8 locale
Markus Kuhn writes: I did report the third issue (acroread breaking in UTF-8 locales) to Adobe multiple times, but no reaction yet. I suspect it might be an issue with the widget library they use and acroread ought in my opinion to ignore the locale entirely as it has no locale-dependnet functionality anyway. In my experience, they have a problem only with the LC_NUMERIC part of the locale, and only with some PDF documents. And it can be worked around by adding a single line to the 'acroread' shell script. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Linux and UTF8 filenames
Glenn Maynard writes: convert its filemanes using the kernel nls modules; yes, it could be done. But would be somewhat tricky, since filenames need to be 8-bit clean except for / and NULL. It's a bag of worms with very little value ... This is a non-issue. All locale encodings used on Linux, from ISO-8859-* over BIG5 to GB18030, use the bytes 0x2f and 0x00 only for '/' and '\0' respectively. The '/' is a problem with ISO-2022 based encodings, but noone with a brain in his head uses them as locale encodings. Bruno Keine verlorenen Lotto-Quittungen, keine vergessenen Gewinne mehr! Beim WEB.DE Lottoservice: http://tippen2.web.de/?x=13 -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Lazy man's UTF8
Robert de Bath writes: Mr. Lazy knows about wide characters and thinks they're a pain, especially for already existing code. Sure. And furthermore some of them are unreliable: when you use wprintf you don't know whether it failed because the disk was full or if there was a conversion error or because the stdio was byte oriented. iconv() is _fairly_ easy to use, the problem isn't that's it's difficult just that there's a lot you have to remember to do for a function that appears (at first) to have a simple job. Have a look at the libunistr part of http://www.haible.de/bruno/gnu/libunistring-0.0.tar.gz Its unistr.h file declares simple functions for simple tasks - even though under the hood many of them are based on iconv. I don't think there's any support for 'character' counting as opposed to 'display cell' counting. In libunistring: u8_strlen vs. u8_strwidth. Bruno __ WEB.DE MyPage - Ohne Computerkenntnisse in nur 5 Minuten online! Alles inklusive! Kinderleicht! http://www.das.ist.aber.ne.lustige.sache.ms/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Lazy man's UTF8
Glenn Maynard writes: Giving wchar_t to iconv isn't portable, though, is it? It is supported by glibc and GNU libiconv, and libiconv is portable. Hmm. Another thing, while we're on iconv: How do you get the number of non-reversible conversions when -1/E2BIG is returned? It seems that converting blocks into a small output buffer (eg. taking advantage of E2BIG) means that count is lost. Seems so, yes. But you can do one round of conversion to see how large you have to make your buffer, and then in the second round you are safe from E2BIG. Bruno __ Die clevere Geldreserve: der DiBa-Privatkredit. Funktioniert wie ein Dispo, ist aber viel gunstiger! Alle Infos: http://diba.web.de/?mc=021104 -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: very small idea
Mike Fabian writes: I tried to cut and paste between gvim, mlterm, xterm, XEmacs, kedit. Worked in all directions without problems with UTF-8 encoded Japanese text. Can you tell me how to reproduce a situation where it doesn't work and where the patch helps? Try with Netscape Communicator. It's one of those clients which support only UTF8_STRING and not COMPOUND_TEXT. Whereas Emacs is one of those clients which support only COMPOUND_TEXT and not UTF8_STRING. Bruno __ Nur ein Zuhause im Internet: Verwalten Sie alle Ihre E-Mail-Adressen einfach bei WEB.DE FreeMail! http://freemail.web.de/?mc=021124 -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Linux and UTF8 filenames
Martin Kochanski writes: how can a poor innocent server discover enough about the context in which it is running to know what filename it has to use so that a user who lists a file directory will see Rêve on his screen? Since it depends on the user's locale, you'll have to convert the filename from the given encoding to the user's locale encoding. Start out with const char *given_encoding = UTF-8; // or UTF-16, depends on what you have const char *localedependent = ; // shortcut for glibc or libiconv iconv_t cd = iconv_open (localedependent, given_encoding); ... Bruno __ Die clevere Geldreserve: der DiBa-Privatkredit. Funktioniert wie ein Dispo, ist aber viel gunstiger! Alle Infos: http://diba.web.de/?mc=021104 -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [Fwd: unicode conversions]
[EMAIL PROTECTED] asks: i was wondering about libiconv: is there any plan to support a fall-back character when performing conversions, as opposed to always stopping conversion when a character with no destination representation is encountered? In general, providing fallback characters is the business of the caller of iconv(). The iconv() function's role is only to determine whether the input character is convertible to the output codeset, and if so, how. It would make sense to add a command line option to the iconv _program_ to force a question mark for unconvertible characters. It already has an option ('-c', most useful together with '-s') to omit unconvertible characters from the output. As a special case, glibc's iconv() function uses '?' as a fallback character if conversion is performed with transliteration (i.e. the target encoding has a //TRANSLIT suffix). Hrm, I was under the impression that converting from non-unicode to unicode was always possible. Yes it is, except for a few bordercases like Inuktitut characters or some rare chinese ideographs, which therefore are mapped to Unicode private areas until they have been officially added to Unicode. Unfortunately, while experimenting with my system iconv, it appears to instead stop when there is no destination encoding for a character, rather than allowing a fallback to a default character. Try iconv -c -s. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: ASCII and JIS X 0201 Roman - the backslash problem
Tomohiro KUBOTA writes: 3) For programs that interpret backslash as some kind of escape character and use Unicode internally but should work with text in Shift_JIS encoding, consider the multibyte character 0x5C as being the escape trigger, not [only] the Unicode character U+005C. This is already done in bash and gettext. For example, in GNU gettext, we have the code I think interpretation of U+00A5 as an additional escape character doesn't always work, because Unicode texts don't have information on their origin (converted from Shift_JIS or not). These are particular kinds of text files, which are fed to such programs that do backslash interpretation: shell scripts, awk scripts, gettext PO files, etc. - yes if the Yen sign should appear there it needs to be doubled. If U+00A5 would always be an escape character, it would be harmful for many softwares. Why is it more harmful if U+00A5 is an escape character than if U+005C is an escape character? In both cases you just double it to get the original character. I am interested in how European people succeeded to migrate from ISO 646 variants into ISO 8859. Yen Sign Problem is exactly a problem of ISO 646, because 0x5c = YEN SIGN comes from JIS X 0201 Roman, which is Japanese variant of ISO 646. For me, the migration occurred when I switched to using a different computer with a different OS and a different character set. (From ISO646-DE to CP437 at that time.) Few files were transported - there is usually a lot of text files that you can just drop once in three years. Among the remaining ones the disambiguation was usually easy, depending on the type of file: In letters I only used umlauts and no brackets, whereas in programs I mostly used brackets and no umlauts. Only few programs contained both brackets and umlauts, and I had to do the fixup by hand, usually the next time I needed the particular program. So it is a minor annoyance over the time of a few months, but by far not the costs that you are estimating. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: readline (was: Switching to UTF-8)
Markus Kuhn writes: There is also bash/readline SuSE 8.0 ships with a bash/readline that works fine with (at least) width 1 characters in an UTF-8 locale. There is also an alpha release of a readline version that attempts to handle single-width, double-width and zero-width characters in all multibyte locales. But it's alpha (read: it doesn't work for me yet). Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: JISX0213 mapping table
Gaspar Sinai writes: 555c555 0x12678 0x30D7 --- 0x12678 0x31F7 0x309A If we use 0x30D7 we will clash with: Table 5 row 4 column 8 0x8376 0x2557 0x30D7 # 1-5-55 (55 == 0x37) Yes, this character is a 'small' variant of 0x30D7. I concede. Let's use 0x31F7 0x309A. It will be the task of the display engine to position the small circle at the right position. But what shall we do with 0x12B65 0xFFFD? Maybe another symbol added to Unicode Yi radicals? Can you move this issue to the unicode.org mailing list? 7950c7951 0x17624 0xFA3E --- 0x17624 0x69EA You are right. Let's use 0x69EA here. Also can you tell the unicode.org people to add this one to Unihan.txt? Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: JISX0213 mapping table
Gaspar Sinai writes: I would be glad if we could reconcile these files and come up with a common format till it is undefined by Unicode. The diff is quite small now. 81,82c81,82 0x12171 0x00A2 0x12172 0x00A3 --- 0x12171 0xFFE0 0x12172 0xFFE1 138c138 0x1224C 0x00AC --- 0x1224C 0xFFE2 These are due to differences in the JISX0208 mapping. I use the one which was on unicode.org for years (now declared obsolete). 148,149c148,149 0x12256 0xFF5F 0x12257 0xFF60 --- 0x12256 0x2985 0x12257 0x2986 Look at the glyphs. I used http://ftp.ora.com/cjkvinfo/pdf/jisx0208+0213.pdf http://www.itscj.ipsj.or.jp/ISO-IR/ 228 and 229 214c214 0x1233A 0x2299 --- 0x1233A 0x29BF 555c555 0x12678 0x30D7 --- 0x12678 0x31F7 0x309A These are indeed debatable. 996,997c996,997 0x12B65 0xFFFD 0x12B66 0xA4A3 --- 0x12B65 0x02E9 0x02E5 0x12B66 0x02E5 0x02E9 I don't understand how the glyphs of 0x02E9 and 0x02E5 can combine to the RISING SIGN or FALLING SIGN. 7765a7766 0x17427 ??? An unmapped code point. jisx0208+0213.pdf shows reserved at 0xEAA5. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: 3.2 MAPPINGS/EASTASIA
Tomohiro KUBOTA writes: http://www.jca.apc.org/~earthian/aozora/0213.html http://www.jca.apc.org/~earthian/aozora/0213/jisx0213code.zip http://www.cse.cuhk.edu.hk/~irg/ http://www.cse.cuhk.edu.hk/~irg/irg/N807_TablesX0123-UCS.zip Thanks a lot for these pointers! With this information, I can write a JISX0213 converter for glibc and libiconv. Strictly speaking, JIS X 0213:2000 *cannot* be defined as a mapping table against ISO 10646, because JIS X 0213's han unification rule is different from ISO 10646's one. (You know, Unicode added several tens of compatibility ideographs which are different characters in JIS X 0213's point of view and different glyphs of the same character in Unicode's point of view.) I'll make use of these 59 compatibility ideographs in the converter. That's the whole reason why they were introduced in Unicode 3.2. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: 3.2 MAPPINGS/EASTASIA
Markus Kuhn writes: it is now up to the maintainers of legacy encoding standards to define the relationship of their respective encodings to Unicode properly. The ISO 8859 authors have already done this in their second editions, and I understand that the latest editions of the relavant JIS standards also contain official ISO 10646 cross-reference tables. Does this also apply to JISX0213:2000? Do you know where to find the conversion tables for this character encoding? The PDF file in the ISO-IR registry contains only the pictures of each glyph, but no conversion table. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Is there a UTF-8 regex library?
David Starner writes: Does anyone know of a UTF-8 regex engine, preferably one that can be plugged into a GPL'ed C program easily? Yes, such a regex engine is contained in the glibc CVS (:pserver:[EMAIL PROTECTED]:/cvs/glibc/libc/posix) It works not only with UTF-8 but with all multibyte encodings. It was contributed by Isamu Hasegawa. An UTF-16 regex engine is available at http://crl.NMSU.Edu/~mleisher/download.html Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
gettext-0.11.1 is released
It is at ftp.gnu.org (soon also its mirrors) in gnu/gettext/gettext-0.11.1.tar.gz New in 0.11.1: * xgettext now also supports Python, Tcl, Awk and Glade. * msgfmt can create (and msgunfmt can dump) Tcl message catalogs. * msggrep has a new option -C that allows to search for strings in translator comments. * Bug fixes in the gettext.m4 autoconf macros. New in 0.11: * New programs: msgattrib - attribute matching and manipulation on message catalog, msgcat - combines several message catalogs, msgconv - character set conversion for message catalog, msgen - create English message catalog, msgexec - process translations of message catalog, msgfilter - edit translations of message catalog, msggrep - pattern matching on message catalog, msginit - initialize a message catalog, msguniq - unify duplicate translations in message catalog. * msgfmt can create (and msgunfmt can dump) Java ResourceBundles. * xgettext now also supports Lisp, Emacs Lisp, librep, Java, ObjectPascal, YCP. * The tools now know about format strings in languages other than C. They recognize new message flags named lisp-format, elisp-format, librep-format, smalltalk-format, java-format, python-format, ycp-format. When such a flag is present, the msgfmt program verifies the consistency of the translated and the untranslated format string. * The msgfmt command line options have changed. Option -c now also checks the header entry, a check which was previously activated through -v. Option -C corresponds to the compatibility checks previously activated through -v -v. Option -v now only increases verbosity and doesn't influence whether msgfmt succeeds or fails. A new option --check-accelerators is useful for GUI menu item translations. * msgcomm now writes its results to standard output by default. The options -d/--default-domain and -p/--output-dir have been removed. * Manual pages for all the programs have been added. * PO mode changes: - New key bindings for 'po-previous-fuzzy-entry', 'po-previous-obsolete-entry', 'po-previous-translated-entry', 'po-previous-untranslated', 'po-undo', 'po-other-window', and 'po-select-auxiliary'. - Support for merging two message catalogs, based on msgcat and ediff. * A fuzzy attribute of the header entry of a message catalog is now ignored by the tools, i.e. it is used even if marked fuzzy. * gettextize has a new option --intl which determines whether a copy of the intl directory is included in the package. * The Makefile variable INTLLIBS is deprecated. It is replaced with LIBINTL (in projects without libtool) or LTLIBINTL (in projects with libtool). * New packaging hints for binary package distributors. See file PACKAGING. * New documentation sections: - Manipulating - po/LINGUAS - po/Makevars - lib/gettext.h - autoconf macros - Other Programming Languages Happy internationalization! Bonne francisation! Frohes Eindeutschen! Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Statically link LGPL cp1252.h with MIT Licensed code?
Michael B Allen writes: Can I statically link of the codepage headers (eg cp1252h) from libiconv with an MIT Licensed module? I would not actually alter the file of course so a user could not modify the LGPL files in my module any more than if they had used libiconv directly Legally speaking: cp1252h is code, not public header file As long as you don't distribute the resulting binaries/libraries, you can link it with anything you want If you want to distribute the result, however, it must all fall under LGPL, which for binaries is roughly equivalent to GPL Namely, you must distribute the source of all the binary/library Practically speaking: It is on purpose that linking with libiconv as a shared library is encouraged, whereas linking it libiconv as a static library is not so welcome The reason is that some people in the countries not yet well supported by character set standards (South Asia an Africa, for example) should have an opportunity to adapt their system to their needs I need to be able to convert one character at a time and provide a subtitution character if the conversion is invalid or stop if some number of *characters* has been reached You can do that by using libiconv unmodified There are even two ways to do it: 1) You can make the conversion one character at a time, by offering one input byte to iconv(), then two bytes, and so on Kind of slow, but works 2) You can convert to an encoding where each character occupies a fixed number of bytes, like UCS-4, and specify an output buffer of precisely the size that can hold the number of characters that you need Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mailnllinuxorg/linux-utf8/
Re: mbscmp
Michael B Allen writes: Do the str* functions handle strings differently if the locale is different? It depends on the functions. strcpy strncpy strcat strncat strcmp strncmp strdup strchr strrchr strcspn strspn strpbrk strstr strtok: NO strcoll strxfrm: YES strcasecmp: YES but doesn't work in multibyte locales. For example, does strcmp work on UTF-8 strings? Not well. Better use strcoll. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: mbscmp
Pablo Saratxaga writes: strcoll() doesn't have multibyte problems ? No. In glibc-2.2 strcoll works fine for all multibyte encodings. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: mbscmp
Michael B Allen writes: What's the ultimate goal here? Are any of these functions *supposed* to work on multi-byte characters, or will there be mbs* functions? strcpy strcat strdup already work for multi-byte characters strncpy strncat strncmp cannot work for multi-byte characters because they truncate characters strcspn strspn strpbrk strstr you can write multibyte aware analogs of these strchr strrchr use a multibyte aware strstr analog instead Nothing is standardized in this area, but IMO an mbstring.h include file which defines these for arbitrary encodings, and an unistring.h which defines these for UTF-8 strings, would be very nice. I'm working on an LGPL'ed implementation of the latter. /* * Returns a pointer to the character at off withing the multi-byte string ^^ Emphasize: at _screen_position_ off. * src not examining more than sn bytes. */ char * mbsnoff(char *src, int off, size_t sn) { unsigned long ucs; int w; size_t n; mbstate_t ps; ucs = 1; memset(ps, 0, sizeof(ps)); if (sn INT_MAX) { sn = INT_MAX; } if (off 0) { off = INT_MAX; } while (ucs (n = mbrtowc(ucs, src, sn, ps)) != (size_t)-2) { Change that to: while (sn 0 (n = mbrtowc(ucs, src, sn, ps)) != (size_t)-2) { if (n == (size_t)-1) { return NULL; } if ((w = wcwidth(ucs)) 0) { if (w off) { break; } off -= w; } sn -= n; src += n; } return src; } Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: mbscmp
Jimmy Kaplowitz writes: based on looking at man pages, you can use one of three functions (mbstowcs, mbsrtowcs, or mbsnrtowcs) to convert your multibyte string to a wide character string (an array of type wchar_t, one wchar_t per *character*), and then use the many wcs* functions to do various tests. My recollection of the consensus on this list is that for internal purposes, wchar_t is the way to go, and conversion to multibyte strings of char is necessary only for I/O, and there only when you can't use functions like fwprintf. That was my impression at the beginning as well. Until I realized that all this idea leads to are unreliable programs. Because fgetwc, which you would like to use for I/O, doesn't give you any chance of correction when it encounters an invalid multibyte character in the input file. And the output side of the streams are not better: fputwc on a stream on which someone has already done an fputc call is undefined behaviour (it can crash or do nothing). For an example, take the 'rev' program, in the util-linux, and feed it with ISO-8859-1 input while running in an UTF-8 locale. Simply unreliable. Also wchar_t[] occupies more memory. More memory means more cache misses, means less speed. Also wchar_t[] doesn't fulfill its promise of 1 character = 1 memory unit. Because a Vietnamese character is usually composed from two Unicode characters; the term complex character is used to denote this multi-wchar_t unit. And you cannot separate these two units, neither in truncation, regexp search, linebreaking or whatever algorithm. For this reason, wchar_t is only good to call wctype.h libc APIs, not for in-memory representation of strings. The latter should still be done with char*. And for iterating through characters in multibyte strings, you can use the inline functions found at http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gnu-i18n/gnu-i18n/fileutils-i18n/lib/mbchar.h?rev=1.3content-type=text/vnd.viewcvs-markup http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gnu-i18n/gnu-i18n/fileutils-i18n/lib/mbiter_multi.h?rev=1.3content-type=text/vnd.viewcvs-markup http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gnu-i18n/gnu-i18n/fileutils-i18n/lib/mbfile_multi.h?rev=1.3content-type=text/vnd.viewcvs-markup However, wchar_t is only guaranteed to be Unicode (which encoding?) when the macro __STDC_ISO_10646__ is defined, as is done with glibc 2.2. Correct. But it does not mean that *every* Unicode character can be used: You cannot use Hangul Unicode characters in an ISO-8859-1 locale. In glibc the wctype.h functions work on these characters (in any locale, except the C locale), but when you convert a Hangul character to multibyte in such a locale, all you get is a '?'. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
Markus Kuhn writes: I just spottet in section 1.1.3 of RFC 3030 (NFS version 4 Protocol) the following requirement: file and directory names are encoded with UTF-8. Good, they got it right. Where is the conversion between the NFS filenames and the user visible filenames (in locale encoding) to take place? Probably in the kernel, and the user-visible encoding will be given by a mount option? Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: isprint() under utf-8 locale
Radovan Garabik writes: From my naive point of view, I would expect isprint() to return nonzero for utf-8 locale, since this would allow older non-multibyte aware programs using isprint() just to pass utf-8 characters to output, which at least has a chance of working, instead of not displaying them at all. The purpose of calling isprint in such programs is to filter out control characters, right? Now when you such an old program calls isprint on the individual bytes that constitute a multibyte character, is cannot know whether that character is a graphic character (like U+20AC) or a control character (like U+200E). Blindly returning 1 would work in some cases but not in others. Better is to port the application to use mbrtowc and iswprint. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Updated: Security in Unicode
Gaspar Sinai writes: http://www.yudit.org/security/ About the first of your samples: what happens there in the first and the third line is that inside the Java programs, the strings are embedded in left-to-right text, whereas in the JTextArea they have no preferred direction, and the Unicode bidi algorithm looks at the direction of the first logical character that has a direction. You can fix it by adding a left-to-right direction marker to the strings: new JLabel(\u200e...); or new JLabel(\u202a...); or new JLabel(\u202d...); I don't see this as a security problem, because programmers ought to test their programs before releasing them. Can't comment on the second sample, though. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Cedilla: a manic text printer
Juliusz Chroboczek writes: A first beta of Cedilla, the manic text printer, is available from http://www.pps.jussieu.fr/~jch/software/cedilla/ If you happen to run it in CLISP 2.26, you need to apply the following bug fix to clisp, and also use ext:quit instead of lisp:quit. Bruno *** clisp-2.26/src/io.d.bak 2001-04-17 09:31:13.0 +0200 --- clisp-2.26/src/io.d 2002-01-31 04:12:46.0 +0100 *** *** 3108,3142 TheIarray(hstring)-data = token; # Datenvektor := O(token_buff_1) token = TheIarray(token)-data; # Normal-Simple-String mit Token var uintL pos = 0; # momentane Position im Token ! loop { # Suche nächstes Hyphen ! if (len-pos == 1) # einbuchstabiger Charactername? ! break; ! var uintL hyphen = pos; # hyphen := pos ! loop { ! if (hyphen == len) # schon Token-Ende? ! goto no_more_hyphen; ! if (chareq(TheSstring(token)-data[hyphen],ascii('-'))) # Hyphen gefunden? ! break; ! hyphen++; # nein - weitersuchen ! } ! # Hyphen bei Position hyphen gefunden ! var uintL sub_len = hyphen-pos; ! TheIarray(hstring)-dims[0] = pos; # Displaced-Offset := pos ! TheIarray(hstring)-totalsize = ! TheIarray(hstring)-dims[1] = sub_len; # Länge := hyphen-pos ! # Jetzt ist hstring = (subseq token pos hyphen) ! # Displaced-String hstring ist kein Bitname - Error ! pushSTACK(*stream_); # Wert für Slot STREAM von STREAM-ERROR ! pushSTACK(copy_string(hstring)); # Displaced-String kopieren ! pushSTACK(*stream_); # Stream ! pushSTACK(S(read)); ! fehler(stream_error, ! GETTEXT(~ from ~: there is no character bit with name ~) ! ); ! bit_ok: # Bitname gefunden, Bit gesetzt ! # Mit diesem Bitnamen fertig. ! pos = hyphen+1; # zum nächsten ! } # einbuchstabiger Charactername { var chart code = TheSstring(token)-data[pos]; # (char token pos) --- 3108,3114 TheIarray(hstring)-data = token; # Datenvektor := O(token_buff_1) token = TheIarray(token)-data; # Normal-Simple-String mit Token var uintL pos = 0; # momentane Position im Token ! if (len-pos == 1) # einbuchstabiger Charactername? # einbuchstabiger Charactername { var chart code = TheSstring(token)-data[pos]; # (char token pos) -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: strcoll for utf-8
Paul Michel writes: But strtok() for instance does not handle utf-8 data properly. Sure strtok() handles UTF-8 strings propertly. It only has the limitation that the 'delimiter' than you can pass must be an ASCII character. strtok() even works with strings encoded in weird encodings like BIG-5 and GB18030, as long as the 'delimiter' is an ASCII character in the range 0x00..0x2F. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: getting locale's charset from a script
Ulrich Drepper writes: I've implemented this iconv -f utf-8 -t //TRANSLIT This was an undefined case which gave not very nice results before. Now an empty string (or empty before the second slash) means use the locale's charset. The next release of GNU libiconv will interpret the empty encoding name and //TRANSLIT in the same way as glibc. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: implementation language choice
Juliusz Chroboczek writes: Finally, would people be willing to use a piece of code that requires Bruno Haible's CLISP to be installed? Or do you think that exclusive use of stone-age languages is a must? Nowadays Python makes a good alternative to Lisp. Roozbeh writes: For me, it's somehow a problem of distributions. Is the prerequisite available in major distributions? clisp ships with Debian, Suse, Mandrake, and is in RedHat contrib. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: mbrtowc
Markus Kuhn writes: mbstate_t ps; mbrtowc(NULL, NULL, 0, ps); This is a bug in your program, not in glibc. You are right. I'll update the mbrtowc manual page to be clearer on this issue. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
libiconv homepage moved
The GNU libiconv homepage is now at http://www.gnu.org/software/libiconv/ instead of http://clisp.cons.org/~haible/packages-libiconv.html Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Encoding conversions
Michael B. Allen writes: I gather that I can only assume that wchar_t is just a sequence of UCS codes of sizeof(wchar_t) in size. You cannot even assume that. wchar_t is locale dependent and OS/compiler/vendor dependent. It should never be used for binary file formats and network messages. Well, I have to normalize to something! wchar_t is a very wrong thing to normalize to, because it is OS and locale dependent. UTF-8 is a much better normalization for strings, both in-memory and on disk. UCS-4 is an alternative, good normalization for strings in memory. You're freshmeat link: http://clisp.cons.org/~haible/packages-libiconv.html is broken. Thanks for the note. I'm currently setting up a replacement. Can I use the latest libiconv as a shared library ... Yes you can. So where do people discuss libiconv problems. With me, or on linux-utf8. iconv_open is giving me No such file or directory. You should look at errno after iconv_open only if iconv_open returned (iconv_t)(-1). The manual page doesn't say anything about errno in the case of a successful return from iconv_open(). Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
RE: Encoding conversions
Carl W. Brown writes: But UTF-8 is not without its own problems. Take Oracle for example. Most of the world is not Oracle. If Oracle uses its own encodings, let Oracle deal with it. They designed UTF-8 to encode UCS-2 not UTF-16. No, Oracle did not design UTF-8 at all. The RFC 2279 specifies UTF-8, and it encodes all characters from U+ to U+7FFF. I am not familiar with libiconv. ftp://ftp.gnu.org/pub/gnu/libiconv/libiconv-1.7.tar.gz ICU has an invalid character callback handler. I use it for example to convert characters that are not in the code page to HTML/XML escape sequences. You can do that with iconv() as well. With iconv(), the processing simply stops at an invalid/unconvertible character, and the programmer can do any kind of error handling before restarting the conversion. Looking at the iconv() I did not see any provisions for special invalid character handling. Do you have this kind of support in libiconv? Sure. It is even built-in. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Encoding conversions
Michael B. Allen writes: But it's not clear to me how this should be done correctly and in a portable way (or at least portable enough so that when if comes time to port I don't smack myself in the forehead). Use iconv. I mean the libc's iconv on GNU libc systems, and the libiconv (also by GNU, but a different implementation) on other systems. libiconv is ported to most systems. I gather that I can only assume that wchar_t is just a sequence of UCS codes of sizeof(wchar_t) in size. You cannot even assume that. wchar_t is locale dependent and OS/compiler/vendor dependent. It should never be used for binary file formats and network messages. But is the in memory representation of a multi-byte string the equivalent of the UTF-8 encoding Depends where you got the string. In most cases, like when you got it from fgets(stdin), it will be in locale dependent encoding (LC_CTYPE environment variable dependent). Only in particular cases, like filenames read from 'pax' archives, or when you yourself converted it to UTF-8, or when you use a GNOME 2 API function, will the string be in UTF-8. So as an example case, to encode wchar_t to UTF-16LE I must convert each character to a definative encoding such as UCS-4 and then use iconv to get to UTF-16LE. With the two aforementioned iconv implementations, you can also directly use iconv_open(UTF-16LE,wchar_t). PS: When encoding ASCII do I want to shave off the 8th bit? Removing the 8th bit is a garbage in - garbage out technique and causes endless grief to users. Instead call iconv_open(...,ASCII), and you'll get full error checking if a non-ASCII character is encountered. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: UTF-8 versus utf8
Markus Kuhn writes: In particular, the string that setlocale returns is this normalized form That was true in RedHat 7.0. But meanwhile Ulrich Drepper fixed it on 2000-10-30. The string returned by setlocale() contains .UTF-8 if the user's environment variables do. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: ISO 8859-16 is a national security threat :)
Markus Kuhn writes: I was delighted to read in ISO/IEC JTC 1/SC 2/WG 3/N 441 http://wwwold.dkuug.dk/JTC1/SC2/WG3/docs/n441.pdf how ISO 8859-16 is officially considered by the Kingdom of the Netherlands a threat to their national security. According to their explanation, Unicode is a threat of their national security as well :-) U+015F != U+0219 ... Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: UTF16 and GCC
Christoph Rohland writes: Yes, but perhaps we could try to make that standard? There is a chance to make the u... syntax(es) standard. Personally I don't think it is possible to standardize the way a compiler detects the encoding of an input file. Some, like gcc, will want to use UTF-8 as the default, some others will want to use the locale encoding. (Can't we use uint_least16_t instead of utf16_t?) No, I think one of the biggest mistakes in the C standard is that char/wchar_t is not fixed. We need an exact 16 bit type with a defined encoding. Joseph Myers explained why you won't get such a type (and why ISO C 99 section 7.18.1.1.(3) says that uint8_t, uint16_t and uint32_t are optional): Some hardware has a word size of 9, 16, 32, or 36 bit, and GCC and C99 support such hardware. Currently only on glibc systems. wchar_t == UCS-4 is only a recommendation in ISO C 99, not mandatory (unfortunately). No, it will be on all Unix systems we support: Solaris, True64, HPUX, AIX5L, Reliant. Did you get a firm confirmation from Sun people that in some version of Solaris, wchar_t will be UCS-4 in all locales and __STDC_ISO10646__ will be defined? In which version of Solaris? Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: New Unifont release
Markus Kuhn writes: b) As a single (proportional) font, for use by applications which use a single font. Can't b) be solved with the help of fontsets instead of redundantly doubling the number of fonts? Not in the current state of affairs. Xlib doesn't do anything meaningful when an XFontSet has two fonts with the same encoding (here: ISO10646-1). The fontset only helps when all you have are fonts in different character sets (ISO8859-x, JISX0208, JISX0212, etc.); then the DrawString algorithm will cut the string into segments, based on the character sets. Other information from the fonts (e.g. width) is not used during this segmentization. And for new code, we use Xft instead of XFontSet. There also, it is helpful to have the entire Unicode repertoire in a single font. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: New Unifont release
Markus Kuhn writes: I strongly recommend that you follow the practice we established in XFree86 for the -misc-fixed-*-iso10646 fonts and split up GNU Unifont into two separate charcell font files, one 8x16 and one 16x16. No, please don't do that. We need *both* ways of packaging Unicode fonts: * As two separate charcell (fixed-width) fonts, for use by xterm and similar applications where width matters a lot. * As a single (proportional) font, for use by applications which use a single font. As a matter of fact, GNU unifont (as a single font) is very useful for use in cooledit or konqueror. Markus, please consider making a combined packaging of misc-fixed-medium-r-normal-- misc-fixed-medium-r-normal-ja- into a single font, that would be covered by the same license and which could therefore be an alternative to unifont, included with XFree86. Btw, what is the license of the unifont? Is it suitable for inclusion in XFree86? Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Luit and screen [was: anti-luit]
Tomohiro KUBOTA writes: However, softwares of GNU Project will have to be assigned to FSF. (Note the difference between merely GPL-ed softwares and GNU Project softwares.) This FSF's way is to guard itself legally. This is not true in this generality. There are packages in the GNU project whose copyright stays with the authors (like GNU clisp). There are also packages in the GNU project whose copyright is assigned to the FSF (like GNU GCC and glibc). The most important point for software that is part of the GNU project is that it cooperates well with the rest of the system, i.e. most importantly that it supports --help and --version command line option, uses GNU infrastructure like autoconf where possible, imposes no arbitrary limitations on the users, and mentions the GNU project on their homepage. GPL-ed softwares cannot be included in XFree86 source tree, as Juliusz said. Thus, I think Juliusz's way (luit in X11 license) is reasonable. Still it seems strange to put a tty based filter program in the X11 distribution. This means that people who use a console and have no X installed cannot use it. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Emacs and nl_langinfo(CODESET)
Markus Kuhn wrote: I think, Juliusz has already understood that naively using iconv() alone might not necessarily be well suited well for luit, because it doesn't resynchronize all encodings cleverly. You need a bit additional logic. If you press ^C in an application that spits out BIG5 in an unfortunate moment or truncate a string by counting bytes, then you will loose BIG5 synchronization, and the terminal has to skip characters in the input stream until is finds two G0 characters in a row to be sure again where the next character starts. BIG5 is an example of a rather messy encoding, not only in that respect. iconv() itself doesn't resynchronize, but it is easy to resynchronize using iconv(). It needs less than 10 lines of code. Both the GNU Compiler for Java and a new gettext PO file lexer that I wrote last week are based on iconv() and do support resynchronization. The resynchronization is simple: Whenever iconv() returns -1/EILSEQ, skip 1 byte. ISO 2022 is far worse. Yes. How do you want to resynchronize when an Escape sequence was dropped during transmission? You can only try an arbitrary ISO 2022 state and hope it's the correct one. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Locking Linux console to UTF-8
H. Peter Anvin writes: Personally I would suggest making this kind of user-space console software the default These consoles rely on the framebuffer console. But on my (quite new) PC I'm unable to get a framebuffer console with a frequency of more than 60 Hz. (Yes, I tried all possible VESA modes my BIOS offers.) Will KGI (the framebuffer console with arbitrary hardware timings, like X) get into the standard kernel? If so, when? Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Determine encoding from $LANG
Markus Kuhn writes: Add to that list many of the programming languages that use Unicode internally but that do not yet set the default i/o encoding correctly automatically based on LC_ALL || LC_CTYPE || LANG. For example TCL ... OTOH, Java (both the Sun JDK 1.3 and the GCC 3.0 libjava) and GNU CLISP already do respect LC_ALL || LC_CTYPE || LANG. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: file name encoding
H. Peter Anvin writes: Yes. This is the point. When users set LANG vairable, they expect all softwares to obey the variable. The issue is, however, what that does mean? In particular, strings in the filesystem are usually in the system-wide encoding scheme, not what that particular user happens to be processing at the time. Obeying LANG is important in two scenarios: 1) For the user who uses a single locale, and this locale's encoding is not ISO-8859-1. He sets LANG in $HOME/.profile. Such a user will in the long run use non-ASCII filenames. They will be stored in locale encoding on the disk. Programs should be able to display and use such filenames. 2) For the user who tries out a locale in a different encoding. He sets LANG on the command line. Such a user will have to be prepared to problems with non-ASCII filenames. But everything else should work without manual intervention. LANG=de_DE.UTF-8 xterm - get an UTF-8 xterm LANG=ja_JP.EUC-JP gvim file - edit EUC-JP encoded file LANG=vi_VN emacs - start emacs with Vietnamese input method etc. It's for the second case that it is important that no encodings are stored in $HOME/.* files. And it's for the first case that non-ASCII filenames must be supported. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: file name encoding
Juliusz Chroboczek writes: In a number of places, a program must interact with its environment in a locale-independent manner. This includes selection conversion, keyboard input, and arguably interaction with the file system. I agree that in _some_ places programs exchange text in locale independent formats. For example, strings in databases should better be stored in a locale independent format, so that users in different locales can access it. But we need to look at it case by case. Lack of understanding of this basic principle leads to absurdities such as Emacs' ``selection-coding-system'' variable. What led to 'selection-coding-system' is that some programs are ICCCM compliant (use locale independent format for the selection and cutbuffer) and some are not. So we'll get a mess everytime it's not clear whether a mechanism uses locale-dependent or -independent text representation. * Selection: Here ICCCM says it's locale independent. * Keyboard input: An XKeyEvent is locale independent. Input read through XmbLookupString is locale dependent. Input read from /dev/tty is assumed to be locale dependent if the IEXTEN flag is set. * Filenames: The POSIX spec for 'ls' implies that 'ls' treats filenames as locale (LC_CTYPE) dependent. This means all other programs must do the same. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: file name encoding
H. Peter Anvin writes: Actually, the conditions for non-ASCII filenames is even stricter: for the system to work consistently the way you describe, the ENTIRE SYSTEM needs to use the same locale. It needs not. If the administrator/distribution files are in ASCII, and users don't need to access each other's files, there is no problem with user A having /home/A in EUC-JP encoding and user B having /home/B in UTF-8 encoding. FILENAME ENCODINGS IN DIFFERENT LOCALES DO NOT WORK. PERIOD. Sure. Therefore it's best to use non-ASCII filenames only after having switched one's system to UTF-8, not before. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
RE: __STDC_ISO_10646__ support under BSD
Markus Kuhn writes: The wchar_t encoding described on http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html has the advantage that functions such as wcwidth() still can be implemented Yes, but other user-written functions like bool is_katakana (wchar_t wc) { return (wc = 0x30A1 wc = 0x30F6 || wc = 0x309B wc = 0x309C || wc = 0x30FC wc = 0x30FE || wc = 0xFF66 wc = 0xFF9F } that assume __STDC_ISO_10646__ will not work with your iso2022-wc encoding. Thus __STDC_ISO_10646__ should be undefined when using a libc with this particular locale. But it is a compile-time constant. So it implies the libc can not define __STDC_ISO_10646__ at all. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Emacs and nl_langinfo(CODESET)
Markus Kuhn writes: Has someone written autoconf tests for the presence of nl_langinfo(CODESET)? Yes, GNU fileutils and GNU gettext use the following test. m4/codeset.m4 #serial AM1 dnl From Bruno Haible. AC_DEFUN([AM_LANGINFO_CODESET], [ AC_CACHE_CHECK([for nl_langinfo and CODESET], am_cv_langinfo_codeset, [AC_TRY_LINK([#include langinfo.h], [char* cs = nl_langinfo(CODESET);], am_cv_langinfo_codeset=yes, am_cv_langinfo_codeset=no) ]) if test $am_cv_langinfo_codeset = yes; then AC_DEFINE(HAVE_LANGINFO_CODESET, 1, [Define if you have langinfo.h and nl_langinfo(CODESET).]) fi ]) = Has someone written a tiny nl_langinfo(CODESET) emulator for use until FreeBSD get's their locale support sorted out properly? Yes, it comes as 'libcharset' subdirectory of GNU libiconv. You can find the newest release at ftp://ftp.ilog.fr/pub/Users/haible/gnu/libcharset-1.1.tar.gz You can find instructions for integrating this into Emacs in the libcharset-1.1/INTEGRATE file. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
RE: __STDC_ISO_10646__ support under BSD
Marco Cimarosti writes: As their name implies, Unicode Language Tags only change the language, NOT the character set (which remains Unicode, of course). The distinction is not relevant in this context. Remember why some people want to keep an ISO-2022 surface of the world. Because they have long ago invented the (mistaken) assumption that a character's rendition depends on the character set it is taken from. That is, a cyrillic character from ISO-8859-5 has width 1, whereas a cyrillic character from ISO-IR-165 has width 2. We are discussing how to make these people accept Unicode. I.e. how can a character with one given Unicode code point be represented with width 1 or 2, depending on context? Unicode 3.1 contains the means for that. A language tag is sufficient, because all Japanese charsets behave the same w.r.t. rendition of some specific characters. It's kind of a national custom. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Comments on locale name guideline: CODESET names
Pablo Saratxaga writes: The standard Vietnamese encoding is TCVN-5712 not VISCII. Yes. And it has combining characters, which Markus wants to exclude... Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Again on mbrtowc()
Marco Cimarosti writes: I hope this is not too much off topic. Time ago, Edmund Grimley Evans asked what should be the value of this expression: mbrtowc(wc, , 0, ps) I have two other similar questions for cases that seems unspecified: 1) What should the function do when passed a NULL as the last argument? Should it use an internal mbstate_t variable or not? Yes. The manpage says it: In all of the above cases, if ps is a NULL pointer, a static anonymous state only known to the mbrtowc function is used instead. 2) What should it do and return if a mbstate_t is supplied that contains invalid state values The same as may happen if you dereference an uninitialized char* variable: unspecified behaviour. SIGSEGV or toast your cat. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Comments on locale name guideline
[EMAIL PROTECTED] writes: what was the original goal? was it just for linux, or aimed as a generic guideline for the benefit of any UNIX variants (including non-linux?) i was under impression that it falls into the latter case (otherwise you wouldn't cc: to bsd-locale mailing list). Li18nux is about APIs for Linux. But since Linux standards are also likely to have an effect on *BSD in the future (at least because we share the same X11 and many applications), comments from BSD people are welcome. This particular subthread focused on how many locale encodings exist in POSIX systems. Including *BSD and other Unices. My previous mail was an attempt to discourage you from spending time implementing ISO-2022-JP and SJIS locales. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [li18nux2000:62] Comments on locale name guideline
Keld Simonsen writes: 4. Add '+' and ',' to the DELIMITERS These are delimiters in ISO/IEC 15897 locale syntax. 5. Change or add the following syntax for locales: LANGUAGE_TERRITORY+MODIFIER1+MODIFIER2,SOURCE_VERSION.CODESET This is the format for locale names in the ISO standard (implemented in glibc). glibc supports this, but adding this to the spec makes it unnecessarily more complex. Why choose a complex spec when a simple one is sufficient? Just to support every existing (but unused) ISO standard? 7. For the CODESET repertoire, please add the specials : ( ) / _ . * No, please don't add : ( ) / . * as these may not occur in charset names according to RFC 2278. 8. In MODIFIER, you should remove the line with euro as this is not a good example. The euro modifier is normally based on a dependency on special coding in the application to say whether this should be used, and as it has not removed the internationalization code from the program, it is a bad example of i18n. This is BS. The euro modifier designates locales with a different contents for LC_MONETARY. It doesn't require special coding in the applications. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Comments on locale name guideline
Frank da Cruz writes: unless absolutely *everybody* agrees on *exactly* how at least the following things are handled: . Case mapping on case-insensitive file systems not relevant for Unix. . Canonical composition or decomposition . Canonical ordering of combining characters These have been specified by the Unicode consortium, so that everyone will have to implement it the same way. Nowadays users rarely type a full filename. Filename completion and point-and-click GUIs make it less frequent. Not to mention issues of sorting and collation, e.g. for listing files in alphabetical order. French users can now sort their files according to french dictionary rules, and similar for the other languages. Actually life gets easier for users than with the ASCII sorting rule, where German umlauts came after the entire alphabet. Even if Linux gets it right, then we have cross-platform issues such as NFS mounts, FTP, and so on. NFS is rarely used across different locales. For FTP we have a problem, right. For file archives, POSIX pax (the successor of 'tar') already specifies that the filenames are stored in UTF-8 in the archive. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: locale names
Pablo Saratxaga writes: why not make it case insensitive? I think the problem is because the actual data is stored on disk. That is, on filesystems that are case sensitive, the locale name is case sensitive (unless you try all the possible case combinations when reading directory names; which would be a bit wastefull). This is definitely not the problem. The implementation could simply map the locale name to lower case _before_ accessing the disk. Implementations are allowed to do this; SUSV2 says If the [locale name] does not begin with a slash, the mechanism used to locate the locale is implementation-dependent. The problem is that Bram is the only person asking for that feature, and thus it hasn't found its way into glibc. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/