Re: JOE editor has just added UTF-8 support
Derek Martin [EMAIL PROTECTED]: in Gaim. =8^) Now if only Mutt will work properly with UTF-8... Err... I'm reading these messages inside mutt, which in turn runs under a UTF-8 enabled xterm (uxterm), with the el_GR.UTF-8 locale. And let me tell you, it works great, and in fact it's been supporting UTF-8 for a long time now. It seems to have problems with double-width asian characters. It works fine with European character sets... Mutt is supposed to and has been known to work with double-width characters, provided it has an appropriate terminal library, such as ncursesw or a UTF-8 version of slang. Recent versions of Debian use ncursesw, but Red Hat 9 seems to use slang: $ cat /etc/redhat-release Red Hat Linux release 9 (Shrike) $ ldd /usr/bin/mutt libslang-utf8.so.1 = /usr/lib/libslang-utf8.so.1 (0x4002b000) ... Judging by the library name this is supposed to work, so can you describe a reproducible bug? Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Perl unicode weirdness.
Henry Spencer [EMAIL PROTECTED]: A conforming implementation of a function like my g(x), or the UTF-8 encoding, includes the range check by definition. Which definition? Are you sure validation is compulsory? Also, since there's no point in checking for error conditions that you don't know how to handle, I hope you have a clear idea of what to do with these illegal high characters in various circumstances, because I don't. Are you perhaps one of those people who thought it was a good idea for an MTA to AND incoming message bodies with 0x7f because the standard didn't officially allow non-us-ascii data, so by ANDing the data with 0x7f you make it more standard-compliant and who cares if you make the message completely useless to the recipient in the process? -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Perl unicode weirdness.
Henry Spencer [EMAIL PROTECTED]: Yes, it would be better to call the more general encoding, say, UTF-P. Surely they're the same encoding applied to a different set of points? Or would you claim that the function f(x) = 1/x on the interval 0 x 1 is a different function from f(x) = 1/x on the interval 0 x 2? In a sense they are different functions, but it's convenient and natural to give them the same name, and they can both have the same implementation if you leave it to the caller to check that x is in range. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: CD Player
Jan Willem Stumpel [EMAIL PROTECTED]: During a short holiday in Greece, I bought some CD´s with Greek songs. xmcd, Workman, etc., cannot display the song titles correctly in Greek (only displaying a mess of accented Latin-1 characters) in my LANG=en_GB.UTF-8 environment. A version of freedb (cddb) that supports UTF-8 with a new protocol level 6 was announced on December 3, so I don't suppose many clients or client libraries support it yet. Maybe you could help update them. (By the way, you wrote CD´s, using an acute accent instead of an apostrophe.) -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Perl in a UTF-8 locale
I have a problem here with Perl v5.8.0 on Red Hat 9. Simplified, my script looks like this: while () { s//cx/g; print; } This works with older versions of Perl, and it works in the C locale, but it doesn't work here in a UTF-8 locale. I tried putting stuff like use bytes or no utf8 or no locale, but it didn't help. Can anyone suggest a good solution, ideally one that is portable between different locales and different versions of Perl? Obviously I could use a wrapper. Currently I'm using this work-around: unless ($ENV{LANG} eq C) { $ENV{LANG} = C; exec(/path/to/this/script, @ARGV); } .)D-|{vWz[ bmYbh{
Re: Linux console internationalization
Beni Cherniavsky [EMAIL PROTECTED]: The first question has some reasonable answers: One answer I didn't notice in your list was that applications might want to display the shift state. For example, in one of my Emacs input methods I use ;c to type ''. When I type ';' I see ';' underlined to remind me that the ';' might be combined with the following character. Back in the 1980s I had an Amstrad PCW running LocoScript 2. You switched between Latin, Cyrillic, Greek and symbol keyboards using Alt-F1, etc, and there was some kind of indication on the screen of which keyboard was currently selected, if I remember correctly. (LocoScript 2 also let you combine any diacritic with any base character and had more diacritics than TeX ...) -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: redhat 8.0 - using locales
Antoine Leca [EMAIL PROTECTED]: In addition, differences between zh_* in LC_MESSAGES are not trivial. AFAIK, Hong Kong is now part of CN. Still, they use Traditional Chinese. So what are we doing then? ;-) Obsolete country codes might be useful for distinguishing a few language varieties that could not otherwise be distinguished. Is anyone using de_DD for German without the latest spelling reform? :-) -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: redhat 8.0 - using locales
A few files appear under LC_MESSAGES, but it seems they dont show up even when LANG=eo. First, you need to have a locale, maybe eo_ES or so. I recommend eo_XX as an unofficial way of not choosing a country. There's a locale definition file here: http://rano.org/eo_XX -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NUL-transparent Java-UTF-8
Markus Kuhn [EMAIL PROTECTED]: Is there a proper full specification of this encoding somewhere online? Merely replacing 0x00 with its overlong UTF-8 equivalent 0xc0 0x80 can't be the full story, because what you are interested in the end must surely be binary transparency, not merely NUL-transparency. I don't see what NUL-transparency alone would be good for, as NUL is usually only a problem in arbitrary binary strings. True, but pedantically correct handling of e-mail messages is an exception. According to RFC 822 all 7-bit characters, including '\0', are valid in a Subject line, for example. You are even allowed to have a bare '\r' or a bare '\n'; only \r\n is special: it must be followed by ' ' or '\t'. Of course, nobody really implements this. Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: filename and normalization (was gcc identifiers)
I don't think normalisation helps at all. An ideal UTF-8 terminal should remember the actual octets that were printed, so you can accurately copy and paste even random binary data that is displayed as reverse-field question marks. The ls program should have an option to display file names in a form in which they can be used as shell arguments and with difficult octet sequences replaced by numerical escapes.[*] Those two measures together should make it fairly easy to copy and paste file names. However, if you add normalisation, it will stop working. It might be useful to have a program that looks for a file path on the system that is similar to a given file path. This program could use normalisation internally, but it would be better to use a fuzzy comparison. For example, guesspath foo would return Foo if the only files in the current directory are Foo and Bar, but it would return foo if there is a file called foo, and I don't know what it would do if there are files called foo and Foo. Edmund [*] Unfortunately, the Bourne shell doesn't have numerical escapes, which rather spoils this plan. You could have a file called \007 displayed as $(printf \x07), while a file called $(printf \\\x07\) is displayed as '$(printf \x07)', etc. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: readdir() on linux
marco [EMAIL PROTECTED]: I need to make a scan of all the files on a Linux system (independently of the type of filesystem and the options given at mount time) and record all the filenames. I'm using the readdir() syscall that returns a pointer to a struct dirent. My question is: what should I assume about the format/encoding of the d_name[] field? Assume it's a null-terminated octet string. It shouldn't be empty, and it shouldn't contain (ASCII) '/'. You can't assume the string is valid character data in any particular encoding. However, if it is valid as UTF-8, then it probably really is UTF-8, but it might not be printable, so you'll still have to process it before displaying it. Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: readdir() on linux
marco [EMAIL PROTECTED]: Ok, does anybody know if the same applies to other unices (e.g.: AIX/Solaris)? I would like to understand how Linux compare to these commercial OS's. I didn't notice any difference when I tried the following: mkdir t cd t x=0; while [ $x -lt 255 ] ; do x=$[$x+1] ; printf $(printf \x$(printf %02x $x)) ; done for x in ? ; do echo -n $x ; done | od -Ax -tx1 There were 252 files created: all octet values except '\0', '.', '/' and '\n' - the latter due to a limitation of the shell, I assume. Shell scripts don't work very well with file names containing a newline ... Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Red Hat 8 now uses UTF-8 by default for all non-CJK users
Radovan Garabik [EMAIL PROTECTED]: There has been suprisingly little user disastification, for one reason or the other. Not sure why exactly. US-centric user base? Techie user base that uses English anyways? Easy enough to switch back? This one probably. Shortly after the new Redhat came out, cz.comp.linux has been flooded by users asking How the f*ck can I turn this off. So I suspect eveyone who was dissatisfied has already switched back to ISO-8859-2 locale Let's hope there were also a few people who bothered to report specific bugs so that they can be fixed! Here's one bug I saw: http://groups.yahoo.com/group/mutt-dev/message/16606 Mutt was working nicely in UTF-8, but Mutt invokes an external editor, in this case Emacs, and apparently Emacs was not respecting the locale. There might have been something in the user's .emacs that caused this, but could someone please check that with Red Hat 8.0 emacs will by default create a UTF-8 file when invoked from a UTF-8 locale? Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: How to read mail with #nnnn
[EMAIL PROTECTED] [EMAIL PROTECTED]: Sometimes I receive mail in Content-Type: text/html; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Your mail client should decode the quoted-printable and pass the decoded HTML document to a web browser. I read e-mail with Mutt and I've set it up to cope with HTML by putting text/html; /usr/bin/lynx -dump -force_html %s; copiousoutput in ~/.mailcap and auto_view text/html in ~/.mutt/muttrc. The muttrc bit is mutt-specific, obviously, but lots of programs use ~/.mailcap. Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: ISO9660 UTF-8
Jungshik Shin [EMAIL PROTECTED]: However, I had to tell him that there's another hurdle to overcome. My patch hard-coded 'UTF-16LE' as the codeset name for 'UTF-16 Little Endian', but it's not very portable. There should be a way to detect the codeset name to use with iconv(3) on a given platform for UTF-16LE. Is there any autoconf macro written for this? An alternative is to just make it user-configurable at run-time. This is easier for programmers, but not so user-friendly... Because of the way some people use libiconv with LD_PRELOAD, it makes sense to decide at run time rather than build time. However, you probably don't need to bother the typical user with configuration stuff; you can try various possible names and run tests at run time. Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Paper size
Henry Spencer [EMAIL PROTECTED]: For the exact same reason you should switch to the metric system... Unfortunately, there isn't the same incentive. Paper size is basically arbitrary; it doesn't impinge on everything else the way the units system does. There's nothing magic about 210x297mm that makes anything easier. But there is! Firstly, if you cut a piece of A4 paper into two halves, each has the same proportions as A4. Secondly, a piece of An paper has area 1/2**n of a square metre. Standard photocopier paper weighs 80 grams a square metre, so a piece of A4 weights 5 g, and airmail postage rates go in steps of 5 g or 10 g ... Of course, it's not really 210x297mm; it's more like 210.224x297.302mm. Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Paper size and locale
As for the actual physical paper format (as opposed to PDF document layout), I'd like to warmly encourage people in North America to start using A4 paper. Why would we? Because you will eventually, so you might as well do it now to minimise suffering. Well, I don't know how true that is for A4 paper, but that's a generic reason for accepting a good standard. I have heard of a US company using A4 for compatibility with its own officies in other countries, but I don't suppose it happens very often yet. I can still remember the old foolscap paper that preceded A4 in Britain. I'm certainly glad they replaced it. Sorry, I'm now totally off topic ... Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
Bruno Haible [EMAIL PROTECTED]: I just spottet in section 1.1.3 of RFC 3030 (NFS version 4 Protocol) the following requirement: file and directory names are encoded with UTF-8. Good, they got it right. Where is the conversion between the NFS filenames and the user visible filenames (in locale encoding) to take place? Probably in the kernel, and the user-visible encoding will be given by a mount option? We had a long and at times somewhat heated discussion about that on this list some time last year, IIRC. I think it doesn't make sense for file name arguments to fopen(), opendir(), etc, to be locale-dependent: too many things will break if different processes see different file names. The mount option makes sense, but it will be confusing if server file names and client file names cannot be converted exactly. So there should be a mount option for converting file names, but people would be well advised not to use it and instead let applications convert file names, if they want to. It's RFC 3010, by the way. Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
Pablo Saratxaga [EMAIL PROTECTED]: Currently yo ucan have a filename with bytes in 0x01-0x1F and 0x7F-0x9F, however you cannot usually type those directly. Well, you can use those \x88 and the like representations, or use that lovely tab-completion feature (if the filename starts with a typable thing), or use a tool that allows you to pick the file in a menu (that is my preferred way to delete bizarre file names: select them in mc and press F8; it is much easier) And the traditional last resort is to move everything with a sensible name out of the directory and then rm -rf the directory. Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Security
Markus Kuhn [EMAIL PROTECTED]: I still think, there is a philosophical missunderstanding here about how digital signatures are to be interpreted in cases of legal dispute. What in most countries that have thought about the issue would count is what the human end user has seen on the display component of the device where the signature was generated. The actual bitstring signed is actually not as relevant here as you might believe. You do not need any reversibility, you just need a tightly standardized rendering process that produces the same readable text each time from the same bit string. That standardised rendering algorithm will be used as well in court to inspect the bitstring you have signed, not your hexdump editor or whatever alternative displaying process that you might come up with to provide a different text. This can't be right, or blind people would not be able to communicate in a legally recognised way. Also, a document might be passed round a company and inspected by a large number of blind and seeing persons, using a wide variety of different software, before it is passed to another company to form part of a contract. The device where the signature was generated might be a server with no display component. I don't think you can get away from the bitstring being the authoritative text. If different software displays bidirectional text differently, then you have another kind of potential ambiguity to add to all the kinds of ambiguity that already exist in any communication between people. (But thinking about a blind person listening to the text through a speech synthesiser probably gives a good idea of what the correct interpretation should be: words should be spoken in the order they appear in the bitstring, regardless of writing direction.) Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [linux-utf8] UTF-8 in e-mail subject lines, To: headers, etc.
[EMAIL PROTECTED] [EMAIL PROTECTED]: A slight extra problem is that MIME::Words and Mail::Header don't really get along very well together. It seems that Mail::Header splits up some headers differently from others. If the header is mentioned in the magical internal hash %Mail::Header::STRUCTURE, then the header is split up on whitespace, commas and semi-colons, eg: From: =?utf-8?Q?Richard Jones?= [EMAIL PROTECTED] But otherwise (eg. for Subject headers), Mail::Header will split at an arbitrary location based on length only. This has the effect of splitting the word across lines, which breaks things. Unfortunately adding %Mail::Header::STRUCTURE{subject} doesn't seem to be the answer, because I can't necessarily guarantee that the subject line will contain any whitespace. So it looks like I'll have to break the header up by hand by adding \n after words before calling MIME::Entity-build. I'm sure I can't be the first person to find this problem ... I'm also not sure why the RFC doesn't define that headers should be concatenated *first*, followed *second* by un-mimeifying. That would seem to be a much simpler way of doing things. Because in general you don't want to unfold (concatenate) header fields. I don't think Mail::Header should be folding (splitting up) headers at all. RFC 822 merely says you can fold header lines, not that you should. In the case of an unstructured field, such as Subject, splitting up and concatenating the header may destroy deliberate layout, for example: Subject: Awake! for Morning in the Bowl of Night Has flung the Stone that puts the Stars to Flight Obviously an MTA may want to unfold the text in order to display it in a summary list, but I don't see why Mail::Header has to mess with it. So, I suggest you try complaining to the maintainer of Mail::Header. Perhaps they would be willing to only split up structured header lines, for example. Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: UTF-8 in e-mail subject lines, To: headers, etc.
[EMAIL PROTECTED] [EMAIL PROTECTED]: Hi: When sending an email with the following subject line to an MS Outlook email client, Outlook renders the Arabic letters as question marks. OTOH message bodies sent in UTF-8 render OK provided the Content-Type header is set as appropriate. Is this a problem with Outlook, or is the subject line itself badly formed? Subject: =?utf-8?Q?The next will be in Arabic: =D8=AA=D8=A7=D8=B9 =D9=84=D8=A7=D9=84=D8=BA=D8=B9=D9=81=D8=BA=D8=B6=D8=B5=D8=AB=D9=82=D9=81=D 8=BA=D8=B9=D9=87=D8=AE=D8=AD=D8=AC=D8=AF=D8=B7=D9=83=D9=85=D9=86=D8=AA=D8= A7=D9=84=D8=A8=D9=8A=D8=A8=D8=B3=D8=B3=D8=B4=D9=84=D8=A7=D8=B1=D9=84=D8=A7 =D8=B1=D8=A1=D8=A4=D8=A4=D8=B1=D9=84=D8=A7=D8=A1=D8=A9=D9=89=D8=B2=D9=85=D 9=85=D9=87=D9=84=D8=A7=D8=AA=D8=AE=D9=85=D8=AE=D8=AD=D9=83=D8=AA=D9=86=D9= 85=D9=89=D8=B4=D8=B3=D9=8A=D8=A8=D8=A8=D9=84=D9=84=D8=AA=D8=A7=D9=84=D8=A8 =D9=8A=D9=8A=D8=B3=D8=A6=D8=A1=D8=A4=D8=B1=D9=84=D8=A7=D9=84=D8=A7=D9=89=D 9=89=D8=A9=D9=88=D8=B1=D8=B1=D8=A4=D8=A4=D8=A1=D8=A1 ?= Cheers for any help with this. The Subject line is badly formed. There shouldn't be any spaces in encoded-text. See http://www.faqs.org/rfcs/rfc2047.html I shall attempt to attach a message with a corrected version of that Subject line ... Edmund ---BeginMessage--- تاعلالغعفغضصثقفغعهخحجدطكمنتالبيبسسشلارلارءؤؤرلاءةىزممهلاتخمخحكتنمىشسيببللتالبييسئءؤرلالاىىةوررؤؤءء ---End Message---
Re: [I18n]Re: Li18nux Locale Name Guideline Public Review
Bram Moolenaar [EMAIL PROTECTED]: In principle, I agree though, case sensitive; work should be aimed at making a GUI simple to use, and the CLI consistent and simple. I still haven't heard a good reason why case sensitivity is useful. Simplicity of implementation (of existing and future code) and avoiding weird bugs have been mentioned as reasons for case sensitivity. Unless I missed something, the only reason we've had for case insensitivity is making the names very slightly easier to remember. Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [I18n]Re: Li18nux Locale Name Guideline Public Review
setenv LANG de_DE.iso-8859-1@euro setenv LANG DE_de.ISO-8859-1@euro setenv LANG de_DE.Iso-8859-1@EURO Do you think an average user can guess which one of these he has to type? No GUI available! If the average user is having to choose between those 3 possibilities, then presumably those 3 possibilities were presented by some program or included in some list. That program, or that list, should be modified to only give valid possibilities. Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Squeeze one more bit into a UTF-8 sequence?
Michael B Allen [EMAIL PROTECTED]: I am in the process of modifying xterm to return keysyms for key *releases* (in addition to key presses natrually). The keysyms would be looked up in a table by their osf code (or something :-). A program that wants to take advantage of this apparatus could then issue a control sequence to turn it on and off and use a normalized table of keycodes to work from. Aaaanyway, I would like to use UTF-8 to encode the keysym for sending to the programs stdin but there is a problem; how do I encode the extra bit of information necessary to indicate that a UTF-8 sequence is a key release as opposed to a key press? Is there a way to encode /one more bit/ of information into a UTF-8 sequence in a way that is mostly orthogonal to the encoding itself? I would have thought that it would be better to use some kind of escape sequence than invalid UTF-8. For example, you could pick characters D and U and use DX or just X to mean X pressed and UX to mean X released (D=down, U=up). Normally, you would transmit just X rather than DX, but you would have to use DD and DU for D and U themselves being pressed. For efficiency you could choose D and U to be characters that don't often get typed, but there's nothing to stop you using the characters 'D' and 'U' if you want. Using a character that isn't too rare has the advantage of making bugs show up earlier. Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
Pablo Saratxaga [EMAIL PROTECTED]: Why was Turkish unified, then? It has not. There are two kinds of i: with and without dots: two different letters, 4 different chars (upper and lower case of the 2 letters). They are not unified. Now, the default pair used in almost all languages is the one with a dot for the lowercase, and the one without dot for the uppercase. So the default pairing is that one; only for Turkish and Azerbaidjani the upercasing and lowercasin rules are different. You've described the situation, but you haven't answered the question. The obvious alternative would be to have 6 characters: upper and lower case versions of ordinary I, Turkish/Azeri dotted I and Turkish/Azeri dotless I. It would be interesting to know whether this alternative is ever used, in some encoding, was ever considered for Unicode, etc. Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, character ambiguities
Henry Spencer [EMAIL PROTECTED]: However, the point remains valid: the Fraktur fonts, which have at least a strong historical presence in Latin-alphabet texts, are unreadable to a lot of Latin-alphabet users, and were nevertheless unified. This is (I assume intentionally) a funny way of putting it. They didn't have to be unified, because they were never considered to be distinct. It's hard to imagine why anyone would want to derive the Latin alphabet by doing a new, independent survey of existing fonts when everyone, even children, already know the alphabet. In summary, I don't think readability has anything to do with it. Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Printing UTF-8
Juliusz Chroboczek [EMAIL PROTECTED]: Finally, would people be willing to use a piece of code that requires Bruno Haible's CLISP to be installed? Or do you think that exclusive use of stone-age languages is a must? Hang on! LISP was invented in 1960. The only older language still in use is FORTRAN (1957). Use of a compiled language might be helpful, to reduce run-time dependencies. Is there a free Common Lisp compiler? You could implement in Prolog (1970), Scheme (1975), Caml (1984) or Haskell (1990). C (1972) is boring; don't use C. :-) Edmund -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: getting locale's charset from a script
Bruno: If it doesn't already do so, perhaps the iconv command should have an option to tell you the charset of the current locale, as one of the most likely reasons for wanting to know it is in order to use it as an argument to iconv. So you could also have a pseudo-charset locale, as in iconv -f locale -t utf-8. A missing -f or -t argument to the iconv program already denotes the locale charset. This is true for both glibc iconv (since glibc-2.2.2) and libiconv iconv (since libiconv-1.6). Thanks. But what if I want to convert to the locale charset with transliteration? Is that possible with iconv? Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
UTF-8 support for freedb and Ogg Vorbis
I recently contributed some UTF-8 support to a couple of projects, which I will describe in case anyone has any advice for me. http://sourceforge.net/projects/freedb/ This is a cddbp database server. You give it the precise track lengths of a CD and it will supply the track titles if someone has already entered them, or you can contribute them yourself. For example, Debian has a script called abcde for converting an entire CD to Ogg Vorbis files which queries a cddbp server automatically so that it can add tags to the Ogg Vorbis files for you. There are various ways of communicating with the server, but all of them include an explicit protocol level except e-mail, which has MIME. Up until now ISO-8859-1 has been prescribed. My proposal is to define protocol level 6 to be the same as 5 but with UTF-8 prescribed. The server takes care of charset conversion and can be configured to automatically detect the encoding of disc files, so an existing database can be used without conversion but new files can be added in UTF-8. When UTF-8 data is supplied to an ISO-8859-1 client the server has to transliterate. The first problem is to provide a good transliteration table: glibc and libiconv don't transliterate Cyrillics, I think, so can anyone recommend such a table? The second problem is to avoid transliterated data being edited by a user then recontributed as a correction. Ideally we wouldn't accept an ISO-8859-1 update to a file that contains non-ISO-8859-1, but unfortunately updates are merged off-line by a different process, which means it would be messy to implement, so we might just make do with including a warning in the CD title when data has been transliterated approximately and trusting the user to understand it. http://www.xiph.org/ogg/vorbis/ This is the free replacement for MP3. The Ogg Vorbis format prescribes UTF-8, but data has to be converted for the client. My suggestion to require iconv was not welcomed, so I provided both a converter using iconv and a simple built-in one with a config test to choose between them. The built-in converted does UTF-8 and 8-bit encodings. It would be useful if anyone could provide a list of 8-bit encodings worth including. An encoding is worth including if it is widely used by people who don't have iconv, and a name of such an encoding is worth including if it might be returned by nl_langinfo(CODESET) on a system without iconv. At present the code uses nl_langifno(CODESET), where available, to get the user's charset. Otherwise it looks at the environment variable CHARSET. Otherwise it assumes US-ASCII. In general, when converting, illegal input bytes are replaced by '#' and unrepresentable characters are replaced by '?'. The function to convert a buffer using iconv is about 200 lines of C, mainly because of faults in the design of iconv's API, which mean you have to convert the data 3 times: you have to go via UTF-8 to distinguish the '#' and '?' cases, and you have to convert from UTF-8 twice to avoid having E2BIG mask the return value telling you that the conversion was inexact. Also, I have to support both the standard iconv and the various versions provided by glibc/libiconv, so I'm not totally happy with iconv. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: wrong strcoll() result with different UTF locale setting
Markus Kuhn [EMAIL PROTECTED]: Only in phone books. The more modern German sorting order used in dictionaries and most other applications treats ö like o, distinguished only in the second sorting level (just like accents are sorted in English as well). I'd rather see the ö=oe sorting order disappear. It is confusing, user unfriendly, and makes looking up words in sorted list more complicated. It has it's place in phone books and name lists only, because there used to be a lot of German surnames that sounded identical but have ö/oe, ü/ue, ä/ae as spelling alternatives (Moeller versus Möller, etc.). I also used a German library catalogue that had Ö = OE and also I = J and U = V, presumably with the sound practical justification that I and J were the same letter in Classical Latin, as were U and V. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Odd differences in locale sorting
David Starner [EMAIL PROTECTED]: It seems that at least all the non-Latin-script languages should sort Latin-script the same way, or at least chose between a standard, language-neutral 'correct' sort and an efficient sort. Probably by default each locale should start off by directly or indirectly copying iso14651_t1 and then apply modifications that only change the ordering of the letters used in that language. However, national standards do sometimes describe how foreign letters should be ordered, so there may be some justification for some of the apparently eccentric variations. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: New Unifont release
David Starner [EMAIL PROTECTED]: It's not clear whether this license covers only your additions, or also Roman's original font. What is Roman's original license? That was Roman's original license. I'm an American, and American laws do not allow copyright on bitmap fonts. Any work I do on the Unifont is therefore in the public domain. If I recall correctly, an international treaty on copyright states that a citizen of country X gets the same rights in country Y as a citizen of country Y, so it doesn't make any difference that you're an American. Your work won't be in the public domain everywhere unless you say so. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Arabic (was Re: [I18n]Syriac)
Pablo Saratxaga [EMAIL PROTECTED]: However, if that is not the case, if bdf/pcf fonts need to be created, there is the problem to create a new font encoding for Syriac. But of course, don't invent anything new if something suitable already exists. At cl.cam.ac.uk I shared an office with George Kiraz, who is the author of some Syriac fonts. I don't have his e-mail to hand, but you can find him on Google with george kiraz syriac fonts. But I don't think he's at Bell Labs any more, so try his private e-mail address. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Luit and screen [was: anti-luit]
Juliusz Chroboczek [EMAIL PROTECTED]: RB Tho I do agree that luit should be integrated into screen eventually. Impossible for licensing reasons. I should hope that luit will get into the XFree86 tree. What are those reasons? Why can't it be dual-licensed? Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Luit and screen [was: anti-luit]
Markus Kuhn [EMAIL PROTECTED]: The GPL is an absolutely fabulous idea, but since there is so much unjustified phobia around it, I'd recommed to donate anything that you produce related to support the use of UTF-8 under POSIX to the public domain (as I did with all my font and other UCS things on my web pages). This seems to maximise impact in other projects as it takes away the fuel from any potential licence discussion. Another possibility is to write that your code may be distributed under licence of your choice or GPL. Then people don't have to waste time discussing whether licence of your choice is GPL-compatible or not. Perl is distributed this way. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Again on mbrtowc()
Tomohiro KUBOTA [EMAIL PROTECTED]: It may detect the problem and return EINVAL. The problem is, mbrtowc() returns size_t value. Thus, any positive value cannot be used for error. If this is a discussion to determine new standard, I would insist it should return some minus value, for example, -3. Yes, errno should be set to EINVAL. Don't worry: when I wrote return EINVAL this was just shorthand for return (size_t)(-1) and set errno to EINVAL. By the way, UTF-8 is stateful as far as mbrtowc() is concerned, so what Markus wrote about calling abort() does not constitute further evidence of a UTF-8 conspiracy to reduce codeset-diversity. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: UTF-8 as the single common encoding everywhere
H. Peter Anvin [EMAIL PROTECTED]: But which is that? The one described in RFC 2279, the one in ISO 10646-1:2000, or the one in Unicode 3.1? These are different. The only difference is how permissive the standard is with respect to the handling of irregular sequences. No standard has ever required interpretation of irregular sequences (except perhaps as a specification bug), and the only safe answer has always been to reject them. But sometimes it is not possible to reject sequences; you have to do something with the data, even if that means replacing it by '?'s. So in some circumstances it might be better to accept and generate UTF-8 sequences corresponding to all of the integers from 0 to 2^31-1. That is, after all, the simpest and most logical behaviour, and it would be the standard behaviour if there were no endian and UTF-16 problems. It sort of irritates me that in a UCS-4/UTF-8 world we are expected to treat U+D800..U+DFFF and U+FFFE and U+ as illegal. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: mbrtowc(wc, , 0, ps)
Marco Cimarosti [EMAIL PROTECTED]: BTW, I see that Plauger's reference contradicts what Markus said in two points, and I have no way of determining who is more correct or up to date: 1) In http://www.dinkumware.com/htm_cl/wchar.html#mbrtowc, it says that mbrtowc() return zero only when the next completed character is a null character, which cannot of course be the case when the size is zero. Plauger too does not specify what the function should return in this case, but -2 (incomplete mb character) seems a reasonable choice. It's the only reasonable choice, even if you can argue, legalistically, that according to some standard mbrtowc is entitled to return -42 and randomly corrupt memory when given size = 0. 2) In http://www.dinkumware.com/htm_cl/wchar.html#mbstate_t it says that mbstate_t can be initialized simply by setting its *first* member of to zero (mbstate_t mbst = {0};), and this would imply that a memset() is only needed to *re*initialize it. I don't think you are allowed to assume that mbstate_t is a structure and has members, so memset is definitely better. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
mbrtowc(wc, , 0, ps)
I admit I haven't checked the latest glibc from CVS, and I haven't investigated any databases of bug reports, so I apologise if this is already well known. With glibc-2.2.3, mbrtowc(wc, , 0, ps) seems to return 0 instead of (size_t)(-2). I think this is a bug. We noticed this because a program stopped working when we tried to use glibc instead of libutf8_plug. Edmund #include stdio.h #include string.h #include wchar.h int main() { mbstate_t ps; wchar_t wc; memset(ps, 0, sizeof(ps)); printf(%d\n, mbrtowc(wc, , 0, ps)); return 0; } - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: mbrtowc(wc, , 0, ps)
Markus Kuhn [EMAIL PROTECTED]: With glibc-2.2.3, mbrtowc(wc, , 0, ps) seems to return 0 instead of (size_t)(-2). I think this is a bug. It is a bug in your software. You should never call mbrtowc with 0 as the number n of bytes that mbrtowc is allowed to examine at most. Such a call seems useless, and the standard does not define the behaviour of mbrtowc in that case. One could argue - and I probably would agree - that (size_t)(-2) might be an aesthetically more pleasing return value in that situation, but that is not really a requirement of ISO/IEC 9899:1999(E), §7.24.6.3.2 on page 388. I don't have that document. Could you quote the bit that says that n musn't be zero? Perhaps someone should write a tutorial on common pitfalls with the restartable multi-byte functions. A list of common mistakes would certainly be helpful, but the priority should be to provide correct man pages. The man page I looked at said nothing about n not being zero, so I assumed I didn't have to check n myself. The code in question is for boot floppies, so I deliberately avoid performing unnecessary checks, which, in another application, I might be happy to do for safety. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: a couple of glibc bugs
Bruno Haible [EMAIL PROTECTED]: Secondly, if you have LANG=fr LANGUAGE=de then you get German messages but nl_langinfo(YESEXPR) and nl_langinfo(NOEXPR) are French. This is confusing LANGUAGE has an influence only on gettext. If you want to influence gettext() and nl_langinfo(YESEXPR), use LC_MESSAGES: LANG=fr_FR LC_MESSAGES=de_DE But the good thing about LANGUAGES is that it lets you specify a list of languages. LC_MESSAGES doesn't, as far as I know. Could YESEXPR be made to follow LANGUAGES without breaking some standard or convention? Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: gettext-0.10.36 is released
Bram Moolenaar [EMAIL PROTECTED]: Yes. It is called bind_textdomain_codeset(), and is documented in the manual. Using this function I don't seem to be able to change the encoding once I have started using gettext. Is this a bug or a feature? I noticed that too. It was said to be fixed in the next version. It seems to be fixed in glibc's CVS, too. I patched my gettext-0.10.36 using the diffs from CVS and it seems to work now. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: gettext-0.10.36 is released
Bruno Haible [EMAIL PROTECTED]: Is there an official mechanism for telling gettext what the target charset is even when the locale is wrong, nl_langinfo is missing, or whatever? Yes. It is called bind_textdomain_codeset(), and is documented in the manual. Using this function I don't seem to be able to change the encoding once I have started using gettext. Is this a bug or a feature? Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
reorder-after in locale definition
Can anyone help me with using reorder-after in the LC_COLLATE section of the locale definition? There aren't very many examples to copy, because only sv_SE seems to use it. I'm trying to say that should be treated like a separate letter between C and D, so I wrote this: LC_COLLATE copy "iso14651_t1" collating-symbol ccirc reorder-after c ccirc reorder-after U0106 U0108 ccirc;CIR;CAP;IGNORE % reorder-after U0107 U0109 ccirc;CIR;MIN;IGNORE % reorder-end END LC_COLLATE It seems to work for "eo_EO.UTF-8 UTF-8" in /etc/locale.gen, but it doesn't work for "eo_EO ISO-8859-3", because: eo_EO:46: LC_COLLATE: cannot reorder after U0106: symbol not known eo_EO:48: LC_COLLATE: cannot reorder after U0107: symbol not known Presumably this is because U+0106 and U+0107 aren't present in ISO-8859-3. So, what should I do to make the same locale definition work in UTF-8 and ISO-8859-3? I admit that I don't really understand the purpose of the character specified in the same line as "reorder-after". Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: reorder-after in locale definition
Roozbeh Pournader [EMAIL PROTECTED]: http://anubis.dkuug.dk/jtc1/sc22/open/n2955.pdf Thanks for that. I was trying: reorder-after U0106 U0108 ccirc;CIR;CAP;IGNORE % reorder-after U0107 U0109 ccirc;CIR;MIN;IGNORE % In fact I should have U0043 and U0063 instead of U0106 and U0107 to make [c-d] in regular expressions be equivalent to [cd]. As far as I can make out from a quick scan of the spec, only "ccirc;CIR;CAP;IGNORE" is used for collating strings, but the order of the lines matters for interpreting character ranges in regular expressions. Not all programs that use regular expressions are locale-sensitive in this way. I haven't investigated why. One program that does have locale-sensitive regular expressions is Mutt. At present a bug in hu_HU prevents [a-z] from working in that locale, but [a-z] seems to mean the same thing in all locales for egrep. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: Unicode-HOWTO 1.0
[EMAIL PROTECTED] [EMAIL PROTECTED]: Unfortunately I am not quite sure what an ACM is. An ACM is "Application Charset Map" the same thing as the screen maps, but an ACM converts bytes to Unicode values. There must be a misunderstanding here about what a screen map is. and koi8r.uni is a unicode map, and contains You've confused me. As I understand it there are Application Charset Maps that map from an 8-bit character set to 16-bit UCS values. These are only used when the console is not in UTF-8 mode. And there are Screen Font Maps that map from 16-bit UCS values to font position (8 or 9 bits). I think "unimap" and "screen map" both mean the same as "SFM", but "SFM" is the preferred term nowadays. You have an ACM for each 8-bit charset/encoding you might want to use, and you have an SFM for each font. The font is then independent of the charset/encoding. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: iconv output utf-8 - utf-16, which one is wrong?
You could argue that putting a BOM is the application's duty, not iconv's business, but that would be painful for all applications which try to use iconv. And unlabelled data (e.g. files on a filesystem) shouldn't use UTF-16 or its variants in the first place, that what UTF-8 is for. Well, the issue is that iconv() is also used for, say, text strings embedded in data. However, it sounds like the solution is simply to request UTF-16BE instead. So, UTF-16 gives you bigendian with BOM, UTF-16BE gives you big-endian without BOM and UTF-16LE gives you little-endian without BOM. How do I ask for the machine's native ordering with or without BOM? Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: iconv output utf-8 - utf-16, which one is wrong?
[EMAIL PROTECTED] [EMAIL PROTECTED]: Wprint (a postscript filter for Netscape/Mozilla printing output) is now, under FreeBSD sending the "fffe" as a valid character because it does not expect it. Although it is easy to just skip it if it is present I would like to know if it should be present at all. U+FEFF is the BOM (Byte Order Mark) or ZERO WIDTH NO-BREAK SPACE. It can in some circumstances be useful to have this at the beginning of a file or datastream to distinguish big-endian UTF-16 from little-endian UTF-16 (and from UTF-8, etc), however, it can also be harmful, so I don't think iconv should be generating or interpreting BOMs by default. Should iconv perhaps have command-line arguments --bom-in and --bom-out or something similar? Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: [I18n] Default charset for locale (is UNICODE !)
Roozbeh Pournader [EMAIL PROTECTED]: Until then, I am looking forward to hearing reports from people who have already completely moved their Linux environment to UTF-8, i.e. who run their terminal emulators only in UTF-8 mode all day long. What does still break under UTF-8 and needs to be fixed? My main problem has been pine. First of all it doesn't pass 0x80-0x9F to the terminal, and second it doesn't have automatic charset conversion, so I have problems with messages in ISO-8859-x. In short, almost nothing works with pine. You could try Mutt (www.mutt.org) instead. Apart from reasonable handling of UTF-8 terminals Mutt has other advantages, too. See www.rano.org/mutt.html for the UTF-8 instructions ... Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: non-breaking space
Markus Kuhn [EMAIL PROTECTED]: Edmund GRIMLEY EVANS wrote on 2000-09-12 16:46 UTC: According to glibc's iswprint(160), a non-breaking space is not printable. Is this correct? Certainly not. NBSP is most definitely a printable character. Good. I'm glad to hear it. But even glibc-2.2 seems to think it's unprintable. Could this be fixed, please, Ulrich? Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
non-breaking space
According to glibc's iswprint(160), a non-breaking space is not printable. Is this correct? Why is this so? To me, ' ' seems more similar to 'x' than ' ' is to 'x' ... Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: Unicode/UTF-8 support for man
Markus Kuhn [EMAIL PROTECTED]: My suggestion is that groff should offer a new -Twlocale, in which it formats a paragraph as a wchar_t text and then spits it out via wprintf() and friends. The C library will take care of converting this to UTF-8, Latin-1, ASCII, transliteration, etc. For each non-ASCII character in a paragraph, groff should query with wcwidth(), how many ASCII character cells wide the character will be according to the locale. This should also take care of transliteration, i.e. wcwidth(0x2264) == 2 in case the locale includes ASCII transliteration and results in wputchar(0x2264) to spit out "=". You seem to be suggesting that C library functions such as wprintf should do transliteration. But I thought these functions, like wcrtomb, only do reversible transformations between multibyte and wide character representations. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
ucs-fonts and Mozilla
Markus's UCS fonts seem to confuse Mozilla even if I rename the fonts.alias file. With ucs-fonts/ on the font path, Mozilla displays apparently double-width boxes instead of us-ascii chars in various places. One of those places is the box for the URL. This is with Mozilla M17 and a rather old version of Markus's fonts (but I don't suppose that makes any difference). It didn't seem to happen with M16, strangely enough. I'm using Debian 2.2 (potato). Has anyone else seen this problem? Has anyone seen M17 with Markus's fonts on the path without this problem? Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: Substituting malformed UTF-8 sequences in a decoder
Markus Kuhn [EMAIL PROTECTED]: I see valuable binary data (PDF ZIP files, etc.) being destroyed almost every day by accidentally applied stupid lossy CRLF - LF - CRLF data conversion that supposedly smart software tries to perform on the fly. I foresee similar non-recoverable data conversion accidents if we try to establish software that wipes out malformed UTF-8 sequence without mercy and destructs all information that they might have contained. Here the problem is that the program is misconverting on the fly and not giving an error. If the program stopped with an error half way through the user would know there was a problem and be able to do something about it. So, I don't think a UTF-8 decoder, as implemented in a library, should do anything other than give an error if it encounters malformed UTF-8. The user should be told that something has gone wrong. Clever reversible conversion of malformed sequences is more likely to hide a real problem, causing a bigger problem later, than to be helpful, I suspect. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: Substituting malformed UTF-8 sequences in a decoder
Markus Kuhn [EMAIL PROTECTED]: A) Emit a single U+FFFD per malformed sequence We discussed this before. I can think of several ways of interpreting the phrase "malformed sequence". I think you probably mean either a single octet in the range 80..BF or a single octet in the range FE..FF or an octet in the range C0..FD followed by any number of octets in the range 80..BF such that it isn't correct UTF-8 and isn't followed by another octet in the range 80..BF. This is probably quite hard to implement consistently, and, as with semantics C, the UTF-8/UTF-16 length ratio is unbounded, which means in particular that you can't decode from a fixed-size buffer in the manner of mbrtowc. B) Emit a U+FFFD for every byte in a malformed UTF-8 sequence This is what I do in Mutt. It's easy to implement and works for any multibyte encoding; the program doesn't have to know about UTF-8. But you have to ask yourself: do I reset the mbstate_t when I replace a bad byte by U+FFFD? If you want consistency, you probably should, as otherwise the mbstate_t is undefined after mbrtowc gives EILSEQ. C) Emit a U+FFFD only for every first malformed sequence in a sequence of malformed UTF-8 sequences I don't think anyone will recommend this. D) Emit a malformed UTF-16 sequence for every byte in a malformed UTF-8 sequence Not much good if you're not converting to UTF-16. So perhaps B should be the generally recommended way. However, I agree that a UTF-8 editor should be able to remember malformed UTF-8 sequences so that you can read in a file, edit part of it and write it out again without it all being rubbished. It's unfortunate that the current UTF-8 stuff for Emacs causes malformed UTF-8 files to be silently trashed. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/