Re: question on Linux UTF8 support
On Wed, Feb 01, 2006 at 07:58:58PM +0100, Danilo Segan wrote: Yesterday at 15:42, wrote: You can prevent just by only having UTF-8 locales on the machine. GNU systems allow users to install their own locales wherever they wish (even in $HOME) by setting environment variable LOCPATH (and I18NPATH for locale source files themselves). Yes, this is unfortunate. :) You may be interested in the C library replacement I'm working on. It provides only UTF-8 through the multibyte/wide character interface (although iconv is of course available for explicit conversions) and has highly optimized UTF-8 processing to minimize/eliminate the performance problems of using UTF-8 on older machines. I don't have regex working yet but my estimate is that pattern matching will be at worst 5% slower than 8bit locales, and probably not measurably slower at all. Unlike other similar projects it's under LGPL and designed with long-term ABI stability in mind so that it's actually a viable replacement for glibc on non-embedded systems. It's about 80% done as of now. I'll be sure to post announcements here when it's ready for testing since serious UTF-8 users and LFS/DIY types are my main target audiences. Rich P.S. Apologies for trashing the non-ascii characters in my reply. This is why I'm working on making a viable UTF-8 platform for myself... -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: question on Linux UTF8 support
On Wed, Feb 01, 2006 at 03:41:18PM -0500, [EMAIL PROTECTED] wrote: I don't think that's a problem for a fresh install. Are there any tools for converting existing file systems from one encoding to another? That's a non-trivial problem. Assuming that all of the characters in the source encoding map to distinct characters in the target encoding (let's assume for the moment that we're talking about ISO 8859-1 to UTF-8), then all of the file names can be converted. But here's the list of things that must happen: I think we can safely assume the destination should always be UTF-8, at least in the view of people on this list. :) Are there source encodings that can't be mapped into UCS in a one-to-one manner? I thought compatibility characters had been added for all of those, even when the compatibility characters should not be needed. 1) All of the file names must be converted from the source encoding to the target encoding. 2) Any symbolic links must be converted such that they point to the renamed file or directory. This is easy to automate. 3) Files that contain file or directory names will have to be converted. A couple of very obvious examples are /etc/passwd (for home directories) and /etc/fstab for mount points). This is even more difficult if users have non-ascii characters in their passwords since you'll need to crack all the passwords first. :) As for home directories, they should change if and only if usernames contained non-ascii characters. It's at least obvious what to do. fstab? Do people really have mount points with non-ascii names? I think it's rare enough that people who do can handle it manually. Unfortunately most people don't even know how to separate the basic unix directory tree into partitions, much less make additional local structure. What about the following: for each config directory (/etc, /usr/local/etc, etc.; pardon the pun) assume all files except a fixed list (such as ld.so.cache) are text files, and translate them from the old encoding to UTF-8 accordingly. Make backups, of course. This should cover all global config. Per-user config is much more difficult, yes. I would use system like: 1. backup all dotfiles from the user's homedir to ~/.old or such. 2. use a heuristic to decide which dotfiles are text. 'file' would perhaps work well..? 3. convert the ones identified as text. This will require a little patience/tolerance by users who may need to fix things manually, but I would expect it to work alright in most cases. A much more annoying problem than config will be users' data files whichc are likely to be in the old encoding (html, text, ...). For these the best thing to do would be to provide users with an easy script (or gui tool if it's a system with X logins) to convert files. It's step 3 that's going to be the problem. While you can make a more or less complete list of system files that would have to be converted, each case wound have to be considered for whether it was safe to convert the entire file or it was necessary to just convert file names. There I don't see why you would want to convert filenames but leave other data in the legacy encoding. Can you give examples? The only case I can see that would be difficult is text strings embedded in binary files. is no way of identifying all of the scripts that might require conversion. And I don't want to think about going through each user's .bashrc, .profile and .emacs looking for all of the other files they load or run. Any user who manually sources other files from their .profile or .emacs is savvy enough to convert their own files I think. :) BTW an alternate idea for the whole process may be for the conversion script to just make a TODO list for the sysadmin, listing things it finds that seem to need conversion, and leaving the actual changes to the admin (aside from the file renaming). Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: question on Linux UTF8 support
Yesterday at 15:42, 問答無用 wrote: You can prevent just by only having UTF-8 locales on the machine. GNU systems allow users to install their own locales wherever they wish (even in $HOME) by setting environment variable LOCPATH (and I18NPATH for locale source files themselves). Basically, you want to ask of all your users to use UTF-8 as filesystem encoding. Cheers, Danilo -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: question on Linux UTF8 support
I don't think that's a problem for a fresh install. Are there any tools for converting existing file systems from one encoding to another? That's a non-trivial problem. Assuming that all of the characters in the source encoding map to distinct characters in the target encoding (let's assume for the moment that we're talking about ISO 8859-1 to UTF-8), then all of the file names can be converted. But here's the list of things that must happen: 1) All of the file names must be converted from the source encoding to the target encoding. 2) Any symbolic links must be converted such that they point to the renamed file or directory. 3) Files that contain file or directory names will have to be converted. A couple of very obvious examples are /etc/passwd (for home directories) and /etc/fstab for mount points). It's step 3 that's going to be the problem. While you can make a more or less complete list of system files that would have to be converted, each case wound have to be considered for whether it was safe to convert the entire file or it was necessary to just convert file names. There is no way of identifying all of the scripts that might require conversion. And I don't want to think about going through each user's .bashrc, .profile and .emacs looking for all of the other files they load or run. - Original Message - From: Danilo Segan [EMAIL PROTECTED] Date: Wednesday, February 1, 2006 1:58 pm Subject: Re: question on Linux UTF8 support Basically, you want to ask of all your users to use UTF-8 as filesystem encoding. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: question on Linux UTF8 support
On Tue, 2006-01-31 at 06:40 +, Sheshadrivasan B wrote: So essentially, what this amounts to is that: you cannot prevent junk being displayed when a user does an ls at the prompt. Essentially users are shooting each other in the foot in as far as display of file names is concerned. right? Shesh. You can prevent just by only having UTF-8 locales on the machine.
Re: question on Linux UTF8 support
No. Different users might be running different locales, and those mentioned old applications might assume filenames to be in users' locale encodings. Of course, if some user switches locales often, then all kinds of mess-ups might occur, unless she's consistently using UTF-8 (or other language-agnostic encoding) for naming files. So essentially, what this amounts to is that: you cannot prevent junk being displayed when a user does an ls at the prompt. Essentially users are shooting each other in the foot in as far as display of file names is concerned. right? Shesh. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: question on Linux UTF8 support
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Bruno Haible [EMAIL PROTECTED] writes: Sergey Poznyakoff wrote: The GNU tar maintainer is working a GNU pax program. Maybe he will also provide a command-line option for GNU tar that would perform the same filename charset conversions (suitable for 'tar' archives with UTF-8 filenames)? It has already been implemented. Current version of GNU tar (1.15.1) performs this conversion automatically when operating on an archive file in pax format. Thanks, indeed that works: When I create a .pax file (*) in an UTF-8 locale and use GNU tar 1.15.1 to unpack it in an ISO-8859-15 locale, the filenames are correctly converted. But it is hard to switch the general distribution of tar files to pax format, because - while a tar as old as GNU tar 1.11p supports pax files with just a warning, and AIX, HP-UX and IRIX tar similarly - the Solaris and OSF/1 /usr/bin/tar refuse to unpack them. Possibly relevant: If you use GNU Automake 1.9 or greater, use the tar-pax option to force the creation of PAX archives. Other possibilities are tar-ustar and tar-v7. I've been using pax archives for quite a while now, with no complaints. If you still require portability to older systems, one of the other options is likely a better choice. Regards, Roger - -- Roger Leigh Printing on GNU/Linux? http://gimp-print.sourceforge.net/ Debian GNU/Linuxhttp://www.debian.org/ GPG Public Key: 0x25BFB848. Please sign and encrypt your mail. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Processed by Mailcrypt 3.5.8 http://mailcrypt.sourceforge.net/ iD8DBQFC/lahVcFcaSW/uEgRAsO5AJ9S4XdDnvzpl4foxe1v9K/AbAZDmgCg00gz FqLvPZwLvXRnzr6+wU+wFVg= =U8ZC -END PGP SIGNATURE- -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: question on Linux UTF8 support
Sergey Poznyakoff wrote: The GNU tar maintainer is working a GNU pax program. Maybe he will also provide a command-line option for GNU tar that would perform the same filename charset conversions (suitable for 'tar' archives with UTF-8 filenames)? It has already been implemented. Current version of GNU tar (1.15.1) performs this conversion automatically when operating on an archive file in pax format. Thanks, indeed that works: When I create a .pax file (*) in an UTF-8 locale and use GNU tar 1.15.1 to unpack it in an ISO-8859-15 locale, the filenames are correctly converted. But it is hard to switch the general distribution of tar files to pax format, because - while a tar as old as GNU tar 1.11p supports pax files with just a warning, and AIX, HP-UX and IRIX tar similarly - the Solaris and OSF/1 /usr/bin/tar refuse to unpack them. Could you add to GNU tar an option, so that it performs the filename conversion _also_ when reading or creating archives in 'tar' format? Bruno (*) It's funny that to create a .pax file I have to use tar -H pax, because pax on my system is OpenBSD's pax, which rejects the option -x pax: it can only create cpio and tar archives, despite its name :-) -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: question on Linux UTF8 support
Bruno Haible [EMAIL PROTECTED] wrote: This problem should be solved by the 'pax' archive format. It specifies that file names in it are stored in UTF-8. This means, the filenames are converted from/to LC_CTYPE's encoding during packing and unpacking. The GNU tar maintainer is working a GNU pax program. Maybe he will also provide a command-line option for GNU tar that would perform the same filename charset conversions (suitable for 'tar' archives with UTF-8 filenames)? It has already been implemented. Current version of GNU tar (1.15.1) performs this conversion automatically when operating on an archive file in pax format. Regards, Sergey -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: question on Linux UTF8 support
Last Wednesday at 20:36, Bruno Haible wrote: Or even worse, what if administrator provides some dirs for the user in an encoding different from the one user wants to use? Eg. imagine having a global /Müsik in ISO-8859-1, and user desires to use UTF-8 or ISO-8859-5. For this directory to be useful for different users, the files that it contains have to be in the same encoding. (If a user put the titles or lyrics of a song there in ISO-8859-5, and another user wants to see them in his UTF-8 locale, there will be a mess.) So a requirement for using a common directory is _anyway_ that all users are in locales with the same encoding. Yeah, with the difference that in just ONE of those encodings all users will be able to use whatever characters they wish, provided everybody knows that file names use that encoding. That's what I was arguing anyway: file names encoding should be per-system, not per-user, and most suitable of all encodings for a per-system encoding is UTF-8. All that you say about the file names is also valid for the file contents. A lot of them are in plain text, and filenames are easily converted into plain text. But all POSIX compliant applications have their interpretation of plain text guided by LC_CTYPE et al. Indeed. That's a big problem of external metadata which is not commonly transferred along with the data itself. However, when you recommend to an application author that his application should consider all filenames as being UTF-8, this is not an improvement. It is a no-op for the UTF-8 users but breaks the world of the EUC-JP and KOI8-R users. You're right that it's not an improvement for them, but what are we going to do with filenames extracted from eg. a tar file? If I send them a tar file with UTF-8 (or KOI8-R) encoded filenames, they're going to see a mess (or get their terminal to hang). Yeah, that's solving the wrong problem (I want metadata attached to everything :), but brokeness is already there, and standardising on one encoding (if it's suitable for everybody) is still a step forward: we'll break some things in the process, but improve others. After a while, we'll be there. Recommending LC_CTYPE to use UTF-8 is an attempt to try to lower the brokeness rate once the switch is finally done. I don't know if such a thing would ever be really done (compatibility, big amount of already encoded data inducing big costs for transition, etc.), but I can at least dream about it ;-) Cheers, Danilo -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: question on Linux UTF8 support
Danilo Segan wrote: 2. Is there any known application which still uses ISO-8859XXX codesets for creating file names? Many old (and new?) applications use current character set on the system (set through eg. LC_CTYPE, or other LC_* variables). I'd suggest all new applications to use UTF-8. This will mess up users who have their LC_CTYPE set to a non-UTF-8 encoding. It is weird if a user, in an application, enters a new file name Süß, and then in a terminal, the filename appears as Süà (wow, it even hangs my xterm!). It is just as bad as those old Motif applications which assume that everything is ISO-8859-1. This makes these applications useless in UTF-8 locales. In summary, I'd suggest - that ALL application follow LC_ALL/LC_CTYPE/LANG, like POSIX specifies, - that users switch to UTF-8 locale when they want. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: question on Linux UTF8 support
Hi Bruno, Today at 17:24, Bruno Haible wrote: This will mess up users who have their LC_CTYPE set to a non-UTF-8 encoding. It is weird if a user, in an application, enters a new file name Süß, and then in a terminal, the filename appears as Süà (wow, it even hangs my xterm!). Oh, indeed. But what about user deciding to change LC_CTYPE? Or even worse, what if administrator provides some dirs for the user in an encoding different from the one user wants to use? Eg. imagine having a global /Müsik in ISO-8859-1, and user desires to use UTF-8 or ISO-8859-5. Now not only that it will be weird (and possibly even hang your xterm!), you'd be in a mess if you try to fix it. My point is that the filesystem encoding should be filesystem-wide (not per-user), because that's the only way to warrant that it won't break. And in the sense of POSIX API, UTF-8 makes most sense as a single, backwards compatible filesystem encoding (well, it wasn't originally called UTF-FS for no reason :), which can work for everybody. It is just as bad as those old Motif applications which assume that everything is ISO-8859-1. This makes these applications useless in UTF-8 locales. No, it's not. UTF-8 can encode all characters, so you'd be able to use whatever characters you wish, give or take a conversion step. ISO-8859-1 limits you not only on implementation details step, but also on features. In summary, I'd suggest - that ALL application follow LC_ALL/LC_CTYPE/LANG, like POSIX specifies, - that users switch to UTF-8 locale when they want. That's not closer to ever solving the problem. It's status quo. I think we should at least recommend improvements, if not require them (and nobody suggested requiring them). Basically, my recommendation was to set LC_CTYPE to UTF-8 on all new systems. Cheers, Danilo -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: question on Linux UTF8 support
Danilo Segan wrote: what about user deciding to change LC_CTYPE? A user who switches to a different LC_CTYPE, or works in two different LC_CTYPEs in parallel, will need to convert his plain text files when moving them from one world to the other. It is not much more effort to also convert the file names at the same moment. Or even worse, what if administrator provides some dirs for the user in an encoding different from the one user wants to use? Eg. imagine having a global /Müsik in ISO-8859-1, and user desires to use UTF-8 or ISO-8859-5. For this directory to be useful for different users, the files that it contains have to be in the same encoding. (If a user put the titles or lyrics of a song there in ISO-8859-5, and another user wants to see them in his UTF-8 locale, there will be a mess.) So a requirement for using a common directory is _anyway_ that all users are in locales with the same encoding. My point is that the filesystem encoding should be filesystem-wide (not per-user) All that you say about the file names is also valid for the file contents. A lot of them are in plain text, and filenames are easily converted into plain text. But all POSIX compliant applications have their interpretation of plain text guided by LC_CTYPE et al. That's not closer to ever solving the problem. It's status quo. I think we should at least recommend improvements, if not require them (and nobody suggested requiring them). Basically, my recommendation was to set LC_CTYPE to UTF-8 on all new systems. We have the same goal, namely to let all users use UTF-8, and get rid of any user-visible character set conversions. I agree with the recommendations that you make to users and sysops. However, when you recommend to an application author that his application should consider all filenames as being UTF-8, this is not an improvement. It is a no-op for the UTF-8 users but breaks the world of the EUC-JP and KOI8-R users. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: question on Linux UTF8 support
However, when you recommend to an application author that his application should consider all filenames as being UTF-8, this is not an improvement. It is a no-op for the UTF-8 users but breaks the world of the EUC-JP and KOI8-R users. Perhaps that is too conservative. Any effort spent supporting legacy encodings, or being prepared to perform charset conversions on input seems wasteful to me. (even to support alternative unicode encodings) Locales are still useful, but I think locales should not specify encoding. There are a lot of benefits to be gained in the form of simplicity and iteroperability, when applications are free to assume that all text they might encounter will be utf-8 encoded. Common protocols and file formats shouldnt have to even specify what encoding text is in, imo. by specifying they are allowing for the posibility that it might be different, and that an application may have to deal with charset conversion etc... System wide messages, the login screen, the filesystem, gecos fields, .plans, /etc/issue, /etc/motd, etc are examples where I think a common enforces encoding would be beneficial. The alternative, such as tagging metadata onto the filesystem layer, individual inodes, idividual file metadata descriptors, etc, seems far uglier in comparison. (imagine a file who's name is in one encoding, metadata in a second, and content in yet a third :( IDN URL's are another good example. Its clearly preferable to have a URL be both canonical (byte for byte) as well as in readable (i.e. non-punycode) form. If a user provides an idn URI to the system or another user in an unexpected encoding, the resoure would be unresolvable. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: question on Linux UTF8 support
Yesterday at 13:07, praveen kumar sivapuram wrote: I am developing an application, which needs to know the format of the file name. Based on the documents referenced on the web, i understood that Linux file system does not impose any specific format for the file system. Users can create files using either UTF-8 or ISO-X-X. And using even any of the other hundred of encodings compatible with ASCII (FWIW, you can probably use some non-ASCII-compatible encodings as well, but you may run into some problems with some special characters). Here are my questions: 1. Is there any way to identify whether the filename is UTF8 or non-UTF8? If entire file name is a valid UTF-8 string, then it's likely to be UTF-8 (it's possible that it's not really UTF-8, but very unlikely!). To check if a string is valid UTF-8, just consult UTF-8 RFC. 2. Is there any known application which still uses ISO-8859XXX codesets for creating file names? Many old (and new?) applications use current character set on the system (set through eg. LC_CTYPE, or other LC_* variables). I'd suggest all new applications to use UTF-8. 3. Can i safely assume that all files on a particualr syatem will be in the same format? i.e all file names will be either UTF8 or any other ISO codeset? No. Different users might be running different locales, and those mentioned old applications might assume filenames to be in users' locale encodings. Of course, if some user switches locales often, then all kinds of mess-ups might occur, unless she's consistently using UTF-8 (or other language-agnostic encoding) for naming files. In practice, for most single-user systems, you can make such an assumption. Cheers, Danilo -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
question on Linux UTF8 support
Hi, I have the following question related to Linux file system support for UTF-8. I am not sure, whether this is the right mialing list to post this question. If it is not, please advise me. I am developing an application, which needs to know the format of the file name. Based on the documents referenced on the web, i understood that Linux file system does not impose any specific format for the file system. Users can create files using either UTF-8 or ISO-X-X. Here are my questions: 1. Is there any way to identify whether the filename is UTF8 or non-UTF8? 2. Is there any known application which still uses ISO-8859XXX codesets for creating file names? 3. Can i safely assume that all files on a particualr syatem will be in the same format? i.e all file names will be either UTF8 or any other ISO codeset? Any help is greatly appreciated. Thanks Praveen__Do You Yahoo!?Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com