Re: question on Linux UTF8 support

2006-02-02 Thread Rich Felker
On Wed, Feb 01, 2006 at 03:41:18PM -0500, [EMAIL PROTECTED] wrote:
> I don't think that's a problem for a fresh install.  Are there any tools
> for converting existing file systems from one encoding to another? 
> That's a non-trivial problem.  Assuming that all of the characters in
> the source encoding map to distinct characters in the target encoding
> (let's assume for the moment that we're talking about ISO 8859-1 to
> UTF-8), then all of the file names can be converted.  But here's the
> list of things that must happen:

I think we can safely assume the destination should always be UTF-8,
at least in the view of people on this list. :) Are there source
encodings that can't be mapped into UCS in a one-to-one manner? I
thought compatibility characters had been added for all of those, even
when the compatibility characters should not be needed.

> 1) All of the file names must be converted from the source encoding to
> the target encoding.
> 
> 2) Any symbolic links must be converted such that they point to the
> renamed file or directory.

This is easy to automate.

> 3) Files that contain file or directory names will have to be converted.
>  A couple of very obvious examples are /etc/passwd (for home
> directories) and /etc/fstab for mount points).

This is even more difficult if users have non-ascii characters in
their passwords since you'll need to crack all the passwords first. :)

As for home directories, they should change if and only if usernames
contained non-ascii characters. It's at least obvious what to do.
fstab? Do people really have mount points with non-ascii names? I
think it's rare enough that people who do can handle it manually.
Unfortunately most people don't even know how to separate the basic
unix directory tree into partitions, much less make additional local
structure.

What about the following: for each config directory (/etc,
/usr/local/etc, etc.; pardon the pun) assume all files except a fixed
list (such as ld.so.cache) are text files, and translate them from the
old encoding to UTF-8 accordingly. Make backups, of course. This
should cover all global config.

Per-user config is much more difficult, yes. I would use system like:
1. backup all dotfiles from the user's homedir to ~/.old or such.
2. use a heuristic to decide which dotfiles are text. 'file' would
   perhaps work well..?
3. convert the ones identified as text.

This will require a little patience/tolerance by users who may need to
fix things manually, but I would expect it to work alright in most
cases. A much more annoying problem than config will be users' data
files whichc are likely to be in the old encoding (html, text, ...).
For these the best thing to do would be to provide users with an easy
script (or gui tool if it's a system with X logins) to convert files.

> It's step 3 that's going to be the problem.  While you can make a more
> or less complete list of system files that would have to be converted,
> each case wound have to be considered for whether it was safe to convert
> the entire file or it was necessary to just convert file names.  There

I don't see why you would want to convert filenames but leave other
data in the legacy encoding. Can you give examples? The only case I
can see that would be difficult is text strings embedded in binary
files.

> is no way of identifying all of the scripts that might require
> conversion.  And I don't want to think about going through each user's
> .bashrc, .profile and .emacs looking for all of the other files they
> load or run.

Any user who manually sources other files from their .profile or
.emacs is savvy enough to convert their own files I think. :)

BTW an alternate idea for the whole process may be for the conversion
script to just make a "TODO list" for the sysadmin, listing things it
finds that seem to need conversion, and leaving the actual changes to
the admin (aside from the file renaming).

Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: question on Linux UTF8 support

2006-02-02 Thread Rich Felker
On Wed, Feb 01, 2006 at 07:58:58PM +0100, Danilo Segan wrote:
> Yesterday at 15:42,  wrote:
> 
> > You can prevent just by only having UTF-8 locales on the machine.
> 
> GNU systems allow users to install their own locales wherever they
> wish (even in $HOME) by setting environment variable LOCPATH (and
> I18NPATH for locale source files themselves).

Yes, this is unfortunate. :)

You may be interested in the C library replacement I'm working on.
It provides only UTF-8 through the multibyte/wide character interface
(although iconv is of course available for explicit conversions) and
has highly optimized UTF-8 processing to minimize/eliminate the
performance problems of using UTF-8 on older machines. I don't have
regex working yet but my estimate is that pattern matching will be at
worst 5% slower than 8bit locales, and probably not measurably slower
at all.

Unlike other similar projects it's under LGPL and designed with
long-term ABI stability in mind so that it's actually a viable
replacement for glibc on non-embedded systems. It's about 80% done as
of now. I'll be sure to post announcements here when it's ready for
testing since serious UTF-8 users and LFS/DIY types are my main target
audiences.


Rich

P.S. Apologies for trashing the non-ascii characters in my reply. This
is why I'm working on making a viable UTF-8 platform for myself...


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: question on Linux UTF8 support

2006-02-01 Thread dsplat
I don't think that's a problem for a fresh install.  Are there any tools
for converting existing file systems from one encoding to another? 
That's a non-trivial problem.  Assuming that all of the characters in
the source encoding map to distinct characters in the target encoding
(let's assume for the moment that we're talking about ISO 8859-1 to
UTF-8), then all of the file names can be converted.  But here's the
list of things that must happen:

1) All of the file names must be converted from the source encoding to
the target encoding.

2) Any symbolic links must be converted such that they point to the
renamed file or directory.

3) Files that contain file or directory names will have to be converted.
 A couple of very obvious examples are /etc/passwd (for home
directories) and /etc/fstab for mount points).

It's step 3 that's going to be the problem.  While you can make a more
or less complete list of system files that would have to be converted,
each case wound have to be considered for whether it was safe to convert
the entire file or it was necessary to just convert file names.  There
is no way of identifying all of the scripts that might require
conversion.  And I don't want to think about going through each user's
.bashrc, .profile and .emacs looking for all of the other files they
load or run.

- Original Message -
From: Danilo Segan <[EMAIL PROTECTED]>
Date: Wednesday, February 1, 2006 1:58 pm
Subject: Re: question on Linux UTF8 support

> Basically, you want to "ask" of all your users to use UTF-8 as
> filesystem encoding.


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: question on Linux UTF8 support

2006-02-01 Thread Danilo Segan
Yesterday at 15:42, 問答無用 wrote:

> You can prevent just by only having UTF-8 locales on the machine.

GNU systems allow users to install their own locales wherever they
wish (even in $HOME) by setting environment variable LOCPATH (and
I18NPATH for locale source files themselves).

Basically, you want to "ask" of all your users to use UTF-8 as
filesystem encoding.

Cheers,
Danilo

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: question on Linux UTF8 support

2006-01-31 Thread 問答無用




On Tue, 2006-01-31 at 06:40 +, Sheshadrivasan B wrote:



So essentially, what this amounts to is that: you cannot prevent
junk being displayed when a user does an "ls" at the prompt.
Essentially users are shooting each other in the foot in as far
as display of file names is concerned. right?
Shesh.




You can prevent just by only having UTF-8 locales on the machine.





Re: question on Linux UTF8 support

2006-01-30 Thread Sheshadrivasan B





No.  Different users might be running different locales, and those
mentioned "old" applications might assume filenames to be in users'
locale encodings.  Of course, if some user switches locales often,
then all kinds of mess-ups might occur, unless she's consistently
using UTF-8 (or other language-agnostic encoding) for naming files.




So essentially, what this amounts to is that: you cannot prevent
junk being displayed when a user does an "ls" at the prompt.
Essentially users are shooting each other in the foot in as far
as display of file names is concerned. right?
Shesh.



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: question on Linux UTF8 support

2005-08-13 Thread Roger Leigh
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Bruno Haible <[EMAIL PROTECTED]> writes:

> Sergey Poznyakoff wrote:
>> > The GNU tar maintainer is working a GNU pax program. Maybe he will also
>> > provide a command-line option for GNU tar that would perform the same
>> > filename charset conversions (suitable for 'tar' archives with UTF-8
>> > filenames)?
>>
>> It has already been implemented.
>>
>> Current version of GNU tar (1.15.1) performs this conversion
>> automatically when operating on an archive file in pax format.
>
> Thanks, indeed that works: When I create a .pax file (*) in an UTF-8 locale
> and use GNU tar 1.15.1 to unpack it in an ISO-8859-15 locale, the filenames
> are correctly converted.
>
> But it is hard to switch the general distribution of tar files to pax format,
> because - while a tar as old as GNU tar 1.11p supports pax files with just
> a warning, and AIX, HP-UX and IRIX tar similarly - the Solaris and OSF/1
> /usr/bin/tar refuse to unpack them.

Possibly relevant: If you use GNU Automake 1.9 or greater, use the
"tar-pax" option to force the creation of PAX archives.  Other
possibilities are "tar-ustar" and "tar-v7".  I've been using pax
archives for quite a while now, with no complaints.  If you still
require portability to older systems, one of the other options is
likely a better choice.


Regards,
Roger

- -- 
Roger Leigh
Printing on GNU/Linux?  http://gimp-print.sourceforge.net/
Debian GNU/Linuxhttp://www.debian.org/
GPG Public Key: 0x25BFB848.  Please sign and encrypt your mail.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 

iD8DBQFC/lahVcFcaSW/uEgRAsO5AJ9S4XdDnvzpl4foxe1v9K/AbAZDmgCg00gz
FqLvPZwLvXRnzr6+wU+wFVg=
=U8ZC
-END PGP SIGNATURE-

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: question on Linux UTF8 support

2005-08-10 Thread Bruno Haible
Sergey Poznyakoff wrote:
> > The GNU tar maintainer is working a GNU pax program. Maybe he will also
> > provide a command-line option for GNU tar that would perform the same
> > filename charset conversions (suitable for 'tar' archives with UTF-8
> > filenames)?
>
> It has already been implemented.
>
> Current version of GNU tar (1.15.1) performs this conversion
> automatically when operating on an archive file in pax format.

Thanks, indeed that works: When I create a .pax file (*) in an UTF-8 locale
and use GNU tar 1.15.1 to unpack it in an ISO-8859-15 locale, the filenames
are correctly converted.

But it is hard to switch the general distribution of tar files to pax format,
because - while a tar as old as GNU tar 1.11p supports pax files with just
a warning, and AIX, HP-UX and IRIX tar similarly - the Solaris and OSF/1
/usr/bin/tar refuse to unpack them.

Could you add to GNU tar an option, so that it performs the filename conversion
_also_ when reading or creating archives in 'tar' format?

Bruno


(*) It's funny that to create a .pax file I have to use "tar -H pax", because
"pax" on my system is OpenBSD's pax, which rejects the option "-x pax": it
can only create cpio and tar archives, despite its name :-)


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: question on Linux UTF8 support

2005-08-09 Thread Sergey Poznyakoff
Bruno Haible <[EMAIL PROTECTED]> wrote:

> This problem should be solved by the 'pax' archive format. It specifies that
> file names in it are stored in UTF-8. This means, the filenames are converted
> from/to LC_CTYPE's encoding during packing and unpacking.
> 
> The GNU tar maintainer is working a GNU pax program. Maybe he will also
> provide a command-line option for GNU tar that would perform the same
> filename charset conversions (suitable for 'tar' archives with UTF-8
> filenames)?

It has already been implemented.

Current version of GNU tar (1.15.1) performs this conversion
automatically when operating on an archive file in pax format.

Regards,
Sergey

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: question on Linux UTF8 support

2005-08-08 Thread Bruno Haible
Danilo Segan wrote:
> what are we
> going to do with filenames extracted from eg. a tar file?  If I send
> them a tar file with UTF-8 (or KOI8-R) encoded filenames, they're
> going to see a mess (or get their terminal to hang).

This problem should be solved by the 'pax' archive format. It specifies that
file names in it are stored in UTF-8. This means, the filenames are converted
from/to LC_CTYPE's encoding during packing and unpacking.

The GNU tar maintainer is working a GNU pax program. Maybe he will also
provide a command-line option for GNU tar that would perform the same
filename charset conversions (suitable for 'tar' archives with UTF-8
filenames)?

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: question on Linux UTF8 support

2005-08-06 Thread Danilo Segan
Last Wednesday at 20:36, Bruno Haible wrote:

>> Or even
>> worse, what if administrator provides some dirs for the user in an
>> encoding different from the one user wants to use?
>>
>> Eg. imagine having a global "/Müsik" in ISO-8859-1, and user desires
>> to use UTF-8 or ISO-8859-5.
>
> For this directory to be useful for different users, the files that it
> contains have to be in the same encoding. (If a user put the titles or
> lyrics of a song there in ISO-8859-5, and another user wants to see them
> in his UTF-8 locale, there will be a mess.) So a requirement for using
> a common directory is _anyway_ that all users are in locales with the
> same encoding.

Yeah, with the difference that in just ONE of those encodings all
users will be able to use whatever characters they wish, provided
everybody knows that file names use that encoding. 

That's what I was arguing anyway: file names encoding should be
per-system, not per-user, and most suitable of all encodings for a
per-system encoding is UTF-8.

> All that you say about the file names is also valid for the file contents.
> A lot of them are in plain text, and filenames are easily converted into
> plain text. But all POSIX compliant applications have their interpretation
> of plain text guided by LC_CTYPE et al.

Indeed.  That's a big problem of external metadata which is not
commonly transferred along with the data itself.

> However, when you recommend to an application author that his application
> should consider all filenames as being UTF-8, this is not an improvement.
> It is a no-op for the UTF-8 users but breaks the world of the EUC-JP and
> KOI8-R users.

You're right that it's not an improvement for them, but what are we
going to do with filenames extracted from eg. a tar file?  If I send
them a tar file with UTF-8 (or KOI8-R) encoded filenames, they're
going to see a mess (or get their terminal to hang).

Yeah, that's solving the wrong problem (I want metadata attached to 
everything :), but brokeness is already there, and standardising on
one encoding (if it's suitable for everybody) is still a step forward:
we'll break some things in the process, but improve others.  After a
while, we'll be there.


Recommending LC_CTYPE to use UTF-8 is an attempt to try to lower the
brokeness rate once the switch is finally done.  I don't know if such
a thing would ever be really done (compatibility, big amount of
already encoded data inducing big costs for transition, etc.), but I
can at least dream about it ;-)

Cheers,
Danilo

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: question on Linux UTF8 support

2005-08-03 Thread srintuar



However, when you recommend to an application author that his application
should consider all filenames as being UTF-8, this is not an improvement.
It is a no-op for the UTF-8 users but breaks the world of the EUC-JP and
KOI8-R users.



Perhaps that is too conservative.

Any effort spent supporting legacy encodings, or being prepared to perform
charset conversions on input seems wasteful to me. (even to support
alternative unicode encodings)  Locales are still useful, but I think 
locales

should not specify encoding.

There are a lot of benefits to be gained in the form of simplicity and
iteroperability, when applications are free to assume that all text they
might encounter will be utf-8 encoded. Common protocols and file
formats shouldnt have to even specify what encoding text is in, imo.
by specifying they are allowing for the posibility that it might be
different, and that an application may have to deal with charset
conversion etc...

System wide messages, the login screen, the filesystem, gecos fields,
.plans, /etc/issue, /etc/motd, etc are examples where I think a common
enforces encoding would be beneficial.

The alternative, such as tagging metadata onto the filesystem layer,
individual inodes, idividual file metadata descriptors, etc, seems far 
uglier

in comparison. (imagine a file who's name is in one encoding, metadata in
a second, and content in yet a third :(

IDN URL's are another good example. Its clearly preferable to have a URL
be both canonical (byte for byte) as well as in readable (i.e. 
non-punycode) form. If a user

provides an idn URI to the system or another user in an unexpected encoding,
the resoure would be unresolvable.


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: question on Linux UTF8 support

2005-08-03 Thread Bruno Haible
Danilo Segan wrote:
> what about user deciding to change LC_CTYPE?

A user who switches to a different LC_CTYPE, or works in two different
LC_CTYPEs in parallel, will need to convert his plain text files when
moving them from one world to the other. It is not much more effort
to also convert the file names at the same moment.

> Or even
> worse, what if administrator provides some dirs for the user in an
> encoding different from the one user wants to use?
>
> Eg. imagine having a global "/Müsik" in ISO-8859-1, and user desires
> to use UTF-8 or ISO-8859-5.

For this directory to be useful for different users, the files that it
contains have to be in the same encoding. (If a user put the titles or
lyrics of a song there in ISO-8859-5, and another user wants to see them
in his UTF-8 locale, there will be a mess.) So a requirement for using
a common directory is _anyway_ that all users are in locales with the
same encoding.

> My point is that the filesystem encoding should be filesystem-wide
> (not per-user)

All that you say about the file names is also valid for the file contents.
A lot of them are in plain text, and filenames are easily converted into
plain text. But all POSIX compliant applications have their interpretation
of plain text guided by LC_CTYPE et al.

> That's not closer to ever solving the problem.  It's status quo.  I
> think we should at least recommend improvements, if not require them
> (and nobody suggested requiring them).
>
> Basically, my recommendation was to set LC_CTYPE to UTF-8 on all new
> systems.

We have the same goal, namely to let all users use UTF-8, and get rid
of any user-visible character set conversions.

I agree with the recommendations that you make to users and sysops.

However, when you recommend to an application author that his application
should consider all filenames as being UTF-8, this is not an improvement.
It is a no-op for the UTF-8 users but breaks the world of the EUC-JP and
KOI8-R users.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: question on Linux UTF8 support

2005-08-03 Thread Danilo Segan
Hi Bruno,

Today at 17:24, Bruno Haible wrote:

> This will mess up users who have their LC_CTYPE set to a non-UTF-8 encoding.
> It is weird if a user, in an application, enters a new file name "Süß",
> and then in a terminal, the filename appears as "Süà " (wow, it even
> hangs my xterm!).

Oh, indeed.  But what about user deciding to change LC_CTYPE?  Or even
worse, what if administrator provides some dirs for the user in an
encoding different from the one user wants to use? 

Eg. imagine having a global "/Müsik" in ISO-8859-1, and user desires
to use UTF-8 or ISO-8859-5.  Now not only that it will be weird (and
possibly even hang your xterm!), you'd be in a mess if you try to fix it.

My point is that the filesystem encoding should be filesystem-wide
(not per-user), because that's the only way to warrant that it won't
break.  And in the sense of POSIX API, UTF-8 makes most sense as a
single, backwards compatible filesystem encoding (well, it wasn't
originally called "UTF-FS" for no reason :), which can work for
everybody.

> It is just as bad as those old Motif applications which assume that
> everything is ISO-8859-1. This makes these applications useless in UTF-8
> locales.

No, it's not.  UTF-8 can encode all characters, so you'd be able to
use whatever characters you wish, give or take a conversion step.
ISO-8859-1 limits you not only on "implementation details" step, but
also on features.

> In summary, I'd suggest
>   - that ALL application follow LC_ALL/LC_CTYPE/LANG, like POSIX specifies,
>   - that users switch to UTF-8 locale when they want.

That's not closer to ever solving the problem.  It's status quo.  I
think we should at least recommend improvements, if not require them
(and nobody suggested requiring them).

Basically, my recommendation was to set LC_CTYPE to UTF-8 on all new
systems.

Cheers,
Danilo

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: question on Linux UTF8 support

2005-08-03 Thread Bruno Haible
Danilo Segan wrote:
> > 2. Is there any known application which still uses ISO-8859XXX codesets
> > for creating file names?
>
> Many old (and new?) applications use current character set on the
> system (set through eg. LC_CTYPE, or other LC_* variables).  I'd
> suggest all new applications to use UTF-8.

This will mess up users who have their LC_CTYPE set to a non-UTF-8 encoding.
It is weird if a user, in an application, enters a new file name "Süß",
and then in a terminal, the filename appears as "Süà " (wow, it even
hangs my xterm!).

It is just as bad as those old Motif applications which assume that
everything is ISO-8859-1. This makes these applications useless in UTF-8
locales.

In summary, I'd suggest
  - that ALL application follow LC_ALL/LC_CTYPE/LANG, like POSIX specifies,
  - that users switch to UTF-8 locale when they want.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: question on Linux UTF8 support

2005-07-31 Thread Danilo Segan
Yesterday at 13:07, praveen kumar sivapuram wrote:

> I am developing an application, which needs to know the format of
> the file name. Based on the documents referenced on the web, i
> understood that Linux file system does not impose any specific
> format for the file system. Users can create files using either
> UTF-8 or ISO-X-X.  

And using even any of the other hundred of encodings compatible with
ASCII (FWIW, you can probably use some non-ASCII-compatible encodings
as well, but you may run into some problems with some special
characters).

> Here are my questions:
>  
> 1. Is there any way to identify whether the filename is UTF8 or non-UTF8?

If entire file name is a valid UTF-8 string, then it's likely to be
UTF-8 (it's possible that it's not really UTF-8, but very unlikely!).
To check if a string is valid UTF-8, just consult UTF-8 RFC.

> 2. Is there any known application which still uses ISO-8859XXX codesets for 
> creating file names?

Many old (and new?) applications use current character set on the
system (set through eg. LC_CTYPE, or other LC_* variables).  I'd
suggest all new applications to use UTF-8.

> 3. Can i safely assume that all files on a particualr syatem will be in the 
> same format? i.e all file names will be either UTF8 or any other ISO codeset?

No.  Different users might be running different locales, and those
mentioned "old" applications might assume filenames to be in users'
locale encodings.  Of course, if some user switches locales often,
then all kinds of mess-ups might occur, unless she's consistently
using UTF-8 (or other language-agnostic encoding) for naming files.


In practice, for most single-user systems, you can make such an
assumption.


Cheers,
Danilo

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/