Re: file name encoding

2001-06-27 Thread Florian Weimer

Bruno Haible [EMAIL PROTECTED] writes:

 The programs we are waiting for are:
 
   - emacs. In an UTF-8 locale, it does not set the
 keyboard-coding-system to UTF-8, thus when I type umlaut keys
 strange things happen. And it does not set the default file
 encoding to UTF-8,

I hope so!  Setting the default encoding to UTF-8 for random files is
harmful in the Emacs context, especially with the current fragile
UTF-8 implementation.

 thus I see mojibake every time I open a
 file which looks perfectly nice through cat or vi in xterm.
 But we heard the Emacs developers are working on this lately.

Yes, the specific problems are solved.  It isn't a big deal actually,
but apparently no one actually tried to run Emacs on a multibyte
terminal, but a few months ago, some guy from Germany (not me, BTW)
triggered a general bug in the Emacs keyboard coding system in this
context which has reportedly been fixed in the development sources.

Anyway, you can run a suitably recent version of Emacs (probably
not the Emacs 21 branch, however) inside an UTF-8 xterm and it
works mainly as expected.  Actually, I've got access to Emacs 20
with MULE-UCS only, and the results are promising indeed.  I didn't
check that the notions of full width characters match and other
sophisticated stuff, but the HELLO file displays quite nicely.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-27 Thread Bruno Haible

H. Peter Anvin writes:

  Yes.  This is the point.  When users set LANG vairable, they
  expect all softwares to obey the variable.
 
 The issue is, however, what that does mean?  In particular, strings in
 the filesystem are usually in the system-wide encoding scheme, not
 what that particular user happens to be processing at the time.

Obeying LANG is important in two scenarios:

  1) For the user who uses a single locale, and this locale's encoding
 is not ISO-8859-1. He sets LANG in $HOME/.profile.

 Such a user will in the long run use non-ASCII filenames. They
 will be stored in locale encoding on the disk. Programs should
 be able to display and use such filenames.

  2) For the user who tries out a locale in a different encoding.
 He sets LANG on the command line.

 Such a user will have to be prepared to problems with non-ASCII
 filenames. But everything else should work without manual
 intervention.
   LANG=de_DE.UTF-8 xterm   - get an UTF-8 xterm
   LANG=ja_JP.EUC-JP gvim file  - edit EUC-JP encoded file
   LANG=vi_VN emacs - start emacs with Vietnamese
   input method
   etc.

It's for the second case that it is important that no encodings are
stored in $HOME/.* files. And it's for the first case that non-ASCII
filenames must be supported.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-27 Thread Juliusz Chroboczek

BH He is not repeating it to make it more true. He is repeating it to
BH make people aware that

BH A program cannot be considered properly internationalized
BH until it obeys the current locale (LC_ALL || LC_CTYPE || LANG).

Tomohiro-san is trying to make this a universal rule.  Tomohiro has
oft expressed the opinion that ilka piece of software must absolutely
respect LC_CTYPE throughout its interface with its environment.

I do not believe this is true.  In a number of places, a program must
interact with its environment in a locale-independent manner.  This
includes selection conversion, keyboard input, and arguably interac-
tion with the file system.

Lack of understanding of this basic principle leads to absurdities
such as Emacs' ``selection-coding-system'' variable.

Regards,

Juliusz
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-27 Thread H. Peter Anvin

Followup to:  [EMAIL PROTECTED]
By author:Bruno Haible [EMAIL PROTECTED]
In newsgroup: linux.utf8
 
 Obeying LANG is important in two scenarios:
 
   1) For the user who uses a single locale, and this locale's encoding
  is not ISO-8859-1. He sets LANG in $HOME/.profile.
 
  Such a user will in the long run use non-ASCII filenames. They
  will be stored in locale encoding on the disk. Programs should
  be able to display and use such filenames.
 
   2) For the user who tries out a locale in a different encoding.
  He sets LANG on the command line.
 
  Such a user will have to be prepared to problems with non-ASCII
  filenames. But everything else should work without manual
  intervention.
LANG=de_DE.UTF-8 xterm   - get an UTF-8 xterm
LANG=ja_JP.EUC-JP gvim file  - edit EUC-JP encoded file
LANG=vi_VN emacs - start emacs with Vietnamese
input method
etc.
 
 It's for the second case that it is important that no encodings are
 stored in $HOME/.* files. And it's for the first case that non-ASCII
 filenames must be supported.
 

Actually, the conditions for non-ASCII filenames is even stricter: for
the system to work consistently the way you describe, the ENTIRE
SYSTEM needs to use the same locale.  A user who sets a locale other
than the system standard locale (which may or may not be ISO-8859-1;
in fact, I claim the only sane default in the long run is UTF-8) and
then uses locale-specific encodings in the filesystem is going to be
fucked sooner or later.  Too many things will malfunction, be it
Samba or administrator/distribution-added files which is in the system
locale but not the locale expected by our (l)user.

FILENAME ENCODINGS IN DIFFERENT LOCALES DO NOT WORK.  PERIOD.

The reason is trivial: filename encoding is a systemwide property.
There is no possibility for adjusting filename encoding on a per-file
or per-user basis.  This is one, of many, mistakes the locale people
made when they set up their system.  It just doesn't work right.

-hpa

-- 
[EMAIL PROTECTED] at work, [EMAIL PROTECTED] in private!
Unix gives you enough rope to shoot yourself in the foot.
http://www.zytor.com/~hpa/puzzle.txt
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-27 Thread Bruno Haible

Juliusz Chroboczek writes:
 In a number of places, a program must interact with its environment
 in a locale-independent manner.  This includes selection conversion,
 keyboard input, and arguably interaction with the file system.

I agree that in _some_ places programs exchange text in locale
independent formats. For example, strings in databases should better
be stored in a locale independent format, so that users in different
locales can access it.

But we need to look at it case by case.

 Lack of understanding of this basic principle leads to absurdities
 such as Emacs' ``selection-coding-system'' variable.

What led to 'selection-coding-system' is that some programs are ICCCM
compliant (use locale independent format for the selection and
cutbuffer) and some are not.

So we'll get a mess everytime it's not clear whether a mechanism uses
locale-dependent or -independent text representation.

* Selection: Here ICCCM says it's locale independent.

* Keyboard input: An XKeyEvent is locale independent. Input read
  through XmbLookupString is locale dependent.
  Input read from /dev/tty is assumed to be locale dependent if the
  IEXTEN flag is set.

* Filenames: The POSIX spec for 'ls' implies that 'ls' treats
  filenames as locale (LC_CTYPE) dependent. This means all other
  programs must do the same.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-27 Thread Bruno Haible

H. Peter Anvin writes:
 Actually, the conditions for non-ASCII filenames is even stricter: for
 the system to work consistently the way you describe, the ENTIRE
 SYSTEM needs to use the same locale.

It needs not. If the administrator/distribution files are in ASCII,
and users don't need to access each other's files, there is no
problem with user A having /home/A in EUC-JP encoding and user B
having /home/B in UTF-8 encoding.

 FILENAME ENCODINGS IN DIFFERENT LOCALES DO NOT WORK.  PERIOD.

Sure. Therefore it's best to use non-ASCII filenames only after having
switched one's system to UTF-8, not before.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-27 Thread Tomohiro KUBOTA

Hi,

At Wed, 27 Jun 2001 20:51:31 +0200 (CEST),
Bruno Haible [EMAIL PROTECTED] wrote:

 I agree that in _some_ places programs exchange text in locale
(snip all followings)

This is just I'd like to insist.

Just one addition.

Since Juliusz's filenames in UTF-8 without conversion way works
only under UTF-8 locales, it is a subset of filenames in locale
encoding way (i.e., the present state).  (Note that if you follow
filenames in locale encoding way, you will use UTF-8 filenames
in UTF-8 locales.)  Thus, this way does not include any technical
improvement but it is just a pressure to people who don't use UTF-8
locales.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-26 Thread Juliusz Chroboczek

TH Locale-dependency is a mandatory.  All text-handling softwares
TH which don't obey LC_CTYPE should be regarded as buggy.

Repeating this will not make it any more true.

Juliusz
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-26 Thread Tomohiro KUBOTA

Hi,

At Tue, 26 Jun 2001 22:11:06 +0200 (CEST),
Bruno Haible [EMAIL PROTECTED] wrote:

  - Newbies should have only a single variable to set in their
$HOME/.profile, not dozens.

Yes.  This is the point.  When users set LANG vairable, they
expect all softwares to obey the variable.


  - We want to make it easy for everyone to use an UTF-8 locale.
Users shouldn't be bothered to change various $HOME/.* files,
set .Xdefault resources etc.

Yes.  However, not only UTF-8 but also all other encodings.


  - All X programs which set their default font to *-iso8859-1
independently of the locale. This includes nedit.

Of course such softwares are buggy.  However, softwares
which use XDraw{Image}String() are also buggy.  (Softwares
before X11R4 should use both XDraw{Image}String() and
XDraw{Image}String16().  Modern softwares after X11R5
should use X{mb,wc,(utf8?)}Draw{Image}String().)

And more, default font of -adobe-helvetica-* is buggy enough.
This excludes most non-Latin fonts.  -adobe-helvetica-*,* is
good.  Or, adding-,*-mechanism before XCreateFontSet() is
better, like I modified twm.

in xc/programs/twm/util.c
basename2 = (char *)malloc(strlen(font-name) + 3);
if (basename2) sprintf(basename2, %s,*, font-name);
else basename2 = font-name;
if( (font-fontset = XCreateFontSet(dpy, basename2,
missing_charset_list_return,
missing_charset_count_return,
def_string_return)) == NULL) {

Of course we can implement better font-guessing mechanism, like
I implemented for IceWM, Blackbox, and Sawfish.  (I didn't use the
mechanism for twm because I thought the mechanism is too heavy for
twm.)

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-26 Thread H. Peter Anvin

Followup to:  [EMAIL PROTECTED]
By author:Tomohiro KUBOTA [EMAIL PROTECTED]
In newsgroup: linux.utf8
 
 At Tue, 26 Jun 2001 22:11:06 +0200 (CEST),
 Bruno Haible [EMAIL PROTECTED] wrote:
 
   - Newbies should have only a single variable to set in their
 $HOME/.profile, not dozens.
 
 Yes.  This is the point.  When users set LANG vairable, they
 expect all softwares to obey the variable.
 

The issue is, however, what that does mean?  In particular, strings in
the filesystem are usually in the system-wide encoding scheme, not
what that particular user happens to be processing at the time.

The locale system was unfortunately misdesigned.  There are very few
reasonable answers when it comes to deal with things like this.  It
does, however, seem pretty clear that fopen() and friends will *not*
convert the character string presented to it, but will treat it as a
string of bytes, period.  POSIX is adamant about this.

-hpa
-- 
[EMAIL PROTECTED] at work, [EMAIL PROTECTED] in private!
Unix gives you enough rope to shoot yourself in the foot.
http://www.zytor.com/~hpa/puzzle.txt
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-26 Thread Tomohiro KUBOTA

Hi,

At 26 Jun 2001 13:49:10 -0700,
H. Peter Anvin [EMAIL PROTECTED] wrote:

 Incidentally, I believe there needs to be an easy way to set the
 default character set in use on a system.  This may of course be
 overridden by the user (possibly at their own peril), but it is
 nevertheless a useful concept.

This mechanism is implemented since X11R5.  XFontSet.

Why XFontSet is not very popular?  I imagine some reasons.
 - People imagine from its name that it is only for CJK people
   who need multiple fonts.
 - People were accustomed to use system without setting locale.
   XFontSet-related functions assume ASCII without locale setting.

Thus, when using XFontSet, I check locale and use XFontStruct-
related conventional non-internationalized functions when the
check fails.  This can avoid complains from people who don't
know how to set locale.  See the source code of twm I wrote
for detail.

xc/programs/twm/twm.c

loc = setlocale(LC_ALL, );
if (!loc || !strcmp(loc, C) || !strcmp(loc, POSIX) ||
!XSupportsLocale()) {
 use_fontset = False;
} else {
 use_fontset = True;
}

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-26 Thread Tomohiro KUBOTA

Hi,

At 26 Jun 2001 16:37:05 -0700,
H. Peter Anvin [EMAIL PROTECTED] wrote:

 The issue is, however, what that does mean?  In particular, strings in
 the filesystem are usually in the system-wide encoding scheme, not
 what that particular user happens to be processing at the time.

Ah, I understand.  We were discussing about different theme.
My point is not on the byte sequence for filenames in the filesystem.
It can or cannot be UTF-8.  I don't care much because users have
little chance to access to the raw byte sequence on the filesystem.
My point is that user-level commands must obey locale when they
communicate with users.  For example, 'ls' must display file names
in locale encoding.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-22 Thread Markus Kuhn

On Fri, 22 Jun 2001 [EMAIL PROTECTED] wrote:
  Would it be acceptable to change internals of functions like fopen()
  so that the passed file name is converted to utf-8 trough iconv() ?

 And how is the character set of the file name supposed to be guessed?

Trivial: Filenames would be always ASCII or UTF-8.

I think, the most practical recommendation is that today nobody should be
using non-ASCII filenames (except for UTF-8 testing of course) until the
big switch to UTF-8. In practice, we are reasonably close to that
situation. Even in very non-Latin user communities, sort-of-English
filenames are currently the dominating practice. Not only on web servers
but also due to the severe regexp hazards of some unsuitable encodings
(BIG5, GB18030, etc.).

Adding locale-dependent encoding conversion functionality to fopen() etc.
really is completely out of the question. I don't even want to start
thinking about the huge number of obvious new devious security problems
that such a severe functionality change in the highly stable Unix/Linux
file system semantics would bring. For those with no phantasy at all, let
me just mention lock files and file existance tests to start with.
Changing fopen here is an absolute no-go! Good Qwafu, we want to decrease
the locale-dependency of the C and X11 API, not increase it.

Let's focus on making the Linux environment suitable for smooth pure UTF-8
usage, not add layer after layer of redundant conversion and recoding
extensions to one API after another, until the system spends half of its
CPU cycles checking whether character encoding conversion is necessary.

Markus


-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-22 Thread Tomohiro KUBOTA

Hi,

At Fri, 22 Jun 2001 18:58:25 +0100 (BST),
Markus Kuhn [EMAIL PROTECTED] wrote:

 Adding locale-dependent encoding conversion functionality to fopen() etc.
 really is completely out of the question. I don't even want to start
 thinking about the huge number of obvious new devious security problems
 that such a severe functionality change in the highly stable Unix/Linux
 file system semantics would bring. For those with no phantasy at all, let
 me just mention lock files and file existance tests to start with.
 Changing fopen here is an absolute no-go! Good Qwafu, we want to decrease
 the locale-dependency of the C and X11 API, not increase it.

Locale-dependency is a mandatory.  All text-handling softwares which
don't obey LC_CTYPE should be regarded as buggy.

I don't know if fopen() should implement encoding conversion or not.
However, if not, each application software will have to convert UTF-8
from/to locale-encoding.

UTF-8 is a mere one of encodings which are supported by locale.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-22 Thread H. Peter Anvin

Followup to:  [EMAIL PROTECTED]
By author:Tomohiro KUBOTA [EMAIL PROTECTED]
In newsgroup: linux.utf8
 
 Locale-dependency is a mandatory.  All text-handling softwares which
 don't obey LC_CTYPE should be regarded as buggy.
 

That is not the problem.  There is a *MAJOR* problem with the locale
API: it's global state.  There is a lot of software that has to switch
locales on the fly, and they generally don't expect things like open()
to be affected by the locale!

 I don't know if fopen() should implement encoding conversion or not.
 However, if not, each application software will have to convert UTF-8
 from/to locale-encoding.
 
 UTF-8 is a mere one of encodings which are supported by locale.

The only sane combination of things is to use locale-dependent
encodings for interchange only.  Getting it into Unicode as soon as
possible is the only sane way to deal with it.  I say that as a
non-English-speaking European, who has a fairly large body of things
in ISO 8859-1...

Yes, it's a pain.  However, at least it will hopefully be like banging
your head against the wall: it feels really good when it stops.  If we
do this right this will be the last time we have to bang our heads
against the wall like this.

-hpa

-- 
[EMAIL PROTECTED] at work, [EMAIL PROTECTED] in private!
Unix gives you enough rope to shoot yourself in the foot.
http://www.zytor.com/~hpa/puzzle.txt
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/