GNU libunistring 0.9 released

2009-04-27 Thread Bruno Haible
Hi,

GNU libunistring 0.9 was released this week. Find below the announcement.

There is a mailing list for this project at
  https://savannah.gnu.org/mail/?group=libunistring

You are invited to join this mailing list, in order to influence and
participate in future releases of this library.

Enjoy!

Bruno


===

GNU libunistring is a library that provides functions for manipulating
Unicode strings and for manipulating C strings according to the Unicode
standard.

It consists of the following parts:

  unistr.h elementary string functions
  uniconv.hconversion from/to legacy encodings
  unistdio.h   formatted output to strings
  uniname.hcharacter names
  unictype.h   character classification and properties
  uniwidth.h   string width when using nonproportional fonts
  uniwbrk.hword breaks
  unilbrk.hline breaking algorithm
  uninorm.hnormalization (composition and decomposition)
  unicase.hcase folding
  uniregex.h   regular expressions (not yet implemented)

libunistring is for you if your application involves non-trivial text
processing, such as upper/lower case conversions, line breaking, operations
on words, or more advanced analysis of text. Text provided by the user can,
in general, contain characters of all kinds of scripts. The text processing
functions provided by this library handle all scripts and all languages.

libunistring is for you if your application already uses the ISO C / POSIX
ctype.h, wctype.h functions and the text it operates on is provided by
the user and can be in any language.

libunistring is also for you if your application uses Unicode strings as
internal in-memory representation.

Download:
  http://ftp.gnu.org/gnu/libunistring/libunistring-0.9.tar.gz

This is the first public release.


Bruno

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: wcwidth update

2007-07-08 Thread Bruno Haible
Hello Markus,

  Could you update your wcwidth implementation at
  
  http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
  
  to latest unicode data?
 
 Done.

This code assigns width 2 to U+4DC0..U+4DFF. But they are marked as 'N' in
Unicode 5.0.0's ucd/EastAsianWidth.txt, therefore they should have width 1.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Proposed fix for Malayalam ( other Indic?) chars and wcwidth

2006-10-16 Thread Bruno Haible
Hello Rich,

 These characters are combining marks that attach on both
 sides of a cluster, and have canonical equivalence to the two separate
 pieces from which they are built, but yet Markus' wcwidth
 implementation and GNU libc assign them a width of 1. It appears very
 obvious to me that there's no hope of rendering both of these parts
 using only 1 character cell on a character cell device, and even if it
 were possible, it also seems horribly wrong for canonically equivalent
 strings to have different widths.

What rendering to other terminal emulators produce for these characters,
especially the ones from GNOME, KDE, Apple, and mlterm? I cannot submit
a patch to glibc based on the data of just 1 terminal emulator.

Bruno

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: utf-8 and well-formed but illegal chars

2006-01-19 Thread Bruno Haible
Rich Felker wrote:
 hope this isn't too off-topic -- i'm working on a utf-8 implementation
 and trying to decide what to do with byte sequences that are
 well-formed but represent illegal code positions, i.e. 0xd800-0xdfff,
 0xfffe-0x, and 0x11-0x1f. should these be treated as
 illegal sequences (EILSEQ) or decoded as ordinary characters? is there
 a good reference on the precedents?

The three cases are probably best treated separately:

- The range 0xd800-0xdfff. You should catch and reject them as invalid when
  you are programming a conversion to UCS-2 or UTF-16, for example
UTF-8 - UTF-16
  or
UCS-4 - UTF-16
  Otherwise it becomes possible for malicious users to create non-BMP
  characters at a level of processing where earlier stages of processing
  did not see them.

  In a conversion from UTF-8 to UCS-4 you don't need to catch 0xd800-0xdfff.

- For the other two ranges, the advice is dictated merely by consistency.

  Most software layers treat 0xfffe-0x like unassigned Unicode characters,
  therefore there is no need to catch them.

  The range = 0x11, I would catch and reject as invalid. Some time ago
  I had a crash in an application because the first level of processing
  rejected only values = 0x8000, with a reasonable error message, and
  later processing relied on valid Unicode and called abort() when a
  character code = 0x11 was seen. Making the first level as strict
  as the later one fixed this.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: i18n of shell scripts

2005-11-02 Thread Bruno Haible
Koblinger Egmont wrote:
  The Bash manual only mentions the $... facility, but I cannot recommend
  using this facility, as it has a security hole by design.

 I was just planning to use this feature. Could you please tell something
 (e.g. a link) about this security hole by design?

See the GNU gettext-0.14.5 manual, section bash - Bourne-Again Shell Script:

 A translator could - voluntarily or inadvertantly - use backquotes
 ``...`' or dollar-parentheses `$(...)' in her translations.
 The enclosed strings would be executed as command lists by the
 shell.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: i18n of shell scripts

2005-10-31 Thread Bruno Haible
D. Dale Gulledge wrote:
 For what it's worth, according to the gettext manual, there is an
 interface to the gettext library for shell scripts.  It's documented here:

 http://www.gnu.org/software/gettext/manual/html_mono/gettext.html#SEC197

More info about this is found in the gettext-0.14.5 manual, section
sh - Shell Script.

 The Bash Reference Manual is similarly terse about how to use it:

 http://www.gnu.org/software/bash/manual/bashref.html#SEC13

The Bash manual only mentions the $... facility, but I cannot recommend
using this facility, as it has a security hole by design.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Using utf-8 in an application

2005-10-12 Thread Bruno Haible
 Here are the questions.

 1) In livido.h we #include wchar.c

 is this the right header for dealing with utf-8 ?

No. Wide characters are useless, because they differ in width and in
representation between platforms. On some platforms, wide character values
are even locale dependent.

 We want to keep the
 header file as light as possible, so it would be preferable to include as
 little code as possible. The only functions we need are to get a string
 length in bytes, so it can be stored, and then to add a terminating utf-8
 NULL when the string is retrieved, since NULL is not stored.

strlen() will do it.

 2) for getting the utf-8 string length in bytes, we use wcslen(). Is this
 the correct function ?

No, use strlen().

 3) when a string is retrieved, we must add a utf-8 terminating NULL to the
 end. How is this done ?

Like you add an ASCII '\0' to an 8-bit string.

 4) For testing purposes, I want to create a utf-8 string. Is there a
 simple way to convert a char *string to utf-8 ?

A char * is normally in locale dependent encoding. To convert it to
UTF-8, you need to go through iconv(). Look for example
 - at function u8_conv_from_locale() in
libuniconv/localeconv.c
libuniconv/uniconv.c
   in ftp://ftp.ilog.fr/pub/Users/haible/gnu/libunistring-0.0.tar.gz
 - or the extras/iconv_string.c in libiconv-1.10.tar.gz,
 - or the 'iconvme' module in gnulib (http://savannah.gnu.org/projects/gnulib)

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Capitalisation of text, which library is it?

2005-09-05 Thread Bruno Haible
 Several applications allow you to convert text to all caps, such as
 Firefox and OpenOffice.org
 Do you know where this information is stored or which library deals this
 task?
 Is it CLDR?

Yes it should be CLDR. Because the glibc locale data files are only
accessible through glibc API, and this API doesn't for example do
toupper(ß) = SS, as needed for the German locale. Similarly in French,
where often toupper(é) = E and not É.

The libraries which exploit CLDR are ICU and GNU glocale ([1], work in
progress).

Bruno

[1] http://live.gnome.org/LocaleProject


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: question on Linux UTF8 support

2005-08-10 Thread Bruno Haible
Sergey Poznyakoff wrote:
  The GNU tar maintainer is working a GNU pax program. Maybe he will also
  provide a command-line option for GNU tar that would perform the same
  filename charset conversions (suitable for 'tar' archives with UTF-8
  filenames)?

 It has already been implemented.

 Current version of GNU tar (1.15.1) performs this conversion
 automatically when operating on an archive file in pax format.

Thanks, indeed that works: When I create a .pax file (*) in an UTF-8 locale
and use GNU tar 1.15.1 to unpack it in an ISO-8859-15 locale, the filenames
are correctly converted.

But it is hard to switch the general distribution of tar files to pax format,
because - while a tar as old as GNU tar 1.11p supports pax files with just
a warning, and AIX, HP-UX and IRIX tar similarly - the Solaris and OSF/1
/usr/bin/tar refuse to unpack them.

Could you add to GNU tar an option, so that it performs the filename conversion
_also_ when reading or creating archives in 'tar' format?

Bruno


(*) It's funny that to create a .pax file I have to use tar -H pax, because
pax on my system is OpenBSD's pax, which rejects the option -x pax: it
can only create cpio and tar archives, despite its name :-)


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: question on Linux UTF8 support

2005-08-03 Thread Bruno Haible
Danilo Segan wrote:
  2. Is there any known application which still uses ISO-8859XXX codesets
  for creating file names?

 Many old (and new?) applications use current character set on the
 system (set through eg. LC_CTYPE, or other LC_* variables).  I'd
 suggest all new applications to use UTF-8.

This will mess up users who have their LC_CTYPE set to a non-UTF-8 encoding.
It is weird if a user, in an application, enters a new file name Süß,
and then in a terminal, the filename appears as Süà (wow, it even
hangs my xterm!).

It is just as bad as those old Motif applications which assume that
everything is ISO-8859-1. This makes these applications useless in UTF-8
locales.

In summary, I'd suggest
  - that ALL application follow LC_ALL/LC_CTYPE/LANG, like POSIX specifies,
  - that users switch to UTF-8 locale when they want.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: question on Linux UTF8 support

2005-08-03 Thread Bruno Haible
Danilo Segan wrote:
 what about user deciding to change LC_CTYPE?

A user who switches to a different LC_CTYPE, or works in two different
LC_CTYPEs in parallel, will need to convert his plain text files when
moving them from one world to the other. It is not much more effort
to also convert the file names at the same moment.

 Or even
 worse, what if administrator provides some dirs for the user in an
 encoding different from the one user wants to use?

 Eg. imagine having a global /Müsik in ISO-8859-1, and user desires
 to use UTF-8 or ISO-8859-5.

For this directory to be useful for different users, the files that it
contains have to be in the same encoding. (If a user put the titles or
lyrics of a song there in ISO-8859-5, and another user wants to see them
in his UTF-8 locale, there will be a mess.) So a requirement for using
a common directory is _anyway_ that all users are in locales with the
same encoding.

 My point is that the filesystem encoding should be filesystem-wide
 (not per-user)

All that you say about the file names is also valid for the file contents.
A lot of them are in plain text, and filenames are easily converted into
plain text. But all POSIX compliant applications have their interpretation
of plain text guided by LC_CTYPE et al.

 That's not closer to ever solving the problem.  It's status quo.  I
 think we should at least recommend improvements, if not require them
 (and nobody suggested requiring them).

 Basically, my recommendation was to set LC_CTYPE to UTF-8 on all new
 systems.

We have the same goal, namely to let all users use UTF-8, and get rid
of any user-visible character set conversions.

I agree with the recommendations that you make to users and sysops.

However, when you recommend to an application author that his application
should consider all filenames as being UTF-8, this is not an improvement.
It is a no-op for the UTF-8 users but breaks the world of the EUC-JP and
KOI8-R users.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: viewing UTF-8 encoded man pages

2005-07-08 Thread Bruno Haible
Jan Willem Stumpel wrote:
  In languages like Japanese or Chinese, there are line breaking
  opportunities not only at spaces. And there are fewer spaces
  than in European languages. I guess that groff is looking for
  spaces when deciding to do line breaking, and this line
  breaking algorithm doesn't produce satisfactory results when
  there are long runs of characters without spaces.

 Yes, this makes sense.. but does it display correctly in your
 case?

With groff-utf8 it doesn't display correctly: the linebreaks are not
well positioned. But it should be enough for a translator who wants to
proofread his/her translated man page.

 Wonder how FC 3 solves this.

groff on Fedora contains a 400 KB patch for Japanese, which includes
some adjustments to the line breaking algorithm.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: [Groff] Re: man page encoding

2005-07-08 Thread Bruno Haible
Andries Brouwer wrote:
 The very long pipeline contains invocations of
 refer, ideal, pic, tbl, eqn, ditroff
 but also lots of preprocessors of my own. If the groff version of refer
 or tbl decides to turn my Latin-1 into UTF-8, then my own preprocessors
 later on in the pipeline will no longer be able to handle the input.
 On the other hand, if they turn stuff into \[...] or \N[...] escape
 sequences, then again my preprocessors are confused since this syntax is
 not traditional troff syntax, and unexpected in the input.

Don't worry here: we don't plan to change 'refer' or 'tbl' to convert
Latin1 input to something else. The plan is that when a user invokes
groff, the constructed pipeline contains an invocation to 'gpreconv'.
A pipeline that you construct by yourself will continue to work.

 Now you say tough luck, and I don't mind, but if the idea is that groff
 has a compatibility mode ...

The compatibility mode is made for compatibility to ATT UNIX troff.
At that time, Latin1 as an encoding didn't exist. Therefore it's hard to
argue that -C should imply interpretation of non-ASCII input as being Latin1.

2) We would have low acceptance from the people who produce man pages
  in EUC-JP, with the consequence that these -Tnippon hacks in groff (or
  equivalent hacks in man in some distributions) would need to stay
  forever.

 But you talk as if you are forced to change groff in ugly ways because
 man is set in stone. But it is very easy to change man.

It is not easy to change the opinion of many Japanese people, regarding the
issue of EUC-JP vs. Unicode.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



man page encoding

2005-07-06 Thread Bruno Haible
Andries,

Currently on a Linux system you find man pages in the following encodings:
  - ISO-8859-1 (German, Spanish, French, Italian, Brasilian, ...),
  - ISO-8859-2 (Hungarian, Polish, ...),
  - KOI8-R (Russian),
  - EUC-JP (Japanese),
  - UTF-8 (Vietnamese),
  - ISO-8859-7, ISO-8859-9, ISO-8859-15, ISO-8859-16 (man7/*),
and none of them contains an encoding marker.

The goal is that groff -T... -mandoc on any man page works, without
need to specify the encoding as an argument to groff.

There are two options:
  a) Recognize only UTF-8 encoded man pages. This is the simplest.
 groff will be changed to emit errors when it is fed a non-UTF-8
 input, so that the man page maintainers are notified that they need to
 convert their man page to UTF-8.
  b) Recognize the encoding according to a note in the first line
'\ -*- coding: EUC-JP -*-
 groff will then emit errors when it is fed input that is non-ASCII and
 without coding: marker, so that man page maintainers are notified that
 they need to add the coding: marker.

Which of the two would you, as Linux man pages maintainer, prefer?

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: viewing UTF-8 encoded man pages

2005-07-06 Thread Bruno Haible
Andries Brouwer wrote:
 Hmm. Long ago I added some code to man that sufficed to make some
 Russian users happy. Forgot all details. See man-iconv.c.
 (Maybe that threw in an invocation of iconv when reading the pages?)

That worked because KOI8-R, like ISO-8859-1, consists of only 256 characters,
and they have all width 1. For Unicode in general, you need the other
trick contained in groff-utf8.tar.gz.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: C source and execution encodings

2005-06-28 Thread Bruno Haible
Roger Leigh wrote:
 #include locale.h
 #include stdio.h
 #include wchar.h

 int
 main (void)
 {
   setlocale (LC_ALL, );
   printf(‘Name1’\n);
   printf(%ls\n, L‘Name2’);
   fwide(stderr, 1);
   fwprintf(stderr, L‘Name3’\n);
   fwprintf(stderr, L%s\n, ‘Name4’);
   printf(‘Name5’\n);
   return 0;
 }

 Try running this in a C locale!

 $ ./test
 'Name3'
 ‘Name1’
 ‘Name5’

I get this (on a glibc 2.3 system):

$ LC_ALL=C ./test
‘Name1’
???Name3???
‘Name5’

Since the encoding of the C locale is ASCII, you can see that none of the
outputs is suitable for the C locale.

Conclusion: use gettext().

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Gettext and UTF-8

2005-06-28 Thread Bruno Haible
Roger Leigh wrote:
 I created a C.po file, and this installed as schroot.mo under
 /usr/share/locale.  This po file simply converts the UTF-8 chars to
 the nearest ASCII equivalent e.g. © - (C).  However, when running
 under the C or POSIX locales, bindtextdomain() never even checks for
 the existence of a message catalogue (checked with strace).

 Is this correct?  If so, is this a gettext or libc bug?

gettext() does no conversion at all when running in the C or POSIX locale.
This is because the POSIX standard specifies the precise output of many
commands in the C locale, and no localization is allowed in this case.

You can get the desired behaviour by using an English locale (such as
en_US.US-ASCII - note: you have to create this locale first, using 'localedef').
You build the message catalog for this locale using the 'msgen' command.
It can contain UTF-8 in both the msgid and the msgstr; the gettext() library
function will take care of converting many common UTF-8 characters to ASCII
when the locale's encoding is ASCII.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: How to detect the encoding of a string?

2005-06-03 Thread Bruno Haible
Simos Xenitellis wrote:
 Is there a library or sample program that can do such a encoding
 detection based on short strings of unknown encoding
 (or to choose from encodings based on a smaller list than iconv --list)?

It's very unfortunate the encoding of the filenames is not specified in the
central_directory_file_header in unzip.h. So the best you can do is to
fall back on heuristics, based on these three bits of information:

 1) the version_made_by[1] field, which contains the OS on which the zip
file was made.
 2) the locale (especially language) of the user who attempts to extract the
zip,
 3) the set of filenames in the zip file.

Here's how you can use this information to do something meaningful:

1) You know that AMIGA used the ISO-8859-1 encoding, ATARI used the ATARIST
   encoding, FS_NTFS and FS_VFAT use preferrably Windows encodings, BEOS
   uses UTF-8, MAC uses the MAC-* specific encodings, MAC_OSX uses UTF-8 in
   decomposed normal form.

2) Assuming that the language of the person who extracts the zip often matches
   the language of the one who created it, you can set up a list of encodings
   to try:

   Afrikaans  UTF-8 ISO-8859-15 ISO-8859-1
   Albanian   UTF-8 ISO-8859-15 ISO-8859-1
   Arabic UTF-8 ISO-8859-6 CP1256
   Armenian   UTF-8 ARMSCII-8
   Basque UTF-8 ISO-8859-15 ISO-8859-1
   Breton UTF-8 ISO-8859-15 ISO-8859-1
   Bulgarian  UTF-8 ISO-8859-5
   Byelorussian   UTF-8 ISO-8859-5
   CatalanUTF-8 ISO-8859-15 ISO-8859-1
   ChineseUTF-8 GB18030 CP936 CP950 BIG5 BIG5-HKSCS EUC-TW
   CornishUTF-8 ISO-8859-15 ISO-8859-1
   Croatian   UTF-8 ISO-8859-2
   Czech  UTF-8 ISO-8859-2
   Danish UTF-8 ISO-8859-15 ISO-8859-1
   Dutch  UTF-8 ISO-8859-15 ISO-8859-1
   EnglishUTF-8 ISO-8859-15 ISO-8859-1
   Esperanto  UTF-8 ISO-8859-3
   Estonian   UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4
   Faeroese   UTF-8 ISO-8859-15 ISO-8859-1
   FinnishUTF-8 ISO-8859-15 ISO-8859-1
   French UTF-8 ISO-8859-15 ISO-8859-1
   FrisianUTF-8 ISO-8859-15 ISO-8859-1
   Galician   UTF-8 ISO-8859-15 ISO-8859-1
   Georgian   UTF-8 GEORGIAN-ACADEMY GEORGIAN-PS
   German UTF-8 ISO-8859-15 ISO-8859-1 ISO-8859-2
   Greek  UTF-8 ISO-8859-7
   GreenlandicUTF-8 ISO-8859-15 ISO-8859-1
   Hebrew UTF-8 ISO-8859-8 CP1255
   Hungarian  UTF-8 ISO-8859-2
   Icelandic  UTF-8 ISO-8859-10 ISO-8859-15 ISO-8859-1
   Inuit  UTF-8 ISO-8859-10
   Irish  UTF-8 ISO-8859-14 ISO-8859-15 ISO-8859-1
   ItalianUTF-8 ISO-8859-15 ISO-8859-1
   Japanese   UTF-8 EUC-JP CP932
   Kazakh UTF-8 PT154
   Korean UTF-8 EUC-KR CP949 JOHAB
   LaotianUTF-8 MULELAO-1 CP1133
   Latin  UTF-8 ISO-8859-15 ISO-8859-1
   LatvianUTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4
   Lithuanian UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4
   Luxemburgish   UTF-8 ISO-8859-15 ISO-8859-1
   Macedonian UTF-8 ISO-8859-5
   MalteseUTF-8 ISO-8859-3
   Manx GaelicUTF-8 ISO-8859-14
   Norwegian  UTF-8 ISO-8859-15 ISO-8859-1
   Polish UTF-8 ISO-8859-2 ISO-8859-13
   Portuguese UTF-8 ISO-8859-15 ISO-8859-1
   Raeto-Romanic  UTF-8 ISO-8859-15 ISO-8859-1
   Romanian   UTF-8 ISO-8859-16
   RussianUTF-8 KOI8-R ISO-8859-5 KOI8-RU
   Sami   UTF-8 ISO-8859-13 ISO-8859-10 ISO-8859-4
   Scottish   UTF-8 ISO-8859-15 ISO-8859-1 ISO-8859-14
   SerbianUTF-8 ISO-8859-5
   Slovak UTF-8 ISO-8859-2
   Slovenian  UTF-8 ISO-8859-2
   SorbianUTF-8 ISO-8859-2
   SpanishUTF-8 ISO-8859-15 ISO-8859-1
   Swedish languages  UTF-8 ISO-8859-15 ISO-8859-1
   Tajik  UTF-8 KOI8-T
   Thai   UTF-8 ISO-8859-11 TIS-620 CP874
   TurkishUTF-8 ISO-8859-9
   Ukrainian  UTF-8 KOI8-U ISO-8859-5
   Vietnamese UTF-8 VISCII TCVN CP1258
   Welsh  UTF-8 ISO-8859-14

3) Look at the set of file names in the zip. If they _all_ happen to be
   in UTF-8, you can assume that's it (because there are very few
   meaningful strings which look like UTF-8 but aren't).
   Then go ahead similarly for the other encodings.

   Furthermore, for Chinese, you can use frequency-of-characters based
   techniques such as
 http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
 http://kamares.ucsd.edu/~arobert/hanziData.html
 http://www.mandarintools.com/codeguess.html

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: How to detect the encoding of a string?

2005-06-03 Thread Bruno Haible
Abel Cheung wrote:
 (because there are very few
 meaningful strings which look like UTF-8 but aren't).

 Yes, that's rare, though real world case has really happened before,
 especially for multibyte characters. Here is a sample:

 http://qa.mandrakesoft.com/show_bug.cgi?id=3935

Yes. It's a heuristic, and heuristics are always buggy. The programmer has
to weigh the benefit for the many users for which it just works against
the problem that it will cause for a few ones. In this case, when the
heuristic doesn't work, the result will be a filename that is garbage, and
a different garbage than if no heuristic took place.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: CSets 1.8 released

2005-05-19 Thread Bruno Haible
Michael B Allen wrote:
 I didn't realize there could be so many differences. Why is that? Are
 these just mistakes? I mean if Mac-Cyrillic is what it is on a Macintosh
 how can glibc-2.3 just decide to change the mapping for 0xB6?

Some of the differences are because the character sets evolve: A new version
of a Macintosh comes with new fonts, and suddenly a few particular, rarely
used code points correspond to different glyphs. Even standardized character
sets like ISO-8859-8 evolve over time.

Some of the differences are because the mapping to Unicode is done by
independent vendors, based on glyph tables. Characters like
OHM SIGN and GREEK CAPITAL LETTER OMEGA look very similar.

Some of the differences are because many vendors have to handle backward
compatibility problems that other vendors don't have.

Some of the differences are just mistakes and bugs: Many charset converters are
shipped without having been tested with a testsuite.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: CSets 1.8 released

2005-05-06 Thread Bruno Haible
Mark Leisher wrote:
 CSets is a collection of mapping tables between Unicode and 48 different
 character encodings.  ...
 http://crl.nmsu.edu/~mleisher/csets.html

A repository for more frequently used charset encoding tables, with
emphasis on the variations found in the various implementations, is at
  http://www.haible.de/bruno/charsets/conversion-tables/index.html

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Weird behaviour of emacs

2005-02-04 Thread Bruno Haible
David Sumbler wrote:
 If I save the file in emacs-mule format, a lower case 'alpha' appears
 as bytes [92 a6 c1] in case (a), and [9c f4 a7 b1] in case (b).  Other
 characters show similar differences.

 I've spent weeks trying to solve this, without success.  Can someone
 point me in the direction of an explanation and/or solution?

The explanation: This a well-known design flaw of Mule in Emacs/XEmacs.
Possibly the solution: The emacs-unicode[-2] branch of the Emacs CVS.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: character width in terminal

2005-01-10 Thread Bruno Haible
Egmont Koblinger asked:
 - Where can I find specification about the terminal width of each and every
 Unicode character?

http://www.unicode.org/reports/tr11/ and the Unicode character database 4.1.

 - Is glibc's wcwidth() considered to be a good implementation?

Yes. Note that for characters with ambiguous width (where the width is 1
in European contexts and 2 in Japanese contexts) it returns 1.

 What about
 the cases where it returns -1, including U+0603 mentioned above?

-1 is returned for control characters and similar, where the cursor
movement is not predictable.

 - Is it clearly a bug in the terminal emulator (gnome-terminal/vte) if it
 moves the cursor for a character whose wcwidth is zero? (I guess it is, and
 I found it in gnome's bugzilla as #162262.)

Yes. A terminal emulator is supposed to display these zero-width and
combining characters in a way that doesn't move the cursor.

 - Is it documented somewhere what a terminal emulator should do if it
 receives a character whose wcwidth equals to -1?

These are control characters. For some, like U+000A, the semantics is
clear; for others, it is unknown.

 - What shall a terminal emulator do with the cursor position if it receives
 a character that is not assigned and known that won'be assigned

Undefined behaviour.

 or when it receives a character that is not yet assigned?

It should assume that it is a normal graphic characters whose width is
1, 2, or 0, depending on the numeric code of the character. For example,
the characters U+2..U+2FFFD and U+3..U+3FFFD all have width 2,
although many of them are not yet assigned.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: wcsftime output encoding

2004-11-26 Thread Bruno Haible
Roger Leigh wrote:
 Viewed as hexadecimal (aligned for comparison):
 Narrow UTF-8:

 == d0 9f d1 82 d0 bd 

In UCS-4 these would be

  041F  0442  043D

 Wide (unknown):
   B  =   
 == 1f 42 3d  

So you can see that it simply used the low 8 bit of every UCS-4 character.
Which is broken. Before reporting this as a bug to the GCC people, you
might want to find out whether it's a bug in std::wcsftime or a bug in
the std::wcout stream.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: gcc and utf-8 source

2004-11-15 Thread Bruno Haible
srintuar wrote:
   1) For  printf(%s\n, Schne Gre);
 ...
 Being that UTF-8 is sortof an an endpoint in the evolution of encodings,
 I also consider option 1 to be perfectly valid.

I would be careful with such statements. We don't know what the successor
of UTF-8 might look like, nor when it will appear (in 6 years? 10 years?
15 years?). But predictions like A personal computer will never need more
than 640 KB of RAM have too frequently turned out to be wrong.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: gcc and utf-8 source

2004-11-12 Thread Bruno Haible
Egmont Koblinger wrote:
 I was reading Markus's page and found the example:
   printf(%ls\n, LSchne Gre);
 and noticed that gcc always interprets the source code according to
 Latin-1.

gcc-3.4's documentation contains the following:

`-fexec-charset=CHARSET'
 Set the execution character set, used for string and character
 constants.  The default is UTF-8.  CHARSET can be any encoding
 supported by the system's `iconv' library routine.

`-fwide-exec-charset=CHARSET'
 Set the wide execution character set, used for wide string and
 character constants.  The default is UTF-32 or UTF-16, whichever
 corresponds to the width of `wchar_t'.  As with
 `-ftarget-charset', CHARSET can be any encoding supported by the
 system's `iconv' library routine; however, you will have problems
 with encodings that do not fit exactly in `wchar_t'.

`-finput-charset=CHARSET'
 Set the input character set, used for translation from the
 character set of the input file to the source character set used
 by GCC. If the locale does not specify, or GCC cannot get this
 information from the locale, the default is UTF-8. This can be
 overridden by either the locale or this command line option.
 Currently the command line option takes precedence if there's a
 conflict. CHARSET can be any encoding supported by the system's
 `iconv' library routine.

and these options work fine for me.

However, these gcc options are normally not usable for portable programs.
This is because

  1) For  printf(%s\n, Schne Gre);

 Many Linux users work in an UTF-8 locale, many others work in a
 pre-Unicode locale. Do you want to ship two executables, one
 produced with -fexec-charset=UTF-8 and one with
 -fexec-charset=ISO-8859-2 ?

  2) For  printf(%ls\n, LSchne Gre);

 On Solaris, FreeBSD and others, the wide character encoding is
 locale dependent and not documented. Therefore there is no good
 choice for the -fwide-exec-charset option that you could make.

The portable solution is to use gettext:

 printf(%s\n, gettext (Schoene Gruesse));
or   printf(%s\n, gettext (Greetings));

This works on all platforms, with all compilers, and furthermore allows
the program to be localized.

OTOH, if you limit yourself to Linux systems and don't want your
programs to be portable or internationalized, you can now use option 2.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: char * to unicode/UTF string

2004-05-19 Thread Bruno Haible
Tomohiro KUBOTA wrote:
 Please use nl_langinfo(CODESET) for encoding name of char* string,
 because the encoding of char* string depends on locale.
 On most GNU-based systems it is available.  You have to call
 setlocale() in advance.

 iconv_t ic = iconv_open(UTF-8,nl_langinfo(CODESET));

Right. And when you use GNU libc or GNU libiconv but your platform lacks
nl_langinfo(CODESET) (like for example FreeBSD 4), then you can use
the  alias instead. It has the same meaning: the locale dependent char*
encoding:

iconv_t ic = iconv_open(UTF-8,);

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Standardized encoding names for iconv_open()

2004-05-19 Thread Bruno Haible
Markus Kuhn wrote:
 In general, the POSIX definition of iconv_open() would become *much*
 more useful, if it actually specified a couple of encoding strings, and
 what exactly they mean.

I second that. JAVA has a similar minimal supported set of encodings
in its conversion facility.

  multi-byte encoding of current LC_CTYPE locale
   UTF-8  UTF-8 (with overlong sequences being illegal)
   UTF-16 UTF-16 (same byte order as C's short)
   UTF-16BE   UTF-16 BigEndian
   UTF-16LE   UTF-16 LittleEndian
   UTF-32 UTF-32 (same byte order as C's long)
   ...

UTF-16 and UTF-32 are defined differently than same byte order as
C's short, in RFC 2781. It's better to refer to their lengthy definition
in RFC 2781.

 and perhaps even

   EUC-JP, EUC-KR, EUC-TW, GB18030

I don't think there is a normative, widely used definition of EUC-TW.
And for GB18030, the fact that its official definition is in Chinese,
not English, doesn't prevent different implementations by different vendors.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: iconv limitations

2004-04-10 Thread Bruno Haible
srintuar wrote:
 The knowledge of how
 to detect a null in a stateful encoding is not necessarily trivial.

 If there was a function which could return the unit-word-size of
 any encoding accepted by iconv, ...

Here is how to write such a function: Given the unknown encoding,
1. convert \000 from UTF-8 to the given encoding,
2. convert \000\000 from UTF-8 to the given encoding,
3. return the difference of the lengths (measured in bytes) of the two
   results.
4. If the encoding is UTF-7, this does not work. Here return 1 instead.

The corresponding Clisp code:

(defun encoding-zeroes (encoding)
  (let ((name (ext:encoding-charset encoding)) 
(table #.(make-hash-table :test #'equal
  :initial-contents '((UTF-7 . 1
(tester #.(make-string 2 :initial-element (code-char 0
(or (gethash name table)
(setf (gethash name table)
  (- (length (ext:convert-string-to-bytes tester encoding))
 (length (ext:convert-string-to-bytes tester encoding :end 1)))

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: iconv limitations

2004-04-10 Thread Bruno Haible
Michael B Allen wrote:
 Shift-JIS has embedded nulls,

I don't think this is true. Shift_JIS is a multibyte encoding. It has the
property that some bytes in the ASCII range (such as 'x' or '\') can occur
as part of non-ASCII characters. But 0x00 cannot occur as part of a double-
byte character.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: relevance of [PATCH] tty utf8 mode in linux-kernel 2.6.4-rc1

2004-03-01 Thread Bruno Haible
Markus Kuhn wrote:
 I believe the only practical solution for this problem is to implement
 BACKSPACE in UTF-8 terminal emulators such that it moves one *character*
 to the left, not one *cell*.

I agree. The objects being displayed are characters. It does not make
sense for a user or for applications to position the cursor in the middle
of a character, or after 1/3 or 2/3 of a character.

 We have little choice if we want to keep the kernel free of
 locale-dependent monsters such as wcwidth().

There is also the problem of the TAB: Currently linux/drivers/char/n_tty.c
also transforms a TAB to a sequences of spaces, and an erase of a TAB to
a sequence of BACKSPACEs. If we keep it this way, the kernel must still
learn to distinguish single-width and double-width characters, in order
to keep a notion of current column number.

What is the reason for treating TAB at the TTY level? Why can't TAB be
treated like a graphic character of unknown width and be passed to the
device driver unchanged?

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: GB18030

2003-12-19 Thread Bruno Haible
Jan Willem Stumpel wrote:
 What was wrong
 with UTF-8 one wonders (rhetorical question, dont really want to
 know the answer because it is probably very complicated).

UTF-8 is upward compatible with ASCII, but the Chinese government wanted
something that is upward compatible with GB2312, and thus they created GB18030.

Bruno

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: grep is horriby slow in UTF-8 locales

2003-11-16 Thread Bruno Haible
Markus Kuhn wrote:
   b) relying entirely on ISO C's generic multi-byte functions, to make
  sure that even stateful monsters like the ISO 2022 encodings
  are supported equally.

Use of mbrlen is not done because of ISO 2022 encodings (which are not
usable as locale encodings!), but because of the non-UTF-8 multibyte
encodings: EUC-JP, Big5, GB18030 etc.

Bruno

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Uppercase string: broken tr?

2003-08-25 Thread Bruno Haible
Bob Proulx wrote:
 But sed and tr and other utilities just use the locale data provided
 on the system by glibc among other places.  These programs are table
 driven by tables that are not part of these programs.  This is why
 locale problems are global problems across the entire system of
 programs such as grep, sed, awk, tr, etc. or anything else that uses
 the locale data.

The glibc locale data for 'ABÇ' has been correct in all locales since 2000,
and is covered by glibc's testsuite. Before blaming glibc, you should make
up a standalone test program that shows the glibc problem.

Bruno

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Uppercase string: broken tr?

2003-08-24 Thread Bruno Haible
Alex J. Dam wrote:

   $ echo 'AB' | tr [:upper:] [:lower:]
   gives me
   ab
   (the last character is an uppercase cedilla)
   I expecte its output to be:
   ab

 Am I doing something wrong?

No, your expectations match what POSIX specifies.

 Is tr (version 2.1) broken?

Yes, and even the i18n patches from IBM
http://oss.software.ibm.com/developer/opensource/linux/patches/?patch_id=24
contain no fix for it.

 It happens with sed, too.

$ echo 'AB' | sed -e 's,\(.*\),\L\1\E,'
ab

Yes this seems like a bug in GNU sed 4.0.3.

I'm CCing bug-coreutils and the sed maintainer, so the maintainers can do
something about it.

Bruno

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: To maintainer of the list

2003-07-05 Thread Bruno Haible
Wu Yongwei suggested that, to get rid of spam and worms, this list be
made subscriber-only. This is now implemented. Sorry for the inconvenience
that this will cause to well-behaved people who are not subscribed.

Bruno

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Wide character APIs

2003-07-03 Thread Bruno Haible
Michael B Allen said:
 Since Win32 is one of my target systems I need wide character support.

But Win32 doesn't have reasonable wide characters. They have a 16-bit
type called 'wchar_t' which cannot accomodate all characters since
Unicode 3.1. So what they will likely end up doing is to use UTF-16
as an encoding for 'wchar_t *' strings, which means that wchar_t doesn't
represent a *character* any more - it represents an UTF-16 memory unit.

 Is there a serious flaw with wchar_t on Linux?

wchar_t by itself is OK on Linux (it's 32-bit wide). But the functions
fgetwc() and fgetws() - as specified by ISO C 99 and POSIX:2001 - have a
big drawback: When you use them, and the input stream/file is not in the
expected encoding, you have no way to determine the invalid byte sequence
and do some corrective action. Using these functions has the effect that
your program becomes

 garbage in - more garbage out
or
 garbage in - abort

You need to use multibyte strings in order to get some decent program
behaviour in the presence of invalid multibyte contents of streams/files.

Bruno

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Strings in a programming language

2003-07-03 Thread Bruno Haible
Hi Marcin,

 Most languages take 3, as I understand Perl it takes the mix of 3 and 2,
 and Python has both 3 and 1. I think I will take 1, but I need advice: -

Don't look at Perl in this case - Perl has the handicap that for historical
reasons it cannot make a clear distinctions between byte arrays (= binary
data) and character arrays (= strings = text).

Python's way of doing it - byte arrays are automatically converted to
character arrays when there is need to - is OK when you consider what
Python  1.5 looked like. But for a freshly designed language it'd be
an unnecessary complexity.

In Lisp (Common Lisp - Scheme guys appear not to care about Unicode or
i18n) the common approach is to have one or two flavours of strings,
namely strings containing Unicode characters, and possibly a second
flavour, strings containing only ISO-8859-1 characters. Conversion
is done during I/O. The Lisp 'open' function has had an argument
'external-format' since 1984 or 1986 at least; nowadays a combination
of the encoding and the newline convention (Mac CR, Dos CRLF or Unix LF)
gets passed here.

You find details here:

  - GNU clisp
http://clisp.sourceforge.net/impnotes/encoding.html
http://clisp.sourceforge.net/impnotes/stream-dict.html#open

  - Allegro Common Lisp
http://www.franz.com/support/documentation/6.2/doc/iacl.htm

  - Liquid Common Lisp
http://www.lispworks.com/reference/lcl50/ics/ics-1.html

  - LispWorks
http://www.lispworks.com/reference/lwl42/LWRM-U/html/lwref-u-95.htm#pgfId-886156
http://www.lispworks.com/reference/lwl42/LWRM-U/html/lwref-u-101.htm#98500
http://www.lispworks.com/reference/lwl42/LWRM-U/html/lwref-u-76.htm#pgfId-902973

and some older details at
http://www.cliki.net/Unicode%20Support

 1. Strings are in UTF-32 on Unix

Since strings are immutable in your language, you can also represent
strings as UCS-2 or ISO-8859-1 if possible; this saves 75% of the memory
in many cases, at the cost of a little more expensive element access.

and UTF-16 on Windows. They are recoded on
the fly during I/O and communication with an OS (e.g. in filenames),
 with some recoding framework to be designed.

Why not using UTF-32 as internal representation on Windows as well?
I mean, once you have decided to put in place a conversion layer for
I/O, this conversion layer can convert to UTF-16 on Windows. What you
gain: you have the same internal representation on all platforms.

 2. Strings are in UTF-8, otherwise it's the same as the above. The
 programer can create malformed strings, they use byte offsets for indexing.

Unless you provide some built-in language constructs for safely iterating
across a string, like

  for (c across-string: str) statement

this would be too cumbersome for the user who is not aware of i18n.

 - How should the conversion API look like? Are there other such APIs
   which I can look at? It should permit interfacing with iconv and other
   platform-specific converters, and with C/C++ libraries which use various
   conventions (locale-based encoding in most, UTF-16 in Qt, UTF-8 in Gtk).

The API typically has a data type 'external-format', consisting of
EOL and encoding. Then you have some functions for creating streams
(to files, pipes, sockets) which all take an 'external-format' argument.

Furthermore you need some functions for converting a string from/to
a byte sequence using an 'external-format'. (These can be methods on
the 'external-format' object.)

 - What other issues I will encounter?

People will want to switch the 'external-format' of a stream on the fly,
because in some protocols like HTTP some part of the data is binary and
other parts are text in a given encoding.

 The language is most similar to Dylan, but let's assume its purpose will be
 like Python's. It will have a compiler which produces C code

The following article might be interesting for you.
http://www.elwoodcorp.com/eclipse/papers/lugm98/lisp-c.html

Bruno

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Wide character APIs

2003-07-03 Thread Bruno Haible
Michael B Allen wrote:
 I didn't know wchar_t was supposed to be able to represent
 an entire character.

If wchar_t is not an entire character, the functions defined in wctype.h,
like iswprint(), make no sense. And indeed, on Windows with UTF-16 as
encoding of 'wchar_t *' strings, they make no sense.

 This is good to know. I have been avoiding those functions and converting
 to/from the locale encoding internally using mbstowc and wctombs.

From the point of view of robustness versus malformed input, mbstowcs()
is just as bad as fgetwc(). The only function that really helps is mbrtowc().

 But no one answered my original question; why are the format specifiers
 for wide character functions different?

Here's the answer: So that the a given format specifier corresponds to a
given argument type.

   Format specifierArgument type

 %dint
 %schar *
 %ls   wchar_t *
 %cint (promoted from char)
 %lc   wint_t (promoted from wchar_t)

Bruno

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: mbrtowc with dlopen doesn't work?

2003-07-01 Thread Bruno Haible
Michael B Allen wrote:
 I was using an 'n' limit parameter of INT_MAX.
 Limiting this to 0x appears to solve the problem.

... but it is still wrong. The ISO C and POSIX specification of mbrtowc()
[http://www.opengroup.org/onlinepubs/007904975/functions/mbrtowc.html]
implies that the mbrtowc() function is free to look at 'n' bytes, starting
from the beginning of the string. In other words, the caller of the function
has to guarantee that 'n' bytes can be accessed. Passing blindly n = x
can crash your program.

Bruno

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: [Translation-i18n] Re: Proposal for declinations in gettext

2003-06-15 Thread Bruno Haible
Yann Dirson wrote:
 it is difficult in some cases to
 find unique english strings that will be possible map one to one in
 all languages.

A common technique is to use a context marker in the msgid string,
like this:

my_gettext ([menu item]Open)
my_gettext ([combobox item]Open)

which translators can translate like this:

msgid [menu item]Open
msgstr Ouvrir

msgid [combobox item]Open
msgstr Ouvert

The my_gettext function calls gettext and, if it is returns the
untranslated string, strips the [...] prefix.

See also the gettext documentation, section GUI program problems.

The only problem (quite small, IMO) with this approach is that translators
must be made aware where the context marker ends.

Bruno

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: [Translation-i18n] Proposal for declinations in gettext

2003-06-15 Thread Bruno Haible
Danilo Segan wrote:
 The usual practice among english-speaking programmers is to compose
 strings out of smaller parts.

You need to educate the programmer to use entire sentences. You can
refer them to the gettext documentation, section Preparing Translatable
Strings. http://www.gnu.org/manual/gettext/html_chapter/gettext_3.html#SEC15

The reason is that in most languages sentences are not composed by
juxtaposition, as in English:
   - For Serbian, you have given examples.
   - In many languages, a verb's form is spelled differently depending
 on the gender of the subject.
   - In Latin, the combiner and comes as a suffix -que.
   - Etc. etc.

 The translation for Workspace %d would look like:
 msgid Workspace %d
 msgstr0 der Workspace %d
 msgstr1 das Workspace %d
 msgstr2 dem Workspace %d
 msgstr3 den Workspace %d

 So, the title of Workspace 5 would be der Workspace 5, while the
 menu which allows switching to that workspace would read Switch to den
 Workspace 5.

There are more bits of context that influence a translation than just a
declination. For example, the beginning of a sentence is special. To pursue
your example, an English programmer would be tempted to write

  %0s is empty.

which would have the German translation

  %0s ist leer.

and result in the final string

  der Workspace %d is leer.

which is wrong because, in German, all sentences must start with a capital letter.

Bruno

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Hello world in UTF-8/X11

2003-01-14 Thread Bruno Haible
Manel de la Rosa writes:
 I don't need a complex rendering system or anything killer. Simply
 display a label with a UTF-8 encoded string.

This is a contradiction in itself.

The purpose of UTF-8 is that it can be used for languages from Russian
over Vietnamese to Indic. This needs a complex rendering engine: for
Russian you already need fonts in non-ISO-8859-1 encoding; for
Vietnamese you need to attach multiple accents to a single letter, and
for Indic (Devanagari etc.) you need vowel reordering. Not to mention
right-to-left reordering (Hebrew, Arabic, Farsi), the problem of
choosing the right fonts, and dealing with the subtleties of these
fonts.

Only two free GUI toolkits have the rendering engines today: Qt/KDE
and GNOME. Also Mozilla and (to a more limited extent) GNU Emacs have
some rendering engines, but not embedded in a GUI toolkit.

With Motif/Lesstif you cannot go further than displaying Russian.
There are no internationalization efforts underway there. (Except
there is a complex rendering underway at the low X11 level, by Sun,
http://stsf.sourceforge.net/, but I have no idea how easy it will be
to use it when it will be finished, and whether the Motif adaptation
will be freely distributable.)

So my recommendation is: Drop Motif, and use KDE/Qt (if the GPL is
acceptable for your program) or GNOME. Qt has a module that helps in
migrating from Motif to Qt.

 with a short X11/UTF-8 Hello World example, for instance

Can't be done.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: redhat 8.0 - using locales

2003-01-10 Thread Bruno Haible
Maiorana, Jason writes:

 A few files appear under LC_MESSAGES, but it seems
 they dont show up even when LANG=eo.

First, you need to have a locale, maybe eo_ES or so. Second, In the
LANGUAGE environment variable, but not in the LANG environment
variable, LL_CC combinations can be abbreviated as LL to denote the
language's main dialect. So you should use LANG=eo_ES, not simply LANG=eo.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: ``A Short Into ...'' - comments, suggestions?

2002-12-16 Thread Bruno Haible
Brian Foster writes:

 Suppose such a file is being opened.  What bytes are passed as the
 name of the file?  This is an unknown.  It obviously depends on the
 Java/JVM implementation.

The Sun Java 1.3 interprets the filenames on the file system according
to the locale. This means, in an UTF-8 locale the file names are UTF-8,
and in an ISO-8859-1 locale it replaces unencodable characters with
question marks, while doing the conversion from Java String to filename.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: ``A Short Into ...'' - comments, suggestions?

2002-12-16 Thread Bruno Haible
Sandip Bhattacharya writes:

  The Sun Java 1.3 interprets the filenames on the file system according
  to the locale
 Can you explain what you mean by interprets? Any encoded filename is
 just a sequence of bytes. Why should apps be concerned any further
 than that?

On the filesystem the filename is just a sequence of bytes.
Inside Java, a filename is a String, i.e. a sequence of Unicode
characters. Which you can display, for example in a graphical file
chooser. So there must be some conversion between the Linux notion of
filename and the Java notion of filename. And this conversion works
perfectly according to the locale.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: filename and normalization

2002-12-05 Thread Bruno Haible
Mike FABIAN writes:

 .  char  - \N'45'

 because I found quite a few man pages which used just -o to write
 command line options of programs not \-o, for example the man page
 of less does this. Without that hack, groff translates - into
 yet another variant of -: U+2010 (HYPHEN).

It's better to fix the man pages instead. The groff input language has
the distinction between - and \- for ages. In some cases (not in
command line options!) HYPHENs look better than MINUS signs, therefore
I want to be able to write man pages where - gives a HYPHEN.

Bruno

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: acroread in UTF-8 locale

2002-11-25 Thread Bruno Haible
Markus Kuhn writes:
 I did report the third issue (acroread breaking in UTF-8 locales) to
 Adobe multiple times, but no reaction yet. I suspect it might be an
 issue with the widget library they use and acroread ought in my opinion
 to ignore the locale entirely as it has no locale-dependnet
 functionality anyway.

In my experience, they have a problem only with the LC_NUMERIC part of
the locale, and only with some PDF documents. And it can be worked
around by adding a single line to the 'acroread' shell script.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Linux and UTF8 filenames

2002-09-20 Thread Bruno Haible

Glenn Maynard writes:
  convert its filemanes using the kernel nls modules; yes,
  it could be done.

 But would be somewhat tricky, since filenames need to be
8-bit clean
 except for / and NULL.  It's a bag of worms with very
little value ...

This is a non-issue. All locale encodings used on Linux, from
ISO-8859-* over BIG5 to GB18030, use the bytes 0x2f and 0x00 only
for '/' and '\0' respectively.

The '/' is a problem with ISO-2022 based encodings, but noone
with
a brain in his head uses them as locale encodings.

Bruno



Keine verlorenen Lotto-Quittungen, keine vergessenen Gewinne mehr! 
Beim WEB.DE Lottoservice: http://tippen2.web.de/?x=13


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Lazy man's UTF8

2002-09-19 Thread Bruno Haible

Robert de Bath writes:

 Mr. Lazy knows about wide characters and thinks they're a pain,
 especially for already existing code.

Sure. And furthermore some of them are unreliable: when you use
wprintf you don't know whether it failed because the disk was full or
if there was a conversion error or because the stdio was byte
oriented.

 iconv() is _fairly_ easy to use, the problem isn't that's it's difficult
 just that there's a lot you have to remember to do for a function that
 appears (at first) to have a simple job.

Have a look at the libunistr part of
http://www.haible.de/bruno/gnu/libunistring-0.0.tar.gz
Its unistr.h file declares simple functions for simple tasks - even
though under the hood many of them are based on iconv.

 I don't think there's any support for 'character' counting as
 opposed to 'display cell' counting.

In libunistring: u8_strlen vs. u8_strwidth.

Bruno

__
WEB.DE MyPage - Ohne Computerkenntnisse in nur 5 Minuten online! Alles
inklusive! Kinderleicht!  http://www.das.ist.aber.ne.lustige.sache.ms/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Lazy man's UTF8

2002-09-19 Thread Bruno Haible

Glenn Maynard writes:

 Giving wchar_t to iconv isn't portable, though, is it?

It is supported by glibc and GNU libiconv, and libiconv is portable.

 Hmm.  Another thing, while we're on iconv: How do you get the number of
 non-reversible conversions when -1/E2BIG is returned?  It seems that
 converting blocks into a small output buffer (eg. taking advantage of
 E2BIG) means that count is lost.

Seems so, yes. But you can do one round of conversion to see how large
you have to make your buffer, and then in the second round you are
safe from E2BIG.

Bruno

__
Die clevere Geldreserve: der DiBa-Privatkredit. Funktioniert wie ein Dispo, 
ist aber viel gunstiger! Alle Infos: http://diba.web.de/?mc=021104

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: very small idea

2002-09-19 Thread Bruno Haible

Mike Fabian writes:
 I tried to cut and paste between gvim, mlterm, xterm,
XEmacs, kedit.
 Worked in all directions without problems with UTF-8 encoded
 Japanese text.
 Can you tell me how to reproduce a situation where it
doesn't work and
 where the patch helps?

Try with Netscape Communicator. It's one of those clients which
support only UTF8_STRING and not COMPOUND_TEXT. Whereas Emacs
is one
of those clients which support only COMPOUND_TEXT and not
UTF8_STRING.

Bruno

__
Nur ein Zuhause im Internet: Verwalten Sie alle Ihre E-Mail-Adressen
einfach bei WEB.DE FreeMail! http://freemail.web.de/?mc=021124

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Linux and UTF8 filenames

2002-09-18 Thread Bruno Haible

Martin Kochanski writes:

 how can a poor innocent server discover enough about the
 context in which it is running to know what filename it has to
 use so that a
 user who lists a file directory will see Rêve on his screen?

Since it depends on the user's locale, you'll have to convert the
filename from the given encoding to the user's locale encoding.
Start out with

 const char *given_encoding = UTF-8;
 // or UTF-16, depends on what you have
 const char *localedependent = ;
 // shortcut for glibc or libiconv
 iconv_t cd = iconv_open (localedependent, given_encoding);
 ...

Bruno


__
Die clevere Geldreserve: der DiBa-Privatkredit. Funktioniert wie ein Dispo, 
ist aber viel gunstiger! Alle Infos: http://diba.web.de/?mc=021104

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: [Fwd: unicode conversions]

2002-08-16 Thread Bruno Haible

[EMAIL PROTECTED] asks:
 
 i was wondering about libiconv: is there any plan to support
 a fall-back character when performing conversions, as opposed to
 always stopping conversion when a character with no destination
 representation is encountered?

In general, providing fallback characters is the business of the
caller of iconv(). The iconv() function's role is only to determine
whether the input character is convertible to the output codeset, and
if so, how.

It would make sense to add a command line option to the iconv
_program_ to force a question mark for unconvertible characters. It
already has an option ('-c', most useful together with '-s') to omit
unconvertible characters from the output.

As a special case, glibc's iconv() function uses '?' as a fallback
character if conversion is performed with transliteration (i.e. the
target encoding has a //TRANSLIT suffix).

 Hrm, I was under the impression that converting from
 non-unicode to unicode was always possible.

Yes it is, except for a few bordercases like Inuktitut characters or
some rare chinese ideographs, which therefore are mapped to Unicode
private areas until they have been officially added to Unicode.

 Unfortunately, while experimenting with my system iconv, it appears
 to instead stop when there is no destination encoding for a character,
 rather than allowing a fallback to a default character.

Try iconv -c -s.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: ASCII and JIS X 0201 Roman - the backslash problem

2002-05-10 Thread Bruno Haible

Tomohiro KUBOTA writes:

  3) For programs that interpret backslash as some kind of escape character
 and use Unicode internally but should work with text in Shift_JIS
 encoding, consider the multibyte character 0x5C as being the escape
 trigger, not [only] the Unicode character U+005C. This is already done
 in bash and gettext. For example, in GNU gettext, we have the code
 
 I think interpretation of
 U+00A5 as an additional escape character doesn't always work, because
 Unicode texts don't have information on their origin (converted from
 Shift_JIS or not).

These are particular kinds of text files, which are fed to such
programs that do backslash interpretation: shell scripts, awk scripts,
gettext PO files, etc. - yes if the Yen sign should appear there it
needs to be doubled.

 If U+00A5 would always be an escape character,
 it would be harmful for many softwares.

Why is it more harmful if U+00A5 is an escape character than if U+005C
is an escape character? In both cases you just double it to get the
original character.

 I am interested in how European people succeeded to migrate from ISO 646
 variants into ISO 8859.  Yen Sign Problem is exactly a problem of ISO 646,
 because 0x5c = YEN SIGN comes from JIS X 0201 Roman, which is Japanese
 variant of ISO 646.

For me, the migration occurred when I switched to using a different
computer with a different OS and a different character set. (From
ISO646-DE to CP437 at that time.) Few files were transported - there
is usually a lot of text files that you can just drop once in three
years. Among the remaining ones the disambiguation was usually easy,
depending on the type of file: In letters I only used umlauts and no
brackets, whereas in programs I mostly used brackets and no umlauts.
Only few programs contained both brackets and umlauts, and I had to do
the fixup by hand, usually the next time I needed the particular
program.

So it is a minor annoyance over the time of a few months, but by far
not the costs that you are estimating.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: readline (was: Switching to UTF-8)

2002-05-02 Thread Bruno Haible

Markus Kuhn writes:

 There is also bash/readline

SuSE 8.0 ships with a bash/readline that works fine with (at least)
width 1 characters in an UTF-8 locale.

There is also an alpha release of a readline version that attempts to
handle single-width, double-width and zero-width characters in all
multibyte locales. But it's alpha (read: it doesn't work for me yet).

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: JISX0213 mapping table

2002-04-09 Thread Bruno Haible

Gaspar Sinai writes:

  555c555
   0x12678 0x30D7
  ---
   0x12678 0x31F7  0x309A

 If we use 0x30D7 we will clash with:
 
 Table 5 row 4 column 8
 0x8376  0x2557  0x30D7  # 1-5-55 (55 == 0x37)

Yes, this character is a 'small' variant of 0x30D7. I concede. Let's
use  0x31F7 0x309A. It will be the task of the display engine to
position the small circle at the right position.

 But what shall we do with 0x12B65 0xFFFD?
 Maybe another symbol added to Unicode Yi radicals?

Can you move this issue to the unicode.org mailing list?

 7950c7951
  0x17624   0xFA3E
 ---
  0x17624   0x69EA

You are right. Let's use 0x69EA here. Also can you tell the
unicode.org people to add this one to Unihan.txt?

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: JISX0213 mapping table

2002-04-08 Thread Bruno Haible

Gaspar Sinai writes:

 I would be glad if we could reconcile these files and come up
 with a common format till it is undefined by Unicode.
 
 The diff is quite small now.

81,82c81,82
 0x12171   0x00A2
 0x12172   0x00A3
---
 0x12171   0xFFE0
 0x12172   0xFFE1
138c138
 0x1224C   0x00AC
---
 0x1224C   0xFFE2

These are due to differences in the JISX0208 mapping. I use the one
which was on unicode.org for years (now declared obsolete).

148,149c148,149
 0x12256   0xFF5F
 0x12257   0xFF60
---
 0x12256   0x2985
 0x12257   0x2986

Look at the glyphs. I used
  http://ftp.ora.com/cjkvinfo/pdf/jisx0208+0213.pdf
  http://www.itscj.ipsj.or.jp/ISO-IR/   228 and 229

214c214
 0x1233A   0x2299
---
 0x1233A   0x29BF
555c555
 0x12678   0x30D7
---
 0x12678   0x31F7  0x309A

These are indeed debatable.

996,997c996,997
 0x12B65   0xFFFD
 0x12B66   0xA4A3
---
 0x12B65   0x02E9  0x02E5
 0x12B66   0x02E5  0x02E9

I don't understand how the glyphs of 0x02E9 and 0x02E5 can combine to
the RISING SIGN or FALLING SIGN.

7765a7766
 0x17427   ???

An unmapped code point. jisx0208+0213.pdf shows reserved at 0xEAA5.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: 3.2 MAPPINGS/EASTASIA

2002-04-04 Thread Bruno Haible

Tomohiro KUBOTA writes:

 http://www.jca.apc.org/~earthian/aozora/0213.html
 http://www.jca.apc.org/~earthian/aozora/0213/jisx0213code.zip
 
 http://www.cse.cuhk.edu.hk/~irg/
 http://www.cse.cuhk.edu.hk/~irg/irg/N807_TablesX0123-UCS.zip

Thanks a lot for these pointers! With this information, I can write a
JISX0213 converter for glibc and libiconv.

 Strictly speaking, JIS X 0213:2000 *cannot* be defined as a mapping
 table against ISO 10646, because JIS X 0213's han unification rule
 is different from ISO 10646's one.  (You know, Unicode added several
 tens of compatibility ideographs which are different characters in
 JIS X 0213's point of view and different glyphs of the same
 character in Unicode's point of view.)

I'll make use of these 59 compatibility ideographs in the converter.
That's the whole reason why they were introduced in Unicode 3.2.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: 3.2 MAPPINGS/EASTASIA

2002-04-02 Thread Bruno Haible

Markus Kuhn writes:
 it is now up
 to the maintainers of legacy encoding standards to define the
 relationship of their respective encodings to Unicode properly. The
 ISO 8859 authors have already done this in their second editions, and I
 understand that the latest editions of the relavant JIS standards also
 contain official ISO 10646 cross-reference tables.

Does this also apply to JISX0213:2000? Do you know where to find the
conversion tables for this character encoding? The PDF file in the
ISO-IR registry contains only the pictures of each glyph, but no
conversion table.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Is there a UTF-8 regex library?

2002-04-02 Thread Bruno Haible

David Starner writes:
 Does anyone know of a UTF-8 regex engine, preferably one
 that can be plugged into a GPL'ed C program easily?

Yes, such a regex engine is contained in the glibc CVS
(:pserver:[EMAIL PROTECTED]:/cvs/glibc/libc/posix)
It works not only with UTF-8 but with all multibyte encodings.
It was contributed by Isamu Hasegawa.

An UTF-16 regex engine is available at
http://crl.NMSU.Edu/~mleisher/download.html

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




gettext-0.11.1 is released

2002-03-13 Thread Bruno Haible

It is at ftp.gnu.org (soon also its mirrors) in
gnu/gettext/gettext-0.11.1.tar.gz

New in 0.11.1:

* xgettext now also supports Python, Tcl, Awk and Glade.

* msgfmt can create (and msgunfmt can dump) Tcl message catalogs.

* msggrep has a new option -C that allows to search for strings in translator
  comments.

* Bug fixes in the gettext.m4 autoconf macros.

New in 0.11:

* New programs:
msgattrib - attribute matching and manipulation on message catalog,
msgcat - combines several message catalogs,
msgconv - character set conversion for message catalog,
msgen - create English message catalog,
msgexec - process translations of message catalog,
msgfilter - edit translations of message catalog,
msggrep - pattern matching on message catalog,
msginit - initialize a message catalog,
msguniq - unify duplicate translations in message catalog.

* msgfmt can create (and msgunfmt can dump) Java ResourceBundles.

* xgettext now also supports Lisp, Emacs Lisp, librep, Java, ObjectPascal,
  YCP.

* The tools now know about format strings in languages other than C.
  They recognize new message flags named lisp-format, elisp-format,
  librep-format, smalltalk-format, java-format, python-format, ycp-format.
  When such a flag is present, the msgfmt program verifies the consistency
  of the translated and the untranslated format string.

* The msgfmt command line options have changed.  Option -c now also checks
  the header entry, a check which was previously activated through -v.
  Option -C corresponds to the compatibility checks previously activated
  through -v -v.  Option -v now only increases verbosity and doesn't
  influence whether msgfmt succeeds or fails.  A new option
  --check-accelerators is useful for GUI menu item translations.

* msgcomm now writes its results to standard output by default. The options
  -d/--default-domain and -p/--output-dir have been removed.

* Manual pages for all the programs have been added.

* PO mode changes:
  - New key bindings for 'po-previous-fuzzy-entry',
'po-previous-obsolete-entry', 'po-previous-translated-entry',
'po-previous-untranslated', 'po-undo', 'po-other-window', and
'po-select-auxiliary'.
  - Support for merging two message catalogs, based on msgcat and ediff.

* A fuzzy attribute of the header entry of a message catalog is now ignored
  by the tools, i.e. it is used even if marked fuzzy.

* gettextize has a new option --intl which determines whether a copy of the
  intl directory is included in the package.

* The Makefile variable INTLLIBS is deprecated. It is replaced with
  LIBINTL (in projects without libtool) or LTLIBINTL (in projects with
  libtool).

* New packaging hints for binary package distributors. See file PACKAGING.

* New documentation sections:
  - Manipulating
  - po/LINGUAS
  - po/Makevars
  - lib/gettext.h
  - autoconf macros
  - Other Programming Languages


Happy internationalization! Bonne francisation! Frohes Eindeutschen!

  Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Statically link LGPL cp1252.h with MIT Licensed code?

2002-03-04 Thread Bruno Haible

Michael B Allen writes:
 Can I statically link of the codepage headers (eg cp1252h) from
 libiconv with an MIT Licensed module? I would not actually alter the
 file of course so a user could not modify the LGPL files in my module
 any more than if they had used libiconv directly

Legally speaking: cp1252h is code, not public header file As long as
you don't distribute the resulting binaries/libraries, you can link it
with anything you want If you want to distribute the result, however,
it must all fall under LGPL, which for binaries is roughly equivalent
to GPL Namely, you must distribute the source of all the
binary/library

Practically speaking: It is on purpose that linking with libiconv as a
shared library is encouraged, whereas linking it libiconv as a static
library is not so welcome The reason is that some people in the
countries not yet well supported by character set standards (South
Asia an Africa, for example) should have an opportunity to adapt their
system to their needs

 I need to be able to convert one character at a time and provide
 a subtitution character if the conversion is invalid or stop if some
 number of *characters* has been reached

You can do that by using libiconv unmodified There are even two ways
to do it:

1) You can make the conversion one character at a time, by offering
one input byte to iconv(), then two bytes, and so on Kind of slow,
but works

2) You can convert to an encoding where each character occupies a
fixed number of bytes, like UCS-4, and specify an output buffer of
precisely the size that can hold the number of characters that you
need

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mailnllinuxorg/linux-utf8/




Re: mbscmp

2002-02-25 Thread Bruno Haible

Michael B Allen writes:
 Do the str* functions handle strings differently if the locale is
 different?

It depends on the functions.

strcpy strncpy strcat strncat strcmp strncmp strdup strchr strrchr
strcspn strspn strpbrk strstr strtok: NO

strcoll strxfrm: YES

strcasecmp: YES but doesn't work in multibyte locales.

 For example, does strcmp work on UTF-8 strings?

Not well. Better use strcoll.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: mbscmp

2002-02-25 Thread Bruno Haible

Pablo Saratxaga writes:

 strcoll() doesn't have multibyte problems ?

No. In glibc-2.2 strcoll works fine for all multibyte encodings.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: mbscmp

2002-02-25 Thread Bruno Haible

Michael B Allen writes:

 What's the ultimate goal here? Are any of these functions *supposed*
 to work on multi-byte characters, or will there be mbs* functions?

strcpy strcat strdup
already work for multi-byte characters

strncpy strncat strncmp
cannot work for multi-byte characters because they truncate
characters

strcspn strspn strpbrk strstr
you can write multibyte aware analogs of these

strchr strrchr
use a multibyte aware strstr analog instead

Nothing is standardized in this area, but IMO an mbstring.h include
file which defines these for arbitrary encodings, and an unistring.h
which defines these for UTF-8 strings, would be very nice. I'm working
on an LGPL'ed implementation of the latter.

 /*
  * Returns a pointer to the character at off withing the multi-byte string
^^
Emphasize: at _screen_position_ off.

  * src not examining more than sn bytes.
  */
 char *
 mbsnoff(char *src, int off, size_t sn)
 {
 unsigned long ucs;
 int w;  
 size_t n;
 mbstate_t ps;
 
 ucs = 1;
 memset(ps, 0, sizeof(ps));
 
 if (sn  INT_MAX) {
 sn = INT_MAX;
 }
 if (off  0) {
 off = INT_MAX;
 }
 
 while (ucs  (n = mbrtowc(ucs, src, sn, ps)) != (size_t)-2) {

Change that to:

  while (sn  0  (n = mbrtowc(ucs, src, sn, ps)) != (size_t)-2) {

 if (n == (size_t)-1) {
 return NULL;
 }
 if ((w = wcwidth(ucs))  0) {
 if (w  off) {
 break;
 }
 off -= w;
 }
 sn -= n;
 src += n;
 }
 
 return src;
 }

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: mbscmp

2002-02-25 Thread Bruno Haible

Jimmy Kaplowitz writes:
 based on looking at man pages, you can use one of three
 functions (mbstowcs, mbsrtowcs, or mbsnrtowcs) to convert your multibyte
 string to a wide character string (an array of type wchar_t, one wchar_t
 per *character*), and then use the many wcs* functions to do various
 tests. My recollection of the consensus on this list is that for
 internal purposes, wchar_t is the way to go, and conversion to multibyte
 strings of char is necessary only for I/O, and there only when you can't
 use functions like fwprintf.

That was my impression at the beginning as well. Until I realized that
all this idea leads to are unreliable programs. Because fgetwc, which
you would like to use for I/O, doesn't give you any chance of
correction when it encounters an invalid multibyte character in the
input file. And the output side of the streams are not better: fputwc
on a stream on which someone has already done an fputc call is
undefined behaviour (it can crash or do nothing).

For an example, take the 'rev' program, in the util-linux, and feed it
with ISO-8859-1 input while running in an UTF-8 locale. Simply
unreliable.

Also wchar_t[] occupies more memory. More memory means more cache
misses, means less speed.

Also wchar_t[] doesn't fulfill its promise of 1 character = 1 memory
unit. Because a Vietnamese character is usually composed from two
Unicode characters; the term complex character is used to denote
this multi-wchar_t unit. And you cannot separate these two units,
neither in truncation, regexp search, linebreaking or whatever
algorithm.

For this reason, wchar_t is only good to call wctype.h libc APIs,
not for in-memory representation of strings. The latter should still
be done with char*. And for iterating through characters in multibyte
strings, you can use the inline functions found at

   
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gnu-i18n/gnu-i18n/fileutils-i18n/lib/mbchar.h?rev=1.3content-type=text/vnd.viewcvs-markup
   
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gnu-i18n/gnu-i18n/fileutils-i18n/lib/mbiter_multi.h?rev=1.3content-type=text/vnd.viewcvs-markup
   
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gnu-i18n/gnu-i18n/fileutils-i18n/lib/mbfile_multi.h?rev=1.3content-type=text/vnd.viewcvs-markup

 However, wchar_t is only guaranteed to be Unicode (which encoding?) 
 when the macro __STDC_ISO_10646__ is defined, as is done with glibc 2.2.

Correct. But it does not mean that *every* Unicode character can be
used: You cannot use Hangul Unicode characters in an ISO-8859-1
locale. In glibc the wctype.h functions work on these characters (in
any locale, except the C locale), but when you convert a Hangul
character to multibyte in such a locale, all you get is a '?'.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: NFS4 requires UTF-8

2002-02-20 Thread Bruno Haible

Markus Kuhn writes:
 I just spottet in section 1.1.3 of RFC 3030 (NFS version 4 Protocol) the
 following requirement: file and directory names are encoded with
 UTF-8.

Good, they got it right.

Where is the conversion between the NFS filenames and the user visible
filenames (in locale encoding) to take place? Probably in the kernel,
and the user-visible encoding will be given by a mount option?

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: isprint() under utf-8 locale

2002-02-13 Thread Bruno Haible

Radovan Garabik writes:
 
 From my naive point of view, I would expect isprint()
 to return nonzero for utf-8 locale, since this would allow
 older non-multibyte aware programs using isprint() just to
 pass utf-8 characters to output, which at least has a chance
 of working, instead of not displaying them at all.

The purpose of calling isprint in such programs is to filter out
control characters, right? Now when you such an old program calls
isprint on the individual bytes that constitute a multibyte character,
is cannot know whether that character is a graphic character (like
U+20AC) or a control character (like U+200E). Blindly returning 1
would work in some cases but not in others.

Better is to port the application to use mbrtowc and iswprint.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Updated: Security in Unicode

2002-02-05 Thread Bruno Haible

Gaspar Sinai writes:

 http://www.yudit.org/security/

About the first of your samples: what happens there in the first and
the third line is that inside the Java programs, the strings are
embedded in left-to-right text, whereas in the JTextArea they have no
preferred direction, and the Unicode bidi algorithm looks at the
direction of the first logical character that has a direction. You can
fix it by adding a left-to-right direction marker to the strings:

new JLabel(\u200e...);
or
new JLabel(\u202a...);
or
new JLabel(\u202d...);

I don't see this as a security problem, because programmers ought to
test their programs before releasing them.

Can't comment on the second sample, though.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Cedilla: a manic text printer

2002-01-31 Thread Bruno Haible

Juliusz Chroboczek writes:
 A first beta of Cedilla, the manic text printer, is available from
 
   http://www.pps.jussieu.fr/~jch/software/cedilla/

If you happen to run it in CLISP 2.26, you need to apply the following
bug fix to clisp, and also use ext:quit instead of lisp:quit.

Bruno


*** clisp-2.26/src/io.d.bak 2001-04-17 09:31:13.0 +0200
--- clisp-2.26/src/io.d 2002-01-31 04:12:46.0 +0100
***
*** 3108,3142 
  TheIarray(hstring)-data = token; # Datenvektor := O(token_buff_1)
  token = TheIarray(token)-data; # Normal-Simple-String mit Token
  var uintL pos = 0; # momentane Position im Token
! loop { # Suche nächstes Hyphen
!   if (len-pos == 1) # einbuchstabiger Charactername?
! break;
!   var uintL hyphen = pos; # hyphen := pos
!   loop {
! if (hyphen == len) # schon Token-Ende?
!   goto no_more_hyphen;
! if (chareq(TheSstring(token)-data[hyphen],ascii('-'))) # Hyphen gefunden?
!   break;
! hyphen++; # nein - weitersuchen
!   }
!   # Hyphen bei Position hyphen gefunden
!   var uintL sub_len = hyphen-pos;
!   TheIarray(hstring)-dims[0] = pos; # Displaced-Offset := pos
!   TheIarray(hstring)-totalsize =
! TheIarray(hstring)-dims[1] = sub_len; # Länge := hyphen-pos
!   # Jetzt ist hstring = (subseq token pos hyphen)
!   # Displaced-String hstring ist kein Bitname - Error
!   pushSTACK(*stream_); # Wert für Slot STREAM von STREAM-ERROR
!   pushSTACK(copy_string(hstring)); # Displaced-String kopieren
!   pushSTACK(*stream_); # Stream
!   pushSTACK(S(read));
!   fehler(stream_error,
!  GETTEXT(~ from ~: there is no character bit with name ~)
! );
!  bit_ok: # Bitname gefunden, Bit gesetzt
!   # Mit diesem Bitnamen fertig.
!   pos = hyphen+1; # zum nächsten
! }
  # einbuchstabiger Charactername
  {
var chart code = TheSstring(token)-data[pos]; # (char token pos)
--- 3108,3114 
  TheIarray(hstring)-data = token; # Datenvektor := O(token_buff_1)
  token = TheIarray(token)-data; # Normal-Simple-String mit Token
  var uintL pos = 0; # momentane Position im Token
! if (len-pos == 1) # einbuchstabiger Charactername?
  # einbuchstabiger Charactername
  {
var chart code = TheSstring(token)-data[pos]; # (char token pos)
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: strcoll for utf-8

2002-01-09 Thread Bruno Haible

Paul Michel writes:

 But strtok() for instance does not handle utf-8
 data properly.

Sure strtok() handles UTF-8 strings propertly. It only has the
limitation that the 'delimiter' than you can pass must be an ASCII
character.

strtok() even works with strings encoded in weird encodings like
BIG-5 and GB18030, as long as the 'delimiter' is an ASCII character in
the range 0x00..0x2F.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: getting locale's charset from a script

2001-12-06 Thread Bruno Haible

Ulrich Drepper writes:
 I've implemented this
 
 iconv -f utf-8 -t //TRANSLIT
 
 This was an undefined case which gave not very nice results before.
 Now an empty string (or empty before the second slash) means use the
 locale's charset.

The next release of GNU libiconv will interpret the empty encoding
name  and //TRANSLIT in the same way as glibc.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: implementation language choice

2001-11-30 Thread Bruno Haible

Juliusz Chroboczek writes:

 Finally, would people be willing to use a piece of code that requires
 Bruno Haible's CLISP to be installed?  Or do you think that exclusive
 use of stone-age languages is a must?

Nowadays Python makes a good alternative to Lisp.

Roozbeh writes:
 For me, it's somehow a problem of distributions. Is the prerequisite 
 available in major distributions?

clisp ships with Debian, Suse, Mandrake, and is in RedHat contrib.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: mbrtowc

2001-11-22 Thread Bruno Haible

Markus Kuhn writes:

  mbstate_t ps;
  
  mbrtowc(NULL, NULL, 0, ps);
 
 This is a bug in your program, not in glibc.

You are right. I'll update the mbrtowc manual page to be clearer on
this issue.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




libiconv homepage moved

2001-09-14 Thread Bruno Haible

The GNU libiconv homepage is now at

   http://www.gnu.org/software/libiconv/

instead of

   http://clisp.cons.org/~haible/packages-libiconv.html


Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Encoding conversions

2001-09-09 Thread Bruno Haible

Michael B. Allen writes:
   I gather that I can only assume that wchar_t is just a sequence of UCS
   codes of sizeof(wchar_t) in size.
  
  You cannot even assume that. wchar_t is locale dependent and
  OS/compiler/vendor dependent. It should never be used for binary file
  formats and network messages.
 
 Well, I have to normalize to something!

wchar_t is a very wrong thing to normalize to, because it is OS and
locale dependent. UTF-8 is a much better normalization for strings,
both in-memory and on disk. UCS-4 is an alternative, good
normalization for strings in memory.

 You're freshmeat link:
 
 http://clisp.cons.org/~haible/packages-libiconv.html
 
 is broken.

Thanks for the note. I'm currently setting up a replacement.

 Can I use the latest libiconv as a shared library ...

Yes you can.

 So where do people discuss libiconv problems.

With me, or on linux-utf8.

 iconv_open is giving me No such file or directory.

You should look at errno after iconv_open only if iconv_open returned
(iconv_t)(-1). The manual page doesn't say anything about errno in
the case of a successful return from iconv_open().

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



RE: Encoding conversions

2001-09-09 Thread Bruno Haible

Carl W. Brown writes:

 But UTF-8 is not without its own problems.  Take Oracle for example.

Most of the world is not Oracle. If Oracle uses its own encodings, let
Oracle deal with it.

 They designed UTF-8 to encode UCS-2 not UTF-16.

No, Oracle did not design UTF-8 at all. The RFC 2279 specifies UTF-8,
and it encodes all characters from U+ to U+7FFF.

 I am not familiar with libiconv.

ftp://ftp.gnu.org/pub/gnu/libiconv/libiconv-1.7.tar.gz

 ICU has an invalid character callback handler.  I use it for example to
 convert characters that are not in the code page to HTML/XML escape
 sequences.

You can do that with iconv() as well. With iconv(), the processing
simply stops at an invalid/unconvertible character, and the programmer
can do any kind of error handling before restarting the conversion.

 Looking at the iconv() I did not see any provisions for special invalid
 character handling.  Do you have this kind of support in libiconv?

Sure. It is even built-in.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Encoding conversions

2001-09-07 Thread Bruno Haible

Michael B. Allen writes:

 But it's not clear to me how this should be done correctly
 and in a portable way (or at least portable enough so that when if comes
 time to port I don't smack myself in the forehead).

Use iconv. I mean the libc's iconv on GNU libc systems, and the
libiconv (also by GNU, but a different implementation) on other
systems. libiconv is ported to most systems.

 I gather that I can only assume that wchar_t is just a sequence of UCS
 codes of sizeof(wchar_t) in size.

You cannot even assume that. wchar_t is locale dependent and
OS/compiler/vendor dependent. It should never be used for binary file
formats and network messages.

 But is the in memory representation
 of a multi-byte string the equivalent of the UTF-8 encoding

Depends where you got the string. In most cases, like when you got it
from fgets(stdin), it will be in locale dependent encoding (LC_CTYPE
environment variable dependent). Only in particular cases, like
filenames read from 'pax' archives, or when you yourself converted it
to UTF-8, or when you use a GNOME 2 API function, will the string be
in UTF-8.

 So as an example case, to encode wchar_t to UTF-16LE I must convert each
 character to a definative encoding such as UCS-4 and then use iconv
 to get to UTF-16LE.

With the two aforementioned iconv implementations, you can also
directly use  iconv_open(UTF-16LE,wchar_t).

 PS: When encoding ASCII do I want to shave off the 8th bit?

Removing the 8th bit is a garbage in - garbage out technique and
causes endless grief to users. Instead call iconv_open(...,ASCII),
and you'll get full error checking if a non-ASCII character is
encountered.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: UTF-8 versus utf8

2001-09-03 Thread Bruno Haible

Markus Kuhn writes:
 In particular, the string that setlocale returns is this normalized form

That was true in RedHat 7.0. But meanwhile Ulrich Drepper fixed it on
2000-10-30. The string returned by setlocale() contains .UTF-8 if
the user's environment variables do.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: ISO 8859-16 is a national security threat :)

2001-08-31 Thread Bruno Haible

Markus Kuhn writes:
 I was delighted to read in
 
   ISO/IEC JTC 1/SC 2/WG 3/N 441
   http://wwwold.dkuug.dk/JTC1/SC2/WG3/docs/n441.pdf
 
 how ISO 8859-16 is officially considered by the Kingdom of the
 Netherlands a threat to their national security.

According to their explanation, Unicode is a threat of their national
security as well :-)  U+015F != U+0219 ...

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: UTF16 and GCC

2001-07-13 Thread Bruno Haible

Christoph Rohland writes:

 Yes, but perhaps we could try to make that standard?

There is a chance to make the u... syntax(es) standard.

Personally I don't think it is possible to standardize the way a
compiler detects the encoding of an input file. Some, like gcc, will
want to use UTF-8 as the default, some others will want to use the
locale encoding.

  (Can't we use uint_least16_t instead of utf16_t?)
 
 No, I think one of the biggest mistakes in the C standard is that
 char/wchar_t is not fixed. We need an exact 16 bit type with a defined
 encoding.

Joseph Myers explained why you won't get such a type (and why ISO C 99
section 7.18.1.1.(3) says that uint8_t, uint16_t and uint32_t are
optional): Some hardware has a word size of 9, 16, 32, or 36 bit, and
GCC and C99 support such hardware.

  Currently only on glibc systems. wchar_t == UCS-4 is only a
  recommendation in ISO C 99, not mandatory (unfortunately).
 
 No, it will be on all Unix systems we support: Solaris, True64,
 HPUX, AIX5L, Reliant.

Did you get a firm confirmation from Sun people that in some version
of Solaris, wchar_t will be UCS-4 in all locales and __STDC_ISO10646__
will be defined? In which version of Solaris?

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: New Unifont release

2001-07-11 Thread Bruno Haible

Markus Kuhn writes:

   b) As a single (proportional) font, for use by applications which use
  a single font.
 
 Can't b) be solved with the help of fontsets instead of redundantly
 doubling the number of fonts?

Not in the current state of affairs. Xlib doesn't do anything
meaningful when an XFontSet has two fonts with the same encoding
(here: ISO10646-1). The fontset only helps when all you have are fonts
in different character sets (ISO8859-x, JISX0208, JISX0212, etc.);
then the DrawString algorithm will cut the string into segments, based
on the character sets. Other information from the fonts (e.g. width)
is not used during this segmentization.

And for new code, we use Xft instead of XFontSet. There also, it is
helpful to have the entire Unicode repertoire in a single font.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: New Unifont release

2001-07-10 Thread Bruno Haible

Markus Kuhn writes:

 I strongly recommend that you follow the practice we established in
 XFree86 for the -misc-fixed-*-iso10646 fonts and split up GNU Unifont into
 two separate charcell font files, one 8x16 and one 16x16.

No, please don't do that. We need *both* ways of packaging Unicode fonts:

  * As two separate charcell (fixed-width) fonts, for use by xterm
and similar applications where width matters a lot.

  * As a single (proportional) font, for use by applications which use
a single font.

As a matter of fact, GNU unifont (as a single font) is very useful for
use in cooledit or konqueror.

Markus, please consider making a combined packaging of
   misc-fixed-medium-r-normal--
   misc-fixed-medium-r-normal-ja-
into a single font, that would be covered by the same license and
which could therefore be an alternative to unifont, included with
XFree86.

Btw, what is the license of the unifont? Is it suitable for inclusion
in XFree86?

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Luit and screen [was: anti-luit]

2001-07-05 Thread Bruno Haible

Tomohiro KUBOTA writes:

 However, softwares of GNU Project will have to be assigned to FSF.
 (Note the difference between merely GPL-ed softwares and GNU Project
 softwares.)  This FSF's way is to guard itself legally.

This is not true in this generality. There are packages in the GNU
project whose copyright stays with the authors (like GNU clisp). There
are also packages in the GNU project whose copyright is assigned to
the FSF (like GNU GCC and glibc).

The most important point for software that is part of the GNU project
is that it cooperates well with the rest of the system, i.e. most
importantly that it supports --help and --version command line option,
uses GNU infrastructure like autoconf where possible, imposes no
arbitrary limitations on the users, and mentions the GNU project on
their homepage.

 GPL-ed softwares cannot be included in XFree86 source tree, as
 Juliusz said.
 
 Thus, I think Juliusz's way (luit in X11 license) is reasonable.

Still it seems strange to put a tty based filter program in the X11
distribution. This means that people who use a console and have no X
installed cannot use it.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Emacs and nl_langinfo(CODESET)

2001-07-04 Thread Bruno Haible

Markus Kuhn wrote:

 I think, Juliusz has already understood that naively using iconv() alone
 might not necessarily be well suited well for luit, because it doesn't
 resynchronize all encodings cleverly. You need a bit additional logic. If
 you press ^C in an application that spits out BIG5 in an unfortunate
 moment or truncate a string by counting bytes, then you will loose BIG5
 synchronization, and the terminal has to skip characters in the input
 stream until is finds two G0 characters in a row to be sure again where
 the next character starts. BIG5 is an example of a rather messy encoding,
 not only in that respect.

iconv() itself doesn't resynchronize, but it is easy to resynchronize
using iconv(). It needs less than 10 lines of code. Both the GNU
Compiler for Java and a new gettext PO file lexer that I wrote last
week are based on iconv() and do support resynchronization. The
resynchronization is simple: Whenever iconv() returns -1/EILSEQ, skip
1 byte.

 ISO 2022 is far worse.

Yes. How do you want to resynchronize when an Escape sequence was
dropped during transmission? You can only try an arbitrary ISO 2022
state and hope it's the correct one.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Locking Linux console to UTF-8

2001-06-29 Thread Bruno Haible

H. Peter Anvin writes:
 
 Personally I would suggest making this kind of user-space console
 software the default

These consoles rely on the framebuffer console. But on my (quite new)
PC I'm unable to get a framebuffer console with a frequency of more
than 60 Hz. (Yes, I tried all possible VESA modes my BIOS offers.)

Will KGI (the framebuffer console with arbitrary hardware timings,
like X) get into the standard kernel? If so, when?

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Determine encoding from $LANG

2001-06-28 Thread Bruno Haible

Markus Kuhn writes:

 Add to that list many of the programming languages that use Unicode
 internally but that do not yet set the default i/o encoding correctly
 automatically based on LC_ALL || LC_CTYPE || LANG.
 
 For example TCL ...

OTOH, Java (both the Sun JDK 1.3 and the GCC 3.0 libjava) and GNU CLISP
already do respect LC_ALL || LC_CTYPE || LANG.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-27 Thread Bruno Haible

H. Peter Anvin writes:

  Yes.  This is the point.  When users set LANG vairable, they
  expect all softwares to obey the variable.
 
 The issue is, however, what that does mean?  In particular, strings in
 the filesystem are usually in the system-wide encoding scheme, not
 what that particular user happens to be processing at the time.

Obeying LANG is important in two scenarios:

  1) For the user who uses a single locale, and this locale's encoding
 is not ISO-8859-1. He sets LANG in $HOME/.profile.

 Such a user will in the long run use non-ASCII filenames. They
 will be stored in locale encoding on the disk. Programs should
 be able to display and use such filenames.

  2) For the user who tries out a locale in a different encoding.
 He sets LANG on the command line.

 Such a user will have to be prepared to problems with non-ASCII
 filenames. But everything else should work without manual
 intervention.
   LANG=de_DE.UTF-8 xterm   - get an UTF-8 xterm
   LANG=ja_JP.EUC-JP gvim file  - edit EUC-JP encoded file
   LANG=vi_VN emacs - start emacs with Vietnamese
   input method
   etc.

It's for the second case that it is important that no encodings are
stored in $HOME/.* files. And it's for the first case that non-ASCII
filenames must be supported.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-27 Thread Bruno Haible

Juliusz Chroboczek writes:
 In a number of places, a program must interact with its environment
 in a locale-independent manner.  This includes selection conversion,
 keyboard input, and arguably interaction with the file system.

I agree that in _some_ places programs exchange text in locale
independent formats. For example, strings in databases should better
be stored in a locale independent format, so that users in different
locales can access it.

But we need to look at it case by case.

 Lack of understanding of this basic principle leads to absurdities
 such as Emacs' ``selection-coding-system'' variable.

What led to 'selection-coding-system' is that some programs are ICCCM
compliant (use locale independent format for the selection and
cutbuffer) and some are not.

So we'll get a mess everytime it's not clear whether a mechanism uses
locale-dependent or -independent text representation.

* Selection: Here ICCCM says it's locale independent.

* Keyboard input: An XKeyEvent is locale independent. Input read
  through XmbLookupString is locale dependent.
  Input read from /dev/tty is assumed to be locale dependent if the
  IEXTEN flag is set.

* Filenames: The POSIX spec for 'ls' implies that 'ls' treats
  filenames as locale (LC_CTYPE) dependent. This means all other
  programs must do the same.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-27 Thread Bruno Haible

H. Peter Anvin writes:
 Actually, the conditions for non-ASCII filenames is even stricter: for
 the system to work consistently the way you describe, the ENTIRE
 SYSTEM needs to use the same locale.

It needs not. If the administrator/distribution files are in ASCII,
and users don't need to access each other's files, there is no
problem with user A having /home/A in EUC-JP encoding and user B
having /home/B in UTF-8 encoding.

 FILENAME ENCODINGS IN DIFFERENT LOCALES DO NOT WORK.  PERIOD.

Sure. Therefore it's best to use non-ASCII filenames only after having
switched one's system to UTF-8, not before.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



RE: __STDC_ISO_10646__ support under BSD

2001-06-26 Thread Bruno Haible

Markus Kuhn writes:
 The wchar_t encoding described on
 
   http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html
 
 has the advantage that functions such as wcwidth() still can be
 implemented

Yes, but other user-written functions like

   bool is_katakana (wchar_t wc)
   {
 return (wc = 0x30A1  wc = 0x30F6
 || wc = 0x309B  wc = 0x309C
 || wc = 0x30FC  wc = 0x30FE
 || wc = 0xFF66  wc = 0xFF9F
   }

that assume __STDC_ISO_10646__ will not work with your iso2022-wc
encoding. Thus __STDC_ISO_10646__ should be undefined when using a
libc with this particular locale. But it is a compile-time
constant. So it implies the libc can not define __STDC_ISO_10646__ at
all.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Emacs and nl_langinfo(CODESET)

2001-06-25 Thread Bruno Haible

Markus Kuhn writes:

 Has someone written autoconf tests for the presence of
 nl_langinfo(CODESET)?

Yes, GNU fileutils and GNU gettext use the following test.

 m4/codeset.m4 
#serial AM1

dnl From Bruno Haible.

AC_DEFUN([AM_LANGINFO_CODESET],
[
  AC_CACHE_CHECK([for nl_langinfo and CODESET], am_cv_langinfo_codeset,
[AC_TRY_LINK([#include langinfo.h],
  [char* cs = nl_langinfo(CODESET);],
  am_cv_langinfo_codeset=yes,
  am_cv_langinfo_codeset=no)
])
  if test $am_cv_langinfo_codeset = yes; then
AC_DEFINE(HAVE_LANGINFO_CODESET, 1,
  [Define if you have langinfo.h and nl_langinfo(CODESET).])
  fi
])
=

 Has someone written a tiny nl_langinfo(CODESET) emulator for use until
 FreeBSD get's their locale support sorted out properly?

Yes, it comes as 'libcharset' subdirectory of GNU libiconv. You can
find the newest release at
ftp://ftp.ilog.fr/pub/Users/haible/gnu/libcharset-1.1.tar.gz

You can find instructions for integrating this into Emacs in the
libcharset-1.1/INTEGRATE file.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



RE: __STDC_ISO_10646__ support under BSD

2001-06-25 Thread Bruno Haible

Marco Cimarosti writes:

 As their name implies, Unicode Language Tags only change the language, NOT
 the character set (which remains Unicode, of course).

The distinction is not relevant in this context.

Remember why some people want to keep an ISO-2022 surface of the
world. Because they have long ago invented the (mistaken) assumption
that a character's rendition depends on the character set it is taken
from. That is, a cyrillic character from ISO-8859-5 has width 1,
whereas a cyrillic character from ISO-IR-165 has width 2.

We are discussing how to make these people accept Unicode. I.e. how
can a character with one given Unicode code point be represented with
width 1 or 2, depending on context? Unicode 3.1 contains the means for
that.

A language tag is sufficient, because all Japanese charsets behave
the same w.r.t. rendition of some specific characters. It's kind of a
national custom.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Comments on locale name guideline: CODESET names

2001-06-20 Thread Bruno Haible

Pablo Saratxaga writes:

 The standard Vietnamese encoding is TCVN-5712 not VISCII.

Yes. And it has combining characters, which Markus wants to exclude...

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Again on mbrtowc()

2001-06-20 Thread Bruno Haible

Marco Cimarosti writes:
 I hope this is not too much off topic.
 
 Time ago, Edmund Grimley Evans asked what should be the value of this
 expression:
 
   mbrtowc(wc, , 0, ps)
 
 I have two other similar questions for cases that seems unspecified:
 
 1) What should the function do when passed a NULL as the last argument?
 Should it use an internal mbstate_t variable or not?

Yes. The manpage says it:

   In all of the above cases, if ps  is  a  NULL  pointer,  a
   static  anonymous state only known to the mbrtowc function
   is used instead.

 2) What should it do and return if a mbstate_t is supplied that contains
 invalid state values

The same as may happen if you dereference an uninitialized char* variable:
unspecified behaviour. SIGSEGV or toast your cat.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Comments on locale name guideline

2001-06-15 Thread Bruno Haible

[EMAIL PROTECTED] writes:
 
   what was the original goal?  was it just for
   linux, or aimed as a generic guideline for the benefit of any
   UNIX variants (including non-linux?)  i was under impression that
   it falls into the latter case (otherwise you wouldn't cc: to
   bsd-locale mailing list).

Li18nux is about APIs for Linux. But since Linux standards are also
likely to have an effect on *BSD in the future (at least because we
share the same X11 and many applications), comments from BSD people
are welcome. This particular subthread focused on how many locale
encodings exist in POSIX systems. Including *BSD and other Unices.

My previous mail was an attempt to discourage you from spending time
implementing ISO-2022-JP and SJIS locales.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: [li18nux2000:62] Comments on locale name guideline

2001-06-13 Thread Bruno Haible

Keld Simonsen writes:

 4. Add '+' and ',' to the DELIMITERS
 
These are delimiters in ISO/IEC 15897 locale syntax.
 
 5. Change or add the following syntax for locales:
 
LANGUAGE_TERRITORY+MODIFIER1+MODIFIER2,SOURCE_VERSION.CODESET
 
This is the format for locale names in the ISO standard (implemented in glibc).

glibc supports this, but adding this to the spec makes it
unnecessarily more complex. Why choose a complex spec when a simple
one is sufficient? Just to support every existing (but unused) ISO
standard?

 7. For the CODESET repertoire, please add the specials : ( ) / _ . *

No, please don't add : ( ) / . * as these may not occur in charset
names according to RFC 2278.

 8. In MODIFIER, you should remove the line with euro as this is not a good example.
The euro modifier is normally based on a dependency on special
coding in the application to say whether this should be used, and 
as it has not removed the internationalization code from the program,
it is a bad example of i18n.

This is BS. The euro modifier designates locales with a different
contents for LC_MONETARY. It doesn't require special coding in the
applications.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Comments on locale name guideline

2001-06-06 Thread Bruno Haible

Frank da Cruz writes:
 unless absolutely *everybody* agrees on *exactly* how at
 least the following things are handled:
 
  . Case mapping on case-insensitive file systems

not relevant for Unix.

  . Canonical composition or decomposition
  . Canonical ordering of combining characters

These have been specified by the Unicode consortium, so that everyone
will have to implement it the same way.

Nowadays users rarely type a full filename. Filename completion and
point-and-click GUIs make it less frequent.

 Not to mention issues of sorting and collation, e.g. for listing files
 in alphabetical order.

French users can now sort their files according to french dictionary
rules, and similar for the other languages. Actually life gets easier
for users than with the ASCII sorting rule, where German umlauts came
after the entire alphabet.

 Even if Linux gets it right, then we have cross-platform issues such as
 NFS mounts, FTP, and so on.

NFS is rarely used across different locales. For FTP we have a
problem, right. For file archives, POSIX pax (the successor of 'tar')
already specifies that the filenames are stored in UTF-8 in the
archive.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: locale names

2001-06-01 Thread Bruno Haible

Pablo Saratxaga writes:

  why not make it case insensitive?
 
 I think the problem is because the actual data is stored on disk.
 That is, on filesystems that are case sensitive, the locale name is
 case sensitive (unless you try all the possible case combinations when
 reading directory names; which would be a bit wastefull).

This is definitely not the problem. The implementation could simply
map the locale name to lower case _before_ accessing the disk.
Implementations are allowed to do this; SUSV2 says If the [locale
name] does not begin with a slash, the mechanism used to locate the
locale is implementation-dependent.

The problem is that Bram is the only person asking for that feature,
and thus it hasn't found its way into glibc.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



  1   2   >