from:"Edmund GRIMLEY EVANS"

Re: JOE editor has just added UTF-8 support

2004-05-04 Thread Edmund GRIMLEY EVANS

Derek Martin [EMAIL PROTECTED]:

   in Gaim.  =8^)  Now if only Mutt will work properly with UTF-8...
  
Err... I'm reading these messages inside mutt, which in turn runs
  under a UTF-8 enabled xterm (uxterm), with the el_GR.UTF-8 locale. And
  let me tell you, it works great, and in fact it's been supporting
  UTF-8 for a long time now.
 
 It seems to have problems with double-width asian characters.  It
 works fine with European character sets...

Mutt is supposed to and has been known to work with double-width
characters, provided it has an appropriate terminal library, such as
ncursesw or a UTF-8 version of slang.

Recent versions of Debian use ncursesw, but Red Hat 9 seems to use
slang:

$ cat /etc/redhat-release
Red Hat Linux release 9 (Shrike)
$ ldd /usr/bin/mutt
libslang-utf8.so.1 = /usr/lib/libslang-utf8.so.1 (0x4002b000)
...

Judging by the library name this is supposed to work, so can you
describe a reproducible bug?

Edmund

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Perl unicode weirdness.

2004-02-04 Thread Edmund GRIMLEY EVANS

Henry Spencer [EMAIL PROTECTED]:

 A conforming implementation of a function like my g(x), or the UTF-8
 encoding, includes the range check by definition.

Which definition? Are you sure validation is compulsory?

Also, since there's no point in checking for error conditions that you
don't know how to handle, I hope you have a clear idea of what to do
with these illegal high characters in various circumstances, because
I don't.

Are you perhaps one of those people who thought it was a good idea for
an MTA to AND incoming message bodies with 0x7f because the standard
didn't officially allow non-us-ascii data, so by ANDing the data with
0x7f you make it more standard-compliant and who cares if you make the
message completely useless to the recipient in the process?

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Perl unicode weirdness.

2004-02-03 Thread Edmund GRIMLEY EVANS

Henry Spencer [EMAIL PROTECTED]:

 Yes, it would be better to call the more general encoding, say, UTF-P.

Surely they're the same encoding applied to a different set of points?

Or would you claim that the function f(x) = 1/x on the interval 0  x
 1 is a different function from f(x) = 1/x on the interval 0  x  2?
In a sense they are different functions, but it's convenient and
natural to give them the same name, and they can both have the same
implementation if you leave it to the caller to check that x is in
range.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: CD Player

2004-02-01 Thread Edmund GRIMLEY EVANS

Jan Willem Stumpel [EMAIL PROTECTED]:

 During a short holiday in Greece, I bought some CD´s with Greek
 songs.
 
 xmcd, Workman, etc., cannot display the song titles correctly in
 Greek (only displaying a mess of accented Latin-1 characters) in
 my LANG=en_GB.UTF-8 environment.

A version of freedb (cddb) that supports UTF-8 with a new protocol
level 6 was announced on December 3, so I don't suppose many clients
or client libraries support it yet. Maybe you could help update them.

(By the way, you wrote CD´s, using an acute accent instead of an
apostrophe.)

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Perl in a UTF-8 locale

2003-11-17 Thread Edmund GRIMLEY EVANS

I have a problem here with Perl v5.8.0 on Red Hat 9. Simplified, my
script looks like this:

while () {
s//cx/g;
print;
}

This works with older versions of Perl, and it works in the C locale,
but it doesn't work here in a UTF-8 locale. I tried putting stuff like
use bytes or no utf8 or no locale, but it didn't help.

Can anyone suggest a good solution, ideally one that is portable
between different locales and different versions of Perl?


Obviously I could use a wrapper. Currently I'm using this work-around:

unless ($ENV{LANG} eq C) {
$ENV{LANG} = C;
exec(/path/to/this/script, @ARGV);
}
.)D-|{vWz[bmYbh{

Re: Linux console internationalization

2003-08-14 Thread Edmund GRIMLEY EVANS

Beni Cherniavsky [EMAIL PROTECTED]:

 The first question has some reasonable answers:

One answer I didn't notice in your list was that applications might
want to display the shift state. For example, in one of my Emacs input
methods I use ;c to type ''. When I type ';' I see ';' underlined
to remind me that the ';' might be combined with the following
character.

Back in the 1980s I had an Amstrad PCW running LocoScript 2. You
switched between Latin, Cyrillic, Greek and symbol keyboards using
Alt-F1, etc, and there was some kind of indication on the screen of
which keyboard was currently selected, if I remember correctly.

(LocoScript 2 also let you combine any diacritic with any base
character and had more diacritics than TeX ...)
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: redhat 8.0 - using locales

2003-01-26 Thread Edmund GRIMLEY EVANS

Antoine Leca [EMAIL PROTECTED]:

 In addition, differences between zh_* in LC_MESSAGES are not
 trivial.
 
 AFAIK, Hong Kong is now part of CN. Still, they use Traditional
 Chinese. So what are we doing then?  ;-)

Obsolete country codes might be useful for distinguishing a few
language varieties that could not otherwise be distinguished. Is
anyone using de_DD for German without the latest spelling reform? :-)
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: redhat 8.0 - using locales

2003-01-10 Thread Edmund GRIMLEY EVANS

  A few files appear under LC_MESSAGES, but it seems
  they dont show up even when LANG=eo.
 
 First, you need to have a locale, maybe eo_ES or so.

I recommend eo_XX as an unofficial way of not choosing a country.
There's a locale definition file here:

http://rano.org/eo_XX
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: NUL-transparent Java-UTF-8

2002-12-23 Thread Edmund GRIMLEY EVANS

Markus Kuhn [EMAIL PROTECTED]:

 Is there a proper full specification of this encoding somewhere
 online? Merely replacing 0x00 with its overlong UTF-8 equivalent
 0xc0 0x80 can't be the full story, because what you are interested
 in the end must surely be binary transparency, not merely
 NUL-transparency. I don't see what NUL-transparency alone would
 be good for, as NUL is usually only a problem in arbitrary binary
 strings.

True, but pedantically correct handling of e-mail messages is an
exception. According to RFC 822 all 7-bit characters, including '\0',
are valid in a Subject line, for example. You are even allowed to have
a bare '\r' or a bare '\n'; only \r\n is special: it must be
followed by ' ' or '\t'. Of course, nobody really implements this.

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: filename and normalization (was gcc identifiers)

2002-12-05 Thread Edmund GRIMLEY EVANS

I don't think normalisation helps at all.

An ideal UTF-8 terminal should remember the actual octets that were
printed, so you can accurately copy and paste even random binary data
that is displayed as reverse-field question marks.

The ls program should have an option to display file names in a form
in which they can be used as shell arguments and with difficult octet
sequences replaced by numerical escapes.[*]

Those two measures together should make it fairly easy to copy and
paste file names. However, if you add normalisation, it will stop
working.

It might be useful to have a program that looks for a file path on the
system that is similar to a given file path. This program could use
normalisation internally, but it would be better to use a fuzzy
comparison. For example, guesspath foo would return Foo if the
only files in the current directory are Foo and Bar, but it would
return foo if there is a file called foo, and I don't know what it
would do if there are files called foo  and Foo.

Edmund

[*] Unfortunately, the Bourne shell doesn't have numerical escapes,
which rather spoils this plan. You could have a file called \007
displayed as $(printf \x07), while a file called $(printf
\\\x07\) is displayed as '$(printf \x07)', etc.
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: readdir() on linux

2002-12-03 Thread Edmund GRIMLEY EVANS

marco [EMAIL PROTECTED]:

 I need to make a scan of all the files on a Linux system (independently
 of the type of filesystem and the options given at mount time) and record
 all the filenames. I'm using the readdir() syscall that returns a pointer
 to a struct dirent. My question is: what should I assume about the
 format/encoding of the d_name[] field?

Assume it's a null-terminated octet string. It shouldn't be empty, and
it shouldn't contain (ASCII) '/'. You can't assume the string is valid
character data in any particular encoding. However, if it is valid as
UTF-8, then it probably really is UTF-8, but it might not be
printable, so you'll still have to process it before displaying it.

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: readdir() on linux

2002-12-03 Thread Edmund Grimley-Evans

marco [EMAIL PROTECTED]:

 Ok, does anybody know if the same applies to other unices (e.g.:
 AIX/Solaris)?
 I would like to understand how Linux compare to these commercial OS's.

I didn't notice any difference when I tried the following:

mkdir t
cd t
x=0; while [ $x -lt 255 ] ; do x=$[$x+1] ; printf   $(printf \x$(printf %02x 
$x)) ; done
for x in ? ; do echo -n $x ; done | od -Ax -tx1

There were 252 files created: all octet values except '\0', '.', '/'
and '\n' - the latter due to a limitation of the shell, I assume.
Shell scripts don't work very well with file names containing a
newline ...

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Red Hat 8 now uses UTF-8 by default for all non-CJK users

2002-11-26 Thread Edmund GRIMLEY EVANS

Radovan Garabik [EMAIL PROTECTED]:

  There has been suprisingly little user disastification, for one
  reason or the other. Not sure why exactly. US-centric user base?
  Techie user base that uses English anyways? Easy enough to switch 
  back?
 
 This one probably. Shortly after the new Redhat came out,
 cz.comp.linux has been flooded by users asking How
 the f*ck can I turn this off. So I suspect eveyone
 who was dissatisfied has already switched back to
 ISO-8859-2 locale

Let's hope there were also a few people who bothered to report
specific bugs so that they can be fixed!

Here's one bug I saw:

http://groups.yahoo.com/group/mutt-dev/message/16606

Mutt was working nicely in UTF-8, but Mutt invokes an external editor,
in this case Emacs, and apparently Emacs was not respecting the
locale. There might have been something in the user's .emacs that
caused this, but could someone please check that with Red Hat 8.0
emacs will by default create a UTF-8 file when invoked from a UTF-8
locale?

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: How to read mail with #nnnn

2002-11-19 Thread Edmund GRIMLEY EVANS

[EMAIL PROTECTED] [EMAIL PROTECTED]:

 Sometimes I receive mail in 
 
   Content-Type: text/html;
   charset=iso-8859-1
   Content-Transfer-Encoding: quoted-printable

Your mail client should decode the quoted-printable and pass the
decoded HTML document to a web browser.

I read e-mail with Mutt and I've set it up to cope with HTML by putting

text/html; /usr/bin/lynx -dump -force_html %s; copiousoutput

in ~/.mailcap and

auto_view text/html

in ~/.mutt/muttrc. The muttrc bit is mutt-specific, obviously, but
lots of programs use ~/.mailcap.

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: ISO9660 UTF-8

2002-10-22 Thread Edmund GRIMLEY EVANS

Jungshik Shin [EMAIL PROTECTED]:

  However, I had to tell him that there's another hurdle to overcome.
 My patch hard-coded 'UTF-16LE' as the codeset name for 'UTF-16 Little
 Endian', but it's not very portable. There should be a way to detect
 the codeset name to use with iconv(3) on a given platform for UTF-16LE.
 Is there any autoconf macro written for this?

   An alternative is to just make it user-configurable at  run-time.
 This is easier for programmers, but not so user-friendly...

Because of the way some people use libiconv with LD_PRELOAD, it makes
sense to decide at run time rather than build time. However, you
probably don't need to bother the typical user with configuration
stuff; you can try various possible names and run tests at run time.

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Paper size

2002-05-02 Thread Edmund GRIMLEY EVANS


Henry Spencer [EMAIL PROTECTED]:

  For the exact same reason you should switch to the metric system...
 
 Unfortunately, there isn't the same incentive.  Paper size is basically
 arbitrary; it doesn't impinge on everything else the way the units system
 does.  There's nothing magic about 210x297mm that makes anything easier. 

But there is! Firstly, if you cut a piece of A4 paper into two halves,
each has the same proportions as A4. Secondly, a piece of An paper has
area 1/2**n of a square metre. Standard photocopier paper weighs 80
grams a square metre, so a piece of A4 weights 5 g, and airmail
postage rates go in steps of 5 g or 10 g ...

Of course, it's not really 210x297mm; it's more like 210.224x297.302mm.

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Paper size and locale

2002-05-01 Thread Edmund GRIMLEY EVANS


  As for the actual physical paper format (as opposed to PDF document
  layout), I'd like to warmly encourage people in North America to start
  using A4 paper. 
 
 Why would we?

Because you will eventually, so you might as well do it now to
minimise suffering. Well, I don't know how true that is for A4 paper,
but that's a generic reason for accepting a good standard.

I have heard of a US company using A4 for compatibility with its own
officies in other countries, but I don't suppose it happens very often
yet.

I can still remember the old foolscap paper that preceded A4 in
Britain. I'm certainly glad they replaced it.

Sorry, I'm now totally off topic ...

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: NFS4 requires UTF-8

2002-02-20 Thread Edmund GRIMLEY EVANS


Bruno Haible [EMAIL PROTECTED]:

  I just spottet in section 1.1.3 of RFC 3030 (NFS version 4 Protocol) the
  following requirement: file and directory names are encoded with
  UTF-8.
 
 Good, they got it right.
 
 Where is the conversion between the NFS filenames and the user visible
 filenames (in locale encoding) to take place? Probably in the kernel,
 and the user-visible encoding will be given by a mount option?

We had a long and at times somewhat heated discussion about that on
this list some time last year, IIRC.

I think it doesn't make sense for file name arguments to fopen(),
opendir(), etc, to be locale-dependent: too many things will break if
different processes see different file names. The mount option makes
sense, but it will be confusing if server file names and client file
names cannot be converted exactly. So there should be a mount option
for converting file names, but people would be well advised not to use
it and instead let applications convert file names, if they want to.

It's RFC 3010, by the way.

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: NFS4 requires UTF-8

2002-02-20 Thread Edmund GRIMLEY EVANS


Pablo Saratxaga [EMAIL PROTECTED]:

 Currently yo ucan have a filename with bytes in 0x01-0x1F and 0x7F-0x9F,
 however you cannot usually type those directly.
 Well, you can use those \x88 and the like representations, or use
 that lovely tab-completion feature (if the filename starts with
 a typable thing), or use a tool that allows you to pick the
 file in a menu (that is my preferred way to delete bizarre file names:
 select them in mc and press F8; it is much easier)

And the traditional last resort is to move everything with a sensible
name out of the directory and then rm -rf the directory.

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Security

2002-02-15 Thread Edmund GRIMLEY EVANS


Markus Kuhn [EMAIL PROTECTED]:

 I still think, there is a philosophical missunderstanding here about how
 digital signatures are to be interpreted in cases of legal dispute. What
 in most countries that have thought about the issue would count is what
 the human end user has seen on the display component of the device where
 the signature was generated. The actual bitstring signed is actually not
 as relevant here as you might believe. You do not need any
 reversibility, you just need a tightly standardized rendering process
 that produces the same readable text each time from the same bit string.
 That standardised rendering algorithm will be used as well in court to
 inspect the bitstring you have signed, not your hexdump editor or
 whatever alternative displaying process that you might come up with to
 provide a different text.

This can't be right, or blind people would not be able to communicate
in a legally recognised way.

Also, a document might be passed round a company and inspected by a
large number of blind and seeing persons, using a wide variety of
different software, before it is passed to another company to form
part of a contract. The device where the signature was generated
might be a server with no display component.

I don't think you can get away from the bitstring being the
authoritative text. If different software displays bidirectional text
differently, then you have another kind of potential ambiguity to add
to all the kinds of ambiguity that already exist in any communication
between people.

(But thinking about a blind person listening to the text through a
speech synthesiser probably gives a good idea of what the correct
interpretation should be: words should be spoken in the order they
appear in the bitstring, regardless of writing direction.)

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: [linux-utf8] UTF-8 in e-mail subject lines, To: headers, etc.

2002-01-29 Thread Edmund GRIMLEY EVANS


[EMAIL PROTECTED] [EMAIL PROTECTED]:

 A slight extra problem is that MIME::Words and Mail::Header
 don't really get along very well together.
 
 It seems that Mail::Header splits up some headers differently
 from others. If the header is mentioned in the magical internal
 hash %Mail::Header::STRUCTURE, then the header is split up on
 whitespace, commas and semi-colons, eg:
 
 From: =?utf-8?Q?Richard Jones?=
   [EMAIL PROTECTED]
 
 But otherwise (eg. for Subject headers), Mail::Header will split
 at an arbitrary location based on length only. This has the effect
 of splitting the word across lines, which breaks things.
 
 Unfortunately adding %Mail::Header::STRUCTURE{subject} doesn't
 seem to be the answer, because I can't necessarily guarantee
 that the subject line will contain any whitespace. So it looks
 like I'll have to break the header up by hand by adding \n
 after words before calling MIME::Entity-build.
 
 I'm sure I can't be the first person to find this problem ...
 
 I'm also not sure why the RFC doesn't define that headers should
 be concatenated *first*, followed *second* by un-mimeifying. That
 would seem to be a much simpler way of doing things.

Because in general you don't want to unfold (concatenate) header
fields.

I don't think Mail::Header should be folding (splitting up) headers at
all.

RFC 822 merely says you can fold header lines, not that you should. In
the case of an unstructured field, such as Subject, splitting up and
concatenating the header may destroy deliberate layout, for example:

Subject: Awake! for Morning in the Bowl of Night
 Has flung the Stone that puts the Stars to Flight

Obviously an MTA may want to unfold the text in order to display it in
a summary list, but I don't see why Mail::Header has to mess with it.

So, I suggest you try complaining to the maintainer of Mail::Header.
Perhaps they would be willing to only split up structured header
lines, for example.

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: UTF-8 in e-mail subject lines, To: headers, etc.

2002-01-28 Thread Edmund GRIMLEY EVANS


[EMAIL PROTECTED] [EMAIL PROTECTED]:

 Hi:
 
 When sending an email with the following subject line to an
 MS Outlook email client, Outlook renders the Arabic letters
 as question marks. OTOH message bodies sent in UTF-8 render OK
 provided the Content-Type header is set as appropriate.
 
 Is this a problem with Outlook, or is the subject line itself
 badly formed?
 
 Subject: =?utf-8?Q?The next will be in Arabic: =D8=AA=D8=A7=D8=B9
 =D9=84=D8=A7=D9=84=D8=BA=D8=B9=D9=81=D8=BA=D8=B6=D8=B5=D8=AB=D9=82=D9=81=D
 8=BA=D8=B9=D9=87=D8=AE=D8=AD=D8=AC=D8=AF=D8=B7=D9=83=D9=85=D9=86=D8=AA=D8=
 A7=D9=84=D8=A8=D9=8A=D8=A8=D8=B3=D8=B3=D8=B4=D9=84=D8=A7=D8=B1=D9=84=D8=A7
 =D8=B1=D8=A1=D8=A4=D8=A4=D8=B1=D9=84=D8=A7=D8=A1=D8=A9=D9=89=D8=B2=D9=85=D
 9=85=D9=87=D9=84=D8=A7=D8=AA=D8=AE=D9=85=D8=AE=D8=AD=D9=83=D8=AA=D9=86=D9=
 85=D9=89=D8=B4=D8=B3=D9=8A=D8=A8=D8=A8=D9=84=D9=84=D8=AA=D8=A7=D9=84=D8=A8
 =D9=8A=D9=8A=D8=B3=D8=A6=D8=A1=D8=A4=D8=B1=D9=84=D8=A7=D9=84=D8=A7=D9=89=D
 9=89=D8=A9=D9=88=D8=B1=D8=B1=D8=A4=D8=A4=D8=A1=D8=A1 ?=
 
 Cheers for any help with this.

The Subject line is badly formed. There shouldn't be any spaces in
encoded-text. See http://www.faqs.org/rfcs/rfc2047.html

I shall attempt to attach a message with a corrected version of that
Subject line ...

Edmund

---BeginMessage---

تاعلالغعفغضصثقفغعهخحجدطكمنتالبيبسسشلارلارءؤؤرلاءةىزممهلاتخمخحكتنمىشسيببللتالبييسئءؤرلالاىىةوررؤؤءء


---End Message---

Re: [I18n]Re: Li18nux Locale Name Guideline Public Review

2002-01-22 Thread Edmund GRIMLEY EVANS


Bram Moolenaar [EMAIL PROTECTED]:

  In principle, I agree though, case sensitive; work should be aimed at
  making a GUI simple to use, and the CLI consistent and simple.
 
 I still haven't heard a good reason why case sensitivity is useful.

Simplicity of implementation (of existing and future code) and
avoiding weird bugs have been mentioned as reasons for case
sensitivity. Unless I missed something, the only reason we've had for
case insensitivity is making the names very slightly easier to
remember.

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: [I18n]Re: Li18nux Locale Name Guideline Public Review

2002-01-21 Thread Edmund GRIMLEY EVANS


   setenv LANG de_DE.iso-8859-1@euro
   setenv LANG DE_de.ISO-8859-1@euro
   setenv LANG de_DE.Iso-8859-1@EURO
 
 Do you think an average user can guess which one of these he has to
 type?  No GUI available!

If the average user is having to choose between those 3 possibilities,
then presumably those 3 possibilities were presented by some program
or included in some list. That program, or that list, should be
modified to only give valid possibilities.

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Squeeze one more bit into a UTF-8 sequence?

2002-01-20 Thread Edmund GRIMLEY EVANS


Michael B Allen [EMAIL PROTECTED]:

 I am in the process of modifying xterm to return keysyms for key
 *releases* (in addition to key presses natrually). The keysyms would be
 looked up in a table by their osf code (or something :-). A program that
 wants to take advantage of this apparatus could then issue a control
 sequence to turn it on and off and use a normalized table of keycodes
 to work from.
 
 Aaaanyway, I would like to use UTF-8 to encode the keysym for sending
 to the programs stdin but there is a problem; how do I encode the extra
 bit of information necessary to indicate that a UTF-8 sequence is a key
 release as opposed to a key press?
 
 Is there a way to encode /one more bit/ of information into a UTF-8
 sequence in a way that is mostly orthogonal to the encoding itself?

I would have thought that it would be better to use some kind of
escape sequence than invalid UTF-8.

For example, you could pick characters D and U and use DX or just X to
mean X pressed and UX to mean X released (D=down, U=up).

Normally, you would transmit just X rather than DX, but you would have
to use DD and DU for D and U themselves being pressed. For efficiency
you could choose D and U to be characters that don't often get typed,
but there's nothing to stop you using the characters 'D' and 'U' if
you want. Using a character that isn't too rare has the advantage of
making bugs show up earlier.

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

2002-01-11 Thread Edmund GRIMLEY EVANS


Pablo Saratxaga [EMAIL PROTECTED]:

  Why was Turkish unified, then?
 
 It has not.
 There are two kinds of i: with and without dots: two different letters,
 4 different chars (upper and lower case of the 2 letters).
 They are not unified.
 
 Now, the default pair used in almost all languages is the one with a dot
 for the lowercase, and the one without dot for the uppercase.
 So the default pairing is that one; only for Turkish and Azerbaidjani
 the upercasing and lowercasin rules are different.

You've described the situation, but you haven't answered the question.

The obvious alternative would be to have 6 characters: upper and lower
case versions of ordinary I, Turkish/Azeri dotted I and
Turkish/Azeri dotless I.

It would be interesting to know whether this alternative is ever used,
in some encoding, was ever considered for Unicode, etc.

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

2002-01-09 Thread Edmund GRIMLEY EVANS


Henry Spencer [EMAIL PROTECTED]:

 However, the point remains valid:  the Fraktur fonts, which have at least
 a strong historical presence in Latin-alphabet texts, are unreadable to a
 lot of Latin-alphabet users, and were nevertheless unified. 

This is (I assume intentionally) a funny way of putting it. They
didn't have to be unified, because they were never considered to be
distinct. It's hard to imagine why anyone would want to derive the
Latin alphabet by doing a new, independent survey of existing fonts
when everyone, even children, already know the alphabet.

In summary, I don't think readability has anything to do with it.

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Printing UTF-8

2001-11-30 Thread Edmund GRIMLEY EVANS


Juliusz Chroboczek [EMAIL PROTECTED]:

 Finally, would people be willing to use a piece of code that requires
 Bruno Haible's CLISP to be installed?  Or do you think that exclusive
 use of stone-age languages is a must?

Hang on! LISP was invented in 1960. The only older language still in
use is FORTRAN (1957).

Use of a compiled language might be helpful, to reduce run-time
dependencies. Is there a free Common Lisp compiler? You could
implement in Prolog (1970), Scheme (1975), Caml (1984) or Haskell
(1990).

C (1972) is boring; don't use C. :-)

Edmund
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: getting locale's charset from a script

2001-10-17 Thread Edmund GRIMLEY EVANS


Bruno:

  If it doesn't already do so, perhaps the iconv command should have an
  option to tell you the charset of the current locale, as one of the
  most likely reasons for wanting to know it is in order to use it as an
  argument to iconv. So you could also have a pseudo-charset locale,
  as in iconv -f locale -t utf-8.
 
 A missing -f or -t argument to the iconv program already denotes
 the locale charset. This is true for both glibc iconv (since
 glibc-2.2.2) and libiconv iconv (since libiconv-1.6).

Thanks. But what if I want to convert to the locale charset with
transliteration? Is that possible with iconv?

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

UTF-8 support for freedb and Ogg Vorbis

2001-10-04 Thread Edmund GRIMLEY EVANS


I recently contributed some UTF-8 support to a couple of projects,
which I will describe in case anyone has any advice for me.

http://sourceforge.net/projects/freedb/

This is a cddbp database server. You give it the precise track lengths
of a CD and it will supply the track titles if someone has already
entered them, or you can contribute them yourself. For example, Debian
has a script called abcde for converting an entire CD to Ogg Vorbis
files which queries a cddbp server automatically so that it can add
tags to the Ogg Vorbis files for you.

There are various ways of communicating with the server, but all of
them include an explicit protocol level except e-mail, which has MIME.
Up until now ISO-8859-1 has been prescribed. My proposal is to define
protocol level 6 to be the same as 5 but with UTF-8 prescribed. The
server takes care of charset conversion and can be configured to
automatically detect the encoding of disc files, so an existing
database can be used without conversion but new files can be added in
UTF-8.

When UTF-8 data is supplied to an ISO-8859-1 client the server has to
transliterate. The first problem is to provide a good transliteration
table: glibc and libiconv don't transliterate Cyrillics, I think, so
can anyone recommend such a table? The second problem is to avoid
transliterated data being edited by a user then recontributed as a
correction. Ideally we wouldn't accept an ISO-8859-1 update to a file
that contains non-ISO-8859-1, but unfortunately updates are merged
off-line by a different process, which means it would be messy to
implement, so we might just make do with including a warning in the CD
title when data has been transliterated approximately and trusting the
user to understand it.

http://www.xiph.org/ogg/vorbis/

This is the free replacement for MP3. The Ogg Vorbis format prescribes
UTF-8, but data has to be converted for the client. My suggestion to
require iconv was not welcomed, so I provided both a converter using
iconv and a simple built-in one with a config test to choose between
them. The built-in converted does UTF-8 and 8-bit encodings. It would
be useful if anyone could provide a list of 8-bit encodings worth
including. An encoding is worth including if it is widely used by
people who don't have iconv, and a name of such an encoding is worth
including if it might be returned by nl_langinfo(CODESET) on a system
without iconv.

At present the code uses nl_langifno(CODESET), where available, to get
the user's charset. Otherwise it looks at the environment variable
CHARSET. Otherwise it assumes US-ASCII. In general, when converting,
illegal input bytes are replaced by '#' and unrepresentable characters
are replaced by '?'.

The function to convert a buffer using iconv is about 200 lines of C,
mainly because of faults in the design of iconv's API, which mean you
have to convert the data 3 times: you have to go via UTF-8 to
distinguish the '#' and '?' cases, and you have to convert from UTF-8
twice to avoid having E2BIG mask the return value telling you that the
conversion was inexact. Also, I have to support both the standard
iconv and the various versions provided by glibc/libiconv, so I'm not
totally happy with iconv.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: wrong strcoll() result with different UTF locale setting

2001-08-30 Thread Edmund GRIMLEY EVANS


Markus Kuhn [EMAIL PROTECTED]:

 Only in phone books. The more modern German sorting order used in
 dictionaries and most other applications treats ö like o, distinguished
 only in the second sorting level (just like accents are sorted in English
 as well). I'd rather see the ö=oe sorting order disappear. It is
 confusing, user unfriendly, and makes looking up words in sorted list more
 complicated. It has it's place in phone books and name lists only, because
 there used to be a lot of German surnames that sounded identical but have
 ö/oe, ü/ue, ä/ae as spelling alternatives (Moeller versus Möller, etc.).

I also used a German library catalogue that had Ö = OE and also I = J
and U = V, presumably with the sound practical justification that I
and J were the same letter in Classical Latin, as were U and V.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Odd differences in locale sorting

2001-08-03 Thread Edmund GRIMLEY EVANS


David Starner [EMAIL PROTECTED]:

 It seems that at least all the non-Latin-script languages should sort
 Latin-script
 the same way, or at least chose between a standard, language-neutral
 'correct'
 sort and an efficient sort.

Probably by default each locale should start off by directly or
indirectly copying iso14651_t1 and then apply modifications that
only change the ordering of the letters used in that language.

However, national standards do sometimes describe how foreign letters
should be ordered, so there may be some justification for some of the
apparently eccentric variations.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: New Unifont release

2001-07-13 Thread Edmund GRIMLEY EVANS


David Starner [EMAIL PROTECTED]:

  It's not clear whether this license covers only your additions, or
  also Roman's original font. What is Roman's original license?
 
 That was Roman's original license. I'm an American, and American laws do not
 allow copyright on bitmap fonts. Any work I do on the Unifont is therefore
 in the public domain.

If I recall correctly, an international treaty on copyright states
that a citizen of country X gets the same rights in country Y as a
citizen of country Y, so it doesn't make any difference that you're an
American. Your work won't be in the public domain everywhere unless
you say so.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Arabic (was Re: [I18n]Syriac)

2001-07-06 Thread Edmund GRIMLEY EVANS


Pablo Saratxaga [EMAIL PROTECTED]:

 However, if that is not the case, if bdf/pcf fonts need to be created, there
 is the problem to create a new font encoding for Syriac.

But of course, don't invent anything new if something suitable already
exists. At cl.cam.ac.uk I shared an office with George Kiraz, who is
the author of some Syriac fonts. I don't have his e-mail to hand, but
you can find him on Google with george kiraz syriac fonts. But I
don't think he's at Bell Labs any more, so try his private e-mail
address.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Luit and screen [was: anti-luit]

2001-07-04 Thread Edmund GRIMLEY EVANS


Juliusz Chroboczek [EMAIL PROTECTED]:

 RB Tho I do agree that luit should be integrated into screen eventually.
 
 Impossible for licensing reasons.  I should hope that luit will get
 into the XFree86 tree.

What are those reasons? Why can't it be dual-licensed?

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Luit and screen [was: anti-luit]

2001-07-04 Thread Edmund GRIMLEY EVANS


Markus Kuhn [EMAIL PROTECTED]:

 The GPL is an absolutely fabulous idea, but since there is so much
 unjustified phobia around it, I'd recommed to donate anything that you
 produce related to support the use of UTF-8 under POSIX to the public
 domain (as I did with all my font and other UCS things on my web pages).
 This seems to maximise impact in other projects as it takes away the fuel
 from any potential licence discussion.

Another possibility is to write that your code may be distributed
under licence of your choice or GPL. Then people don't have to waste
time discussing whether licence of your choice is GPL-compatible or
not. Perl is distributed this way.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Again on mbrtowc()

2001-06-21 Thread Edmund GRIMLEY EVANS


Tomohiro KUBOTA [EMAIL PROTECTED]:

  It may detect the problem and return EINVAL.
 
 The problem is, mbrtowc() returns size_t value.  Thus, any positive
 value cannot be used for error.
 
 If this is a discussion to determine new standard, I would insist
 it should return some minus value, for example, -3.  Yes, errno
 should be set to EINVAL.

Don't worry: when I wrote return EINVAL this was just shorthand for
return (size_t)(-1) and set errno to EINVAL.

By the way, UTF-8 is stateful as far as mbrtowc() is concerned, so
what Markus wrote about calling abort() does not constitute further
evidence of a UTF-8 conspiracy to reduce codeset-diversity.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: UTF-8 as the single common encoding everywhere

2001-06-07 Thread Edmund GRIMLEY EVANS


H. Peter Anvin [EMAIL PROTECTED]:

  But which is that? The one described in RFC 2279, the one in ISO
  10646-1:2000, or the one in Unicode 3.1? These are different.
  
 
 The only difference is how permissive the standard is with respect to
 the handling of irregular sequences.  No standard has ever required
 interpretation of irregular sequences (except perhaps as a
 specification bug), and the only safe answer has always been to reject
 them.

But sometimes it is not possible to reject sequences; you have to do
something with the data, even if that means replacing it by '?'s. So
in some circumstances it might be better to accept and generate UTF-8
sequences corresponding to all of the integers from 0 to 2^31-1. That
is, after all, the simpest and most logical behaviour, and it would be
the standard behaviour if there were no endian and UTF-16 problems.

It sort of irritates me that in a UCS-4/UTF-8 world we are expected to
treat U+D800..U+DFFF and U+FFFE and U+ as illegal.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: mbrtowc(wc, , 0, ps)

2001-05-28 Thread Edmund GRIMLEY EVANS


Marco Cimarosti [EMAIL PROTECTED]:

 BTW, I see that Plauger's reference contradicts what Markus said in two
 points, and I have no way of determining who is more correct or up to date:
 
 1) In http://www.dinkumware.com/htm_cl/wchar.html#mbrtowc, it says that
 mbrtowc() return zero only when the next completed character is a null
 character, which cannot of course be the case when the size is zero.
 Plauger too does not specify what the function should return in this case,
 but -2 (incomplete mb character) seems a reasonable choice.

It's the only reasonable choice, even if you can argue,
legalistically, that according to some standard mbrtowc is entitled to
return -42 and randomly corrupt memory when given size = 0.

 2) In http://www.dinkumware.com/htm_cl/wchar.html#mbstate_t it says that
 mbstate_t can be initialized simply by setting its *first* member of  to
 zero (mbstate_t mbst = {0};), and this would imply that a memset() is only
 needed to *re*initialize it.

I don't think you are allowed to assume that mbstate_t is a structure
and has members, so memset is definitely better.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

mbrtowc(wc, , 0, ps)

2001-05-26 Thread Edmund GRIMLEY EVANS


I admit I haven't checked the latest glibc from CVS, and I haven't
investigated any databases of bug reports, so I apologise if this is
already well known.

With glibc-2.2.3, mbrtowc(wc, , 0, ps) seems to return 0 instead
of (size_t)(-2). I think this is a bug.

We noticed this because a program stopped working when we tried to use
glibc instead of libutf8_plug.

Edmund


#include stdio.h
#include string.h
#include wchar.h

int main()
{
  mbstate_t ps;
  wchar_t wc;

  memset(ps, 0, sizeof(ps));
  printf(%d\n, mbrtowc(wc, , 0, ps));
  return 0;
}
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: mbrtowc(wc, , 0, ps)

2001-05-26 Thread Edmund GRIMLEY EVANS


Markus Kuhn [EMAIL PROTECTED]:

  With glibc-2.2.3, mbrtowc(wc, , 0, ps) seems to return 0 instead
  of (size_t)(-2). I think this is a bug.
 
 It is a bug in your software. You should never call mbrtowc with 0 as
 the number n of bytes that mbrtowc is allowed to examine at most. Such a
 call seems useless, and the standard does not define the behaviour of
 mbrtowc in that case.
 
 One could argue - and I probably would agree - that (size_t)(-2) might
 be an aesthetically more pleasing return value in that situation, but
 that is not really a requirement of ISO/IEC 9899:1999(E), §7.24.6.3.2 on
 page 388.

I don't have that document. Could you quote the bit that says that n
musn't be zero?

 Perhaps someone should write a tutorial on common pitfalls with the
 restartable multi-byte functions.

A list of common mistakes would certainly be helpful, but the priority
should be to provide correct man pages. The man page I looked at said
nothing about n not being zero, so I assumed I didn't have to check n
myself. The code in question is for boot floppies, so I deliberately
avoid performing unnecessary checks, which, in another application, I
might be happy to do for safety.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: a couple of glibc bugs

2001-04-27 Thread Edmund GRIMLEY EVANS


Bruno Haible [EMAIL PROTECTED]:

  Secondly, if you have LANG=fr LANGUAGE=de then you get German messages
  but nl_langinfo(YESEXPR) and nl_langinfo(NOEXPR) are French. This is
  confusing
 
 LANGUAGE has an influence only on gettext. If you want to influence
 gettext() and nl_langinfo(YESEXPR), use LC_MESSAGES:
 
 LANG=fr_FR LC_MESSAGES=de_DE

But the good thing about LANGUAGES is that it lets you specify a list
of languages. LC_MESSAGES doesn't, as far as I know. Could YESEXPR be
made to follow LANGUAGES without breaking some standard or convention?

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/

Re: gettext-0.10.36 is released

2001-04-16 Thread Edmund GRIMLEY EVANS


Bram Moolenaar [EMAIL PROTECTED]:

   Yes. It is called bind_textdomain_codeset(), and is documented in the
   manual.
  
  Using this function I don't seem to be able to change the encoding
  once I have started using gettext. Is this a bug or a feature?
 
 I noticed that too.  It was said to be fixed in the next version.

It seems to be fixed in glibc's CVS, too. I patched my gettext-0.10.36
using the diffs from CVS and it seems to work now.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/

Re: gettext-0.10.36 is released

2001-04-15 Thread Edmund GRIMLEY EVANS


Bruno Haible [EMAIL PROTECTED]:

  Is there an official mechanism for telling gettext what the target
  charset is even when the locale is wrong, nl_langinfo is missing, or
  whatever?
 
 Yes. It is called bind_textdomain_codeset(), and is documented in the
 manual.

Using this function I don't seem to be able to change the encoding
once I have started using gettext. Is this a bug or a feature?

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/

reorder-after in locale definition

2001-03-29 Thread Edmund GRIMLEY EVANS


Can anyone help me with using reorder-after in the LC_COLLATE section
of the locale definition? There aren't very many examples to copy,
because only sv_SE seems to use it.

I'm trying to say that  should be treated like a separate letter
between C and D, so I wrote this:

LC_COLLATE
copy "iso14651_t1"

collating-symbol ccirc

reorder-after c
ccirc

reorder-after U0106
U0108 ccirc;CIR;CAP;IGNORE % 
reorder-after U0107
U0109 ccirc;CIR;MIN;IGNORE % 

reorder-end

END LC_COLLATE

It seems to work for "eo_EO.UTF-8 UTF-8" in /etc/locale.gen, but it
doesn't work for "eo_EO ISO-8859-3", because:

eo_EO:46: LC_COLLATE: cannot reorder after U0106: symbol not known
eo_EO:48: LC_COLLATE: cannot reorder after U0107: symbol not known

Presumably this is because U+0106 and U+0107 aren't present in
ISO-8859-3. So, what should I do to make the same locale definition
work in UTF-8 and ISO-8859-3?

I admit that I don't really understand the purpose of the character
specified in the same line as "reorder-after".

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/

Re: reorder-after in locale definition

2001-03-29 Thread Edmund GRIMLEY EVANS


Roozbeh Pournader [EMAIL PROTECTED]:

   http://anubis.dkuug.dk/jtc1/sc22/open/n2955.pdf

Thanks for that.

I was trying:

 reorder-after U0106
 U0108 ccirc;CIR;CAP;IGNORE % 
 reorder-after U0107
 U0109 ccirc;CIR;MIN;IGNORE % 

In fact I should have U0043 and U0063 instead of U0106 and U0107 to
make [c-d] in regular expressions be equivalent to [cd].

As far as I can make out from a quick scan of the spec, only
"ccirc;CIR;CAP;IGNORE" is used for collating strings, but the
order of the lines matters for interpreting character ranges in
regular expressions.

Not all programs that use regular expressions are locale-sensitive in
this way. I haven't investigated why. One program that does have
locale-sensitive regular expressions is Mutt. At present a bug in
hu_HU prevents [a-z] from working in that locale, but [a-z] seems to
mean the same thing in all locales for egrep.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/

Re: Unicode-HOWTO 1.0

2001-02-05 Thread Edmund GRIMLEY EVANS


[EMAIL PROTECTED] [EMAIL PROTECTED]:

  Unfortunately I am not quite sure what an ACM is.
 
 An ACM is "Application Charset Map" the same thing as the screen maps,
 but an ACM converts bytes to Unicode values.
 
 There must be a misunderstanding here about what a screen map is.

 and koi8r.uni is a unicode map, and contains

You've confused me.

As I understand it there are Application Charset Maps that map from an
8-bit character set to 16-bit UCS values. These are only used when the
console is not in UTF-8 mode. And there are Screen Font Maps that map
from 16-bit UCS values to font position (8 or 9 bits).

I think "unimap" and "screen map" both mean the same as "SFM", but
"SFM" is the preferred term nowadays.

You have an ACM for each 8-bit charset/encoding you might want to use,
and you have an SFM for each font. The font is then independent of the
charset/encoding.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/

Re: iconv output utf-8 - utf-16, which one is wrong?

2000-11-11 Thread Edmund GRIMLEY EVANS


  You could argue that putting a BOM is the application's duty, not
  iconv's business, but that would be painful for all applications which
  try to use iconv. And unlabelled data (e.g. files on a filesystem)
  shouldn't use UTF-16 or its variants in the first place, that what
  UTF-8 is for.
  
 
 Well, the issue is that iconv() is also used for, say, text strings
 embedded in data.  However, it sounds like the solution is simply to
 request UTF-16BE instead.

So, UTF-16 gives you bigendian with BOM, UTF-16BE gives you big-endian
without BOM and UTF-16LE gives you little-endian without BOM.

How do I ask for the machine's native ordering with or without BOM?

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/

Re: iconv output utf-8 - utf-16, which one is wrong?

2000-11-02 Thread Edmund GRIMLEY EVANS


[EMAIL PROTECTED] [EMAIL PROTECTED]:

 Wprint (a postscript filter for Netscape/Mozilla printing output) is
 now, under FreeBSD sending the "fffe" as a valid character because it
 does not expect it.  Although it is easy to just skip it if it is
 present I would like to know if it should be present at all.

U+FEFF is the BOM (Byte Order Mark) or ZERO WIDTH NO-BREAK SPACE. It
can in some circumstances be useful to have this at the beginning of a
file or datastream to distinguish big-endian UTF-16 from little-endian
UTF-16 (and from UTF-8, etc), however, it can also be harmful, so I
don't think iconv should be generating or interpreting BOMs by
default.

Should iconv perhaps have command-line arguments --bom-in and
--bom-out or something similar?

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/

Re: [I18n] Default charset for locale (is UNICODE !)

2000-10-27 Thread Edmund GRIMLEY EVANS


Roozbeh Pournader [EMAIL PROTECTED]:

  Until then, I am looking forward to hearing reports from people who have
  already completely moved their Linux environment to UTF-8, i.e. who run
  their terminal emulators only in UTF-8 mode all day long. What does
  still break under UTF-8 and needs to be fixed?
 
 My main problem has been pine. First of all it doesn't pass 0x80-0x9F to
 the terminal, and second it doesn't have automatic charset conversion, so
 I have problems with messages in ISO-8859-x. In short, almost nothing
 works with pine.

You could try Mutt (www.mutt.org) instead. Apart from reasonable
handling of UTF-8 terminals Mutt has other advantages, too.

See www.rano.org/mutt.html for the UTF-8 instructions ...

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/

Re: non-breaking space

2000-09-13 Thread Edmund GRIMLEY EVANS


Markus Kuhn [EMAIL PROTECTED]:

 Edmund GRIMLEY EVANS wrote on 2000-09-12 16:46 UTC:
  According to glibc's iswprint(160), a non-breaking space is not
  printable. Is this correct?
 
 Certainly not. NBSP is most definitely a printable character.

Good. I'm glad to hear it. But even glibc-2.2 seems to think it's
unprintable. Could this be fixed, please, Ulrich?

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/

non-breaking space

2000-09-12 Thread Edmund GRIMLEY EVANS


According to glibc's iswprint(160), a non-breaking space is not
printable. Is this correct? Why is this so?

To me, ' ' seems more similar to 'x' than ' ' is to 'x' ...

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/

Re: Unicode/UTF-8 support for man

2000-08-17 Thread Edmund GRIMLEY EVANS


Markus Kuhn [EMAIL PROTECTED]:

 My suggestion is that groff should offer a new -Twlocale,
 in which it formats a paragraph as a wchar_t text and then spits
 it out via wprintf() and friends. The C library will take care
 of converting this to UTF-8, Latin-1, ASCII, transliteration,
 etc. For each non-ASCII character in a paragraph, groff should
 query with wcwidth(), how many ASCII character cells wide the
 character will be according to the locale. This should also take
 care of transliteration, i.e. wcwidth(0x2264) == 2 in case the
 locale includes ASCII transliteration and results in
 wputchar(0x2264) to spit out "=".

You seem to be suggesting that C library functions such as wprintf
should do transliteration. But I thought these functions, like
wcrtomb, only do reversible transformations between multibyte and wide
character representations.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/

ucs-fonts and Mozilla

2000-08-12 Thread Edmund GRIMLEY EVANS


Markus's UCS fonts seem to confuse Mozilla even if I rename the
fonts.alias file. With ucs-fonts/ on the font path, Mozilla displays
apparently double-width boxes instead of us-ascii chars in various
places. One of those places is the box for the URL.

This is with Mozilla M17 and a rather old version of Markus's fonts
(but I don't suppose that makes any difference). It didn't seem to
happen with M16, strangely enough. I'm using Debian 2.2 (potato).

Has anyone else seen this problem? Has anyone seen M17 with Markus's
fonts on the path without this problem?

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/

Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-28 Thread Edmund GRIMLEY EVANS


Markus Kuhn [EMAIL PROTECTED]:

 I see valuable binary data (PDF  ZIP files, etc.) being destroyed
 almost every day by accidentally applied stupid lossy CRLF - LF - CRLF
 data conversion that supposedly smart software tries to perform on the
 fly. I foresee similar non-recoverable data conversion accidents if we
 try to establish software that wipes out malformed UTF-8 sequence
 without mercy and destructs all information that they might have
 contained.

Here the problem is that the program is misconverting on the fly and
not giving an error. If the program stopped with an error half way
through the user would know there was a problem and be able to do
something about it.

So, I don't think a UTF-8 decoder, as implemented in a library, should
do anything other than give an error if it encounters malformed UTF-8.
The user should be told that something has gone wrong. Clever
reversible conversion of malformed sequences is more likely to hide a
real problem, causing a bigger problem later, than to be helpful, I
suspect.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/

Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-23 Thread Edmund GRIMLEY EVANS


Markus Kuhn [EMAIL PROTECTED]:

 A) Emit a single U+FFFD per malformed sequence

We discussed this before. I can think of several ways of interpreting
the phrase "malformed sequence".

I think you probably mean either a single octet in the range 80..BF or
a single octet in the range FE..FF or an octet in the range C0..FD
followed by any number of octets in the range 80..BF such that it
isn't correct UTF-8 and isn't followed by another octet in the range
80..BF.

This is probably quite hard to implement consistently, and, as with
semantics C, the UTF-8/UTF-16 length ratio is unbounded, which means
in particular that you can't decode from a fixed-size buffer in the
manner of mbrtowc.

 B) Emit a U+FFFD for every byte in a malformed UTF-8 sequence

This is what I do in Mutt. It's easy to implement and works for any
multibyte encoding; the program doesn't have to know about UTF-8.

But you have to ask yourself: do I reset the mbstate_t when I replace
a bad byte by U+FFFD? If you want consistency, you probably should, as
otherwise the mbstate_t is undefined after mbrtowc gives EILSEQ.

 C) Emit a U+FFFD only for every first malformed sequence in a sequence
of malformed UTF-8 sequences

I don't think anyone will recommend this.

 D) Emit a malformed UTF-16 sequence for every byte in a malformed
UTF-8 sequence

Not much good if you're not converting to UTF-16.

So perhaps B should be the generally recommended way.

However, I agree that a UTF-8 editor should be able to remember
malformed UTF-8 sequences so that you can read in a file, edit part of
it and write it out again without it all being rubbished.

It's unfortunate that the current UTF-8 stuff for Emacs causes
malformed UTF-8 files to be silently trashed.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/

56 matches

Mail list logo