from:"Rich Felker"

Re: suit file

2009-05-04 Thread Rich Felker

On Mon, May 04, 2009 at 01:01:52PM +0200, Jan Willem Stumpel wrote:
> Ben Wiley Sittler wrote:
> 
> > It's a font "suitcase", and IIRC the font data is actually in
> > the "resource" fork. At least under Mac OS X, fontforge seems
> > to be able to deal with these. If you have the file on a
> > non-Mac OS machine it may well be corrupt, since non-Mac
> > filesystems do not preserve the resource fork data.
> 
> This file was sent to me by a friend, from a Mac computer, by
> e-mail, and then saved on my ext3 HD. Any danger that it was
> corrupted, or incomplete?

Often old Mac email programs will send both the data fork and resource
fork as attachments when sending email. You might need a good mail
reader like mutt which can let you select which mime element you want
to save in order to get the resource fork saved as its own file.

Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: suit file

2009-05-03 Thread Rich Felker

On Sun, May 03, 2009 at 08:02:40AM +0200, Jan Willem Stumpel wrote:
> I have a font for an exotic language (Javanese) that I want to
> convert to UTF-8 encoding. Problem is, the font file was made on a
> Macintosh using Fontographer, and it has a .suit file extension
> that Fontforge doesn't know how to handle.
> 
> Anyone knows of a conversion tool under Linux that can change a
> "*.suit" file to ttf?

Googling for suit file format turns up lots of SEO-spam sites with no
details on what the format really looks like. I think it's just some
sort of primitive archive format that contains the ttf (or several
ttf's) and you may be able to search for a ttf header within it and
then just throw away the suit crap at the beginning using dd.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: i18n fonts

2007-12-18 Thread Rich Felker

On Wed, Dec 19, 2007 at 02:01:26PM +1100, Russell Shaw wrote:
> Russell Shaw wrote:
> >Rich Felker wrote:
> >>On Mon, Dec 03, 2007 at 02:16:00PM +1100, Russell Shaw wrote:
> 
> >
> >Hi,
> >I can parse in the gsub tables. I was trying to do the gpos tables,
> >but the OpenType spec doesn't define "ValueRecord" in
> >"Single Adjustment Positioning: Format 1":
> >
> >  http://www.microsoft.com/typography/otspec/gpos.htm
> 
> I found it in there. For some reason, Ctrl-F "valuerecord" doesn't
> find it in firefox.

Props for researching this stuff. If there's ever going to be good
implementations a lot more people need to know about how it works and
exchange ideas, challenge and debate how to make it best, etc.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: i18n fonts

2007-12-02 Thread Rich Felker

On Mon, Dec 03, 2007 at 02:16:00PM +1100, Russell Shaw wrote:
> Hi,
> I was thinking of making a multilingual text editor.
> 
> I don't get how glyphs are done outside of english.
> 
> I've read the Unicode Standard book.
> 
> When a paragraph of unicode characters is processed, the glyphs
> are layed out according to the state contained in the unicode
> character sequence.
> 
> Depending on this state, the same unicode characters can map to
> multiple glyphs depending on context.
> 
> If multiple fonts exist for a language, then for all these font
> files to work with an editor, then all these glyphs must be indexed
> the same.
> 
> Where can i find the standard that specifies what glyphs are indexed
> by what number? Or are these glyphs created on the fly by the unicode
> paragraph layout processor?

The relevant standard is OpenType fonts, which contain the necessary
tables for mapping sequences of characters to glyphs. The glyph
indexing is specific to the particular font being used; there is no
standard across fonts, and in fact some fonts will use precomposed
glyphs while others will use constituent glyphs with positioning
information to achieve the same result.

OpenType was designed by Microsoft as an abstraction of TrueType and
Type1 fonts with the necessary features for proper Unicode rendering.
On Windows, Uniscribe/USP10.DLL is the code responsible for processing
these tables. Correctly multilingualized applications will use its
functions for text rendering (but all the standard Windows controls
will do that for apps).

The situation on Linux and *nix is a bit more diverse. Both GTK+ and
Qt widgets provide semi-correct OpenType handling, but with lots of
mistakes in handling scripts/languages their developers are not very
familiar with. Qt uses its own code for this, while GTK+ uses the
Pango library, an extremely slow “complex text layout” library which
does a lot more than is needed for most uses, and which duplicates
most of the font-specific logic in code, causing lots of headaches in
addition to bloat and bad performance (Firefox with Pango enabled is
many times slower than without; this is why many distributions still
have Pango support disabled by default, causing many languages not to
work...).

I’m very much hoping for a future direction of proper OpenType
rendering support without the need for Pango, but it requires someone
spending some time to understand the problem domain. Basically it’s
just a matter of applying substitution tables, and hard-coding lists
of which tables are needed for which scripts in Unicode and the order
in which they should be applied. (Originally they were intended to be
applied in the order they appear in the font files, but then MS went
and made their implementation hard-code the order, so other
implementations need to follow that in order to handle fonts properly
— or at least that’s my understanding.)

The OpenType specs themselves are available at Microsoft’s website,
but they’re very poorly documented. Reading them alone is insufficient
to make an implementation unless you already know basically what the
implementation must do, IMO — something like RFC 1459 in quality...

There’s a (semi-)new library called Harfbuzz which, as I understand
it, is purely the OpenType logic, without all the bloat of Pango. I’m
not sure what stage it’s at these days, but it might be a good place
to begin your search. Of course if your app depends on GTK+ or Qt you
can just use their widgets and forget about the whole issue, but I
hope someone will move things forward for OpenType font support
without the need for these toolkits.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode, ISO/IEC 10646 Synchronization Issues for UTF-8

2007-04-27 Thread Rich Felker

On Fri, Apr 27, 2007 at 12:41:22PM -0700, Ben Wiley Sittler wrote:
> glad it was rejected. the only really sensible approach i have yet
> seen is utf-8b (see my take on it here:
> http://bsittler.livejournal.com/10381.html and another implementation
> here: http://hyperreal.org/~est/utf-8b/ )
> 
> the utf-8b approach is superior to many others in that binary is
> preserved, but it does not inject control characters. instead it is an
> extension to utf-8 that allows all byte sequences, both those that are
> valid utf-8 and those that are not. when converting utf-8 <-> utf-16,
> the bytes in invalid utf-8 sequences <-> unpaired utf-16 surrogates.
> the correspondence is 1-1, so data is never lost. valid paired
> surrogates are unaffected (and are used for characters outside the
> bmp.)

this approach is perhaps reasonable for applications that want to use
utf-16 internally without corrupting invalid sequences in utf-8, but
it has problems too. for example it's not stable under string
concatenation or substring operations.

the whole reason utf-8 is usable comes from its self-synchronizing
property and the property that one character is never a substring of
another character. this necessarily forces the encoding to treat some
strings as invalid; that is, it's provably impossible to make an
encoding with the required properties where all strings are valid. as
a consequence, any treatment of invalid sequences as if they were
'special characters', like utf-8b does, will break all of the
essential properties. for some applications this may not matter; for
others it would be disastrous. it's certainly not possible to do such
a thing as the C library level (mb*towc family) without causing all
sorts of breakage.

my view is that it's best to just leave the data in its original utf-8
form and not do conversions until 'just in time', for presentation,
character identification, etc. caching this 'presentation' form
alongside the data may be appropriate for many applications.

rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode, ISO/IEC 10646 Synchronization Issues for UTF-8

2007-04-27 Thread Rich Felker

On Fri, Apr 27, 2007 at 05:15:16PM +0600, Christopher Fynn wrote:
> N3266 was discussed and rejected by WG2 yesterday. As you pointed out
> there are all sorts of problems with this proposal, and accepting it
> would break many existing implementations.

That's good to hear. In followup, I think the whole idea of trying to
standardize error handling is flawed. What you should do when
encountering invalid data varies a lot depending on the application.
For filenames or text file contents you probably want to avoid
corrupting them at all costs, even if they contain illegal sequences,
to avoid catastrophic data loss or vulnerabilities. On the other hand,
when presenting or converting data, there are many approaches that are
all acceptable. These include dropping the corrupt data, replacing it
with U+FFFD, or even interpreting the individual bytes according to a
likely legacy codepage. This last option is popular for example in IRC
clients and works well to deal with the stragglers who refuse to
upgrade their clients to use UTF-8. Also, some applications may wish
to give fatal errors and refuse to process data at all unless it's
valid to begin with.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode, ISO/IEC 10646 Synchronization Issues for UTF-8

2007-04-26 Thread Rich Felker

On Thu, Apr 26, 2007 at 03:44:33PM +0600, Christopher Fynn wrote:
> N3266
> 
> UCS Transformation Formats summary, non-error and error sequences – 
> feedback on N3248
> 
> 

I must say this is a rather stupid looking proposal. The C0 controls
already have application-defined semantics; trying to give them a
universal meaning like this is a very bad idea. Keep in mind that
U+001A is ^Z, so for example if a terminal emulator converted bogus
UTF-8 from an X11 paste into this character, it would send (possibly
many) suspend commands to the application. Certainly not what the user
had in mind!!

Moreover, C0 and C1 control codes (minus newline and perhaps tab),
along with Unicode line/paragraph separator, should be considered
INVALID in plain text themselves. So generating them as a means of
error replacement is counterproductive as the ^Z's could be seen as
errors in themselves.

Also note that ^Z is DOS EOF. I bet some bad Windows software would
truncate files at the first ^Z...

Finally, I think the fact that this document was submitted in MS Word
form speaks for the author's qualifications (or lack thereof) to
design such a specification...

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode, ISO/IEC 10646 Synchronization Issues for UTF-8

2007-04-24 Thread Rich Felker

On Tue, Apr 24, 2007 at 04:43:59PM -0400, ＳｒｉｎＴｕａｒ wrote:
> Basically, its a proposal to cap at 10.
> 
> I see no reason to cap utf-8 and utf-32 just to deal with the
> limitations of utf-16.
> 
> As long as you don't attempt to convert to utf-16, it should not be a
> problem. (and eventually, utf-16 should be phased out)

Capping is a good thing, and 21-bit is exactly the point you want to
cap at. Not only does it ensure that required table indices for UCS
support can't grow unmanagably large; it also ensures that UTF-8 is no
larger than UTF-32, so that conversion can be done in-place in
situations where storage space is limited.

Almost all present-day scripts have already been encoded, and plenty
of historical ones too. Even 18 or 19 bits would have been a plenty. I
see no legitimate practical argument against a 21-bit limit; it just
increases the potential for implementation complexity with no
benefits.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: terminal status [Re: wcwidth and locale]

2007-04-24 Thread Rich Felker

On Mon, Apr 23, 2007 at 12:16:29AM +0800, Abel Cheung wrote:
> On 4/17/07, Rich Felker <[EMAIL PROTECTED]> wrote:
> >What is the output of:
> >echo -e '日本語\b\bhello'
> 
> Wait. Quick question: how much should '\b' backstep when wide characters are
> encountered?
> 
> - a whole wide character?
> - a single byte?
> - a half of wide character?

One byte is obviously nonsense since the screen contents are not bytes
but characters. Between the other two options, there's always a
tradeoff: if you want to move by character positions and \b works in
columns or vice versa, then you need to know the width (wcwidth) of
the character you're moving over. However..

> Which is considered 'correct'?

Columns is considered the correct behavior. Otherwise it would be
impossible to position the cursor to a particular visual location
without already knowing the contents of the screen, which a program
might not even know. On the other hand, if you're moving by
characters, then presumably the program knows what the characters on
the screen are, so it can compute widths.

Some terminals (Apple's Terminal.app, I believe) allow you to select
the behavior. This has the benefit of allowing programs which are not
aware of wcwidth to function somewhat usably with wide and/or
nonspacing characters, but at the expense of trashing the column
alignment and visual layout of correct programs. It will also likely
cause serious problems if used with GNU screen, which is width-aware.

One slightly problematic issue is what happens if you position the
cursor 'in the middle' of a double width character and then overwrite
the second column of it. In general the results could be anything
bogus, but good terminals will either erase the character or just
leave half of it there.

uuterm does not yet handle this case, and by chance it will end up
looking for a double-width glyph for the newly written character
(which might exist depending on the font. This behavior of course
should not be relied upon...

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Questions about Unicode-aware C programs under Linux

2007-04-17 Thread Rich Felker

On Tue, Apr 17, 2007 at 03:17:48PM +, Ali Majdzadeh wrote:
> Hi Rich
> Thanks for your attention. I do use UTF-8 but the files I am dealing with
> are encoded using a strange encoding system, I used iconv to convert them
> into UTF-8. By the way, another question, if all those stdio.h and
> string.hfunctions, work well with UTF-8 strings, as they actually do,
> what would be
> the reason to use wchar_t and wchar_t-aware functions?

There are a mix of reasons, but most stem from the fact that the
Japanese designed some really bad encodings for their language prior
to UTF-8, which are almost impossible to use in a standard C
environment. At the time, the ANSI/ISO C committee thought that it
would be necessary to avoid using char strings directly for
multilingual text purposes, and was setting up to transition to
wchar_t strings; however, this was very incomplete. Note that C has no
support for using wchar_t strings as filenames, and likewise POSIX has
no support for using them for anything having to do with interfacing
with the system or library in places where strings are needed. Thus
there was going to be a dichotomy where multilingual text would be a
special case only available in some places, while system stuff,
filenames, etc. would have to be ASCII. UTF-8 does away with that
dichotomy.

The main remaining use of wchar_t is that, if you wish to write
portable C applications which work on many different text encodings
(both UTF-8 and legacy) depending on the system's or user's locale,
you can use mbrtowc/wcrtomb and related functions when it's necessary
to determine the identity of a particular character in a char string,
then use the isw* functions to ask questions like: Is it alphabetic?
Is it printable? etc.

On modern C systems (indicated by the presence of the
__STDC_ISO_10646__ preprocessor symbol), wchar_t will be Unicode
UCS-4, so if you're willing to sacrifice some degree of portability,
you can use the values from the mb/wc conversions functions directly
as Unicode character numbers for lookup in fonts, character data
tables, etc.

Another option is to use the iconv() API (but this is part of Single
Unix Specification, not general C) to convert between the locale's
encoding and UTF-8 or UCS-4 if you need to make sure your data is in a
particular form.

However, for natural language work where there's a reasonable
expectation that any user of the software would be using UTF-8 as
their encoding already, IMO it makes sense to just assume you're
working in a UTF-8 environment. Some may disagree on this.

Hope this helps.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Questions about Unicode-aware C programs under Linux

2007-04-17 Thread Rich Felker

On Tue, Apr 17, 2007 at 08:47:19AM +, Ali Majdzadeh wrote:
> The program does not print the line read from the file to stdout (some junks
> are printed). I also used "cat ./persian.txt | iconv -t utf-8 > in.txt" to
> produce a UTF-8 oriented file.

If your native encoding is not UTF-8 then of course sending UTF-8 to
stdout is not going to result in something directly legible. I was
assuming you were using UTF-8 everywhere, which you should be doing on
any modern unix system...

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: terminal status [Re: wcwidth and locale]

2007-04-17 Thread Rich Felker

On Tue, Apr 17, 2007 at 11:08:44AM +0200, Egmont Koblinger wrote:
> On Mon, Apr 16, 2007 at 05:13:07PM -0400, Rich Felker wrote:
> 
> > Konsole and Xfce terminal: no support for nonspacing characters;
> > unsure about whether cjk wide characters are right.
> 
> CJK is fine in them AFAIK.

What is the output of:
echo -e '日本語\b\bhello'

It should be: “日本hello” and not “日hello”. I’m not sure which it
does. Also try with explicit cursor positioning escapes.

(I’d appreciate it if you could try and report since I no longer have
them installed and forgot to check this while I did.)

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Questions about Unicode-aware C programs under Linux

2007-04-16 Thread Rich Felker

On Tue, Apr 17, 2007 at 10:46:44AM +0430, Ali Majdzadeh wrote:
> Hello Rich
> Thanks for your response.
> About your question, I should say "yes", I need some text processing
> capabilities.

OK.

> Do you mean that I should use common stdio functions? (like, fgets(), ...)

Yes, they'll work fine.

> And what about UTF-8 strings? Do you mean that these strings should be
> stored in common char*

Yes.

> variables? So, what about the character size defference (Unicode and ASCII)?
> And also, string functions? (like, strtok())

strtok, strsep, strchr, strrchr, strpbrk, strspn, and strcspn will all
work just fine on UTF-8 strings as long as the separator characters
you're looking for are ASCII.

strstr always works on UTF-8, and can be used in place of strchr to
search for single non-ascii characters or longer substrings.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: wcwidth and locale

2007-04-16 Thread Rich Felker

On Tue, Apr 17, 2007 at 02:04:32AM +0800, Abel Cheung wrote:
> >not all we like, but can you come up with things that should
> >legitimately be wide (i.e. ideographs) which have no chance to enter
> >Unicode?
> 
> Certain there are, say some belonging to Taiwan CNS11643, which
> is regarded as variation of existing character in Unicode. And there

If they're needed for round trip compatibility with a legacy charset,
it should be possible to encode them in one of the CJK compatibility
sections. Are there still characters missing?

> are other symbols and characters not accepted in unicode, not
> necessarily wide. Though I must admit usage of those would certainly
> be quite rare.

If they're not wide then the default wcwidth of 1 is ok, no?

~Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

terminal status [Re: wcwidth and locale]

2007-04-16 Thread Rich Felker

On Tue, Apr 17, 2007 at 02:04:32AM +0800, Abel Cheung wrote:
> >This is only an issue on character-cell devices which use wcwidth.
> 
> I'm exactly talking about those apps, like terminals.

Given how utterly abysmal current terminals' Unicode support is, this
seems like a relatively minor issue. I don't want to disparage concern
about getting it right, but rather investigate where we're at now and
what needs to be done. Along those lines, I recently evaluated some
terminals with the following results:

Konsole and Xfce terminal: no support for nonspacing characters;
unsure about whether cjk wide characters are right.

Gnome Terminal: I assume it's the same since Xfce uses the same
widget. Please correct me if I'm mistaken since I didn't try it.

urxvt and xterm: CJK and nonspacing character widths are correct, but
rendering is minimal overstrike for nonspacing characters. No bidi or
complex script support. xterm default of only 1 combining character
per cell is horribly deficient for any language that doesn't just use
precomposed characters anyway.

aterm/rxvt/Eterm/etc.: unmaintained; no UTF-8 support at all.

mlterm: CJK and nonspacing character widths are correct, bidi is
available (not sure how well it works) with correct Arabic shaping,
and Indic reordering/shaping is available but as a special case (not
sure how well it works either). Also, cursor position becomes
nonsensical (font-dependent too) with Indic shaping, making
screen-mode (my terminology, as opposed to line-mode) apps difficult
to use.

uuterm (experimental; by me): CJK and nonspacing character widths are
correct. Shaping/ligatures are supported and sufficient for all
scripts afaik, but using a nonstandard font system (ucf). Bidi and
reordering (for Indic vowel marks on left) are not available.

So as of now, here is the status of support for particular languages
I'm aware of:

European-script langs using precomposed forms only: any terminal
except legacy stuff lacking UTF-8 support should be fine.

European-script languages with multiple decomposed accents: uuterm is
probably the only one that works.

Languages of India: mlterm and some old, unmaintained Indic-specific
terminals (pre-Unicode I think) are the only ones that work.

CJK, Thai, Lao: urxvt, xterm, mlterm, and uuterm all work. uuterm is
the only one that supports decomposed Korean (Hangul Jamo) though.

Tibetan: uuterm is the only terminal that works correctly, but a
minimal degree of legibility can be obtained with an ugly tailored
font that does not require shaping, so that urxvt, xterm, and mlterm
are usable.

Burmese: not supported by anything.

Arabic and Hebrew: mlterm and perhaps some rtl-specific terminal
emulators I'm not aware of..?

Mongolian: unknown; probably only mlterm and I'm unsure whether it
even works acceptably well.

One additional issue I have not tested is support for characters
outside the BMP. I know GNU screen totally lacks support for these,
and I suspect many terminal emulators have the same problem.

~Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Questions about Unicode-aware C programs under Linux

2007-04-16 Thread Rich Felker

On Mon, Apr 16, 2007 at 11:33:26AM +0330, Ali Majdzadeh wrote:
> Hello All
> Sorry, if my questions are elementary. As I know, the size of wchar_t data
> type (glibc), is compiler and platform dependent. What is the best practice
> of writing portable Unicode-aware C programs? Is it a good practice to use
> Unicode literals directly in a C program?

It depends on the degree of portability you want. Using them in wide
strings is not entirely portable (depends on the translation character
encoding), but using them in UTF-8 strings is (they're just byte
sequences).

> I have experienced some problems
> with glibc's wide character string functions, I want to know is there any
> standard way of programming or standard template to write a Unicode-aware C
> program? By the way, my native language is Persian. I am working on a C
> program which reads a Persian text file, parses it and generates an XML
> document.

If your application is Persian-specific, then you're completely
entitled to assume the text encoding is UTF-8 and that the system is
capable of dealing with UTF-8 and Unicode. Will there be any Persion
specific text processing though or do you just want to be able to pass
through Persian text?

> For this, there exist lots of issues that need the use of library
> functions (eg. wcscpy(), wcsstr(), wcscmp(), fgetws(), wfprintf(), ...),
> and, as I mentioned earlier, I have experienced some odd problems using
> them. (eg. wcsstr() never succeeds in matching two wchar_t * Persian
> strings.)

wcsstr doesn't care about encoding or Unicode semantics or anything.
It just looks for binary substring matches, just like strstr but using
wchar_t instead of char as the unit.

Overall I'd suggest ignoring the wchar_t functions. Especially the
wide stdio functions are problematic. Using UTF-8 is just as easy and
then your strings are directly usable for input and output to/from
text files, commandline, etc.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: wcwidth and locale

2007-04-16 Thread Rich Felker

On Tue, Apr 17, 2007 at 12:11:12AM +0800, Abel Cheung wrote:
> On 4/11/07, Rich Felker <[EMAIL PROTECTED]> wrote:
> >Indeed, glibc's character data is horribly outdated and incorrect.
> >There are plenty of unsupported nonspacing characters, even characters
> >that were present in Unicode 4.0. It also considers nonspacing letters
> >to be non-alphabetic, which is a real problem for users of languages
> >which utilize nonspacing letters.
> 
> AFAIK Pablo Saraxtaga has done something about it [1], though I
> didn't intend to dig deeper and check what has been done.
> 
> [1] http://sourceware.org/bugzilla/show_bug.cgi?id=3885

This works, bug UHG it's so disgusting. Someday people need to realize
that POSIX charmap/localedef format is utterly broken for use with
Unicode and replace it with something reasonable that doesn't take 200
megs of core..

> It really depends on the intended audience of the fonts. The original
> intention for those double width Greek and Cyrillic characters is to
> make them align nicely with all other CJK characters. Then there are
> no such thing as wide Greek/Cyrillic characters and wide version of
> some other symbols in Unicode, so font designers in Asia are forced
> to make them wide and map them to narrow ones, since they must
> support legacy encoding for commercial or whatever reason.
> They are doing this out of no choice (except discarding those
> glyphs, which would offend other users).

This is only an issue on character-cell devices which use wcwidth. For
GUI applications, the metrics of the font will govern layout and
alignment, so either can be used. I don't think it's such a big deal
to say these fonts with wide Greek, Cyrillic, etc. aren't suitable for
terminals. In fact they could be automatically used just by squeezing
the glyph horizontally and cropping off the excess spacing.

> I'm also bitten by this issue -- PUA codepoints always have wcwidth=1,
> and it would make CJK fonts suck again because characters keep
> overlapping against each other. Yes, PUA usage should be avoided
> whenever possible, but we would still see legacy systems in the
> short future.

Yes, PUA is very bad. I wouldn't be opposed to designating a certain
portion of the PUA as "wide", but I question whether using the PUA on
charcell devices is even needed.

> Not to mention some characters would never have the
> chance to enter Unicode.

We can debate whether things like the Apple™® symbol are characters or
not all we like, but can you come up with things that should
legitimately be wide (i.e. ideographs) which have no chance to enter
Unicode?

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: wcwidth and locale

2007-04-10 Thread Rich Felker

On Tue, Apr 10, 2007 at 12:36:28PM +0200, Egmont Koblinger wrote:
> Though I cannot answer your original question, I've just found recently that
> glibc's wcwidth database suffers from problems. There are a lot of letters
> or letter-like symbols that are unprintable according to glibc (wcwidth
> returns -1, iswprint returns 0). For example U+0221 (latin small letter d
> with curl) is the first such character. I think we should submit a bugreport
> for glibc...

Indeed, glibc's character data is horribly outdated and incorrect.
There are plenty of unsupported nonspacing characters, even characters
that were present in Unicode 4.0. It also considers nonspacing letters
to be non-alphabetic, which is a real problem for users of languages
which utilize nonspacing letters.

As for wcwidth and iswprint, I recently changed my libc implementation
to consider all Unicode codepoints except illegal/noncharacter/control
codepoints as printable, with a wcwidth of 1 for the BMP and plane 1,
and a wcwidth of 2 for planes 2 and 3. While this is still imperfect
(it won't account for added characters with width 0, for example), it
at least makes it so users with outdated libc/locale data can use the
new characters they might need in a minimal sort of way. I would
recommend that the glibc maintainers do something similar.

> I don't know whether the width info varies or should vary between different
> utf-8 locales.

The ambiguous characters are wide in CJK locales and narrow in others.
This is probably annoying for some CJK users since the characters
(such as Greek and Cyrillic) obviously should be narrow
typographically; they're wide only for the sake of old programs and
ascii-art type stuff which were designed for legacy charsets. IMO they
should be made narrow by default in all locales with a modifier like
"@wide" or something for the users who actually need them wide.

~Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: wcwidth and locale

2007-04-10 Thread Rich Felker

On Mon, Apr 09, 2007 at 12:26:51PM -0400, ＳｒｉｎＴｕａｒ wrote:
> Just a question:
> 
> Does anyone know of locales where ambiguous char-cell width
> characters, such as ※☠☢☣☤ ♀♂★☆ are treated as double 
> width rather than
> single width?

Ambiguous width from a Unicode perspective means just that the
characters did not exist in legacy CJK encodings, or that they were
wide in legacy CJK encodings but narrow in others (and should be
narrow), such as Greek.

> It seems they are double width in most fonts, but on my systems even
> in east asian locales they still return widths of 1. (so I get funny
> overlaps in my terminals )

I think this is a problem with the fonts. There’s no reason a
character like ♀ should be double-width. A few of the examples you
gave are hard to make look nice at 8x16 and could benefit from a
double-width cell, but all of them are legible and distinguishable at
8x16. If you’re using a smaller font size you shouldn’t expect
non-Latin characters to be particularly legible.

At times I’ve thought it would be beneficial to update and standardize
the wcwidth table to make certain characters wide, such as the em
dash and various letters in certain Indic and other scripts which
cannot adequately be represented in a single cell due to their
proportions and level of detail. But I’m not entirely sure how this
should be done, and even if it were done, I don’t think dingbats are
appropriate candidates.

~Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support [BACK OFF-TOPIC]

2007-04-07 Thread Rich Felker

On Sat, Apr 07, 2007 at 08:21:25PM +0200, Marcin 'Qrczak' Kowalczyk wrote:
> > Using UTF-8 would have accomplished the same thing without
> > special-casing.
> 
> Then iterating over strings and specifying string fragments could not be
> done by code point indices, and it’s not obvious how a good interface
> should look like.

One idea is to have a 'point' in a string be an abstract data type
rather than just an integer index. In reality it would just be a UTF-8
byte offset.

> Operations like splitting on whitespace would no
> longer have simple implementations based on examining successive code
> points.

Sure it would. Accessing a character at a point would still evaluate
to a character. Instead of if/else for 8bit/32bit string, you'd just
have a UTF-8 operation.

> This still rules out bounds checking. If each s[i] among 4096 indexing
> operations has the cost of 4096-i, then 8M might become noticeable.

Indeed this is true. Here's a place where you're very right, a HLL
which does bounds checking will want to know (at the implementation
level) the size of arrays. On the other hand this information is
useless to C, and if you're writing C, it's your responsibility to
know whether the offset you access is safe before you access it.
Different languages and target audiences/domains have different
requirements.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support [BACK OFF-TOPIC]

2007-04-07 Thread Rich Felker

On Sat, Apr 07, 2007 at 01:46:22PM +0200, Marcin 'Qrczak' Kowalczyk wrote:
> For example in my language Kogut a string is a sequence of Unicode code
> points. My implementation uses two string representations internally:
> if it contains no characters above U+00FF, then it’s stored as a
> sequence of bytes, otherwise it’s a sequence of 32-bit integers.

> This variation is not visible in the language. The narrow case has
> a redundant NUL appended. When a string is passed to some C function
> and the function expects the default encoding (normally taken from
> the locale), then — under the assumption that a default encoding
> is ASCII-compatible — if the string contains only ASCII characters
> excluding NUL, a pointer to the string data is passed. Otherwise

I hope you generate an exception (or whatever the appropriate error
behavior is) if the string contains a NUL byte other than the
terminator when it's passed to C functions. Otherwise you risk the
same sort of vuln that burned Firefox. Passing a string with embedded
NULs where a NUL-terminated string is expected is an incompatible type
error, and a self-respecting HLL should catch and protect you from
this.

> a recoded array of bytes is created. This is quite a practical reason
> to store the redundant NULs, even though NUL is not special as far as
> the string type is concerned. Most strings manipulated by average
> programs are ASCII-only.

Using UTF-8 would have accomplished the same thing without
special-casing. Then even non-ASCII strings would use less memory. As
discussed recently on this list, most if not all of the advantages of
UTF-32 over UTF-8 are mythical.

> > Also note that there's nothing "backwards" about using termination
> > instead of length+data. For example it's the natural way a string
> > would be represented in a pure (without special string type) lisp-like
> > language. (Of course using a list is still binary clean because the
> > terminator is in the cdr rather than the car.)
> 
> The parenthesized remark is crucial. Lisp lists use an out-of-band
> terminator, not in-band.

Indeed, the point was more about the O(n) thing not being a problem.

> > And like with lists, C
> > strings have the advantage that a terminal substring of the original
> > string is already a string in-place, without copying.
> 
> This is too small advantage to overcome the inability of storing NULs
> and the lack of O(1) length check (which rules out bounds checking on
> indexing), and it’s impractical with garbage collection anyway.

C strings are usually used for small strings for which O(n) is O(1)
because n is bounded by, say, 4096 (PATH_MAX). Whenever discussing
these issues, it's essential to be aware that a plain string, whether
NUL-terminated or pascal-style, is unsuitable for a large class of
uses including any data that will frequently be edited. This is
because insertion or deletion is O(n). A reasonable program working
with large textual datasets will keep small strings (maybe lines, or
maybe arbitrary chunks of a certain max size) in a list structure with
a higher level data structure indexing them.

Certainly very high level languages could do the same with ordinary
strings, providing efficient primitives for insertion, deletion,
splitting, etc. but for the majority of tiny strings the overhead may
be a net loss. I kinda prefer the Emacs Lisp approach of having
immutable string objects for small jobs and full-fledged emacs buffers
for heavy weight text processing.

At this point I'm not sure to what degree this thread is
off-/on-topic. If any list members are offended by its continuation,
please say so.

~Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support [BACK OFF-TOPIC]

2007-04-05 Thread Rich Felker

On Thu, Apr 05, 2007 at 12:54:54PM +0200, Marcin 'Qrczak' Kowalczyk wrote:
> Dnia 05-04-2007, czw o godzinie 02:04 -0400, Rich Felker napisał(a):
> 
> > Just look how much that already happens anyway... the use
> > of : as a separator in PATH-type strings, the use of spaces to
> > separate command line arguments, the use of = to separate environment
> > variable names from values, etc..
> 
> Do you propose to replace them with NULs? This would make no sense.

Of course not.

> A single environment variable can contain a whole PATH-type string.
> You can’t use NUL to delimit the whole string *and* its components
> at the same time. Different contexts require different delimiters
> if a string from one context is to be able to contain a sequence of
> another one.

My point is that the first level of in-band signalling is already
standardized, making for one less.

> > Having a character you know can't
> > occur in text (not just by arbitrary rules, but because it's actually
> > impossible for it to be passed in a C string) is nice because there's
> > at least one character you know is always safe to use for app-internal
> > in-band signalling.
> 
> Here you contradict yourself:

No. Inflammatory accusations like this are rather hasty and
inappropriate...

> > Notice also how GNU find/xargs use NUL to cleanly
> > separate filenames, relying on the fact that it could never occur
> > embedded in a filename.
> 
> because you show an example where NUL *is* used in text, and it’s used
> not internally but in communication between two programs.

That's not text. It's binary data containing a sequence of text
strings. The assumption that pipes==text is one of the most common
incorrect perceptions about unix, caused most likely by bad experience
with DOS pipes.

> > > The other languages handle all 256 byte values consistently.
> > 
> > Which ones?
> 
> All languages besides C, except toy interpreters written in C by some
> students.

False.

> > There are plenty of languages which can't handle control characters in
> > strings well at all, much less NUL.
> 
> I don’t know any such language.

sed, awk, bourne shell, 

> > Because C was there first and C is essentially the only standardized
> > language.
> 
> Nonsense.

Like I said if you want to debate this email me off-list. It's quite
true, but mostly unrelated to the practical issues being discussed
here.

> > When your applications run on top of a system build upon C
> > and POSIX you have to play by the C and POSIX rules.
> 
> Only during communication with the system.
> 
> The only influence of C on string representation in other languages
> is that it’s common to redundantly have NUL stored after the string
> *in addition* to storing the length explicitly, so in cases the string
> doesn’t contain NUL itself it’s possible to pass the string to a C
> function without copying its contents.

This is bad design that leads to the sort of bugs seen in Firefox. If
we were living back in the 8bit codepage days, it might make sense for
these languages to try to unify byte arrays and character strings, but
we're not. There's no practical reason a character string needs to
store the NUL character (it's already not binary-clean due to UTF-8)
and thus no reason to introduce this blatent incompatibility (which
almost always turns into bugs and vulnerabilities) with the underlying
system.

Also note that there's nothing "backwards" about using termination
instead of length+data. For example it's the natural way a string
would be represented in a pure (without special string type) lisp-like
language. (Of course using a list is still binary clean because the
terminator is in the cdr rather than the car.) And like with lists, C
strings have the advantage that a terminal substring of the original
string is already a string in-place, without copying.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support [BACK ON-TOPIC]

2007-04-04 Thread Rich Felker

On Wed, Apr 04, 2007 at 11:56:35PM -0400, Daniel B. wrote:
> Rich Felker wrote:
> 
> >
> > Null termination is not the security problem. Broken languages that
> > DON'T use null-termination are the security problem, particularly
> > mixing them with C.
> 
> C is the language that handles one out of 256 possible byte values
> inconsistently (with respect to the other 255) (in C strings).

Having a standard designated byte that can be treated specially is
very useful in practice. If there weren't such a powerful force
establishing NUL as the one, we'd have all sorts of different
conventions. Just look how much that already happens anyway... the use
of : as a separator in PATH-type strings, the use of spaces to
separate command line arguments, the use of = to separate environment
variable names from values, etc.. Having a character you know can't
occur in text (not just by arbitrary rules, but because it's actually
impossible for it to be passed in a C string) is nice because there's
at least one character you know is always safe to use for app-internal
in-band signalling. Notice also how GNU find/xargs use NUL to cleanly
separate filenames, relying on the fact that it could never occur
embedded in a filename.

You can ask what would have happened if C had used pascal-style
strings. I suspect we would have been forced to deal with ridiculously
small length limits, controversial ABI changes to correct for it, etc.
Certainly for many types of applications its beneficial to use smarter
data structures for text internally (more complex even than just
pascal style strings), but I think C made a very good choice in using
the simplest possible representation for communicating reasonable-size
strings between the application, the system, and all the various
libraries that have followed the convention.

> The other languages handle all 256 byte values consistently.

Which ones? Now I think you're being hypocritical. One moment you're
applauding treating text as a sequence of Unicode codepoints in a way
that's not binary-clean for files containing invalid sequences, and
then you're complaining about C strings not being binary-clean because
NUL is a terminator. NUL is not text. Arguably other control
characters aside from newline (and perhaps tab) are not text either.
If you want to talk about binary data instead of text, then C isn't
doing anything inconsistent. The functions for dealing with binary
data (memcpy/memmove/memcmp/etc.) don't treat NUL specially of course.

There are plenty of languages which can't handle control characters in
strings well at all, much less NUL. I suspect most of the ones that
handle NUL the way you'd like them to also clobber invalid sequences
due to using UTF-16 internally.

> Why isn't it C that is a bit broken (that has irregular limitation)?  

Because C was there first and C is essentially the only standardized
language. When your applications run on top of a system build upon C
and POSIX you have to play by the C and POSIX rules. Ignoring this
necessity is what got Firefox burned.

Rich

P.S. If you really want to debate what I said about C being the only
standardized language/the authority/whatever, let's take it off-list
because we've gotten way off-topic from utf-8 handling already. I have
reasons for what I say, but I really don't want to burden this list
with more off-topic sub-thread spinoffs.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

2007-03-31 Thread Rich Felker

On Sat, Mar 31, 2007 at 07:44:39PM -0400, Daniel B. wrote:
> Rich Felker wrote:
> > Again, software which does not handle corner cases correctly is crap.
> 
> Why are you confusing "special-case" with "corner case"?
> 
> I never said that software shouldn't handle corner cases such as illegal
> UTF-8 sequences.
> 
> I meant that an editor that handles illegal UTF-8 sequences other than
> by simply rejecting the edit request is a bit if a special case compared
> to general-purpose software, say a XML processor, for which some 
> specification requires (or recommends?) that the processor ignore or 
> reject any illegal sequences.  The software isn't failing to handle the 
> corner case; it is handling it--by explicitly rejecting it.

It is a corner case! Imagine a situation like this:

1. I open a file in my text editor for editing, unaware that it
contains invalid sequences.

2. The editor either silently clobbers them, or presents some sort of
warning (which, as a newbie, I will skip past as quickly as I can) and
then clobbers them.

3. I save the file, and suddenly I’ve irreversibly destroyed huge
amounts of data.

It’s simply not acceptable for opening a file and resaving it to not
yield exactly the same, byte-for-byte identical file, because it can
lead either to horrible data corruption or inability to edit when your
file has somehow gotten malformed data into it. If your editor
corrupts files like this, it’s broken and I would never even consider
using it.

As an example of broken behavior (but different from what you’re
talking about since it’s not UTF-8), XEmacs converts all characters to
its own nasty mule encoding when it loads the file. It proceeds to
clobber all Unicode characters which don’t also exist in legacy mule
character sets, and upon saving, the file is horribly destroyed. Yes
this situation is different, but the only difference is that UTF-8 is
a proper standard and mule is a horrible hack. The clobbering is just
as wrong either way.

(I’m hoping that XEmacs developers will fix this someday soon since I
otherwise love XEmacs, but this is pretty much a show-stopper since it
clobbers characters I actually use..)

> What I meant (given the quoted part below you replied before) was that 
> if you're dealing with a file that overall isn't valid UTF-8, how would 
> you know whether a particular part that looks like valid UTF-8, 
> representing some characters per the UTF-8 interpretation, really 
> represents those characters or is an erroneously mixed-in representation 
> of other characters in some other encoding?
> 
> Since you're talking about preserving what's there as opposed to doing
> anything more than that, I would guess you answer is that it really
> doesn't matter.  (Whether you treater 0xCF 0xBF as a correct the UTF-8 
> sequence and displayed the character U+03FF or, hypothetically, treated 
> it as an incorrectly-inserted Latin-1 encoding of U+00DF U+00BF and 
> displayed those characters, you'd still write the same bytes back out.) 

Yes, that’s exactly my answer. You might as well show it as the
character in case it really was supposed to be the character. Now it
sounds like we at least understand what one another are saying.

> > > example, if at one point you see the UTF-8-illegal byte sequence
> > > 0x00 0xBF and assume that that 0xBF byte means character U+00BF, then
> > 
> > This is incorrect. It means the byte 0xBF, and NOT ANY CHARACTER.
> 
> You said you're talking about a text editor, that reads bytes, displays 
> legal UTF-8 sequences as the characters they represent in UTF-8, doesn't
> reject other UTF-8-illegal bytes, and does something with those bytes.
> 
> What does it do with such a byte?  It seems you were taking about 
> mapping it to some character to display it.  Are you talking about 
> something else, such as displaying the hex value of the byte?

Yes. Actually GNU Emacs displays octal instead of hex, but it’s the
same idea. The pager “less” displays hex, such as , in reverse
video, and shows legal sequences that make up illegal or unprintable
codepoints in the form  (also reverse video).

> Yes someone did--they wrote about rejecting spam mail by detecting
> bytes/octets with the high bit set.

Oh that was me. I misunderstood what you meant, sorry.

> > If youâ??re going to do this, at least map into the PUA rather than to
> > Latin-1. At least that way itâ??s clear what the meaning is.
> 
> That makes it a bit less convenient, since then the numeric values of 
> the characters don't match the numeric values of the bytes.
> 
> But yes, doing all that is not something you'd want to escape into the
> wild (be seen outside the immediate code whether you need to fake
> byte-level regular expressions in Java).

*nod*

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

2007-03-31 Thread Rich Felker

On Sat, Mar 31, 2007 at 06:36:06PM -0400, Daniel B. wrote:
>  wrote:
>  
> 
> > > > The fact that your mailer misinterpreted my UTF-8 as Latin-1 does not
> > > > instill faith...
> > >
> > > Maybe you should think more clearly.  I didn't write my mailer, so the
> > > quality of its behavior doesn't reflect my knowledge.
> > 
> > it does reflect your lack of interesting in getting your email utf-8 
> > compatible.
> 
> How the hell do you think you know what it reflects?  (Have you ever
> considered it might have something to do with bookmark management?)

Just because you insist on using an ancient, horribly broken,
proprietary web browser to manage your bookmarks doesn't mean you have
to use it for email too... especially when it breaks email so badly.
In any case it reflects priorities I think, and also indicates that
you're using backwards software, which goes alongside with discussing
the UTF-8 issue as if we were living in 1997 instead of 2007.

All of this is stuff you're entitled to do if you like, and it's not
really my business to tell you what you should be using. But it
does reframe the discussion.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

2007-03-31 Thread Rich Felker

On Sat, Mar 31, 2007 at 06:56:05PM -0400, Daniel B. wrote:
> > > > Normally, you should not have to ever convert strings between
> > > > encodings.
> > >
> > > Then how do you process, say, a multi-part MIME body that has parts
> > > in different character encodings?
> > 
> > Excellent example. Email is absolutely something that you can work
> > with on a byte-by-byte basis and have no need for considering
> > characters. 
> 
> What operations are you excluding when you say "work with?"  You're
> being quite non-specific.  Maybe that's part of the cause of our
> arguing.

Indeed, that would be good to clarify.

> Certainly searching for a given character string across multiple
> MIME parts requires handling different encodings for different parts.

Not if it was all converted at load-time.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Perl Unicode support

2007-03-30 Thread Rich Felker

On Fri, Mar 30, 2007 at 07:06:52PM +0200, Egmont Koblinger wrote:
> On Fri, Mar 30, 2007 at 11:46:12AM -0400, Rich Felker wrote:
> 
> > What does “supports the encoding” mean? Applications cannot select the
> > locale they run in, aside from requesting the “C” or “POSIX” locale.
> 
> This isn't so. First of all, see the manual page setlocale(3), as well as

The documentation of setlocale is here:
http://www.opengroup.org/onlinepubs/009695399/functions/setlocale.html

As you’ll see, the only arguments with which you can portably call
setlocale are NULL, "", "C", "POSIX", and perhaps also a string
previously returned by setlocale.

I’m interested only in portable applications, not “GNU/Linux
applications”.

> the documentation of newlocale() and uselocale() and *_l() functions (no man
> page for them, use google). These will show you how to switch to arbitrary
> existing locale, no matter what your environment variables are.

These are nonstandard extensions and are a horrible mistake in design
direction. Having the character encoding even be selectable at runtime
is partly a mistake, and should be seen as a temporary measure during
the adoption of UTF-8 to allow legacy apps to continue working until
they can be fixed. In the future we should have much lighter, sleeker,
more maintainable systems without runtime-selectable character
encoding.

If you look into the GNU *_l() functions, the majority of them exist
primarily or only because of LC_CTYPE. The madness of having locally
bindable locale would not be so mad if these could all be thrown out,
and if only the ones that actually depend on cultural customs instead
of on character encoding could be kept.

However, I suspect even then it’s a mistake. Applications which just
need to present data to the user in a form that’s comfortable to the
user’s cultural expectations are fine with a single global locale.
Applications which need to deal with multinational cultural
expectations simultaneously probably need much stronger functionality
than the standard library provides anyway, and would do best to use
their own (possibly in library form) specialized machinery.

> Second, in order to perform charset conversion, you don't need locales at
> all, you only need the iconv_open(3) and iconv(3) library calls. Yes, glibc
> provides a function to convert between two arbitrary character sets, even if
> the locale in effect uses a third, different charset.

Yes, I’m well aware. This is not specific to glibc but part of the
standard. There is no standard on which character encodings should be
supported (which is a good thing, since eventually they can all be
dropped.. and even before then, non-CJK systems may wish to omit the
large tables for legacy CJK encodings), nor on the names for the
encodings (which is rather stupid; it would be very reasonable and
practical for SUS to mandate that, if an encoding is supported, it
must be supported under its standard preferred MIME name). The
standard also does not necessarily guarantee a direct conversion from
A to C, even if conversions from A to B and B to C exist.

> file to contain them in only one language, the one you want to see. Hence
> this program outputs plenty of configuration file, one for each window
> manager and each language (icewm.en, icewm.hu, windowmaker.en,
> windowmaker.hu and so on).

It would be nice if these apps would use some sort of message catalogs
for their menus, and if they would perform the sorting themselves at
runtime.

> Just in case you're interested, here's the source:
> ftp://ftp.uhulinux.hu/sources/uhu-menu/

You could use setlocale instead of the *_l() stuff so it would be
portable to non-glibc. For a normal user application I would say this
is an abuse of locales to begin with and that it should use its own
collation data tables, but what you’re doing seems reasonable for a
system-specific maintainence script. The code looks nice. Clean use of
plain C without huge bloated frameworks.

> > How would you deal with multiple browser windows or tabs, or even frames?
> 
> I can't see any problem here. Can you? Browsers work correctly, don't they?
> You ask me how I'd implement a feature that _is_ implemented in basically
> any browser. I guess your browser handles frames and tabs with different
> charset correctly, doesn't it? Even if you run it with an 8-bit locale.

I meant you run into trouble if you were going to change locale for
each page. Obviously it works if you don’t use the locale system.

> > Normal implementations work either by converting all data to the
> > user’s encoding, or by converting it all to some representation of
> > Unicode (UTF-8 or UTF-32, or something nonstandard like UTF-21).
> 
> Normal implementations work the 2nd way, that is, use a Unicode-compatible
> internal encoding.

Links w

Re: Perl Unicode support

2007-03-30 Thread Rich Felker

On Fri, Mar 30, 2007 at 06:44:49PM +0200, Egmont Koblinger wrote:
> On Fri, Mar 30, 2007 at 05:17:32PM +0200, Fredrik Jervfors wrote:
> 
> > If Y's computer supports the encoding X used [...]
> 
> Yes, I assumed in my examples that both computers support both encodings.
> Glibc supports all well-known 8-bit character sets since 2.1 (released in
> 1999), Unicode and its transcripts since 2.2 (2000). Fonts are also
> installed on any sane system.

You mean the iconv in glibc?

> > I think clipboards treat the data as bytes,
> 
> Try copy-pasting from a latin1 application to an utf8 app or vice versa and
> you'll see that luckily it's not the case. You'll get the same letters (i.e.
> different byte sequences) in the two apps.

But it doesn’t work the other way around. I’ve tried pasting from an
app respecting locale (UTF-8) into rxvt (with its head stuck in the
Latin-1 sand, no not urxvt) and the bytes of the UTF-8 get interpreted
as Latin-1 characters. :)

It should work, but Latin-1-oriented apps are usually dumb enough that
it doesn’t...

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Perl Unicode support

2007-03-30 Thread Rich Felker

On Fri, Mar 30, 2007 at 05:17:32PM +0200, Fredrik Jervfors wrote:
> > I say that his browser mush show è correctly, it doesn't matter what its
> > locale is.
> 
> That depends on the configuration of the browser.
> 
> The browser should by default (programmer's choice really) think in the
> encoding X used, since it's tagged with that encoding information.
> 
> If Y's computer supports the encoding X used (it doesn't have to be Y's
> preferred encoding), the browser should use X's encoding when showing Y

What does “supports the encoding” mean? Applications cannot select the
locale they run in, aside from requesting the “C” or “POSIX” locale.
It’s the decision of the user and/or the system implementor. In fact
it would be impossible to switch locales when visiting different pages
anyway. How would you deal with multiple browser windows or tabs, or
even frames?

> If Y's computer doesn't support the encoding X used, the browser should,
> as a fallback solution, try to convert the page to Y's encoding if
> possible.

This is why I’m confused about what you mean by “support the
encoding”. The app cannot switch it’s native encoding (the locale), so
supporting the encoding would have to mean supporting it as an option
for conversion... But then, if the system doesn’t “support” it in this
sense, how would you go about converting?

Normal implementations work either by converting all data to the
user’s encoding, or by converting it all to some representation of
Unicode (UTF-8 or UTF-32, or something nonstandard like UTF-21).

> I think clipboards treat the data as bytes, so if Y wants to copy from X's
> page and paste it into program P, Y has to make sure that the browser
> converts the data to Y's preferred encoding before copying, since P's
> input validation would (should) complain otherwise (when pasting).

X selection thinks in ASCII or UTF-8. Technically the ASCII mode can
also be used for Latin-1, but IMO it’s a bad idea to continue to
support this since it’s obviously a broken interface. There’s also a
nasty scheme based on ISO-2022 which should be avoided at all costs.
So, in order to communicate cleanly via the X selection, X apps need
to be able to convert their data to and from UTF-8.

In a way I think this is bad, because it makes things difficult for
apps, but the motivation seems to be at least somewhat correct.
There’s no reason to expect that other X clients are even running on
the same machine, and they machines they’re running on might use
different encodings, so a universal encoding is needed for
interchange. It would be nice if xlib provided an API to convert the
data to and from the locale’s encoding automatically upon sending and
receiving it, however. (This could be a no-op on UTF-8-only systems.)

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

2007-03-30 Thread Rich Felker

On Fri, Mar 30, 2007 at 03:58:21PM +0200, Jan Willem Stumpel wrote:
> Marcin 'Qrczak' Kowalczyk wrote:
> 
> > There is still some software I have installed here which
> > doesn’t work with UTF-8. I switched from ekg to gaim and from
> > a2ps to paps because of this. UTF-8 support in some quite
> > popular programs still relies on unofficial patches: mc, pine,
> > fmt. There is still work to do.
> 
> Yes.. for instance texmacs and maxima. And a2ps -- doomed to be
> replaced by paps. But these examples are becoming rarer and rarer.
> 
> mc, for instance, is quite alright nowadays (well, in Debian it is).
> 
> Of course your point is quite correct. Until even a few years ago,
> UTF-8 was only practicable for hardy pioneers. But it is different
> now.

I agree. It’s amazing how much software I still fight with not
supporting UTF-8 correctly. Even bash/readline is broken in the
presence of nonspacing characters and long lines..

My point was that, had the mistake of introducing ISO-8859 support not
been made (i.e. if bytes 128-255 had remained considered as
“unprintable” at the time), there would have been both much more
incentive to get UTF-8 working quickly, and much less of an obstacle
(the tendancy of applications to treat these bytes as textual
characters).

Obviously there were plenty of people who wanted internationalization
even back in 1996 and earlier. I’m just saying they should have done
it correctly in a way that supports multilingualization rather than
taking the provincial path of ‘codepages’ some 5 years after
UCS/Unicode had obsoleted them.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

2007-03-30 Thread Rich Felker

On Fri, Mar 30, 2007 at 01:30:58PM +0200, Egmont Koblinger wrote:
> On Fri, Mar 30, 2007 at 05:07:55PM +0600, Christopher Fynn wrote:
> 
> Hi,
> 
> > IMO these days all browsers should come with their default encoding set 
> > to UTF-8
> 
> What do you mean by a browser's default encoding? Is it the encoding to be
> assumed for pages lacking charset specification? In this case iso-8859-1 is
> a much better choise -- there are far more pages out there in the wild
> encoded in latin1 that lack charset info than utf8 pages that lack this
> info. (Maybe an utf8 auto-detection would be nice, though.) So my argument
> for iso-8859-1 is not theoretical but practical.

Chris's argument (also mine) is practical too: It intentionally breaks
pages which are missing an explicit character set specification, so
that people making this broken stuff will have to fix it or deal with
lost visitors/lost sales/etc. :)

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support [off]

2007-03-30 Thread Rich Felker

On Fri, Mar 30, 2007 at 11:56:56AM +0200, Egmont Koblinger wrote:
> On Thu, Mar 29, 2007 at 04:46:14PM -0400, Rich Felker wrote:
> 
> > I am a mathematician
> 
> I nearly became a mathematican, too. Just a few weeks before I had to choose
> university I changed my mind and went to study informatics.
> 
> When I was younger, I had a philosophy closer to yours. Programming in

I’m not sure if this is a cheap ad hominem ;) or just an honest
storytelling..

> (more or less). Users don't care about implementation details, and actually
> they shouldn't need to care. They just care whether things work. They're not

Users should be presented with something that’s possible for ordinary
people to understand and which has reasonable explanations. Otherwise
the computer is a disempowering black box that requires them to look
to “experts” whenever something doesn’t make sense.

Here’s an interesting article that’s somehow related (though I don’t
necessarily claim it supports either of our view and don’t care to
argue over whether it does):

http://osnews.com/story.php?news_id=6282

> There's absolutely no way to explain any user that his browser isn't able to
> display some letters unless he quits it and sets a different locale, but

1. Sure there is. Simply telling the user he/she is working in an
environment that doesn’t support the character is clear and does make
sense. I’ve explained this sort of thing countless times doing user
help on IRC.

It’s much more difficult to explain to the user why they can see these
characters in their web browser but can’t paste them into a text file,
because it’s INCONSISTENT and DOESN’T MAKE SENSE. The only option
you’re left with is the Microsoft one: telling users that clean
applications which respect the standards are somehow “backwards”,
while hiding from them the fact that the standards provide a much
saner path to internationalization than hard-coding all sorts of
unicode stuff into each application.

2. You don’t have to explain anything. This is 2007 and the user’s
locale uses UTF-8. Period. Unless this is some oldschooler who already
knows the reasons and insists on using a legacy encoding anyway.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support [BACK ON-TOPIC]

2007-03-29 Thread Rich Felker

On Thu, Mar 29, 2007 at 11:53:01AM -0700, Larry Wall wrote:
> : I think a regex engine should, for example, match one binary byte to a
> : "." the same way it would match a valid sequence of unicode characters
> : and composing characters as a singe grapheme. This is a best effort to
> : work with the string as provided, and someone who does not want such
> : behavior would not run regex's over such strings.
> 
> How can it possibly know whether to match a binary byte or a grapheme
> if you've mixed UTF-8 and binary in the same string?

I agree that SrinTuar’s idea of matching . to a byte is insane. While
NFA/DFA is sometimes a nice tool even with binary data, using regex
character syntax for it is maybe a bit dubious. And surely, like you
said, they should not be mixed in the same string.

With that in mind, though, I think your emphasis on graphemes is also
a bit misplaced. The idea of a “grapheme” as the fundamental unit of
editing, instead of a character, is pretty much only appropriate when
writing Latin, Greek, and Cyrillic based languages with NFD. In most
Indian scripts, whole syllables get counted as “graphemes” for visual
presentation, yet users still expect to be able to edit, search, etc.
individual characters.

Even if you’re just considering a “grapheme” to be a base character
followed by a sequence of combining marks (Mn/Me/Cf), it’s
inappropriate for Tibetan where letters stack vertically (via
combining forms of class Mn) and yet each is considered a letter for
the purposes of editing, character counting, etc. A similar situation
applies for Hangul Jamo.

IMO, a regex pattern to match whole graphemes could be useful, but I
suspect character matching is almost always what’s wanted except for
NFD with European scripts.

> it might be.  And null termination has turned out to be a terrible
> workaround (in security terms as well as efficiency) for not knowing

Null termination is not the security problem. Broken languages that
DON'T use null-termination are the security problem, particularly
mixing them with C.

> the length.  C's head-in-the-sand approach to string processing is
> directly responsible for many of the security breaks on the net.

No, the incompetence of people writing C code is what’s directly
responsible for them. C’s approach might be indirectly responsible,
for being difficult or something, but certainly not directly. There
are examples of real-world C programs which are absolutely secure,
such as vsftpd.

> It's just my gut-level feeling that traditional world of C, Unix,
> locales, etc. simply does not provide appropriate abstractions to deal
> with internationalization.  Yes, you can get there if you throw enough
> libraries and random functions and macros and pipes and filters at it,
> but the basic abstractions leak like a seive.  It's time to clean it
> all up.

Mutt works right without any of that.. It’s as close as you’ll find to
the pinnacle of correct C application coding.

> I don't think it's Perl 6's place to force either utf-8 or utf-16 or
> utf-whatever on anyone.  If the abstractions are sane and properly
> encapsulated, the implementors can do whatever makes sense behind
> the scenes, and that very likely means different things in different
> contexts.

But the corner-case of handling “text” data with malformed sequences
in it will be very difficult and painful, no? With C and byte strings
it’s very easy..

> I try hard not to be a linguistic imperialist (when I try at all).  :-)

☺ ☻ ☺ ☻(happy multiracial smileys)

> Anyway, if anyone wants to give me specific feedback on the current
> design of Perl 6, that'd be cool.  Though perl6-language@perl.org would
> probably be a better forum for that.

The only feedback I’d like to give is ask that if the nasty warning
messages are kept, they should be applied to characters in the range
128-255 as well, not just characters >255.

Also.. is there a clean way to deal with the issue (aside from just
disabling warnings) on a perl build without PerlIO (and thus no
working binmode)?

Finally, I must admit I’m not at all a Perl fan, so maybe take what I
say with a grain of salt. I just wish Perl scripts I obtain from
others would work more comfortably without making me have to think
about the nonstandard (compared to the rest of a unix system)
treatment they’re giving to character encoding.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

2007-03-29 Thread Rich Felker

On Thu, Mar 29, 2007 at 07:15:37PM +0200, Egmont Koblinger wrote:
> > or failing that ask the programmer to explicitly qualify them as one of
> > its supported encodings. I do not think the strings should have built in
> > machinery that does this work behind the scenes implicitly.
> 
> If you have the freedom of choosing the character set you use, you need to

You don’t. An application should assume that there is no such freedom;
the character encoding is dictated by the user or the host
implementation, and should on all modern systems be UTF-8 (but don’t
assume this).

Any text that’s encoded with another scheme needs to be treated as
non-text (binary) data (i.e. not suitable for use with regex). It
could be converted (to the dictated encoding) or left as binary data
depending on the application.

> tell the regexp matching function what charset you use. (It's a reasonable
> decision that the default is the charset of the current locale, but it has
> to be overridable.) There are basically two ways I think to reach this goal.

You can get by just fine without it being overridable. For instance,
mutt does just fine using the POSIX regex routines which do not have
any way of specifying a character encoding.

> 1st: strings are just byte sequences, and you may pass the charset
> information as external data.
> 
> 2nd: strings are either forced to a fixed encoding (UTF-8 in Gtk+, UCS-16 in
> Java) or carry meta-information about their encoding (utf8 flag in Perl).

Of these (neither of which is necessary), #1 is the more unix-like and
#2 is the mac/windows approach. Unix has a strong history of
intentionally NOT assigning types to data files etc., but instead
treating everything as streams of bytes. This leads to very powerful
combinations of tools where the same (byte sequence) of data is
interpreted in different ways by different tools/contexts. I am a
mathematician and I must say it’s comparable to what we do when we
allow ourselves to think of an operator on a linear space both as a
map between linear spaces and as an element of a larger linear space
of operators (and possibly also in many other ways) at the same time.

On the other hand, DOS/Windows/Mac have a strong history of assigning
fixed types to data files. On DOS/Windows it’s mostly just extensions,
but Mac goes much farther with the ‘resource fork’, not only typing
the file but also associating it with a creating application. This
sort of mechanism is, in my opinion, deceptively convenient to
ignorant new users, but also fosters an unsophisticated, uneducated,
less powerful way of thinking about data.

Of course in either case there are ways to override things and get
around the limitations. Even on unix files tend to have suffixes to
identify the ‘type’ a user will most often want to consider the file
as, and likewise on Mac you can edit the resource forks or ignore
them. Still, I think the approach you take says a lot about your
system philosophy.

> Using the 1st approach I still can't see how you'd imagine Perl to work.
> Let's go back to my earlier example. Suppose perl read's a file's content
> into a variable. This file contained 4 bytes, namely: 65 195 129 66. Then
> you do this:
> 
>   print "Hooray\n" if $filecontents =~ m/A.B/;
> 
> Should it print Hooray or not if you run this program under an UTF-8 locale?

Of course.

> On one hand, when running with a Latin1 locale it didn't print it. So it
> mustn't print Hooray otherwise you brake backwards compatibility.

No, the program still does the same thing if run in a Latin-1 locale,
regardless of your perl version. There’s no reason to believe that
text processing code should behave byte-identically under different
locales.

> On the other hand, we just encoded the string "AÁB" in UTF-8 since nowadays
> we use UTF-8 everywhere, and of course everyone expects AÁB to match A.B.

So you need to make your data and your locale consistent. If you want
to set the locale to UTF-8, the string “AÁB” needs to be in UTF-8. If
you want to use the legacy Latin-1 data, your locale needs to be set
to something Latin-1-based.

> How would you design Perl's Unicode support to overcome this contradiction?

I don’t see it as any contradiction. The code does exactly what it’s
supposed to in either case, as long as your locale and data are
consistent.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

2007-03-29 Thread Rich Felker

On Thu, Mar 29, 2007 at 07:43:54PM +0200, Egmont Koblinger wrote:
> On Thu, Mar 29, 2007 at 01:05:57PM -0400, Rich Felker wrote:
> 
> > Gtk+-2’s approach is horribly incorrect and broken. By default it
> > writes UTF-8 filenames into the filesystem even if UTF-8 is not the
> > user’s encoding. 
> 
> There's an environment variable that tells Gtk+-2 to use legacy encoding in
> filenames. Whether or not forcing UTF-8 on filenames is a good idea is
> really questionable, you're right.

Well the real solution is forcing UTF-8 in filenames by forcing
everyone who wants to use multilingual text to switch to UTF-8
locales.

> But I'm not just talking about filenames, there are many more strings
> handled inside Glib/Gtk+. Strings coming from gettext that will be displayed
> on the screen, error messages originating from libc's strerror, strings
> typed by the user into entry widgets and so on. Gtk+-2 uses UTF-8
> everywhere, and (except for the filenames) it's clearly a wise decision.

Not if it will also be reading/writing text to stdout or text-based
config files, etc..

> I think this is just plain wrong. Since when do you browse the net and read
> acccented pages? Since when do you use UTF-8 locale?

Using accented characters in your own language has always been
possible with legacy codepage locales, and is still possible with what
I consider the correct implementation. The only thing that's not
possible in legacy codepage locales is handling text from other
languages that need characters not present in your codepage.

> I used Linux with a Latin-2 locale since 1996. It's been around 2003 that I
> began using UTF-8 sometimes and it was last year that I finally managed to
> switch fully to UTF-8. There are still several applications that are
> nightmare with UTF-8 (midnight commander for example). A few years ago
> software were even much worse, many of them were not ready for UTF-8, it
> would have been nearly impossible to switch to UTF-8.

But now we’re living in 2007, not 2003 or 1996. Maybe your approaches
had some merit then, but that’s no reason to continue to use them now.
At this point anyone who wants multilingual text support should be
using UTF-8 natively, and if they have a good reason they’re not (e.g.
a particular piece of broken software) that software should be quickly
fixed.

> When did you switch to
> unicode? Probably a few years earlier than I did, but I bet you also had
> those old-fashioned 8-bit days...

I’ve always used UTF-8 since I started with Linux; until recently it
was just restricted to the first 128 characters of Unicode, though. :)
I never used 8bit codepages except to draw stuff on DOS way back.

> So, I have used Linux for 10 years with an 8-bit locale set up. Still I
> could visit French, Japanese etc. pages and the letters appeared correctly.

UTF-8 has been around for almost 15 years now, longer than any real
character-aware 8bit locale support on Linux. It was a mistake that
8bit locales were ever implemented on Linux. If things had been done
right from the beginning we wouldn't even be having this discussion.

I’m sure you did have legitimate reasons to use Latin-2 when you did,
namely broken software without proper support for UTF-8. Here’s where
we have to agree to disagree I think: you’re in favor of workarounds
which get quick results while increasing the long-term maintainence
cost and corner-case usability, while I’m in favor of omitting
functionality (even very desirable functions) until someone does it
right, with the goal of increasing the incentive for someone to do it
right.

> Believe me, I would have switched to Windows or whatever if Linux browsers
> weren't be able to perform this pretty simple job.

Your loss, not mine.

> It's not about workarounds or non-issues. If a remote server tells my
> browser to display a kanji then my browser _must_ display a kanji, even if

Nonsense. If you don’t have kanji fonts installed then it can’t
display kanji anyway. Not having a compatible encoding is a comparable
obstacle to not having fonts. I see no reason that a system without
support for _doing_ anything with Japanese text should be able to
display it. What happens if you copy and paste it from your browser
into a terminal or text editor???

Even the Unicode standards talk about “supported subset” and give
official blessing to displaying characters outside the supported
subset as a ? or replacement glyph or whatever.

> > > Show me your code that you think "just works" and I'll show you where 
> > > you're
> > > wrong. :-)
> > 
> > Mutt is an excellent example.
> 
> As you might see from the header of my messages, I'm using Mutt too. In this
> regard mutt is a nice piece of software that handles accented characters
> correctly (n

Re: perl unicode support

2007-03-29 Thread Rich Felker

On Thu, Mar 29, 2007 at 12:24:43PM +0200, Egmont Koblinger wrote:
> On Wed, Mar 28, 2007 at 05:57:35PM -0400, ＳｒｉｎＴｕａｒ wrote:
> 
> > The regex library can ask the locale what encoding things are in, just
> > like everybody else
> 
> The locale tells you which encoding your system uses _by default_. This is
> not necessarily the same as the data you're currently working with.

The word “default” does not appear in any standard regarding LC_CTYPE.
It determines THE encoding of text. Foreign character data from other
systems obviously cannot be treated directly as text under this view.

> write a console mp3 id3v2 editor if you completely ignored the console's
> charset

The console charset uses text and text is encoded according to
LC_CTYPE. The tags are encoded according to the encoding specified by
the file and may be converted via iconv or similar library calls.

> or the charset used within the id3v2 tags? How would you write a
> database frontend if you completely ignored the local charset as well as the
> charset used in the database? (Someone inserts some data, someone else
> queries it and receives different letters...)

The same problem exists on the filesystem. The solution locally is to
mandate a policy of a single encoding for all users sharing data. For
remote protocols, the protocol usually specifies an encoding by which
the data is delivered, so again you convert according to iconv or
similar.

Nowhere have SrinTuar nor myself said that encoding is always
something you can ignore. My point is that consideration of it can be
fully isolated to the point at which badly-encoded data is received
(from text embedded in a binary file, from http, from mime mail, etc.)
such that the other 99% of your software never has to think about it.

> > >There _are_ many character sets out there, and it's _your_ job, the
> > >programmer's job to tell the compiler/interpreter how to handle your bytes
> > >and to hide all these charset issues from the users. Therefore you have to
> > >be aware of the technical issues and have to be able to handle them.
> > 
> > If that was true then the vast majority of programs would not be i18n'd..
> 
> That's false. Check for example the bind_textdomain_codeset call. In Gtk+-2
> apps you call it with an UTF-8 argument. This happens because you _know_
> that you'll need this data encoded in UTF-8.

Then what do you do when you want to print text to stdout, or generate
filenames, etc.? You can’t use your localized text anymore because the
encoding may not match. This is evidence that gtk’s approach is
flawed.

> > I wish perl would let me do that- it works so well in C.
> 
> I already wrote twice. Just in case you haven't seen it, I write it for the
> third time. Perl _lets_ you think/work in bytes. Just ignore everything
> related to UTF-8. Just never set the utf8 mode. You'll be back at the world
> of bytes. It's so simple!

I don’t know about SrinTuar but this is not what I meant at all. I
want (NEED!) regex to work correctly, etc. Thus Perl needs to respect
the character encoding, which thankfully matches the host encoding,
UTF-8. No problem so far. However, as soon as I try to send these Perl
character strings (which are equally valid as host character strings)
to stdout, it spews warnings, and does so in an inconsistent way!
(i.e. it complains about characters above 255 but not characters
128-255)

> > Their internal utf-16 mandate was a mistake, imo.
> 
> That was not utf-16 but ucs-2 at that time and imo those days it was a
> perfectly reasonable decision.

It was not. UCS-2 was already obsolete at the time Java was released
to the public in 1995. UTF-8 was invented in September 1992.

> > (and the locale should always say utf-8)
> 
> Should, but doesn't. It's your choice to decide whether you want your
> application to work everywhere, or only under utf-8 locales.

Having limited functionality (displaying ??? for all characters not
available in the locale) under broken legacy locales is perfectly
acceptable behavior. If someone wants to use/display/write a
character, they need to use a character encoding where that character
is encoded!!!

> I admit that in an ideal world everything would be encoded in UTF-8. Just
> don't forget: our world is not ideal. My browser has to display web pages
> encoded in Windows-1250 correctly. My e-mail client has to display messages
> encoded in iso-8859-2 correctly. And so on...

As you can read above, none of this is contrary to what I said. My
system does all of this quite well.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

2007-03-29 Thread Rich Felker

On Thu, Mar 29, 2007 at 12:01:28PM +0200, Egmont Koblinger wrote:
> On Wed, Mar 28, 2007 at 02:35:32PM -0400, Rich Felker wrote:
> 
> > > matches or not _does_ depend on the character set that you use. It's not
> > > perl's flaw that it couldn't decide, it's impossible to decide in theory
> > > unless you know the charset.
> > 
> > It is perl's flaw. The LC_CTYPE category of the locale determines the
> > charset. This is how all sane languages work.
> 
> LC_CTYPE determines the system charset. This is used when reading from /
> writing to a terminal, to/from text files by default; this is the charset
> you expect messages coming from glibc to be encoded in; etc...
> 
> But this is not necessarily the charset you want your application to work
> with. Think of Gtk+-2 for example, internally it always uses UTF-8, no
> matter what your locale is.

Gtk+-2’s approach is horribly incorrect and broken. By default it
writes UTF-8 filenames into the filesystem even if UTF-8 is not the
user’s encoding. 

> So it _has_ to tell every external regexp
> routine (if it uses any) to work with UTF-8, not with the charset implied by
> LC_CTYPE.

This is their fault for designing it wrong. If they correctly used the
requested encoding, there would be no problem.

> And you can think of any web browser, mail client and so on, they have to
> cope with the charset that particular web page or message uses, yet again
> independently from the system locale.

Not independently. All they have to do is convert it to the local
encoding. And yes I’m quite aware that a lot of information might be
lost in the process. That’s fine. If users want to be able to read
multilingual text, they NEED to migrate to a character encoding that
supports multilingual text. Trying to “work around” this [non-]issue
by mixing encodings and failing to respect LC_CTYPE is a huge hassle
for negative gain.

> > I don't have to be aware of it in any other language. It just works.
> 
> Show me your code that you think "just works" and I'll show you where you're
> wrong. :-)

Mutt is an excellent example.

> > Perl is being unnecessarily difficult here.
> 
> You forget one very important thing: Compatibility. In the old days Perl
> used 8-bit strings and there many people created many perl programs that
> handled 8-bit (most likely iso-8859-1) data. These programs must continue to
> work correctly with newer Perls. This implies that perl mustn't assume UTF-8
> charset for the data flows (even if your locale says so) since in this case
> it would produce different output.

Such programs could just as easily be run in a legacy locale, if
available on the system. But unless the data they’re processing
actually contains Latin-1 (in which case you’re in a Latin-1
environment!), there’s no reason that treating the strings as UTF-8
should cause any harm. ASCII is the same either way of course. The
only possible exception is if a perl program is using regex on true
binary data, which is a bit dubious to begin with.

> > Nonsense. As long as all the length variables are in the SAME unit,
> > your program has absolutely no reason to care whatsoever exactly what
> > that unit it. Any unit is just as good as long as it's consistent.
> 
> If you don't know what unit is used, then you're unable to answer questions
> whether that man is most likely healthy, whether he's extremely tall or
> extremely small.

Thresholds/formulae for what height is tall/small/healthy/whatever
just need to be written using whatever unit you’ve selected as the
global units.

> If you don't know what unit is used, how do you fill up your structures from
> external data source? What if you are supposed to store cm but the data
> arrives in inches? How would you know that you need to convert?

Same way it works with character encodings. The code importing
external data knows what format the internal data must be in. The
internal code has no knowledge or care what the unit/encoding is. This
keeps the internal code clean and simple.

> I guess you've heard several stories about million (billion?) dollar
> projects failing due to such stupid mistakes - one developer sending the
> data in centimeters, the other expecting them to arrive in inches.

Yes.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: orthographic imperialism

2007-03-28 Thread Rich Felker

On Thu, Mar 29, 2007 at 12:41:06AM -0400, William J Poser wrote:
> [EMAIL PROTECTED] has made several claims about writing systems
> for indigenous languages that I, as a linguist with a strong
> interest in writing systems and substantial experience working
> with indigenous people, not only as a linguist
> studying their languages but as a staff member of indigenous
> organizations, believe to be false.

I apologize if some of the claims I made were offensive to you as a
linguist, or to other linguists. I was more offended by David
Starner’s “Euro-centricism” as I called it, than by the activities of
linguists and orthographers in themselves. Thank you for speaking up
and for the detailed accounts and anecdotes.

One situation I particularly had in mind was the colonial times in
Africa. From my naive knowledge (mostly derived from bits and pieces
I’ve heard here and there, and mention of African languages in
documents on character set coverage and Unicode), it seems like
there’s a lot of Latin orthography for African languages. Can you fill
us in on any particular cases, and whether they were developed/imposed
by white colonists or developed alongside and embraced by Africans at
the time? As is quite apparent by now, I’m pretty ignorant on the
matter and interested in learning.

Another interesting example to look at is the use of Latin in writing
Indian (India, not Native American) languages. My understanding is
that now there are various scholarly standards for doing so, and
perhaps Indian government standards for how to write names of places,
etc., but my experience while in India was that spellings were
extremely inconsistent and based on naive “phonetic English” spellings
probably invented during the British occupation. The book you
mentioned would surely be an interesting read.

One example on which I’m not ignorant is systems for writing Tibetan
in Latin script. These days there are primarily three systems, none of
which seem to be used much by Tibetans except for language scholars.
One, the Wylie transliteration, is a direct systematic transliteration
of the Tibetan orthography. While it comes across very logical to me,
it’s difficult to read and pronounce without being accomplished in
both Tibetan orthography and the Wylie scheme, and I’ve met very few
Tibetans who find it natural at all. For purposes where preservation
of the original orthography is not important, members of the THDL
project (www.thdl.org) have proposed a standard which seems somewhat
reasonable, but which discards some phonetic data that’s meaningful to
Tibetans for the sake of being easy for Westerners. Finally, there’s a
Chinese-imposed system which they call “Tibetan Pinyin”, which is the
worst of all. It basically preserves only the parts of Tibetan which
fit into Chinese phonetics, resulting in horrible mispronunciation and
confusion about word identity unless you can ‘guess’ which Tibetan
word a “Tibetan Pinyin” word came from. In this latter case it’s a
clear instance of imperial (albeit not Western) imposition of a
Romanization, though thankfully without much success. Amusingly, I’ve
hardly ever met Tibetan people who uses any of these systems, even
when writing in Latin script. My experience has been that most just
write according to whatever “English-like” phonetics come most easily.
:)

So, this is where some of my sentiment that linguist-designed Latin
orthographies don’t work so well comes from. Obviously it’s not
extensive data, just my own experience in a limited field.

Again, I apologize for “mis-stating the ideology of linguists” as you
quite nicely put it. I’d be happy if you have more information on
these subjects to share, both insomuch as it relates to m17n and i18n
and issues that developers should be aware of, and for its own sake.

Best,

~Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support [BACK ON-TOPIC]

2007-03-28 Thread Rich Felker

On Mon, Mar 26, 2007 at 05:28:43PM -0400, ＳｒｉｎＴｕａｒ wrote:
> I frequenty run into problems with utf-8 in perl, and I was wondering
> if anyone else
> had encountered similar things.
[...]

Can we get back on-topic with this, and look for solutions to the
problems? Maybe Larry has some thoughts for us?

~Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

2007-03-28 Thread Rich Felker

On Wed, Mar 28, 2007 at 10:46:01PM -0400, Daniel B. wrote:
> > For example, the unix "cut" program works automatically with UTF-8
> > text as long as the delimiter is a single byte, 
> 
> By "single byte," do you mean a character whose UTF-8 representation
> is a single byte?  (If you gave it the byte 0xBF, would it reject it
> as an invalid UTF-8 sequence, or would it then possibly cut in the middle
> of the byte sequence for a character (e.g., 0xEF 0xBF 0x00)?)

Apologies for omitting the word “character” after single byte. Yes, I
meant ASCII.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

2007-03-28 Thread Rich Felker

On Wed, Mar 28, 2007 at 11:05:56PM -0400, Daniel B. wrote:
>  wrote:
> > 
> > 2007/3/28, Egmont Koblinger <[EMAIL PROTECTED]>:
> > 
> > > ...f you only handle _texts_ then
> > > probably the best approach is to convert every string as soon as they 
> > > arrive
> > > at your application to some Unicode representation (UTF-8 for Perl, 
> > > "String"
> > > (which uses UTF-16) for Java and so on)
> > 
> > Hrm, I think Java needs to be fixed. Their internal utf-16 mandate was
> > a mistake, imo.
> 
> Are you aware that Java was created (or frozen) when Unicode required
> 16 bits?  (It wasn't a mistake at the time.)

Java was introduced in May 1995. UTF-8 existed since September 1992.
There was never any excuse for UCS-2/UTF-16 existing at all.

Read Thompson & Pike’s UTF-8 paper for details.

〜Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

2007-03-28 Thread Rich Felker

On Wed, Mar 28, 2007 at 10:39:49PM -0400, Daniel B. wrote:
> > > Well of course you need to think in bytes when you're interpreting the
> > > stream of bytes as a stream of characters, which includes checking for
> > > invalid UTF-8 sequences.
> > 
> > And what do you do if they're present? 
> 
> Of course, it depends where they are present.  You seem to be addressing
> relative special cases.

I’m addressing corner cases. Robust systems engineering is ALWAYS
about handling the corner cases. Any stupid codemonkey can write code
that does what’s expected when you throw the expected input at it. The
problem is that coding like this blows up and gives your attacker root
as soon as they throw something unexpected at it. :)

> > Under your philosophy, it would
> > be impossible for me to remove files with invalid sequences in their
> > names, since I could neither type the filename nor match it with glob
> > patterns (due to the filename causing an error at the byte to
> > character conversion phase before there’s even a change to match
> > anything). ...
> 
> If the file name contains illegal byte sequences, then either they’re
> not in UTF-8 to start with or, if they’re supposed to be, something
> else let invalid sequences through.

Several likely scenarios:

1. Attacker intentionally created invalid filenames. This might just
   be annoying vandalism but on the other hand might be trying to
   trick non-robust code into doing something bad (maybe throwing away
   or replacing the invalid sequences so that the name collides with
   another filename, or interpreting overlong UTF-8 sequences, etc.).

2. Foolish user copied filenames from a foreign system (e.g. scp or
   rsync) with a different encoding, without conversion.

3. User (yourself or other) extracted files from a tar or zip archive
   with names encoded in a foreign encoding, without using software
   that could detect and correct the situation.

> If they're supposed to be UTF-8 and aren't, then certainly normal
> tools shouldn't have to deal with malformed sequences.

This is nonsense. Regardless of what they’re supposed to be, someone
could intentionally or unintentionally create files whose names are
not valid UTF-8. While it would be a nice kernel feature to make such
filenames illegal, you have to consider foreign removable media (where
someone might have already created such bad names), and since POSIX
makes no guarantee that strings which are illegal sequences in the
character encoding are illegal as filenames, any robust and portable
code MUST account for the the fact that they could exist. Thus
filenames, commandlines, etc. MUST always be handled as bytes or in a
way that preserves invalid sequences.

> If you write
> a special tool to fix malformed sequences somehow (e.g., delete files
> with malformed sequences), then of course you're going to be dealing
> with the byte level and not (just) the character level.

Why should I need a special tool to do this?? Something like:
rm *known_ascii_substring*
should work, as long as the filename contains a unique ascii (or valid
UTF-8) substring.

> > Other similar problem: I open a file in a text editor and it contains
> > illegal sequences. For example, Markus Kuhn's UTF-8 decoder test file,
> 
> Again, you seem to be dealing with special cases.

Again, software which does not handle corner cases correctly is crap.

> If a UTF-8 decoder test file contains illegal byte UTF-8 sequences, why
> would you expect a UTF-8 text editor to work on it? 

I expect my text editor to be able to edit any file without corrupting
it. Perhaps you have lower expectations... If you’re used to Windows
Notepad, that would be natural, but I’m used to GNU Emacs.

> For the data that is parseable as a valid UTF-8 encoding of characters, 
> how do you propose to know whether it really is characters encoded as
> UTF-8 or is characters encoded some other way?   

It’s neither. It’s bytes, which when they are presented for editing,
are displayed as a character according to their interpretation as
UTF-8. :)

If I receive a foreign file in a legacy encoding and wish to interpret
it as characters in that encoding, then I’ll convert it to UTF-8 with
iconv (which deal with bytes) or using C-x RET c prefix in Emacs to
visit the file with a particular encoding. What I absolutely do NOT
want is for a file to “magically” be interpreted as Latin-1 or some
other legacy codepage as soon as invalid sequences are detected. This
is clobbering the functionality of my system to edit its own native
data for the sake of accomodating foreign data.

I respect that others do want and regularly use such auto-detection
functionality, however.

> (If you see the byte sequence 0xDF 0xBF, how do you know whether that 
> means the character U+003FF

It never means U+03FF in any case because U+03FF is 0xCF 0xBF...

> or the two characters U+00DF U+00BF?  For

It never means this in text on my system because the text encoding is
UTF-8. It would mean

Re: perl unicode support

2007-03-28 Thread Rich Felker

On Wed, Mar 28, 2007 at 02:24:26PM -0500, David Starner wrote:
> On 3/27/07, Rich Felker <[EMAIL PROTECTED]> wrote:
> >On Tue, Mar 27, 2007 at 06:44:42PM -0500, David Starner wrote:
> >> On 3/27/07, Rich Felker <[EMAIL PROTECTED]> wrote:
> >This is one of the very few
> >places where a computer should ever perform case mappings: in a
> >powerful editor or word processor
> 
> Just about any program that deals with text is going to have a need to
> merge distinctions that the user considers irrelevant, which often
> includes case. I use grep -i, even when searching the output of my own
> programs sometimes. I could go back and check the case I used in the
> messages, but I'd rather let the tools do that.

This is not case mapping but equivalence classes. A completely
different issue. Matching equivalence classes (including case and
other equivalences) is trivial and mostly language-independent. Case
mapping is ugly (think German “SS/ß”) and language-dependent (think
Turkish “I/ı” and “İ/i”).

> >Same thing. North American civilization is all European-derived.
> 
> The civilization on North America, South America, Europe, Australia
> and Antartica is European-derived, but I find it horribly hard to
> dismiss something that's universal in five of the seven continents as
> "disgustingly euro-centric".

It’s not universal. It’s universal among the european-descended
colonizers. In many of these places there are plenty of indigenous
populations which do not use the colonizer’s script because it’s not
suitable for their language, because the latin phonetic systems are
designed for pompous linguists rather than based on the way people see
their own languages. Often there is a colonial language (English,
Spanish, French, etc.) alongside an indigenous language, and while the
latter may often be written in latin letters, the orthography is often
inconsistent and should be perceived as a “foreign” spelling system
rather than something native.

> >> In fact, I think you'd find that
> >> most of the world's languages are written in scripts that have a
> >> concept of case.
> >
> >This is a very dubious assertion. Technically it depends on how you
> >measure "most" (language count vs speaker count... also the whole
> >dialect vs language debate), but otherwise I think it's bogus.
> 
> The English meaning of "Most of the world's languages" is the number
> of languages. All of the languages spoken in North and South America,
> with the exception of Cherokee and some Canadian languages written in
> the UCAS, are written in Latin. All of the languages spoken in Africa,
> with the exception of a few languages written in Ethiopian and Arabic,
> are written in Latin.

Written by whom? European-descended scholars who imposed a Latin
alphabet for studying the language. Many of the speakers of many of
these languages don’t even write the language at all..

I maintain that you have a very euro-centric-imperialist view of the
world. It’s not to say that latin isn’t important or in widespread
use, but pretending like latin is the pinnacle of importance and like
frills for latin keep the world happy is something i find extremely
annoying.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

2007-03-28 Thread Rich Felker

On Wed, Mar 28, 2007 at 07:49:57PM +0200, Egmont Koblinger wrote:
> matches or not _does_ depend on the character set that you use. It's not
> perl's flaw that it couldn't decide, it's impossible to decide in theory
> unless you know the charset.

It is perl's flaw. The LC_CTYPE category of the locale determines the
charset. This is how all sane languages work.

> > I don't care what the regex library does under the covers, and I
> > shouldnt have to care...
> 
> >From a user's point of view, it's a good expectation against any program:
> they should _just work_ without requiring any charset knowledge from me.
> 
> In an ideal world where no more than one character set (and one
> representation) is used, a developer could expect the same from any
> programming language or development environment. But our world is not ideal.
> There _are_ many character sets out there, and it's _your_ job, the
> programmer's job to tell the compiler/interpreter how to handle your bytes
> and to hide all these charset issues from the users. Therefore you have to
> be aware of the technical issues and have to be able to handle them.

I don't have to be aware of it in any other language. It just works.
Perl is being unnecessarily difficult here.

> Having a variable in your code that stores sequence of bytes, without you
> being able to tell what encoding is used there, is just like having a
> variable to store the height of people, without knowing whether it's
> measured in cm or meter or feet... The actions you may take are very limited
> (e.g. you can add two of these to calculate how large they'd be if one would
> stand on the top of the other (btw the answer would also lack the unit)),
> but there are plenty of things you cannot answer.

Nonsense. As long as all the length variables are in the SAME unit,
your program has absolutely no reason to care whatsoever exactly what
that unit it. Any unit is just as good as long as it's consistent. The
same goes for character sets. There is a well-defined native character
encoding, which should be UTF-8 on any modern system. When importing
data from foreign encodings, it should be converted. This is just the
same as if you stored all your lengths in a database. As long as
they're all consistent (e.g. all in meters) then you don't have to
grossly increase complexity and redundancy by storing a unit with each
value. Instead, you just convert foreign values when they're input,
and assume all local data is already in the correct form. The same
applies to character encoding.

> your application, and convert (if necessary) when you output them. If you
> must be able to handle arbitrary byte sequences, then (as Rich pointed out)
> you should keep the array of bytes but you might need to adjust a few
> functions that handle them, e.g. regexp matching might be a harder job in
> this case (e.g. what does a dot (any character) mean in this case?).

Regex matching is also easier with bytes, even if your bytes represent
multibyte characters. The implementation that converts to UTF-32 or
similar is larger, slower, klunkier, and more error-prone.

> > If it knows how to match "Á" to ".", then I dont have to know how it
> > goes about doing so.
> 
> Recently you asked why perl didn't just simply work with bytes. Now you talk
> about the "Á" letter. But you seem to forget about one very important step:
> how should perl know that your sequence of bytes represents "Á" and not some
> other letter(s)?

Because the system defines this as part of LC_CTYPE.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

2007-03-28 Thread Rich Felker

On Wed, Mar 28, 2007 at 04:03:23PM +0200, Egmont Koblinger wrote:
> On Tue, Mar 27, 2007 at 01:51:59PM -0400, ＳｒｉｎＴｕａｒ wrote:
> 
> > I'm not quite sure how "thinking in characters" helps an application,
> > in general. I'd be interested if you had a concrete example...
> 
> dealing with. For example it's impossible to implement a regexp matching
> routine if you have no idea what encoding is being used.
> 
> > It's probably advisable to use a library regex engine than to re-write 
> > custom regex engines all the time.
> 
> Sure.

I think ＳｒｉｎＴｕａｒ has made it clear that he agrees that a
regular expression engine needs to be able to interpret characters.
His point is that the calling code does not have to know anything
about characters, only strings.

> > Once you have a regex library that handles codepoints, the code that uses
> > it doesnt have to care about them in particular.
> 
> It's not so simple. Suppose you have a byte sequence (decimal) 65 195 129
> 66. (This is the beginning of the Hungarian alphabet AÁB... encoded in
> UTF-8). Suppose you want to test whether it matches to the regexp 65 46 66
> ("A.B"). Does it match? It depends. If the byte sequence really denotes AÁB
> (i.e. it is encoded in UTF-8) then it does. If it has different semantics (a
> different character sequence encoded in some other 8-bit encoding) then it
> doesn't. How do you think perl is supposed to overcome this problem if it
> didn't have Unicode support?
> 
> You have to make sure that the string to test and the regexp itself are
> encoded in the same charset, and in turn this also matches the charset the
> regexp library routine expects.

When interpreting bytes as characters, you do so according to the
system's character encoding, as exposed by the C multibyte character
handling functions. On systems which allow the user to choose an
encoding, the user then selects it via the LC_CTYPE category. On my
system, it's always UTF-8 and not a runtime option.

If you want to process foreign encodings (not the system/locale native
encoding) then you should convert them to your native encoding first
(via iconv or a similar library). If your native encoding is not able
to represent all the characters in the foreign encoding then you're
out of luck and you should give up your legacy codepage and switch to
UTF-8 if you want multilingual support.

> Otherwise things will go plain wrong sooner
> or later. In some languages regexp matching is done via functions, and
> probably you may have an 8-bit match() and a Unicode-aware mbmatch() as
> well.

I don't know which languages do this, but it's wrong. mbmatch() would
cover both cases just fine (i.e. it would work even if the native
encoding is 8bit). If you want a BYTE-based regex engine, that's
another matter, and AFAIK few languages provide such a thing
explicitly. (If they do, it's by misappropriating an
8bit-codepage-targetted matcher.) But treating bytes and 8bit codepage
encodings as the same thing is wrong. Bytes represent numbers in the
range 0-255. 8bit codepages represent 256-character subsets of
Unicode. These are not the same.

> > The problem soon as you use a library routine that is utf-8 aware, it sets
> > the utf-8 flag on a string and problems start to result. If there was no 
> > utf-8
> > flag on the scalar strings to be set, then you could stay in byte world all 
> > the
> > time, while still using unicode functionality where you needed it.
> 
> As I've already said, there's absolutely nothing preventing you from _not_
> using the Unicode features of Perl at all. But then I'm just curious how you
> would match accented characters to regexps for example.

Regex would always match characters...

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

2007-03-27 Thread Rich Felker

On Tue, Mar 27, 2007 at 11:53:15PM -0400, ＳｒｉｎＴｕａｒ wrote:
> 007/3/27, Daniel B. <[EMAIL PROTECTED]>:
> >What about when it breaks a string into substrings at some delimiter,
> >say, using a regular expression?  It has to break the underlying byte
> >string at a character boundary.
> 
> 
> Ｕｎｌｅｓｓ　ｙｏｕ　ｐａｓｓ　ｉｎｖａｌｉｄ　ｕｔｆ－８　
> ｓｅｑｕｅｎｃｅｓ　ｔｏ　ｙｏｕｒ　ｒｅｇｕｌａｒ　

Haha, was it your intent to use this huge japanese wide ascii? :)
Sadly I don't think Daniel can read anything but Latin-1...
Here's an ascii transliteration...
~Rich


On Tue, Mar 27, 2007 at 11:53:15PM -0400, SrinTuar wrote:
> 007/3/27, Daniel B. <[EMAIL PROTECTED]>:
> >What about when it breaks a string into substrings at some delimiter,
> >say, using a regular expression?  It has to break the underlying byte
> >string at a character boundary.
> 
> Unless you pass invalid utf-8 sequences to your regular expression
> library, that should be impossible. breaking strings works great as
> long as you pattern match for boundaries.
> 
> The only time it fails is if you break it at arbitrary byte
> indexes.note that breaking utf-32 strings at arbirtrary indicies also
> destroys the text.
> 
> >In fact, what about interpreting an underlying string of bytes as
> >as the right individual characters in that regular expression?
> 
> The regular expression engine should be utf-8 aware. The code that
> uses and calls it has no need to.
> 
> >Any time a program uses the underlying byte string as a character
> >string other than simply a whole string (e.g., breaking it apart,
> >interpreting it), it needs to consider it at the character level,
> >not the byte level.
> 
> Only the most fancy intepretations require any knowledge of unicode
> code points.Any substring match on valid sequences will produce valid
> boundaries in utf-8,and thats the whole point.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

2007-03-27 Thread Rich Felker

On Tue, Mar 27, 2007 at 10:07:11PM -0400, Daniel B. wrote:
>  wrote:
> > 
> > > That would be contradictory to the whole concept of Unicode. A
> > > human-readable string should never be considered an array of bytes, it is 
> > > an
> > > array of characters!
> > 
> > Hrm, that statement I think I would object to. For the overwhelming
> > vast majority of programs, strings are simply arrays of bytes.
> > (regardless of encoding) The only time source code needs to care about
> > characters is when it has to layout or format them for display.
> 
> What about when it breaks a string into substrings at some delimiter,
> say, using a regular expression?  It has to break the underlying byte 
> string at a character boundary.

Searching for the delimeter already gives you a character boundary.
There is no need to think further about it.

For example, the unix "cut" program works automatically with UTF-8
text as long as the delimiter is a single byte, and if you want
multibyte delimiters, all you need to do is make it accept a multibyte
delimeter character and then do a substring search instead of a byte
search. There is no need to ever treat the input string as characters,
and in fact doing so just makes it slow and bloated.

> In fact, what about interpreting an underlying string of bytes as
> as the right individual characters in that regular expression?  
> 
> Any time a program uses the underlying byte string as a character
> string other than simply a whole string (e.g., breaking it apart, 
> interpreting it), it needs to consider it at the character level,
> not the byte level.

You're mistaken. Most times, you can avoid thinking about characters
totally. Not always, but much more often than you think.

> > When I write a basic little perl script that reads in lines from a
> > file, does trivial string operations on them, then prints them back
> > out, there should be absolutely no need for my code to make any
> > special considerations for encoding.
> 
> It depends how trivial the operations are.
> 
> (Offhand, the only things I think would be safe are copying and
> appending.)

This is because you don't understand UTF-8..

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

2007-03-27 Thread Rich Felker

On Tue, Mar 27, 2007 at 10:55:32PM -0400, Daniel B. wrote:
> Rich Felker wrote:
> > ...
> > None of this is relevant to most processing of text which is just
> > storage, retrieval, concatenation, and exact substring search.
> 
> It might be true that more-complicated processing is not relevant to those
> operations.  (I'm not 100% sure about exact substring matches, but maybe 
> if the byte string given to search for is proper (e.g., doesn't have any
> partial representations of characters), it's okay).

No character is a substring of another character in UTF-8. This is an
essential property of any sane multibyte encoding (incidentally, the
only other one is EUC-TW).

> Well of course you need to think in bytes when you're interpreting the
> stream of bytes as a stream of characters, which includes checking for 
> invalid UTF-8 sequences.

And what do you do if they're present? Under your philosophy, it would
be impossible for me to remove files with invalid sequences in their
names, since I could neither type the filename nor match it with glob
patterns (due to the filename causing an error at the byte to
character conversion phase before there's even a change to match
anything). I'd have to write specialized tools to do it...

Other similar problem: I open a file in a text editor and it contains
illegal sequences. For example, Markus Kuhn's UTF-8 decoder test file,
or a file with mixed encodings (e.g. a mail spool) or with mixed-in
binary data. I want to edit it anyway and save it back without
trashing the data that does not parse as valid UTF-8, while still
being able to edit the data that is valid UTF-8 as UTF-8.

This is easy if the data is kept as bytes and the character
interpretation is only made "just in time" when performing display,
editing, pattern searches, etc. If I'm going to convert everything to
characters, it requires special hacks for encoding the invalid
sequences in a reversible way. Markus Kuhn experimented with ideas for
this a lot back in the early linux-utf8 days and eventually it was
found to be a bad idea as far as I could tell.

Also, I've found performance and implementation simplicity to be much
better when data is kept as UTF-8. For example, my implementation of
the POSIX fnmatch() function (used by glob() function) is extremely
light and fast, due to performing all the matching as byte strings and
only considering characters "just in time" during bracket expression
matching (same as regex brackets). This also allows it to accept
strings with illegal sequences painlessly.

> > Hardly. A byte-based regex for all case matches (e.g. "(Ã¤|Ã?)") will

The fact that your mailer misinterpreted my UTF-8 as Latin-1 does not
instill faith...

> > work just as well even for case-insensitive matching, and literal
> > character matching is simple substring matching identical to any other
> > sane encoding. I get the impression you don't understand UTF-8..
> 
> How do you match a single character?  Would you want the programmer to 
> have to write an expression that matches a byte 0x00 through 0x7F, a
> sequence of two bytes from 0xC2 0x80 through 0xDF 0xBF, a sequence of
> three bytes from 0xE1 0xA0 0x80 through 0xEF 0xBF 0xBF, etc. [hoping I 
> got those bytes right] instead of simply "."?

No, this is the situation where a character-based regex is wanted.
Ideally, a single regex system could exist which could do both
byte-based and character-based matching together in the same string.
Sadly that's not compatible with POSIX BRE/ERE, nor Perl AFAIK.

> >... Sometimes a byte-based regex is also useful. For
> > example my procmail rules reject mail containing any 8bit octets if
> > there's not an appropriate mime type for it. This kills a lot of east
> > asian spam. :)
> 
> Yep.
> 
> Of course, you can still do that with character-based strings if you
> can use other encodings.  (E.g., in Java, you can read the mail
> as ISO-8859-1, which maps bytes 0-255 to Unicode characters 0-255.
> Then you can write the regular expression in terms of Unicode characters
> 0-255.  The only disadvantage there is probably some time spent
> decoding the byte stream into the internal representation of characters.)

The biggest disadvantage of it is that it's WRONG. The data is not
Latin-1, and pretending it's Latin-1 is a hideous hack. The data is
bytes with either no meaning as characters, or (more often) an
interpretation as characters that's not available to the software
processing it. I've just seen wy too many bugs from pretending
that bytes are characters to consider doing this reasonable. It also
perpetuates the (IMO very bad) viewpoint among new users that UTF-8 is
"sequences of Latin-1 characters making up a character" in

Re: perl unicode support

2007-03-27 Thread Rich Felker

On Tue, Mar 27, 2007 at 06:44:42PM -0500, David Starner wrote:
> On 3/27/07, Rich Felker <[EMAIL PROTECTED]> wrote:
> >This is not a simple task at all, and in fact it's a task that a
> >computer should (almost) never do...
> 
> Of course. Why shouldn't an editor go through and change 257 headings
> to titlecase by hand? Humans are known for their abilities to do such
> tedious
> things without error, aren't they?

There was a reason I wrote "almost". This is one of the very few
places where a computer should ever perform case mappings: in a
powerful editor or word processor. Another I can think of is
linguistic software (e.g. machine based translation, or anything
that's performing semantic analysis or synthesis of human language
text). These comprise a tiny minority of computing applications and
certainly do not warrant punishing the rest; such functionality and
special handling should be isolated to the programs that use it.

> >The whole idea of case conversion in programming languages is
> >digustingly euro-centric. The rest of the world doesn't have such a
> >stupid thing as case...
> 
> Really? Funny, I'm from North America, and we have a concept of case

Same thing. North American civilization is all European-derived.

> here. 90% of the languages native to the continent are written in a
> script that has a concept of case.

Is that so? I don't think so. Rather, most of the languages native to
the continent have no native writing system, or use a writing system
that was long ago lost/extincted. Perhaps you should look up the
meaning of the word native.. :)

> In fact, I think you'd find that
> most of the world's languages are written in scripts that have a
> concept of case.

This is a very dubious assertion. Technically it depends on how you
measure "most" (language count vs speaker count... also the whole
dialect vs language debate), but otherwise I think it's bogus. I
believe a majority of the world's population has as their native
language a language that does not use case.

Just take India and China and you're already almost there. Now throw
in the rest of South Asia and East Asia, all of the Arabic speaking
countries, 

> Furthermore, the whole reason for Unicode is because
> you have to accomadate every single script's idiosyncracities; you
> have to include case conversion because certain scripts demand it.

No, you only have to deal with the idiosyncracies of the subset you
support. A good multilingual application will have sufficient support
for acceptable display and editing of most or all languages, but
there's no reason it should have lots of language-specific features
for each language. Why should all apps be forced to have
(Euro-centric) case mappings, but not also mappings between (for
example) the corresponding base-character and subjoined-character
forms of Tibetan letters, or transliteration mappings between Latin
and Cyrillic for East European languages?

My answer (maybe others disagree) is that most apps need none of this,
while editor/OS hybrids like GNU emacs probably want all of it. :) But
each app is free to choose which language-specific frills it wants to
include support for. I see no reason that case mappings should be
given such a special place aside from the forces of linguistic
imperialism.

~Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

2007-03-27 Thread Rich Felker

On Tue, Mar 27, 2007 at 06:31:11PM +0200, Egmont Koblinger wrote:
> On Tue, Mar 27, 2007 at 11:16:58AM -0400, ＳｒｉｎＴｕａｒ wrote:
> > >That would be contradictory to the whole concept of Unicode. A
> > >human-readable string should never be considered an array of bytes, it is 
> > >an
> > >array of characters!
> > 
> > Hrm, that statement I think I would object to. For the overwhelming
> > vast majority of programs, strings are simply arrays of bytes.
> > (regardless of encoding)
> 
> In order to be able to write applications that correctly handle accented
> letters, Unicode taught us the we clearly have to distinguish between bytes
> and characters,

No, accented characters have nothing to do with the byte/character
distinction. That applies to any non-ascii character. However, it only
matters when you'll be performing display, editing, and pattern-based
(not literal-string-based, though) searching.

Accents and combining marks have to do with the character/grapheme
distinction, which is pretty much relevant only for display.

None of this is relevant to most processing of text which is just
storage, retrieval, concatenation, and exact substring search.

> and when handling texts we have to think in terms of
> characters. These characters are eventually stored in memory or on disk as
> several bytes, though. But in most of the cases you have to _think_ in
> characters, otherwise it's quite unlikely that your application will work
> correctly.

It's the other way around, too: you have to think in terms of bytes.
If you're thinking in terms of characters too much you'll end up doing
noninvertable transformations and introduce vulnerabilities when data
has been maliciously crafted not to be valid utf-8 (or just bugs due
to normalizing data, etc.).

> > The only time source code needs to care about
> > characters is when it has to layout or format them for display.
> 
> No, there are many more situations. Even if your job is so simple that you
> only have to convert a text to uppercase, you already have to know what
> encoding (and actually what locale) is being used.

This is not a simple task at all, and in fact it's a task that a
computer should (almost) never do... Case-insensitivity is bad enough,
but case conversion is a horrible horrible mistake. Create your data
in the case you want it in.

The whole idea of case conversion in programming languages is
digustingly euro-centric. The rest of the world doesn't have such a
stupid thing as case...

> Finding a particular
> letter (especially in case insentitive mode),

Hardly. A byte-based regex for all case matches (e.g. "(ä|Ä)") will
work just as well even for case-insensitive matching, and literal
character matching is simple substring matching identical to any other
sane encoding. I get the impression you don't understand UTF-8..

> performing regexp matching,
> alphabetical sorting etc. are just a few trivial examples where you must
> think in characters.

Character-based regex (which posix BRE/ERE is) needs to think in terms
of characters. Sometimes a byte-based regex is also useful. For
example my procmail rules reject mail containing any 8bit octets if
there's not an appropriate mime type for it. This kills a lot of east
asian spam. :)

> > If perl did not have a "utf-8" bit on its scalars, it would probably
> > handle utf-8 alot better and more naturally, imo.
> 
> Probably. Probably not. I'm really unable to compare an existing programming
> language with a hypothetical one. For example in PHP a string is simply a
> sequence of bytes, and you have mb...() functions that handle them according
> to the selected locale. I don't think it's either better or worse than perl,
> it's just a different approach.

Well it's definitely worse for someone who just wants text to work on
their system without thinking about encoding. And it WILL just work
(as evidenced by my disabling of the warning and still getting correct
behavior) as long as the whole system is consistent, regardless of
what encoding is used.

Yes, strings need to distinguish byte/character data. But streams
should not. A stream should accept bytes, and a character string
should always be interpreted as bytes according to the machine's
locale when read/written to a stream, or when incorporated into byte
strings.

> > When I write a basic little perl script that reads in lines from a
> > file, does trivial string operations on them, then prints them back
> > out, there should be absolutely no need for my code to make any
> > special considerations for encoding.
> 
> If none of these trivial string operations depend on the encoding then you
> don't have to use this feature of perl, that's all. Simply make sure that
> the file descriptors are not set to utf8, neither are the strings that you
> concat or match to. etc, so you stay in world of pure bytes.

But it should work even with strings interpreted as characters!
There's no legitimate reason for it not to.

Moreover, the warning is fundamentally

Re: perl unicode support

2007-03-26 Thread Rich Felker

On Mon, Mar 26, 2007 at 05:28:43PM -0400, ＳｒｉｎＴｕａｒ wrote:
> I frequenty run into problems with utf-8 in perl, and I was wondering
> if anyone else
> had encountered similar things.
> 
> One thing I've noticed is that when processing characters, I often get
> warnings about
> "wide characters in print", or have input/output get horribly mangled.
> 
> Ive been trying to work around it in various ways, commonly doing thing 
> such as:
> binmode STDIN,":utf8";
> binmode STDOUT,":utf8";
> 
> or using functions such as :
> sub unfunge_string
> {
>foreach my $ref (@_)
>{
>$$ref = Encode::decode("utf8",$$ref,Encode::FB_CROAK);
>}
> }
> 
> 
> but this feels wrong to me.
> 
> For a language that really goes out of its way to support encodings, I
> wonder if it
> wouldnt have been better off it it just ignored the entire concept
> alltogether and treated
> strings as arrays of bytes...

Read the ancient linux-utf8 list archives and you should get a good
feel for Larry Wall's views on the matter.

> Ive found pages wherin people complain of similar problems, such as:
> http://ahinea.com/en/tech/perl-unicode-struggle.html
> 
> And I'm wondering if in its attempt to be a good i18n citizen, perl
> hasnt gone overboard and made a mess of things instead.

I agree, but maybe there are workarounds. I have a system that's
completely UTF-8-only. I don't have or want support for any legacy
encodings except in a few isolated tools (certainly nothing related to
perl) for converting legacy data I receive from outside.

With that in mind, I built perl without PerlIO, wanting to take
advantage of my much smaller and faster stdio implementation. But now,
binmode doesn't work, so the only way I can get rid of the nasty
warning is by disabling it explicitly.

Is there any way to get perl to behave sanely in this regard? I don't
really use perl much (mainly for irssi) so if not, I guess I'll just
leave it how it is and hope nothing seriously breaks..

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Non-ASCII characters in file names

2007-03-18 Thread Rich Felker

On Sun, Mar 18, 2007 at 08:41:48AM -0700, Ben Wiley Sittler wrote:
> awesome, and thank you! however, utf-8 filenames given on the command
> line still do not work... the get turned into iso-8859-1, which is
> then utf-8 encoded before saving (?!)
> 
> here's my (partial) utf-8 workaround for emacs so far:
> 
> (if (string-match "XEmacs\\|Lucid" emacs-version)
>nil
>  (condition-case nil (eval
>   (if
>   (string-match "\\.\\(UTF\\|utf\\)-?8$"
> (or (getenv "LC_CTYPE")
> (or (getenv "LC_ALL")
> (or (getenv "LANG")
> "C"
>   '(concat (set-terminal-coding-system 'utf-8)
>(set-keyboard-coding-system 'utf-8)
>(set-default-coding-systems 'utf-8)
>(setq file-name-coding-system 'utf-8)
>(set-language-environment "UTF-8"
>((error "Language environment not defined: \"UTF-8\"") nil)))

Here are all my relevant emacs settings. They work in at least
emacs-21 and later; however, emacs-21 seems to be having trouble with
UTF-8 on the command line and I don’t know any way around that.

; Force unix and utf-8
(setq inhibit-eol-conversion t)
(prefer-coding-system 'utf-8)
(setq locale-coding-system 'utf-8)
(set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(set-selection-coding-system 'utf-8)
(setq file-name-coding-system 'utf-8)
(setq coding-system-for-read 'utf-8)
(setq coding-system-for-write 'utf-8)

Note that the last two may be undesirable; they force ALL files to be
treated as UTF-8, skipping any detection. This allows me to edit files
which may have invalid sequences in them (like Kuhn’s decoder test
file) or which are a mix of binary data and UTF-8.

I use the experimental unicode-2 branch of GNU emacs, and with it,
forcing UTF-8 does not corrupt non-UTF-8 files. The invalid sequences
are simply shown as octal byte codes and saved back to the file as
they were in the source. I cannot confirm that this will not corrupt
files on earlier versions of GNU emacs, however, and XEmacs ALWAYS
corrupts files visited as UTF-8 (it converts any unicode character for
which it does not have a corresponding emacs-mule character into a
replacement character) so it’s entirely unsuitable for use with UTF-8
until that’s fixed (still broken in latest cvs as of a few months
ago..).

BTW looking for “UTF-8” in the locale string is a bad idea since UTF-8
is not necessarily a “special” encoding but may be the “native”
encoding for the selected language. nl_langinfo(CODESET) is the only
reliable determination and I doubt emacs provides any direct way of
accessing it. :(

~Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: High-Speed UTF-8 to UTF-16 Conversion

2007-03-17 Thread Rich Felker

On Sat, Mar 17, 2007 at 06:25:59PM +0600, Christopher Fynn wrote:
> Colin Paul Adams wrote:
> 
> >>>>>>"Rich" == Rich Felker <[EMAIL PROTECTED]> writes:
> >
> >Rich> Indeed, this was what I was thinking of. Thanks for
> >Rich> clarifying. BTW, any idea WHY they brought the UTF-16
> >Rich> nonsense to DOM/DHTML/etc.?
> 
> >I don't know for certain, but I can speculate well, I think.
> 
> >DOM was a micros**t invention (and how it shows!). NT was UCS-2
> >(effectively).
> 
> AFAIK Unicode was originally only planned to be a 16-bit encoding.
> the The Unicode Consortium and ISO 10646 then agreed to synchronize the
> two standards - though originally Unicode was only going to be a 16-bit 
> subset of the UCS. A little after that Unicode decided to support UCS 
> characters beyond plane 0.
> 
> Anyway at the time NT was being designed (late eighties) Unicode was 
> supposed to be limited to < 65536 characers and UTF-8 hadn't been 
> thought of, so 16-bits probably seemed like a good idea.

While this is probably true, it's also aside from the point. I wasn't
asking why Windows used UCS-2, but why JavaScript remained stuck on
the 16bit idea even after the character set expanded -- since JS is a
pretty high level lang and the size of types is largely irrelevant,
redefining characters to be 32bit integers shouldn't have broken
anything.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Non-ASCII characters in file names

2007-03-17 Thread Rich Felker

On Sat, Mar 17, 2007 at 08:25:43AM +, Colin Paul Adams wrote:
> Now this is where it gets interesting.
> My URI resolver translates the file name (the URI is relative to a
> base file: URI) into a UTF-8 byte sequence which gets passed to the
> fopen call (the program is supposed to work on other O/Ses too, not
> just Linux, but I'll worry about that later).
> 
> The test suite is currently distributed as a zip file. It so happens
> that the file concerned is named using ISO-8859-1 on the distributors
> system. On my system, doing ls from the GNOME console shows the name
> as xgespr?ch.xml. Whereas Emacs dired shows the name as
> xgespräch.xml.
> 
> I'm not sure exactly how fopen is supposed to handle the situation.

It's not. You should not create files in your filesystem with the
wrong encoding. If you do, then the only way to access them is via
whatever the (invalid) byte sequence is.

> Anyway, the test failed - not surprisingly.
> I looked at the unzip man page, to see if there was any filename
> translation option. I couldn't find one.

Yes, the problem here is the unzip command. It should provide a way to
translate filenames...

> So I tried unzipping the distrbution afresh, but this time with
> LANG=en_GB.

That won't help. You can't mix encodings in the filesystem and expect
any reasonable behavior.

> Emacs still showed the same name, ls however showed a completely
> different character (it loked like it might be arabic to me - I don't
> know).
> 
> The test still failed.
> 
> So I went back to LANG=en_GB.UTF-8, unzipped the distribution again,
> and re-named the file, thanks to your help.

Yep, this is the only reasonable fix until the unzip command is fixed
to handle foreign encodings.

> ls now shows the correct file name. Emacs shows
> xgesprÃ¤ch.xml. And the test works.

(setq file-name-coding-system 'utf-8)

~Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Non-ASCII characters in file names

2007-03-17 Thread Rich Felker

On Sat, Mar 17, 2007 at 09:51:53AM -0700, Ben Wiley Sittler wrote:
> emacs seems not to handle utf-8 filenames at all, regardless of locale.

(setq file-name-coding-system 'utf-8)

~Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: How to enter accented UTF-8 character on GNOME terminal

2007-03-16 Thread Rich Felker

On Sat, Mar 17, 2007 at 07:05:01AM +, Colin Paul Adams wrote:
> I can't find this in the GNOME help, so I thought I'd try asking here.
> 
> I want to be rename a file so it has an a-umlaut (lower case) in the
> name.
> 
> My LANG is en_GB.UTF-8.
> 
> I don't know how to type the accented character.

One sure way is to copy-and-paste it from a file already containing
the character. I keep around a copy of UnicodeData.txt with the
literal UTF-8 character added to each line for exactly this purpose.

Another method that might work is the ISO 14755 entry method, holding
control and shift and typing the character number in hex. Not sure if
GNOME terminal supports this or not. On the Linux console, if you have
an appropriate keymap loaded, holding AltGr and typing the character
number will do the same.

Of course for characters that you want to enter often, all of these
methods are rather inconvenient. For this purpose you can customize
the X keyboard tables with xkb or xmodmap. I have xkb configured so
that capslock toggles between two mappings. Then I run the command:
setxkbmap us,xx with xx replaced with whatever secondary mapping I
want to use. If you just want accented characters though you probably
don't need a whole secondary mapping; just enabling 'dead keys' or
setting up altgr+something to enter the characters you need is
probably sufficient.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: High-Speed UTF-8 to UTF-16 Conversion

2007-03-16 Thread Rich Felker

On Fri, Mar 16, 2007 at 07:16:55PM -0700, Ben Wiley Sittler wrote:
> I believe it's more "DHTML" that is the problem.
> 
> DOMString is specified to be UTF-16. Likewise for ECMAScript strings,
> IIRC, although they may still be officially UCS-2.

Indeed, this was what I was thinking of. Thanks for clarifying. BTW,
any idea WHY they brought the UTF-16 nonsense to DOM/DHTML/etc.? As
far as I can tell there's no reason JS and such were restricted to
16bit types for characters; changing it to 32bit (or 21bit or
whatever) shouldn't be able to break anything... It's not like JS is a
systems programming language with pointers and type casts between
pointer types. It's just working with abstract character numbers.

I wonder if there's any hope that the madness will eventually be
fixed, or if we'll be stuck with UTF-16 forever here..

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: High-Speed UTF-8 to UTF-16 Conversion

2007-03-15 Thread Rich Felker

On Thu, Mar 15, 2007 at 01:28:58PM -0700, Rob Cameron wrote:
> Simon,
> 
> You asked about relevance.   The UTF-8 to UTF-16 bottleneck
> is widely cited in literature on XML processing performance.

And why would you do this? Simply keep the data as UTF-8. There's no
good reason for using UTF-16 at all; it's just a bad implementation
choice. IIRC either HTML or XML (yes I know they're different but I
forget which does it..) specifies that everything is UTF-16
internally, which is naturally a stupid thing to specify, but this can
in almost all cases simply be ignored because it's an internal
implementation detail that's not testably different.

> For example, in SAX processing, Psaila finds that transcoding
> takes > 50% of XML processing time.

But isn't XML processing something like 1-5% of your total time for
whatever you're ultimately trying to do?

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: High-Speed UTF-8 to UTF-16 Conversion

2007-03-15 Thread Rich Felker

On Thu, Mar 15, 2007 at 11:43:51AM -0700, Rob Cameron wrote:
> Rich,
> 
> I would agree that the abuse of software patents is fundamentally
> wrong and that patent reform is highly overdue. I am doing 
> something about it.

The use of software patents is the abuse of software patents. There is
no difference. Any software patent is fundamentally wrong.

> However, I would prefer to see that discussion taken back to Groklaw.
> We have already had two rounds with respect to International Characters 
> draft Covenant Not to Assert as well as my patent-based open source model.
> http://www.cs.sfu.ca/~cameron/tech-transfer.html

No matter how much you try to "help" 'open source' (a movement to
which I do not belong and do not want to belong) with patents,
reinforcing the software patent system and giving it legitimacy will
only hurt Free Software (and all programmers) more in the long run.
Why are you pursuing patents anyway? Do you even have a reason that
you're willing to share with us?

> The "obvious" application of vectorization to UTF-8 doesn't work,
> because UTF-8 comes in variable length chunks.

Without reading your source, my "obvious" implementation would be a
sort of nondeterministic model of computing the decoding in all
alignments at once, and ensuring that an error flag accumulates for
invalid ones to allow only the valid ones to be kept. While UTF-8
chunks are variable length, UTF-8 has the very nice property that a
misaligned decode will never emulate a valid sequence. I thought this
out in under a minute, being moderately experienced in writing UTF-8
decoders. It's not rocket science.

I'm sure there are other approaches too. Maybe you have a somewhat
better one. In any case the madness of patenting "applying
vectorization to problem X", or in general "applying known tool Y to
problem X", has got to stop!

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: High-Speed UTF-8 to UTF-16 Conversion

2007-03-15 Thread Rich Felker

On Wed, Mar 14, 2007 at 02:01:04PM -0700, Rob Cameron wrote:
> u8u16-0.9  is available as open source software under an OSL 3.0 license
> at http://u8u16.costar.sfu.ca/

On second thought, I will not offer any further advice on this. The
website refers to "patent-pending technology". Software patents are
fundamentally wrong and unless you withdraw this nonsense you are an
enemy of Free Software, of programmers, and users in general, and
deserve to be ostracized by the community. Even if you intend to
license the patents obtained freely for use in Free Software, it's
still wrong to obtain them because it furthers a precedent that
software patents are valid, particularly stupid patents like "applying
vectorization in the obvious way to existing problem X".

Sorry for the harsh language but this is what you should expect when
you ask for advice from the Linux/Free Software community on
leveraging patents against them.

Sincerely,
Rich Felker

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: High-Speed UTF-8 to UTF-16 Conversion

2007-03-15 Thread Rich Felker

On Wed, Mar 14, 2007 at 02:01:04PM -0700, Rob Cameron wrote:
> As part of my research program into high-speed XML/Unicode/text
> processing using SIMD techniques, I have experimented extensively
> with the UTF-8 to UTF-16 conversion problem.I've generally been
> comparing performance of my software with that of iconv under
> Linux and Mac OS X.Are there any substantially faster implementations
> available?   Currently our u8u16-0.9 software runs about 3X to 25X faster 
> than iconv depending on platform and data characteristics.

GNU iconv is an extremely bad implementation to test for performance.
It has high overhead per call (so it will only be remotely fast on
very large runs, not individual character conversions), and even then
I don't suspect it would be very fast.

Why not just write the naive conversion algorithm yourself? For the
UTF-8 decoding, refer to uClibc's implementation of mbrtowc for UTF-8
locales, which is probably the fastest I've seen. I also have an
implementation in i386 asm which might be slightly faster.

> u8u16-0.9  is available as open source software under an OSL 3.0 license
> at http://u8u16.costar.sfu.ca/

Thanks. I'll take a look.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: c++ strings and UTF-8 (other charsets)

2007-03-08 Thread Rich Felker

On Thu, Mar 08, 2007 at 10:18:55PM -0500, Daniel B. wrote:
>  wrote:
> 
> > I have yet to encounter a case where a "character" count is useful.
> 
> Well, if an an editor the user tries to move forward three characters,
> you probably want to increment a character count (an offset from
> the beginning of the string).  

1. Normally you want to move locally by a (very) small integer number
of characters, e.g. 1, not to a particular character offset a long way
away. While the latter is a valid operation and is expensive in UTF-8
it has no practical applications that I know of except when all
characters occupy exactly one column and you’re trying to line up
columns. Relative seeking by n characters in UTF-8 is O(n),
independent of string length, so no problem for small relative cursor
motion like your example.

2. Even in such an editor, normally the unit by which you want to move
by is “graphemes” and not “characters”. That is, if the cursor is
positioned prior to ‘ã’ (LATIN LETTER SMALL A + COMBINING TILDE) and
you press the right arrow, you probably want it to move past both
characters and not “between” the two. The concept of graphemes is
slightly more complex in Indic scripts. There’s also the cases of
Korean (decomposed Jamo), Tibetan (stacking letters), etc. which can
be treated logically just like the A-TILDE example above.

> (No, I don't know how dealing with glyphs instead of just characters
> adds to that.)

Hopefully the above answers a little bit of that uncertainty..

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: c++ strings and UTF-8 (other charsets)

2007-03-01 Thread Rich Felker

On Thu, Mar 01, 2007 at 07:53:52PM +0100, Marcel Ruff wrote:
> Are you thinking of Java's _modified_ version of UTF-8
> (http://en.wikipedia.org/wiki/UTF-8#Java)?
> The first sentence from the above wiki says:
> 
> "In normal usage, the Java programming language 
>  supports 
> standard UTF-8 when reading and writing strings through 
> |InputStreamReader 
> | 
> and |OutputStreamWriter 
> "|
> 
> and this is what i do to access sockets, so no problems here.
> 
> But then it states that 'Supplementary multilingual plane' is encoded 
> incompatible.

Oh, you're talking about that part, not the NUL issue. Then yes, it's
a major problem. Java generates and processes bogus illegal UTF-8
(surrogates). I don't know if there are any easy workarounds except to
flame Sun to hell for being so stupid..

> So must i assume if i send 'mathematical alphanumeric symbols'
> http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols
> like 'ℝ' from C to java they will be corrupted?

ℝ is in the BMP, so no problem with it. It's just the huge pages of
random letters in every single font/style imaginable that are outside
the BMP. Of course various important CJK characters (needed for
writing certain names) and historical scripts are also outside the
BMP.

> Both applications work with what they think is 'UTF-8' ...

Yes. And Java is wrong. However, according to the Wikipedia article
referenced, Java _does_ do the right thing in input and output
streams. It's only the object serialization stuff that uses the bogus
UTF-8. So I don't think you're likely to have problems in practice as
long as you don't try to pass this data off (which would be in binary
files anyway, I think...?) as UTF-8.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: c++ strings and UTF-8 (other charsets)

2007-03-01 Thread Rich Felker

On Thu, Mar 01, 2007 at 09:41:44AM +0100, Marcel Ruff wrote:
> 
> >>Are you thinking of Java's _modified_ version of UTF-8
> >>(http://en.wikipedia.org/wiki/UTF-8#Java)?
> >>
> >
> >Uhg, disgusting...
> >  
> Yes - this is an open & serious issue for my approach!
> 
> Has anybody some practical advice on this?

Just treat the sequence c0 80 according to the spec, as an invalid
sequence. Neither it (because it's illegal utf-8) nor a real NUL
(because it's illegal in text) should appear. If your problem is more
specific and there's a real reason you need to handle such data
differently, please describe what you're doing so we can offer better
advice.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: c++ strings and UTF-8 (other charsets)

2007-02-28 Thread Rich Felker

On Tue, Feb 27, 2007 at 07:49:17PM -0500, Daniel B. wrote:
> Marcel Ruff wrote:
> > 
> 
> > As UTF-8 may not contain '\0' ...
> 
> Yes it can.

No, I think he just meant to say "a string of non-NUL _characters_ may
not contain a 0 _byte_". The NUL character is not valid "text" or a
valid part of a "string" in the POSIX sense of "text" or the C/POSIX
sense of "string".

> Are you thinking of Java's _modified_ version of UTF-8
> (http://en.wikipedia.org/wiki/UTF-8#Java)?

Uhg, disgusting...

BTW, note that ill-advised programs allowing NUL characters in text
where they do not belong often leads to vulnerabilities, like the
Firefox vuln just a few days ago.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: c++ strings and UTF-8 (other charsets)

2007-02-27 Thread Rich Felker

On Tue, Feb 27, 2007 at 09:49:50AM -0500, ＳｒｉｎＴｕａｒ wrote:
> On Mon, Feb 26, 2007 at 03:35:05PM +0100, Stephane Bortzmeyer wrote:
> >> Old code doesn't need to be ported.
> >
> >Very strange advice, indeed.
> 
> You might want to read up on the history of UTF-8.

Here are some references for anyone wanting to do so:
http://www.cl.cam.ac.uk/~mgk25/ucs/UTF-8-Plan9-paper.pdf
http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

> Not needed to make any code changes at all to most applications was in
> fact one of the primary design goal of the encoding.

I'd like to expand on and strengthen this statement a bit: the goal
was not just to avoid making code changes, but to avoid requirements
on text that would be fundamentally incompatible with some of the most
powerful tools in the unix model. UTF-16 (or at that time, UCS-2) not
only broke the API of standard C and unix; it also broke the
statelessness and robustness of text and the ability to treat it as
binary byte streams in pipes, etc. due to byte order issues and BOM.
This could have been avoided only by redefining the atomic data unit
(byte) to be 16 (or later 21 :) bits, which would in turn have
required scrapping and replacing every octet-based internet protocol..

Hopefully a good understanding of the history and motivations behind
UTF-8 makes it clear that UTF-8 is not (as Windows and Java fans try
to portrary it) a backwards-compatibility hack, but instead a
fundamentally better encoding scheme which allows powerful unix data
processing principles to continue to be used with text. It's a shame
the history isn't better-known.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: c++ strings and UTF-8 (other charsets)

2007-02-26 Thread Rich Felker

On Mon, Feb 26, 2007 at 03:35:05PM +0100, Stephane Bortzmeyer wrote:
> On Mon, Feb 26, 2007 at 08:10:59AM +0100,
>  Marcel Ruff <[EMAIL PROTECTED]> wrote 
>  a message of 65 lines which said:
> 
> > As UTF-8 may not contain '\0' you can simply use all functions as
> > before (strcmp(), std::string etc.).
> 
> As long as you just store or retrieve strings. If you compare them
> (strcmp), you HAVE TO take normalization into account.

No you don't. Nothing in Unicode says that you must treat canonically
equivalent strings as identical, and in fact doing so is a bad idea in
most of the situations I've worked with. Unicode only says that you
should not assume that another process (in the Unicode sense of the
word "process") will treat them as being distinct.

If your particular application has a special need for normalization,
then yes you need to take it into account. But if you're doing
something like passing around filenames you most surely should not be
normalizing anything.

> If you measure
> them (strlen), you HAVE TO use a character semantic, not a byte
> semantic. And so on.

Huh? Length in characters is basically useless to know. Length in
bytes and width of the text when rendered to a visual presentation are
both useful, but the only place where knowing length in number of
characters is useful is for fields that are limited to a fixed number
of characters. If the limit is for the sake of using a fixed-size
storage object, then this limit should just be changed to a limit in
bytes instead of in characters..

> > Old code doesn't need to be ported.
> 
> Very strange advice, indeed.

?? Hardly strange.. It depends on what the code does. See Markus
Kuhn's UTF-8 FAQ.

But Marcel is right about a lot of old code (just not all). Most code
doesn't care at all about the contents of the text, just that it's a
string.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: A call for fixing aterm/rxvt/etc...

2007-02-25 Thread Rich Felker

On Sat, Feb 24, 2007 at 01:39:25AM -0500, Rich Felker wrote:
> > using luit for this sounds appealing, but in my experience luit (a)
> > crashes frequently and (b) is easily confused by escape sequences and
> > has no user interface for resetting all its iso-2022 state, so in
> > practice it works for only a few apps.
> 
> Hmm, maybe a replacement for luit is in order then.. If I omit
> iso-2022 support (which IMO is a big plus) then it should just be ~100
> lines of C.. I'll see if I can whip up a prototype sometime soon.

And here it is. Ugly but simple. Syntax is:
tconv [-i inner_encoding] [-o outer_encoding] [-e command ...]

Both encodings default to nl_langinfo(CODESET). Command defaults to
$SHELL. Bad things(tm) may happen if you set either encoding to
something stateful or ascii-incompatible (e.g. non-EUC legacy CJK
encodings) or a transliterating converter.

Actual usage to fix rxvt:
rxvt -e ./tconv -o iso-8859-1

Known bugs: termios handling is somewhat wrong and something should be
done to ensure that replacements made by iconv match the column width
of the correct character, to avoid corrupting the terminal. Maybe
deadlock situations when terminal blocks..? Other bugs?

Rich
/* Written in 2007 by Rich Felker; released to the public domain */

#define _XOPEN_SOURCE 500

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void dummy(int sig)
{
}

static void print(int fd, ...)
{
va_list ap;
const char *s;
va_start(ap, fd);
while ((s = va_arg(ap, const char *)))
write(fd, s, strlen(s));
}

int main(int argc, char **argv)
{
int i, j;
const char *o_enc, *i_enc;
char **cmd = 0;
int pty;
fd_set rfds, wfds;
char buf[512], buf2[1536];
static struct termios tio, tio_old;
iconv_t itoo, otoi;
char *in, *out;
size_t inb, outb;

#ifdef TIOCSWINSZ
struct winsize ws = { };

signal(SIGWINCH, dummy);
ioctl(0, TIOCGWINSZ, &ws);
#endif

tcgetattr(0, &tio);
tio_old = tio;
tio.c_cflag &= CBAUD;
tio.c_cflag |= CS8 | CLOCAL | CREAD;
tio.c_iflag = 0;
tio.c_oflag = 0;
tio.c_lflag = 0;
tcsetattr(0, TCSANOW, &tio);

setlocale(LC_CTYPE, "");
i_enc = o_enc = nl_langinfo(CODESET);

for (i=1; i 2) close(i);
if (cmd) execvp(cmd[0], cmd);
else {
const char *s = getenv("SHELL");
if (!s) s = "/bin/sh";
execl(s, s, (char *)0);
}
_exit(1);
}

goto resize;
for (;;) {
FD_ZERO(&rfds);
FD_ZERO(&wfds);
FD_SET(0,&rfds);
FD_SET(pty,&rfds);
switch (select(pty+1, &rfds, &wfds, NULL, 0)) {
case 0:
continue;
case -1:
if (errno != EINTR) {
print(2, argv[0], ": error: ",
strerror(errno), "\n", (char *)0);
goto die;
}
resize:
#ifdef TIOCSWINSZ
ioctl(0, TIOCGWINSZ, &ws);
ioctl(pty, TIOCSWINSZ, &ws);
#endif
continue;
}
if (FD_ISSET(pty, &rfds)) {
ssize_t l = read(pty, buf, sizeof buf);
if (l <= 0) exit(0);
in = buf; inb = l;
out = buf2; outb = sizeof buf2;
while (inb && outb) {
iconv(itoo, &in, &inb, &out, &outb);
if (inb) { in++; inb--; }
}
write(1, buf2, out-buf2);
}
/* fixme: account for blocked pty..? */
if (FD_ISSET(0, &rfds)) {
ssize_t l = read(0, buf, sizeof buf);
if (l <= 0) exit(0);
in = buf; inb = l;
out = buf2; outb = sizeof buf2;
while (inb && outb) {
iconv(otoi, &in, &inb, &out, &outb);
if (inb) { in++; inb--; }
}
write(pty, buf, l);
}
}
die:
tcsetattr(0, TCSAFLUSH, &tio_old);
return 1;
}

Re: c++ strings and UTF-8 (other charsets)

2007-02-25 Thread Rich Felker

On Sat, Feb 24, 2007 at 06:13:37PM +0100, Julien Claassen wrote:
> Hi!
>   What I meant about UTF-8-strings in c++: I mean in c and c++ they're not 
> standard like in Java.

UTF-16, used by Java, is also variable-width. It can be either 2 bytes
or 4 bytes per character. Support for the characters that use 4 bytes
is generally very poor due to the misconception that it's
fixed-width.. :(

> I think UTF-8 is a variable width multibyte charset, so 
> there are specific problems in handling them allocating the right space. I 
> mean the Glib contains something like UString and QT has its QStrings, which 
> I think are also UTF-8 capable.

All strings are UTF-8 capable; the unit of data is simply bytes
instead of characters. If you're looking for a class that treats
strings as a sequence of abstract characters rather than a sequence of
bytes, you could look for a library to do this or write your own.
However I suspect the most useful way to do this on C++ would be to
extend whatever standard byte-based string class you're using with a
derived class.

Maybe there's something like this built in to the C++ STL classes
already that I'm not aware of. As I said I don't know much of (modern)
C++. Can someone who knows the language better provide an answer?

It would also be easier to provide you answers if we knew better what
you're trying to do with the strings, i.e. whether you just need to
store them and spit them back in output, or whether you need to do
higher-level unicode processing like line breaks, collation,
rendering, etc.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: A call for fixing aterm/rxvt/etc...

2007-02-23 Thread Rich Felker

On Fri, Feb 23, 2007 at 04:24:29PM -0800, Ben Wiley Sittler wrote:
> just two cents: i did this some years back for the links and elinks
> web browsers (it's the "utf-8 i/o" option available in some versions

FWIW: ELinks has since been fixed (in the development versions, not
yet released but working great) to have true UTF-8 support. Proper
Unicode support/m17n is still a ways off tho (bidi, line breaking,
combining characters, cjk-wide behavior, UTF-8 text search, etc.).

> of each) and the results are fairly mixed -- copy-n-paste fails
> horribly in an app converted in this way, and i assume the same would
> be true of a terminal emulator in a window system like X11. on the

Well, copy-n-paste will work fine as long as the characters you want
to copy/paste are in the user's selected legacy codepage. Other
characters naturally are lost, but presumably the user doesn't really
care about characters aside from the ones in their own language or
else they'd get a better terminal..

> using luit for this sounds appealing, but in my experience luit (a)
> crashes frequently and (b) is easily confused by escape sequences and
> has no user interface for resetting all its iso-2022 state, so in
> practice it works for only a few apps.

Hmm, maybe a replacement for luit is in order then.. If I omit
iso-2022 support (which IMO is a big plus) then it should just be ~100
lines of C.. I'll see if I can whip up a prototype sometime soon.

> that said, it would probably be better  thanthe current state of affairs.

Yeah, that was the main thing I wanted to say, I suppose. Of course it
would be nice if someone wants to add proper UTF-8 support, but that's
a lot more work.. IMO, if there were at least minimal UTF-8 support,
it might allow people with modern systems and UTF-8 locales to use
these terminal emulators again, and then they might get interested in
improving them to have real support...

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

A call for fixing aterm/rxvt/etc...

2007-02-23 Thread Rich Felker

These days we have at least xterm, urxvt, mlterm, gnome-terminal, and
konsole which support utf-8 fairly well, but on the flip side there's
still a huge number of terminal emulators which do not respect the
user's encoding at all and always behave in a legacy-8bit-codepage
way.

Trying to help users in #irssi, etc. with charset issues, I've come to
believe that it's a fairly significant problem: users get frustrated
with utf-8 because the terminal emulator they want to use (which might
be chosen based on anti-bloat sentiment or, quite the opposite, on a
desire for specialized eye candy only available in one or two
programs) forces their system into a mixed-encoding scenario where
they have both utf-8 and non-utf-8 data in the filesystem and text
files.

How hard would it be to go through the available terminal emulators,
evaluate which ones lack utf-8 support, and provide at least minimal
fixes? In particular, are there any volunteers?

What I'm thinking of as a minimal fix is just putting utf-8 conversion
into the input and output layers. It would still be fine for most
users of these apps if the terminal were limited to a 256-character
subset of UCS, didn't support combining characters or CJK, etc. as
long as the data sent and received over the PTY device is valid UTF-8,
so that the (valid and correct) assumption of applications running on
the terminal that characters are encoded in the locale's encoding is
satisfied.

Perhaps this could be done via a "reverse luit" -- that is, a program
like luit or an extension to luit that assumes the physical terminal
is using an 8bit legacy codepage rather than UTF-8. Then these
terminals could simply be patched to run luit if the locale's encoding
is not single-byte.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: c++ strings and UTF-8 (other charsets)

2007-02-20 Thread Rich Felker

On Mon, Feb 19, 2007 at 06:49:20PM +0100, Julien Claassen wrote:
> Hello!
>   I've got one question. I'm writing a library in c++, which needs to handle 
> different character sets. I suppose for internal purposes UTF-8 is quite 
> sufficient. So is there a standard string class in the libstdc++ which 
> supports it?
>   Can I use something like:
>   printw(0,0,"%s",my_utf8_string.c_str());
>   with it?

The whole point of UTF-8 is that it's usable directly as a normal
string. You don't need any special classes, just a normal string
class. If you want to add extra UTF-8-specific functionality you could
perhaps make a derived class.

>   Is there some kind of good, small example code of how to use libiconv most 
> efficiently with strings in c++?

Not sure what you mean by most efficiently. If you're converting from
another encoding to UTF-8, I would just initially allocate some small
constant times the original size in the legacy encoding (3 times
should be sufficient; 4 times surely is), then use iconv to convert
into the allocated buffer, and subsequently resize it to free the
unused space if you care about space.

Sorry my suggestions aren't very C++-specific. I only use C and am not
very fond of C++ so I'm not particularly familiar with it.

>   Any good hints are appreciated! Thanks!

Hope this helps a little bit. If you have more specific questions feel
free to ask (on-list please).

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

New Tibetan glyphs for GNU unifont -- is there a maintainer?

2006-12-14 Thread Rich Felker

Is anyone maintaining the GNU unifont? I just completed a set of
8x16 Tibetan glyphs which, via what could be called either clever
pixel-painting or a horrible hack, allow the display of legible
Tibetan text with pure-overstrike combining marks (no contextual
substitutions or positioning required). Needless to say Sanskrit
transliteraion stacks don't come out very nice, but all "standard"
Tibetan stacks work, meaning one can read and write Tibetan or
Dzongkha language text.

I'd like to see these added to GNU unifont if possible. The existing
glyphs there, last I checked, weren't at all appropriate for actual
overstrike rendering, only for use as nominal Unicode glyphs and not
even very good in that regard.

Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Do combinations need to be defined in advance?

2006-12-12 Thread Rich Felker

On Tue, Dec 12, 2006 at 08:56:06PM +0600, Christopher Fynn wrote:
> 
> Rich Felker wrote:
> 
> >Whether it's possible to support all combinations efficiently, I don't
> >know. The OpenType system is very poorly designed from what I can
> >tell. In the Tibetan fonts I've examined, rather than just saying
> >"character U+0F62 needs to use an alternate glyph when followed by any
> >of {list here} combining characters", there are individual ligature
> >combination tables for each pairing. Whether this is just lack of
> >understanding on the font designer's part or fundamental limitations
> >of OpenType, I'm not sure.
> 
> Although you can build Tibetan stacks using contextual substitutions
> I've found through trial and error that it is generally much more 
> efficient to have pre-composed consonant stacks and simple 
> (non-contextual) GSUB lookups. You will probably still need some 
> contextual lookups for vowel marks and for a few variant forms of stacks 
> - especially in cursive style Tibetan - but having a lot of contextual 
> substitution lookups in a Tibetan font seems to slow everything to a 
> crawl especially with long documents.

OK, so basically it's a workaround for poor OpenType implementations.
Got it. Thanks for the explanation.

BTW, I heard there was a mailing list specifically for Tibetan
font/script issues. Is that still active, and if so, how can I
subscribe?

Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Do combinations need to be defined in advance?

2006-12-11 Thread Rich Felker

On Mon, Dec 11, 2006 at 05:49:22PM +0100, Andries Brouwer wrote:
> On Mon, Dec 11, 2006 at 05:06:23PM +0100, Jan Willem Stumpel wrote:
> 
> > I am beginning to think that the responsibility for correct
> > "combining accents" behaviour rests primarily with the rendering
> > engine, rather than with the fonts. The fonts must, of course,
> > include the combining accents, otherwise the accents will be
> > borrowed from other fonts; but I doubt that they really need
> > anchors or GPOS.
> > 
> > E.g. say I am a rendering engine; I see a character which, from
> > its Unicode range, is either
> > 
> > -- a "top" accent
> > -- a "bottom" accent
> > [-- a left accent if such things exist, a right accent, etc.,]
> 
> In Hebrew, a dagesh is a dot centered in the glyph to double
> the consonant or change the pronunciation.
> 
> The precise place where it should go must be indicated by the font.
> If one just centers a dot in the same area, it may well be
> (and in practice, in my Java experiments, is) invisible
> because it overlaps part of the glyph.

Then in principle you just need a 'center point' anchor for Hebrew
consonants. The point is that rendering combining marks should require
roughly O(nk) information (where n is the number of characters and k
is a small number of classes) as opposed to O(nm) or even O(nm^j)
(where m is the number of combining characters and j is the maximum
combining stack length).

Whether it's possible to support all combinations efficiently, I don't
know. The OpenType system is very poorly designed from what I can
tell. In the Tibetan fonts I've examined, rather than just saying
"character U+0F62 needs to use an alternate glyph when followed by any
of {list here} combining characters", there are individual ligature
combination tables for each pairing. Whether this is just lack of
understanding on the font designer's part or fundamental limitations
of OpenType, I'm not sure.

On the other hand, I've successfully implemented a O(nk) system with
UCF/uuterm, so I know it's possible. From what I've read, Apple's AAT
tables also sound like they're O(nk) and don't suffer from the
horrible "leave it to the rendering engine to decide what to do, and
decide incorrectly" syndrome of OpenType/Uniscribe/pango/etc.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Xlib UTF-8 support

2006-12-07 Thread Rich Felker

On Thu, Dec 07, 2006 at 01:36:01PM +0900, Jiro SEKIBA wrote:
> 
> At Thu, 07 Dec 2006 03:17:32 +0100,
> Mirco Bakker wrote:
> 
> > The programm (written in C) uses only the standard Xlib. The
> > writing is done using XmbDrawString() (AFAIK function of choice).
> > I also tried Xutf8DrawString (X_HAVE_UTF8_STRING is set) with the
> > same effect. After Googeling for hours I found a few outdated
> > reports that Xlib has a Bug handling UTF-8 Strings (or Fonts). Is
> > this still true or is my code crap?
> 
>  X UTF-8 supports is ok, but only a few fonts have all glyphs.

A few? Actually no fonts have all glyphs. :(
Part of this is just incompleteness, but part of it is the
insufficiency of 1-1 character/glyph mapping.

>  Or you can use font sets instead of single iso10646-1 font.
> Try to specify legacy fonts separated by ',' comma.  
> Like "a14,k14,*", ('*' is wild card).

Hm? What programs will use this?

Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Xlib UTF-8 support

2006-12-06 Thread Rich Felker

On Wed, Dec 06, 2006 at 10:06:09PM -0500, Michael B Allen wrote:
> Two things. First, I believe Pango is becoming the defacto method for
> rendering non-Latin1 text in general purpose applications (I've never

I'm hoping we can remedy this situation. Xft/pango is extremely slow
compared to the core X font system, and there's nothing wrong with the
core system as long as the X/font server could communicate
OpenType/AAT/etc. tables to Xlib for Xlib to use in correctly choosing
glyphs.

Unfortunately we're a long way from having something like this
working, but in the mean time Xlib and core fonts should work fine for
UTF-8 as long as you don't need context-sensitive glyphs.

> used it but from installing apps I can see more and more apps depend on
> it). Second, make sure you're in the UTF-8 locale. If you're not,
> UTF-8 text will not be rendered properly.

Also make sure a font with iso10646-1 encoding is selected... Any
other ideas?

BTW I don't know what policy on this list is, but in general it's
considered bad to top-post on lists I think.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: utf8 and solaris

2006-11-18 Thread Rich Felker

On Sat, Nov 18, 2006 at 07:43:54PM +0530, Balaji.Ramdoss wrote:
> Folks, well this is not on linux. I have an issue in Sun Solaris box 
> where octal values gets displayed instead of symbols like "^","|" as 
> \136, \075.
> This happens if I set my LC_CTYPE to en_US.UTF-8 locale and I have the 
> "set verbose" on.
> The OS/hardware is
> SunOS irvhomer 5.9 Generic_118558-23 sun4u sparc SUNW,Ultra-250
> and the tcsh is
> tcsh 6.14.00 (Astron) 2005-03-25 (sparc-sun-solaris) options 
> wide,nls,dl,al,kan,rh,color,filec
> 
> Will really appreciate if someone has any possible work around to this 
> solaris/tcsh bug ?

1. Does your version of Solaris actually have a locale named
   en_US.UTF-8? If not you need to figure out how to create it.

2. What program is displaying octal codes, etc.? If it's tcsh the
   problem is most likely just that tcsh sucks. :) As far as I know it
   doesn't support UTF-8. You might be able to get it to display the
   characters with the same options that make 8bit encodings work, but
   I suspect it will be hard to interactively edit the commandline.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

2006-11-09 Thread Rich Felker

On Tue, Nov 07, 2006 at 01:13:24AM -0800, rajeev joseph sebastian wrote:
> Well, I think I misunderstood ...

No problem.

> ---
> In the first para, I asked whether it was possible to use TrueType
> in the terminal. If we cannot, then we need to use some hybrid of
> bitmap fonts and OT fonts, such that, the OT features can be used
> (atleast the GSUB if nothing else) and the Bitmap features can be
> used (i.e., using a bitmap instead of outlines).

Yes, UCF also solves the problem of character->glyph mapping in a way
that's more cell-oriented, but an application (e.g. mlterm) using
OpenType fonts could use the OT tables instead and get the same
effect.

> ---
> 
> In the last para, I said that I would try (or rather the Typographer
> and I could try) the following:
> 
> 1) Since you are assigning widths to characters, and since each
> logical cluster would get a width = sum of the widths of the
> characters in that cluster, ...
> 
> 2) ... all we need to do is design the font in such a way that, the
> glyph corresponding to a logical cluster would use as much space as
> available to it.
> 
> E.g., 
> 
> kra cluster consists of ka + chandrakkala + ra
> so, when a software (say ls or cat) outputs a sequence ka +
> chandrakkala + ra, the kra logical cluster will get widthC =
> width(ka) + width(chandrakkala) + width(ra) allocated to it. In the
> font, we make sure that the kra *glyph* which corresponds to the kra
> *logical cluster* uses as much as possible of widthC.
> 
> With this, characters have a width specification, and glyphs can be
> moulded to use as much of the space as possible/necessary as per the
> widths assigned to each *character*.
> 
> --
> 
> I hope I have set things right ?

Yep, this is right! Maybe you or your typographer friend could try
sketching out a few glyphs and see if it seems to work out well or not
(and what character width assignments would be required). The
character cell size I'm working with for my font with widespread
coverage of lots of scripts is 8x16, but larger or smaller font sizes
could of course be made too. In assigning widths. my inclination is
never to assume that more than 3 (or 4?) vertical strokes can fit in a
single cell, since 3 is the number in the latin characters "m" and "w"
and since a cell size too small to represent latin characters is
probably not useful anywhere.

In terms of simplifying font design, it helps if conjunct forms can be
reduced as much as possible to 'glueing together pieces'. UCF allows
the shape of the pieces to vary depending on the adjacent pieces. For
example a latin "fi" ligature is made not by creating a single wide
"fi" glyph but instead a special glyph for "f when it is followed by
i" and a special glyph for "i when it follows f". In conjunct
formation for many scripts (including diacritic placement for western
scripts, stacking for Tibetan, and various others) this model works
out nicer and greatly reduces the number of glyphs needed (and the
amount of maintainence/font design work). However, if needed, it's
possible to convert whole predrawn "conjunct glyphs" to the UCF rules
format -- it just might require a lot of glyphs. For Malayalam, a mix
of the two approaches is probably appropriate, depending on whether
the particular conjunct is formed by putting together 'reusable' parts
or whether it's highly unique to the character sequence it represents.

Hopefully this information is helpful to you or anyone else thinking
about designing fonts.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

2006-11-06 Thread Rich Felker

On Mon, Nov 06, 2006 at 10:14:20AM -0800, rajeev joseph sebastian wrote:
> I can say that you have done a good job. My point has so far been
> that some kind of special font system should be created. In any
> case, the use of straight TTF or OTF is not possible. (is it?). in
> that case, it may be worthwhile to investigate a kind of OpenType
> Bitmap font :)

It's not a question of the font system not being powerful enough. It's
a question of font-specific spacing not being available. It's much
more fundamental, the information just isn't there. If I do:

cat foo.txt

on a terminal, how does the text file query a font and decide how to
align itself? It's not a program. Even if it were a program, for
example ls, the columnar output would only be correct for one run. If
you did:

ls -C > listing.txt

should ls adapt its output to the current terminal and font it's
running on? What if you then do

cat listing.txt

on a different terminal or with a different font? This is why the
notion of column width must be font-independent. If you're talking
about making a system where spacing is font-dependent, that's
something you can do, but it's a sort of graphic layout language and
not a charactercell terminal anymore, and it won't be useful for
running any existing terminal apps (their output will corrupt,
especially if it causes automargins to wrap in unexpected places) and
loses many of the nice properties of a terminal.

Note that this is an entirely separate issue from the "excessive
spacing" issue. Correction for excessive spacing (with an api more
powerful than wcwidth() that takes context into consideration) is one
possible design direction for a terminal, but the width would still
have to be specified in a font-independent manner.

BTW there are also lots of nice things that can be done to get rid of
the excessive space "problem", for example pushing all the space
forward to the next place where two or more consecutive space, or a
tab, or end of line occurs. This can be done entirely at display time
so that it does desynchronize with the application's idea of the
terminal contents and lead to corruption. The only important thing is
to maintain a concept of cells containing characters, without which
character-based applications cannot work (and I already explained in
the last email why any application running on a terminal must be
character-based and not glyph-based).

> In this case, since each designer will know exactly how much space
> is available, he can *design* conjuncts to fill as much space as
> possible. I can talk to the typographer who makes Malayalam fonts
> for us on this matter, whether he can think about the problem.

Last time I checked even typographers for Latin fonts weren't very
fond of character cell terminals... :(

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

2006-11-05 Thread Rich Felker

On Sun, Nov 05, 2006 at 12:59:03PM -0800, rajeev joseph sebastian wrote:
> Well, most correctly implemented Unicode-aware applicatons do this also:
> have 2 backing stores, one for text and the other for glyphs. Use
> the glyph representation for display. When a selection is done, the
> map between the 2 stores is used to derive the correct text for the
> selected glyphs.

Yes, this is roughly what uuterm does (except it doesn't keep a glyph
representation, it just dynamically-generates it). However
applications running on the terminal don't have any way to know about
glyphs; all they can access are the characters.

> Currently, most apps I have seen use the precomposed Latin
> characters, which is allowed only because of the stability policy.
> Most apps do not implement complex layout of latin glyphs which
> causes no-end of problems for Latin transliterations of Indic/other
> text. Although most of the required characters for Indic
> transliteration are already available precomposed, the policy of
> Unicode and the combining mark model do not allow the rest to be
> encoded. Hence the proliferation of PUA codepoints for this purpose.
> (I hope the situation changes for GNU/Linux, but I think it is
> unlikely).

uuterm already has full support for combining marks, including varied
placement of the diacritics. It doesn't use precomposed glyphs even if
they're available; it always decomposes to NFD (with some additional
decompositions necessary because of stupid Unicode policies) for
rendering.

> --
> The other issue here is that there's no standard for glyph numbering,
> and Unicode doesn't represent glyphs, so there's really no way an
> application running on a terminal could directly print glyphs. Even if
> it could, just "cat file_with_indic_text.txt" on the terminal, or
> something simple like "ls", wouldn probably not work as expected.
> 
> There is no need for glyph numbers and that is one the strong points
> of Unicode.

I agree totally. However it does mean that applications running on a
terminal don't have any way to operate in terms of glyphs. Everything
they do must be in terms of characters. This is why we're only able to
consider character width and not glyph width for the purposes of
spacing.

> I would strongly suggest to look over the HarfBuzz
> library which is slowly evolving which will allow you to use the
> work of the best minds in the community. It will transform
> codepoints into glyphs, which you can then use. (You can also use
> Pango if need be).

uuterm is based entirely on bitmap fonts, so these are not appropriate
solutions for it and probably not for kernel-level console drivers
either. However, any character-width tables agreed upon should be able
to be used reasonably with OpenType fonts too of course. It would be
silly to try to adopt a standard that excludes a popular modern
technology. However just like with Latin, fonts whose metrics don't
fit well with the cell widths wouldn't look very good in a terminal
emulator.

IMO, in a way this is part of an argument for the "excessive" spacing
too -- if there's extra space you can fit almost any font in there...
and optionally scale it to try to fill up the space if desired, or
distribute the extra spacing equally spread-out, etc.

> My (naive)
> understanding is that Kannada conjuncts take place mostly as a
> "subscript" to the bottom-right of the initial consonant and vowel
> mark, so perhaps they'll look fairly proper in such a scheme.
> 
> ---
> This is not always true. For Kannada, I will try to confirm that.

I have a friend I can check with too, but going from the sparse
information in the Unicode specs and sites like Omniglot and
Wikipedia, it seems to be true that even 'subjunct' conjunct
characters use some of their own horizontal space. Sometimes
characters that would definitely need 2 cells on their own are simple
enough to fit in one cell when they are a subjunct character though,
so spacing is not entirely ideal, but the glyphs I experimented with
drawing seemed to fit legibly anyway. I can send you the xbm files if
you're interested in seeing. (They're not hideously ugly like the
ascii art below.. :)

> If you mean to say that each logical cluster will be allocated
> enough width equal to the sum of the widths of each character in
> that cluster, then I think you will allocate much too much space :)

Yes, I know. :) But given the choice between too much and not enough,
too much is better.

Can I ask you if something like the following (aside from the bad
ascii art :) is horribly offensive:

pa:
   #
   #
  ##   #
 #  #  #
 #  #  #
   #

ppa:
  #
   ## #
  #  ##
  #  ##
###
  #
###
   #  #   #
 ##
(became wide because it was allocated 2 spaces due to two "pa"
characters..)

Hopefully these pictures explain a bit of one way that excess space
could be filled up. Whether it looks reasonable or n

Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

2006-11-02 Thread Rich Felker

On Wed, Nov 01, 2006 at 01:34:14PM +0600, Christopher Fynn wrote:
> Yes, Indic scripts like Malayalam need specific console fonts. I think 
> for console applications legibility is more important that beauty.
> 
> Why not use the typefaces used in old-fashioned Indian typewriters as a 
> starting point? Most of the popular mono-with fonts for Latin (Courier 
> etc.) are based on typewriter faces.
> 
> Manual mechanical typewriters had a fixed advance width and the 
> "resolution" was fairly low - a lot of care and expertise went into 
> designing typefaces that were legible within these constraints.

Thanks for the constructive ideas. Of course you're totally right,
this approach makes sense. There is still the character/glyph issue
with regard to width, since typewriters of course work with glyphs
rather than characters, but that's unavoidable.

> I know typewriters made by companies like Remington were manufactured 
> for most Indian scripts - and I suspect a lot of these machines are 
> still around - so it shouldn't be too hard to come up with some type 
> samples to use as a starting point.

Yes, I'm sure they are. I suppose now I just have to find someone who
has one and who can explain it well.

Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

2006-10-31 Thread Rich Felker

On Tue, Oct 31, 2006 at 09:37:34AM -0800, rajeev joseph sebastian wrote:
> Hi Rich Felker,
> 
> I find your work to provide support for Indic text on
> console/terminal to be admirable, and yes, any kind of display is
> far better than none at all (and I do not consider your statement
> insulting) :)
> 
> What I was referring to was a comment along the lines of "... have a
> set of wcwidth classes (say, 1, 2, and 3) and assign - glyphs - to
> one of those classes... ". (Please forgive me if I misunderstood the
> last few posts.) The word to note is "glyph". What I'm saying is you
> cannot in advance specify the width of any given conjunct. It may be
> different in different fonts.

Yes, my use of the word character rather than glyph was intentional
however. I know that the typographically correct way to do spacing
would be to measure the width of glyphs, but for better or worse the
only standardized api (wcwidth) works in terms of characters, and
terminals work in terms of characters. Sometimes this has benefits;
for example it makes it so you can hilight text that was printed to
the terminal and paste it into other apps or back into the terminal,
with exact results which are suitable for filenames and such. This
might not be possible if the app running in the terminal had converted
the text to a glyph representation. So in a way it's nice that the
character->glyph conversion is done at the last step, in the terminal,
since it keeps the data in the logical representation instead of the
presentation form. Of course it also has downsides too as I'm sure
we're all aware.

The other issue here is that there's no standard for glyph numbering,
and Unicode doesn't represent glyphs, so there's really no way an
application running on a terminal could directly print glyphs. Even if
it could, just "cat file_with_indic_text.txt" on the terminal, or
something simple like "ls", wouldn probably not work as expected.

My hope is to work out a set of width assignments for characters so
that reasonable glyph presentations of the character sequence always
fit in the spacing privided by the sum of the "character widths".
Unfortunately this may result in excess spacing in some (many?) cases,
but I hope it can be made usable if not elegant. My (naive)
understanding is that Kannada conjuncts take place mostly as a
"subscript" to the bottom-right of the initial consonant and vowel
mark, so perhaps they'll look fairly proper in such a scheme.

> I suppose, we need to develop console specific fonts which can make
> proper use of the available width classes (or the structure you
> propose), however, I don't think any research has occurred in this
> regard.

Well, as long as a reasonable font size were chosen, any font that
fits into the (possibly excessive) width allocation could be used in
principle. For uuterm I'm working on 8x16-cell (and later other larger
sizes) bitmap fonts, which I find much more usable, but there's no
reason other terminal emulators like mlterm couldn't use truetype
fonts in this framework.

> So, a proper answer to your question: how many width classes, really
> needs a lot of work both artistic as well as technical. (Malayalam
> has about 950 conjuncts, so it has to be seen how they can fit into
> those classes).

Well my question is much simpler I think: given a character, what's
the "most space" it can take up in any conjunct it forms?

> Speaking of curses, doesnt Debian/(K)ubuntu use curses for its
> installer ? I remember telling the Kubuntu devels to remove Hindi
> from the list of languages, because looking at the rendering is
> really horrible (misplaced vowels, and so many other things,
> unrelated to spacing/width).

Yes.. it's not really a curses problem though. As long as the terminal
supports reordering and ligatures, using curses should not be much of
a problem. I still need to write the reordering stuff for uuterm
though.

> It is unfortunate, that many developers think that by using
> widestrings for each character is equivalent to support for all
> languages under Unicode. I guess some even think that the
> dotted-circle is a part of the script ;)

Haha yeah. I still can't believe Roman Czyborra drew the original GNU
Unifont with those hideous dotted circles in it... (Yes he knew they
weren't part of the script, but...) My hope is to make it so that
using multibyte char functions + wcwidth is sufficient for _usable_
support for all langs in apps that run on terminals. Then, as more
users of these langs use the apps in question, hopefully other things
(like line folding in scripts without word spacing, better spacing,
integration with input methods, etc.) will come. Unlike most of the
GUI projects working on these issues my goal is not to put
word-processor-type layout in every app, just to fix what's broken and
make them usable with more languages.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

2006-10-30 Thread Rich Felker

On Mon, Oct 30, 2006 at 04:17:54AM -0800, rajeev joseph sebastian wrote:
> Hello Rich Felker,
> 
> It is impossible to fit Malayalam "glyphs" into a given width class,
> if you want even barely aesthetic text. This is because a given
> sequence of Unicode characters may map into somewhat different
> conjunct styles depending on the font: either proper top to bottom
> (subjoining), or left to right (adjoining) or something in between
> as well :)

Yes, I'm aware of the aesthetic considerations but between the choice
of seeing nothing at all and seeing something with excessive spacing
(still correctly subjoining, but with extra width/spacing to make up
for the second character not using horizontal space), wouldn't the
latter be preferable? I don't claim it will be pretty but I believe
one can put together something which at least avoids being hideously
ugly. I also don't mean to insult your script by presenting it in an
ugly way (even having "i" and "m" the same width is ugly although much
less severely so), but a terminal and the apps that can be run on it
are quite useful IMO and it seems a shame for many people to be unable
to use them on account of language.

BTW the situation for Kannada seems much less severe... do you know
enough about the script to confirm this?

Thanks for the comments.

Rich

P.S. There's also the possibility of treating syllable clusters as the
fundamental unit of display and requiring a context-sensative function
rather than wcwidth to measure width; however from my experience
getting application maintainers just to fix their handling of
nonspacing characters is difficult enough without asking them to add
script-specific processing. Also the curses library (which is a bad
library anyway but many apps use it) doesn't support this model. :(
IMO the best long-term solution is to support both, with a terminal
escape to switch the terminal between "dumb" wcwidth-based spacing for
compatibility with apps that are not specifically Indic-script aware,
and "smart" context-sensitive spacing.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

2006-10-29 Thread Rich Felker

In addition to the issues I raised before about consistency of width
under canonical equivalence, I've found additional problems in the
width definitions which are not technical issues like before, but just
feasibility-of-presentation issues. Specifically, several Indic
scripts including Kannada and Malayalam have several characters which
require 6 or 7 vertical strokes for their standard presentation
glyphs, and numerous characters that require 4 or 5. Moreover, the
standard glyphs shapes for these characters are roughly twice as wide
(sometimes more than twice) as they are tall.

This puts their horizontal complexity on par with most ideographic
characters, and makes it impossible to render them legibly in a single
character cell without huge font size. The possible courses of action
are:

1. Leave them with wcwidth of 1 anyway and assume everyone will use
   huge font sizes or else put up with completely illegible glyphs.

2. Assign a global wcwidth of 2 to the affected scripts.

3. Perform "a careful analysis not only of each Unicode character,
   but also of each presentation form", as Markus suggested in his
   wcwidth.c comments, assigning width of 1/2[/3??] on a per-character
   basis.

IMO course 1 is ridiculous. The only argument for it is compatibility,
but obviously no one has ever tried using wcwidth with these scripts
since it just plain doesn't work.

Course 3 is difficult but might give the most visually pleasing
results. On the other hand, it may tend to lock one into a particular
style of presentation forms. If preferred glyph forms change due to
"reforms" or just stylistic preferences, users could be left with a
mess. Part of the analysis for #3 would have to include making sure
that the width assignments could remain reasonable under such
variations, as opposed to being font-specific, but this is probably
not infeasible as long as the amount of "width>1" characters is kept
to a minimum.

Finally there's course 2. In a way it's sort of a cop-out, taking the
easy approach of "fixed width", but that's what character cell widths
have done ever since "i" and "m" received the same width of 1 column.
It's font-independent and ensures that text in a single script can
align well in columns regardless of which characters are used.

I can prepare example bitmaps if anyone is interested in seeing what
the choices might look like, and probably will do this soon anyway.
Again, my goal is revising the wcwidth data (which Markus labelled as
incomplete in the original version) to account for scripts for which
it is not currently being used and for which it does not currently
provide reasonable results. But it's useless for me to just say what I
think it should be. There should be some sort of sane process here, by
which we arrive at a de facto standard which glibc and other
implementations can adopt.

Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

2006-10-16 Thread Rich Felker

Sorry I originally replied off-list to Bruno because the list mail was
slow coming thru and I thought he was just mailing me in private..

On Mon, Oct 16, 2006 at 05:38:45PM -0700, Ben Wiley Sittler wrote:
> just tried this in a few terminals, here are the results:
> 
> GNOME Terminal 2.16.1:
> U+0D30 U+0D4A displayed with width 3
> U+0D30 U+0D46 U+0D3E displayed with width 3
> NOTE: displays very differently in each case
> 
> Konsole 1.6.5:
> U+0D30 U+0D4A displayed with width 3
> U+0D30 U+0D46 U+0D3E displayed with width 4
> NOTE: displays very differently in each case
> 
> mlterm 2.9.3:
> U+0D30 U+0D4A displayed with width 2
> U+0D30 U+0D46 U+0D3E displayed with width 2
> NOTE: displays identically in each case

As we can see, _none_ of these agrees with the current wcwidth
implementation. In fact I'm pretty sure they all ignore wcwidth and
use their own (possibly font-specific) interpretation of width, which
fundamentally dooms the terminal from being able to be used for
anything with columns or cursor positioning.

If they don't even agree with the current wcwidth, and the current
wcwidth cannot reasonably be used for Indic scripts, I see no good
reason why wcwidth tables shouldn't be fixed to at least match values
that _could_ be used for reasonable rendering...

> >What rendering to other terminal emulators produce for these characters,
> >especially the ones from GNOME, KDE, Apple, and mlterm? I cannot submit
> >a patch to glibc based on the data of just 1 terminal emulator.

As I commented in private to Bruno, Apple's Terminal.app even has
broken cursor positioning behavior for CJK and nonspacing characters,
so I think it's hopeless to try to use it for Indic scripts...

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Proposed fix for Malayalam (& other Indic?) chars and wcwidth

2006-10-13 Thread Rich Felker

Working on uuterm[1], I've run into a problem with the characters
0D4A-0D4C and possibly others like them, in regards to wcwidth(3)
behavior. These characters are combining marks that attach on both
sides of a cluster, and have canonical equivalence to the two separate
pieces from which they are built, but yet Markus' wcwidth
implementation and GNU libc assign them a width of 1. It appears very
obvious to me that there's no hope of rendering both of these parts
using only 1 character cell on a character cell device, and even if it
were possible, it also seems horribly wrong for canonically equivalent
strings to have different widths.

I propose amending the wcwidth definitions to assign these characters
(and any like them) a width of 2. Furthermore, I would suggest that
any characters with canonical decompositions be assigned a width that
is the sum of the widths of the decomposition into NFD. This would
avoid similar unfortunate situations in the future that might not yet
have been found. It may also be desirable to do this for compatibility
decompositions (like "dz", etc.); however I suspect it's unlikely that
anyone would use such characters in non-legacy data anyway.

BTW I don't think there's any harm here in breaking compatibility with
existing practice, since obviously no one is using the results of
wcwidth on these characters or they would already have run into thus
problem..

Rich


[1] http://svn.mplayerhq.hu/uuterm/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Announcing uuterm and ucf (universal charcell font)

2006-10-09 Thread Rich Felker

On Mon, Oct 09, 2006 at 12:37:24PM -0600, Wesley J. Landaker wrote:
> On Thursday 05 October 2006 16:03, Rich Felker wrote:
> > A few comments on "Why not just use OpenType??":
> >
> > - The GSUB model does not adapt well to a character cell device where
> >   characters are organized into cells and where arbitrary string
> >   replacements don't make sense.
> >
> > - The glyph metric data is as large as the actual glyphs, doubling
> >   font size. Charcell fonts don't need any glyph metrics.
> >
> > - I don't think you can implement OpenType in less than 100 lines of
> >   C. The UCF char-to-glyph mapping algorithm is easy to implement and
> >   tiny.
> >
> > - Personally I like solutions that are adapted to the nature of the
> >   particular problem (character cell device) rather than trying to
> >   apply an overly general solution that will be awkward at best.
> >
> > - Something like UCF has a chance of getting into *NIX console drivers
> >   someday. I doubt anything OpenType-based would ever pass the
> >   necessary bloat tests to get integrated at such a low level.
> 
> The main point here, which I don't argue against, is that OpenType is complex 
> and bloated when applied to minimally simple charcell devices. So, say I want 
> to go implement this right away... I code it up and... ah, no fonts!

Actually we have all of "GNU Unifont" plus plenty of other bitmap
fonts, all of which are easy to convert. Unfortunately the
European/Western glyphs in GNU Unifont are extremely ugly; if it
weren't for that I would just have converted them already.

I'll be working on the scripts that are interesting to me, but my
viewpoint here is as follows: there are VERY MANY scripts with
absolutely no terminal emulator that can display them, or with only
one locale-specific terminal emulator with very poor features. If a
terminal emulator implements UCF support, it _automatically_ supports
these scripts as soon as someone makes a font. No coding is required
by users wanting to get their script supported; just font drawing.
While this doesn't help so much with the goal of getting a complete
font for all scripts, it does make it very easy to achieve the local
goal of supporting just one or two scripts you need, as the need
arises.

> To help create UCF fonts, it seems like having an OpenType to UCF converter 
> would be a *really* big help.

Well, I mostly disagree. TrueType/OpenType fonts simply do not make
legible character cell fonts, between not being designed for fixed
width and the classic problem of poor rendering at small sizes.

In any case, if you can make bitmaps from your OpenType fonts, it's
trivial to use the glyphs in a UCF font, and programs to make bitmaps
(e.g. BDF) out of OpenType fonts already exist. However, the OpenType
tables for substitutions and positioning are built on an entirely
different framework of layout that's about character sequences,
baselines, and anchor points as opposed to character cells, so IMO
there's very little hope of converting such tables in a meaningful
way. If you have an idea for how this could be done, I'd be very
interested in hearing it!

Keep in mind that most glyphs don't even need any such tables. The
vast majority of glyphs are CJK. Also, the fact that UCF doesn't need
precomposed glyphs for accented characters cuts down vastly on the
number of glyphs needed. As an example, my current Tibetan UCF font
has only 113 glyphs because it makes powerful use of combining. Fonts
with precomposed glyphs can have well over 1000. The situation for
Latin is similar.

> Even if you still had to tweak it manually 
> afterward for >75% of the glyphs, it would still be a big win, reduce a lot 
> of manual labor, and would help tide the "gee, UCF sounds like a good idea; 
> too bad there will never be any fonts" arguments.

Well, we'll see. :)

> Not to distract from your work here,

No problem, comments are welcome.

> but you implied that you are going to 
> work on converting fonts manually. Even just for your own use, wouldn't it 
> save you time in the long run to get a minimal OpenType to UCF converter 
> working?

Well, my plan right now for fonts is split into several parts:

For Latin, Cyrillic, Greek, etc. I plan to compile fonts in several
styles: one that's classic VGA-style glyphs, one with a more standard
modern non-bold terminal look, and one based on the font I personally
use, which I designed for Latin-only a long time ago (extending it to
non-Latin alphabets). The first two are matters of importing; the
latter is a matter of drawing.

For other scripts, I'm converting glyphs if there are nice existing
ones (for instance the Thai font in GNU Unifont seems decent) and
drawing

Re: Announcing uuterm and ucf (universal charcell font)

2006-10-06 Thread Rich Felker

[cc'ing the list since i think it's relevant]

On Fri, Oct 06, 2006 at 04:55:51PM -0400, Daniel Glassey wrote:
> btw there is discussion about trying to integrate as much as possible on
> http://live.gnome.org/UnifiedTextLayoutEngine that you might like to
> contribute to.

well sadly i think the only thing i could contribute to this is
detracting from it. i'm strongly against pushing common apis. what
needs to happen in this area is not for everyone to agree on a single
codebase to use (which will invariably be ill-suited to many people's
needs), but instead to move the topic of layout _out_ of the code and
into data or standards -- either new tables in fonts or generic tables
that apply to all fonts, much like the unicode tables are generic.
then, everyone can use whatever implementation (choice of language,
etc.) suits them while still agreeing on a common expected behavior.
but whatever is standardized _must_ always be behavior. not code. a
single codebase, free/libre or not, is not a standard but an
implementation!

graphite might be the solution we're looking for, or it might be
ridiculously overcomplex and bloated. i'd need to research it more to
have an opinion but i'm quite interested in it. basically it's like a
much more powerful version of what i did with ucf (whereas ucf is
extremely simple because the task it needs to accomplish is simple).

one thing i'm sure of though, from working on uuterm and ucf: there
are two _very_ different issues people are trying to solve, and i
think many of the people working on them don't understand the
difference. "complex" stacking of diacritic marks is absolutely not a
layout issue. the solution can be fully specified in terms of simple
substitution tables, or substitution+positioning. ligatures can also
be entirely handled in this way -- even the notoriously-"complex"
indic scripts. i find it appalling that most apps don't support these
correctly and then claim it's because of complex layout issues. part
of my intent in the experiment of uuterm is demonstrating that
combining stacks, shaping, and ligatures are not a complex layout
issue.

rendering bidi text, diagonal urdu, mixed horizontal and vertical text
flows, etc. is complex (and except for bidi these things probably only
belong in word processing, desktop publishing, web browsers, etc. --
not your average plaintext textbox). on the other hand getting
combining stacks and ligatures right is _not_ complex. having done it
in less than 100 lines of c, i can now say this with confidence...

rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Announcing uuterm and ucf (universal charcell font)

2006-10-05 Thread Rich Felker

After much work, I finally have a working (but still experimental)
version of uuterm and the "ucf" bitmap font format I proposed in
August. Source for uuterm is browsable at
http://svn.mplayerhq.hu/uuterm/ and a sample ucf font is linked from
the included README.

Since ucf is probably more interesting to members of this list than
particular software, I'll skip the stuff about uuterm and just get to
the point of ucf. I based the design loosely on Markus Kuhn's old
proposal for a bitmap font format that recognizes the difference
between glyphs and characters. "Source code" for a ucf font looks
like:

# sa+la
:7B1129650300 0F66+0FB3

# sa+*
:7B294503 0F66+[0F90-0FAC] 0F66+[0FAE-0FB0] 
0F66+[0FB4-0FBC]

...

# ra la sha ssa sa
:3E08081C22010100 0F62 0F6A
:394545491D030100 0F63
:0709096F391109010100 0F64
:7048487B4E4448404000 0F65
:7B11294563130100 0F66

The long hex number is a glyph bitmap, which can be edited easily with
a program like Roman Czyborra'a "hexdraw" (from the GNU unifont
protject), or imported/exported from other formats. Unlike unifont
however there is no limitation on character cell size.

The numbers that follow are the characters that the glyph can
represent, and in which contexts. In the above example, the first
glyph is used for the Tibetan consonant "sa" (U+0F66) when a combining
"la" (U+0FB3) is attached to it. The second glyph is used for "sa"
when any of the listed ranges of combining characters is attached, and
the third glyph is used in any case not matching previous ones.

Aside from the WITH_ATTACHED rule (represented by "+"), the format
also has ATTACHED_TO (for shaping combining marks depending on the
base character or previous combining mark) as well as rules for
examining the character(s) in the previous/next cell (in visual
order). Together with application of visual reordering rules by the
application, I believe this is sufficient for nice (not perfect, but
on a comparable level to rendering English text monospaced)
presentation of Indic text.



I will be converting GNU unifont and/or other free 8x16-cell fonts to
make a fairly complete UCF font with all the necessary contextual
glyph replacements, but it will be a slow process and I'm in no hurry.
I'd welcome others who get interested in it to work on such a thing.
I'd also be interested in studying the feasability of getting support
for UCF in various *NIX consoles.



A few comments on "Why not just use OpenType??":

- The GSUB model does not adapt well to a character cell device where
  characters are organized into cells and where arbitrary string
  replacements don't make sense.

- The glyph metric data is as large as the actual glyphs, doubling
  font size. Charcell fonts don't need any glyph metrics.

- I don't think you can implement OpenType in less than 100 lines of
  C. The UCF char-to-glyph mapping algorithm is easy to implement and
  tiny.

- Personally I like solutions that are adapted to the nature of the
  particular problem (character cell device) rather than trying to
  apply an overly general solution that will be awkward at best.

- Something like UCF has a chance of getting into *NIX console drivers
  someday. I doubt anything OpenType-based would ever pass the
  necessary bloat tests to get integrated at such a low level.


Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Bidi considered harmful? :)

2006-09-05 Thread Rich Felker

On Tue, Sep 05, 2006 at 08:07:14AM -0600, Mark Leisher wrote:
> Rich Felker wrote:
> >On Mon, Sep 04, 2006 at 08:19:02PM -0600, Mark Leisher wrote:
> 
> My last gasp on this conversation: I don't think you really understand 
> what you are talking about and won't until you get some hands-on 
> experience.

I'm not sure how to take this but whatever it is, it sounds
condescending and impolite. Was that the intent? What makes you think
I lack hands-on experience? The fact that my code is "too small" and
going to stay that way? Or just that it's not yet checked in for you
to view?

I'm sorry if my long messages to this list have offended, but my
intent was to seek input and discussion. I don't think anything I said
was any more offensive than similar things which Markus and other
people respected in this community have said. If it's just that you
don't have time to deal with this thread anymore, no problem, I won't
take offense.

> Goodbye and good luck.

Thanks I suppose..

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Bidi considered harmful? :)

2006-09-05 Thread Rich Felker

On Tue, Sep 05, 2006 at 12:57:08AM -0500, David Starner wrote:
> On 9/5/06, Rich Felker <[EMAIL PROTECTED]> wrote:
> >In all seriousness, though, unless you're dealing with image, music,
> >or movie files, text weighs in quite heavy in size.
> 
> As opposed to what? The vast majority of content is one of the four,
> and what's left--say, Flash files--don't seem particularly small
> compared to text.

I wasn't thinking of a website but rather a complete computer system.
I have several gigabytes of email which is larger than even a very
bloated OS and several hundred thousand times bigger than a
non-bloated OS. Multiply this by a factor of 3 or more and it could
quite easily go from "feasible to store" to "infeasible to store".

> >If you're making a website
> >without fluff and with lots of information, text size will be the
> >dominant factor in traffic. It's quite unfortunate that native
> >language text is 3 to 6(*) times larger in countries where bandwidth
> >is very expensive.
> 
> Welcome to HTTP 1.1. There's no reason not to compress the data while
> you're sending it across the network, which will fix the vast majority
> of this problem.

Here you have the issue of compression performance versus bandwidth,
especially relevant on a heavily loaded server (of course you can
precompress static texts). Also gzip doesn't perform so well on UTF-8
so bzip2 would be better but also much more cpu-hungry and I doubt any
clients support it.

Anyway all of this discussion is in a sense pointless since none of us
have the power to change any of the problem and since there's no real
solution even if we could. But sometimes you just have to bitch about
the stuff the Unicode folks messed up on..

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Bidi considered harmful? :)

2006-09-04 Thread Rich Felker

On Mon, Sep 04, 2006 at 11:44:26PM -0500, David Starner wrote:
> On 9/1/06, Rich Felker <[EMAIL PROTECTED]> wrote:
> >IMO the answer is common sense. Languages that have a low information
> >per character density (lots of letters/marks per word, especially
> >Indic) should be in 2-byte range and those with high information
> >density (especially ideographic) should be in 3-byte range. If it
> >weren't for so many legacy Latin blocks near the beginning of the
> >character set, most or all scripts for low-density languages could
> >have fit in the 2-byte range.
> 
> Once you compress the data with a decent compression scheme, you may
> as well store the data by writing out the full Unicode name (e.g.
> "LATIN CAPITAL LETTER OU"); the final result will be about the same
> size.

With some compression methods this is true, particularly bz2.

> Furthermore, you can fit a decent sized novel on a floppy
> uncompressed and a decent sized library on a DVD uncompressed.

Yet somehow the firefox source code is still 36 megs (bz2), and god
only knows how large OOO is. Imagine now if all the variable and
function names were written in Hindi or Thai... It would be an
interesting test to transliterate the Latin letters to Devanagari and
see how much the compressed tarball size goes up.

> The
> only application I've seen where text data size was really crucial was
> text messaging. Hence, common sense tells _me_ that we should put
> scripts used by heavily text-messaging cultures in the 2-byte range;
> that is, Latin, Hiragana and Katakana.

ROTFL! :)

In all seriousness, though, unless you're dealing with image, music,
or movie files, text weighs in quite heavy in size. It's true that in
html 75-90% of the size is usually tags (in ASCII) but that's due to
incompetence of the web designers and their inability to use CSS
correctly, not anything fundamental. If you're making a website
without fluff and with lots of information, text size will be the
dominant factor in traffic. It's quite unfortunate that native
language text is 3 to 6(*) times larger in countries where bandwidth
is very expensive.

Rich

(*) 6 because a large number of characters in Indic scripts will have
the virama (a combining character) attached to them to remove the
inherent vowel and attach them into clusters.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Bidi considered harmful? :)

2006-09-04 Thread Rich Felker

On Mon, Sep 04, 2006 at 08:19:02PM -0600, Mark Leisher wrote:
> Rich Felker wrote:
> >
> >It went farther because it imposed language-specific semantics in
> >places where they do not belong. These semantics are correct with
> >sentences written in human languages which would not have been hard to
> >explicitly mark up, especially with a word processor doing it for you.
> >On the other hand they're horribly wrong in computer languages
> >(meaning any text file meant to be computer-read and -interpreted, not
> >just programming languages) where explicit markup is highly
> >undesirable or even illegal.
> 
> The Unicode Consortium is quite correctly more concerned with human
> languages than programming languages. I think you are arguing yourself
> into a dead end. Programming languages are ephemeral and some might 
> argue they are in fact slowly converging with human languages.

Arrg, C is not going away anytime soon. C is THE LANGUAGE as far as
POSIX is concerned. The reason I said "arrg" is that I feel like this
gap between the core values of the "i18n bloatware crowd" and the
"hardcore lowlevel efficient software crowd" is what keeps good i18n
out of the best software. When you talk about programming languages
converging with human languages somehow all I can think of us Perl...
yuck! Larry Wall's been great about pushing Unicode and UTF-8, but
Perl itself is a horrible mess. The implementation is hopelessly bad
and there's little hope of there ever being a reimplementation.

Anyway as I've said again and again, it's no problem for human
language text to have explicit embedding tagging. It doesn't need to
conform to syntax rules (oh yeah Perl code doesn't need to either ;)).
Fancy editors can even insert tags for you. On the other hand,
stuffing extra control characters into machine-read texts with
specific syntactical and semantic rules is not possible. You can't
even just strip these characters when processing because, depending on
the semantics of the file, they may either be controlling the display
of the file or literal embedding controls to be used when the strings
from the file are printed to their final destination.

> >Or I could just ask: should we write C code in MS Word .doc format?
> 
> No reason to. Programming editors work well as they are and will
> continue to work well after being adapted for Unicode.

No, if they perform the algorithm in UAX#9 they will display garbled
unreadable code. Or does C somehow qualify as a "higher level
protocol" for formatting?

> You don't appear to have any experience writing lexical scanners for
> programming languages. If you did, you would know how utterly trivial it
> is to ignore embedded bidi codes an editor might introduce.

I'm quite aware that it's simple to code, but also illegal according
to the specs. Also you're ignoring the more troublesome issues...
Obviously you can't remove them inside strings. :) Issues with
comments too..

> Though I haven't checked myself, I wouldn't be surprised if Perl,
> Python, PHP, and a host of other programming languages weren't already
> doing this, making your concerns pointless.

I doubt it, but even it they do, these are toy languages with one
implementation and no specification (and in Perl's case, for which
it's hopeless to even try to write a specification). It's easy to hack
whatever you want and break compatibility with every new release of
the language when your implementation is the only one. It's much
harder when you're working with an international standard for a
language that's been around (and rather stable!) approaching-40-years
and intended to have multiple interoperable implementations.

> You can't seriously expect readers of RTL
> languages to just throw away everything they've learned since childhood
> and learn to read their mathematical expressions backwards? Or simply
> require that their scripts never appear in a plain text file? That is
> ignorant at best and arrogant at worst.

I've seen examples that show that UAX#9 just butchers mathematical
expressions in the absence of explicit bidi control.

> You really need to start looking at code and stop pontificating from a
> poorly understood position. Just about every programming editor out
> there is already aware of programming language syntax. Many different
> programming languages in most cases.

Cheap regex-based syntax hilighting is not the same thing at all. But
this is aside from the point, that it's fundamentally WRONG to need a
special tool that knows about the syntax of your computer language in
order to edit it. What if you've designed your own language to solve a
particular problem? Do you have to go and modify your editor t

Re: Bidi considered harmful? :)

2006-09-01 Thread Rich Felker

On Fri, Sep 01, 2006 at 03:46:44PM -0600, Mark Leisher wrote:
> Did it every occur to you that it wasn't the "word processing mentality" 
> of the Unicode designers that led to ambiguities surviving in plain 
> text? It is simply the fact that there is no nice neat solution. Unicode 
> went farther than just about anyone else in solving the general case of 
> reordering plain bidi text for display without explicit directional codes.

It went farther because it imposed language-specific semantics in
places where they do not belong. These semantics are correct with
sentences written in human languages which would not have been hard to
explicitly mark up, especially with a word processor doing it for you.
On the other hand they're horribly wrong in computer languages
(meaning any text file meant to be computer-read and -interpreted, not
just programming languages) where explicit markup is highly
undesirable or even illegal.

> Why does plain text still exist?

Read Eric Raymond's "The Art of Unix Programming". He answers the
question quite well.

Or I could just ask: should we write C code in MS Word .doc format?

> >A bidi algorithm with minimal/no
> >implicit behavior works fine as long as you are not mixing
> >languages/scripts, and when mixing scripts it makes sense to use
> >explicit embedding -- especially since the cases of mixed scripts that
> >MUST work without formatting controls are files that are meant to be
> >machine-interpreted as opposed to pretty-printed for human
> >consumption.
> 
> I'm not quite sure what point you are trying to make here. Do away with 
> plain text?

No, rather that handling of bidi scripts in plain text should be
biased towards computer languages rather than human languages. This is
both because plain text files are declining in use for human language
texts and increasing in use for computer language texts, and because
the display issues in human language texts can be solved with explicit
embeddign markers (which an editor or word processor could even
auto-insert for you) while the same marks are unwelcome in computer
languages.

> >In particular, an algorithm that only applies reordering within single
> >'words' would give the desired effects for writing numbers in an RTL
> >context and for writing single LTR words in a RTL context or single
> >RTL words in a LTR context. Anything more than that (with unlimited
> >long range reordering behavior) would then require explicit embedding.
> 
> You are aware that numeric expressions can be written differently in 
> Hebrew and Arabic, yes? Sometimes the reordering of numeric expressions 
> differ (i.e. 1/2 in  Latin and Hebrew would be presented as 2/1 in 
> Arabic). This also affects other characters often used with numbers such 
> as percent and dollar sign. So even within strictly RTL scripts, 
> different reordering is required depending on which script is being 
> used. But if you know a priori which script is in use, reordering is 
> trivial.

This is part of the "considered harmful" of bidi. :)
I'm not familiar with all this stuff, but as a mathematician I'm
curious how mathematicians working in these languages write. BTW
mathematical notation is an interesting example where traditional
storage order is visual and not logical.

> This is the choice of each programming language designer: either allow 
> directional override codes in the source or ban them. Those than ban 
> them obviously assume that knowledge of the language's syntax is 
> sufficient to allow an editor to present the source code text reasonably 
> well.

It's simply not acceptable to need an editor that's aware of language
syntax in order to present the code for viewing and editing. You could
work around the problem by inserting dummy comments to prevent the
bidi algo from taking effect but that's really ugly and essentially
makes RTL scripts unusable in programming if the editor applies
Unicode bidi algo to the display.

> >>You left out the part where Unicode says that none of these things is 
> >>strictly required.
> >
> >This is blatently false. UAX#9 talks about PARAGRAPHS. Text files
> >consist of LINES. If you don't believe me read ISO/IEC 9899 or IEEE
> >1003.1-2001. The algorithm in UAX#9 simply cannot be applied to text
> >files as it is written because text files do not define paragraphs.
> 
> How is a line ending with newline in a text file not a paragraph? A 
> poorly formatted paragraph, to be sure, but a paragraph nonetheless. The 
> Unicode Standard says paragraph separators are required for the 
> reordering algorithm. There is no reason why a line can't be viewed as a 
> paragraph. And it even works reasonably well most of the time.

Actually it does not work with embedding. If a (semantic) paragraph
has been split into multiple lines, the bidi embedding levels will be
broken and cannot be processed by the UAX#9 algorithm without trying
to reconstruct an idea of what the whole "paragraph" meant. Also, a
problem occurs if the first char

Re: Bidi considered harmful? :)

2006-09-01 Thread Rich Felker

On Fri, Sep 01, 2006 at 09:36:44AM -0600, Mark Leisher wrote:
> Rich Felker wrote:
> >
> >If that were the problem it would be trivial. The problems are much
> >more fundamental. The key examples you should look at are things like:
> >printf("%s %d %d %s\n", string1, number2, number3, string4); where the
> >output is intended to be columnar. Everything is fine until someone
> >puts in data where string1 ends in RTL text and string4 begins with
> >RTL text, in which case the numbers switch places. This kind of
> >instability is not just awkward; it shows that implicit bidi is
> >fundamentally broken.
> 
> I can say with certainty born of 10+ years of trying to implement an 
> implicit bidi reordering routine that "just does the right thing," there 
> are ambiguities that simply can't be avoided. Like your example.
> 
> Are one or both numbers associated with the RTL text or the LTR text? 
> Simple question, multiple answers. Some answers are simple, some are not.

Exactly. Unicode bidi algorithm assumes that anyone putting bidi
characters in a text stream will give them special consideration and
manually resolve these issues with explicit embedding. That is, it
comes from the word processor mentality of the designers of Unicode.
They never stop to think that maybe an automated process that doesn't
know about character semantics could be writing strings, or that
syntax in a particular text file (like passwd, csv files, tsv files,
etc.) could preclude such treatment.

> The Unicode bidi reordering algorithm is not fundamentally broken, it 
> simply provides a result that is correct in many, but not all cases. If 
> you can defy 30 years of experience in implicit bidi reordering 
> implementations and come up with one that does the correct thing all the 
> time, you could be a very rich man.

Why is implicit so important? A bidi algorithm with minimal/no
implicit behavior works fine as long as you are not mixing
languages/scripts, and when mixing scripts it makes sense to use
explicit embedding -- especially since the cases of mixed scripts that
MUST work without formatting controls are files that are meant to be
machine-interpreted as opposed to pretty-printed for human
consumption.

In particular, an algorithm that only applies reordering within single
'words' would give the desired effects for writing numbers in an RTL
context and for writing single LTR words in a RTL context or single
RTL words in a LTR context. Anything more than that (with unlimited
long range reordering behavior) would then require explicit embedding.

> So you have a choice, adapt your config file reader to ignore a few 
> characters or come up with an algorithm that displays plain text 
> correctly all the time.

What should happpen when editing source code? Should x = FOO(BAR);
have the argument on the left while x = FOO(bar); has it on the right?
Should source code require all RTL identifiers to be wrapped in
embedding codes? (They're illegal in ISO C and any language taking
identifier rules from ISO/IEC TR 10176, yet Hebrew and Arabic
characters are legal like all other characters used to write non-dead
languages.)

> >One of the unacceptable things that the Unicode consortium has done
> >(as opposed to ISO 10646 which, after their initial debacle, has been
> >quite reasonable and conservative in what they specify) is to presume
> >they can redefine what a text file is. This has included BOMs,
> >paragraph break character, implicit(?) deprecation of newline
> >character as a line/paragraph break, etc. Notice that all of these
> >redefinitions have been universally rejected by *NIX users because
> >they are incompatible with the *NIX notion of a text file. My view is
> >that implicit bidi is equally incompatible with text files and should
> >be rejected for the same reasons.
> >
> 
> You left out the part where Unicode says that none of these things is 
> strictly required.

This is blatently false. UAX#9 talks about PARAGRAPHS. Text files
consist of LINES. If you don't believe me read ISO/IEC 9899 or IEEE
1003.1-2001. The algorithm in UAX#9 simply cannot be applied to text
files as it is written because text files do not define paragraphs.

> The *NIX community didn't reject anything. They 
> didn't need to. You also seem unaware of how much effort was made by 
> ISO, the Unicode Consortium, and all the national standards bodies to 
> avoid breaking a lot of existing practice.

I'm aware that unlike many other standardization processes, the
Unicode Consortium was very inconsistent in its application of this
rule. Many people consider Han unification to beak existing practice.
UCS-2 which they initially tried to push onto people, as well as
UTF-1, heavily broke existing practice as

Re: Bidi considered harmful? :)

2006-09-01 Thread Rich Felker

On Fri, Sep 01, 2006 at 04:32:40PM +1000, George W Gerrity wrote:
> I did try to tell you that doing a terminal emulation properly would  
> be complex. I don't know if the algorithm is broken: I doubt it. But  
> it is difficult getting it to work properly and it essentially  
> requires internal tables for every glyph describing its direction and  
> orientation.

If that were the problem it would be trivial. The problems are much
more fundamental. The key examples you should look at are things like:
printf("%s %d %d %s\n", string1, number2, number3, string4); where the
output is intended to be columnar. Everything is fine until someone
puts in data where string1 ends in RTL text and string4 begins with
RTL text, in which case the numbers switch places. This kind of
instability is not just awkward; it shows that implicit bidi is
fundamentally broken. Even if it can be handled at the terminal
emulator level with special escapes and whatnot (and I believe it can,
albeit in very ugly ways) it simply cannot be handled in a plain text
file, for reasons like:

columna COLUMNB 1234 5678 columnc
columna COLUMNB 1234 5678 COLUMNC

Implicit bidi requires interpreting a flow of plain text as
sentence/paragraph content which is simply not a reasonable
assumption. Consider also what would happen if your text file is two
preformatted 32-character-wide paragraph columns side-by-side. Now
imagine the kind of havok that could result if this sort of insanity
took place in the presentation of configuration files with critical
security settings, for instance where the strings are usernames (which
MUST be able to contain any letter character from any language) and
the numbers are permission levels. And certainly you can't just throw
explicit direction markers into a config file like that because they'd
alter the semantics (which should be purely byte-oriented; there's no
reason any program not displaying text should include code to process
the contents).

One of the unacceptable things that the Unicode consortium has done
(as opposed to ISO 10646 which, after their initial debacle, has been
quite reasonable and conservative in what they specify) is to presume
they can redefine what a text file is. This has included BOMs,
paragraph break character, implicit(?) deprecation of newline
character as a line/paragraph break, etc. Notice that all of these
redefinitions have been universally rejected by *NIX users because
they are incompatible with the *NIX notion of a text file. My view is
that implicit bidi is equally incompatible with text files and should
be rejected for the same reasons.

This does not mean that storing text in 'visual order' is acceptable
either; that's just disgusting and makes correct ligatures/shaping
impossible. It just means that you cannot create a bidirection
presentation from a text file without higher level markup. Instead you
can use a vertical presentation or either LTR or RTL presentation with
the opposite-directionality glyphs rotated 180°.

My observations were that this sort of presentation is much easier to
edit and quite possibly easier to read than a format where your eyes
have to switch scanning directions.

I'm not unwilling to support implicit bidi if somebody else wants to
code it, but the output WILL BE WRONG in many cases and thus will be
off by default. The data needed to do it correctly is simply not
there.

> > [...]
> >[1] There is a small problem that even without LTR scripts mixed in,
> >most RTL scripts are "bidirectional" due to numbers being written LTR.
> >However supporting reversed display of individual numbers (or even
> >individual words) is a trivial problem compared to full bidi text flow
> >and can be done without compromising reversibility and without complex
> >algorithms that cause misinterpretation of adjacent text.
> 
> No one using arabic script would accept reading it top to bottom: it  
> is simply never done (to the best of my knowledge), and so any  
> terminal emulator claiming to work with any script had better be able  
> to render the text correctly, including mixing rtl and ltr.

You misread the above. Of course no one using LTR scripts would want
to read top-to-bottom either. The intent is that users of RTL scripts
could use an _entirely_ RTL terminal with the LTR characters' glyphs
rotated 180° while LTR users could use an _entirely_ LTR terminal with
RTL glyphs rotated 180°. The exception noted in the footnote is that
RTL scripts actually require "bidi" for numbers, but I comment that
this is trivial compared to bidi and suffers from none of the
fundamental problems of bidi.

The vertical orientation thing is mostly of interest to Mongolian
users and perhaps some East Asian users, but it could also be
interesting to (a very few) users of both LTR and RTL scripts who use
both frequently and who want a more equal treatment of both,
especially if they find reading upside-down difficult.

Rich

P.S. Do you have any good screenshots with RTL or LTR embedded

Bidi considered harmful? :)

2006-08-31 Thread Rich Felker

I read an old thread on the XFree88 i18n list started by Markus Kuhn
suggesting (rather strongly) that bidi should not be supported at the
terminal level, as well accusations (from other sources) by the author
of Yudit that UAX#9 bidi algo results in serious security issues due
to the irreversibility of the transformation and that it inevitably
butchers mathematical formulae.

I've also considered examples on my own, such as a program (not
necessarily terminal-aware, just text output) that prints lines of the
form "%s %d %d %s" without any special treatment (such as putting
explicit embedding marks around the %s fields) for bidi text, or a
terminal-based program that draws interface elements over top of
existing RTL text, resulting in nonsense.

In all cases, my personal opinion has been not just that UAX#9 is
broken, but that there's no way to implement any sort of implicit bidi
in a terminal emulator or in the display of text/plain data without
every single program having to go _far_ out of its way to ensure that
it won't give incorrect output when the input contains RTL characters,
which simply isn't going to happen, especially since it would
interfere with use in non-RTL scenarios. Other people may have
different opinions but I have not seen any viable solutions.



At the same time, I'm also very dissatisfied with the lack of proper
support for RTL scripts/languages in most applications and especially
at the terminal level, especially since Arabic is in such widespread
use and has great political importance in world affairs these days. I
do not accept that the solution is to just to print characters in the
wrong visual order.

.eerga ll'uoy tcepxe I ylbatrofmoc ecnetnes siht daer nac uoy sselnU



I experimented with the idea of mirroring glyphs to improve
readability, and was fairly surprised with how little it helped my
perception. Reading English text that had been graphically mirrored
remained almost as difficult as reading the above line, with the b/d
and p/q pairs causing significant pause in comprehension.

So then, reading UAX#9 again, I stumbled across the only section
that's not completely stupid (IMO of course):

5.4 Vertical Text

In the case of vertical line orientation, the bidirectional
algorithm is still used to determine the levels of the text.
However, these levels are not used to reorder the text, since the
characters are usually ordered uniformly from top to bottom.
Instead, the levels are used to determine the rotation of the
text. Sometimes vertical lines follow a vertical baseline in which
each character is oriented as normal (with no rotation), with
characters ordered from top to bottom whether they are Hebrew,
numbers, or Latin. When setting text using the Arabic script in
vertical lines, it is more common to employ a horizontal baseline
that is rotated by 90° counterclockwise so that the characters are
ordered from top to bottom. Latin text and numbers may be rotated
90° clockwise so that the characters are also ordered from top to
bottom.

What this provides is a suggested formatting that makes RTL and LTR
scripts both readable in a single-directional context, a vertical one.
Combined with the recent Mongolian script discussion on this list, I
believe this offers an alternate presentation form for documents that
mix LTR and RTL text without using bidi.

I'm not suggesting that everyone should switch to vertically-oriented
terminals or text-file-presentation, although Mongolian users might
like such a setup and it can certainly be one presentation option
that's fair to both RTL and LTR users by making both RTL and LTR
scripts quite readable.

The key idea to take from the Mongolian discussion and from UAX#9 5.4
is that, by having glyphs for LTR and RTL scripts rotated 180°
relative to one another, both can appear legible in a common
directionality. Thus, perhaps LTR users could present legible
RTL-script text by rotating all glyphs 180° and displaying them in LTR
order, and likewise RTL users could use a dominant RTL direction with
LTR glyphs rotated 180° [1]. Like with Mongolian, directionality could
become a localized user preference, rather than a property of the
script.

Does this actually work?

I repeated my experiment with English text reading, rotating the
graphic representation by 180° rather than mirroring it left-right. I
was pleased to find that I could read it with similar ease (but not
quite as fast) as ordinary LTR English text. Surprisingly, p/d and b/q
confusion did not arise, perhaps due to the obvious visual distinction
between the ascent/descent space of the glyphs.

I do not claim these tests are scientific, since the only subject
participating was myself. :) But they are suggestive of an alternative
possible presentation form for mixed LTR/RTL scripts without utilizing
bidirectionality. I consider bidirectionality harmful because:

- It is inherently slow for one's eyes to jump back and forth

Re: Next Generation Console Font?

2006-08-20 Thread Rich Felker

On Sat, Aug 19, 2006 at 11:20:55AM -0700, Ben Wiley Sittler wrote:
> sorry, cat-typing sent that email a bit early. here's the rest:
> 
> for indic scripts and arabic having triple-cell ligatures is really
> indispensible for readable text.
> 
> for east asian text a ttb, rtl columnar display mode is really, really
> nice.

For a terminal? Why? Do you want to see:

 l
 s

 -
 l
 [...]

??? I suspect not. If anyone really does want this behavior, then by
all means they can make a terminal with different orientation. But
until I hear about someone really wanting this I'll assume such claims
come from faux-counter-imperial chauvinism where western academics in
ivory towers tell people in other cultures that they must "preserve
their traditions" for their own sake with no regard for practicality,
and end up doing nothing but _disadvantaging_ people.

> a passable job at least for CJK. how to handle single-cell vs.
> double-cell vs. triple-cell glyphs in vertical presentation is a

I've never heard of a triple-cell glyph. Certainly the "standard"
wcwidth (Kuhn's version) has no such thing.

> tricky problem - short runs (<= 2 cells) should probably be displayed
> as horizontal inclusions, longer runs should probably be rotated.

Nonsense. A terminal does not have the luxury to decide such things.
You're confusing "terminal" with "word processor" or maybe even with
TeX...

> why don't we have escape sequences for switching between the DBCS and
> non-DBCS cell behaviors, and for rotating the terminal display for
> vertical text vs. horizontal text?

Because it's not useful. Applications will not use it. All the
terminal emulator needs to do is:

1. display raw text in a form that's not offensive -- this is
   necessary so that terminal-unaware programs just writing to stdout
   will work.

2. provide cursor positioning functions (minimal) and (optionally)
   scrolling/insert/delete and other small optimizations.

Anything more is just pure bloat because it won't be supported by
curses and applications are written either to curses or to vt102.

> Note that mixing vertical and
> horizontal is sometimes done in the typographic world but is probably
> not needed for terminal emulators (this requires a layout engine much
> more advanced than the unicode bidi algorithm, capable of laying out

This most certainly does not belong in a terminal emulator. Apps
(such as text based web browsers) wishing to do elegant
multi-orientation formatting can do the cursor positioning and such
themselves. Users preferring a vertical orientation can configure
their terminals as such. This is a matter of user preference, not
application control, and thus there should NOT be a way for
applications to control or override it.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Indic scripts and wcwidth: comments?

2006-08-18 Thread Rich Felker

On Fri, Aug 18, 2006 at 08:10:14PM +0200, Andries Brouwer wrote:
> On Fri, Aug 18, 2006 at 02:29:34PM +0200, Werner LEMBERG wrote:
> > 
> > Since I have no idea about Indic scripts, I won't and can't give a
> > comment.  I just want to note that Emacs supports Devanagari with
> > single, double, and triple width glyphs (IIRC); you may have a look
> > how they've done it -- from a technical point, not from an encoding
> > point.
> 
> There is a bug (that I have not investigated) in the use of
> "emacs -nw" on a uxterm. When symbols occur on the line
> of which no glyph is available, then emacs and uxterm have different
> ideas about the width of displayed strings, and corruption results.

This is exactly why it's a bug for the terminal emulator to use the
font's idea of glyph width whatsoever. The only correct implementation
is for the terminal emulator to use wcwidth and demand that the font
matches (and in fact render individual glyphs in cells, not use
string-rendering functions).

But yes, Werner was talking about Emacs GUI, not -nw. I doubt the GUI
is really relevant to character-cell stuff though unless they found a
way to coerce Indic scripts into character cells well...

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

1 2 >

1 - 100 of 131 matches

Mail list logo