subject:"Re\: Nicest UTF"

RE: Nicest UTF

2004-12-14 Thread Lars Kristan

Title: RE: Nicest UTF

D. Starner wrote:
> > Some won't convert any and will just start using UTF-8 
> > for new ones. And this should be allowed. 
> 
> Why should it be allowed? You can't mix items with
> different unlabeled encodings willy-nilly. All you're going
> to get, all you can expect to get is a mess.

Easy for you to say. You're not the one that is going to answer the support calls. They WILL do it. You can jump up and down as much as you like, but they will. If I tell to users what you are telling me, they will think I am mad and will stop using my application.

Lars

Re: Nicest UTF

2004-12-13 Thread Philippe Verdy

From: "D. Starner" <[EMAIL PROTECTED]>
Some won't convert any and will just start using UTF-8
for new ones. And this should be allowed.
Why should it be allowed? You can't mix items with
different unlabeled encodings willy-nilly. All you're going
to get, all you can expect to get is a mess.
When you say "you can't", it's excessive when speaking about filesystems, 
which DO NOT label their encoding, and allow multiple users to use and 
create files on shared filesystems with different locales having each a 
differnt encoding.

So it does happen that the same filesystem stores multiple encodings for its 
filenames. It also happens that systems allow mounting remote filesystems 
shared on systems using distinct system encodings (so even if a filesystem 
is consistent, these filenames appear with various encodings, and this goes 
to more complex situations when they are crosslinked with soft links or 
URLs.

Think about the web: it's a filesystem in itself, which uses names (URLs) 
include inconsistent encodings. Although there's a recommandation to use 
UTF-8 in URLs, this is not mandatory, and there are lots of hosts that use 
URLs created with some ISO-8859 charsets, or even Windows or Macintosh 
codepages.

To resolve some problems, HTML specifications allow additional (but 
out-of-band) attributes to resolve the encoding used for resource contents, 
but this has no impact on URLs themselves.

The current solution is to use "URL-encoding" and treat them as binary 
sequences with a restricted set of byte values, but this time it means 
transforming what was initially plain-text into some binary moniker.

Unfortunately, many web search engines do use the URLs to qualify the 
pertinence of search keywords, instead of treating them only as blind 
monikers.

Lots has been done to internationalize the domain names for use in IRIs, but 
URLs remain a mess and a mixture of various charsets, and IRIs are still 
rarely supported on browsers.

The problem with URLs is that they must be allowed to contain any valid 
plain-text, notably for Form-Data, submitted with a GET method, because this 
plain-text data becomes part of a query-string, itself part of the URL. HTML 
does allow specifiying in the HTML form which encoding should be used for 
this form data, because servers won't always expect a single and consistent 
encoding; the absence of this specification is often interpreted in browsers 
as meaning that form-data must be encoded with the same charset as the HTML 
form itself, but not all browsers observe this rule (in addition many web 
pages are incorrectly labelled, simply because of incorrect or limited HTTP 
server configurations, and the standards specify that the charset specified 
in the HTTP headers have priority to the charset specified in encoded 
documents themselves; this was a poor decision, which is inconsistent with 
the usage of the same HTML documents on filesystems that do not store the 
charset used for the file content)...

So don't think that this is simple. It is legitimate to be able to refer to 
some documents which we know are plain-text, but have unknown or ambiguous 
encodings (and there are many works related to the automated identification 
of lguage/charset pairs used in documents; none of these method are 100% 
exempt of false guesses).

For clients trying to use these resources with ambiguous or unknown 
encodings, but that DO know that this is effectly plain-text (such as a 
filename), the solution to eliminate (ignore, not show, discard...) all 
filenames or documents that look incorrectly encoded may be the worst 
solution: it gives no information to the user that these documents are 
missing, and this does not allow these users to even determine (even if 
characters are incorrectly displayed) which alternate encoding to try. It's 
legitimate to think about solution allowing at least partial representation 
of these texts, so that the user can look at how it is effectively encoded 
and get hints about how to select the appropriate charset. Also, very lossy 
conversions (with U+FFFD) are not satisfying enough.

RE: Nicest UTF

2004-12-13 Thread D. Starner

> Some won't convert any and will just start using UTF-8 
> for new ones. And this should be allowed. 

Why should it be allowed? You can't mix items with
different unlabeled encodings willy-nilly. All you're going
to get, all you can expect to get is a mess.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Nicest UTF

2004-12-13 Thread John Cowan

Lars Kristan scripsit:

> > I'm using ISO-8859-2.
> In fact you're lucky. Many ISO-8859-1 filenames display correctly in
> ISO-8859-2. Not all users are so lucky.

It was a design point of ISO-8859-{1,2,3,4}, but not any other variants,
that every character appears either at the same codepoint or not at all.

-- 
John Cowan[EMAIL PROTECTED]
At times of peril or dubitation,  http://www.ccil.org/~cowan
Perform swift circular ambulation,http://www.reutershealth.com
With loud and high-pitched ululation.

RE: Nicest UTF

2004-12-13 Thread Lars Kristan

Title: RE: Nicest UTF





D. Starner wrote:
> "Lars Kristan" writes:
>  
> > > A system administrator (because he has access to all files).
> > My my, you are assuming all files are in the same encoding. 
> And what about
> > all the references to the files in scripts? In 
> configuration files? Soft
> > links? If you want to break things, this is definitely the 
> way to do it.
> 
> Was it ever really wise to use non-ASCII file names in 
> scripts and configuration
> files?
It goes beyond that. Please see my reply to Marcin 'Qrczak' Kowalczyk.


> It's not very hard to convert soft links at the same 
> time.
Please see my reply to Marcin 'Qrczak' Kowalczyk.


> Even if you can't do a system-wide change, it's easy enough 
> to change the
> system files, and post a message about switching to UTF-8, 
> and offering to
> assist any users with the change.
That's perfectly fine. But I started talking about this because I claimed that you are likely to end up by having UTF-8 filenames alongside legacy encoded filenames. If you do it gradually, that is precisely what is going to happen, at least for a certain period. But this period could be longer than expected. And as it turns out things are not simple, some users may never convert all the filenames. Some won't convert any and will just start using UTF-8 for new ones. And this should be allowed. Assuming that all filenames should be valid UTF-8 is a bad argument against my claims that applications should be able to process filenames with invalid UTF-8 sequences.


Lars

RE: Nicest UTF

2004-12-13 Thread Lars Kristan

Title: RE: Nicest UTF





Marcin 'Qrczak' Kowalczyk wrote:
> > My my, you are assuming all files are in the same encoding.
> 
> Yes. Otherwise nothing shows filenames correctly to the user.
UNIX is a multi user system. One user can use one locale and might never see files from another user that uses a different locale. And users can even have filenames in wrong locales in their own home directory. Copied from somewhere. Perhaps only a letter here and there does not display correctly, but this doesn't mean the user can't use the file.

> 
> > And what about all the references to the files in scripts?
> > In configuration files?
> 
> Such files rarely use non-ASCII characters. Non-ASCII characters are
> primarily used in names of documents created explicitly by the user.
Rarely. So only rare systems will not boot after the conversion. And only rare programs will no longer work. Is that acceptable?

Plus, it might not be as rare as you think. It might be far more common in a country where not many people understand English and are not using latin letters on top of it.

Also, a script (a UNIX batch file) many have an ASCII name, but what if it processes some user documents for some purpose. And has a set of filenames hardcoded in it? What about MRU lists? What about documents that link other documents?

Mass renaming is a dangerous thing. It should be done gradually and with utmost care. And during this period, everything should keep working. If not, users won't even start the process.

> 
> > Soft links?
> 
> They can be fixed automatically.
U, yes, not a good example. Except in case one decides to allow the user to select an option to use U+FFFD instead of failing the conversion. Then you need to be extra careful, rename any files that convert to a sinle name and keep track of everything so you can use the right names for the soft links. But yes, it can be done. If, on the other hand, you adopt the 'broken' conversion concept, you can convert all filenames, in a single pass, and don't need to build lists of softlinks since you can convert them directly.

> 
> > If you want to break things, this is definitely the way to do it.
> 
> Using non-ASCII filenames is risky to begin with. Existing tools don't
> have a good answer to what should happen with these files when the
> default encoding used by the user changes, or when a user using a
> different encoding tries to access them.
Not really. On UNIX, it is all very well defined. A filename is a sequence of bytes which is only interpreted when it is displayed. You can place a filename in a script or a configuration file and the file will be identified and opened regardless of your locale setting.

People like you and me avoid non-ASCII filenames. But not all users do.



> Mozilla doesn't show such filenames in a directory listing. You
> may consider it a bug, but this is a fact. Producing non-UTF-8 HTML
> labeled as UTF-8 would be wrong too. There is no good solution to
> the problem of filenames encoded in different encodings.
There is no good solution. True. And I am trying to find one. And yes, I would consider that a bug. They should probably use some escaping technique. And, funny thing, you would probably accept the escaping technique. But if you think about it, it is again representing invalid data with valid Unicode characters. And if un-escaping needs to be done, it introduces all the problems that you are pointing out for my 'broken' conversion. So, think of my 128 codepoints as an escaping technique. One with no overhead. One with little possibiliy of confusion. One that can be standardized and whoever comes across it will know exactly what it is. Which is definitely not true if we let each application devise its own escaping and there is no way they can interoperate.

> > As soon as you realize you cannot convert filenames to UTF-8, you
> > will see that all you can do is start adding new ones in UTF-8.
> > Or forget about Unicode.
> 
> I'm not using a UTF-8 locale yet, because too many programs don't
> support it.
Like Mozilla. I am showing you the way programs can be made to work with UTF-8 faster and easier. And really by fixing them, not by rewriting them. At least some programs, or some portions of programs. Then developers can concentrate on the things that do require extra attention, like strupr, isspace (or their equivalence).

> I'm using ISO-8859-2.
In fact you're lucky. Many ISO-8859-1 filenames display correctly in ISO-8859-2. Not all users are so lucky.


> But almost all filenames are ASCII.
Basically, you are avoiding the problem alltogether. A wise decision. But it also means you don't know as much about this problem as I do.


Lars

Re: infinite combinations, was Re: Nicest UTF

2004-12-12 Thread Peter Kirk

On 11/12/2004 16:53, Peter R. Mueller-Roemer wrote:
...
For a fixed length of combining character sequence (base + 3 combining 
marks is the most I have seen graphically distinguishable) the 
repertore is still finite.

In Hebrew it is actually possible to have up to 9 combining marks with a 
single base character:

shin + sin/shin dot + dagesh + rafe + 2 vowel points + 2 accents + dot 
above + masora circle

SBL Hebrew and Ezra SIL both make a valiant attempt to display this lot 
but don't quite get there:

××ÖÖ×Ö
But I think 5 is the maximum number which actually occur with any one 
base character in the Hebrew Bible.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Nicest UTF

2004-12-12 Thread Marcin 'Qrczak' Kowalczyk

"Philippe Verdy" <[EMAIL PROTECTED]> writes:

> It's hard to create a general model that will work for all scripts
> encoded in Unicode. There are too many differences. So Unicode just
> appears to standardize a higher level of processing with combining
> sequences and normalization forms that are better approaching the
> linguistic and semantic of the scripts. Consider this level as an
> intermediate tool that will help simplify the identification of
> processing units.

While rendering and user input may use evolving rules with complex
specifications and implementations which depend on the environment
and user's configuration (actually there is no other choice: this
is inherently complicated for some scripts), string processing in
a programming language should have a stable base with well-defined
and easy to remember semantics which doesn't depend on too many
settable preferences and version variations.

The more complex rules a protocol demands (case-insensitive
programming language identifiers, compared after normalization,
after bidi processing, with soft hyphens removed etc.), the more
tools will implement it incorrectly. Usually with subtle errors
which don't manifest until someone tries to process an unusual name
(e.g. documentation generation tool will produce hyperlinks with
dangling links, because a WWW server does not perform sufficient
transformations of addresses).

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-12 Thread Marcin 'Qrczak' Kowalczyk

"D. Starner" <[EMAIL PROTECTED]> writes:

>> But demanding that each program which searches strings checks for 
>> combining classes is I'm afraid too much. 
>
> How is it any different from a case-insenstive search?

We started from string equality, which somehow changed into searching.
Default string equality is case-sensitive.

Searching for an arbitrary substring entered by a user should use
user-friendly rules which fold various minor differences like
decomposition and case and soft hyphens, but it's a rare task and
changing rules generally affects convenience rather than correctness.

String equality is used for internal and important operations like
lookup in a dictionary (not necessarily of strings ever viewed by
the user), comparing XML tags, filenames, mail headers, program
identifiers, hyperlink addresses etc. They should be unambiguous,
simple and fast. Computing approximate equivalence by folding "minor"
differenes must be done explicitly when needed, as mandated by
relevant protocols and standards, not forced as the default.

>> >> Does "\n" followed by a combining code point start a new line? 
>> > 
>> > The Standard says no, that's a defective combining sequence. 
>> 
>> Is there *any* program which behaves this way? 
>
> I misstated that; it's a new line followed by a defective combining
> sequence.

What is the definition of combining sequences?

>> It doesn't matter that accented backslashes don't occur practice.
>> I do care for unambiguous, consistent and simple rules.
>
> So do I; and the only unambiguous, consistent and simple rule that
> won't give users hell is that "ba" never matches "bä". Any programs
> for end-users must follow that rule.

Please give a precise definition of string equality. What representation
of strings it needs - a sequence of code points or something else?
Are all strings valid and comparable? Are there operations which give
different results for "equal" strings?

If string equality folded the difference between precomposed and
decomposed characters, then the API should hide that difference in
other places as well, otherwise string equality is not the finest
distinction between string values but some arbitrary equivalence
relation.

>> My current implementation doesn't support filenames which can't be 
>> encoded in the current default encoding. 
>
> The right thing to do, IMO, would be to support filenames as byte
> strings, and let the programmer convert them back and forth between
> character strings, knowing that it won't roundtrip.

Perhaps. Unfortunately it makes filename processing harder, e.g.
you can't store them in *text* files processed through a transparent
conversion between its encoding and Unicode. In effect we must go
back from manipulating context-insensitive character sequences to
manipulating byte sequences with context-dependent interpretation.

We can't even sort filenames using Unicode algorithms for collation
but must use some algorithms which are capable of processing both
strings in the locale's encoding and arbitrary byte sequences at the
same time. This is much more complicated than using Unicode algorithms
alone.

What is worse, in Windows filenames the primary representation of
filenames is Unicode, so programs which carefully use APIs based on
byte sequences for processing filenames will be less general than
Unicode-based APIs when the program is ported to Windows.

The computing world is slowly migrating from processing byte sequences
in ambiguous encodings to processing Unicode strings, often represented
by byte sequences in explicitly labeled encodings. There are relics
when the new paradigm doesn't fit well, like Unix filenames, but
sticking to the old paradigm means that programs will continue to
support mixing scripts poorly or not at all.

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-12 Thread Marcin 'Qrczak' Kowalczyk

Lars Kristan <[EMAIL PROTECTED]> writes:

> My my, you are assuming all files are in the same encoding.

Yes. Otherwise nothing shows filenames correctly to the user.

> And what about all the references to the files in scripts?
> In configuration files?

Such files rarely use non-ASCII characters. Non-ASCII characters are
primarily used in names of documents created explicitly by the user.

> Soft links?

They can be fixed automatically.

> If you want to break things, this is definitely the way to do it.

Using non-ASCII filenames is risky to begin with. Existing tools don't
have a good answer to what should happen with these files when the
default encoding used by the user changes, or when a user using a
different encoding tries to access them.

As long as everybody uses the same encoding and files use it too,
things work. When the assumption is false, something will break.

>> You mean, various programs will break at various points of time,
>> instead of working correctly from the beginning?
> 
> So far nothing broke. Because all the programs are in UTF-8.

This doesn't imply that they won't break. You are talking about
filenames which are *not* UTF-8, with the locale set to UTF-8.

Mozilla doesn't show such filenames in a directory listing. You
may consider it a bug, but this is a fact. Producing non-UTF-8 HTML
labeled as UTF-8 would be wrong too. There is no good solution to
the problem of filenames encoded in different encodings.

Handling such filenames is incompatible with using Unicode to process
strings. You have to go back to passing arrays of bytes with ambiguous
interpretation of non-ASCII characters, and live with inconveniences
like displaying garbage for non-ASCII filenames and broken sorting.

>> Mixing any two incompatible filename encodings on the same file system
>> is a bad idea.
> 
> As soon as you realize you cannot convert filenames to UTF-8, you
> will see that all you can do is start adding new ones in UTF-8.
> Or forget about Unicode.

I'm not using a UTF-8 locale yet, because too many programs don't
support it. I'm using ISO-8859-2. But almost all filenames are ASCII.

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

RE: Nicest UTF

2004-12-11 Thread D. Starner

"Lars Kristan" writes:

> > A system administrator (because he has access to all files).
> My my, you are assuming all files are in the same encoding. And what about
> all the references to the files in scripts? In configuration files? Soft
> links? If you want to break things, this is definitely the way to do it.

Was it ever really wise to use non-ASCII file names in scripts and configuration
files? It's not very hard to convert soft links at the same time. Nor, really
should it be too hard to figure out the encodings; /home/foo/.bashrc probably
tells you, as well as simple logic. 

Even if you can't do a system-wide change, it's easy enough to change the
system files, and post a message about switching to UTF-8, and offering to
assist any users with the change.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Nicest UTF

2004-12-11 Thread D. Starner

"Marcin 'Qrczak' Kowalczyk" writes:
> But demanding that each program which searches strings checks for 
> combining classes is I'm afraid too much. 

How is it any different from a case-insenstive search?

> >> Does "\n" followed by a combining code point start a new line? 
> > 
> > The Standard says no, that's a defective combining sequence. 
> 
> Is there *any* program which behaves this way? 

I misstated that; it's a new line followed by a defective combining sequence.

> It doesn't matter that accented backslashes don't occur practice. I do 
> care for unambiguous, consistent and simple rules. 

So do I; and the only unambiguous, consistent and simple rule that won't
give users hell is that "ba" never matches "bä". Any programs for end-users
must follow that rule.

> My current implementation doesn't support filenames which can't be 
> encoded in the current default encoding. 

The right thing to do, IMO, would be to support filenames as byte strings,
and let the programmer convert them back and forth between character strings,
knowing that it won't roundtrip.

> If the 
> program assumed that an accented slash is not a directory separator, 
> I expect possible security holes (the program thinks that a string 
> doesn't include slashes, but from the OS point of view it does). 

If the program assumes that an accented slash is not a directory separator,
then it's wrong. Any way you go is going to require sensitivity.

> > The rules you are offering are only simple and unambiguous to the 
> > programmer; they appear completely random to the end user. 
> 
> And yours are the opposite :-) 

Programmers get to spend a lot of time dealing with the "random"
requirements of users, not the other way around.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: infinite combinations, was Re: Nicest UTF

2004-12-11 Thread Philippe Verdy

From: "Peter R. Mueller-Roemer" <[EMAIL PROTECTED]>
For a fixed length of combining character sequence (base + 3 combining 
marks is the most I have seen graphically distinguishable) the repertore 
is still finite.
I do think that you are underestimating the repertoire. Also Unicode does 
NOT define an upper bound for the length of combining sequences, and also 
not on the length of default grapheme clusters (which can be composed of 
multiple combining sequences, for example in the Hangul or Tibetan scripts) 
Your estimations also ignores various layouts found in Asian texts, and the 
particular structures of historic texts which can use many "diacritics" on 
top of a single base letter starting a combining sequence. The model of 
these scripts (for example Hebrew) imply the justaposition of up to 13 or 15 
levels of diacritics for the same base letter!

In practice, it's impossible to enumerate all existing combinations (and 
ensure that they will be assigned a unique code within a reasonnably limited 
code point), and that's why a simpler model based on more basic but 
combinable code points is used in Unicode: it frees Unicode from having to 
encode all of them (this is already a difficult task for the Han script 
which could have been encoded with combining sequences, if the algorithms 
needed to create the necesssary layout had not needed the use of so many 
complex rules and so many exceptions...)

RE: Nicest UTF

2004-12-11 Thread Lars Kristan

Title: RE: Nicest UTF






Missed this one the other day, but cannot let it go...


Marcin 'Qrczak' Kowalczyk wrote:


> > filenames, what is one supposed to do? Convert all 
> filenames to UTF-8?
> 
> Yes.
> 
> > Who will do that?
> 
> A system administrator (because he has access to all files).
My my, you are assuming all files are in the same encoding. And what about all the references to the files in scripts? In configuration files? Soft links? If you want to break things, this is definitely the way to do it.

> > If you keep all processing in UTF-8, then this is a decision you can
> > postpone.
> 
> You mean, various programs will break at various points of time,
> instead of working correctly from the beginning?
So far nothing broke. Because all the programs are in UTF-8. If you would try to write it in UTF-16, it would break. So nobody does it. Except those that must.

> > I didn't encourage users to mix UTF-8 filenames and Latin 1 
> filenames.
> > Do you want to discourage them?
> 
> Mixing any two incompatible filename encodings on the same file system
> is a bad idea.
As soon as you realize you cannot convert filenames to UTF-8, you will see that all you can do is start adding new ones in UTF-8. Or forget about Unicode.


Lars

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan

Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Philippe Verdy wrote:
> This is a known caveat even for Unix, when you look at the 
> tricky details of 
> the support of Windows file sharing through Samba, when the 
> client requests 
> a file with a "short" 8.3 name, that a partition used by 
> Windows is supposed 
> to support.

Do you know how Samba is configured to present UTF-8 filenames properly to Windows? What happens to Latin 1 filenames? Are the invalid sequences escaped? How?

Lars

Re: Nicest UTF

2004-12-11 Thread Philippe Verdy

From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
Regarding A, I see three choices:
1. A string is a sequence of code points.
2. A string is a sequence of combining character sequences.
3. A string is a sequence of code points, but it's encouraged
  to process it in groups of combining character sequences.
I'm afraid that anything other than a mixture of 1 and 3 is too
complicated to be widely used. Almost everybody is representing
strings either as code points, or as even lower-level units like
UTF-16 units. And while 2 is nice from the user's point of view,
it's a nightmare from the programmer's point of view:
Consider that the normalized forms are trying to approach the choice number 
2, to create more predictable combining character sequences which can still 
be processed with algorithms just streams of code points.
Remember that the total number of possible code points is finite; but not 
the total number of possible combining sequences, meaning that text handling 
will necessarily have to make decisions based on a limited set of 
properties.

Note however that for most Unicode strings, the "composite" character 
properties are those of the base character in the sequence. Note also that 
for some languages/scripts, the linguistically correct unit of work is the 
grapheme cluster; Unicode just defines "default grapheme clusters", which 
can span several combining sequences (see for example the Hangul script, 
written with clusters made of multiple combining sequences, where the base 
character is a Unicode jamo, itself made somtimes of multiple simpler jamos 
that Unicode do not allow to decompose as canonically equivalent strings, 
despite this decomposition is inherent of the script itself in its 
structure, and not bound to the language which Unicode will not 
standardize).

It's hard to create a general model that will work for all scripts encoded 
in Unicode. There are too many differences. So Unicode just appears to 
standardize a higher level of processing with combining sequences and 
normalization forms that are better approaching the linguistic and semantic 
of the scripts. Consider this level as an intermediate tool that will help 
simplify the identification of processing units.

The reality is that a written language is actually more complex than what 
can be approached in a single definition of processing units. For many other 
similar reasons, the ideal working model will be with "simple" and 
enumerable abstract characters with a finite number of code points, and with 
which actual and non-enumerable characters can be composed.

But the situation is not ideal for some scripts, notably ideographic ones 
due to their very complex and often "inconsistent" composition rules or 
layout and that require allocating many code points, one for each 
combination. Working with ideographic scripts requires much more character 
properties than with other scripts (see for example the huge and various 
properties defined in UniHan, which are still not standardized due to the 
difficulty to represent them and the slow discovery of errors, omissions, or 
contradictions found in various sources for this data...)

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan

Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Kenneth Whistler wrote:
> Further, as it turns out that Lars is actually asking for
> "standardizing" corrupt UTF-8, a notion that isn't going to
> fly even two feet, I think the whole idea is going to be
> a complete non-starter.

Technically, I am not asking anything. I am just trying to discuss an approach which I think can be used to solve certain problems. And this approach does not need to be conformant at this point. If someone finds it suitable to make it conformant, even better, but at this point this is irrelevant to the discussion. Unless it is proven that it cannot be made conformant (by changing or amending the standard) because I have missed an important fact. But so far, I have not seen such a proof.

But suppose I am asking, therefore proposing - it would be several separate items:

1 - To assign codepoints for 128 (or 256) new surrogates(*), used for:
1.1 - Representing unassigned values when converting from an encoding to Unicode (optional).
1.2 - Representing invalid sequences when interpreting UTF-8 (optional).
The use of these would not be mandatory. Existing handling is still an option and can be preserved wherever it suits the needs, or changed where the new behavior is beneficial.

Representation of these codepoints in UTF-8 would be as per current standard.

2 - An alternative conversion from Unicode, to, say, UTF-8E (UTF-8E is _NOT_ Unicode(*)).
This conversion would reconstruct the original byte sequence, from a Unicode string obtained by 1.2. This conversion pair intended for use on platform or interface boundaries if/where it is determined that they are suitable. For example, interfacing UNIX filesystem and a UTF-8 pipe would require UTF-8E<=>UTF-8 conversion. Interfacing UNIX filesystem and Windows filesystem would require UTF-8E<=>UTF-16 conversion.

(*) If proposal #2 would not be accepted, then codepoints in proposal #1 would actually not be surrogates, but simply codepoints and nothing else. Even if proposal #2 is accepted, it is still not clear if those should really be called surrogates, since they would convert among all UTF's just as any other codepoint and only their representation in UTF-8E would differ. Note that UTF-8E is not Unicode, but would be standardized in Unicode. IF U in UTF is a problem, then any other name can be chosen. Consider it a working name and be aware of what it is and is not.

3 - If UTC cannot agree that BMP should be used for proposal #1, I would advise against a decision to assign non-BMP codepoints for the purpose. I believe less damage would be done by postponing the decision than by making a wrong decision. It is not just about how much disk space or bandwidth is used. For example, if both filesystems have a 256 characters limit for a filename, limitations are consistent (at least in one direction) if BMP is used, and not if any other plane is used.

4 - If neither of the proposals is accepted, it would be beneficial if UTC would manage to preserve at least one suitable block (for example U+A4xx or U+ABxx) of 256 codepoints intact to facilitate a future decision.

Lars Kristan

infinite combinations, was Re: Nicest UTF

2004-12-11 Thread Peter R. Mueller-Roemer

Philippe Verdy wrote:ãääåäææâåääââåäâåâåäããâçæææ 
ææâäãææãææãççæãççæãææãææãâæçãææ

The repertoire of all possible combining characters sequences is 
already infinite in Unicode, as well as the number of "default 
grapheme clusters" they can represent.

For a fixed length of combining character sequence (base + 3 combining 
marks is the most I have seen graphically distinguishable) the repertore 
is still finite.
I am enthused about some nicely distinuishable sequences e.g. u + macron 
+ diaeresis shows nicely as a long long vowel u-Umlaut, whereas u + 
diaeresis + macron displays as long vowel u with trema above to be 
spoken as a separate vowel. BRAVO! I do not see a good reason why does 
not work for all other base characters, particularly on all vowels (e, 
i, o combine in undesirable fashion, a only in one newest version of a 
unicode font).

I can add an accute accent to each sequence but the accent is smuged 
into the previous complex characters in an ugly default overtype mode.

Another GOOD solution: The single combining Hebrew-dagesh point 'finds' 
the right 'inner' place in all the Hebrew consonants and some Latin base 
characters, why should overtype uglyness be allowed in many other cases.

There seems to be no difficulty to implement composition of complex 
character from inside out.

Can't we join forces to request a default graphical representation, so 
that legible, distinguishable complex symbols must be generated by 
future unicode-fonts? The technical details are not too complex and the 
expressiveness and ease of use of Unicode would be greatly enhanced.

The Greek accute and grave accents should by themselves combine centered 
over any base-character; and if together with a spiritus asper or lenis 
should be minimally separated from the accent horizontally and display 
centered over the base character.

Hebrew vowel-points and accents also need to be fitted under any single 
base characteer.

Samaritan complex characters should be composable of short combining 
sequences.

Peter R. Mueller-Roemer

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan

Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)






Kenneth Whistler wrote:
> Lars responded:
> 
> > > ... Whatever the solutions
> > > for representation of corrupt data bytes or uninterpreted data
> > > bytes on conversion to Unicode may be, that is irrelevant to the
> > > concerns on whether an application is using UTF-8 or UTF-16
> > > or UTF-32.
> 
> > The important fact is that if you have an 8-bit based 
> program, and you
> > provide a locale to support UTF-8, you can keep things 
> working (unless you
>  ^^^
> 
> You can keep *some* things *sorta* working.
I didn't say that this is all that needs to be done. But the way you say it makes one think that this is not even the right track.


> > prescribe validation). But you cannot achieve the same if 
> you try to base
> > your program on 16 or 32 bit strings. 
> 
> Of course you can. You just have to rewrite the program to handle
> 16-bit or 32-bit strings correctly. You can't pump them through
> 8-bit pipes or char* API's, but it's just silly to try that, because
> they are different animals to begin with.
Correctly? Strings? There are no strings and no encodings in a UNIX filesystem. Please clarify.


> 
> By the way, I participated as an engineer in a multi-year project
> that shifted an advanced, distributed data analysis system
> from an 8-bit character set to 16-bit Unicode. *All* 
> user-visible string
> processing was converted over -- and that included proprietary
> file servers, comm servers, database gateways, networking code,
> a proprietary 32-bit workstation GUI implementation, and a suite
> of object-oriented application tools, including a spreadsheet,
> plotting tool, query and database reporting tools, and much more.
> It worked cross-platform, too.
> 
> It was completed, running, and *delivered* to customers in 1994,
> a decade ago.
OK, was this a fresh development, or was this an upgrade of an existing system?
Did the existing system contain user data that needed to be converted?
Was this data all in ASCII?
Was this data all in a single code page?
Latin 1 perhaps?
How much of that data was in UTF-8?


> 
> You can't bamboozle me with any of this "it can't be done with
> 16-bit strings" BS.


BS? Bamboozle? One learns all sorts of new words here on this mailing list. Frankly, I find it interesting to read many historical and cultural facts in off-topic discussions, but I have a feeling I am not the only one and that many people prefer to engage in those. And that often original questions remain unanswered. And interesting ideas unexplored.

I know it is hard to follow someone else's ideas, spread over many mails, already sidetracked by those who think they understand what is being discussed and by those who can't distinguish between following a standard and changing or extending it. In the end, statements torn out of context do in fact look as if they're nonsense.

Much your response (in this particular mail, not in general) is just that. One misinterpretation after another. And detailed explanations of things that are not even being discussed. Non-conformances being pointed out, where consequences of proposed changes should in fact be discussed. I am disappointed by this attitude, even more so because it comes from one of the most respected people on this mailing list.

Examples:
> Yes you can.
> No, you need not -- that is non-conformant, besides.
> http://www.unicode.org/Public/UNIDATA/
> Utterly non-conformant.
> Also utterly nonconformant.


I suppose surrogates were also non-conformant at the time they were proposed. Can I interpret your responses that surrogates should never have been accepted into the Unicode standard?

> I just don't understand these assertions at all.
I have given plenty of examples.


> First of all it isn't "UNIX data" or "Windows data" -- it is
> end user's data, which happens to be processed in software
> systems which in turn are running on a UNIX or Windows OS.
This is resorting to a philosophical answer, picking on words.


> I work for a company that *routinely* runs applications that
> cross the platform barriers in all sorts of ways. It works
> because character sets are handled conformantly, and conversions
> are done carefully at platform boundaries -- not because some
> hack has been added to UTF-8 to preserve data corruptions.
Sybase, yes. A very controlled environment. The fact that validity of data *can* be guaranteed in your particular environment gives you not more, but less right to make judgements about other environments and claim the problems can be solved 'by doing things correctly'.

> > If the purpose of

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan

Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)





John Cowan wrote:
> However, although they are *technically* octet sequences, they
> are *functionally* character strings.  That's the issue.
Nicely put! But UTC does not seem to care.


> 
> > The point I'm making is that *whatever* you do, you are still
> > asking for implementers to obey some convention on conversion
> > failures for corrupt, uninterpretable character data.
> > My assessment is that you'd have no better success at making
> > this work universally well with some set of 128 magic bullet
> > corruption pills on Plane 14 than you have with the
> > existing Quoted-Unprintable as a convention.
> 
> It doesn't have to work universally; indeed, it becomes a QOI issue.
> Allocating representations of bytes with "bits that are high" makes
> it possible to do something recoverable, at very little expense to the
> Unicode Consortium.
Except that the expense should be slightly higher. The importance of these replacement codepoints is still underestimated. They belong in the BMP. And at least there is no way anyone can blame UTC for a cultural bias in this case, these codepoints are universal.

> 
> > Further, as it turns out that Lars is actually asking for
> > "standardizing" corrupt UTF-8, a notion that isn't going to
> > fly even two feet, I think the whole idea is going to be
> > a complete non-starter.
> 
> I agree that that part won't fly, absolutely.
Then I'll have to restructure it.


Lars

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan

Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)





Arcane Jill responded:
> >> Windows filesystems do know what encoding they use.
> >Err, not really. MS-DOS *need to know* the encoding to use, 
> >a bit like a
> >*nix application that displays filenames need to know the 
> >encoding to use
> >the correct set of glyphs (but constrainst are much more heavy.)
> 
> Sure, but MS-DOS is not Windows. MS-DOS uses "8.3" filenames. 
> But it's not 
> like MS-DOS is still terrifically popular these days.
I don't know what Antoine meant by MS-DOS, but since he mentioned it in the Windows context, I thought it was about Windows console applications (console is still often referred to as DOS box, I think).

> The fact that applications can still open files using the 
> legacy fopen() 
> call (which requires char*, hence 8-bit-wide, strings) is kind of 
> irrelevant. If the user creates a file using fopen() via a code page 
> translation, AND GETS IT WRONG, then the file will be created 
> with Unicode 
> characters other than those she - but those characters will 
> still be Unicode 
> and unambiguous, no?
Funny thing. Nobody cares much if a Latin 2 string is misinterpreted and Latin 1 conversion is used instead. As long as they can create the file. But if a Latin 2 string is misinterpreted and UTF-8 conversion is used? You won't just get the filename with charaters other than those you expected. Either the file won't open at all (depending on where and how the validation is done), or you risk that two files you create one after another will overwrite each other. Note that I am talking about files you create from within this scenario, not files that existed on the disk before.

Second thing: OK, you say fopen is a legacy call. True, you can use _wfopen. So, you can have a console application in Unicode and all problems are solved? No. Standard input and standard output are 8-bit, and a code page is used. And it has to remain so, if you want the old and the new applications to be able to communicate. So, the logical conclusion is that UTF-8 needs to be used instead of a code page. Unfortunately, Windows has problems with that. Try MODE CON: CP SELECT=65001. Much of it works, but batch files don't run.

Now suppose Windows does work correctly with code page set to UTF-8. You create an application that reads the stdin, counts the words longer than 10 codepoints and passes the input unmodified to stdout. What happens:

* set CP to Latin 1, process Latin 1: correct result
* set CP to Latin 1, process UTF-8:   wrong result
* set CP to UTF-8, process UTF-8: correct result
* set CP to UTF-8, process Latin 1:   wrong restlt, corrupted output


Now, I wonder why Windows is not supporting UTF-8 as much as one would want.



Lars

Re: Nicest UTF

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk

"D. Starner" <[EMAIL PROTECTED]> writes:

>> > This implies that every programmer needs an indepth knowledge of 
>> > Unicode to handle simple strings. 
>> 
>> There is no way to avoid that. 
>
> Then there's no way that we're ever going to get reliable Unicode
> support.

This is probably true.

I wonder whether things could have been done significantly better,
or it's an inherent complexity of text. Just curious, it doesn't help
with the reality.

>> If the runtime automatically performed NFC on input, then a part of a 
>> program which is supposed to pass a string unmodified would sometimes 
>> modify it. Similarly with NFD.
>
> No. By the same logic you used above, I can expect the programmer to
> understand their tools, and if they need to pass strings unmodified,
> they shouldn't load them using methods that normalize the string.

That's my point: if he normalizes, he does this explicitly.

If a standard (a programming language, XML, whatever) specifies that
identifiers should be normalized before comparison, a program should
do this. If it specifies that Cf characters are to be ignored, then a
program should comply. A standard doesn't have to specify such things
however, so a programming language shouldn't do too much automatically.
It's easier to apply a transformation than to undo a transformation
applied automatically.

> Sometimes things get ambiguous if one day ŝ is matched by s and one
> day ŝ isn't? That's absolutely wrong behavior; the program must serve
> the user, not the programmer.

If I use grep to search for a combining acute, I bet it will currently
match cases where it's a separate combining character but will not
match precomposed characters.

Do you say that this should be changed?

Hey, Linux grep matches only a single byte by ".", even in UTF-8 locale.
Now, I can agree that this should be changed.

But demanding that each program which searches strings checks for
combining classes is I'm afraid too much.

>> Does "\n" followed by a combining code point start a new line? 
>
> The Standard says no, that's a defective combining sequence.

Is there *any* program which behaves this way?

How useful is a rule in a standard which nobody obeys to?

>> Does a double quote followed by a combining code point start a
>> string literal?
>
> That would depend on your language. I'd prefer no, but it's obvious
> many have made other choices.

Since my language is young and almost doesn't have users, I can even
change decisions made earlier: I'm not constrained by compatibility
yet.

But if lexical structure of the program worked in terms of combining
character sequences, it would have to be somehow supported by generic
string processing functions, and it would have to consistely work for
all lexical features. For example */ followed by a combining accent
would not end a comment, accented backslash would not need escaping in
a string literal, and something unambiguous would have to be done with
an accented newline.

Such rules would be harder to support with most text processing tools.
I know no language in which searching for a backslash in a string would
not find an accented backslash.

It doesn't matter that accented backslashes don't occur practice. I do
care for unambiguous, consistent and simple rules.

>> Does a slash followed by a combining code point separate 
>> subdirectory names?
>
> In Unix, yes; that's because filenames in Unix are byte streams with
> the byte 0x2F acting as a path seperator.

My current implementation doesn't support filenames which can't be
encoded in the current default encoding. The encoding can be changed
from within a program (perhaps locally during execution of some code).
So one can process any Unix filename by temporarily setting the
encoding to Latin1. It's unfortunate that the default setting is more
restrictive than the OS, but I have found no sensible alternative
other than encouraging processing strings in their transportation
encoding.

Anyway, if a string *is* accepted as a file name, the program's idea
about directory separators is the same as the OS (as long as we assume
Unix; I don't yet provide any OS-generic pathname handling). If the
program assumed that an accented slash is not a directory separator,
I expect possible security holes (the program thinks that a string
doesn't include slashes, but from the OS point of view it does).

> The rules you are offering are only simple and unambiguous to the
> programmer; they appear completely random to the end user.

And yours are the opposite :-)

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk

"Philippe Verdy" <[EMAIL PROTECTED]> writes:

[...]
> This was later amended in an errata for XML 1.0 which now says that
> the list of code points whose use is *discouraged* (but explicitly
> *not* forbidden) for the "Char" production is now:
[...]

Ugh, it's a mess...

IMHO Unicode is partially to blame, by introducing various kinds of
holes in code point numbering (non-characters, surrogages), by not
being clear when the unit of processing should be a code point and
when a combining character sequence, and earlier by pushing UTF-16 as
the fundamental representation of the text (which led to such horrible
descriptions as http://www.xml.com/axml/notes/Surrogates.html).

XML is just an example of a standard which must decide:
A. What is the unit of text processing? (code point? combining character
   sequence? something else? hopefully it would not be UTF-16 unit)
B. Which (sequences of) characters are valid when present in the raw
   source, i.e. what UTF-n really means?
C. Which (sequences of) characters can be formed by specifying a
   character number?

A programming language must do the same.

The language Kogut I'm designing and developing uses Unicode as string
representation, but the details can still be changed. I want to have
rules which are "correct" as far as Unicode is concerned, and which
are simple enough to be practical (e.g. if a standard forced me to
make the conversion from code point number to actual character
contextual, or if it forced me to unconditionally unify precomposed
and decomposed characters, then I quit and won't support a broken
standard).

Internal text processing in a programming language can be more
permissive than an application of such processing like XML parsing:
if a particular character is valid in UTF-8 but XML disallows it,
everything is fine, it can be rejected at some stage. It must not be
more restrictive however, as it would make impossible to implement XML
parsing in terms of string processing.

Regarding A, I see three choices:
1. A string is a sequence of code points.
2. A string is a sequence of combining character sequences.
3. A string is a sequence of code points, but it's encouraged
   to process it in groups of combining character sequences.

I'm afraid that anything other than a mixture of 1 and 3 is too
complicated to be widely used. Almost everybody is representing
strings either as code points, or as even lower-level units like
UTF-16 units. And while 2 is nice from the user's point of view,
it's a nightmare from the programmer's point of view:
- Unicode character properties (like general category, character
  name, digit value) are defined in terms of code points. Choosing
  2 would immediately require two-stage processing: a string is
  a sequence of sequences of code points.
- Unicode algorithms (like collation, case mapping, normalization)
  are specified in terms of code points.
- Data exchange formats (UTF-n) are always closer to code points
  than to combining character sequences.
- Code points have a finite domain, so you can make dictionaries
  indexed by code points; for combining character sequences we would
  be forced to make functions which *compute* the relevant property
  basing on the structure of such a sequence.

I don't believe 2 is workable at all. The question is how to make 3
convenient enough to be used more often. Unfortunately it's much
harder than 1, unless strings used some completely different iteration
protocols than other sequences. I don't have an idea how to make 3
convenient.

Regarding B in the context of a programming language (not XML),
chapter 3.9 of the Unicode standard version 4.0 excludes only
surrogates: it does not exclude non-characters like U+.
But non-characters must be excluded somewhere, because otherwise
U+FFFE at the beginning would be mistaken for a BOM. I'm confused.

Regarding C, I'm confused too. Should a function which returns
the character of the given number accept surrogates? I guess no.
Should it accept non-characters? I don't know. I only know that
it should not accept values above 0x10.

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-10 Thread John Cowan

Philippe Verdy scripsit:

> And I disagree with you about the fact the U+ can't be used in XML 
> documents. It can be used in URI through URI escaping mechanism, as 
> explicitly indicated in the XML specification...

You have a hold of the right stick but at the wrong end.  U+ can be
encoded in a URI as %00, but that does not mean that the IRIs in system ids
and namespace names (and potentially other places) can contain explicit
U+ characters or � escapes either.  Both of those are illegal,
and documents that contain them are not well-formed.

In character content and attribute values, U+ is not possible.

> And the fact that the various character productions, that are normally 
> normative, have been changed so often, sometimes through erratas that 
> were forgotten in the text of the next edition of the standard,  

Do you have evidence for this claim?

> The only thing about which I can agree is that XML will forbid surrogates 
> and U+FFFE and U+, but I won't say that a XML parser that does not 
> reject NULs or other non-characters or "disallowed" C0 controls is so 
> much buggy. 

You are of course entitled to your uninformed opinion.

> But all these is also a proof that XML documents are definitely NOT 
> plain-text documents, so you can't use Unicode encoding rules at the 
> encoded XML document level, only at the finest plain-text nodes (these 
> are the levels that the productions in the XML standard are trying, with 
> more or less success, to standardize).

You can't blindly do *normalization* of XML documents as if they were
plain text.  *Encoding* XML documents according to Unicode is of course
possible and desirable.

> As a consequence any process that blindly applies a plain-text 
> normalization to a complete XML document is bogous, because it breaks the 
> most basic XML conformance, i.e. the core document structure...

In one extraordinarily unlikely case, yes: the appearance of a
combining overlay slash following the ">" that closes a tag will
damage the document if it is NFC-normalized.

-- 
You are a child of the universe no less John Cowan
than the trees and all other acyclichttp://www.reutershealth.com
graphs; you have a right to be here.http://www.ccil.org/~cowan
  --DeXiderata by Sean McGrath  [EMAIL PROTECTED]

Re: Nicest UTF

2004-12-10 Thread John Cowan

Philippe Verdy scripsit:

> >Okay, I'm confused. Does ≮ open a tag? Does it matter if it's 
> >composed or decomposed?
> 
> It does not open a XML tag.
> It does matter if it's composed (won't open a tag) or decomposed (will 
> open a tag, but with a combining character, invalid as an identifier 
> start)

Let's be precise here.  If the 7-character character sequence "蠔"
appears in an XML document, it never opens a tag and it is never changed
by normalization.  If the 1-character sequence consisting of a single
U+226E appears in an XML document, and that document is put through
NF(K)D, it will become not well-formed.  However, NF(K)D is not
recommended for XML documents, which should be in NFC.

-- 
First known example of political correctness:   John Cowan
"After Nurhachi had united all the otherhttp://www.reutershealth.com
Jurchen tribes under the leadership of the  http://www.ccil.org/~cowan
Manchus, his successor Abahai (1592-1643)   [EMAIL PROTECTED]
issued an order that the name Jurchen should   --S. Robert Ramsey,
be banned, and from then on, they were all The Languages of China
to be called Manchus."

Re: Nicest UTF

2004-12-10 Thread John Cowan

Philippe Verdy scripsit:

> If you look at the XML 1.0 Second Edition

The Second Edition has been superseded by the Third.

> Char   ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
> [#x1-#x10]

That is normative.

> But the comment following it specifies:

That comment is not normative and not meant to be precise.

> the restrictive 
> definition of "Char" above also includes the whole range of C1 controls 

By oversight.

> (#x80..#x9F), so I can't understand why the Char definition is so 
> restrictive on controls; in addition the definition of Char also 
> *includes* many non-characters (it only excludes surrogates, and U+FFFE 
> and U+, but forgets to exclude U+1FFFE and U+1, U+2FFFE and 
> U+2, ..., U+10FFFE and U+10).

By oversight again.

> Note however that nearly all XML parsers don't seem to honor this 
> constraint (like SGML parsers...)!

Please specify the parsers that do and don't honor this.  Any which
don't honor it are buggy, and any documents which exploit those bugs
are not XML.

> What is even worse is that XML 1.1 now reallows NUL for system 
> identifiers and URIs, through escaping mechanisms.

Not true.  U+ is absolutely excluded in both XML 1.0 and XML 1.1.

-- 
"I could dance with you till the cows   John Cowan
come home.  On second thought, I'd  http://www.ccil.org/~cowan
rather dance with the cows when you http://www.reutershealth.com
came home."  --Rufus T. Firefly [EMAIL PROTECTED]

Re: Nicest UTF

2004-12-10 Thread Philippe Verdy

From: "D. Starner" <[EMAIL PROTECTED]>
Okay, I'm confused. Does ≮ open a tag? Does it matter if it's 
composed or
decomposed?
It does not open a XML tag.
It does matter if it's composed (won't open a tag) or decomposed (will open 
a tag, but with a combining character, invalid as an identifier start)

Conclusion1: blind normalizations of XML documents, as if they were 
plain-text documents, can break the XML well-formedness of these 
documents This is caused by the fact that plain-text documents can be 
parsed by units of grapheme clusters or combining sequences. But XML parsing 
stops at the one-codepoint character level, and ignores canonical 
equivalences.
Conclusion2: XML documents are not plain-text documents.

Re: Nicest UTF

2004-12-10 Thread Philippe Verdy

From: "John Cowan" <[EMAIL PROTECTED]>
Marcin 'Qrczak' Kowalczyk scripsit:
http://www.w3.org/TR/2000/REC-xml-20001006#charsets
implies that the appropriate level for parsing XML is code points.
You are reading the XML Recommendation incorrectly.  It is not defined
in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of
characters.  XML processors are required to process UTF-8 and UTF-16,
and may process other character encodings or not.  But the internal
model is that of characters.  Thus surrogate code points are not
allowed.
I have different reading, because the "character" in XML is not the same as 
the "character" in Unicode. For XML, U+10 is a valid character (even if 
its use is explicitly not recommanded, it is perfectly valid), for Unicode 
it's a non-character... For XML, U+0001 is *sometimes* a valid character, 
sometimes not.

And I disagree with you about the fact the U+ can't be used in XML 
documents. It can be used in URI through URI escaping mechanism, as 
explicitly indicated in the XML specification...

And the fact that the various character productions, that are normally 
normative, have been changed so often, sometimes through erratas that were 
forgotten in the text of the next edition of the standard, then reintroduced 
in an errata, shows that these productions are less reliable than the 
descriptive *definitions* which ARE normative in XML...

The only thing about which I can agree is that XML will forbid surrogates 
and U+FFFE and U+, but I won't say that a XML parser that does not 
reject NULs or other non-characters or "disallowed" C0 controls is so much 
buggy. I do think that these restrictions is a defect of XML...

But all these is also a proof that XML documents are definitely NOT 
plain-text documents, so you can't use Unicode encoding rules at the encoded 
XML document level, only at the finest plain-text nodes (these are the 
levels that the productions in the XML standard are trying, with more or 
less success, to standardize).

As a consequence any process that blindly applies a plain-text normalization 
to a complete XML document is bogous, because it breaks the most basic XML 
conformance, i.e. the core document structure...

Re: Nicest UTF

2004-12-10 Thread D. Starner

John Cowan writes:

> You are reading the XML Recommendation incorrectly.  It is not defined
> in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of
> characters.  XML processors are required to process UTF-8 and UTF-16,
> and may process other character encodings or not.  But the internal
> model is that of characters.  Thus surrogate code points are not
> allowed.

Okay, I'm confused. Does ≮ open a tag? Does it matter if it's composed or 
decomposed?
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Nicest UTF

2004-12-10 Thread John Cowan

Marcin 'Qrczak' Kowalczyk scripsit:

> http://www.w3.org/TR/2000/REC-xml-20001006#charsets
> implies that the appropriate level for parsing XML is code points.

You are reading the XML Recommendation incorrectly.  It is not defined
in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of
characters.  XML processors are required to process UTF-8 and UTF-16,
and may process other character encodings or not.  But the internal
model is that of characters.  Thus surrogate code points are not
allowed.

-- 
John Cowan  www.reutershealth.com  www.ccil.org/~cowan  [EMAIL PROTECTED]
Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash,
The day and hour soon are coming / When all the IT folks say "Gosh!"
It isn't from a clever lawsuit / That Windowsland will finally fall,
But thousands writing open source code / Like mice who nibble through a wall.
--The Linux-nationale by Greg Baker

Re: Nicest UTF

2004-12-10 Thread Philippe Verdy

From: "Philippe Verdy" <[EMAIL PROTECTED]>
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
"Philippe Verdy" <[EMAIL PROTECTED]> writes:
The XML/HTML core syntax is defined with fixed behavior of some
individual characters like '&', '<', quotation marks, and with special
behavior for spaces.
The point is: what "characters" mean in this sentence. Code points?
Combining character sequences? Something else?
See the XML character model document... XML ignores combining sequences. 
But for Unicode and for XML a character is an abstract character with a 
single code allocated in a *finite* repertoire. The repertoire of all 
possible combining characters sequences is already infinite in Unicode, as 
well as the number of "default grapheme clusters" they can represent.
Note there is some differently relaxed definitions of what constitutes a 
"character" for XML.
If you look at the XML 1.0 Second Edition, it specifies that the document is 
a "text" (defined only as a sequence of "characters", which may represent 
markup or character data) will only contain characters in this set:
Char   ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
[#x1-#x10]

But the comment following it specifies:
"any Unicode character, excluding the surrogate blocks, FFFE, and ."
which is considerably weaker (because it would include ALL basic controls in 
the range #x0 to #x1F, and not only TAB, LF, CR); the restrictive definition 
of "Char" above also includes the whole range of C1 controls (#x80..#x9F), 
so I can't understand why the Char definition is so restrictive on controls; 
in addition the definition of Char also *includes* many non-characters (it 
only excludes surrogates, and U+FFFE and U+, but forgets to exclude 
U+1FFFE and U+1, U+2FFFE and U+2, ..., U+10FFFE and U+10).

So XML does allow Unicode/ISO10646 non-characters... But not all. Apparently 
many XML parsers seem to ignore the restriction of Char above, notably in 
CDATA sections

The alternative is then to use numeric character references, as defined by 
this even weaker production (in 4.1. Character and Entity References):

CharRef ::= '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';'
but with this definition:
"A character reference refers to a specific character in the ISO/IEC 10646 
character set, for example one not directly accessible from available input 
devices."

Which is exactly the purpose of encoding something like "" to encode a 
SOH character U+0001 (which after all is a valid Unicode/ISO/IEC10646 
"character"), or even a NUL character.

The "CharRef" production however is annotated by a Well-Formedness 
Constraint, "Legal Character":
"Characters referred to using character references must match the production 
for Char.

Note however that nearly all XML parsers don't seem to honor this constraint 
(like SGML parsers...)!

This was later amended in an errata for XML 1.0 which now says that the list 
of code points whose use is *discouraged* (but explicitly *not* forbidden) 
for the "Char" production is now:
[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3],
[#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6],
[#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9],
[#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC],
[#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF],
[#x10FFFE-#x10].
This clause is not really normative, but just adds to the confusion...Then 
comes XML 1.1, that extends the restrictive "Char" production:Char   ::= 
[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10]with the same comment 
"any Unicode character, excluding the surrogate blocks, FFFE, and ."So 
in XML 1.0, the comment was accurate, not the formal production...In XML 
1.1, all C0 and C1 controls (except NUL) are now allowed, but some of them 
their use is restricted in some cases:

RestrictedChar   ::=   [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | 
[#x86-#x9F]

What is even worse is that XML 1.1 now reallows NUL for system identifiers 
and URIs, through escaping mechanisms. Clearly, the XML specification is 
inconsistent there, and this would explain why most XML parsers are more 
permissive than what is given in the "Char" production of the XML 
specification, and that they simply refer to the definition of valid 
codepoints for Unicode and ISO/IEC 10646, excluding only surrogate code 
points (a valid code point can be a non-character, and can also be a 
NUL...): the XML parser will accept those code points, but will let the 
validity control to the application using the parsed XML data, or will offer 
some tuning options to enable this "Char" filter (that depends on XML 
version...).

See also the various erratas for XML 1.1, related to "RestrictedChar"...
Or to the list of characters whose use is discouraged (meaning explicitly 
not forbidden, so allowed...):

[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3FF

Re: Nicest UTF

2004-12-10 Thread D. Starner

"Marcin 'Qrczak' Kowalczyk" writes:

> "D. Starner" writes: 
>
> > This implies that every programmer needs an indepth knowledge of 
> > Unicode to handle simple strings. 
> 
> There is no way to avoid that. 

Then there's no way that we're ever going to get reliable Unicode
support. 

> If the runtime automatically performed NFC on input, then a part of a 
> program which is supposed to pass a string unmodified would sometimes 
> modify it. Similarly with NFD.

No. By the same logic you used above, I can expect the programmer to
understand their tools, and if they need to pass strings unmodified,
they shouldn't load them using methods that normalize the string.

> You can't expect each and every program which compares strings to 
> perform normalization (e.g. Linux kernel with filenames). 

As has been pointed out here, Posix filenames are not character strings; 
they are byte strings. They quite likely aren't even valid UTF-8 strings.

> > So S should _sometimes_ match an accented S? Again, I feel extended misery 
> > of explaining to people why things aren't working right coming on. 
> 
> Well, otherwise things get ambiguous, similarly to these XML issues. 

Sometimes things get ambiguous if one day ŝ is matched by s and one
day ŝ isn't? That's absolutely wrong behavior; the program must serve
the user, not the programmer. 's' cannot, should, must not match 'ŝ';
and if it must, then it absolutely always must match 'ŝ' and someway
to make a regex that matches s but not ŝ must be designed. It doesn't
matter what problems exist in the world of programming; that is the
entirely reasonable expectation of the end user.

> Does "\n" followed by a combining code point start a new line? 

The Standard says no, that's a defective combining sequence.

> Does 
> a double quote followed by a combining code point start a string 
> literal? 

That would depend on your language. I'd prefer no, but it's obvious
many have made other choices.

> Does a slash followed by a combining code point separate 
> subdirectory names?

In Unix, yes; that's because filenames in Unix are byte streams with
the byte 0x2F acting as a path seperator.

> It's hard enough to convince them that a 
> character is not the same as a byte. 

That contradicts you above statement, that every programmer needs an
indepth knowledge of Unicode.

> In case I want to circumvent security or deliberately cause a piece of 
> software to misbehave. Robustness require unambiguous and simple rules. 

The rules you are offering are only simple and unambiguous to the programmer;
they appear completely random to the end user. To have ≮ sometimes start a
tag means that a user can't look at the XML and tell whether something opens
a tag or is just text. You might be able to expect all programmers, but you
can't expect all end users to.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk

John Cowan <[EMAIL PROTECTED]> writes:

>> > The XML/HTML core syntax is defined with fixed behavior of some
>> > individual characters like '&', '<', quotation marks, and with special
>> > behavior for spaces.
>> 
>> The point is: what "characters" mean in this sentence. Code points?
>> Combining character sequences? Something else?
>
> Neither.  Unicode characters.

http://www.w3.org/TR/2000/REC-xml-20001006#charsets
implies that the appropriate level for parsing XML is code points.

In particular XML allows a combining character directly after ">".

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-10 Thread Philippe Verdy

- Original Message - 
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, December 10, 2004 8:35 PM
Subject: Re: Nicest UTF

"Philippe Verdy" <[EMAIL PROTECTED]> writes:
The XML/HTML core syntax is defined with fixed behavior of some
individual characters like '&', '<', quotation marks, and with special
behavior for spaces.
The point is: what "characters" mean in this sentence. Code points?
Combining character sequences? Something else?
See the XML character model document... XML ignores combining sequences. But 
for Unicode and for XML a character is an abstract character with a single 
code allocated in a *finite* repertoire. The repertoire of all possible 
combining characters sequences is already infinite in Unicode, as well as 
the number of "default grapheme clusters" they can represent.

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk

John Cowan <[EMAIL PROTECTED]> writes:

>> > The XML/HTML core syntax is defined with fixed behavior of some
>> > individual characters like '&', '<', quotation marks, and with special
>> > behavior for spaces.
>> 
>> The point is: what "characters" mean in this sentence. Code points?
>> Combining character sequences? Something else?
>
> Neither.  Unicode characters.

What does "Unicode characters" mean?

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-10 Thread John Cowan

Marcin 'Qrczak' Kowalczyk scripsit:

> > The XML/HTML core syntax is defined with fixed behavior of some
> > individual characters like '&', '<', quotation marks, and with special
> > behavior for spaces.
> 
> The point is: what "characters" mean in this sentence. Code points?
> Combining character sequences? Something else?

Neither.  Unicode characters.

-- 
"May the hair on your toes never fall out!" John Cowan
--Thorin Oakenshield (to Bilbo) [EMAIL PROTECTED]

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk

"Philippe Verdy" <[EMAIL PROTECTED]> writes:

> The XML/HTML core syntax is defined with fixed behavior of some
> individual characters like '&', '<', quotation marks, and with special
> behavior for spaces.

The point is: what "characters" mean in this sentence. Code points?
Combining character sequences? Something else?

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk

"D. Starner" <[EMAIL PROTECTED]> writes:

>> String equality in a programming language should not treat composed
>> and decomposed forms as equal. Not this level of abstraction.
>
> This implies that every programmer needs an indepth knowledge of
> Unicode to handle simple strings.

There is no way to avoid that.

If the runtime automatically performed NFC on input, then a part of a
program which is supposed to pass a string unmodified would sometimes
modify it. Similarly with NFD.

You can't expect each and every program which compares strings to
perform normalization (e.g. Linux kernel with filenames).

Perhaps if there was a single normalization format which everybody
agreed to, and unnormalized strings were never used for data
interchange (if UTF-8 was specified such that to disallow unnormalized
data, etc.), things would be different. But Unicode treats both
composed and decomposed representations as valid.

>> IMHO splitting into graphemes is the job of a rendering engine, not of
>> a function which extracts a part of a string which matches a regex.
>
> So S should _sometimes_ match an accented S? Again, I feel extended misery
> of explaining to people why things aren't working right coming on.

Well, otherwise things get ambiguous, similarly to these XML issues.
Does "\n" followed by a combining code point start a new line? Does
a double quote followed by a combining code point start a string
literal? Does a slash followed by a combining code point separate
subdirectory names?

An iterator which delivers whole combining character sequences out of
a sequence of code points can be used. You can also manipulate strings
as arrays of combining character sequences. But if you insist that
this is the primary string representation, you become incompatible
with most programs which have different ideas about delimited strings.
You can't expect each and every program to check combining classes
of processed characters. It's hard enough to convince them that a
character is not the same as a byte.

>> I expect breakage of XML-based protocols if implementations are
>> actually changed to conform to these rules (I bet they don't now).
>
> Really? In what cases are you storing isolated combining code points
> in XML as text?

In case I want to circumvent security or deliberately cause a piece of
software to misbehave. Robustness require unambiguous and simple rules.

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-09 Thread Doug Ewell

Philippe Verdy  wrote:

>> Please start adding spaces to your entity references or
>> something, because those of us reading this through a web interface
>> are getting very confused.
>
> No confusion possible if using any classic mail reader.
>
> Blame your ISP (and other ISPs as well like AOL that don't respect the
> interoperable standards for plain-text emails) for its poor webmail
> interface, that does not properly escape the characters...

No harm done in following David's suggestion, though, to help
accommodate the mail readers that do this.  It's just an e-mail, after
all.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-09 Thread Philippe Verdy

From: "Antoine Leca" <[EMAIL PROTECTED]>
Err, not really. MS-DOS *need to know* the encoding to use, a bit like a
*nix application that displays filenames need to know the encoding to use
the correct set of glyphs (but constrainst are much more heavy.) Also
Windows NT Unicode applications know it, because it can't be changed :-).
But when it comes to other Windows applications (still the more common) 
that
happen to operate in 'Ansi' mode, they are subject to the hazard of 
codepage
translations. Even if Windows 'knows' the encoding used for the filesystem
(as when it uses NTFS or Joliet, or VFAT on NT kernels; in the other cases
it does not even know it, much like with *nix kernels), the only usable 
set
is the _intersection_ of the set used to write and the set used to read;
that is, usually, it is restricted to US ASCII, very much like the usable
set in *nix cases...
True, but this applies to FAT-only filesystems, which happen to store 
filenames with a "OEM" charset which is not stored explicitly on the volume. 
This is a known caveat even for Unix, when you look at the tricky details of 
the support of Windows file sharing through Samba, when the client requests 
a file with a "short" 8.3 name, that a partition used by Windows is supposed 
to support.

In fact, this nightmare comes from the support in Windows of the 
compatibility with legacy DOS applications which don't know the details and 
don't use the Win32 APIs with Unicode support. Note that DOS applications 
use a "OEM" charset which is part of the user settings, not part of the 
system settings (see the effects of the command CHCP in a DOS command 
prompt).

FAT32 and NTFS help reconciliate these incompatible charsets because these 
filesystems also store a "LFN" (Long File Name) for the same files (in that 
case the short name, encoded in some ambiguous OEM charset, is just an 
alias, acting exactly like a hard link on Unix created in the same directory 
that references the same file). "LFN" names are UTF-16 encoded and support 
mostly the same names as in NTFS volumes.

However, on FAT32 volumes, the short names are mandatory, unlike on NTFS 
volumes where they can be created "on the fly" by the filesystem driver, 
according to the current user settings for the selected OEM charset, without 
storing them explicitly on the volume. Windows contains, in CHKDSK, a way to 
verify that short names of FAT32 filesystems are properly encoded with a 
coherent OEM charset, using the UTF-16 encoded LFN names as a reference. If 
needed, corrections for the OEM charset can be applied...

This nightmare of incompatible OEM charsets do happen on Windows 98/98SE/ME, 
when the "autoexec.bat" file that defines the current user profile is not 
executing as it should the proper "CHCP" command, or when this autoexec.bat 
file has been modified or erased: in that case, the default OEM charset 
(codepage 437) is used, and short filenames are incorrectly encoded.

Another complexity is that Win32 applications, that use a fixed (not 
user-settable) "ANSI" charset, and that don't use the Unicode API depend on 
the conversion from the ANSI charset to the current OEM charset. But if a 
file is handled through some directory shares via multiple hosts, that have 
distinct ANSI charsets (i.e. Windows hosts running different localization of 
Windows, such as a US installation and a French version in the same LAN), 
the charsets viewed by these hosts will create incompatible encodings on the 
same shared volume.

So the only "stable" subset for short names, that is not affected by OS 
localization or user settings is the intersection of all possible ANSI and 
OEM charsets that can be set in all versions of Windows! No need to say, 
this designates only the printable ASCII charset for short 8.3 names. Long 
filenames are not affected by this problem.

Conclusion: to use international characters out of ASCII in filenames used 
by Windows, make sure that the the name is not in a 8.3 short format, so 
that a long filename, in UTF-16, will be created on FAT32 filesystems or on 
SMBFS shares (Samba on Unix/Linux, Windows servers)... Or use NTFS (but then 
resolve the interoperability problems with Linux/Unix client hosts that 
can't access reliably, for now, to these filesystems, and that are not 
completely emulated by Unix filesystems used by Samba, due to the limitation 
on the LanMan sharing protocol, and limitations of Unix filesystems as well 
that rarely use UTF-8 as their prefered encoding...)

Re: Nicest UTF

2004-12-09 Thread Philippe Verdy

From: "D. Starner" <[EMAIL PROTECTED]>
"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes:
If it's a broken character reference, then what about Á (769 is
the code for combining acute if I'm not mistaken)?
Please start adding spaces to your entity references or
something, because those of us reading this through a web interface
are getting very confused.
No confusion possible if using any classic mail reader.
Blame your ISP (and other ISPs as well like AOL that don't respect the 
interoperable standards for plain-text emails) for its poor webmail 
interface, that does not properly escape the characters used in plain-text 
emails you receive (and that are NOT containing any html entities), but that 
get inserted blindly within the HTML page they create in their webmail 
interface.

Not only such webmail interface is bogous, but it is also dangerous as it 
allows arbitrary HTML code to run from plain-text emails. Ask for support 
and press your ISP to correct its server-side scripts so that it will 
correctly support plain-text emails !

Re: Nicest UTF

2004-12-09 Thread Philippe Verdy

From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
Ok, so it's the conversion from raw text to escaped character
references which should treat combining characters specially.
What about < with combining acute, which doesn't have a precomposed
form? A broken opening tag or a valid text character?
Also a broken opening tag for HTML/XML documents (which are NOT plain text 
documents, and must be first parsed as HTML/XML, before parsing the many 
text sections contained in text elements, element names, attribute names, 
attribute values (etc...) as plain-text under the restrictions specified in 
the HTML or XML specifications (which contain restriction for example on 
which characters are allowed in names).

The XML/HTML core syntax is defined with fixed behavior of some individual 
characters like '&', '<', quotation marks, and with special behavior for 
spaces. This core structure is not plain-text, and cannot be overriden, even 
by Unicode grapheme clusters.

Note that HTML/XML do NOT mandate the use or even the support of Unicode, 
just the support of a character repertoire that contains some required 
characters, and the acceptance of at least the ISO/10646 repertoire under 
some conditions, however the encoding to code points itself is not required 
for something else than numeric character references, which are more 
symbolic in a way similar to other named character entities in SGML, than 
absolute as implying the required support of the repertoire with a single 
code!

So you can as well create fully conforming HTML or XML documents using a 
character set which includes characters not even defined in Unicode/ISO/IEC 
10646, or characters defined only symbolically with just a name. Whever this 
name will map or not to one or more Unicode characters does not change the 
validity of the document itself.

And all the XML/HTML behavior ignores almost all Unicode properties 
(including normalization properties, because XML and HTML treat different 
strings, which are still canonically equivalent, as completely distinct; an 
important feature for cases like XML Signatures, where normalization of 
documents should not be applied blindly as it would break the data 
signature).

If you want to normalize XML documents, you should not do it with a 
normalizer working on the whole document as if it was plain-text. Instead 
you must normalize the individual strings that are in the XML InfoSet, as 
accessible when browsing the nodes of its DOM tree, and then you can 
serialize the normalized tree to create a new document (using CDATA sections 
and/or character references, if needed to escape some syntaxic characters 
reserved by XML that would be present in the string data of DOM tree nodes).

Note also that a XML document containing references to Unicode 
non-characters would still be well-formed, because these characters may be 
part of a non-Unicode charset.

XML document validation is a separate and optional problem from XML parsing 
which checks well-formedness and builds a DOM tree: validation is only 
performed when matching the DOM tree according to a schema definition, DTD 
or XSD, in which additional restrictions on allowed characters may be 
checked, or in which additional symbolic-only "characters" may be defined 
and used in the XML document with parsable named entities similar to: 
">".

(An example: the schema may contain a definition for a "character" 
representing a private company logo, mapped to a symbolic name; the XML 
document can contain such references, but the DTD may also define an 
encoding for it in a private charset, so that the XML document will directly 
use that code; the Apple logo in Macintosh charsets is an example, for which 
an internal mapping to Unicode PUAs is not sufficient to allow correct 
processing of multiple XML documents, where PUAs used in each XML documents 
have no equivalence; the conversion of such documents to Unicode with these 
PUAs is a lossy conversion, not suitable for XML data processing).

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-09 Thread Arcane Jill

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Antoine Leca
Sent: 09 December 2004 11:29
To: Unicode Mailing List
Subject: Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Windows filesystems do know what encoding they use.
Err, not really. MS-DOS *need to know* the encoding to use, a bit like a
*nix application that displays filenames need to know the encoding to use
the correct set of glyphs (but constrainst are much more heavy.)
Sure, but MS-DOS is not Windows. MS-DOS uses "8.3" filenames. But it's not 
like MS-DOS is still terrifically popular these days.

But when it comes to other Windows applications (still the more common) 
that
happen to operate in 'Ansi' mode, they are subject to the hazard of codepage
translations.
Sure, but this has got nothing to do with the filesystem. The Windows 
filesystem(s) store filenames in those disk sectors which are reserved for 
file headers, and in these location they are stored using sixteen-bit wide 
code units. (I assume this can only be UTF-16?). Thus, "Windows file systems 
do know what encodings they use" seems to me to be a correct statement.

The fact that applications can still open files using the legacy fopen() 
call (which requires char*, hence 8-bit-wide, strings) is kind of 
irrelevant. If the user creates a file using fopen() via a code page 
translation, AND GETS IT WRONG, then the file will be created with Unicode 
characters other than those she - but those characters will still be Unicode 
and unambiguous, no?

that is, usually, it is restricted to US ASCII, very much like the usable
set in *nix cases...
[OFF TOPIC] Why do so many people call it "US ASCII" anyway? Since "ASCII" 
comprises that subset of Unicode from U+ to U+007F, it is not clear to 
me in what way "US-ASCII" is different from ASCII. It's bad enough for us 
non-Americans that the A in ASCII already stands for "American", but to 
stick "US" on the front as well is just  Anyway, back to the discussion 
on US-Unicode...

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-09 Thread Antoine Leca

On Monday, December 6th, 2004 20:52Z John Cowan va escriure:

> Doug Ewell scripsit:
>
>>> Now suppose you have a UNIX filesystem, containing filenames in a
>>> legacy encoding (possibly even more than one). If one wants to
>>> switch to UTF-8 filenames, what is one supposed to do? Convert all
>>> filenames to UTF-8?
>>
>> Well, yes.  Doesn't the file system dictate what encoding it uses for
>> file names?  How would it interpret file names with "unknown"
>> characters from a legacy encoding?  How would they be handled in a
>> directory search?
>
> Windows filesystems do know what encoding they use.

Err, not really. MS-DOS *need to know* the encoding to use, a bit like a
*nix application that displays filenames need to know the encoding to use
the correct set of glyphs (but constrainst are much more heavy.) Also
Windows NT Unicode applications know it, because it can't be changed :-).

But when it comes to other Windows applications (still the more common) that
happen to operate in 'Ansi' mode, they are subject to the hazard of codepage
translations. Even if Windows 'knows' the encoding used for the filesystem
(as when it uses NTFS or Joliet, or VFAT on NT kernels; in the other cases
it does not even know it, much like with *nix kernels), the only usable set
is the _intersection_ of the set used to write and the set used to read;
that is, usually, it is restricted to US ASCII, very much like the usable
set in *nix cases...


Antoine

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Kenneth Whistler

Lars responded:

> > ... Whatever the solutions
> > for representation of corrupt data bytes or uninterpreted data
> > bytes on conversion to Unicode may be, that is irrelevant to the
> > concerns on whether an application is using UTF-8 or UTF-16
> > or UTF-32.

> The important fact is that if you have an 8-bit based program, and you
> provide a locale to support UTF-8, you can keep things working (unless you
 ^^^

You can keep *some* things *sorta* working.

If you don't make the effort to actually upgrade software to
use the standard *conformantly*, then it is no real surprise when
data corruptions creep in, characters get mislaid, and some things
don't work the way they should.

> prescribe validation). But you cannot achieve the same if you try to base
> your program on 16 or 32 bit strings. 

Of course you can. You just have to rewrite the program to handle
16-bit or 32-bit strings correctly. You can't pump them through
8-bit pipes or char* API's, but it's just silly to try that, because
they are different animals to begin with.

By the way, I participated as an engineer in a multi-year project
that shifted an advanced, distributed data analysis system
from an 8-bit character set to 16-bit Unicode. *All* user-visible string
processing was converted over -- and that included proprietary
file servers, comm servers, database gateways, networking code,
a proprietary 32-bit workstation GUI implementation, and a suite
of object-oriented application tools, including a spreadsheet,
plotting tool, query and database reporting tools, and much more.
It worked cross-platform, too.

It was completed, running, and *delivered* to customers in 1994,
a decade ago.

You can't bamboozle me with any of this "it can't be done with
16-bit strings" BS.

> Or, again, you really cannot with 16
> bit (UTF-16), 

Yes you can.

> and you sort of can with 32 bit (UTF-32), but must resort to
> values above 21 bits. 

No, you need not -- that is non-conformant, besides.

> Again, nothing standardized there, nothing defined for
> how functions like isspace should react and so on.

That is wrong, too. The standard information that people seek
is in the Unicode Character Database:

http://www.unicode.org/Public/UNIDATA/

And there are standard(*) libraries such as ICU that public API's
for programs to use to get the kind of behavior they need.

(*) Just because a library isn't an International Standard does
not mean that it is not a de facto standard that people can
and do rely upon for such program behavior.

You can't expect to just rely upon the C or C++ standards
and POSIX to solve all your application problems, but there
are perfectly good solutions working out there, in UTF-8,
in UTF-16, and in UTF-32. (Or in combinations of those.)

> And it's about the fact that it is far more likely that this
> happens to UTF-8 data (or that some legacy data is mistakenly labelled or
> assumed to be UTF-8).
> UTF-16 data is far cleaner than 8-bit data. Basically because you had to
> know the encoding in order to store the data in UTF-16.

Actually, I think this should be characterized as software engineers
writing software for UTF-16 are likely to do a better job of
handling characters, because they have to, whereas a lot of
stuff using UTF-8 just slides by, because people think they can
ignore character set issues long enough, so that when the problem
occurs, it can no longer be traced to mistakes they made or
that they are still held responsible for. ;-)

> UTF-8 is what solved the problems on UNIX. It allowed UNIX to process
> Windows data. Alongside its own.
> It is Windows that has problems now. And I think roundtripping is the
> solution that will allow Windows to process UNIX data. Without dropping data
> or raising exceptions. Alongside its own.

I just don't understand these assertions at all.

First of all it isn't "UNIX data" or "Windows data" -- it is
end user's data, which happens to be processed in software
systems which in turn are running on a UNIX or Windows OS.

I work for a company that *routinely* runs applications that
cross the platform barriers in all sorts of ways. It works
because character sets are handled conformantly, and conversions
are done carefully at platform boundaries -- not because some
hack has been added to UTF-8 to preserve data corruptions.

> > There's more to it, of course, but this is, I believe, as the
> > bottom of the reason why, for 12 years now, people have been
> > fundamentally misunderstanding each other about UTF-8.
> Is it 12? Thought it was far less. 

Yes. The precursor of UTF-8 was dreamed up around 1992.

> Off topic, when was UTF-8 added to
> Unicode standard?

In Unicode 1.1, Appendix F, then known as "FSS-UTF", in 1993.


> Quite close. Except for the fact that:
> * U+EE93 is represented in UTF-32 as 0xEE93
> * U+EE93 is represented in UTF-16 as 0xEE93
> * U+EE93 is represented in UTF

Re: Nicest UTF

2004-12-08 Thread Kenneth Whistler

Marcin asked:

> The general trouble is that numeric character references can only
> encode individual code points

By design.

> rather than graphemes (is this a correct
> term for a non-combining code point with a sequence of combining code
> points?).

No. The correct term is "combining character sequence".

TUS 4.0, p. 70, D17.

The correct NCR representation of a combining character sequence
is a sequence of NCR's. -- Not too surprisingly.

--Ken

> So if XML is supposed to be treated as a sequence of
> graphemes, weird effects arise in the above boundary cases...

Re: Nicest UTF

2004-12-08 Thread D. Starner

"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes:
> If it's a broken character reference, then what about Á (769 is
> the code for combining acute if I'm not mistaken)?

Please start adding spaces to your entity references or 
something, because those of us reading this through a web interface
are getting very confused.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Nicest UTF

2004-12-08 Thread D. Starner

"Marcin 'Qrczak' Kowalczyk" writes:
> String equality in a programming language should not treat composed
> and decomposed forms as equal. Not this level of abstraction.

This implies that every programmer needs an indepth knowledge of Unicode
to handle simple strings. The concept makes me want to replace Unicode;
spending the rest of my life explaining to programmers, and people who use
their programs, why a search for "Römishe Elegien" isn't bringing the book
is not my idea of happiness.

> IMHO splitting into graphemes is the job of a rendering engine, not of
> a function which extracts a part of a string which matches a regex.

So S should _sometimes_ match an accented S? Again, I feel extended misery
of explaining to people why things aren't working right coming on.

> They are supposed to be equivalent when they are actual characters.
> What if they are numeric character references? Should "≮"
> (7 characters) represent a valid plain-text character or be a broken
> opening tag?

Which 7 characters? My email "client" turned them into the actual characters.
But I think it's fairly obvious that XML added entities in part so you
could include '<'s and other characters without them getting interpreted as
part of the text of the document. Similarly, a combining character entity
following an actual < should be the start of a tag. 

>Note that if it's a valid plain-text character, it's impossible
>to represent isolated combining code points in XML, 

No more then it's impossible to represent '<' in the text.

> I expect breakage of XML-based protocols if implementations are
> actually changed to conform to these rules (I bet they don't now).

Really? In what cases are you storing isolated combining code points
in XML as text? I can think of hypothetical cases, but most real-world
use isn't going to be affected. If I were designing such an XML protocol,
I'd probably store it as a decimal number anyway; XML is designed to
be human-readable, and an isolated combining character that randomly 
combines with other characters that it's not logically associated with 
when displayed isn't particularly human readable.

> Implementing an API which works in terms of graphemes over an API
> which works in terms of code points is more sane than the converse,
> which suggests that the core API should use code points if both APIs
> are sometimes needed at all.

Implementing an API which works in terms of lists over an API which works
in terms of pointers is more sane than the converse, which suggests that the
core API should use pointers if both APIs are sometimes needed at all.

> While I'm not obsessed with efficiency, it would be nice if changing
> the API would not slow down string processing too much.

Who knows how much it would slow down string processing? If I get around
to writing the test code, I'll try and see how much it slows stuff down,
but right now we don't know.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Nicest UTF

2004-12-08 Thread Marcin 'Qrczak' Kowalczyk

John Cowan <[EMAIL PROTECTED]> writes:

>> String equality in a programming language should not treat composed
>> and decomposed forms as equal. Not this level of abstraction.
>
> Well, that assumes that there's a special "string equality" predicate,
> as distinct from just having various predicates that DWIM.

No, I meant the default generic equality predicate when applied to two
strings.

> It's a broken opening tag.

Ok, so it's the conversion from raw text to escaped character
references which should treat combining characters specially.

What about < with combining acute, which doesn't have a precomposed
form? A broken opening tag or a valid text character?

What about AACUTE where ACUTE stands for combining acute? Is this
A with acute, or a broken character reference which ends with an
accented semicolon?

If it's a broken character reference, then what about Á (769 is
the code for combining acute if I'm not mistaken)? If *this* is A with
acute, then it's inconsistent: here combining accents are processed
after resolving numeric character references, and previously it was
in the opposite order. OTOH if this is something else, then it's
impossible to represent letters without precomposed forms with numeric
character references.

The general trouble is that numeric character references can only
encode individual code points rather than graphemes (is this a correct
term for a non-combining code point with a sequence of combining code
points?). So if XML is supposed to be treated as a sequence of
graphemes, weird effects arise in the above boundary cases...

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-08 Thread John Cowan

Marcin 'Qrczak' Kowalczyk scripsit:

> String equality in a programming language should not treat composed
> and decomposed forms as equal. Not this level of abstraction.

Well, that assumes that there's a special "string equality" predicate, as
distinct from just having various predicates that DWIM.  In a Unicode Lisp
implementation, e.g., equal might be char-by-char equality and equalp might not.

> They are supposed to be equivalent when they are actual characters.
> What if they are numeric character references? Should "≮"
> (7 characters) represent a valid plain-text character or be a broken
> opening tag?

It's a broken opening tag.

> Note that if it's a valid plain-text character, it's impossible
> to represent isolated combining code points in XML, 

It's problematic to represent the *specific* combining code point
when it appears immediately after a tag.

-- 
Don't be so humble.  You're not that great. John Cowan
--Golda Meir[EMAIL PROTECTED]

Re: Nicest UTF

2004-12-08 Thread Marcin 'Qrczak' Kowalczyk

"D. Starner" <[EMAIL PROTECTED]> writes:

> The semantics there are surprising, but that's true no matter what you
> do. An NFC string + an NFC string may not be NFC; the resulting text
> doesn't have N+M graphemes.

Which implies that automatically NFC-ing strings as they are processed
would be a bad idea. They can be NFC-ed at the end of processing if the
consumer of this data will demand this. Especially if other consumers
would want NFD.

String equality in a programming language should not treat composed
and decomposed forms as equal. Not this level of abstraction.

IMHO splitting into graphemes is the job of a rendering engine, not of
a function which extracts a part of a string which matches a regex.

> If you do so with an language that includes <, you violate the Unicode
> standard, because ≮ (not <) and ≮ are canonically equivalent.

I think that Unicode tries to push implications of "equivalence"
too far.

They are supposed to be equivalent when they are actual characters.
What if they are numeric character references? Should "≮"
(7 characters) represent a valid plain-text character or be a broken
opening tag?

Note that if it's a valid plain-text character, it's impossible
to represent isolated combining code points in XML, and thus it's
impossible to use XML for transportation of data which allows isolated
combining code points (except by introducing custom escaping of
course, e.g. transmitting decimal numbers instead of characters).
I expect breakage of XML-based protocols if implementations are
actually changed to conform to these rules (I bet they don't now).

OTOH if it's not a valid plain-text character, then conversion between
numeric character references and actual characters is getting more
hairy.

> I'll see if I have time after finals to pound out a basic API that
> implements this, in Ada or Lisp or something.

My language is quite similar to Lisp semantically.

Implementing an API which works in terms of graphemes over an API
which works in terms of code points is more sane than the converse,
which suggests that the core API should use code points if both APIs
are sometimes needed at all.

While I'm not obsessed with efficiency, it would be nice if changing
the API would not slow down string processing too much.

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread John Cowan

Kenneth Whistler scripsit:

> A Sybase ASE database has the same behavior running on Windows as
> running on Sun Solaris or Linux, for that matter.

Fair enough.

> UNIX filenames are just one instance of this. 

However, although they are *technically* octet sequences, they
are *functionally* character strings.  That's the issue.

> Failing that, then BINARY fields *are* the appropriate
> way to deal with arbitrary arrays of bytes that cannot
> be interpreted as characters. 

This is purism.  All the filenames on my Unix system, for example, can
be interpreted as character strings; the potential to create filenames
that can't be is unutilized, and sensibly so.  For that matter, the
potential to create files containing C0 controls is also unutilized.

> > in the same way that it would
> > be overkill to encode all 8-bit strings in XML using Base-64
> > just because some of them may contain control characters that are
> > illegal in well-formed XML.
> 
> Dunno about the XML issue here -- you're the expert on what
> the expected level of illegality in usage is there.

XML's policy is zero tolerance, both for illegal encodings and for
illegal characters such as U+0001.  So in order to be *100% sure* that
a character string (ASCII, Latin-1, or UTF-*, it matters not) can be put
into an XML document, one must treat it as binary and encode it as such,
using QP or Base64 or what have you.  But nobody does.

XML 1.1 allows the representation of every Unicode character except
U+, which materially reduces the problem, but there is little support
for XML 1.1 as yet.

In any case, this case is only an analogy, not an exact equivalent:
the problems of representing illegal *characters* in an XML document is
closely analogous to the problem of representing illegal *bytes* in a
character string.

> The point I'm making is that *whatever* you do, you are still
> asking for implementers to obey some convention on conversion
> failures for corrupt, uninterpretable character data.
> My assessment is that you'd have no better success at making
> this work universally well with some set of 128 magic bullet
> corruption pills on Plane 14 than you have with the
> existing Quoted-Unprintable as a convention.

It doesn't have to work universally; indeed, it becomes a QOI issue.
Allocating representations of bytes with "bits that are high" makes
it possible to do something recoverable, at very little expense to the
Unicode Consortium.

> Further, as it turns out that Lars is actually asking for
> "standardizing" corrupt UTF-8, a notion that isn't going to
> fly even two feet, I think the whole idea is going to be
> a complete non-starter.

I agree that that part won't fly, absolutely.

-- 
In politics, obedience and support  John Cowan <[EMAIL PROTECTED]>
are the same thing.  --Hannah Arendthttp://www.ccil.org/~cowan

Re: Nicest UTF

2004-12-08 Thread D. Starner

"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes:
> "D. Starner" <[EMAIL PROTECTED]> writes:
>
> > You could hide combining characters, which would be extremely useful if we 
> > were just using Latin 
> > and Cyrillic scripts.
> 
> It would need a separate API for examining the contents of a combining
> character. You can't avoid the sequence of code points completely.

Not a seperate API; a function that takes a character and returns an array of 
integers.

> It would yield to surprising semantics: for example if you concatenate
> a string with N+1 possible positions of an iterator with a string with
> M+1 positions, you don't necessarily get a string with N+M+1 positions
> because there can be combining characters at the border.

The semantics there are surprising, but that's true no matter what you
do. An NFC string + an NFC string may not be NFC; the resulting text
doesn't have N+M graphemes. Unless you're explicitly adding a combining
character, a combining character should never start a string. This could 
be fixed several ways, including by inserting a dummy character to hold 
the combining character, and "normalizing" the string by removing the dummy 
characters. That would, for the most part, only hurt pathological cases.

> It would impose complexity in cases where it's not needed. Most of the
> time you don't care which code points are combining and which are not,
> for example when you compose a text file from many pieces (constants
> and parts filled by users) or when parsing (if a string is specified
> as ending with a double quote, then programs will in general treat a
> double quote followed by a combining character as an end marker).

If you do so with an language that includes <, you violate the Unicode
standard, because ≮ (not <) and ≮ are canonically equivalent. You've
either got to decompose first or look at the individual characters as
a whole instead of looking at code points.

Has anyone considered this while defining a language? How about the official
standards bodies? Searching for XML in the archives is a bit unhelpful, and
UTR #20 doesn't mention the issue. Your solution is just fine if you're
considering the issue on the bit level, but it strikes me as the wrong answer,
and I would think that it would surprising to a user that didn't understand
Unicode, especially in the ≮ case. A warning either way would be nice.

I'll see if I have time after finals to pound out a basic API that implements
this, in Ada or Lisp or something. It's not going to be the most efficient 
thing,
but I doubt it's going to be a big difference for most programs, and if you want
C, you know where to find it.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Kenneth Whistler

John Cowan responded:
 
> > Storage of UNIX filenames on Windows databases, for example,
^^

O.k., I just quoted this back from the original email, but
it really is a complete misconception of the issue for
databases. "Windows databases" is a misnomer to start with.

There are some databases, like Access, that are Windows-only
applications, but most serious SQL databases in production (DB2,
Oracle, Sybase ASE and ASA, and so on) are crossplatform from
the get go, and have their *own* rules for what can and
cannot legitimately be stored in data fields, independent
of what platform you are running them on. A Sybase ASE
database has the same behavior running on Windows as running
on Sun Solaris or Linux, for that matter.

> > can be done with BINARY fields, which correctly capture the
> > identity of them as what they are: an unconvertible array of
> > byte values, not a convertible string in some particular
> > code page.
> 
> This solution, however, is overkill, 

Actually, I don't think it is.

One of the serious classes of fundamental errors that
database administrators and database programmers run into
when creating global applications is ignoring or misconstruing
character set issues.

In a database, if I define the database (or table or field)
as containing UTF-8 data, it damn well better have UTF-8
data in it, or I'm just asking for index corruptions, data
corruptions or worse -- and calls from unhappy customers.
When database programmers "lie" to the database about
character sets, by setting a character set to Latin-1, say,
and then pumping in data which is actually UTF-8, for
instance, expecting it to come back out unchanged with
no problems, they are skating on very thin ice ... which
usually tends to break right in the middle of some critical
application during a holiday while your customer service
desk is also down. ;-)

Such "lying to the database" is generally the tactic of
first resort for "fixing" global applications when they
start having to deal with mixed Japanese/European/UTF-8
data on networks, but it is clearly a hack for not
understanding and dealing with the character set
architecture and interoperability problems of putting
such applications together.

UNIX filenames are just one instance of this. The first
mistake is to network things together in ways that create
a technical mismatch between what the users of the localized
systems think the filenames mean and what somebody on the
other end of such a system may end up interpreted the
bag o' bytes to mean. The application should be constructed
in such a way that the locale/charset state can be preserved
on connection, with the "filename" interpreted in terms
of characters in the realm that needs to deal with it
that way, and restored to its bag o' bytes at the point
that needs it that way. If you can't do that reliably
with a "raw" UNIX set of applications, c'est la vie -- you
should be building more sophisticated multi-tiered applications
on top of your UNIX layer, applications which *can* track
and properly handle locale and character set identities.

Failing that, then BINARY fields *are* the appropriate
way to deal with arbitrary arrays of bytes that cannot
be interpreted as characters. Trying to pump them into
UTF-8 text data fields and processing them as such when
they *aren't* UTF-8 text data is lying to the database
and basically forfeiting your warranty that the database
will do reasonable things with that data. It's as stupid
as trying to store date or numeric types in text data
fields without first converting them to formatted strings
of text data.

> in the same way that it would
> be overkill to encode all 8-bit strings in XML using Base-64
> just because some of them may contain control characters that are
> illegal in well-formed XML.

Dunno about the XML issue here -- you're the expert on what
the expected level of illegality in usage is there.

But for real database applications, there are usually
mountains and mountains of stuff going on, most of it
completely orthogonal to something as conceptually
straightforward as maintaining the correct interpretation
of a UNIX filename. It isn't really overkill, in my
opinion, to design the appropriate tables and metadata
needed for ensuring that your filename handling doesn't
blow up somewhere because you've tried to do an UPDATE
on a UTF-8 data field with some random bag o' bytes that
won't validate as UTF-8 data.

> 
> > In my opinion, trying to do that with a set of encoded characters
> > (these 128 or something else) is *less* likely to solve the
> > problem than using some visible markup convention instead.
> 
> The trouble with the visible markup, or even the PUA, is that
> "well-formed filenames", those which are interpretable as
> UTF-8 text, must also be encoded so as to be sure any
> markup or PUA that naturally appears in the filename is
> escaped properly.  This is essentially the Quoted-Printable
> encoding, whic

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Lars Kristan

Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)





Kenneth Whistler wrote:
> I'm going to step in here, because this argument seems to
> be generating more heat than light.
I agree, and I thank you for that.


> First, I'm going to summarize what I think Lars Kristan is
> suggesting, to test whether my understanding of the proposal
> is correct or not.
> 
> I do not think this is a proposal to amend UTF-8 to allow
> invalid sequences. So we should get that off the table.
At least until we all understand everything else about this issue.


> 
> What I think this suggestion is is for adding 128 characters
> to represent byte values in conversion to Unicode when the
> byte values are uninterpretable as characters. Why 128 instead
> of 256 I find a little mysterious, but presumably the intent
> is to represent 0x80..0xFF as raw, uninterpreted byte values,
> unconvertible to Unicode characters otherwise.
Indeed, the full 256 codepoints could and perhaps even should be assigned for this purpose. The low 128 may in fact have a different purpose, and different handling. But I would delay this discussion also.

> 
> This is suggested by Lars' use case of:
> 
> > Storing UNIX filenames in a Windows database.
> 
> ... since UNIX filenames are simply arrays of bytes, and cannot,
> on interconnected systems, necessarily be interpreted in terms
> of well-defined characters.
> 
> Apparently Lars is currently using PUA U+E080..U+E0FF
> (or U+EE80..U+EEFF ?) for this purpose, enabling the round-tripping
> of byte values uninterpretable as characters to be converted, and
> is asking for standard Unicode values for this purpose, instead.
Yes.
And, yes, it's U+EE80..U+EEFF.


> 
> The other use case that Lars seems to be talking about are
> existing documents containing data corruptions in them, which
> can often happen when Latin-1 data gets dropped into UTF-8 data
> or vice versa due to mislabeled email or whatever.
Yes. One could argue that the need for the first use will gradually go away, that's why I also use this second example. Although, I think the first problem is underestimated. And is not limited to my example. And can have much more serious consequences. And might not go away anytime soon.

> And I am assuming this is referring primarily to the second case,
> where the extreme scenario Lars is envisioning would be, for
> example, where each point in a system was hyper-alert to
> invalid sequences and simply tossed or otherwise sequestered
> entire documents if they got these kinds of data corruptions
> in them. And in such a case, I can understand the concern about
> angry users. How many people on this list would be cursing if
> every bit of email that had a character set conversion error in
> it resulting in some bit hash or other, simply got tossed in the
> bit bucket instead of being delivered with the glorious hash
> intact, at least giving you the chance to see if you could
> figure out what was intended?
The two aspects of the problem are not always clearly distinct. But yes, let's say it's the second one.


I had the need to solve the first problem, not the second one. So some of what I say about this second one is somewhat theoretical. But also realistic, I hope. Or fear.

> 
> This is, I think the basic point at which people are talking past each
> other.
> 
> Notionally, Doug is correct that UTF-8 and UTF-16 are equivalent
> encoding forms, and anything represented (correctly) in one can
> be represented (correctly) in the other. In that sense, there is
> no difference between representation of text in UTF-8 or UTF-16,
> and no reason to postulate that a "UTF-8 based program" will have
> any advantages or disadvantages over a "UTF-16 based program" when
> it comes to dealing with corrupted data.
> 
> What Lars is talking about is a broad class of UNIX-based software
> which is written to handle strings essentially as
> opaque bags of bytes, not caring what they contain for many
> purposes. Such software generally keeps working just fine if you
> pump UTF-8 at it, which is by design for UTF-8 -- precisely because
> UTF-8 leaves untouched all the 0x00..0x7F byte values that may
> have particular significance for those processes. Most of that
> software treats 0x80..0xFF just as bit hash from the get-go, and
> neither cares nor has any way of knowing if the particular
> sequence of bit hash is valid UTF-8 or Shift-JIS or Latin-1 or
> EUC-JIS or some mix or whatever.
Yes. With a couple of additions.


It is not true that most of that software doesn't care about the encoding. Copy or cat really don't need to, but more does, to count the lines properly (needs to know the number of outputted glyphs or whatever they are, in

Re: Nicest UTF

2004-12-08 Thread Marcin 'Qrczak' Kowalczyk

"D. Starner" <[EMAIL PROTECTED]> writes:

> You could hide combining characters, which would be extremely useful if 
> we were just using Latin and Cyrillic scripts.

It would need a separate API for examining the contents of a combining
character. You can't avoid the sequence of code points completely.

It would yield to surprising semantics: for example if you concatenate
a string with N+1 possible positions of an iterator with a string with
M+1 positions, you don't necessarily get a string with N+M+1 positions
because there can be combining characters at the border.

It's simpler to overlay various grouping styles on top of a sequence
of code points than to start with automatically combined combining
characters and process inwards and outwards from there (sometimes
looking inside characters, sometimes grouping them even more).

It would impose complexity in cases where it's not needed. Most of the
time you don't care which code points are combining and which are not,
for example when you compose a text file from many pieces (constants
and parts filled by users) or when parsing (if a string is specified
as ending with a double quote, then programs will in general treat a
double quote followed by a combining character as an end marker).

I believe code points are the appropriate general-purpose unit of
string processing.

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Lars Kristan

Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)





> Needless to say, these systems were badly designed at their 
> origin, and 
> newer filesystems (and OS APIs) offer much better 
> alternative, by either 
> storing explicitly on volumes which encoding it uses, or by 
> forcing all 
> user-selected encodings to a common kernel encoding such as 
> Unicode encoding 
> schemes (this is what FAT32 and NTFS do on filenames created 
> under Windows, 
> since Windows 98 or NT).
> 
The UNIX (I also call it variant) principle has a problem of not knowing the encoding.
The Windows (I also call it invariant) principle has a problem that it HAS to know the encoding.


The Windows principle has another problem, it can store data from any encoding, and it also does a good job of trying to represent the data in any encoding, but it cannot guarantee identification in just any encoding. An invariant store can be implemented as UTF-8 or UTF-16. Windows uses UTF-16 and guranteed indentification used to be only possible in UTF-16. Due to UTF-8, now it can also be done in 8-bit (console, telnet). But for some reason, support for UTF-8 is still limited in some areas. And the missing rountrip capability may have something to do with it.

I basically agree that the variant approach is not a good one. But the invariant one is not an easy path. It was easier for the Windows to take it, because at the time transition was made, those systems were still single user. Hence, typically all data was in a single encoding.


Lars

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Lars Kristan

Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)





Doug Ewell wrote:
> How do file names work when the user changes from one SBCS to another
> (let's ignore UTF-8 for now) where the interpretation is 
> different?  For
> example, byte C3 is U+00C3, A with tilde (Ã) in ISO 8859-1, 
> but U+0102,
> A with breve (Ä) in ISO 8859-2.  If a file name contains byte 
> C3, is its
> name different depending on the current locale?
It displays differently, but compares the same. Whether or not it is the same name is a philosophical question.


>  Is it 
> accessible in all
> locales?
Typically, yes for all SBCS, but not really guaranteed for all MBCS. Depends on whether you validate the string or not. The way UNIX is being developed, those files are typically still accessible since the programs are still working with 8-bit strings. And that is what I am saying. A UTF-8 program (a hypothetical 'UNIX Commander 8') would have no problems accessing the files. A UTF-16 program (a hypothetical 'UNIX Commander 16') on the other hand would have problems.

>  (Not every SBCS defines a character at every code point.
> There's no C3 in ISO 8859-3, for example.)
It works just like unassigned codepoints in Unicode work. How they are displayed is not defined, but they can be passed around and compared for equality. Collation is again not defined, but simple sorting does give useful results.

> 
> Does this work with MBCS other than UTF-8?  I know you said 
> other MBCS,
> like Shift-JIS, are not often used alongside other encodings except
> ASCII, but we can't guarantee that since we're not in a perfect world.
> :-)  What if they were?
I don't know if and how much they were. But I am assuming UTF-8 would be used alongside other encodings on a much larger scale. At least that's what we are hoping for aren't we? Of course it would be even better if we would be only using UTF-8 (or any other Unicode format), but the transition has to come first.

> I fear Ken is not correct when he says you are not arguing for the
> legalization of invalid UTF-8 sequences.
I am arguing for a mechanism that allows processing invalid UTF-8 sequences. For those who need to do so. You can still think of them as invalid. Exactly how they will be called and to what extent will they be discouraged still needs to be investigated and defined.

> This isn't about UTF-8 versus other encoding forms.  UTF-8-based
> programs will reject these invalid sequences because they don't map to
> code points, and because they are supposed to reject them.
The problem is, until now a text editor typically preserved all data if a file was opened and saved immediately. Even binary data. And the data could be interpreted as Latin 1, Latin 2, ...  But you cannot interprete the data as UTF-8 and preserve all the data at the same time. Well, actually it is possible, which is exactly what I am saying is the advantage of UTF-8. But if you insist on validation, you break it. Fine, you get your Unicode world, and UTF-16 is then just as good as UTF-8. But you are now losing data where previously it wasn't lost. Well, you better remember to put a disclaimer in you license agreement...

> > Besides, surrogates are not completely interchangeable. 
> Frankly, they
> > are, but do not need to be, right?
> 
> They are not completely.  In UTF-8 and UTF-32, they are not allowed at
> all.  In UTF-16, they may only occur in the proper context: a high
> surrogate may only occur before a low surrogate, and a low 
> surrogate may
> only appear after a high surrogate.  No other usage of surrogates is
> permitted, because if unpaired surrogates could be interpreted, the
> interpretation would be ambiguous.
Well, yes, that's the theory. But as usual, I look at how things that are not defined yet work. From the algorithms, unpaired surrogates convert pretty well. Unless they start to pair up, of course. But there are cases where one knows they cannot (no concatenation is done).

Let me bring up one issue again. I want to standardize a mechanism that allows a roundtrip for 8-bit data. And I already stated that by doing that, you lose the roundtrip for 16-bit data. Now I ask myself again, is that true? Yes and no. For the case I mentioned above (no concatenation), roundtrip is currently really possible. But generally speaking, it is not always possible. And last but not least, you don't even care for it, right? Good, because that means my proposal doesn't make anything worse.

> I admit my error with regard to the handling of file names by 
> Unix-style
> file systems, and I appreciate being set straight.


Sorry for rubbing it in, but .. could it be that a lot of conclusions you have about what Unicode should or should not be are also wrong if they were based on such incorrect assu

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Doug Ewell

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Lars Kristan wrote:

> I never said it doesn't violate any existing rules. Stating that it
> does, doesn't help a bit. Rules can be changed. Assuming we understand
> the consequences. And that is what we should be discussing. By stating
> what should be allowed and what should be prohibited you are again
> defending those rules. I agree, rules should be defended, but only up
> to a certain point. Simply finding a rule that is offended is not
> enough to prove something is bad or useless.

In my opinion, these are rules that should not be broken or changed, NOT
because changing the rules is inherently bad but because these
particular changes would cause more problems than they would solve.  In
my opinion.

> Defining Unicode as the world of codepoints is a complex task on its
> own. It seems that you are afraid of stepping out of this world, since
> you do not know what awaits you there. So, it is easier to find an
> excuse within existing rules, especially if a proposed change
> threatens to shake everything right down to the foundation. If I would
> be dealing with Unicode (as we know it), I would probably be doing the
> same thing. I ask you to step back and try to see the big picture.

My objection to this has nothing to do with being some kind of
conservative fuddy-duddy who is afraid to think outside the box.

>> Do you have a use case for this?
>
> Yes, I definitely have. I am the one accusing you of living in a
> perfect world, remember?.

Yes, I remember.  Thank you.

> Do you think I would do that if I wasn't dealing with this problem in
> real life?

The problem seems to be that you have file names in a Unix or Unix-like
file system, where names are stored as uninterpreted bytes (thanks to
everyone who pointed this out; I have learned something), and these
bytes need to remain valid if the locale specifies UTF-8 and the bytes
don't make a valid UTF-8 sequence.  Right?

How do file names work when the user changes from one SBCS to another
(let's ignore UTF-8 for now) where the interpretation is different?  For
example, byte C3 is U+00C3, A with tilde (Ã) in ISO 8859-1, but U+0102,
A with breve (Ä) in ISO 8859-2.  If a file name contains byte C3, is its
name different depending on the current locale?  Is it accessible in all
locales?  (Not every SBCS defines a character at every code point.
There's no C3 in ISO 8859-3, for example.)

Does this work with MBCS other than UTF-8?  I know you said other MBCS,
like Shift-JIS, are not often used alongside other encodings except
ASCII, but we can't guarantee that since we're not in a perfect world.
:-)  What if they were?

If you have a UTF-8 locale, and file names that contain invalid UTF-8
sequences, how would you address those files in a locale-aware way?
This is similar to the question about the file with byte C3, which is Ã
in one locale, Ä in another, and an unassigned code point in a third.

> It is the current design that is unfair. A UTF-16 based program will
> only be able to process valid UTF-8 data. A UTF-8 based program will
> in many cases preserve invalid sequences even without any effort.

I fear Ken is not correct when he says you are not arguing for the
legalization of invalid UTF-8 sequences.

> Let me guess, you will say it is a flaw in the UTF-8 based program.

Good guess.  Unicode and ISO/IEC 10646 say it is, and I say it is.

> If validation is desired, yes. But then I think you would want all
> UTF-8 based programs to do that. That will not happen. What will
> happen is that UTF-8 based programs will be better text editors
> (because they will not lose data or constantly complain), while UTF-16
> based programs will produce cleaner data. You will opt for the latter.
> And I for the former. But will users know exactly what they've got?
> Will designers know exactly what they're gonna get? This is where all
> this started. I stated that there is an important difference between
> deciding for UTF-8 or for UTF-16 (or UTF-32).

This isn't about UTF-8 versus other encoding forms.  UTF-8-based
programs will reject these invalid sequences because they don't map to
code points, and because they are supposed to reject them.

> BTW, you have mixed up source and target. Or I don't understand what
> you're trying to say.

You are right.  I spoke of translating German to French, when the
example was about going the other way.  I made a mistake.

> Besides, surrogates are not completely interchangeable. Frankly, they
> are, but do not need to be, right?

They are not completely.  In UTF-8 and UTF-32, they are not allowed at
all.  In UTF-16, they may only occur in the proper context: a high
surrogate may only occur before a low surrogate, and a low surrogate may
only appear after a high surrogate.  No other usage of s

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Doug Ewell

Kenneth Whistler  wrote:

> I do not think this is a proposal to amend UTF-8 to allow
> invalid sequences. So we should get that off the table.

I hope you are right.

> Apparently Lars is currently using PUA U+E080..U+E0FF
> (or U+EE80..U+EEFF ?) for this purpose, enabling the round-tripping
> of byte values uninterpretable as characters to be converted, and
> is asking for standard Unicode values for this purpose, instead.

If I understand correctly, he is using these PUA values when the data is
in UTF-16, and using bare high-bit bytes (i.e. invalid UTF-8 sequences)
when the data is in UTF-8, and expecting to convert between the two.
That has at least two bad implications:

(1) the PUA characters would not round-trip from UTF-8 to UTF-16 to
UTF-8, but would be converted to the bare high-bit bytes, and

(2) the bare high-bit bytes might or might not accidentally form valid
UTF-8 sequences, which mean they might not round-tip either.

> Say a process gets handed a "UTF-8" string that contains the
> byte sequence <61 62 63 93 4D D0 B0 E4 BA 8C F0 90 8C 82 94>.
> ^^   ^^
>
> The 93 and 94 are just corrupt data -- it cannot be interpreted
> as UTF-8, and may have been introduced by some process that
> screwed up smart quotes from Code Page 1252 and UTF-8, for
> example. Interpreting the string, we have:
>
> 
>
> Now *if* I am interpreting Lars correctly, he is using 128
> PUA code points to *validly* contain any such byte, so that
> it can be retained. If the range he is using is U+EE80..U+EEFF,
> then the string would be reinterpreted as:
>
>  U+EE94>
>
> which in UTF-8 would be the byte sequence:
>
> <61 62 63 EE BA 93 4D D0 B0 E4 BA 8C F0 90 8C 82 EE BA 94>
>      
>
> This is now well-formed UTF-8, which anybody could deal with.
> And if you interpret U+EE93 as meaning "a placeholder for the
> uninterpreted or corrupt byte 0x93 in the original source",
> and so on, you could use this representation to exactly
> preserve the original information, including corruptions,
> which you could feed back out, byte-for-byte, if you reversed
> the conversion.

Oh, how I hope that is all he is asking for.

> Now moving from interpretation to critique, I think it unlikely
> that the UTC would actually want to encode 128 such characters
> to represent byte values -- and the reasons would be similar to
> those adduced for rejecting the earlier proposal. Effectively,
> in either case, these are proposals for enabling representation
> of arbitrary, embedded binary data (byte streams) in plain text.
> And that concept is pretty fundamentally antithetical to the
> Unicode concept of plain text.

Isn't this an excellent use for the PUA?  These characters are private
anyway; they are defined by some standard other than Unicode, which is
not evident in the Unicode data.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Doug Ewell

Philippe Verdy  wrote:

> An alternative can then be a mixed encoding selection:
> - choose a legacy encoding that will most often be able to represent
> valid filenames without loss of information (for example ISO-8859-1,
> or Cp1252).
> - encode the filename with it.
> - try to decode it with a *strict* UTF-8 decoder, as if it was UTF-8
> encoded.
> - if there's no failure, then you must reencode the filename with
> UTF-8 instead, even if the result is longer.
> - if the strict UTF-8 decoding fails, you can keep the filename in the
> first 8-bit encoding...
> When parsing files:
> - try decoding filenames with *strict* UTF-8 rules. If this does not
> fail, then the filename was effectively encoded with UTF-8.
> - if the decoding failed, decode the filename with the legacy 8-bit
> encoding.
>
> But even with this scheme, you will find interoperability problems
> because some applications will only expect the legacy encoding, or
> only the UTF-8 encoding, without deciding...

This technique was described as "adaptive UTF-8" by Dan Oscarsson in
August 1998:

http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML012/0738.html

although he did not go as far as Philippe did, in actually checking the
"adaptively" encoded string to make sure it would be decoded correctly.

All the same, it was decided not to go this route, partly because the
auto-detection capability of UTF-8 would be lost, partly because having
multiple context-dependent encodings of the same code points would have
been a Bad Thing (<99 C9> could be encoded adaptively but  could
not), and partly for the reason Philippe mentions -- most existing
decoders would expect either Latin-1 or UTF-8, and would choke if handed
a mixture of the two.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread John Cowan

Kenneth Whistler scripsit:

> Storage of UNIX filenames on Windows databases, for example,
> can be done with BINARY fields, which correctly capture the
> identity of them as what they are: an unconvertible array of
> byte values, not a convertible string in some particular
> code page.

This solution, however, is overkill, in the same way that it would
be overkill to encode all 8-bit strings in XML using Base-64
just because some of them may contain control characters that are
illegal in well-formed XML.

> In my opinion, trying to do that with a set of encoded characters
> (these 128 or something else) is *less* likely to solve the
> problem than using some visible markup convention instead.

The trouble with the visible markup, or even the PUA, is that
"well-formed filenames", those which are interpretable as
UTF-8 text, must also be encoded so as to be sure any
markup or PUA that naturally appears in the filename is
escaped properly.  This is essentially the Quoted-Printable
encoding, which is quite rightly known to those stuck with
it as "Quoted-Unprintable".

> Simply
> encoding 128 characters in the Unicode Standard ostensibly to
> serve this purpose is no guarantee whatsoever that anyone would
> actually implement and support them in the universal way you
> envision, any more than they might a "=93", "=94" convention.

Why not, when it's so easy to do so?  And they'd be *there*,
reserved, unassignable for actual character encoding.

Plane E would be a plausible location.

-- 
John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, LOTR:FOTR

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Kenneth Whistler

Lars,

I'm going to step in here, because this argument seems to
be generating more heat than light.

> I never said it doesn't violate any existing rules. Stating that it does,
> doesn't help a bit. Rules can be changed. 

> I ask you to step back and try to see the big picture.

First, I'm going to summarize what I think Lars Kristan is
suggesting, to test whether my understanding of the proposal
is correct or not.

I do not think this is a proposal to amend UTF-8 to allow
invalid sequences. So we should get that off the table.

What I think this suggestion is is for adding 128 characters
to represent byte values in conversion to Unicode when the
byte values are uninterpretable as characters. Why 128 instead
of 256 I find a little mysterious, but presumably the intent
is to represent 0x80..0xFF as raw, uninterpreted byte values,
unconvertible to Unicode characters otherwise.

This is suggested by Lars' use case of:

> Storing UNIX filenames in a Windows database.

... since UNIX filenames are simply arrays of bytes, and cannot,
on interconnected systems, necessarily be interpreted in terms
of well-defined characters.

Apparently Lars is currently using PUA U+E080..U+E0FF
(or U+EE80..U+EEFF ?) for this purpose, enabling the round-tripping
of byte values uninterpretable as characters to be converted, and
is asking for standard Unicode values for this purpose, instead.

The other use case that Lars seems to be talking about are
existing documents containing data corruptions in them, which
can often happen when Latin-1 data gets dropped into UTF-8 data
or vice versa due to mislabeled email or whatever.

> So you would drop the data. There are only two options with current designs.
> Dropping invalid sequences, or storing it separately (which probably means
> the whole document is dead until manually decoded). Dropping invalid
> sequences is actually a better choice. And would even be justifiable (but
> still sometimes inconvenient) if we were living in world where everything is
> in UTF-8. In a world, trying to transition from legacy encodings to Unicode,
> there could be a lot of data lost and a lot of angry users.

And I am assuming this is referring primarily to the second case,
where the extreme scenario Lars is envisioning would be, for
example, where each point in a system was hyper-alert to
invalid sequences and simply tossed or otherwise sequestered
entire documents if they got these kinds of data corruptions
in them. And in such a case, I can understand the concern about
angry users. How many people on this list would be cursing if
every bit of email that had a character set conversion error in
it resulting in some bit hash or other, simply got tossed in the
bit bucket instead of being delivered with the glorious hash
intact, at least giving you the chance to see if you could
figure out what was intended?

> A UTF-16 based program will only be able to process valid UTF-8
> data. A UTF-8 based program will in many cases preserve invalid sequences
> even without any effort. Let me guess, you will say it is a flaw in the
> UTF-8 based program. If validation is desired, yes. But then I think you
> would want all UTF-8 based programs to do that. That will not happen. What
> will happen is that UTF-8 based programs will be better text editors
> (because they will not lose data or constantly complain), while UTF-16 based
> programs will produce cleaner data. You will opt for the latter. 

This is, I think the basic point at which people are talking past each
other.

Notionally, Doug is correct that UTF-8 and UTF-16 are equivalent
encoding forms, and anything represented (correctly) in one can
be represented (correctly) in the other. In that sense, there is
no difference between representation of text in UTF-8 or UTF-16,
and no reason to postulate that a "UTF-8 based program" will have
any advantages or disadvantages over a "UTF-16 based program" when
it comes to dealing with corrupted data.

What Lars is talking about is a broad class of UNIX-based software
which is written to handle strings essentially as
opaque bags of bytes, not caring what they contain for many
purposes. Such software generally keeps working just fine if you
pump UTF-8 at it, which is by design for UTF-8 -- precisely because
UTF-8 leaves untouched all the 0x00..0x7F byte values that may
have particular significance for those processes. Most of that
software treats 0x80..0xFF just as bit hash from the get-go, and
neither cares nor has any way of knowing if the particular
sequence of bit hash is valid UTF-8 or Shift-JIS or Latin-1 or
EUC-JIS or some mix or whatever.

> And I for
> the former. But will users know exactly what they've got? Will designers
> know exactly what they're gonna get? This is where all this started. I
> stated that there is an important difference between deciding for UTF-8 or
> for UTF-16 (or UTF-32).

This is where this is all getting derailed. Whatever the solutions
for representation of corrupt data bytes or un

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Kenneth Whistler

Philippe continued:

> As if Unicode had to be bound on 
> architectural constraints such as the requirement of representing code units 
> (which are architectural for a system) only as 16-bit or 32-bit units, 

Yes, it does. By definition. In the standard.

> ignoring the fact that technologies do evolve and will not necessarily keep 
> this constraint. 64-bit systems already exist today, and even if they have, 
> for now, the architectural capability of handling efficiently 16-bit and 
> 32-bit code units so that they can be addressed individually, this will 
> possibly not be the case in the future.

This is just as irrelevant as worrying about the fact that 8-bit
character encodings may not be handled efficiently by some 32-bit
processors.

> When I look at the encoding forms such as UTF-16 and UTF-32, they just 
> define the value ranges in which code units will be be valid, but not 
> necessarily their size. 

Philippe, you are wrong. Go reread the standard. Each of the encoding
forms is *explicitly* defined in terms of code unit size in bits.

  "The Unicode Standard uses 8-bit code units in the UTF-8 encoding
   form, 16-bit code units in the UTF-16 encoding form, and 32-bit
   code units in the UTF-32 encoding form."
   
If there is something ambiguous or unclear in wording such as that,
I think the UTC would like to know about it.

> You are mixing this with encoding schemes, which is 
> what is needed for interoperability, and where other factors such as bit or 
> byte ordering is also important in addition to the value range.

I am not mixing it up -- you are, unfortunately. And it is most
unhelpful on this list to have people waxing on, with
apparently authoritative statements about the architecture
of the Unicode Standard, which on examination turn out to be
flat wrong.

> I won't see anything wrong if a system is set so that UTF-32 code units will 
> be stored in 24-bit or even 64-bit memory cells, as long as they respect and 
> fully represent the value range defined in encoding forms, 

Correct. And I said as much. There is nothing wrong with implementing
UTF-32 on a 64-bit processor. Putting a UTF-32 code point into
a 64-bit register is fine. What you have to watch out for is
handing me a 64-bit array of ints and claiming that it is a
UTF-32 sequence of code points -- it isn't.

> and if the system 
> also provides an interface to convert them with encoding schemes to 
> interoperable streams of 8-bit bytes.

No, you have to have an interface which hands me the correct
data type when I declare it uint_32, and which gives me correct
offsets in memory if I walk an index pointer down an array.
That applies to the encoding *form*, and is completely separate
from provision of any streaming interface that wants to feed
data back and form in terms of byte streams.

> Are you saying that UTF-32 code units need to be able to represent any 
> 32-bit value, even if the valid range is limited, for now to the 17 first 
> planes?

Yes.

> An API on a 64-bit system that would say that it requires strings being 
> stored with UTF-32 would also define how UTF-32 code units are represented. 
> As long as the valid range 0 to 0x10 can be represented, this interface 
> will be fine. 

No, it will not. Read the standard.

An API on a 64-bit system that uses an unsigned 32-bit datatype for UTF-32
is fine. It isn't fine if it uses an unsigned 64-bit datatype for
UTF-32.

> If this system is designed so that two or three code units 
> will be stored in a single 64-bit memory cell, no violation will occur in 
> the valid range.

You can do whatever the heck crazy thing you want to do internal
to your data manipulation, but you cannot surface a datatype
packed that way and conformantly claim that it is UTF-32.

> More interestingly, there already exists systems where memory is adressable 
> by units of 1 bit, and on these systems, ...

[excised some vamping on the future of computers]

> Nothing there is impossible for the future (when it will become more and 
> more difficult to increase the density of transistors, or to reduce further 
> the voltage, or to increase the working frequency, or to avoid the 
> inevitable and random presence of natural defects in substrates; escaping 
> from the historic binary-only systems may offer interesting opportunities 
> for further performance increase).

Look, I don't care if the processors are dealing in qubits on
molecular arrays under the covers. It is the job of the hardware
folks to surface appropriate machine instructions that compiler
makers can use to surface appropriate formal language constructs
to programmers to enable hooking the defined datatypes of
the character encoding standards into programming language
datatypes.

It is the job of the Unicode Consortium to define the encoding
forms for representing Unicode code points, so that people
manipulating Unicode digital text representation can do so
reliably using general purpose programming languages with
wel

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Philippe Verdy

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)I know wht you mean here: 
most Linux/Unix filesystems (as well as many legacy filesystems for Windows 
and MacOS...) do not track the encoding with which filenames were encoded 
and, depending on local user preferences when that user created that file, 
filenames on such systems seem to have unpredictable encodings.

However the problem comes, most often, when interchanging data from one 
system to another, through removeable volumes or shared volumes.

Needless to say, these systems were badly designed at their origin, and 
newer filesystems (and OS APIs) offer much better alternative, by either 
storing explicitly on volumes which encoding it uses, or by forcing all 
user-selected encodings to a common kernel encoding such as Unicode encoding 
schemes (this is what FAT32 and NTFS do on filenames created under Windows, 
since Windows 98 or NT).

I understand that there may exist situations, such as Linux/Unix UFS-like 
filesystems where it will be hard to decide which encoding was used for 
filenames (or simply for the content of plain-text files). For plain-text 
files, which have long-enough data in them, automatic identification of the 
encoding is possible, and used with success in many applications (notably in 
web browsers).

But foir filenames, which are generally short, automatic identification is 
often difficult. However, UTF-16 remains easy to identify, most often, due 
to the very unusual frequency of low-values in byte sequences on every even 
or odd position. UTF-8 is also easy to identify due to its strict rules 
(without these strict rules, that forbid some sequences, automatic 
identification of the encoding becomes very risky).

If the encoding cannot be identified precisely and explicitly, I think that 
UTF-16 is much better than UTF-8 (and it also offers a better compromize for 
total size for names in any modern language). However, it's true that UTF-16 
cannot be used on Linux/Unix due to the presence of null bytes. The 
alternative is then UTF-8, but it is often larger than legacy encodings.

An alternative can then be a mixed encoding selection:
- choose a legacy encoding that will most often be able to represent valid 
filenames without loss of information (for example ISO-8859-1, or Cp1252).
- encode the filename with it.
- try to decode it with a *strict* UTF-8 decoder, as if it was UTF-8 
encoded.
- if there's no failure, then you must reencode the filename with UTF-8 
instead, even if the result is longer.
- if the strict UTF-8 decoding fails, you can keep the filename in the first 
8-bit encoding...
When parsing files:
- try decoding filenames with *strict* UTF-8 rules. If this does not fail, 
then the filename was effectively encoded with UTF-8.
- if the decoding failed, decode the filename with the legacy 8-bit 
encoding.

But even with this scheme, you will find interoperability problems because 
some applications will only expect the legacy encoding, or only the UTF-8 
encoding, without deciding...

If only MS Word was coded this well (was Re: Nicest UTF)

2004-12-07 Thread Theodore H. Smith

From: "D. Starner" <[EMAIL PROTECTED]>

(Sorry for sending this twice, Marcin.)
"Marcin 'Qrczak' Kowalczyk" writes:
UTF-8 is poorly suitable for internal processing of strings in a
modern programming language (i.e. one which doesn't already have a
pile of legacy functions working of bytes, but which can be designed
to make Unicode convenient at all). It's because code points have
variable lengths in bytes, so extracting individual characters is
almost meaningless
Same with UTF-16 and UTF-32. A character is multiple code-points, 
remember? (decomposed chars?)

(unless you care only about the ASCII subset, and
sequences of all other characters are treated as non-interpreted bags
of bytes).
Nope. I've done tons of UTF-8 string processing. I've even done a case 
insensitive word-frequency measuring algorithm on UTF-8. It runs 
blastingly fast, because I can do the processing with bytes.

It just requires you to understand the actual logic of UTF-8 well 
enough to know that you can treat it as bytes, most of the time.

And the times you can't treat it as bytes, usually you can't even treat 
UTF-32 as bytes!

If you are talking about creating an editfield or text control or 
something, that is true that UTF-32 is better. However, UTF-16 is the 
worst of all cases, you'd be better off using UTF-8 as the native 
encoding of an editfield.

The thing is, very very very few people write editfields.
I've seen tons of XML parsers in my lifetime (at least 3 I wrote 
myself), but only a few editfield libraries.

Its a shame that very few people understand the different UTFs properly.
As for isspace... sure there is a UTF-8 non-byte space.
My case insensitive utf-8 word frequency counter (which runs blastingly 
fast) however didn't find this to be any problem. It dealt with 
non-single byte all sorts of word breaks :o)

It appears to run at about 3MB/second on my laptop, which involves for 
every word, doing a word check on the entire previous collection of 
words.

Thats like having MS Word spell-check 3MB of pure Unicode text (no 
style junk bloating up the file-size) in one second, for you. (The 
words would all be spelt correctly though, so as to not require 
expensive RAM copying when doing the replacements.)


Yes, I do know how to code ;o)
Too bad so few others do.
--
   Theodore H. Smith - Software Developer - www.elfdata.com/plugin/
   Industrial strength string processing code, made easy.
   (If you believe that's an oxymoron, see for yourself.)

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Philippe Verdy

From: "Kenneth Whistler" <[EMAIL PROTECTED]>
Yes, and pigs could fly, if they had big enough wings.
Once again, this is a creative comment. As if Unicode had to be bound on 
architectural constraints such as the requirement of representing code units 
(which are architectural for a system) only as 16-bit or 32-bit units, 
ignoring the fact that technologies do evolve and will not necessarily keep 
this constraint. 64-bit systems already exist today, and even if they have, 
for now, the architectural capability of handling efficiently 16-bit and 
32-bit code units so that they can be addressed individually, this will 
possibly not be the case in the future.

When I look at the encoding forms such as UTF-16 and UTF-32, they just 
define the value ranges in which code units will be be valid, but not 
necessarily their size. You are mixing this with encoding schemes, which is 
what is needed for interoperability, and where other factors such as bit or 
byte ordering is also important in addition to the value range.

I won't see anything wrong if a system is set so that UTF-32 code units will 
be stored in 24-bit or even 64-bit memory cells, as long as they respect and 
fully represent the value range defined in encoding forms, and if the system 
also provides an interface to convert them with encoding schemes to 
interoperable streams of 8-bit bytes.

Are you saying that UTF-32 code units need to be able to represent any 
32-bit value, even if the valid range is limited, for now to the 17 first 
planes?
An API on a 64-bit system that would say that it requires strings being 
stored with UTF-32 would also define how UTF-32 code units are represented. 
As long as the valid range 0 to 0x10 can be represented, this interface 
will be fine. If this system is designed so that two or three code units 
will be stored in a single 64-bit memory cell, no violation will occur in 
the valid range.

More interestingly, there already exists systems where memory is adressable 
by units of 1 bit, and on these systems, an UTF-32 code unit will work 
perfectly if code units are stored by steps of 21 bits of memory. On 64-bit 
systems, the possibility of addressing any groups individual bits will 
become an interesting option, notably when handling complex data structures 
such as bitfields, data compressors, bitmaps, ... No more need to use costly 
shifts and masking. Nothing would prevent such system to offer 
interoperability with 8-bit byte based systems (note also that recent memory 
technologies use fast serial interfaces instead of parallel buses, so that 
the memory granularity is less important).

The only cost for bit-addressing is that it just requires 3 bits of address, 
but in a 64-bit address, this cost seems very low becaue the global 
addressable space will still be... more than 2.3*10^18 bytes, much more than 
any computer will manage in a single process for the next century (according 
to the Moore's law which doubles the computing capabilities every 3 years). 
Even such scheme would not limit the performance given that memory caches 
are paged, and these caches are always increasing, eliminating most of the 
costs and problems related to data alignment experimented today on bus-based 
systems.

Other territories are also still unexplored in microprocessors, notably the 
possibility of using non-binary numeric systems (think about optical or 
magnetic systems which could outperform the current electric systems due to 
reduced power and heat caused by currents of electrons through molecular 
substrates, replacing them by shifts of atomic states caused by light rays, 
and the computing possibilities offered by light diffraction through 
cristals). The lowest granularity of information in some future may be 
larger than a dual-state bit, meaning that todays 8-bit systems would need 
to be emulated using other numerical systems...
(Note for example that to store the range 0..0x10, you would need 13 
digits on a ternary system, and to store the range of 32-bit integers, you 
would need 21 ternary digits; memry technologies for such systems may use 
byte units made of 6 ternary digits, so programmers would have the choice 
between 3 "ternary bytes", i.e. 18 ternary digits, to store our 21-bit code 
units, or 4 "ternary bytes", i.e. 24 ternary digits or more than 34 binary 
bits, to be able to store the whole 32-bit range.)

Nothing there is impossible for the future (when it will become more and 
more difficult to increase the density of transistors, or to reduce further 
the voltage, or to increase the working frequency, or to avoid the 
inevitable and random presence of natural defects in substrates; escaping 
from the historic binary-only systems may offer interesting opportunities 
for further performance increase).

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Rick McGowan

> Yes, and pigs could fly, if they had big enough wings.

An 8-foot wingspan should do it. For picture of said flying pig see:

http://www.cincinnati.com/bigpiggig/profile_091700.html
http://www.cincinnati.com/bigpiggig/images/pig091700.jpg

Rick

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Kenneth Whistler

Philippe stated, and I need to correct:

> UTF-24 already exists as an encoding form (it is identical to UTF-32), if 
> you just consider that encoding forms just need to be able to represent a 
> valid code range within a single code unit.

This is false.

Unicode encoding forms exist by virtue of the establishment of
them as standard, by actions of the standardizing organization,
the Unicode Consortium.

> UTF-32 is not meant to be restricted on 32-bit representations.

This is false. The definition of UTF-32 is:

  "The Unicode encoding form which assigns each Unicode scalar
   value to a single unsigned 32-bit code unit with the same
   numeric value as the Unicode scalar value."
   
It is true that UTF-32 could be (and is) implemented on computers
which hold 32-bit numeric types transiently in 64-bit registers
(or even other size registers), but if an array of 64-bit integers
(or 24-bit integers) were handed to some API claiming to be UTF-32,
it would simply be nonconformant to the standard.

UTF-24 does not "already exist as an encoding form" -- it already
exists as one of a large number of more or less idle speculations
by character numerologists regarding other cutesy ways to handle
Unicode characters on computers. Many of those cutesy ways are
mere thought experiments or even simply jokes.

> However it's true that UTF-24BE and UTF-24LE could be useful as a encoding 
> schemes for serializations to byte-oriented streams, suppressing one 
> unnecessary byte per code point.

"Could be", perhaps, but is not.

Implementers using UTF-32 for processing efficiency, but who have
bandwidth constraints in some streaming context should simply
use one of the CES's with better size characteristics or use
a compression on their data.

> Note that 64-bit systems could do the same: 3 code points per 64-bit unit, 
> requires only 63 bits, that are stored in a single positive 64-bit integer 
> (the remaining bit would be the sign bit, always set to 0, avoiding problems 
> related to sign extensions). And even today's system could use such 
> representation as well, given that most 32-bit processors of today also have 
> the internal capabilities to manage 64-bit integers natively.

This is just an incredibly bad idea.

Packing instructions in large-word microprocessors is one thing. You
have built-in microcode which handles that, hidden away from
application-level programming, and carefully architected for
maximal processor efficiency.

But attempting to pack character data into microprocessor words, just
because you have bits available, would just detract from the efficiency
of handling that data. Storage is not the issue -- you want to
get the characters in and out of the registers as efficiently as
possible. UTF-32 works fine for that. UTF-16 works almost as well,
in aggregate, for that. And I could care less that when U+0061
goes in a 64-bit register for manipulation, the high 57 bits are
all set to zero.

> Strings could be encoded as well using only 64-bit code units that would 
> each store 1 to 3 code points, 

Yes, and pigs could fly, if they had big enough wings.

> the unused positions being filled with 
> invalid codepoints out the Unicode space (for example by setting all 21 bits 
> to 1, producing the out-of-range code point 0x1F, used as a filler for 
> missing code points, notably when the string to encode is not an exact 
> multiple of 3 code points). Then, these 64-bit code units could be 
> serialized on byte streams as well, multiplying the number of possibilities: 
> UTF-64BE and UTF-64LE? One interest of such scheme is that it would be more 
> compact than UTF-32, because this UTF-64 encoding scheme would waste only 1 
> bit for 3 codepoints, instead 1 byte and 3 bits for each codepoint with 
> UTF-32!

Wow!

> You can imagine many other encoding schemes, depending on your architecture 
> choices and constraints...

Yes, one can imagine all sorts of strange things. I myself
imagined UTF-17 once. But there is a difference between having
fun imagining strange things and filling the list with
confusing misinterpretations of the status and use of
UTF-8, UTF-16, and UTF-32.

--Ken

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Lars Kristan

Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)





Doug Ewell replied:


> Actually the Unicode Technical Committee.  But you are 
> correct: it is up
> to the UTC to decide whether they want to redefine UTF-8 to permit
> invalid sequences, which are to be interpreted as unknown characters
> from an unknown legacy coding standard, and to prohibit 
> conversion from
> this redefined UTF-8 to other encoding schemes, or directly to Unicode
> code points.  We will have to wait and see what UTC members think of
> this.
I never said it doesn't violate any existing rules. Stating that it does, doesn't help a bit. Rules can be changed. Assuming we understand the consequences. And that is what we should be discussing. By stating what should be allowed and what should be prohibited you are again defending those rules. I agree, rules should be defended, but only up to a certain point. Simply finding a rule that is offended is not enough to prove something is bad or useless.

> 
> > But this decision should not be based solely on theory and ideal
> > worlds.
> 
> Right.  Uh-huh.
Defining Unicode as the world of codepoints is a complex task on its own. It seems that you are afraid of stepping out of this world, since you do not know what awaits you there. So, it is easier to find an excuse within existing rules, especially if a proposed change threatens to shake everything right down to the foundation. If I would be dealing with Unicode (as we know it), I would probably be doing the same thing. I ask you to step back and try to see the big picture.

> 
> Of course not.  That is not at all the same as INTENTIONALLY storing
> invalid sequences in UTF-8 and expecting the decoding mechanism to
> preserve the invalid bytes for posterity.
So you would drop the data. There are only two options with current designs. Dropping invalid sequences, or storing it separately (which probably means the whole document is dead until manually decoded). Dropping invalid sequences is actually a better choice. And would even be justifiable (but still sometimes inconvenient) if we were living in world where everything is in UTF-8. In a world, trying to transition from legacy encodings to Unicode, there could be a lot of data lost and a lot of angry users.

> 
> 
> And do what with it, Lars?  Keep it on a shelf indefinitely 
> in case some
> archaeologist unearths a new legacy encoding that might unlock the
> mystery data?
> 
> Is this really worth the effort of redefining UTF-8 and 
> disallowing free
> conversion between UTF-8 and Unicode code points?
> 
> Do you have a use case for this?
Yes, I definitely have. I am the one accusing you of living in a perfect world, remember?. Do you think I would do that if I wasn't dealing with this problem in real life?

> 
> 
> So with your plan, you have invalid sequence #1, invalid sequence #2,
> and so forth.  Now, what do the sequences mean?  Is there any way to
> interpret them?  No, there isn't, because by definition these 
> sequences
> represent characters from an unknown coding standard.  Either 
> (a) nobody
> has gone to the trouble to find out what characters they truly
> represent, (b) the original standard is lost and we will *never* know,
> or (c) we are waiting for the archaeologist to save the day.
> 
> In the meantime, the UTF-8 data with invalid sequences must be kept
> isolated from all processes that would interpret the sequences as code
> points, and raise an exception on invalid sequences-- in other words,
> all existing processes that handle UTF-8.
On the contrary. If those invalid sequences can (well, may) be translated into codepoints, then you can stop worrying about them. Or at least all the worrying is done within the conversion. It is the current design that is unfair. A UTF-16 based program will only be able to process valid UTF-8 data. A UTF-8 based program will in many cases preserve invalid sequences even without any effort. Let me guess, you will say it is a flaw in the UTF-8 based program. If validation is desired, yes. But then I think you would want all UTF-8 based programs to do that. That will not happen. What will happen is that UTF-8 based programs will be better text editors (because they will not lose data or constantly complain), while UTF-16 based programs will produce cleaner data. You will opt for the latter. And I for the former. But will users know exactly what they've got? Will designers know exactly what they're gonna get? This is where all this started. I stated that there is an important difference between deciding for UTF-8 or for UTF-16 (or UTF-32).

> 
> > Let's compare UTF-8 to UTF-16 conversion to an automated translation
> > from German to French. What Unicode standard says can be interpreted
> > as follows:
> >
> > * All in

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Lars Kristan

Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)





Doug Ewell wrote:


> John Cowan  wrote:
> 
> > Windows filesystems do know what encoding they use.  But a 
> filename on
> > a Unix(oid) file system is a mere sequence of octets, of 
> which only 00
> > and 2F are interpreted.  (Filenames containing 20, and 
> especially 0A,
> > are annoying to handle with standard tools, but not illegal.)
> >
> > How these octet sequences are translated to characters, if at all,
> > is no concern of the file system's.  Some higher-level 
> tools, such as
> > directory listers and shells, have hardwired assumptions, 
> others have
> > changeable assumptions, but all are assumptions.
> 
> OK, fair enough.  Under a Unixoid file system, a file name 
> consists of a
> more or less arbitrary sequence of bytes, essentially 
> unregulated by the
> OS.
> 
> If interpreted as UTF-8, some of these sequences may be 
> invalid, and the
> files may be inaccessible.
> 
> This is *exactly* the same scenario as with GB 2312, or 
> Shift-JIS, or KS
> C 5601, or ISO 6937, or any other multibyte character encoding ever
> devised.
> 
> This is not a problem that needs to be solved within Unicode, any more
> than it needed to be solved within those other encodings.
> 


Shift-JIS was typically not mixed with other encodings, except for pure 7-bit ASCII. UTF-8 will be. And Shift-JIS had other serious problems, like the trailing backslash byte. UTF-8 has learned a lot from Shift-JIS. If there is anything still to learn, then let's welcome that.

Also, Shift-JIS (and other MBCS encodings) were a must for those cultures. UTF-8 is not a must. If there will be problems, there will be complaints. And resistance.


Lars

Re: Nicest UTF

2004-12-07 Thread Philippe Verdy

From: "D. Starner" <[EMAIL PROTECTED]>
If you're talking about a language that hides the structure of strings
and has no problem with variable length data, then it wouldn't matter
what the internal processing of the string looks like. You'd need to
use iterators and discourage the use of arbitrary indexing, but arbitrary
indexing is rarely important.
I fully concur to this point of view. Almost all (if not all) string 
processing can be performed in terms of sequential enumerators, instead of 
through random indexing (which has also the big disavantage of not allowing 
with rich context dependant processing behaviors, something you can't ignore 
when handling international texts).

So internal storage of string does not matter for the programming interface 
of parsable string objects. In terms of efficiency and global application 
performance, using compressed encoding schemes is highly recommanded for 
large databases of text, because the negative impact of the decompressing 
overhead is extremely small face to the huge benefits you get when reducing 
the load on system resources, on data locality and on memory caches, on the 
system memory allocator, on the memory fragmentation level, on reduced VM 
swaps and on file or database I/O (which will be the only effective 
limitation for large databases).

Re: Nicest UTF

2004-12-06 Thread D. Starner

(Sorry for sending this twice, Marcin.)

"Marcin 'Qrczak' Kowalczyk" writes: 
> UTF-8 is poorly suitable for internal processing of strings in a 
> modern programming language (i.e. one which doesn't already have a 
> pile of legacy functions working of bytes, but which can be designed 
> to make Unicode convenient at all). It's because code points have 
> variable lengths in bytes, so extracting individual characters is 
> almost meaningless (unless you care only about the ASCII subset, and 
> sequences of all other characters are treated as non-interpreted bags 
> of bytes). You can't even have a correct equivalent of C isspace(). 

That's assuming that the programming language is similar to C and Ada. 
If you're talking about a language that hides the structure of strings 
and has no problem with variable length data, then it wouldn't matter 
what the internal processing of the string looks like. You'd need to 
use iterators and discourage the use of arbitrary indexing, but arbitrary 
indexing is rarely important. 

You could hide combining characters, which would be extremely useful if 
we were just using Latin and Cyrillic scripts. You'd have to be flexible, 
since it would be natural to step through a Hebrew or Arabic string as if the 
vowels were written inline, and people might want to look at the combining 
characters (which would be incredibly rare if your language already 
provided most standard Unicode functions.) 

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-06 Thread Philippe Verdy

- Original Message - 
From: "Arcane Jill" <[EMAIL PROTECTED]>
Probably a dumb question, but how come nobody's invented "UTF-24" yet? I 
just made that up, it's not an official standard, but one could easily 
define UTF-24 as UTF-32 with the most-significant byte (which is always 
zero) removed, hence all characters are stored in exactly three bytes and 
all are treated equally. You could have UTF-24LE and UTF-24BE variants, 
and even UTF-24 BOMs. Of course, I'm not suggesting this is a particularly 
brilliant idea, but I just wonder why no-one's suggested it before.
UTF-24 already exists as an encoding form (it is identical to UTF-32), if 
you just consider that encoding forms just need to be able to represent a 
valid code range within a single code unit.
UTF-32 is not meant to be restricted on 32-bit representations.

However it's true that UTF-24BE and UTF-24LE could be useful as a encoding 
schemes for serializations to byte-oriented streams, suppressing one 
unnecessary byte per code point.

(And then of course, there's UTF-21, in which blocks of 21 bits are 
concatenated, so that eight Unicode characters will be stored in every 21 
bytes - and not to mention UTF-20.087462841250343, in which a plain text 
document is simply regarded as one very large integer expressed in radix 
1114112, and whose UTF-20.087462841250343 representation is simply that 
number expressed in binary. But now I'm getting /very/ silly - please 
don't take any of this seriously.)  :-)
I don't think that UTF-21 would be useful as an encoding form, but possibly 
as a encoding scheme where 3 always-zero bits would be stripped, providing a 
tiny compression level, which would only be justified for transmission over 
serial or network links.

However I do think that such "optimization" would have the effect of 
removing byte alignments, on which more powerful compressors are working. If 
you really need a more effective compression use SCSU or apply some deflate 
or bzip2 compression to UTF-8, UTF-16, or UTF-24/32... (there's not much 
difference between compressing UTF-24 or UTF-32 with generic compression 
algorithms like deflate or bzip2).

The "UTF-24" thing seems a reasonably sensible question though. Is it just 
that we don't like it because some processors have alignment restrictions 
or something?
There does exists, even still today, 4-bit processors, and 1-bit processors, 
where the smallest addressable memory unit is smaller than 8-bit. They are 
used for lowcost micro-devices, notably to build automated robots for the 
industry, or even for many home/kitchen devices. I don't know whever they do 
need Unicode to represent international text, given that they often have a 
very limited user interface, incapable of inputing or output text, but who 
knows? May be they are used in some mobile phones, or within "smart" 
keyboards or tablets or other input devices connected to PCs...

There also exists systems where the smallest addressable memory cell is a 
9-bit byte. This is more an issue here, because the Unicode standard does 
not specify whever encoding schemes (that serialize code points to bytes) 
should set the 9th bit of each byte to 0, or should fill every 8 bit of 
memory, even if this means that 8-bit bytes of UTF-8 will not be 
synchronized with memory 9-bit bytes.

Somebody already introduced UTF-9 in the past for 9-bit systems.
A 36-bit processor could as well address the memory by cells of 36 bits, 
where the 4 highest bits would be either used for CRC control bits 
(generated and checked automatically by the processor or a memory bus 
interface within memory regions where this behavior would be allowed), or 
either used to store supplementary bits of actual data (in unchecked regions 
that fit in reliable and fast memory, such as the internal memory cache of 
the CPU, or static CPU registers).

For such things, the impact of the transformation of addressable memory 
widths through interfaces is for now not discussed in Unicode, which 
supposes that internal memory is necessarily addressed in a power of 2 and a 
multiple of 8 bits, and then interchanged or stored using this byte unit.

Today, we assist to the constant expansion of bus widths to allow parallel 
processing instead of multiplying the working frequency (and the energy 
spent and temperature, which generates other environmental problems), so why 
the 8-bit byte unit would remain the most efficient universal unit? If you 
look at IEEE floatting point formats, they are often implemented in FPU 
working on 80-bit units, and a 80-bit memory cell could as well become 
tomorrow a standard (compatible with the increasingly used 64-bit 
architectures of today) which would no longer be a power of 2 (even if this 
stays a multiple of 8 bits).

On a 80-bit system, the easiest solution for handling UTF-32 without using 
too much space would be a unit of 40-bits (i.e. two code points per 80-bit 
memory cell). But if you consider that 21 bits only are used

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-06 Thread Doug Ewell

John Cowan  wrote:

> Windows filesystems do know what encoding they use.  But a filename on
> a Unix(oid) file system is a mere sequence of octets, of which only 00
> and 2F are interpreted.  (Filenames containing 20, and especially 0A,
> are annoying to handle with standard tools, but not illegal.)
>
> How these octet sequences are translated to characters, if at all,
> is no concern of the file system's.  Some higher-level tools, such as
> directory listers and shells, have hardwired assumptions, others have
> changeable assumptions, but all are assumptions.

OK, fair enough.  Under a Unixoid file system, a file name consists of a
more or less arbitrary sequence of bytes, essentially unregulated by the
OS.

If interpreted as UTF-8, some of these sequences may be invalid, and the
files may be inaccessible.

This is *exactly* the same scenario as with GB 2312, or Shift-JIS, or KS
C 5601, or ISO 6937, or any other multibyte character encoding ever
devised.

This is not a problem that needs to be solved within Unicode, any more
than it needed to be solved within those other encodings.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-06 Thread John Cowan

Doug Ewell scripsit:

> > Now suppose you have a UNIX filesystem, containing filenames in a
> > legacy encoding (possibly even more than one). If one wants to switch
> > to UTF-8 filenames, what is one supposed to do? Convert all filenames
> > to UTF-8?
> 
> Well, yes.  Doesn't the file system dictate what encoding it uses for
> file names?  How would it interpret file names with "unknown" characters
> from a legacy encoding?  How would they be handled in a directory
> search?

Windows filesystems do know what encoding they use.  But a filename on
a Unix(oid) file system is a mere sequence of octets, of which only 00
and 2F are interpreted.  (Filenames containing 20, and especially 0A,
are annoying to handle with standard tools, but not illegal.)

How these octet sequences are translated to characters, if at all,
is no concern of the file system's.  Some higher-level tools, such as
directory listers and shells, have hardwired assumptions, others have
changeable assumptions, but all are assumptions.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
No man is an island, entire of itself; every man is a piece of the
continent, a part of the main.  If a clod be washed away by the sea,
Europe is the less, as well as if a promontory were, as well as if a
manor of thy friends or of thine own were: any man's death diminishes me,
because I am involved in mankind, and therefore never send to know for
whom the bell tolls; it tolls for thee.  --John Donne

Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-06 Thread Doug Ewell

RE: Nicest UTFLars Kristan wrote:

>> I could not disagree more with the basic premise of Lars' post.  It
>> is a fundamental and critical mistake to try to "extend" Unicode with
>> non-standard code unit sequences to handle data that cannot be, or
>> has not been, converted to Unicode from a legacy standard.  This is
>> not what any character encoding standard is for.
>
> What a standard is or is not for is a decision. And Unicode consortium
> is definitely the body that makes the decision in this case.

Actually the Unicode Technical Committee.  But you are correct: it is up
to the UTC to decide whether they want to redefine UTF-8 to permit
invalid sequences, which are to be interpreted as unknown characters
from an unknown legacy coding standard, and to prohibit conversion from
this redefined UTF-8 to other encoding schemes, or directly to Unicode
code points.  We will have to wait and see what UTC members think of
this.

> But this decision should not be based solely on theory and ideal
> worlds.

Right.  Uh-huh.

>> This is simply what you have to do.  You cannot convert the data into
>> Unicode in a way that says "I don't know how to convert this data
>> into Unicode."  You must either convert it properly, or leave the
>> data in its original encoding (properly marked, preferably).
>
> Here lies the problem. Suppose you have a document in UTF-8, which
> somehow got corrupted and now contains a single invalid sequence. Are
> you proposing that this document needs to be stored separately?

Of course not.  That is not at all the same as INTENTIONALLY storing
invalid sequences in UTF-8 and expecting the decoding mechanism to
preserve the invalid bytes for posterity.

> Everything else in the database would be stored in UTF-16, but now one
> must add the capability to store this document separately. And
> probably not index it. Regardless of any useful data in it. But if you
> use UTF-8 storage instead, you can put it in with the rest (if you can
> mark it, even better, but you only need to do it if that is a
> requirement).

And do what with it, Lars?  Keep it on a shelf indefinitely in case some
archaeologist unearths a new legacy encoding that might unlock the
mystery data?

Is this really worth the effort of redefining UTF-8 and disallowing free
conversion between UTF-8 and Unicode code points?

Do you have a use case for this?

> I can reinterprete your example. Using the French word is exactly the
> solution I am proposing, and I see your solution is to replace the
> word with a placeholder which says "a word that does not exist in
> German". Even worse, you want to use the same placeholder for all the
> unknown words. Numbering them would be better, but awkward, since you
> don't know how to assign numbers. Fortunetely, with bytes in invalid
> sequences, the numbering is trivial and has a meaning.

So with your plan, you have invalid sequence #1, invalid sequence #2,
and so forth.  Now, what do the sequences mean?  Is there any way to
interpret them?  No, there isn't, because by definition these sequences
represent characters from an unknown coding standard.  Either (a) nobody
has gone to the trouble to find out what characters they truly
represent, (b) the original standard is lost and we will *never* know,
or (c) we are waiting for the archaeologist to save the day.

In the meantime, the UTF-8 data with invalid sequences must be kept
isolated from all processes that would interpret the sequences as code
points, and raise an exception on invalid sequences-- in other words,
all existing processes that handle UTF-8.

> Let's compare UTF-8 to UTF-16 conversion to an automated translation
> from German to French. What Unicode standard says can be interpreted
> as follows:
>
> * All input text must be valid German language.
> * All output text must be valid French language.
> * Any unknown words shall be replaced by a (single) 'unknown word'
>   placeholder.

If you have French words that cannot be translated into German at all,
and nobody in the target audience is capable of understanding French,
then what you have is an inscrutable collection of mystery data, perhaps
suitable for research and examination by linguists, but not something
that the audience can make any sense of.  In that case, converting all
the mystery data to a single "unknown word" placeholder is no worse than
any other solution, and in particular, no worse than a solution that
converts 100 different mystery words into 100 different placeholders,
*none* of which the audience can decipher.

> And that last statement goes for German words missing in your
> dictionary, misspelled words, Spanish words, proper nouns...

The underlying assumption is that somebody, somewhere, will be able to
recognize these "foreign" or "unrecognized" words and make some sense of
them.  But in your character encoding example, the premise is that we
DON'T know what the original encoding was, and it's too difficult or
impossible to find out, so we just shoeho

Re: Nicest UTF

2004-12-06 Thread Marcin 'Qrczak' Kowalczyk

Lars Kristan <[EMAIL PROTECTED]> writes:

>> This is simply what you have to do. You cannot convert the data
>> into Unicode in a way that says "I don't know how to convert this
>> data into Unicode." You must either convert it properly, or leave
>> the data in its original encoding (properly marked, preferably).
>
> Here lies the problem. Suppose you have a document in UTF-8, which
> somehow got corrupted and now contains a single invalid sequence.
> Are you proposing that this document needs to be stored separately?

He is not proposing that.

> Everything else in the database would be stored in UTF-16, but now
> one must add the capability to store this document separately.

No, it can be be stored in UTF-16 or whatever else is used. Except the
corrupted part of course, but it's corrupted, and thus useless, so it
doesn't matter what happens with it.

> Now suppose you have a UNIX filesystem, containing filenames in a legacy
> encoding (possibly even more than one). If one wants to switch to UTF-8
> filenames, what is one supposed to do? Convert all filenames to UTF-8?

Yes.

> Who will do that?

A system administrator (because he has access to all files).

> And when?

When the owners of the computer system decide to switch to UTF-8.

> Will all users agree?

It depends on who decides about such things. Either they don't have a
voice, or they agree and the change is made, or they don't agree and
the change is not made. What's the point?

> Should all filenames that do not conform to UTF-8 be declared invalid?

What do you mean by "invalid"? They are valid from the point of view
of the OS, but they will not work with reasonable applications which
use Unicode internally.

> If you keep all processing in UTF-8, then this is a decision you can
> postpone.

You mean, various programs will break at various points of time,
instead of working correctly from the beginning?

If it's broken, fix it, instead of applying patches which will
sometimes hide the fact that it's broken, or sometimes not.

> I didn't encourage users to mix UTF-8 filenames and Latin 1 filenames.
> Do you want to discourage them?

Mixing any two incompatible filename encodings on the same file system
is a bad idea.

> IMHO, preserving data is more important, but so far it seems it is
> not a goal at all. With a simple argument - that Unicode only
> defines how to process Unicode data. Understandably so, but this
> doesn't mean it needs to remain so.

If you don't know the encoding and want to preserve the values of
bytes, then don't convert it to Unicode.

> Well, you may have a wrong assumption here. You probably think that
> I convert invalid sequences into PUA characters and keep them as
> such in UTF-8. That is not the case. Any invalid sequences in UTF-8
> are left as they are. If they need to be converted to UTF-16, then
> PUA is used. If they are then converted to UTF-8, they are converted
> back to their original bytes, hence the incorrect sequences are
> re-created.

This does not make sense. If you want to preserve the bytes instead
of working in terms of characters, don't convert it at all - keep the
original byte stream.

> One more example of data loss that arises from your approach: If a
> single bit is changed in UTF-16 or UTF-32, that is all that will
> happen (in more than 99% of the cases). If a single bit changes in
> UTF-8, you risk that the entire character will be dropped or
> replaced with the U+FFFD. But funny, only if it ever gets converted
> to the UTF-16 or UTF-32. Not that this is a major problem on its
> own, but it indicates that there is something fishy in there.

If you change one bit in a file compressed by gzip, you might not be
able to recover any part of it. What's the point?

UTF-x were not designed to minimize the impact of corruption of
encoded bytes. If you want to preserve the text despite occasional
corruption, use a higher level protocol for this (if I remember
correctly, RAR can add additional information to an archive which
allows to recover the data even if parts of the archive, entire
blocks, have been lost).

> There was a discussion on nul characters not so long ago. Many text
> editors do not properly preserve nul characters in text files.
> But it is definitely a nice thing if they do. While preserving nul
> characters only has a limited value, preserving invalid sequences
> in text files could be crucial.

An editor should alert the user that the file is not encoded in a
particular encoding or that it's corrupted, instead of trying to guess
which characters were supposed to be there.

If it's supposed to edit binary files too, it should work on the bytes
instead of decoded characters.

> A UTF-8 based editor can easily do this. A UTF-16 based editor
> cannot do it at all. If you say that UTF-16 is not intended for such
> a purpose, then so be it. But this also means that UTF-8 is superior.

It's much easier with CP-1252, which shows that it's superior to UTF-8
:-)

> Yes, it is not related

Re: Nicest UTF

2004-12-06 Thread Andy Heninger

Asmus Freytag wrote:
A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider
1) 1 extra test per character (to see whether it's a surrogate)
In my experience with tuning a fair amount of utf-16 software, this test 
takes pretty close to zero time.  All modern processors have branch and 
pipeline trickery that fairly effectively disappears the cost of a 
predictable branch within a tight loop.  Occurrences of supplementary 
characters should generally be rare enough that the extra time to 
process them when they are encountered is not statistically significant.

2) special handling every 100 to 1000 characters (say 10 instructions)
3) additional cost of accessing 16-bit registers (per character)
4) reduction in cache misses (each the equivalent of many instructions)
This is a big deal.  The costs in plowing through lots of text data with 
relatively simple processing appear to be heavily related to the 
required memory bandwidth. Assuming reasonably carefully written code, 
that is.

5) reduction in disk access (each the equivaletn of many many instructions)
For many operations, e.g. string length, both 1, and 2 are no-ops,
so you need to apply a reduction factor based on the mix of operations
you do perform, say 50%-75%.
For many processors, item 3 is not an issue.
For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each
occurrence depending on the architecture. Their relative weight depends
not only on cache sizes, but also on how many other instructions per
character are performed. For text scanning operations, their cost
does predominate with large data sets.

--
 Andy Heninger
 [EMAIL PROTECTED]

Re: Nicest UTF

2004-12-06 Thread Antoine Leca

Asmus Freytag wrote:
> A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider

> 3) additional cost of accessing 16-bit registers (per character)

> For many processors, item 3 is not an issue.

I do not know, I only know of a few of them; for example, I do not know how
Alpha or Sparc or PowerPC handle 16-bit datas (I did hear different sounds.)
I agree this was not an issue for 80386-80486 or Pentium. However, for the
more recent processors, P6, Pentium 4, or AMD K7 or K8, I am unsure; and I
shall appreciate insights.

I remember reading that in the case of the AMD K7, for instance, 16-bit
instructions (all? a few of them? only ALU-related, i.e. exclusing load and
store, which is the point here? I do not know) are handled in a different
way from the 32-bit ones, e.g. reduced number of decoders. The impact could
be really important.

I also remember that when the P6 was launched (1995, known as PentiumPro),
there was a bunch of critics toward Intel because the performances of 16-bit
code was actually worse than an equivalent Pentium (but there were an
advantage for 32-bit code); of course this should be considered in the
context, where 16-bit (DOS/Windows 3.x) code was important, something that
faded. But I believe the reasoning behind the arguments should still hold.

Finally, there is certainly an issue about the need to add a prefix with the
X86 processors. The issue is reduced for the Pentium4 (because the prefix
does not consume space in the L1-cache); but it still holds for L2-cache.
And the impact is noticeable; I do not have figures for the access to UTF-16
datas, but I know that for when using 64-bit mode (with AMD K8), the need to
have a prefix to access 64-bit data, so consuming code cache space for it,
was given as cause for a 1-3% penality in execution time.

Of course, such a tiny penalty is easily hidden by other factors, such as
the others Dr. Freitag mentionned.

> Given this little model and some additional assumptions about your
> own project(s), you should be able to determine the 'nicest' UTF for
> your own performance-critical case.

My point was that the variability of the factors headed to keeping the three
UTFs as possible candidates when one consider writing a "perfect-world"
library. Can we say we are in agreement?

By the way, this will also mean that the optimisations to be considered
inside the library could be very different, since the optimal uses can be
significantly different. For example, use of UTF-32 might signal a user bias
toward easy management of codepoints, disregarding memory use, so the used
code in the library should favour time over space (so unrolling loops and
similar things could be considered).
UTF-8 /might/ be the reverse.

Antoine

Re: Nicest UTF

2004-12-06 Thread Doug Ewell

Arcane Jill  wrote:

> Probably a dumb question, but how come nobody's invented "UTF-24" yet?
> I just made that up, it's not an official standard, but one could
> easily define UTF-24 as UTF-32 with the most-significant byte (which
> is always zero) removed, hence all characters are stored in exactly
> three bytes and all are treated equally. You could have UTF-24LE and
> UTF-24BE variants, and even UTF-24 BOMs. Of course, I'm not suggesting
> this is a particularly brilliant idea, but I just wonder why no-one's
> suggested it before.

It has been suggested before, by Pim Blokland on April 3, 2003, in a
message titled "UTF-24."  If you get the digest, it's in Digest V3 #79.

> The "UTF-24" thing seems a reasonably sensible question though. Is it
> just that we don't like it because some processors have alignment
> restrictions or something?

Almost all do.  In addition, no programming language I know of has a
3-byte-wide integer data type (maybe INTERCAL does), so the efficiency
of UTF-24 would be wasted in software as well as in hardware.

Besides that, there were the usual protests that supplementary
characters would be vanishingly rare in the context of "normal" text,
and that one should use compression (SCSU/BOCU or GP tools) if size is
an issue.

None of this stopped me from experimentally implementing it, of course,
but I haven't touched it since finishing the implementation.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

RE: Nicest UTF

2004-12-06 Thread Lars Kristan

Title: RE: Nicest UTF






Doug Ewell wrote:
> RE: Nicest UTFLars Kristan wrote:
> 
> >> I think UTF8 would be the nicest UTF.
> >
> > I agree. But not for reasons you mentioned. There is one other
> > important advantage: UTF-8 is stored in a way that permits storing
> > invalid sequences. I will need to elaborate that, of course.
> 
> I could not disagree more with the basic premise of Lars' 
> post.  It is a
> fundamental and critical mistake to try to "extend" Unicode with
> non-standard code unit sequences to handle data that cannot be, or has
> not been, converted to Unicode from a legacy standard.  This 
> is not what
> any character encoding standard is for.
What a standard is or is not for is a decision. And Unicode consortium is definitely the body that makes the decision in this case. But this decision should not be based solely on theory and ideal worlds.

> 
> > 1.2 - Any data for which encoding is not known can only be 
> stored in a
> > UTF-16 database if it is converted. One needs to choose a conversion
> > (say Latin-1, since it is trivial). When a user finds out that the
> > result is not appealing, the data needs to be converted back to the
> > original 8-bit sequence and then the user (or an algorithm) can try
> > various encodings until the result is appealing.
> 
> This is simply what you have to do.  You cannot convert the data into
> Unicode in a way that says "I don't know how to convert this data into
> Unicode."  You must either convert it properly, or leave the 
> data in its
> original encoding (properly marked, preferably).
Here lies the problem. Suppose you have a document in UTF-8, which somehow got corrupted and now contains a single invalid sequence. Are you proposing that this document needs to be stored separately? Everything else in the database would be stored in UTF-16, but now one must add the capability to store this document separately. And probably not index it. Regardless of any useful data in it. But if you use UTF-8 storage instead, you can put it in with the rest (if you can mark it, even better, but you only need to do it if that is a requirement).

> 
> It is just as if a German speaker wanted to communicate a 
> word or phrase
> in French that she did not understand.  She could find the correct
> German translation and use that, or she could use the French word or
> phrase directly (moving the translation burden onto the 
> listener).  What
> she cannot do is "extend" German by creating special words that are
> placeholders for French words whose meaning she does not know.
I can reinterprete your example. Using the French word is exactly the solution I am proposing, and I see your solution is to replace the word with a placeholder which says "a word that does not exist in German". Even worse, you want to use the same placeholder for all the unknown words. Numbering them would be better, but awkward, since you don't know how to assign numbers. Fortunetely, with bytes in invalid sequences, the numbering is trivial and has a meaning.

Let's compare UTF-8 to UTF-16 conversion to an automated translation from German to French. What Unicode standard says can be interpreted as follows:

* All input text must be valid German language.
* All output text must be valid French language.
* Any unknown words shall be replaced by a (single) 'unknown word' placeholder.


And that last statement goes for German words missing in your dictionary, misspelled words, Spanish words, proper nouns...

> 
> > 2.2 - Any data for which encoding is not known can simply be stored
> > as-is.
> 
> NO.  Do not do this, and do not encourage others to do this.  
> It is not
> valid UTF-8.
I never said it is valid UTF-8. The fact remains I can store legacy data in the same store as UTF-8 data. But cannot do that if storage is UTF-16 based.

Now suppose you have a UNIX filesystem, containing filenames in a legacy encoding (possibly even more than one). If one wants to switch to UTF-8 filenames, what is one supposed to do? Convert all filenames to UTF-8? Who will do that? And when? Will all users agree? Should all filenames that do not conform to UTF-8 be declared invalid? And those files innacessible? If you keep all processing in UTF-8, then this is a decision you can postpone. But if you start using UTF-32 applications for processing filenames, invalid sequences will be dropped and those files can in fact become inaccessible. And then you'll be wondering why users don't want to start using Unicode.

I didn't encourage users to mix UTF-8 filenames and Latin 1 filenames. Do you want to discourage them?


> 
> Among other things, you run the risk that the mystery data happens to
> form a valid UTF-8 sequence, by sheer coi

Re: Nicest UTF

2004-12-06 Thread Arcane Jill

Probably a dumb question, but how come nobody's invented "UTF-24" yet? I 
just made that up, it's not an official standard, but one could easily 
define UTF-24 as UTF-32 with the most-significant byte (which is always 
zero) removed, hence all characters are stored in exactly three bytes and 
all are treated equally. You could have UTF-24LE and UTF-24BE variants, and 
even UTF-24 BOMs. Of course, I'm not suggesting this is a particularly 
brilliant idea, but I just wonder why no-one's suggested it before.

(And then of course, there's UTF-21, in which blocks of 21 bits are 
concatenated, so that eight Unicode characters will be stored in every 21 
bytes - and not to mention UTF-20.087462841250343, in which a plain text 
document is simply regarded as one very large integer expressed in radix 
1114112, and whose UTF-20.087462841250343 representation is simply that 
number expressed in binary. But now I'm getting /very/ silly - please don't 
take any of this seriously.)  :-)

The "UTF-24" thing seems a reasonably sensible question though. Is it just 
that we don't like it because some processors have alignment restrictions or 
something?

Arcane Jill
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Marcin 'Qrczak' Kowalczyk
Sent: 02 December 2004 16:59
To: [EMAIL PROTECTED]
Subject: Re: Nicest UTF
"Arcane Jill" <[EMAIL PROTECTED]> writes:
Oh for a chip with 21-bit wide registers!
Not 21-bit but 20.087462841250343-bit :-)
--
__("< Marcin Kowalczyk
\__/ [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-05 Thread Doug Ewell

Philippe Verdy  wrote:

> Only the encoder may be a bit complex to write (if one wants to
> generate the optimal smallest result size), but even a moderate
> programmer could find a simple and working scheme with a still
> excellent compression rate (around 1 to 1.2 bytes per character on
> average for any Latin text, and around 1.2 to 1.5 bytes per character
> for Asian texts which would still be a good application of SCSU face
> to UTF-32 or even UTF-8).

If by "Asian texts" you mean CJK ideographs (*), precomposed Hangul, or
Yi syllables, you have no chance of doing better than 2 bytes per
character.  This is because it is not possible in SCSU to set a dynamic
window to any range between U+3400 and U+DFFF, where these characters
reside.  Such a window would be of little use anyway, because real-world
texts using these characters would draw from so many windows that
single-byte mode would be less efficient than Unicode mode, where 2
bytes per character is the norm.  Of course, this is still better than
UTF-32 or UTF-8 for these characters.

For Katakana and Hiragana, you can get the same efficiency with SCSU as
for other small scripts, but very few texts are written in pure kana
except for young children.

Sorry for missing this point in my earlier post.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/
 (*) No, I'm not interested in arguing over this word.

Re: Nicest UTF

2004-12-05 Thread Doug Ewell

Philippe Verdy  wrote:

>> Here is a string, expressed as a sequence of bytes in SCSU:
>>
>> 05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E
>>   M   o  s  s  o  v   SP  i  s SP  .
>
> Without looking at it, it's easy to see that this tream is separated
> in three sections, initiated by 05 1C, then 05 1D, then 12. I can't
> remember without looking at the UTN what they perform (i.e. which
> Unicode code points range they select), but the other bytes are simple
> offsets relative to the start of the selected ranges. Also the third
> section is ended by a regular dot (2E) in the ASCII range selected for
> the low half-page, and the other bytes are offsets for the script
> block initiated by 12.

05 is a static-quote tag which modifies only the next byte.  It doesn't
really initiate a new section; it's intended for isolated characters
where initiating a new section would be wasteful.  The sequences <05 1C>
and <05 1D> encode the matching double-quote characters U+201C and
U+201D respectively.

12 switches to a new dynamic window -- in this case, window 2, which is
predefined to point to the Cyrillic block -- so it does select a range
as you said.  Also, the ASCII bytes do represent Basic Latin characters.

> Immediately I can identify this string, without looking at any table:
>
> "Mossov?" is ??.
>
> where each ? replaces a character that I can't decipher only through
> my defective memory. (I don't need to remember the details of the
> standard table of ranges, because I know that this table is complete
> in a small and easily available document).

Actually "Moscow," not "Mossov" -- but as you said, this is not
important because a computer would have gotten this arithmetic right.
The actual string is:

âMoscowâ is ÐÐÑÐÐÐ.

> The decoder part of SCSU still remains extremely trivial to implement,
> given the small but complete list of codes that can alter the state of
> the decoder, because there's no choice in its interpretation and
> because the set of variables to store the decoder state is very
> limited, as well as the number of decision tests at each step. This is
> a "finite state automata".

I think "extremely trivial" is overstating the case a bit.  It is
straightforward and not very difficult, but still somewhat more complex
than a UTF.  (There had better not be any choice in interpretation, if
we want lossless decompression!)

BTW, the singular is "automaton."

> Only the encoder may be a bit complex to write (if one wants to
> generate the optimal smallest result size), but even a moderate
> programmer could find a simple and working scheme with a still
> excellent compression rate (around 1 to 1.2 bytes per character on
> average for any Latin text, and around 1.2 to 1.5 bytes per character
> for Asian texts which would still be a good application of SCSU face
> to UTF-32 or even UTF-8).

UTN #14 contains pseudocode for an encoder that beats the Japanese
example in UTS #6 (by one byte, big deal) and can be easily translated
into working code.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

SCSU as internal encoding (was: Re: Nicest UTF)

2004-12-05 Thread Doug Ewell

Philippe Verdy  wrote:

>> The point is that indexing should better be O(1).
>
> SCSU is also O(1) in terms of indexing complexity... simply because it
> keeps the exact equivalence with codepoints, and requires a *fixed*
> (and small) number of steps to decode it to code points, but also
> because the decoder states uses a *fixed* (and small) number of
> variables for the internal context (unlike more powerful compression
> algorithms like dictionnary-based, Lempel-Ziv-Welsh-like, algorithms
> such as deflate).

As Marcin said, SCSU is O(n) in terms of indexing complexity, because
you have to decode the first (n - 1) characters before you can decode
the n'th.  Even when you have a run of "ASCII" bytes between 0x20 and
0x7E, there is no guarantee that the characters are Basic Latin.  There
might have been a previous SCU tag that switched into Unicode mode.

>> No, individual characters are immutable in almost every language.
>
> But individual characters do not always have any semantic. For
> languages, the relevant unit is almost always the grapheme cluster,
> not the character (so not its code point...). As grapheme clusters
> need to be represented on variable lengths, an algorithm that could
> only work with fixed-width units would not work internationaly or
> would cause serious problems for correct analysis or transformation of
> true languages.

This is beside the point, as I said at the outset.  In programming, you
have to deal with individual characters in a string on a regular basis,
even if some characters depend on others from a linguistic standpoint.

> Code points are probably the easiest thing to describe what an text
> algorithm is supposed to do, but this is not a requirement for
> applications (in fact many libraries have been written that correctly
> implement the Unicode algorithms, without even dealing with code
> points, but only with in-memory code units of UTF-16 or even in UTF-8
> or GB18030, or directly with serialization bytes of UTF-16LE or UTF-8
> or SCSU or ether encoding schemes).

Algorithms that operate on CES-specific code units are what lead to such
"wonderful" innovations as CESU-8.  All text operations, except for
encoding and decoding, should work with code points.

Marcin  responded:

> UTF-8 is much better for interoperability than SCSU, because it's
> already widely supported and SCSU is not.

True, but not really Philippe's point.

Philippe again:

> The question is why you would need to extract the nth codepoint so
> blindly. If you have such reasons, because you know the context in
> which this index is valid and usable, then you can as well extract a
> sequence using an index in the SCSU encoding itself using the same
> knowledge.
>
> Linguistically, extracting a substring or characters at any random
> index in a sequence of code points will only cause you problems. In
> general, you will more likely use index as a way to mark a known
> position that you have already parsed sequentially in the past.

You have to do this ALL THE TIME in programming.

Example: searching and replacing text.  To search a string for a
substring, you would normally write a function that would not only give
a yes/no answer (i.e. "this string does/does not contain the
substring"), but would also indicate *where* the substring was found
within the string.  That's because the world needs not only search
tools, but also search-and-replace tools, and you need to know where the
substring is in order to replace it with another.  "Linguistically" has
nothing to do with it.  Nothing prevents the user of a
search-and-replace tool from doing something linguistically unsound, nor
should it.

If you do this in SCSU, you have to keep track of the state of the
decoder within the string (single-byte vs. Unicode mode, current dynamic
window, and position of all dynamic windows).  If you lose track of the
decoder state, you run the risk of corrupting the data.  (Philippe
acknowledged this in his next paragraph.)  You really need to convert
internally to code points in order to do this.  I'm a believer in SCSU
as an efficient storage and transfer encoding, but not as an internal
process code.

> All those are not demonstration: decoding IRC commands or similar
> things does not constitute the need to encode large sets of texts. In
> your examples, you show applications that need to handle locally some
> strings made for computer languages.

One of the main stated goals of SCSU was to provide good compression for
small strings.

> Texts of human languages, or even a collection of person names, or
> places are not like this, and have a much wider variety, but with huge
> possibilities for data compression (inherent to the phonology of human
> languages and their overall structure, but also due to repetitive
> conventions spread throughout the text to allow easier reading and
> understanding).

This is where general-purpose compression schemes excel, and should be
considered.  (You might want to read UTN #1

Re: Nicest UTF

2004-12-05 Thread Philippe Verdy

From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
Now consider scanning forwards. We want to strip a beginning of a
string. For example the string is an irc message prefixed with a
command and we want to take the message only for further processing.
We have found the end of the prefix and we want to produce a string
from this position to the end (a copy, since strings are immutable).
All those are not demonstration: decoding IRC commands or similar things 
does not constitute the need to encode large sets of texts. In your 
examples, you show applications that need to handle locally some strings 
made for computer languages.

Texts of human languages, or even a collection of person names, or places 
are not like this, and have a much wider variety, but with huge 
possibilities for data compression (inherent to the phonology of human 
languages and their overall structure, but also due to repetitive 
conventions spread throughout the text to allow easier reading and 
understanding).

Scanning backward a person name or human text is possibly needed locally, 
but such text has a strong forward directionality without which it does not 
make sense. Same thing if you scan such text starting at random positions: 
you could make many false interpretations of this text by extracting random 
fragments like this.

Anyway, if you have a large database of texts to process or even to index, 
you will, in fine, need to scan this text linearily first from the beginning 
to the end, should it be only to create an index for accessing it later 
randomly. You will still need to store the indexed text somewhere, and in 
order to maximize the performance, or responsiveness of your application, 
you'll need to minimize its storage: that's where compression takes place. 
This does not change the semantic of the text, does not remove its 
semantics, but this is still an optimization, which does not prevent a 
further access with more easily parsable representation as stateless streams 
of characters, through surjective (sometimes bijective) converters between 
the compressed and uncompressed forms.

My conclusion: there's no "best" representation to fit all needs. Each 
representation has its merits in its domain. The Unicode UTFs are excellent 
only for local processing of limited texts, but they are not necessarily the 
best for long term storage or for large text sets.

And even for texts that will be accessed frequently, compressed schemes can 
still constitute optimizations, even if these texts need to be decompressed 
repeatedly each time they are needed. I am clearly against the arguments 
with "one scheme fits all needs", even if you think that UTF-32 is the only 
viable long-term solution.

Re: Nicest UTF

2004-12-05 Thread Marcin 'Qrczak' Kowalczyk

"Philippe Verdy" <[EMAIL PROTECTED]> writes:

> The question is why you would need to extract the nth codepoint so
> blindly.

For example I'm scanning a string backwards (to remove '\n' at the
end, to find and display the last N lines of a buffer, to find the
last '/' or last '.' in a file name). SCSU in general supports
traversal only forwards.

> But remember the context in which this discussion was introduced:
> which UTF would be the best to represent (and store) large sets of
> immutable strings. The discussion about indexes in substrings is not
> relevevant in that context.

It is relevant. A general purpose string representation should support
at least a bidirectional iterator, or preferably efficient random access.
Neither is possible with SCSU.

* * *

Now consider scanning forwards. We want to strip a beginning of a
string. For example the string is an irc message prefixed with a
command and we want to take the message only for further processing.
We have found the end of the prefix and we want to produce a string
from this position to the end (a copy, since strings are immutable).

With any stateless encoding a suitable library function will compute
the length of the result, allocate memory, and do an equivalent of
memcpy.

With SCSU it's not possible to copy the string without analysing it
because the prefix might have changed the state, so the suffix is not
correct when treated as a standalone string. If the stripped part is
short and the remaining part is long, it might pay off to scan the
part we want to strip and perform a shortcut of memcpy if the prefix
did not change the state (which is probably a common case). But in
general we must recompress the whole copied part! We can't even
precalculate its physical size. Decompressing into temporary memory
will negate benefits of a compressed encoding, so we should better
decompress and compress in parallel into a dynamically resizing
buffer. This is ridiculously complex compared to a memcpy.

The *only* advantage of SCSU is that it takes little space. Although
in most programs most strings are ASCII, and SCSU never beats
ISO-8859-1 which is what the implementation of my language is using
for strings which no characters above U+00FF, so it usually does
not have even this advantage.

Disadvantages are everywhere else: every operation which looks at the
contents of a string or produces contents of a string is more complex.
Some operations can't be supported at all with the same asymptotic
complexity, so the API would have to be changed as well to use opaque
iterators instead of indices. It's more complicated both for internal
processing and for interoperability (unless the other end understands
SCSU too, which is unlikely).

Plain immutable character arrays are not completely universal either
(e.g. they are not sufficient for a buffer of a text editor), but they
are appropriate as the default representation for common cases; for
representing filenames, URLs, email addresses, computer language
identifiers, command line option names, lines of a text file, messages
in a dialog in a GUI, names of columns of a database table etc. Most
strings are short and thus performing a physical copy when extracting
a substring is not disastrous. But the complexity of SCSU is too bad.

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-05 Thread Philippe Verdy

From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
"Philippe Verdy" <[EMAIL PROTECTED]> writes:
The point is that indexing should better be O(1).
SCSU is also O(1) in terms of indexing complexity...
It is not. You can't extract the nth code point without scanning the
previous n-1 code points.
The question is why you would need to extract the nth codepoint so blindly. 
If you have such reasons, because you know the context in which this index 
is valid and usable, then you can as well extract a sequence using an index 
in the SCSU encoding itself using the same knowledge.

Linguistically, extracting a substring or characters at any random index in 
a sequence of code points will only cause you problems. In general, you will 
more likely use index as a way to mark a known position that you have 
already parsed sequentially in the past.

However it is true that if you have determined a good index position to 
allow future extraction of substrings, SCSU will be more complex because you 
not only need to remember the index, but also the current state of the SCSU 
decoder, to allow decoding characters encoded starting at that index. This 
is not needed for UTF's and most legacy character encodings, or national 
standards, or GB18030 which looks like a valid UTF, even though it is not 
part of the Unicode standard itself.

But remember the context in which this discussion was introduced: which UTF 
would be the best to represent (and store) large sets of immutable strings. 
The discussion about indexes in substrings is not relevevant in that 
context.

Re: Nicest UTF

2004-12-05 Thread Marcin 'Qrczak' Kowalczyk

"Philippe Verdy" <[EMAIL PROTECTED]> writes:

>> The point is that indexing should better be O(1).
>
> SCSU is also O(1) in terms of indexing complexity...

It is not. You can't extract the nth code point without scanning the
previous n-1 code points.

> But individual characters do not always have any semantic. For
> languages, the relevant unit is almost always the grapheme cluster,
> not the character (so not its code point...).

How do you determine the semantics of a grapheme cluster? Answer: by
splitting it into code points. A code point is atomic, it's not split
any more, because there is a finite number of them.

When a string is exchanged with another application or network
computer or the OS, it always uses some encoding which is closer to
code points than to grapheme clusters, no matter if it's UTF-8 or
UTF-16 or ISO-8859-something. If the string was originally stored as
an array of grapheme clusters, it would have to be translated to code
points before further conversion.

> Which represent will be the best is left to implementers, but I really
> think that compressed schemes are often introduced to increase the
> application performances and reduce the needed resources both in
> memory and for I/O, but also in networking where interoperability
> across systems and bandwidth optimization are also important design
> goals...

UTF-8 is much better for interoperability than SCSU, because it's
already widely supported and SCSU is not.

It's also easier to add support for UTF-8 than for SCSU. UTF-8 is
stateless, SCSU is stateful - this is very important. UTF-8 is easier
to encode and decode.

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-05 Thread Philippe Verdy

- Original Message - 
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Sunday, December 05, 2004 1:37 AM
Subject: Re: Nicest UTF

"Philippe Verdy" <[EMAIL PROTECTED]> writes:
There's nothing that requires the string storage to use the same
"exposed" array,
The point is that indexing should better be O(1).
SCSU is also O(1) in terms of indexing complexity... simply because it keeps 
the exact equivalence with codepoints, and requires a *fixed* (and small) 
number of steps to decode it to code points, but also because the decoder 
states uses a *fixed* (and small) number of variables for the internal 
context (unlike more powerful compression algorithms like dictionnary-based, 
Lempel-Ziv-Welsh-like, algorithms such as deflate).

Not having a constant side per code point requires one of three things:
1. Using opaque iterators instead of integer indices.
2. Exposing a different unit in the API.
3. Living with the fact that indexing is not O(1) in general; perhaps
  with clever caching it's good enough in common cases.
Altough all three choices can work, I would prefer to avoid them.
If I had to, I would probably choose 1. But for now I've chosen a
representation based on code points.
Anyway, each time you use an index to access to some components of a
String, the returned value is not an immutable String, but a mutable
character or code unit or code point, from which you can build
*other* immatable Strings
No, individual characters are immutable in almost every language.
But individual characters do not always have any semantic. For languages, 
the relevant unit is almost always the grapheme cluster, not the character 
(so not its code point...). As grapheme clusters need to be represented on 
variable lengths, an algorithm that could only work with fixed-width units 
would not work internationaly or would cause serious problems for correct 
analysis or transformation of true languages.

Assignment to a character variable can be thought as changing the
reference to point to a different character object, even if it's
physically implemented by overwriting raw character code.
When you do that, the returned character or code unit or code point
does not guarantee that you'll build valid Unicode strings. In fact,
such character-level interface is not enough to work with and
transform Strings (for example it does not work to perform correct
transformation of lettercase, or to manage grapheme clusters).
This is a different issue. Indeed transformations like case mapping
work in terms of strings, but in order to implement them you must
split a string into some units of bounded size (code points, bytes,
etc.).
Yes, but why do you want that this intermediate unit be the code point? Such 
algorithm can be developped with any UTF, or even with compressed encoding 
schemes through accessor or enumerator methods...

All non-trivial string algorithms boil down to working on individual
units, because conditionals and dispatch tables must be driven by
finite sets. Any unit of a bounded size is technically workable, but
they are not equally convenient. Most algorithms are specified in
terms of code points, so I chose code points for the basic unit in
the API.
"Most" is the right term here: this is not a requirement, and it's not 
because it is the simplest way to implement such algorithm that it will be 
the most efficient in terms of performance or resource allocations. Most 
experiences prove that the most efficient algorithms are also complex to 
implement.

Code points are probably the easiest thing to describe what an text 
algorithm is supposed to do, but this is not a requirement for applications 
(in fact many libraries have been written that correctly implement the 
Unicode algorithms, without even dealing with code points, but only with 
in-memory code units of UTF-16 or even in UTF-8 or GB18030, or directly with 
serialization bytes of UTF-16LE or UTF-8 or SCSU or ether encoding schemes).

Which represent will be the best is left to implementers, but I really think 
that compressed schemes are often introduced to increase the application 
performances and reduce the needed resources both in memory and for I/O, but 
also in networking where interoperability across systems and bandwidth 
optimization are also important design goals...

Re: Nicest UTF

2004-12-05 Thread Doug Ewell

Philippe Verdy  wrote:

>> I appreciate Philippe's support of SCSU, but I don't think *even I*
>> would recommend it as an internal storage format.  The effort to
>> encode and decode it, while by no means Herculean as often perceived,
>> is not trivial once you step outside Latin-1.
>
> I said: "for immutable strings", which means that these Strings are
> instanciated for long term, and multiple reuses. In that sense, what
> is really significant is its decoding, not the effort to encode it
> (which is minimal for ISO-8859-1 encoded source texts, or Unicode
> UTF-encoded texts that only use characters from the first page).
>
> Decoding SCSU is very straightforward, even if this is stateful (at
> the internal character level). But for immutable strings, there's no
> need to handle various initial states, and the states associated with
> each conponent character of the string has no importance (strings
> being immutable, only the decoding of the string as a whole makes
> sense).

Here is a string, expressed as a sequence of bytes in SCSU:

05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E

See how long it takes you to decode this to Unicode code points.  (Do
not refer to UTN #14; that would be cheating. :-)

It may not be rocket science, but it is not trivial.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Nicest UTF

2004-12-05 Thread Doug Ewell

RE: Nicest UTFLars Kristan wrote:

>> I think UTF8 would be the nicest UTF.
>
> I agree. But not for reasons you mentioned. There is one other
> important advantage: UTF-8 is stored in a way that permits storing
> invalid sequences. I will need to elaborate that, of course.

I could not disagree more with the basic premise of Lars' post.  It is a
fundamental and critical mistake to try to "extend" Unicode with
non-standard code unit sequences to handle data that cannot be, or has
not been, converted to Unicode from a legacy standard.  This is not what
any character encoding standard is for.

> 1.2 - Any data for which encoding is not known can only be stored in a
> UTF-16 database if it is converted. One needs to choose a conversion
> (say Latin-1, since it is trivial). When a user finds out that the
> result is not appealing, the data needs to be converted back to the
> original 8-bit sequence and then the user (or an algorithm) can try
> various encodings until the result is appealing.

This is simply what you have to do.  You cannot convert the data into
Unicode in a way that says "I don't know how to convert this data into
Unicode."  You must either convert it properly, or leave the data in its
original encoding (properly marked, preferably).

It is just as if a German speaker wanted to communicate a word or phrase
in French that she did not understand.  She could find the correct
German translation and use that, or she could use the French word or
phrase directly (moving the translation burden onto the listener).  What
she cannot do is "extend" German by creating special words that are
placeholders for French words whose meaning she does not know.

> 2.2 - Any data for which encoding is not known can simply be stored
> as-is.

NO.  Do not do this, and do not encourage others to do this.  It is not
valid UTF-8.

Among other things, you run the risk that the mystery data happens to
form a valid UTF-8 sequence, by sheer coincidence.  The example of
"NESTLÃâ" in Windows CP1252 is applicable here.  The last two bytes are
C9 99, a valid UTF-8 sequence for U+0259.  By applying the concept of
"adaptive UTF-8" (as Dan Oscarsson called it in 1998), this sequence
would be interpreted as valid UTF-8, and data loss would occur.

> 2.4 - Any data that was stored as-is may contain invalid sequences,
> but these are stored as such, in their original form. Therefore, it is
> possible to raise an exception (alert) when the data is retrieved.
> This warns the user that additional caution is needed. That was not
> possible in 1.4.

This is where the fatal mistake is made.  No matter what Unicode
encoding form is used, its entire purpose is to encode *Unicode code
points*, not to implement a two-level scheme that supports both Unicode
and non-Unicode data.  What sort of "exception" is to be raised?  What
sort of "additional caution" should the user take?  What if this process
is not interactive, and contains no user intervention?

> 3.1 - Unfortunately we don't live in either of the two perfect worlds,
> which makes it even worse. A database on UNIX will typically be (or
> can be made to be) 8-bit. Therefore perfectly able to handle UTF-8
> data. On Windows however, there is a lot of support for UTF-16, but
> trying to work in UTF-8 could prove to be a handicap, if not close to
> impossible.

UTF-8 and UTF-16, used correctly, are perfectly interchangeable.  It is
not in any way a fault of UTF-16 that it cannot be used to store
arbitrary binary data.

> 3.3 - For the record: other UTF formats CAN be made equally useful to
> UTF-8. It requires 128 codepoints. Back in 2002, I have tried to
> convince people on the Unicode mailing list that this should be done,
> but have failed.

Because it is an incredibly bad idea.

> I am now using the PUA for this purpose. And I am even tempted to hope
> nobody will never realize the need for these 128 codepoints, because
> then all my data will be non-standard.

You *should* use the PUA for this purpose.  It is an excellent
application of the PUA.  But do not be surprised if someone else,
somewhere, decides to use the same 128 PUA code points for some other
purpose.  That does not make your data "non-standard," because all PUA
data, by definition, is "non-standard."  What you are doing with the PUA
is far more standard, and far more interoperable, than writing invalid
UTF-8 sequences and expecting parsers to interpret them as "undeciphered
8-bit legacy text of some sort."

> 4.1 - UTF-32 is probably very useful for certain string operations.
> Changing case for example. You can do it in-place, like you could
> with ASCII. Perhaps it can even be done in UTF-8, I am not sure. But
> even if it is possible today, it is definitely not guaranteed that it
> will always remain so, so one shouldn't rely on it.

Not only is this not 100% true, as others have pointed out, but it is
completely irrelevant to your other points.

> 4.2 - But UTF-8 is superior. You can make UTF-8 functions ignore
> inv

Re: Nicest UTF

2004-12-05 Thread Doug Ewell

Asmus Freytag  wrote:

> Given this little model and some additional assumptions about your
> own project(s), you should be able to determine the 'nicest' UTF for
> your own performance-critical case.

This is absolutely correct.  Each situation may have different needs and
constraints, and these should govern which UTF is best suited for the
task.  No one UTF is better than the others in all cases.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Nicest UTF

2004-12-04 Thread Marcin 'Qrczak' Kowalczyk

"Philippe Verdy" <[EMAIL PROTECTED]> writes:

> There's nothing that requires the string storage to use the same
> "exposed" array,

The point is that indexing should better be O(1).

Not having a constant side per code point requires one of three things:
1. Using opaque iterators instead of integer indices.
2. Exposing a different unit in the API.
3. Living with the fact that indexing is not O(1) in general; perhaps
   with clever caching it's good enough in common cases.

Altough all three choices can work, I would prefer to avoid them.
If I had to, I would probably choose 1. But for now I've chosen a
representation based on code points.

> Anyway, each time you use an index to access to some components of a
> String, the returned value is not an immutable String, but a mutable
> character or code unit or code point, from which you can build
> *other* immatable Strings

No, individual characters are immutable in almost every language.
Assignment to a character variable can be thought as changing the
reference to point to a different character object, even if it's
physically implemented by overwriting raw character code.

> When you do that, the returned character or code unit or code point
> does not guarantee that you'll build valid Unicode strings. In fact,
> such character-level interface is not enough to work with and
> transform Strings (for example it does not work to perform correct
> transformation of lettercase, or to manage grapheme clusters).

This is a different issue. Indeed transformations like case mapping
work in terms of strings, but in order to implement them you must
split a string into some units of bounded size (code points, bytes,
etc.).

All non-trivial string algorithms boil down to working on individual
units, because conditionals and dispatch tables must be driven by
finite sets. Any unit of a bounded size is technically workable, but
they are not equally convenient. Most algorithms are specified in
terms of code points, so I chose code points for the basic unit in
the API.

In fact in my language there is no separate character type: a code
point extracted from a string is represented by a string of length 1.
It doesn't change the fact that indexing a string by code point index
should run in constant time, and thus using UTF-8 internally would be
a bad idea unless we implement one of the three points above.

> Once you realize that, which UTF you use to handle immutable String
> objects is not important, because it becomes part of the "blackbox"
> implementation of String instances.

The black box must provide enough tools to implement any algorithm
specified in terms of characters, an algorithm which was not already
provided as a primitive by the language.

Algorithms generally scan strings sequentially, but in order to store
positions to come back to them later you must use indices or some
iterators. Indices are simpler (and in my case more efficient).

> Using SCSU for such String blackbox can be a good option if this
> effectively helps in store many strings in a compact (for global
> performance) but still very fast (for transformations) representation.

I disagree. SCSU can be a separate type to be used explicitly, but
it's a bad idea for the default string representation. Most strings
are short, and thus constant factors and simplicity matter more than
the amount of storage. And you wouldn't save much storage anyway:
as I said, in my representation strings which contain only characters
U+..U+00FF are stored one byte per character. The majority of
strings in average programs is ASCII.

In general what I don't like in SCSU is that there is no obvious
compression algorithm which makes good use of various features. Each
compression algorithm is either not as powerful as it could, or is
extremely slow (trying various choices), or is extremely complicated
(trying only sensible paths).

> Unfortunately, the immutable String implementations in Java or C#
> or Python does not allow the application designer to decide which
> representation will be the best (they are implemented as concrete
> classes instead of virtual interfaces with possible multiple
> implementations, as they should; the alternative to interfaces would
> have been class-level methods allowing the application to trade with
> the blackbox class implementation the tuning parameters).

Some functions accept any sequence of characters. Other functions
accept only standard strings. The question is how often to use each
style.

Choosing the first option increases flexibility but adds an overhead
in the common case. For example case mapping of a string would have to
either perform dispatching functions at each step, or be implemented
twice. Currently it's implemented for strings only, in C, and thus
avoids calling a generic indexing function and other overheads. At
some time I will probably implement it again, to work for arbitrary
sequences of characters, but it's more work for effects that I don't
currently need, s

Re: Nicest UTF

2004-12-04 Thread Deborah Goldsmith

On Dec 3, 2004, at 2:54 AM, Andrew C. West wrote:
I strongly agree that all Unicode
implementations should cover all of Unicode, and not just the BMP, and 
it really
annoys me when they don't; but suggesting that you need to implement 
supra-BMP
characters because they are going to start popping up all over the 
place is
wrong in my opinion (not that Doug suggested that, but that's my 
extrapolation
of his point). Software developers need to implement supra-BMP 
characters
because some users (probably very few) will from time to time want to 
use them,
and software should allow people to do what they want.
Actually, about 10% of the glyphs in the Japanese fonts that ship with 
Mac OS X are represented by characters in plane 2. The main reason they 
are there is because they are used in names (people, places, and 
companies). So there are real customers who want to use characters 
outside the BMP. I would not characterize it as "very few". That's true 
of the vast majority of SMP characters, but not all of them.

Deborah Goldsmith
Internationalization, Unicode Liaison
Apple Computer, Inc.
[EMAIL PROTECTED]

Re: Nicest UTF

2004-12-04 Thread Philippe Verdy

From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
"Philippe Verdy" <[EMAIL PROTECTED]> writes:
Random access by code point index means that you don't use strings
as immutable objects,
No. Look at Python, Java and C#: their strings are immutable (don't
change in-place) and are indexed by integers (not necessarily by code
points, but it doesn't change the point).
Those strings are not indexed. They are just accessible through methods or 
accessors, that act *as if* they were arrays. There's nothing that requires 
the string storage to use the same "exposed" array, and in fact you can as 
well work on immutable strings, as if they were vectors of code points, or 
vectors of code units, and sometimes vectors of bytes.

Note for example the difference between the .length property of Java arrays, 
and the .length() method of java String instances...

Note also the fact that the "conversion" of an array of bytes or code units 
or code points to a String requires distinct constructors, and that the 
storage is copied rather than simply referenced (the main reason being that 
indexed vectors or arrays are mutable in their indexed content, but not 
String instances which become sharable).

Anyway, each time you use an index to access to some components of a String, 
the returned value is not an immutable String, but a mutable character or 
code unit or code point, from which you can build *other* immatable Strings 
(using for example mutable StringBuffers or StringBuilder or similar objects 
in other languages). When you do that, the returned character or code unit 
or code point does not guarantee that you'll build valid Unicode strings. In 
fact, such character-level interface is not enough to work with and 
transform Strings (for example it does not work to perform correct 
transformation of lettercase, or to manage grapheme clusters). The most 
powerful (and universal) transformations are those that don't use these 
interfaces directly, but that work on complete Strings and return complete 
Strings.

The character-level APIs are convenience for very basic legacy 
transformations, but they do not solve alone most internationalization 
problems; or they are used as a "protected" interface that allow building 
more powerful String to String transformations.

Once you realize that, which UTF you use to handle immutable String objects 
is not important, because it becomes part of the "blackbox" implementation 
of String instances. If you consider then the UTF as a blackbox, then the 
real arguments for an UTF or another depends on the set of String-to-String 
transformations you want to use (because it conditions the implmentation of 
these transformations), but more importantly it affects the efficiency of 
the String storage allocation.

For this reason, the blackbox can determine itself which UTF or internal 
encoding is the best to perform those transformations: the total volume of 
immutable string instances to handle in memory and the frequency of their 
instanciation determines which representation to use (because large String 
volumes will sollicitate the memory manager, and will seriously impact the 
overall application performance).

Using SCSU for such String blackbox can be a good option if this effectively 
helps in store many strings in a compact (for global performance) but still 
very fast (for transformations) representation.

Unfortunately, the immutable String implementations in Java or C# or Python 
does not allow the application designer to decide which representation will 
be the best (they are implemented as concrete classes instead of virtual 
interfaces with possible multiple implementations, as they should; the 
alternative to interfaces would have been class-level methods allowing the 
application to trade with the blackbox class implementation the tuning 
parameters).

There are other classes or libraries within which such multiple 
representations are possible and easily and transparently convertible from 
one to the other. (Note that this discussion is related to the UTF used to 
represent code points, but today, there are also needs to work on strings 
within grapheme cluster boundaries, including the various normalization 
forms, and a few libraries do exist for which the various normalizations can 
be changed without changing the "immutable" aspect of Strings, the 
complexity being that Strings do not always represent plain-text...)

Re: Nicest UTF

2004-12-03 Thread Philippe Verdy

From: "Theo" <[EMAIL PROTECTED]>
From: Asmus Freytag <[EMAIL PROTECTED]>
So, despite it being  UTF-8 case insensitive, it was totally blastingly 
fast. (One person reported counting words at 1MB/second of pure text, from 
within a mixed Basic / C environment). You'll need to keep in mind, that 
the counter must look up through thousands of words (Every single word its 
come across in the text), on every single word lookup.

Anyhow, from my experience, UTF-8 is great for speed and RAM.
Probably true for English or most Western European Latin-based languages 
(plus Greek and Coptic).

But for other languages that still use lots of characters in the range 
U+ to U+03FF (C0 and C1 controls, Basic Latin, Latin-1 suplement, Latin 
Extended-A and -B, IPA Extensions, Spacing Modifier Letters, Combining 
Diacritical Marks, Greek and Coptic) UTF-8 and UTF-16 may be nearly as 
efficient.

For all others, that need lots of characters out of the range U+ to 
U+03FF (Cyrillic, Armenian, Hebrew, Arabic, and all Asian or Native-American 
or African scripts, or even PUAs), UTF-16 is better (more compact in memory, 
so faster).

UTF-32 will be better only for historic texts written nearly completely with 
characters out of the BMP (for now, only Old Ialic, Gothic, Ugaritic, 
Deseret, Shavian, Osmanya, Cypriot Syllabary), if C0 controls (such as TAB, 
CR and LF), or ASCII SPACE, or NBSP are a minority.

Re: Nicest UTF

2004-12-03 Thread Philippe Verdy

From: "Asmus Freytag" <[EMAIL PROTECTED]>
A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider
1) 1 extra test per character (to see whether it's a surrogate)
2) special handling every 100 to 1000 characters (say 10 instructions)
3) additional cost of accessing 16-bit registers (per character)
4) reduction in cache misses (each the equivalent of many instructions)
5) reduction in disk access (each the equivaletn of many many
instructions)
(...)
For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each
occurrence depending on the architecture. Their relative weight depends
not only on cache sizes, but also on how many other instructions per
character are performed. For text scanning operations, their cost
does predominate with large data sets.
I tend to disagree with you on points 4 and 5: cache misses, and disk 
accesses (more commonly refered to as "data locality" in computing 
performances) really favors UTF-16 face to UTF-32, simply because UTF-16 
will be more compact for almost every text you need to process, unless you 
are working on texts that only contain characters from a script *not present 
at all* in the BMP (this sentence excludes Han, even if there are tons of 
ideographs out of the BMP, because these ideographs are almost never used 
alone, but used seldomly within tons of other conventional Han characters in 
the BMP).

Given that these scripts are all historic ones, or were encoded for 
technical purpose with very specific usage, a very large majority of texts 
will not use significant numbers of characters out of the BMP, so the use of 
surrogates in UTF-16 will remain a minority. In all cases, even for texts 
made only of characters out of the BMP, UTF-16 can't be larger than UTF-32.

The only case where it would be worse than UTF-32 is for the internal 
representation of strings in memory, where 16-bit code units can't be 
represented with 16-bit only, for example if memory cells are not 
individually addressable below units of at least 32 bits, and the CPU 
architecture is very inefficient when working with 16-bit bitfields within 
32-bit memory units or registers, due to extra shifts and masking operations 
needed to pack and unpack 16-bit bitfields into a single 32-bit memory cell.

I doubt that such architecture would be very successful, given that too many 
standard protocols depend on being able to work with datastreams made of 
8-bit bytes: with such architecture, all data I/O would need to store 8-bit 
bytes in separate but addressable 32-bit memory cells, which would really be 
a poor usage of available central memory (such architecture would require 
much more RAM to work with equivalent performances for data I/O, and even 
the very costly fast RAM caches would need to be increased a lot, meaning 
higher hardware construction costs).

So even on such 32-bit only (or 64-bit only...) architectures (where for 
example the C datatype "char" would be 32-bit or 64-bit), there would be 
efficient instructions in the CPU to allow packing/unpacking bytes in 32-bit 
(or 64-bit) memory cells (or at least at the register level, with 
instructions allowing to work efficiently with such bitfields).

Re: Nicest UTF

2004-12-03 Thread Marcin 'Qrczak' Kowalczyk

"Philippe Verdy" <[EMAIL PROTECTED]> writes:

> Decoding SCSU is very straightforward,

But not for random access by code point index, which is needed by many
string APIs.

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

1 2 >

1 - 100 of 117 matches

Mail list logo