Re: Roundtripping in Unicode

2004-12-14 Thread John Cowan
Mike Ayers scripsit:

>   I thought that URLs were specified to be in Unicode.  Am I mistaken?

You are.  URLs are specified to be in *ASCII*.  There is a %-encoding
hack that allows you to represent random-octet filenames as ASCII.
Some people (including me) think it's a good idea to use this hack
to specify non-ASCII characters with double encoding (first as UTF-8,
then with the %-hack), but the URI Syntax RFC doesn't say.

-- 
John Cowan  [EMAIL PROTECTED]
http://www.reutershealth.comhttp://www.ccil.org/~cowan
Humpty Dump Dublin squeaks through his norse
Humpty Dump Dublin hath a horrible vorse
But for all his kinks English / And his irismanx brogues
Humpty Dump Dublin's grandada of all rogues.  --Cousin James


Unicode Version 4.1.0 Beta Release

2004-12-14 Thread Rick McGowan
The next version of the Unicode Standard will be Version 4.1.0, due for  
release in March, 2005.

A BETA version of the updated Unicode Character Database files is  
available for public comment. We strongly encourage implementers to  
download these files and test them with their programs, well before the end  
of the beta period. These files are located at the following URL:

http://www.unicode.org/Public/4.1.0/

(or ftp://www.unicode.org/Public/4.1.0/)

A detailed description of the beta is located here:

http://www.unicode.org/versions/beta.html

Any comments on the beta Unicode Character Database should be reported  
using the Unicode reporting form. The comment period ends January 31, 2005.  
All substantive comments must be received by that date for consideration  
at the next UTC meeting. Editorial comments (typos, etc) may be submitted  
after that date for consideration in the final editorial work.

Note: All beta files may be updated, replaced, or superseded by other  
files at any time. The beta files will be discarded once Unicode 4.1.0 is  
final. It is inappropriate to cite these files as other than a work in  
progress.

Testers should not commit any product or implementation to the code points  
in the current beta data files. Testers should also be ready for retesting  
based on updated data files which will be posted after the February, 2005  
UTC meeting.



If you have comments for official consideration, please post them by  
submitting your comments through our feedback & reporting page:

  http://www.unicode.org/reporting.html

If you wish to discuss beta issues on the Unicode mail list, then please  
use the following link to subscribe (if necessary). Please be aware that  
discussion comments on the Unicode mail list are not automatically recorded  
as beta comments. You must use the reporting link above to generate  
comments for official consideration.

  http://www.unicode.org/consortium/distlist.html


Regards,
Rick McGowan
Unicode, Inc.



RE: Roundtripping in Unicode

2004-12-14 Thread Mike Ayers
Title: RE: Roundtripping in Unicode






> From: Peter Kirk [mailto:[EMAIL PROTECTED]] 
> Sent: Tuesday, December 14, 2004 3:37 PM


> Thanks for the clarification. Perhaps the bifurcation could 
> be better expressed as into "strings of characters as defined 
> by the locale" and "strings of non-null octets". Then I could 
> re-express this as "the only safe way out of this mess is 
> never to process filenames as strings of characters as 
> defined by the locale".


    That would not be correct for ISO 8859 locales, though (amongst others).  That's why I specified UTF-8.  Although other locales may have the problem of invalid sequences, we're only interested in UTF-8 here.

> Well, I was assuming that when John Cowan implied that 0x08 
> was permitted, and Jill wrote "Unix filenames consist of an 
> arbitrary sequence of octets, excluding 0x00 and 0x2F", they 
> were speaking from the appropriate orifices.


    Correct, and my bad.  I got thrown off by John's:


>>(A private correspondent has come up with an ingenious trick which 
>>depends on being able to create files named 0x08 and 0x7F, but it truly 
>>is a trick, and in any case depends only on an ASCII interpretation.)


    which I misinterpreted to mean that 0x08 was a forbidden character.  It isn't - just real hard to type!



/|/|ike




"Tumbleweed E-mail Firewall " made the following
 annotations on 12/14/04 16:24:51
--
This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed.  If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.
==


Unicode filenames and other external strings on Unix - existing practice

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk
I describe here languages which exclusively use Unicode strings.
Some languages have both byte strings and Unicode strings (e.g. Python)
and then byte strings are generally used for strings exchanged with
the OS, the programmer is responsible for the conversion if he wishes
to use Unicode.

I consider situations when the encoding is implicit. For I/O of file
contents it's always possible to set the encoding explicitly somehow.

Corrections are welcome. This is mostly based on experimentation.


Java (Sun)
--

Strings are UTF-16.

Filenames are assumed to be in the locale encoding.

a) Interpreting. Bytes which cannot be converted are replaced by U+FFFD.

b) Creating. Characters which cannot be converted are replaced by "?".

Command line arguments and standard I/O are treated in the same way.


Java (GNU)
--

Strings are UTF-16.

Filenames are assumed to be in Java-modified UTF-8.

a) Interpreting. If a filename cannot be converted, a directory listing
   contains a null instead of a string object.

b) Creating. All Java characters are representable in Java-modified UTF-8.
   Obviously not all potential filenames can be represented.

Command line arguments are interpreted according to the locale.
Bytes which cannot be converted are skipped.

Standard I/O works in ISO-8859-1 by default. Obviously all input is
accepted. On output characters above U+00FF are replaced by "?".


C# (mono)
-

Strings are UTF-16.

Filenames use the list of encodings from the MONO_EXTERNAL_ENCODINGS
environment variable, with UTF-8 implicitly added at the end. These
encodings are tried in order.

a) Interpreting. If a filename cannot be converted, it's skipped in
   a directory listing.

   The documentation says that if a filename, a command line argument
   etc. looks like valid UTF-8, it is treated as such first, and
   MONO_EXTERNAL_ENCODINGS is consulted only in remaining cases.
   The reality seems to not match this (mono-1.0.5).

b) Creating. If UTF-8 is used, Non-characters are converted to
   pseudo-UTF-8, U+ throws an exception (System.ArgumentException:
   Path contains invalid chars), paired surrogates are treated
   correctly, and an isolated surrogate causes an internal error:
** ERROR **: file strenc.c: line 161 (mono_unicode_to_external): assertion 
failed: (utf8!=NULL)
aborting...

Command line arguments are treated in the same way, except that if an
argument cannot be converted, the program dies at start:
[Invalid UTF-8]
Cannot determine the text encoding for argument 1 (xxx\xb1\xe6\xea).
Please add the correct encoding to MONO_EXTERNAL_ENCODINGS and try again.

Console.WriteLine emits UTF-8. Paired surrogates are treated
correctly, non-characters and unpaired surrogates are converted to
pseudo-UTF-8.

Console.ReadLine interprets text as UTF-8. Bytes which cannot be
converted are skipped.


Perl


Depending on the convention used by a particular function and on
imported packages, a Perl string is treated either as Perl-modified
Unicode (with character values up to 32 bits or 64 bits depending on
the architecture) or as an unspecified locale encoding. It has two
internal representations: ISO-8859-1 and Perl-modified UTF-8 (with
an extended range).

If every Perl string is assumed to be a Unicode string, then filenames
are effectively ISO-8859-1.

a) Interpreting. Characters up to 0xFF are used.

b) Creating. If the filename has no characters above 0xFF, it is
   converted to ISO-8859-1. Otherwise it is converted to Perl-modified
   UTF-8 (all characters, not just those above 0xFF).

Command line arguments and standard I/O are treated in the same way,
i.e. ISO-8859-1 on input and a mixture of ISO-8859-1 and UTF-8 on
output, depending on the contents.

This behavior is modifiable by importing various packages and using
interpreter invocation flags. When Perl is told that command line
arguments are UTF-8, the behavior for strings which cannot be
converted is inconsistent: sometimes it's treated as ISO-8859-1,
sometimes an error is signalled.


Haskell
---

Haskell nominally uses Unicode. There is no conversion framework
standarized or implemented yet though. Implementations which support
more than 256 characters currently assume ISO-8859-1 for filenames,
command line arguments and all I/O, taking the lowest 8 bits of a
character code on output.


Common Lisp: Clisp
--

Common Lisp standard doesn't say anything about string encoding.
In Clisp strings are UTF-32 (internally optimized as UCS-2 and
ISO-8859-1 when possible). Any character code up to U+10 is
allowed, including non-characters and isolated surrogates.

Filenames are assumed to be in the locale encoding.

a) Interpreting. If a byte cannot be converted, an exception is thrown.

b) Creating. If a character cannot be converted, an exception is thrown.


Kogut (my language; this is the current state - can be changed)
-

Strings are UTF-32 (internally optimized as ISO-8859-1 when possible).
Currently any

RE: Roundtripping in Unicode

2004-12-14 Thread Mike Ayers
Title: RE: Roundtripping in Unicode






> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]] On Behalf Of Mike Ayers
> Sent: Tuesday, December 14, 2004 3:29 PM


> The rule is "No zero, no eight". 


    "No zero, no forty seven".


    My bad.




/|/|ike




"Tumbleweed E-mail Firewall " made the following
 annotations on 12/14/04 16:25:28
--
This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed.  If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.
==


RE: Roundtripping in Unicode

2004-12-14 Thread Mike Ayers
Title: RE: Roundtripping in Unicode






> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]] On Behalf Of Philippe Verdy
> Sent: Tuesday, December 14, 2004 2:47 PM


> More simply, I think that it's an error to have the encoding 
> part of any locale... The system should not depend on them, 
> and for critical things like filesystem volumes, the encoding 
> should be forced by the filesystem itself, and applications 
> should mandatorily follow the filesystem rules.


    It doesn't, it is, and they do.


    The rule is "No zero, no eight".


    The problem is that these valid filenames can't all be translated as valid UTF-8 Unicode.


> Now think about the web itself: it's really a filesystem, 


    No.  It isn't.


> with billions users, or trillion applications using 
> simultaneously hundreds or thousands of incompatible 
> encodings... Many resources on the web seem to have valid 
> URLs for some users but not for others, until URLs are made 
> independant to any user locale, and then not considered as 
> encoded plain-text but only as strings of bytes.


    I thought that URLs were specified to be in Unicode.  Am I mistaken?



/|/|ike



P.S.  [OT} Note the below autoattachment.  I recall that we discussed such clauses on the list some time ago with regard to their legal standing.  Does anyone have a pointer to substantive material on the subject?  I've gotten curious again, 'natch.



"Tumbleweed E-mail Firewall " made the following
 annotations on 12/14/04 15:31:51
--
This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed.  If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.
==


Re: Roundtripping in Unicode

2004-12-14 Thread Doug Ewell
> Unicode did not invent the notion of conformance to character
> encoding standards. What is new about Unicode is that it has
> *3* interoperable character encoding forms, not just one, and
> all of them are unusual in some way, because they are designed
> for a very, very large encoded character repertoire, and
> involve multibyte and/or non-byte code unit representations.

Geez, even when I was going through my stage of inventing wild and crazy
new UTF's, I made sure they were 100% convertible to and from code
points.  How could they not be?

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





Re: Roundtripping in Unicode

2004-12-14 Thread Philippe Verdy
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
Lars Kristan <[EMAIL PROTECTED]> writes:
Hm, here lies the catch. According to UTC, you need to keep
processing the UNIX filenames as BINARY data. And, also according
to UTC, any UTF-8 function is allowed to reject invalid sequences.
Basically, you are not supposed to use strcpy to process filenames.
No: strcpy passes raw bytes, it does not interpret them according to
the locale. It's not "an UTF-8 function".
Correct: [wc]strcpy() handles "string" instances, but not all string 
instances are plain-text, so they don't need to obey to UTF encoding rules 
(they just obey to the convention of null-byte termination, with no 
restriction on the string length, measured as a size in [w]char[_t] but not 
as a number of Unicode characters).

This is true for the whole standard C/C++ string libraries, as well as in 
Java (String and Char objects or "native" char datatype), and as well in 
almost all string handling libraries of common programming languages.

A "locale" defined as "UTF-8" will experiment lots of problems because of 
the various ways applications will behave face to encoding "errors" 
encountered in filenames: exceptions thrown aborting the program, 
substitution by "?" or U+FFFD causing wrong files to be accessed, some files 
not treated because their name was considered "invalid" althoug they were 
effectively created by some user of another locale...

Filenames are identifiers coded as strings, not as plain-text (even if most 
of these filename strings are plain-text).

The solution if then to use a locale based on a "relaxed version of UTF-8" 
(some spoke about defining a "NOT-UTF-8" and "NOT-UTF-16" encodings to allow 
any sequence of code units, but nobody has thought about how to make 
"NOT-UTF-8" and "NOT-UTF-16" mutually fully reversible; now add "NOT-UTF-32" 
to this nightmare and you will see that "NOT-UTF-32" needs to encode 2^32 
distinct NOT-Unicode-codepoints, and that they must map bijectively to 
exactly all 2^32 sequences possible in NOT-UTF-16 and NOT-UTF-8; I have not 
found a solution to this problem, and I don't know if such solution even 
exists; if such solution exists, it should be quite complex...).




Re: Roundtripping in Unicode

2004-12-14 Thread Philippe Verdy
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
"Arcane Jill" <[EMAIL PROTECTED]> writes:
If so, Marcin, what exactly is the error, and whose fault is it?
It's an error to use locales with different encodings on the same
system.
More simply, I think that it's an error to have the encoding part of any 
locale... The system should not depend on them, and for critical things like 
filesystem volumes, the encoding should be forced by the filesystem itself, 
and applications should mandatorily follow the filesystem rules.

Now think about the web itself: it's really a filesystem, with billions 
users, or trillion applications using simultaneously hundreds or thousands 
of incompatible encodings... Many resources on the web seem to have valid 
URLs for some users but not for others, until URLs are made independant to 
any user locale, and then not considered as encoded plain-text but only as 
strings of bytes.




Re: Roundtripping in Unicode

2004-12-14 Thread Kenneth Whistler
Marcin Kowalczyk noted:

> Unicode has the following property. Consider sequences of valid
> Unicode characters: from the range U+..U+10, excluding
> non-characters (i.e. U+nFFFE and U+n for n from 0 to 0x10 and
> U+FDD0..U+FDEF) and surrogates. Any such sequence can be encoded
> in any UTF-n, and nothing else is expected from UTF-n.

Actually not quite correct. See Section 3.9 of the standard.

The character encoding forms (UTF-8, UTF-16, UTF-32) are defined
on the range of scalar values for Unicode: 0..D7FF, E000..10.

Each of the UTF's can represent all of those scalar values, and
can be converted accurately to either of the other UTF's for
each of those values. That *includes* all the code points used
for noncharacters.

U+ is a noncharacter. It is not assigned to an encoded
abstract character. However, it has a well-formed representation
in each of the UTF-8, UTF-16, and UTF32 encoding forms,
namely:

UTF-8:  
UTF-16: 
UTF-32: <>

> With the exception of the set of non-characters being irregular and
> IMHO too large (why to exclude U+FDD0..U+FDEF?!), and a weird top
> limit caused by UTF-16, this gives a precise and unambiguous set of
> values for which encoders and decoders are supposed to work.

Well, since conformant encoders and decoders must work for all
the noncharacter code points as well, and since U+10, however
odd numerologically, is itself precise and unambiguous, I don't
think you even need these qualifications. 

> Well,
> except non-obvious treatment of a BOM (at which level it should be
> stripped? does this include UTF-8?).

The handling of BOM is relevant to the character encoding *schemes*,
where the issues are serialization into byte streams and interpretation
of those byte streams. Whether you include U+FEFF in text or not
depends on your interpretation of the encoding scheme for a Unicode
byte stream.

At the level of the character encoding forms (the UTF's), the
handling of BOM is just as for any other scalar value, and is
completely unambiguous:

UTF-8:  
UTF-16: 
UTF-32: 

> 
> A variant of UTF-8 which includes all byte sequences yields a much
> less regular set of abstract string values. Especially if we consider
> that 1110 1011 1010 binary is not valid UTF-8, as much as
> 0xFFFE is not valid UTF-16 (it's a reversed BOM; it must be invalid in
> order for a BOM to fulfill its role).

This is incorrect.  *is* valid UTF-8, just as  is
valid UTF-16. In both cases these are valid representations of
a noncharacter, which should not be used in public interchange,
but that is a separate issue from the fact that the code unit
sequences themselves are "well-formed" by definition of the
Unicode encoding forms.

> 
> Question: should a new programming language which uses Unicode for
> string representation allow non-characters in strings? 

Yes.

> Argument for
> allowing them: otherwise they are completely useless at all, except
> U+FFFE for BOM detection. Argument for disallowing them: they make
> UTF-n inappropriate for serialization of arbitrary strings, and thus
> non-standard extensions of UTF-n must be used for serialization.

Incorrect. See above. No extensions of any of the encoding forms
are needed to handle noncharacters correctly.

--Ken




RE: Roundtripping in Unicode

2004-12-14 Thread Mike Ayers
Title: RE: Roundtripping in Unicode






> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]] On Behalf Of Peter Kirk
> Sent: Tuesday, December 14, 2004 11:32 AM


> This is a design flaw in Unix, or in how it is explained to 
> users. Well, Lars wrote "Basically, you are not supposed to 
> use strcpy to process filenames." I'm not sure if that is his 
> opinion or someone else's, but the only safe way out of this 
> mess is never to process filenames as strings.


    As mentioned by Kenneth, Lars was speaking from the wrong orifice when he said that.


    Also, it appears that the term "string" is being used too much and without qualification.  The entire focus of this thread is on what happens when unqualified bytes (filenames) get qualified (by locale), so it would behoove us all to qualify all the strings we're talking about.  For instance, Peter's last clause above bifurcates into:

    "...but the only safe way out of this mess is never to process filenames as UTF-8 strings."


    and:


    "...but the only safe way out of this mess is always to process filenames as opaque C strings."


    which was mentioned early on in this thread, but Lars does not wish to do this.


> This may be called a "trick" but it looks like it could very 
> easily be a security hole. For example, a filename 0x41 0x08 
> 0x42 will be displayed the same as just 0x42, in a Latin-1 or 
> UTF-8 locale. Your friend's trick has become an open door for 
> spoofers.


    Exactly why 0x08 was banned in filenames, as I recall.



/|/|ike




"Tumbleweed E-mail Firewall " made the following
 annotations on 12/14/04 13:16:29
--
This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed.  If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.
==


Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk
"Arcane Jill" <[EMAIL PROTECTED]> writes:

> If so, Marcin, what exactly is the error, and whose fault is it?

It's an error to use locales with different encodings on the same
system.

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/



Re: Validity and properties of U+FFFD (was RE: Roundtripping in Unico de)

2004-12-14 Thread Kenneth Whistler
Lars asked:

> BTW, what are the properties of U+FFFD? In English please, do not point me
> to the standard. 

?!

It has the general category of "Symbol Other" [gc=So].

> Like, can it be a part of an identifier, 

It does not have the ID_Start or the ID_Continue property, which
you could determine for yourself by referring to the standard:

http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt

That doesn't prevent a formal syntax definition for a language
from including it within the BNF for defining and identifier,
but in general, no, it would not appear in identifiers, just as
most other symbols would not.

> is it an 'alphanumeric'? 

No.

> Let me speculate. It should be a letter 

No.

> (it probably more
> often originally was than wasn't). 

You are referring here to speculation regarding what uninterpretable
sequence in some other character encoding was *converted* to U+FFFD
on conversion to Unicode. But that is irrelevant to the properties
of U+FFFD itself.

That is tantamount, for example, to claiming that the C0 control
code 0x1A SUBSTITUTE should be defined as a "letter", simply because
it is often used in signalling a conversion substitution in
8-bit tables.

> I would accept it for identifiers (variables, filenames). 

If you are defining your own language, that would be your
prerogative, of course. But if you are using standard languages
like C, C++, Java, C#, SQL, etc., it is unlikely that you would
be correct in that approach.

> It has no case properties. And it is obviously not a
> space.

True.

There is much, much more to know about Unicode character properties
than just what can be inferred from an attempt to apply the
POSIX model to UTF-8. A good place to start would be Unicode
Technical Report #23, The Unicode Charater Property Model:

http://www.unicode.org/reports/tr23/

And after that, yes, I would point you to the standard.

--Ken







RE: Roundtripping in Unicode

2004-12-14 Thread Kenneth Whistler
Lars said:

> According to UTC, you need to keep processing
> the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8
> function is allowed to reject invalid sequences. Basically, you are not
> supposed to use strcpy to process filenames.

This is a very misleading set of statements.

First of all, the UTC has not taken *any* position on the
processing of UNIX filenames. That is an implementation issue
outside the scope of what the UTC normally deals with, and I
doubt that it will take a position on the issue.

It is erroneous to imply that the UTC has indicated that "you
are not supposed to use strcpy to process filenames." It has
done nothing of the kind, and I don't know of any reason why
anyone should think otherwise. I certainly use strcpy to process
filenames, UTF-8 or not, and expect that nearly every implementer
on the list has done so, too.

Any process *interpreting* a UTF-8 code unit sequences as
characters can and should recognize invalid sequences, but
that is a different matter.

If I pass the byte stream <0x80 0xFF 0x80 0xFF 0x80 0xFF> to
a process claiming conformance to UTF-8 and ask it to intepret
that as Unicode characters, it should tell me that it is
garbage. *How* it tells me that it is garbage is a matter of
API design, code design, and application design.

But there is *nothing* new here.

If I pass the byte stream <0x80 0xFF 0x80 0xFF 0x80 0xFF> to
a process claiming conformance to Shift-JIS and ask it to intepret
that as JIS characters, it should tell me that it is
garbage. *How* it tells me that it is garbage is a matter of
API design, code design, and application design.

Unicode did not invent the notion of conformance to character
encoding standards. What is new about Unicode is that it has
*3* interoperable character encoding forms, not just one, and
all of them are unusual in some way, because they are designed
for a very, very large encoded character repertoire, and
involve multibyte and/or non-byte code unit representations.

> Well, I just hope noone will listen to them and modify strcpy and strchr to
> validate the data when running in UTF-8 locale and start signalling
> something (really, where and how?!). The two statements from UTC don't make
> sense when put together. Unless we are really expected to start building
> everything from scratch.

This is bogus. The UTC has never asked anyone to modify strcpy
and strchr. What anyone implementing UTF-8 using a C runtime
library (or similar set of functions) has to do is completely
comparable to what they have to do for supporting any other
multibyte character encoding on such systems. If your system
handles euc-kr, euc-tw, and/or euc-jp correctly, then adding
UTF-8 support is comparable, in principle and in practice.

--Ken




Re: Roundtripping in Unicode

2004-12-14 Thread Peter Kirk
On 14/12/2004 11:32, Arcane Jill wrote:
I've been following this thread for a while, and I've pretty much got 
the hang of the issues here. To summarize:

I haven't followed everything, but here is my 2 cents worth.
I note that there is a real problem. I have had significant problems in 
Windows with files copied from other language systems. Sometimes for 
example these files are listed fine in Explorer but when I try to copy 
or delete them they are not found, presumably because the filename is 
being corrupted somewhere in the system and doesn't match.

Unix filenames consist of an arbitrary sequence of octets, excluding 
0x00 and 0x2F. How they are /displayed/ to any given user depends on 
that user's locale setting. In this scenario, two users with different 
locale settings will see different filenames for the same file, but 
they will still be able to access the file via the filename that they 
see. These two filenames will be spelt identically in terms of octets, 
but (apparently) differently when viewed in terms of characters.

At least, that's how it was until the UTF-8 locale came along. If we 
consider only one-byte-per-character encodings, then any octet 
sequence is "valid" in any locale. But UTF-8 introduces the 
possibility that an octet sequence might be "invalid" - a new concept 
for Unix. So if you change your locale to UTF-8, then suddenly, some 
files created by other users might appear to you to have invalid 
filenames (though they would still appear valid when viewed by the 
file's creator).

This is not in fact a new concept. Some octet sequences which are valid 
filenames are invalid in a Latin-1 locale - for example, those which 
include octets in the range 0x80-0x9F, if "Latin-1" means ISO 8859-1. 
Some of these octets are of course defined in Windows CP1252 etc, so a 
Unix Latin-1 system may have some interpretation for some of them; but 
others e.g. 0x81 have no interpretation in any flavour of Latin-1 as far 
as I know. So there is by no means a guarantee that every non-Unicode 
Unix locale has an interpretation of every octet, which implies that 
other octets are invalid.

Now no doubt many Unix filename handling utilities ignore the fact that 
some octets are invalid or uninterpretable in the locale, because they 
handle filenames as octet strings (with 0x00 and 0x2F having special 
interpretations) rather than as locale-dependent character strings. But 
these routines should continue to work in a UTF-8 locale, as they make 
no attempt to interpret any octets other than 0x00 and 0x2F.

A specific example: if a file F is accessed by two different users, A 
and B, of whom A has set their locale to Latin-1, and B has set their 
locale to UTF-8, then the filename may appear to be valid to user A, 
but invalid to user B.

Lars is saying (and he's probably right, because he knows more about 
Unix than I) that user B does not necessarily have the right to change 
the actual octet sequence which is the filename of F, just to make it 
appear valid to user B, because doing so would stop a lot of things 
working for user A (for instance, A might have created the file, the 
filename might be hardcoded in a script, etc.). So Lars takes a 
Unix-like approach, saying "retain the actual octet sequence, but feel 
free to try to display and manipulate it as if it were some UTF-8-like 
encoding in which all octet sequences are valid". And all this seems 
to work fine for him, until he tries to roundtrip to UTF-16 and back.

I think the problem here is that a Unix filename is a string of octets, 
not of characters. And so it should not be converted into another 
encoding form as if it is characters; it should be processed at a quite 
different level of interpretation.

Of course a system is free to do what it wants internally.
I'm not sure why anyone's arguing about this though - Phillipe's 
suggestion seems to be the perfect solution which keeps everyone 
happy. So...

...allow me to construct a specific example of what Phillipe suggested 
only generally:

...
This would appear to solve Lars' problem, and because the three 
encodings, NOT-UTF-8, NOT-UTF-16 and NOT-UTF-32, don't claim to be 
UTFs, no-one need get upset.

All of this is ingenious, and may be useful for internal processing 
within a Unix system, and perhaps even for interaction between 
cooperating systems. But NOT-Unicode is not Unicode (!) and so Unicode 
should not be expected to standardise it.

I can see that there may be a need for a protocol for open exchange of 
Unix-like filenames. But these filenames should be treated as binary 
data (which may or may not be interpretable in any one locale) and 
encoded as such, rather than forced into the mould of Unicode characters 
which it does not fit.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes:

> Hm, here lies the catch. According to UTC, you need to keep
> processing the UNIX filenames as BINARY data. And, also according
> to UTC, any UTF-8 function is allowed to reject invalid sequences.
> Basically, you are not supposed to use strcpy to process filenames.

No: strcpy passes raw bytes, it does not interpret them according to
the locale. It's not "an UTF-8 function".

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/



Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk
"Arcane Jill" <[EMAIL PROTECTED]> writes:

> OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 ->
> NOT-UTF-16 -> NOT-UTF-8

But it's not possible in the direction NOT-UTF-16 -> NOT-UTF-8 ->
NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an
awkward way which would happen to exclude those subsequences of
non-characters which would form a valid UTF-8 fragment.

Unicode has the following property. Consider sequences of valid
Unicode characters: from the range U+..U+10, excluding
non-characters (i.e. U+nFFFE and U+n for n from 0 to 0x10 and
U+FDD0..U+FDEF) and surrogates. Any such sequence can be encoded
in any UTF-n, and nothing else is expected from UTF-n.

With the exception of the set of non-characters being irregular and
IMHO too large (why to exclude U+FDD0..U+FDEF?!), and a weird top
limit caused by UTF-16, this gives a precise and unambiguous set of
values for which encoders and decoders are supposed to work. Well,
except non-obvious treatment of a BOM (at which level it should be
stripped? does this include UTF-8?).

A variant of UTF-8 which includes all byte sequences yields a much
less regular set of abstract string values. Especially if we consider
that 1110 1011 1010 binary is not valid UTF-8, as much as
0xFFFE is not valid UTF-16 (it's a reversed BOM; it must be invalid in
order for a BOM to fulfill its role).

Question: should a new programming language which uses Unicode for
string representation allow non-characters in strings? Argument for
allowing them: otherwise they are completely useless at all, except
U+FFFE for BOM detection. Argument for disallowing them: they make
UTF-n inappropriate for serialization of arbitrary strings, and thus
non-standard extensions of UTF-n must be used for serialization.

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/



Re: Roundtripping in Unicode

2004-12-14 Thread Peter Kirk
On 14/12/2004 17:47, John Cowan wrote:
Peter Kirk scripsit:
 

I think the problem here is that a Unix filename is a string of octets, 
not of characters. And so it should not be converted into another 
encoding form as if it is characters; it should be processed at a quite 
different level of interpretation.
   

Unfortunately, that is simply a counsel of perfection.
Unix filenames are in general input as character strings, output as character
strings, and intended to be perceived as character strings.  The corner
cases in which this does not work are not sufficient to overthrow the
power and generality to be achieved by assuming it 99% of the time.
 

This is a design flaw in Unix, or in how it is explained to users. Well, 
Lars wrote "Basically, you are not supposed to use strcpy to process 
filenames." I'm not sure if that is his opinion or someone else's, but 
the only safe way out of this mess is never to process filenames as strings.

(A private correspondent has come up with an ingenious trick which
depends on being able to create files named 0x08 and 0x7F, but it
truly is a trick, and in any case depends only on an ASCII interpretation.)
 

This may be called a "trick" but it looks like it could very easily be a 
security hole. For example, a filename 0x41 0x08 0x42 will be displayed 
the same as just 0x42, in a Latin-1 or UTF-8 locale. Your friend's trick 
has become an open door for spoofers.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Re: Roundtripping in Unicode

2004-12-14 Thread John Cowan
Peter Kirk scripsit:

> I think the problem here is that a Unix filename is a string of octets, 
> not of characters. And so it should not be converted into another 
> encoding form as if it is characters; it should be processed at a quite 
> different level of interpretation.

Unfortunately, that is simply a counsel of perfection.

Unix filenames are in general input as character strings, output as character
strings, and intended to be perceived as character strings.  The corner
cases in which this does not work are not sufficient to overthrow the
power and generality to be achieved by assuming it 99% of the time.

(A private correspondent has come up with an ingenious trick which
depends on being able to create files named 0x08 and 0x7F, but it
truly is a trick, and in any case depends only on an ASCII interpretation.)

-- 
Income tax, if I may be pardoned for saying so, John Cowan
is a tax on income.  --Lord Macnaghten (1901)   [EMAIL PROTECTED]



UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtripping in Unicode)

2004-12-14 Thread Edward H. Trager
On Tuesday 2004.12.14 12:50:43 -, Arcane Jill wrote:
> If I have understood this correctly, filenames are not "in" a locale, they 
> are absolute. Users, on the other hand, are "in" a locale, and users view 
> filenames. The same filename can "look" different to two different users. 
> To user A (whose locale is Latin-1), a filename might look valid; to user B 
> (whose locale is UTF-8), the same filename might look invalid.

Correct. The problem will however be limited to the accented
Latin characters present in ISO-8859-1 beyond the ASCII set.  The basic Latin
alphabet in the ASCII set
at the beginning of both ISO-8859-1 and UTF-8 will appear unchanged to both 
users (UTF-8 user looking at Latin-1's home directory, or Latin-1 looking at
UTF-8's home directory).  So both users could probably guess the filename
they were looking at.  For example, here is a file on my local machine,
a Linux box with the locale set to LANG=en_US.UTF-8:

  déclaration_des_droits.utf8

The accented "e" in "déclaration" appears correctly under the UTF-8 locale.

I then copied this file (using scp) over to an older Sun Solaris box which I do 
not administer,
so I have to live with the "C" POSIX locale that they have got that machine
set to.  Now, when I
view the file names in a terminal (where the terminal emulator is set to
the same locale), I see:

  d??claration_des_droits.utf8

The terminal, being set to interpret the legacy locale, does not know 
how to interpret the two bytes that are used for the UTF-8 "é".
Still, I can guess that the first word should be "déclaration".

The solution, as has been pointed out, is for everyone to move to
UTF-8 locales.  In the Linux and Unix world, this is already happening
for the most part.  Solaris 10 now defaults to a UTF-8 locale, at least
when set to English.  Both SuSE and Redhat default to UTF-8 locales
for most language and script environments.  And (open source) tools exist for
converting file names from one encoding to another encoding on Linux
and Unix systems.  A group of Japanese developers is working on an NLS 
implementation
for the BSDs like OpenBSD which are currently "stuck" with nothing but the "C"
POSIX locale.  I think the name of that project is "Citrus".

-- Ed Trager

   

> 
> Is that right, Lars?
> 
> If so, Marcin, what exactly is the error, and whose fault is it?
> 
> Jill
> 
> -Original Message-
> 
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> 
> Behalf Of Marcin 'Qrczak' Kowalczyk
> 
> Sent: 13 December 2004 14:59
> 
> To: [EMAIL PROTECTED]
> 
> Subject: Re: Roundtripping in Unicode
> 
> Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error.
> 
> 
> 
> 
> 
> 


RE: Roundtripping in Unicode

2004-12-14 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





Arcane Jill wrote:
> I've been following this thread for a while, and I've pretty 
Thanks for bearing with me. And I hope my response will not discourage you from continuing to do so. That is, until I am banned from the list for heresy.

> much got the 
> hang of the issues here. To summarize:
> 
> Unix filenames consist of an arbitrary sequence of octets, 
> excluding 0x00 
> and 0x2F. How they are /displayed/ to any given user depends 
> on that user's 
> locale setting. In this scenario, two users with different 
> locale settings 
> will see different filenames for the same file, but they will 
> still be able 
> to access the file via the filename that they see. These two 
> filenames will 
> be spelt identically in terms of octets, but (apparently) 
> differently when 
> viewed in terms of characters.
> 
> At least, that's how it was until the UTF-8 locale came along. If we 
I think such problems were already present with Shift-JIS. But already stated once why this was not noticed and will not repeat myself, unless explicitly asked to do so.

> consider only one-byte-per-character encodings, then any 
> octet sequence is 
> "valid" in any locale. But UTF-8 introduces the possibility 
> that an octet 
> sequence might be "invalid" - a new concept for Unix. So if 
> you change your 
> locale to UTF-8, then suddenly, some files created by other 
> users might 
> appear to you to have invalid filenames (though they would 
> still appear 
> valid when viewed by the file's creator).
> 
> A specific example: if a file F is accessed by two different 
> users, A and B, 
> of whom A has set their locale to Latin-1, and B has set 
> their locale to 
> UTF-8, then the filename may appear to be valid to user A, 
> but invalid to 
> user B.
> 
> Lars is saying (and he's probably right, because he knows 
> more about Unix 
> than I) that user B does not necessarily have the right to 
> change the actual 
> octet sequence which is the filename of F, just to make it 
> appear valid to 
> user B, because doing so would stop a lot of things working 
> for user A (for 
> instance, A might have created the file, the filename might 
> be hardcoded in 
> a script, etc.). So Lars takes a Unix-like approach, saying 
> "retain the 
> actual octet sequence, but feel free to try to display and 
> manipulate it as 
> if it were some UTF-8-like encoding in which all octet 
> sequences are valid". 
> And all this seems to work fine for him, until he tries to 
> roundtrip to 
> UTF-16 and back.
> 
> I'm not sure why anyone's arguing about this though - 
> Phillipe's suggestion 
> seems to be the perfect solution which keeps everyone happy. So...
Well, it doesn't. The rest of my comments will show you why.


> 
> ...allow me to construct a specific example of what Phillipe 
> suggested only 
> generally:
> 
> DEFINITION - "NOT-Unicode" is the character repertoire 
> consisting of the 
> whole of Unicode, and 128 additional characters representing 
> integers in the 
> range 0x80 to 0xFF.
As long as we agree that the codepoints used to store the NOT-Unicode data are valid unicode codepoints. You noticed yourself that NOT-Unicode should roundtrip through UTF-16. Only valid Unicode codepoints can be safely passed through UTF-16.

> 
> OBSERVATION - Unicode is a subset of NOT-Unicode
But unfortunately data can pass from NOT-Unicode to Unicode. Some people think that this is terribly bad. One would think that by storing NOT-UTF-8 in NOT-UTF-16 would prevent data from crossing the boundary, but that is not so.

> 
> DEFINITION - "NOT-UTF-8" is a bidirectional encoding between 
> a NOT-Unicode 
> character stream and an octet stream, defined as follows: if 
> a NOT-Unicode 
> character is a Unicode character then its encoding is the 
> UTF-8 encoding of 
> that character; else the NOT-Unicode character must represent 
> an integer, in 
> which case its encoding is itself. To decode, assume the next 
> NOT-Unicode 
> character is a Unicode character and attempt to decode from 
> the octet stream 
> using UTF-8; if this fails then the NOT-Unicode character is 
> an integer, in 
> which case read one single octet from the stream and return it.
More or less. You have not defined how to return the octet. It must be returned as a valid Unicode codepoint. And if a Unicode character is decoded, one must check if it is any of the codepoints used for this purpose and escape it. But only when decoding NON-UTF-8. Decoding from UTF-8 remains unchanged.

> 
> OBSERVATION - All possible octet sequences are valid NOT-UTF-8.
Yes, that's the sanity check, because this is what we wanted to get.


> 
> OBSERVATION - NOT-Unicode characters which are Unicode 
> characters will be 
> encoded identically in UTF-8 and NOT-UTF-8
Unfortunately not so. Becase you started with the wrong assumption that NOT-UTF-8 data will not be stored in valid codepoints. But the fact that this observation is not true is not really a problem.

RE: Roundtripping in Unicode

2004-12-14 Thread Arcane Jill
I've been following this thread for a while, and I've pretty much got the 
hang of the issues here. To summarize:

Unix filenames consist of an arbitrary sequence of octets, excluding 0x00 
and 0x2F. How they are /displayed/ to any given user depends on that user's 
locale setting. In this scenario, two users with different locale settings 
will see different filenames for the same file, but they will still be able 
to access the file via the filename that they see. These two filenames will 
be spelt identically in terms of octets, but (apparently) differently when 
viewed in terms of characters.

At least, that's how it was until the UTF-8 locale came along. If we 
consider only one-byte-per-character encodings, then any octet sequence is 
"valid" in any locale. But UTF-8 introduces the possibility that an octet 
sequence might be "invalid" - a new concept for Unix. So if you change your 
locale to UTF-8, then suddenly, some files created by other users might 
appear to you to have invalid filenames (though they would still appear 
valid when viewed by the file's creator).

A specific example: if a file F is accessed by two different users, A and B, 
of whom A has set their locale to Latin-1, and B has set their locale to 
UTF-8, then the filename may appear to be valid to user A, but invalid to 
user B.

Lars is saying (and he's probably right, because he knows more about Unix 
than I) that user B does not necessarily have the right to change the actual 
octet sequence which is the filename of F, just to make it appear valid to 
user B, because doing so would stop a lot of things working for user A (for 
instance, A might have created the file, the filename might be hardcoded in 
a script, etc.). So Lars takes a Unix-like approach, saying "retain the 
actual octet sequence, but feel free to try to display and manipulate it as 
if it were some UTF-8-like encoding in which all octet sequences are valid". 
And all this seems to work fine for him, until he tries to roundtrip to 
UTF-16 and back.

I'm not sure why anyone's arguing about this though - Phillipe's suggestion 
seems to be the perfect solution which keeps everyone happy. So...

...allow me to construct a specific example of what Phillipe suggested only 
generally:

DEFINITION - "NOT-Unicode" is the character repertoire consisting of the 
whole of Unicode, and 128 additional characters representing integers in the 
range 0x80 to 0xFF.

OBSERVATION - Unicode is a subset of NOT-Unicode
DEFINITION - "NOT-UTF-8" is a bidirectional encoding between a NOT-Unicode 
character stream and an octet stream, defined as follows: if a NOT-Unicode 
character is a Unicode character then its encoding is the UTF-8 encoding of 
that character; else the NOT-Unicode character must represent an integer, in 
which case its encoding is itself. To decode, assume the next NOT-Unicode 
character is a Unicode character and attempt to decode from the octet stream 
using UTF-8; if this fails then the NOT-Unicode character is an integer, in 
which case read one single octet from the stream and return it.

OBSERVATION - All possible octet sequences are valid NOT-UTF-8.
OBSERVATION - NOT-Unicode characters which are Unicode characters will be 
encoded identically in UTF-8 and NOT-UTF-8

OBSERVATION - NOT-Unicode characters which are not Unicode characters cannot 
be represented in UTF-8

DEFINITION - "NOT-UTF-16" is a bidirectional encoding between a NOT-Unicode 
character stream and a 16-bit word stream, defined as follows: if a 
NOT-Unicode character is a Unicode character then its encoding is the UTF-16 
encoding of that character; else the NOT-Unicode character must represent an 
integer, in which case its encoding is 0xDC00 plus the integer. To decode, 
if the next 16-bit word is in the range 0xDC80 to 0xDCFF then the 
NOT-Unicode character is the integer whose value is (word16 - 0xDC00), else 
the NOT-Unicode character is the Unicode character obtained by decoding as 
if UTF-16.

OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 -> 
NOT-UTF-16 -> NOT-UTF-8

OBSERVATION - NOT-Unicode characters which are Unicode characters will be 
encoded identically in UTF-16 and NOT-UTF-16

OBSERVATION - NOT-Unicode characters which are not Unicode characters cannot 
be represented in UTF-16

DEFINITION - "NOT-UTF-32" is a bidirectional encoding between a NOT-Unicode 
character stream and a 32-bit word stream, defined as follows: if a 
NOT-Unicode character is a Unicode character then its encoding is the UTF-32 
encoding of that character; else the NOT-Unicode character must represent an 
integer, in which case its encoding is 0xDC00 plus the integer. To 
decode, if the next 32-bit word is in the range 0xDC80 to 0xDCFF 
then the NOT-Unicode character is the octet whose value is (word32 - 
0xDC00), else the NOT-Unicode character is the Unicode character 
obtained by decoding as if UTF-16.

OBSERVATION - Roundtripping is possible in the directions NOT-UTF-8 ->

RE: Nicest UTF

2004-12-14 Thread Lars Kristan
Title: RE: Nicest UTF





D. Starner wrote:
> > Some won't convert any and will just start using UTF-8 
> > for new ones. And this should be allowed. 
> 
> Why should it be allowed? You can't mix items with
> different unlabeled encodings willy-nilly. All you're going
> to get, all you can expect to get is a mess.


Easy for you to say. You're not the one that is going to answer the support calls. They WILL do it. You can jump up and down as much as you like, but they will. If I tell to users what you are telling me, they will think I am mad and will stop using my application.


Lars