Re: Roundtripping in Unicode
Mike Ayers scripsit: > I thought that URLs were specified to be in Unicode. Am I mistaken? You are. URLs are specified to be in *ASCII*. There is a %-encoding hack that allows you to represent random-octet filenames as ASCII. Some people (including me) think it's a good idea to use this hack to specify non-ASCII characters with double encoding (first as UTF-8, then with the %-hack), but the URI Syntax RFC doesn't say. -- John Cowan [EMAIL PROTECTED] http://www.reutershealth.comhttp://www.ccil.org/~cowan Humpty Dump Dublin squeaks through his norse Humpty Dump Dublin hath a horrible vorse But for all his kinks English / And his irismanx brogues Humpty Dump Dublin's grandada of all rogues. --Cousin James
Unicode Version 4.1.0 Beta Release
The next version of the Unicode Standard will be Version 4.1.0, due for release in March, 2005. A BETA version of the updated Unicode Character Database files is available for public comment. We strongly encourage implementers to download these files and test them with their programs, well before the end of the beta period. These files are located at the following URL: http://www.unicode.org/Public/4.1.0/ (or ftp://www.unicode.org/Public/4.1.0/) A detailed description of the beta is located here: http://www.unicode.org/versions/beta.html Any comments on the beta Unicode Character Database should be reported using the Unicode reporting form. The comment period ends January 31, 2005. All substantive comments must be received by that date for consideration at the next UTC meeting. Editorial comments (typos, etc) may be submitted after that date for consideration in the final editorial work. Note: All beta files may be updated, replaced, or superseded by other files at any time. The beta files will be discarded once Unicode 4.1.0 is final. It is inappropriate to cite these files as other than a work in progress. Testers should not commit any product or implementation to the code points in the current beta data files. Testers should also be ready for retesting based on updated data files which will be posted after the February, 2005 UTC meeting. If you have comments for official consideration, please post them by submitting your comments through our feedback & reporting page: http://www.unicode.org/reporting.html If you wish to discuss beta issues on the Unicode mail list, then please use the following link to subscribe (if necessary). Please be aware that discussion comments on the Unicode mail list are not automatically recorded as beta comments. You must use the reporting link above to generate comments for official consideration. http://www.unicode.org/consortium/distlist.html Regards, Rick McGowan Unicode, Inc.
RE: Roundtripping in Unicode
Title: RE: Roundtripping in Unicode > From: Peter Kirk [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, December 14, 2004 3:37 PM > Thanks for the clarification. Perhaps the bifurcation could > be better expressed as into "strings of characters as defined > by the locale" and "strings of non-null octets". Then I could > re-express this as "the only safe way out of this mess is > never to process filenames as strings of characters as > defined by the locale". That would not be correct for ISO 8859 locales, though (amongst others). That's why I specified UTF-8. Although other locales may have the problem of invalid sequences, we're only interested in UTF-8 here. > Well, I was assuming that when John Cowan implied that 0x08 > was permitted, and Jill wrote "Unix filenames consist of an > arbitrary sequence of octets, excluding 0x00 and 0x2F", they > were speaking from the appropriate orifices. Correct, and my bad. I got thrown off by John's: >>(A private correspondent has come up with an ingenious trick which >>depends on being able to create files named 0x08 and 0x7F, but it truly >>is a trick, and in any case depends only on an ASCII interpretation.) which I misinterpreted to mean that 0x08 was a forbidden character. It isn't - just real hard to type! /|/|ike "Tumbleweed E-mail Firewall " made the following annotations on 12/14/04 16:24:51 -- This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately. ==
Unicode filenames and other external strings on Unix - existing practice
I describe here languages which exclusively use Unicode strings. Some languages have both byte strings and Unicode strings (e.g. Python) and then byte strings are generally used for strings exchanged with the OS, the programmer is responsible for the conversion if he wishes to use Unicode. I consider situations when the encoding is implicit. For I/O of file contents it's always possible to set the encoding explicitly somehow. Corrections are welcome. This is mostly based on experimentation. Java (Sun) -- Strings are UTF-16. Filenames are assumed to be in the locale encoding. a) Interpreting. Bytes which cannot be converted are replaced by U+FFFD. b) Creating. Characters which cannot be converted are replaced by "?". Command line arguments and standard I/O are treated in the same way. Java (GNU) -- Strings are UTF-16. Filenames are assumed to be in Java-modified UTF-8. a) Interpreting. If a filename cannot be converted, a directory listing contains a null instead of a string object. b) Creating. All Java characters are representable in Java-modified UTF-8. Obviously not all potential filenames can be represented. Command line arguments are interpreted according to the locale. Bytes which cannot be converted are skipped. Standard I/O works in ISO-8859-1 by default. Obviously all input is accepted. On output characters above U+00FF are replaced by "?". C# (mono) - Strings are UTF-16. Filenames use the list of encodings from the MONO_EXTERNAL_ENCODINGS environment variable, with UTF-8 implicitly added at the end. These encodings are tried in order. a) Interpreting. If a filename cannot be converted, it's skipped in a directory listing. The documentation says that if a filename, a command line argument etc. looks like valid UTF-8, it is treated as such first, and MONO_EXTERNAL_ENCODINGS is consulted only in remaining cases. The reality seems to not match this (mono-1.0.5). b) Creating. If UTF-8 is used, Non-characters are converted to pseudo-UTF-8, U+ throws an exception (System.ArgumentException: Path contains invalid chars), paired surrogates are treated correctly, and an isolated surrogate causes an internal error: ** ERROR **: file strenc.c: line 161 (mono_unicode_to_external): assertion failed: (utf8!=NULL) aborting... Command line arguments are treated in the same way, except that if an argument cannot be converted, the program dies at start: [Invalid UTF-8] Cannot determine the text encoding for argument 1 (xxx\xb1\xe6\xea). Please add the correct encoding to MONO_EXTERNAL_ENCODINGS and try again. Console.WriteLine emits UTF-8. Paired surrogates are treated correctly, non-characters and unpaired surrogates are converted to pseudo-UTF-8. Console.ReadLine interprets text as UTF-8. Bytes which cannot be converted are skipped. Perl Depending on the convention used by a particular function and on imported packages, a Perl string is treated either as Perl-modified Unicode (with character values up to 32 bits or 64 bits depending on the architecture) or as an unspecified locale encoding. It has two internal representations: ISO-8859-1 and Perl-modified UTF-8 (with an extended range). If every Perl string is assumed to be a Unicode string, then filenames are effectively ISO-8859-1. a) Interpreting. Characters up to 0xFF are used. b) Creating. If the filename has no characters above 0xFF, it is converted to ISO-8859-1. Otherwise it is converted to Perl-modified UTF-8 (all characters, not just those above 0xFF). Command line arguments and standard I/O are treated in the same way, i.e. ISO-8859-1 on input and a mixture of ISO-8859-1 and UTF-8 on output, depending on the contents. This behavior is modifiable by importing various packages and using interpreter invocation flags. When Perl is told that command line arguments are UTF-8, the behavior for strings which cannot be converted is inconsistent: sometimes it's treated as ISO-8859-1, sometimes an error is signalled. Haskell --- Haskell nominally uses Unicode. There is no conversion framework standarized or implemented yet though. Implementations which support more than 256 characters currently assume ISO-8859-1 for filenames, command line arguments and all I/O, taking the lowest 8 bits of a character code on output. Common Lisp: Clisp -- Common Lisp standard doesn't say anything about string encoding. In Clisp strings are UTF-32 (internally optimized as UCS-2 and ISO-8859-1 when possible). Any character code up to U+10 is allowed, including non-characters and isolated surrogates. Filenames are assumed to be in the locale encoding. a) Interpreting. If a byte cannot be converted, an exception is thrown. b) Creating. If a character cannot be converted, an exception is thrown. Kogut (my language; this is the current state - can be changed) - Strings are UTF-32 (internally optimized as ISO-8859-1 when possible). Currently any
RE: Roundtripping in Unicode
Title: RE: Roundtripping in Unicode > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED]] On Behalf Of Mike Ayers > Sent: Tuesday, December 14, 2004 3:29 PM > The rule is "No zero, no eight". "No zero, no forty seven". My bad. /|/|ike "Tumbleweed E-mail Firewall " made the following annotations on 12/14/04 16:25:28 -- This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately. ==
RE: Roundtripping in Unicode
Title: RE: Roundtripping in Unicode > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED]] On Behalf Of Philippe Verdy > Sent: Tuesday, December 14, 2004 2:47 PM > More simply, I think that it's an error to have the encoding > part of any locale... The system should not depend on them, > and for critical things like filesystem volumes, the encoding > should be forced by the filesystem itself, and applications > should mandatorily follow the filesystem rules. It doesn't, it is, and they do. The rule is "No zero, no eight". The problem is that these valid filenames can't all be translated as valid UTF-8 Unicode. > Now think about the web itself: it's really a filesystem, No. It isn't. > with billions users, or trillion applications using > simultaneously hundreds or thousands of incompatible > encodings... Many resources on the web seem to have valid > URLs for some users but not for others, until URLs are made > independant to any user locale, and then not considered as > encoded plain-text but only as strings of bytes. I thought that URLs were specified to be in Unicode. Am I mistaken? /|/|ike P.S. [OT} Note the below autoattachment. I recall that we discussed such clauses on the list some time ago with regard to their legal standing. Does anyone have a pointer to substantive material on the subject? I've gotten curious again, 'natch. "Tumbleweed E-mail Firewall " made the following annotations on 12/14/04 15:31:51 -- This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately. ==
Re: Roundtripping in Unicode
> Unicode did not invent the notion of conformance to character > encoding standards. What is new about Unicode is that it has > *3* interoperable character encoding forms, not just one, and > all of them are unusual in some way, because they are designed > for a very, very large encoded character repertoire, and > involve multibyte and/or non-byte code unit representations. Geez, even when I was going through my stage of inventing wild and crazy new UTF's, I made sure they were 100% convertible to and from code points. How could they not be? -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Roundtripping in Unicode
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> Lars Kristan <[EMAIL PROTECTED]> writes: Hm, here lies the catch. According to UTC, you need to keep processing the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8 function is allowed to reject invalid sequences. Basically, you are not supposed to use strcpy to process filenames. No: strcpy passes raw bytes, it does not interpret them according to the locale. It's not "an UTF-8 function". Correct: [wc]strcpy() handles "string" instances, but not all string instances are plain-text, so they don't need to obey to UTF encoding rules (they just obey to the convention of null-byte termination, with no restriction on the string length, measured as a size in [w]char[_t] but not as a number of Unicode characters). This is true for the whole standard C/C++ string libraries, as well as in Java (String and Char objects or "native" char datatype), and as well in almost all string handling libraries of common programming languages. A "locale" defined as "UTF-8" will experiment lots of problems because of the various ways applications will behave face to encoding "errors" encountered in filenames: exceptions thrown aborting the program, substitution by "?" or U+FFFD causing wrong files to be accessed, some files not treated because their name was considered "invalid" althoug they were effectively created by some user of another locale... Filenames are identifiers coded as strings, not as plain-text (even if most of these filename strings are plain-text). The solution if then to use a locale based on a "relaxed version of UTF-8" (some spoke about defining a "NOT-UTF-8" and "NOT-UTF-16" encodings to allow any sequence of code units, but nobody has thought about how to make "NOT-UTF-8" and "NOT-UTF-16" mutually fully reversible; now add "NOT-UTF-32" to this nightmare and you will see that "NOT-UTF-32" needs to encode 2^32 distinct NOT-Unicode-codepoints, and that they must map bijectively to exactly all 2^32 sequences possible in NOT-UTF-16 and NOT-UTF-8; I have not found a solution to this problem, and I don't know if such solution even exists; if such solution exists, it should be quite complex...).
Re: Roundtripping in Unicode
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> "Arcane Jill" <[EMAIL PROTECTED]> writes: If so, Marcin, what exactly is the error, and whose fault is it? It's an error to use locales with different encodings on the same system. More simply, I think that it's an error to have the encoding part of any locale... The system should not depend on them, and for critical things like filesystem volumes, the encoding should be forced by the filesystem itself, and applications should mandatorily follow the filesystem rules. Now think about the web itself: it's really a filesystem, with billions users, or trillion applications using simultaneously hundreds or thousands of incompatible encodings... Many resources on the web seem to have valid URLs for some users but not for others, until URLs are made independant to any user locale, and then not considered as encoded plain-text but only as strings of bytes.
Re: Roundtripping in Unicode
Marcin Kowalczyk noted: > Unicode has the following property. Consider sequences of valid > Unicode characters: from the range U+..U+10, excluding > non-characters (i.e. U+nFFFE and U+n for n from 0 to 0x10 and > U+FDD0..U+FDEF) and surrogates. Any such sequence can be encoded > in any UTF-n, and nothing else is expected from UTF-n. Actually not quite correct. See Section 3.9 of the standard. The character encoding forms (UTF-8, UTF-16, UTF-32) are defined on the range of scalar values for Unicode: 0..D7FF, E000..10. Each of the UTF's can represent all of those scalar values, and can be converted accurately to either of the other UTF's for each of those values. That *includes* all the code points used for noncharacters. U+ is a noncharacter. It is not assigned to an encoded abstract character. However, it has a well-formed representation in each of the UTF-8, UTF-16, and UTF32 encoding forms, namely: UTF-8: UTF-16: UTF-32: <> > With the exception of the set of non-characters being irregular and > IMHO too large (why to exclude U+FDD0..U+FDEF?!), and a weird top > limit caused by UTF-16, this gives a precise and unambiguous set of > values for which encoders and decoders are supposed to work. Well, since conformant encoders and decoders must work for all the noncharacter code points as well, and since U+10, however odd numerologically, is itself precise and unambiguous, I don't think you even need these qualifications. > Well, > except non-obvious treatment of a BOM (at which level it should be > stripped? does this include UTF-8?). The handling of BOM is relevant to the character encoding *schemes*, where the issues are serialization into byte streams and interpretation of those byte streams. Whether you include U+FEFF in text or not depends on your interpretation of the encoding scheme for a Unicode byte stream. At the level of the character encoding forms (the UTF's), the handling of BOM is just as for any other scalar value, and is completely unambiguous: UTF-8: UTF-16: UTF-32:> > A variant of UTF-8 which includes all byte sequences yields a much > less regular set of abstract string values. Especially if we consider > that 1110 1011 1010 binary is not valid UTF-8, as much as > 0xFFFE is not valid UTF-16 (it's a reversed BOM; it must be invalid in > order for a BOM to fulfill its role). This is incorrect. *is* valid UTF-8, just as is valid UTF-16. In both cases these are valid representations of a noncharacter, which should not be used in public interchange, but that is a separate issue from the fact that the code unit sequences themselves are "well-formed" by definition of the Unicode encoding forms. > > Question: should a new programming language which uses Unicode for > string representation allow non-characters in strings? Yes. > Argument for > allowing them: otherwise they are completely useless at all, except > U+FFFE for BOM detection. Argument for disallowing them: they make > UTF-n inappropriate for serialization of arbitrary strings, and thus > non-standard extensions of UTF-n must be used for serialization. Incorrect. See above. No extensions of any of the encoding forms are needed to handle noncharacters correctly. --Ken
RE: Roundtripping in Unicode
Title: RE: Roundtripping in Unicode > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED]] On Behalf Of Peter Kirk > Sent: Tuesday, December 14, 2004 11:32 AM > This is a design flaw in Unix, or in how it is explained to > users. Well, Lars wrote "Basically, you are not supposed to > use strcpy to process filenames." I'm not sure if that is his > opinion or someone else's, but the only safe way out of this > mess is never to process filenames as strings. As mentioned by Kenneth, Lars was speaking from the wrong orifice when he said that. Also, it appears that the term "string" is being used too much and without qualification. The entire focus of this thread is on what happens when unqualified bytes (filenames) get qualified (by locale), so it would behoove us all to qualify all the strings we're talking about. For instance, Peter's last clause above bifurcates into: "...but the only safe way out of this mess is never to process filenames as UTF-8 strings." and: "...but the only safe way out of this mess is always to process filenames as opaque C strings." which was mentioned early on in this thread, but Lars does not wish to do this. > This may be called a "trick" but it looks like it could very > easily be a security hole. For example, a filename 0x41 0x08 > 0x42 will be displayed the same as just 0x42, in a Latin-1 or > UTF-8 locale. Your friend's trick has become an open door for > spoofers. Exactly why 0x08 was banned in filenames, as I recall. /|/|ike "Tumbleweed E-mail Firewall " made the following annotations on 12/14/04 13:16:29 -- This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately. ==
Re: Roundtripping in Unicode
"Arcane Jill" <[EMAIL PROTECTED]> writes: > If so, Marcin, what exactly is the error, and whose fault is it? It's an error to use locales with different encodings on the same system. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Validity and properties of U+FFFD (was RE: Roundtripping in Unico de)
Lars asked: > BTW, what are the properties of U+FFFD? In English please, do not point me > to the standard. ?! It has the general category of "Symbol Other" [gc=So]. > Like, can it be a part of an identifier, It does not have the ID_Start or the ID_Continue property, which you could determine for yourself by referring to the standard: http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt That doesn't prevent a formal syntax definition for a language from including it within the BNF for defining and identifier, but in general, no, it would not appear in identifiers, just as most other symbols would not. > is it an 'alphanumeric'? No. > Let me speculate. It should be a letter No. > (it probably more > often originally was than wasn't). You are referring here to speculation regarding what uninterpretable sequence in some other character encoding was *converted* to U+FFFD on conversion to Unicode. But that is irrelevant to the properties of U+FFFD itself. That is tantamount, for example, to claiming that the C0 control code 0x1A SUBSTITUTE should be defined as a "letter", simply because it is often used in signalling a conversion substitution in 8-bit tables. > I would accept it for identifiers (variables, filenames). If you are defining your own language, that would be your prerogative, of course. But if you are using standard languages like C, C++, Java, C#, SQL, etc., it is unlikely that you would be correct in that approach. > It has no case properties. And it is obviously not a > space. True. There is much, much more to know about Unicode character properties than just what can be inferred from an attempt to apply the POSIX model to UTF-8. A good place to start would be Unicode Technical Report #23, The Unicode Charater Property Model: http://www.unicode.org/reports/tr23/ And after that, yes, I would point you to the standard. --Ken
RE: Roundtripping in Unicode
Lars said: > According to UTC, you need to keep processing > the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8 > function is allowed to reject invalid sequences. Basically, you are not > supposed to use strcpy to process filenames. This is a very misleading set of statements. First of all, the UTC has not taken *any* position on the processing of UNIX filenames. That is an implementation issue outside the scope of what the UTC normally deals with, and I doubt that it will take a position on the issue. It is erroneous to imply that the UTC has indicated that "you are not supposed to use strcpy to process filenames." It has done nothing of the kind, and I don't know of any reason why anyone should think otherwise. I certainly use strcpy to process filenames, UTF-8 or not, and expect that nearly every implementer on the list has done so, too. Any process *interpreting* a UTF-8 code unit sequences as characters can and should recognize invalid sequences, but that is a different matter. If I pass the byte stream <0x80 0xFF 0x80 0xFF 0x80 0xFF> to a process claiming conformance to UTF-8 and ask it to intepret that as Unicode characters, it should tell me that it is garbage. *How* it tells me that it is garbage is a matter of API design, code design, and application design. But there is *nothing* new here. If I pass the byte stream <0x80 0xFF 0x80 0xFF 0x80 0xFF> to a process claiming conformance to Shift-JIS and ask it to intepret that as JIS characters, it should tell me that it is garbage. *How* it tells me that it is garbage is a matter of API design, code design, and application design. Unicode did not invent the notion of conformance to character encoding standards. What is new about Unicode is that it has *3* interoperable character encoding forms, not just one, and all of them are unusual in some way, because they are designed for a very, very large encoded character repertoire, and involve multibyte and/or non-byte code unit representations. > Well, I just hope noone will listen to them and modify strcpy and strchr to > validate the data when running in UTF-8 locale and start signalling > something (really, where and how?!). The two statements from UTC don't make > sense when put together. Unless we are really expected to start building > everything from scratch. This is bogus. The UTC has never asked anyone to modify strcpy and strchr. What anyone implementing UTF-8 using a C runtime library (or similar set of functions) has to do is completely comparable to what they have to do for supporting any other multibyte character encoding on such systems. If your system handles euc-kr, euc-tw, and/or euc-jp correctly, then adding UTF-8 support is comparable, in principle and in practice. --Ken
Re: Roundtripping in Unicode
On 14/12/2004 11:32, Arcane Jill wrote: I've been following this thread for a while, and I've pretty much got the hang of the issues here. To summarize: I haven't followed everything, but here is my 2 cents worth. I note that there is a real problem. I have had significant problems in Windows with files copied from other language systems. Sometimes for example these files are listed fine in Explorer but when I try to copy or delete them they are not found, presumably because the filename is being corrupted somewhere in the system and doesn't match. Unix filenames consist of an arbitrary sequence of octets, excluding 0x00 and 0x2F. How they are /displayed/ to any given user depends on that user's locale setting. In this scenario, two users with different locale settings will see different filenames for the same file, but they will still be able to access the file via the filename that they see. These two filenames will be spelt identically in terms of octets, but (apparently) differently when viewed in terms of characters. At least, that's how it was until the UTF-8 locale came along. If we consider only one-byte-per-character encodings, then any octet sequence is "valid" in any locale. But UTF-8 introduces the possibility that an octet sequence might be "invalid" - a new concept for Unix. So if you change your locale to UTF-8, then suddenly, some files created by other users might appear to you to have invalid filenames (though they would still appear valid when viewed by the file's creator). This is not in fact a new concept. Some octet sequences which are valid filenames are invalid in a Latin-1 locale - for example, those which include octets in the range 0x80-0x9F, if "Latin-1" means ISO 8859-1. Some of these octets are of course defined in Windows CP1252 etc, so a Unix Latin-1 system may have some interpretation for some of them; but others e.g. 0x81 have no interpretation in any flavour of Latin-1 as far as I know. So there is by no means a guarantee that every non-Unicode Unix locale has an interpretation of every octet, which implies that other octets are invalid. Now no doubt many Unix filename handling utilities ignore the fact that some octets are invalid or uninterpretable in the locale, because they handle filenames as octet strings (with 0x00 and 0x2F having special interpretations) rather than as locale-dependent character strings. But these routines should continue to work in a UTF-8 locale, as they make no attempt to interpret any octets other than 0x00 and 0x2F. A specific example: if a file F is accessed by two different users, A and B, of whom A has set their locale to Latin-1, and B has set their locale to UTF-8, then the filename may appear to be valid to user A, but invalid to user B. Lars is saying (and he's probably right, because he knows more about Unix than I) that user B does not necessarily have the right to change the actual octet sequence which is the filename of F, just to make it appear valid to user B, because doing so would stop a lot of things working for user A (for instance, A might have created the file, the filename might be hardcoded in a script, etc.). So Lars takes a Unix-like approach, saying "retain the actual octet sequence, but feel free to try to display and manipulate it as if it were some UTF-8-like encoding in which all octet sequences are valid". And all this seems to work fine for him, until he tries to roundtrip to UTF-16 and back. I think the problem here is that a Unix filename is a string of octets, not of characters. And so it should not be converted into another encoding form as if it is characters; it should be processed at a quite different level of interpretation. Of course a system is free to do what it wants internally. I'm not sure why anyone's arguing about this though - Phillipe's suggestion seems to be the perfect solution which keeps everyone happy. So... ...allow me to construct a specific example of what Phillipe suggested only generally: ... This would appear to solve Lars' problem, and because the three encodings, NOT-UTF-8, NOT-UTF-16 and NOT-UTF-32, don't claim to be UTFs, no-one need get upset. All of this is ingenious, and may be useful for internal processing within a Unix system, and perhaps even for interaction between cooperating systems. But NOT-Unicode is not Unicode (!) and so Unicode should not be expected to standardise it. I can see that there may be a need for a protocol for open exchange of Unix-like filenames. But these filenames should be treated as binary data (which may or may not be interpretable in any one locale) and encoded as such, rather than forced into the mould of Unicode characters which it does not fit. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Roundtripping in Unicode
Lars Kristan <[EMAIL PROTECTED]> writes: > Hm, here lies the catch. According to UTC, you need to keep > processing the UNIX filenames as BINARY data. And, also according > to UTC, any UTF-8 function is allowed to reject invalid sequences. > Basically, you are not supposed to use strcpy to process filenames. No: strcpy passes raw bytes, it does not interpret them according to the locale. It's not "an UTF-8 function". -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Roundtripping in Unicode
"Arcane Jill" <[EMAIL PROTECTED]> writes: > OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 -> > NOT-UTF-16 -> NOT-UTF-8 But it's not possible in the direction NOT-UTF-16 -> NOT-UTF-8 -> NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an awkward way which would happen to exclude those subsequences of non-characters which would form a valid UTF-8 fragment. Unicode has the following property. Consider sequences of valid Unicode characters: from the range U+..U+10, excluding non-characters (i.e. U+nFFFE and U+n for n from 0 to 0x10 and U+FDD0..U+FDEF) and surrogates. Any such sequence can be encoded in any UTF-n, and nothing else is expected from UTF-n. With the exception of the set of non-characters being irregular and IMHO too large (why to exclude U+FDD0..U+FDEF?!), and a weird top limit caused by UTF-16, this gives a precise and unambiguous set of values for which encoders and decoders are supposed to work. Well, except non-obvious treatment of a BOM (at which level it should be stripped? does this include UTF-8?). A variant of UTF-8 which includes all byte sequences yields a much less regular set of abstract string values. Especially if we consider that 1110 1011 1010 binary is not valid UTF-8, as much as 0xFFFE is not valid UTF-16 (it's a reversed BOM; it must be invalid in order for a BOM to fulfill its role). Question: should a new programming language which uses Unicode for string representation allow non-characters in strings? Argument for allowing them: otherwise they are completely useless at all, except U+FFFE for BOM detection. Argument for disallowing them: they make UTF-n inappropriate for serialization of arbitrary strings, and thus non-standard extensions of UTF-n must be used for serialization. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Roundtripping in Unicode
On 14/12/2004 17:47, John Cowan wrote: Peter Kirk scripsit: I think the problem here is that a Unix filename is a string of octets, not of characters. And so it should not be converted into another encoding form as if it is characters; it should be processed at a quite different level of interpretation. Unfortunately, that is simply a counsel of perfection. Unix filenames are in general input as character strings, output as character strings, and intended to be perceived as character strings. The corner cases in which this does not work are not sufficient to overthrow the power and generality to be achieved by assuming it 99% of the time. This is a design flaw in Unix, or in how it is explained to users. Well, Lars wrote "Basically, you are not supposed to use strcpy to process filenames." I'm not sure if that is his opinion or someone else's, but the only safe way out of this mess is never to process filenames as strings. (A private correspondent has come up with an ingenious trick which depends on being able to create files named 0x08 and 0x7F, but it truly is a trick, and in any case depends only on an ASCII interpretation.) This may be called a "trick" but it looks like it could very easily be a security hole. For example, a filename 0x41 0x08 0x42 will be displayed the same as just 0x42, in a Latin-1 or UTF-8 locale. Your friend's trick has become an open door for spoofers. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Roundtripping in Unicode
Peter Kirk scripsit: > I think the problem here is that a Unix filename is a string of octets, > not of characters. And so it should not be converted into another > encoding form as if it is characters; it should be processed at a quite > different level of interpretation. Unfortunately, that is simply a counsel of perfection. Unix filenames are in general input as character strings, output as character strings, and intended to be perceived as character strings. The corner cases in which this does not work are not sufficient to overthrow the power and generality to be achieved by assuming it 99% of the time. (A private correspondent has come up with an ingenious trick which depends on being able to create files named 0x08 and 0x7F, but it truly is a trick, and in any case depends only on an ASCII interpretation.) -- Income tax, if I may be pardoned for saying so, John Cowan is a tax on income. --Lord Macnaghten (1901) [EMAIL PROTECTED]
UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtripping in Unicode)
On Tuesday 2004.12.14 12:50:43 -, Arcane Jill wrote: > If I have understood this correctly, filenames are not "in" a locale, they > are absolute. Users, on the other hand, are "in" a locale, and users view > filenames. The same filename can "look" different to two different users. > To user A (whose locale is Latin-1), a filename might look valid; to user B > (whose locale is UTF-8), the same filename might look invalid. Correct. The problem will however be limited to the accented Latin characters present in ISO-8859-1 beyond the ASCII set. The basic Latin alphabet in the ASCII set at the beginning of both ISO-8859-1 and UTF-8 will appear unchanged to both users (UTF-8 user looking at Latin-1's home directory, or Latin-1 looking at UTF-8's home directory). So both users could probably guess the filename they were looking at. For example, here is a file on my local machine, a Linux box with the locale set to LANG=en_US.UTF-8: déclaration_des_droits.utf8 The accented "e" in "déclaration" appears correctly under the UTF-8 locale. I then copied this file (using scp) over to an older Sun Solaris box which I do not administer, so I have to live with the "C" POSIX locale that they have got that machine set to. Now, when I view the file names in a terminal (where the terminal emulator is set to the same locale), I see: d??claration_des_droits.utf8 The terminal, being set to interpret the legacy locale, does not know how to interpret the two bytes that are used for the UTF-8 "é". Still, I can guess that the first word should be "déclaration". The solution, as has been pointed out, is for everyone to move to UTF-8 locales. In the Linux and Unix world, this is already happening for the most part. Solaris 10 now defaults to a UTF-8 locale, at least when set to English. Both SuSE and Redhat default to UTF-8 locales for most language and script environments. And (open source) tools exist for converting file names from one encoding to another encoding on Linux and Unix systems. A group of Japanese developers is working on an NLS implementation for the BSDs like OpenBSD which are currently "stuck" with nothing but the "C" POSIX locale. I think the name of that project is "Citrus". -- Ed Trager > > Is that right, Lars? > > If so, Marcin, what exactly is the error, and whose fault is it? > > Jill > > -Original Message- > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > > Behalf Of Marcin 'Qrczak' Kowalczyk > > Sent: 13 December 2004 14:59 > > To: [EMAIL PROTECTED] > > Subject: Re: Roundtripping in Unicode > > Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error. > > > > > >
RE: Roundtripping in Unicode
Title: RE: Roundtripping in Unicode Arcane Jill wrote: > I've been following this thread for a while, and I've pretty Thanks for bearing with me. And I hope my response will not discourage you from continuing to do so. That is, until I am banned from the list for heresy. > much got the > hang of the issues here. To summarize: > > Unix filenames consist of an arbitrary sequence of octets, > excluding 0x00 > and 0x2F. How they are /displayed/ to any given user depends > on that user's > locale setting. In this scenario, two users with different > locale settings > will see different filenames for the same file, but they will > still be able > to access the file via the filename that they see. These two > filenames will > be spelt identically in terms of octets, but (apparently) > differently when > viewed in terms of characters. > > At least, that's how it was until the UTF-8 locale came along. If we I think such problems were already present with Shift-JIS. But already stated once why this was not noticed and will not repeat myself, unless explicitly asked to do so. > consider only one-byte-per-character encodings, then any > octet sequence is > "valid" in any locale. But UTF-8 introduces the possibility > that an octet > sequence might be "invalid" - a new concept for Unix. So if > you change your > locale to UTF-8, then suddenly, some files created by other > users might > appear to you to have invalid filenames (though they would > still appear > valid when viewed by the file's creator). > > A specific example: if a file F is accessed by two different > users, A and B, > of whom A has set their locale to Latin-1, and B has set > their locale to > UTF-8, then the filename may appear to be valid to user A, > but invalid to > user B. > > Lars is saying (and he's probably right, because he knows > more about Unix > than I) that user B does not necessarily have the right to > change the actual > octet sequence which is the filename of F, just to make it > appear valid to > user B, because doing so would stop a lot of things working > for user A (for > instance, A might have created the file, the filename might > be hardcoded in > a script, etc.). So Lars takes a Unix-like approach, saying > "retain the > actual octet sequence, but feel free to try to display and > manipulate it as > if it were some UTF-8-like encoding in which all octet > sequences are valid". > And all this seems to work fine for him, until he tries to > roundtrip to > UTF-16 and back. > > I'm not sure why anyone's arguing about this though - > Phillipe's suggestion > seems to be the perfect solution which keeps everyone happy. So... Well, it doesn't. The rest of my comments will show you why. > > ...allow me to construct a specific example of what Phillipe > suggested only > generally: > > DEFINITION - "NOT-Unicode" is the character repertoire > consisting of the > whole of Unicode, and 128 additional characters representing > integers in the > range 0x80 to 0xFF. As long as we agree that the codepoints used to store the NOT-Unicode data are valid unicode codepoints. You noticed yourself that NOT-Unicode should roundtrip through UTF-16. Only valid Unicode codepoints can be safely passed through UTF-16. > > OBSERVATION - Unicode is a subset of NOT-Unicode But unfortunately data can pass from NOT-Unicode to Unicode. Some people think that this is terribly bad. One would think that by storing NOT-UTF-8 in NOT-UTF-16 would prevent data from crossing the boundary, but that is not so. > > DEFINITION - "NOT-UTF-8" is a bidirectional encoding between > a NOT-Unicode > character stream and an octet stream, defined as follows: if > a NOT-Unicode > character is a Unicode character then its encoding is the > UTF-8 encoding of > that character; else the NOT-Unicode character must represent > an integer, in > which case its encoding is itself. To decode, assume the next > NOT-Unicode > character is a Unicode character and attempt to decode from > the octet stream > using UTF-8; if this fails then the NOT-Unicode character is > an integer, in > which case read one single octet from the stream and return it. More or less. You have not defined how to return the octet. It must be returned as a valid Unicode codepoint. And if a Unicode character is decoded, one must check if it is any of the codepoints used for this purpose and escape it. But only when decoding NON-UTF-8. Decoding from UTF-8 remains unchanged. > > OBSERVATION - All possible octet sequences are valid NOT-UTF-8. Yes, that's the sanity check, because this is what we wanted to get. > > OBSERVATION - NOT-Unicode characters which are Unicode > characters will be > encoded identically in UTF-8 and NOT-UTF-8 Unfortunately not so. Becase you started with the wrong assumption that NOT-UTF-8 data will not be stored in valid codepoints. But the fact that this observation is not true is not really a problem.
RE: Roundtripping in Unicode
I've been following this thread for a while, and I've pretty much got the hang of the issues here. To summarize: Unix filenames consist of an arbitrary sequence of octets, excluding 0x00 and 0x2F. How they are /displayed/ to any given user depends on that user's locale setting. In this scenario, two users with different locale settings will see different filenames for the same file, but they will still be able to access the file via the filename that they see. These two filenames will be spelt identically in terms of octets, but (apparently) differently when viewed in terms of characters. At least, that's how it was until the UTF-8 locale came along. If we consider only one-byte-per-character encodings, then any octet sequence is "valid" in any locale. But UTF-8 introduces the possibility that an octet sequence might be "invalid" - a new concept for Unix. So if you change your locale to UTF-8, then suddenly, some files created by other users might appear to you to have invalid filenames (though they would still appear valid when viewed by the file's creator). A specific example: if a file F is accessed by two different users, A and B, of whom A has set their locale to Latin-1, and B has set their locale to UTF-8, then the filename may appear to be valid to user A, but invalid to user B. Lars is saying (and he's probably right, because he knows more about Unix than I) that user B does not necessarily have the right to change the actual octet sequence which is the filename of F, just to make it appear valid to user B, because doing so would stop a lot of things working for user A (for instance, A might have created the file, the filename might be hardcoded in a script, etc.). So Lars takes a Unix-like approach, saying "retain the actual octet sequence, but feel free to try to display and manipulate it as if it were some UTF-8-like encoding in which all octet sequences are valid". And all this seems to work fine for him, until he tries to roundtrip to UTF-16 and back. I'm not sure why anyone's arguing about this though - Phillipe's suggestion seems to be the perfect solution which keeps everyone happy. So... ...allow me to construct a specific example of what Phillipe suggested only generally: DEFINITION - "NOT-Unicode" is the character repertoire consisting of the whole of Unicode, and 128 additional characters representing integers in the range 0x80 to 0xFF. OBSERVATION - Unicode is a subset of NOT-Unicode DEFINITION - "NOT-UTF-8" is a bidirectional encoding between a NOT-Unicode character stream and an octet stream, defined as follows: if a NOT-Unicode character is a Unicode character then its encoding is the UTF-8 encoding of that character; else the NOT-Unicode character must represent an integer, in which case its encoding is itself. To decode, assume the next NOT-Unicode character is a Unicode character and attempt to decode from the octet stream using UTF-8; if this fails then the NOT-Unicode character is an integer, in which case read one single octet from the stream and return it. OBSERVATION - All possible octet sequences are valid NOT-UTF-8. OBSERVATION - NOT-Unicode characters which are Unicode characters will be encoded identically in UTF-8 and NOT-UTF-8 OBSERVATION - NOT-Unicode characters which are not Unicode characters cannot be represented in UTF-8 DEFINITION - "NOT-UTF-16" is a bidirectional encoding between a NOT-Unicode character stream and a 16-bit word stream, defined as follows: if a NOT-Unicode character is a Unicode character then its encoding is the UTF-16 encoding of that character; else the NOT-Unicode character must represent an integer, in which case its encoding is 0xDC00 plus the integer. To decode, if the next 16-bit word is in the range 0xDC80 to 0xDCFF then the NOT-Unicode character is the integer whose value is (word16 - 0xDC00), else the NOT-Unicode character is the Unicode character obtained by decoding as if UTF-16. OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 -> NOT-UTF-16 -> NOT-UTF-8 OBSERVATION - NOT-Unicode characters which are Unicode characters will be encoded identically in UTF-16 and NOT-UTF-16 OBSERVATION - NOT-Unicode characters which are not Unicode characters cannot be represented in UTF-16 DEFINITION - "NOT-UTF-32" is a bidirectional encoding between a NOT-Unicode character stream and a 32-bit word stream, defined as follows: if a NOT-Unicode character is a Unicode character then its encoding is the UTF-32 encoding of that character; else the NOT-Unicode character must represent an integer, in which case its encoding is 0xDC00 plus the integer. To decode, if the next 32-bit word is in the range 0xDC80 to 0xDCFF then the NOT-Unicode character is the octet whose value is (word32 - 0xDC00), else the NOT-Unicode character is the Unicode character obtained by decoding as if UTF-16. OBSERVATION - Roundtripping is possible in the directions NOT-UTF-8 ->
RE: Nicest UTF
Title: RE: Nicest UTF D. Starner wrote: > > Some won't convert any and will just start using UTF-8 > > for new ones. And this should be allowed. > > Why should it be allowed? You can't mix items with > different unlabeled encodings willy-nilly. All you're going > to get, all you can expect to get is a mess. Easy for you to say. You're not the one that is going to answer the support calls. They WILL do it. You can jump up and down as much as you like, but they will. If I tell to users what you are telling me, they will think I am mad and will stop using my application. Lars