RE: Nicest UTF
Title: RE: Nicest UTF D. Starner wrote: > > Some won't convert any and will just start using UTF-8 > > for new ones. And this should be allowed. > > Why should it be allowed? You can't mix items with > different unlabeled encodings willy-nilly. All you're going > to get, all you can expect to get is a mess. Easy for you to say. You're not the one that is going to answer the support calls. They WILL do it. You can jump up and down as much as you like, but they will. If I tell to users what you are telling me, they will think I am mad and will stop using my application. Lars
Re: Nicest UTF
From: "D. Starner" <[EMAIL PROTECTED]> Some won't convert any and will just start using UTF-8 for new ones. And this should be allowed. Why should it be allowed? You can't mix items with different unlabeled encodings willy-nilly. All you're going to get, all you can expect to get is a mess. When you say "you can't", it's excessive when speaking about filesystems, which DO NOT label their encoding, and allow multiple users to use and create files on shared filesystems with different locales having each a differnt encoding. So it does happen that the same filesystem stores multiple encodings for its filenames. It also happens that systems allow mounting remote filesystems shared on systems using distinct system encodings (so even if a filesystem is consistent, these filenames appear with various encodings, and this goes to more complex situations when they are crosslinked with soft links or URLs. Think about the web: it's a filesystem in itself, which uses names (URLs) include inconsistent encodings. Although there's a recommandation to use UTF-8 in URLs, this is not mandatory, and there are lots of hosts that use URLs created with some ISO-8859 charsets, or even Windows or Macintosh codepages. To resolve some problems, HTML specifications allow additional (but out-of-band) attributes to resolve the encoding used for resource contents, but this has no impact on URLs themselves. The current solution is to use "URL-encoding" and treat them as binary sequences with a restricted set of byte values, but this time it means transforming what was initially plain-text into some binary moniker. Unfortunately, many web search engines do use the URLs to qualify the pertinence of search keywords, instead of treating them only as blind monikers. Lots has been done to internationalize the domain names for use in IRIs, but URLs remain a mess and a mixture of various charsets, and IRIs are still rarely supported on browsers. The problem with URLs is that they must be allowed to contain any valid plain-text, notably for Form-Data, submitted with a GET method, because this plain-text data becomes part of a query-string, itself part of the URL. HTML does allow specifiying in the HTML form which encoding should be used for this form data, because servers won't always expect a single and consistent encoding; the absence of this specification is often interpreted in browsers as meaning that form-data must be encoded with the same charset as the HTML form itself, but not all browsers observe this rule (in addition many web pages are incorrectly labelled, simply because of incorrect or limited HTTP server configurations, and the standards specify that the charset specified in the HTTP headers have priority to the charset specified in encoded documents themselves; this was a poor decision, which is inconsistent with the usage of the same HTML documents on filesystems that do not store the charset used for the file content)... So don't think that this is simple. It is legitimate to be able to refer to some documents which we know are plain-text, but have unknown or ambiguous encodings (and there are many works related to the automated identification of lguage/charset pairs used in documents; none of these method are 100% exempt of false guesses). For clients trying to use these resources with ambiguous or unknown encodings, but that DO know that this is effectly plain-text (such as a filename), the solution to eliminate (ignore, not show, discard...) all filenames or documents that look incorrectly encoded may be the worst solution: it gives no information to the user that these documents are missing, and this does not allow these users to even determine (even if characters are incorrectly displayed) which alternate encoding to try. It's legitimate to think about solution allowing at least partial representation of these texts, so that the user can look at how it is effectively encoded and get hints about how to select the appropriate charset. Also, very lossy conversions (with U+FFFD) are not satisfying enough.
RE: Nicest UTF
> Some won't convert any and will just start using UTF-8 > for new ones. And this should be allowed. Why should it be allowed? You can't mix items with different unlabeled encodings willy-nilly. All you're going to get, all you can expect to get is a mess. -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
Re: Nicest UTF
Lars Kristan scripsit: > > I'm using ISO-8859-2. > In fact you're lucky. Many ISO-8859-1 filenames display correctly in > ISO-8859-2. Not all users are so lucky. It was a design point of ISO-8859-{1,2,3,4}, but not any other variants, that every character appears either at the same codepoint or not at all. -- John Cowan[EMAIL PROTECTED] At times of peril or dubitation, http://www.ccil.org/~cowan Perform swift circular ambulation,http://www.reutershealth.com With loud and high-pitched ululation.
RE: Nicest UTF
Title: RE: Nicest UTF D. Starner wrote: > "Lars Kristan" writes: > > > > A system administrator (because he has access to all files). > > My my, you are assuming all files are in the same encoding. > And what about > > all the references to the files in scripts? In > configuration files? Soft > > links? If you want to break things, this is definitely the > way to do it. > > Was it ever really wise to use non-ASCII file names in > scripts and configuration > files? It goes beyond that. Please see my reply to Marcin 'Qrczak' Kowalczyk. > It's not very hard to convert soft links at the same > time. Please see my reply to Marcin 'Qrczak' Kowalczyk. > Even if you can't do a system-wide change, it's easy enough > to change the > system files, and post a message about switching to UTF-8, > and offering to > assist any users with the change. That's perfectly fine. But I started talking about this because I claimed that you are likely to end up by having UTF-8 filenames alongside legacy encoded filenames. If you do it gradually, that is precisely what is going to happen, at least for a certain period. But this period could be longer than expected. And as it turns out things are not simple, some users may never convert all the filenames. Some won't convert any and will just start using UTF-8 for new ones. And this should be allowed. Assuming that all filenames should be valid UTF-8 is a bad argument against my claims that applications should be able to process filenames with invalid UTF-8 sequences. Lars
RE: Nicest UTF
Title: RE: Nicest UTF Marcin 'Qrczak' Kowalczyk wrote: > > My my, you are assuming all files are in the same encoding. > > Yes. Otherwise nothing shows filenames correctly to the user. UNIX is a multi user system. One user can use one locale and might never see files from another user that uses a different locale. And users can even have filenames in wrong locales in their own home directory. Copied from somewhere. Perhaps only a letter here and there does not display correctly, but this doesn't mean the user can't use the file. > > > And what about all the references to the files in scripts? > > In configuration files? > > Such files rarely use non-ASCII characters. Non-ASCII characters are > primarily used in names of documents created explicitly by the user. Rarely. So only rare systems will not boot after the conversion. And only rare programs will no longer work. Is that acceptable? Plus, it might not be as rare as you think. It might be far more common in a country where not many people understand English and are not using latin letters on top of it. Also, a script (a UNIX batch file) many have an ASCII name, but what if it processes some user documents for some purpose. And has a set of filenames hardcoded in it? What about MRU lists? What about documents that link other documents? Mass renaming is a dangerous thing. It should be done gradually and with utmost care. And during this period, everything should keep working. If not, users won't even start the process. > > > Soft links? > > They can be fixed automatically. U, yes, not a good example. Except in case one decides to allow the user to select an option to use U+FFFD instead of failing the conversion. Then you need to be extra careful, rename any files that convert to a sinle name and keep track of everything so you can use the right names for the soft links. But yes, it can be done. If, on the other hand, you adopt the 'broken' conversion concept, you can convert all filenames, in a single pass, and don't need to build lists of softlinks since you can convert them directly. > > > If you want to break things, this is definitely the way to do it. > > Using non-ASCII filenames is risky to begin with. Existing tools don't > have a good answer to what should happen with these files when the > default encoding used by the user changes, or when a user using a > different encoding tries to access them. Not really. On UNIX, it is all very well defined. A filename is a sequence of bytes which is only interpreted when it is displayed. You can place a filename in a script or a configuration file and the file will be identified and opened regardless of your locale setting. People like you and me avoid non-ASCII filenames. But not all users do. > Mozilla doesn't show such filenames in a directory listing. You > may consider it a bug, but this is a fact. Producing non-UTF-8 HTML > labeled as UTF-8 would be wrong too. There is no good solution to > the problem of filenames encoded in different encodings. There is no good solution. True. And I am trying to find one. And yes, I would consider that a bug. They should probably use some escaping technique. And, funny thing, you would probably accept the escaping technique. But if you think about it, it is again representing invalid data with valid Unicode characters. And if un-escaping needs to be done, it introduces all the problems that you are pointing out for my 'broken' conversion. So, think of my 128 codepoints as an escaping technique. One with no overhead. One with little possibiliy of confusion. One that can be standardized and whoever comes across it will know exactly what it is. Which is definitely not true if we let each application devise its own escaping and there is no way they can interoperate. > > As soon as you realize you cannot convert filenames to UTF-8, you > > will see that all you can do is start adding new ones in UTF-8. > > Or forget about Unicode. > > I'm not using a UTF-8 locale yet, because too many programs don't > support it. Like Mozilla. I am showing you the way programs can be made to work with UTF-8 faster and easier. And really by fixing them, not by rewriting them. At least some programs, or some portions of programs. Then developers can concentrate on the things that do require extra attention, like strupr, isspace (or their equivalence). > I'm using ISO-8859-2. In fact you're lucky. Many ISO-8859-1 filenames display correctly in ISO-8859-2. Not all users are so lucky. > But almost all filenames are ASCII. Basically, you are avoiding the problem alltogether. A wise decision. But it also means you don't know as much about this problem as I do. Lars
Re: infinite combinations, was Re: Nicest UTF
On 11/12/2004 16:53, Peter R. Mueller-Roemer wrote: ... For a fixed length of combining character sequence (base + 3 combining marks is the most I have seen graphically distinguishable) the repertore is still finite. In Hebrew it is actually possible to have up to 9 combining marks with a single base character: shin + sin/shin dot + dagesh + rafe + 2 vowel points + 2 accents + dot above + masora circle SBL Hebrew and Ezra SIL both make a valiant attempt to display this lot but don't quite get there: ××ÖÖ×Ö But I think 5 is the maximum number which actually occur with any one base character in the Hebrew Bible. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Nicest UTF
"Philippe Verdy" <[EMAIL PROTECTED]> writes: > It's hard to create a general model that will work for all scripts > encoded in Unicode. There are too many differences. So Unicode just > appears to standardize a higher level of processing with combining > sequences and normalization forms that are better approaching the > linguistic and semantic of the scripts. Consider this level as an > intermediate tool that will help simplify the identification of > processing units. While rendering and user input may use evolving rules with complex specifications and implementations which depend on the environment and user's configuration (actually there is no other choice: this is inherently complicated for some scripts), string processing in a programming language should have a stable base with well-defined and easy to remember semantics which doesn't depend on too many settable preferences and version variations. The more complex rules a protocol demands (case-insensitive programming language identifiers, compared after normalization, after bidi processing, with soft hyphens removed etc.), the more tools will implement it incorrectly. Usually with subtle errors which don't manifest until someone tries to process an unusual name (e.g. documentation generation tool will produce hyperlinks with dangling links, because a WWW server does not perform sufficient transformations of addresses). -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
"D. Starner" <[EMAIL PROTECTED]> writes: >> But demanding that each program which searches strings checks for >> combining classes is I'm afraid too much. > > How is it any different from a case-insenstive search? We started from string equality, which somehow changed into searching. Default string equality is case-sensitive. Searching for an arbitrary substring entered by a user should use user-friendly rules which fold various minor differences like decomposition and case and soft hyphens, but it's a rare task and changing rules generally affects convenience rather than correctness. String equality is used for internal and important operations like lookup in a dictionary (not necessarily of strings ever viewed by the user), comparing XML tags, filenames, mail headers, program identifiers, hyperlink addresses etc. They should be unambiguous, simple and fast. Computing approximate equivalence by folding "minor" differenes must be done explicitly when needed, as mandated by relevant protocols and standards, not forced as the default. >> >> Does "\n" followed by a combining code point start a new line? >> > >> > The Standard says no, that's a defective combining sequence. >> >> Is there *any* program which behaves this way? > > I misstated that; it's a new line followed by a defective combining > sequence. What is the definition of combining sequences? >> It doesn't matter that accented backslashes don't occur practice. >> I do care for unambiguous, consistent and simple rules. > > So do I; and the only unambiguous, consistent and simple rule that > won't give users hell is that "ba" never matches "bä". Any programs > for end-users must follow that rule. Please give a precise definition of string equality. What representation of strings it needs - a sequence of code points or something else? Are all strings valid and comparable? Are there operations which give different results for "equal" strings? If string equality folded the difference between precomposed and decomposed characters, then the API should hide that difference in other places as well, otherwise string equality is not the finest distinction between string values but some arbitrary equivalence relation. >> My current implementation doesn't support filenames which can't be >> encoded in the current default encoding. > > The right thing to do, IMO, would be to support filenames as byte > strings, and let the programmer convert them back and forth between > character strings, knowing that it won't roundtrip. Perhaps. Unfortunately it makes filename processing harder, e.g. you can't store them in *text* files processed through a transparent conversion between its encoding and Unicode. In effect we must go back from manipulating context-insensitive character sequences to manipulating byte sequences with context-dependent interpretation. We can't even sort filenames using Unicode algorithms for collation but must use some algorithms which are capable of processing both strings in the locale's encoding and arbitrary byte sequences at the same time. This is much more complicated than using Unicode algorithms alone. What is worse, in Windows filenames the primary representation of filenames is Unicode, so programs which carefully use APIs based on byte sequences for processing filenames will be less general than Unicode-based APIs when the program is ported to Windows. The computing world is slowly migrating from processing byte sequences in ambiguous encodings to processing Unicode strings, often represented by byte sequences in explicitly labeled encodings. There are relics when the new paradigm doesn't fit well, like Unix filenames, but sticking to the old paradigm means that programs will continue to support mixing scripts poorly or not at all. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
Lars Kristan <[EMAIL PROTECTED]> writes: > My my, you are assuming all files are in the same encoding. Yes. Otherwise nothing shows filenames correctly to the user. > And what about all the references to the files in scripts? > In configuration files? Such files rarely use non-ASCII characters. Non-ASCII characters are primarily used in names of documents created explicitly by the user. > Soft links? They can be fixed automatically. > If you want to break things, this is definitely the way to do it. Using non-ASCII filenames is risky to begin with. Existing tools don't have a good answer to what should happen with these files when the default encoding used by the user changes, or when a user using a different encoding tries to access them. As long as everybody uses the same encoding and files use it too, things work. When the assumption is false, something will break. >> You mean, various programs will break at various points of time, >> instead of working correctly from the beginning? > > So far nothing broke. Because all the programs are in UTF-8. This doesn't imply that they won't break. You are talking about filenames which are *not* UTF-8, with the locale set to UTF-8. Mozilla doesn't show such filenames in a directory listing. You may consider it a bug, but this is a fact. Producing non-UTF-8 HTML labeled as UTF-8 would be wrong too. There is no good solution to the problem of filenames encoded in different encodings. Handling such filenames is incompatible with using Unicode to process strings. You have to go back to passing arrays of bytes with ambiguous interpretation of non-ASCII characters, and live with inconveniences like displaying garbage for non-ASCII filenames and broken sorting. >> Mixing any two incompatible filename encodings on the same file system >> is a bad idea. > > As soon as you realize you cannot convert filenames to UTF-8, you > will see that all you can do is start adding new ones in UTF-8. > Or forget about Unicode. I'm not using a UTF-8 locale yet, because too many programs don't support it. I'm using ISO-8859-2. But almost all filenames are ASCII. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
RE: Nicest UTF
"Lars Kristan" writes: > > A system administrator (because he has access to all files). > My my, you are assuming all files are in the same encoding. And what about > all the references to the files in scripts? In configuration files? Soft > links? If you want to break things, this is definitely the way to do it. Was it ever really wise to use non-ASCII file names in scripts and configuration files? It's not very hard to convert soft links at the same time. Nor, really should it be too hard to figure out the encodings; /home/foo/.bashrc probably tells you, as well as simple logic. Even if you can't do a system-wide change, it's easy enough to change the system files, and post a message about switching to UTF-8, and offering to assist any users with the change. -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
Re: Nicest UTF
"Marcin 'Qrczak' Kowalczyk" writes: > But demanding that each program which searches strings checks for > combining classes is I'm afraid too much. How is it any different from a case-insenstive search? > >> Does "\n" followed by a combining code point start a new line? > > > > The Standard says no, that's a defective combining sequence. > > Is there *any* program which behaves this way? I misstated that; it's a new line followed by a defective combining sequence. > It doesn't matter that accented backslashes don't occur practice. I do > care for unambiguous, consistent and simple rules. So do I; and the only unambiguous, consistent and simple rule that won't give users hell is that "ba" never matches "bä". Any programs for end-users must follow that rule. > My current implementation doesn't support filenames which can't be > encoded in the current default encoding. The right thing to do, IMO, would be to support filenames as byte strings, and let the programmer convert them back and forth between character strings, knowing that it won't roundtrip. > If the > program assumed that an accented slash is not a directory separator, > I expect possible security holes (the program thinks that a string > doesn't include slashes, but from the OS point of view it does). If the program assumes that an accented slash is not a directory separator, then it's wrong. Any way you go is going to require sensitivity. > > The rules you are offering are only simple and unambiguous to the > > programmer; they appear completely random to the end user. > > And yours are the opposite :-) Programmers get to spend a lot of time dealing with the "random" requirements of users, not the other way around. -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
Re: infinite combinations, was Re: Nicest UTF
From: "Peter R. Mueller-Roemer" <[EMAIL PROTECTED]> For a fixed length of combining character sequence (base + 3 combining marks is the most I have seen graphically distinguishable) the repertore is still finite. I do think that you are underestimating the repertoire. Also Unicode does NOT define an upper bound for the length of combining sequences, and also not on the length of default grapheme clusters (which can be composed of multiple combining sequences, for example in the Hangul or Tibetan scripts) Your estimations also ignores various layouts found in Asian texts, and the particular structures of historic texts which can use many "diacritics" on top of a single base letter starting a combining sequence. The model of these scripts (for example Hebrew) imply the justaposition of up to 13 or 15 levels of diacritics for the same base letter! In practice, it's impossible to enumerate all existing combinations (and ensure that they will be assigned a unique code within a reasonnably limited code point), and that's why a simpler model based on more basic but combinable code points is used in Unicode: it frees Unicode from having to encode all of them (this is already a difficult task for the Han script which could have been encoded with combining sequences, if the algorithms needed to create the necesssary layout had not needed the use of so many complex rules and so many exceptions...)
RE: Nicest UTF
Title: RE: Nicest UTF Missed this one the other day, but cannot let it go... Marcin 'Qrczak' Kowalczyk wrote: > > filenames, what is one supposed to do? Convert all > filenames to UTF-8? > > Yes. > > > Who will do that? > > A system administrator (because he has access to all files). My my, you are assuming all files are in the same encoding. And what about all the references to the files in scripts? In configuration files? Soft links? If you want to break things, this is definitely the way to do it. > > If you keep all processing in UTF-8, then this is a decision you can > > postpone. > > You mean, various programs will break at various points of time, > instead of working correctly from the beginning? So far nothing broke. Because all the programs are in UTF-8. If you would try to write it in UTF-16, it would break. So nobody does it. Except those that must. > > I didn't encourage users to mix UTF-8 filenames and Latin 1 > filenames. > > Do you want to discourage them? > > Mixing any two incompatible filename encodings on the same file system > is a bad idea. As soon as you realize you cannot convert filenames to UTF-8, you will see that all you can do is start adding new ones in UTF-8. Or forget about Unicode. Lars
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Philippe Verdy wrote: > This is a known caveat even for Unix, when you look at the > tricky details of > the support of Windows file sharing through Samba, when the > client requests > a file with a "short" 8.3 name, that a partition used by > Windows is supposed > to support. Do you know how Samba is configured to present UTF-8 filenames properly to Windows? What happens to Latin 1 filenames? Are the invalid sequences escaped? How? Lars
Re: Nicest UTF
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> Regarding A, I see three choices: 1. A string is a sequence of code points. 2. A string is a sequence of combining character sequences. 3. A string is a sequence of code points, but it's encouraged to process it in groups of combining character sequences. I'm afraid that anything other than a mixture of 1 and 3 is too complicated to be widely used. Almost everybody is representing strings either as code points, or as even lower-level units like UTF-16 units. And while 2 is nice from the user's point of view, it's a nightmare from the programmer's point of view: Consider that the normalized forms are trying to approach the choice number 2, to create more predictable combining character sequences which can still be processed with algorithms just streams of code points. Remember that the total number of possible code points is finite; but not the total number of possible combining sequences, meaning that text handling will necessarily have to make decisions based on a limited set of properties. Note however that for most Unicode strings, the "composite" character properties are those of the base character in the sequence. Note also that for some languages/scripts, the linguistically correct unit of work is the grapheme cluster; Unicode just defines "default grapheme clusters", which can span several combining sequences (see for example the Hangul script, written with clusters made of multiple combining sequences, where the base character is a Unicode jamo, itself made somtimes of multiple simpler jamos that Unicode do not allow to decompose as canonically equivalent strings, despite this decomposition is inherent of the script itself in its structure, and not bound to the language which Unicode will not standardize). It's hard to create a general model that will work for all scripts encoded in Unicode. There are too many differences. So Unicode just appears to standardize a higher level of processing with combining sequences and normalization forms that are better approaching the linguistic and semantic of the scripts. Consider this level as an intermediate tool that will help simplify the identification of processing units. The reality is that a written language is actually more complex than what can be approached in a single definition of processing units. For many other similar reasons, the ideal working model will be with "simple" and enumerable abstract characters with a finite number of code points, and with which actual and non-enumerable characters can be composed. But the situation is not ideal for some scripts, notably ideographic ones due to their very complex and often "inconsistent" composition rules or layout and that require allocating many code points, one for each combination. Working with ideographic scripts requires much more character properties than with other scripts (see for example the huge and various properties defined in UniHan, which are still not standardized due to the difficulty to represent them and the slow discovery of errors, omissions, or contradictions found in various sources for this data...)
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Kenneth Whistler wrote: > Further, as it turns out that Lars is actually asking for > "standardizing" corrupt UTF-8, a notion that isn't going to > fly even two feet, I think the whole idea is going to be > a complete non-starter. Technically, I am not asking anything. I am just trying to discuss an approach which I think can be used to solve certain problems. And this approach does not need to be conformant at this point. If someone finds it suitable to make it conformant, even better, but at this point this is irrelevant to the discussion. Unless it is proven that it cannot be made conformant (by changing or amending the standard) because I have missed an important fact. But so far, I have not seen such a proof. But suppose I am asking, therefore proposing - it would be several separate items: 1 - To assign codepoints for 128 (or 256) new surrogates(*), used for: 1.1 - Representing unassigned values when converting from an encoding to Unicode (optional). 1.2 - Representing invalid sequences when interpreting UTF-8 (optional). The use of these would not be mandatory. Existing handling is still an option and can be preserved wherever it suits the needs, or changed where the new behavior is beneficial. Representation of these codepoints in UTF-8 would be as per current standard. 2 - An alternative conversion from Unicode, to, say, UTF-8E (UTF-8E is _NOT_ Unicode(*)). This conversion would reconstruct the original byte sequence, from a Unicode string obtained by 1.2. This conversion pair intended for use on platform or interface boundaries if/where it is determined that they are suitable. For example, interfacing UNIX filesystem and a UTF-8 pipe would require UTF-8E<=>UTF-8 conversion. Interfacing UNIX filesystem and Windows filesystem would require UTF-8E<=>UTF-16 conversion. (*) If proposal #2 would not be accepted, then codepoints in proposal #1 would actually not be surrogates, but simply codepoints and nothing else. Even if proposal #2 is accepted, it is still not clear if those should really be called surrogates, since they would convert among all UTF's just as any other codepoint and only their representation in UTF-8E would differ. Note that UTF-8E is not Unicode, but would be standardized in Unicode. IF U in UTF is a problem, then any other name can be chosen. Consider it a working name and be aware of what it is and is not. 3 - If UTC cannot agree that BMP should be used for proposal #1, I would advise against a decision to assign non-BMP codepoints for the purpose. I believe less damage would be done by postponing the decision than by making a wrong decision. It is not just about how much disk space or bandwidth is used. For example, if both filesystems have a 256 characters limit for a filename, limitations are consistent (at least in one direction) if BMP is used, and not if any other plane is used. 4 - If neither of the proposals is accepted, it would be beneficial if UTC would manage to preserve at least one suitable block (for example U+A4xx or U+ABxx) of 256 codepoints intact to facilitate a future decision. Lars Kristan
infinite combinations, was Re: Nicest UTF
Philippe Verdy wrote:ãääåäææâåääââåäâåâåäããâçæææ ææâäãææãææãççæãççæãææãææãâæçãææ The repertoire of all possible combining characters sequences is already infinite in Unicode, as well as the number of "default grapheme clusters" they can represent. For a fixed length of combining character sequence (base + 3 combining marks is the most I have seen graphically distinguishable) the repertore is still finite. I am enthused about some nicely distinuishable sequences e.g. u + macron + diaeresis shows nicely as a long long vowel u-Umlaut, whereas u + diaeresis + macron displays as long vowel u with trema above to be spoken as a separate vowel. BRAVO! I do not see a good reason why does not work for all other base characters, particularly on all vowels (e, i, o combine in undesirable fashion, a only in one newest version of a unicode font). I can add an accute accent to each sequence but the accent is smuged into the previous complex characters in an ugly default overtype mode. Another GOOD solution: The single combining Hebrew-dagesh point 'finds' the right 'inner' place in all the Hebrew consonants and some Latin base characters, why should overtype uglyness be allowed in many other cases. There seems to be no difficulty to implement composition of complex character from inside out. Can't we join forces to request a default graphical representation, so that legible, distinguishable complex symbols must be generated by future unicode-fonts? The technical details are not too complex and the expressiveness and ease of use of Unicode would be greatly enhanced. The Greek accute and grave accents should by themselves combine centered over any base-character; and if together with a spiritus asper or lenis should be minimally separated from the accent horizontally and display centered over the base character. Hebrew vowel-points and accents also need to be fitted under any single base characteer. Samaritan complex characters should be composable of short combining sequences. Peter R. Mueller-Roemer
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Kenneth Whistler wrote: > Lars responded: > > > > ... Whatever the solutions > > > for representation of corrupt data bytes or uninterpreted data > > > bytes on conversion to Unicode may be, that is irrelevant to the > > > concerns on whether an application is using UTF-8 or UTF-16 > > > or UTF-32. > > > The important fact is that if you have an 8-bit based > program, and you > > provide a locale to support UTF-8, you can keep things > working (unless you > ^^^ > > You can keep *some* things *sorta* working. I didn't say that this is all that needs to be done. But the way you say it makes one think that this is not even the right track. > > prescribe validation). But you cannot achieve the same if > you try to base > > your program on 16 or 32 bit strings. > > Of course you can. You just have to rewrite the program to handle > 16-bit or 32-bit strings correctly. You can't pump them through > 8-bit pipes or char* API's, but it's just silly to try that, because > they are different animals to begin with. Correctly? Strings? There are no strings and no encodings in a UNIX filesystem. Please clarify. > > By the way, I participated as an engineer in a multi-year project > that shifted an advanced, distributed data analysis system > from an 8-bit character set to 16-bit Unicode. *All* > user-visible string > processing was converted over -- and that included proprietary > file servers, comm servers, database gateways, networking code, > a proprietary 32-bit workstation GUI implementation, and a suite > of object-oriented application tools, including a spreadsheet, > plotting tool, query and database reporting tools, and much more. > It worked cross-platform, too. > > It was completed, running, and *delivered* to customers in 1994, > a decade ago. OK, was this a fresh development, or was this an upgrade of an existing system? Did the existing system contain user data that needed to be converted? Was this data all in ASCII? Was this data all in a single code page? Latin 1 perhaps? How much of that data was in UTF-8? > > You can't bamboozle me with any of this "it can't be done with > 16-bit strings" BS. BS? Bamboozle? One learns all sorts of new words here on this mailing list. Frankly, I find it interesting to read many historical and cultural facts in off-topic discussions, but I have a feeling I am not the only one and that many people prefer to engage in those. And that often original questions remain unanswered. And interesting ideas unexplored. I know it is hard to follow someone else's ideas, spread over many mails, already sidetracked by those who think they understand what is being discussed and by those who can't distinguish between following a standard and changing or extending it. In the end, statements torn out of context do in fact look as if they're nonsense. Much your response (in this particular mail, not in general) is just that. One misinterpretation after another. And detailed explanations of things that are not even being discussed. Non-conformances being pointed out, where consequences of proposed changes should in fact be discussed. I am disappointed by this attitude, even more so because it comes from one of the most respected people on this mailing list. Examples: > Yes you can. > No, you need not -- that is non-conformant, besides. > http://www.unicode.org/Public/UNIDATA/ > Utterly non-conformant. > Also utterly nonconformant. I suppose surrogates were also non-conformant at the time they were proposed. Can I interpret your responses that surrogates should never have been accepted into the Unicode standard? > I just don't understand these assertions at all. I have given plenty of examples. > First of all it isn't "UNIX data" or "Windows data" -- it is > end user's data, which happens to be processed in software > systems which in turn are running on a UNIX or Windows OS. This is resorting to a philosophical answer, picking on words. > I work for a company that *routinely* runs applications that > cross the platform barriers in all sorts of ways. It works > because character sets are handled conformantly, and conversions > are done carefully at platform boundaries -- not because some > hack has been added to UTF-8 to preserve data corruptions. Sybase, yes. A very controlled environment. The fact that validity of data *can* be guaranteed in your particular environment gives you not more, but less right to make judgements about other environments and claim the problems can be solved 'by doing things correctly'. > > If the purpose of
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) John Cowan wrote: > However, although they are *technically* octet sequences, they > are *functionally* character strings. That's the issue. Nicely put! But UTC does not seem to care. > > > The point I'm making is that *whatever* you do, you are still > > asking for implementers to obey some convention on conversion > > failures for corrupt, uninterpretable character data. > > My assessment is that you'd have no better success at making > > this work universally well with some set of 128 magic bullet > > corruption pills on Plane 14 than you have with the > > existing Quoted-Unprintable as a convention. > > It doesn't have to work universally; indeed, it becomes a QOI issue. > Allocating representations of bytes with "bits that are high" makes > it possible to do something recoverable, at very little expense to the > Unicode Consortium. Except that the expense should be slightly higher. The importance of these replacement codepoints is still underestimated. They belong in the BMP. And at least there is no way anyone can blame UTC for a cultural bias in this case, these codepoints are universal. > > > Further, as it turns out that Lars is actually asking for > > "standardizing" corrupt UTF-8, a notion that isn't going to > > fly even two feet, I think the whole idea is going to be > > a complete non-starter. > > I agree that that part won't fly, absolutely. Then I'll have to restructure it. Lars
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Arcane Jill responded: > >> Windows filesystems do know what encoding they use. > >Err, not really. MS-DOS *need to know* the encoding to use, > >a bit like a > >*nix application that displays filenames need to know the > >encoding to use > >the correct set of glyphs (but constrainst are much more heavy.) > > Sure, but MS-DOS is not Windows. MS-DOS uses "8.3" filenames. > But it's not > like MS-DOS is still terrifically popular these days. I don't know what Antoine meant by MS-DOS, but since he mentioned it in the Windows context, I thought it was about Windows console applications (console is still often referred to as DOS box, I think). > The fact that applications can still open files using the > legacy fopen() > call (which requires char*, hence 8-bit-wide, strings) is kind of > irrelevant. If the user creates a file using fopen() via a code page > translation, AND GETS IT WRONG, then the file will be created > with Unicode > characters other than those she - but those characters will > still be Unicode > and unambiguous, no? Funny thing. Nobody cares much if a Latin 2 string is misinterpreted and Latin 1 conversion is used instead. As long as they can create the file. But if a Latin 2 string is misinterpreted and UTF-8 conversion is used? You won't just get the filename with charaters other than those you expected. Either the file won't open at all (depending on where and how the validation is done), or you risk that two files you create one after another will overwrite each other. Note that I am talking about files you create from within this scenario, not files that existed on the disk before. Second thing: OK, you say fopen is a legacy call. True, you can use _wfopen. So, you can have a console application in Unicode and all problems are solved? No. Standard input and standard output are 8-bit, and a code page is used. And it has to remain so, if you want the old and the new applications to be able to communicate. So, the logical conclusion is that UTF-8 needs to be used instead of a code page. Unfortunately, Windows has problems with that. Try MODE CON: CP SELECT=65001. Much of it works, but batch files don't run. Now suppose Windows does work correctly with code page set to UTF-8. You create an application that reads the stdin, counts the words longer than 10 codepoints and passes the input unmodified to stdout. What happens: * set CP to Latin 1, process Latin 1: correct result * set CP to Latin 1, process UTF-8: wrong result * set CP to UTF-8, process UTF-8: correct result * set CP to UTF-8, process Latin 1: wrong restlt, corrupted output Now, I wonder why Windows is not supporting UTF-8 as much as one would want. Lars
Re: Nicest UTF
"D. Starner" <[EMAIL PROTECTED]> writes: >> > This implies that every programmer needs an indepth knowledge of >> > Unicode to handle simple strings. >> >> There is no way to avoid that. > > Then there's no way that we're ever going to get reliable Unicode > support. This is probably true. I wonder whether things could have been done significantly better, or it's an inherent complexity of text. Just curious, it doesn't help with the reality. >> If the runtime automatically performed NFC on input, then a part of a >> program which is supposed to pass a string unmodified would sometimes >> modify it. Similarly with NFD. > > No. By the same logic you used above, I can expect the programmer to > understand their tools, and if they need to pass strings unmodified, > they shouldn't load them using methods that normalize the string. That's my point: if he normalizes, he does this explicitly. If a standard (a programming language, XML, whatever) specifies that identifiers should be normalized before comparison, a program should do this. If it specifies that Cf characters are to be ignored, then a program should comply. A standard doesn't have to specify such things however, so a programming language shouldn't do too much automatically. It's easier to apply a transformation than to undo a transformation applied automatically. > Sometimes things get ambiguous if one day ŝ is matched by s and one > day ŝ isn't? That's absolutely wrong behavior; the program must serve > the user, not the programmer. If I use grep to search for a combining acute, I bet it will currently match cases where it's a separate combining character but will not match precomposed characters. Do you say that this should be changed? Hey, Linux grep matches only a single byte by ".", even in UTF-8 locale. Now, I can agree that this should be changed. But demanding that each program which searches strings checks for combining classes is I'm afraid too much. >> Does "\n" followed by a combining code point start a new line? > > The Standard says no, that's a defective combining sequence. Is there *any* program which behaves this way? How useful is a rule in a standard which nobody obeys to? >> Does a double quote followed by a combining code point start a >> string literal? > > That would depend on your language. I'd prefer no, but it's obvious > many have made other choices. Since my language is young and almost doesn't have users, I can even change decisions made earlier: I'm not constrained by compatibility yet. But if lexical structure of the program worked in terms of combining character sequences, it would have to be somehow supported by generic string processing functions, and it would have to consistely work for all lexical features. For example */ followed by a combining accent would not end a comment, accented backslash would not need escaping in a string literal, and something unambiguous would have to be done with an accented newline. Such rules would be harder to support with most text processing tools. I know no language in which searching for a backslash in a string would not find an accented backslash. It doesn't matter that accented backslashes don't occur practice. I do care for unambiguous, consistent and simple rules. >> Does a slash followed by a combining code point separate >> subdirectory names? > > In Unix, yes; that's because filenames in Unix are byte streams with > the byte 0x2F acting as a path seperator. My current implementation doesn't support filenames which can't be encoded in the current default encoding. The encoding can be changed from within a program (perhaps locally during execution of some code). So one can process any Unix filename by temporarily setting the encoding to Latin1. It's unfortunate that the default setting is more restrictive than the OS, but I have found no sensible alternative other than encouraging processing strings in their transportation encoding. Anyway, if a string *is* accepted as a file name, the program's idea about directory separators is the same as the OS (as long as we assume Unix; I don't yet provide any OS-generic pathname handling). If the program assumed that an accented slash is not a directory separator, I expect possible security holes (the program thinks that a string doesn't include slashes, but from the OS point of view it does). > The rules you are offering are only simple and unambiguous to the > programmer; they appear completely random to the end user. And yours are the opposite :-) -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
"Philippe Verdy" <[EMAIL PROTECTED]> writes: [...] > This was later amended in an errata for XML 1.0 which now says that > the list of code points whose use is *discouraged* (but explicitly > *not* forbidden) for the "Char" production is now: [...] Ugh, it's a mess... IMHO Unicode is partially to blame, by introducing various kinds of holes in code point numbering (non-characters, surrogages), by not being clear when the unit of processing should be a code point and when a combining character sequence, and earlier by pushing UTF-16 as the fundamental representation of the text (which led to such horrible descriptions as http://www.xml.com/axml/notes/Surrogates.html). XML is just an example of a standard which must decide: A. What is the unit of text processing? (code point? combining character sequence? something else? hopefully it would not be UTF-16 unit) B. Which (sequences of) characters are valid when present in the raw source, i.e. what UTF-n really means? C. Which (sequences of) characters can be formed by specifying a character number? A programming language must do the same. The language Kogut I'm designing and developing uses Unicode as string representation, but the details can still be changed. I want to have rules which are "correct" as far as Unicode is concerned, and which are simple enough to be practical (e.g. if a standard forced me to make the conversion from code point number to actual character contextual, or if it forced me to unconditionally unify precomposed and decomposed characters, then I quit and won't support a broken standard). Internal text processing in a programming language can be more permissive than an application of such processing like XML parsing: if a particular character is valid in UTF-8 but XML disallows it, everything is fine, it can be rejected at some stage. It must not be more restrictive however, as it would make impossible to implement XML parsing in terms of string processing. Regarding A, I see three choices: 1. A string is a sequence of code points. 2. A string is a sequence of combining character sequences. 3. A string is a sequence of code points, but it's encouraged to process it in groups of combining character sequences. I'm afraid that anything other than a mixture of 1 and 3 is too complicated to be widely used. Almost everybody is representing strings either as code points, or as even lower-level units like UTF-16 units. And while 2 is nice from the user's point of view, it's a nightmare from the programmer's point of view: - Unicode character properties (like general category, character name, digit value) are defined in terms of code points. Choosing 2 would immediately require two-stage processing: a string is a sequence of sequences of code points. - Unicode algorithms (like collation, case mapping, normalization) are specified in terms of code points. - Data exchange formats (UTF-n) are always closer to code points than to combining character sequences. - Code points have a finite domain, so you can make dictionaries indexed by code points; for combining character sequences we would be forced to make functions which *compute* the relevant property basing on the structure of such a sequence. I don't believe 2 is workable at all. The question is how to make 3 convenient enough to be used more often. Unfortunately it's much harder than 1, unless strings used some completely different iteration protocols than other sequences. I don't have an idea how to make 3 convenient. Regarding B in the context of a programming language (not XML), chapter 3.9 of the Unicode standard version 4.0 excludes only surrogates: it does not exclude non-characters like U+. But non-characters must be excluded somewhere, because otherwise U+FFFE at the beginning would be mistaken for a BOM. I'm confused. Regarding C, I'm confused too. Should a function which returns the character of the given number accept surrogates? I guess no. Should it accept non-characters? I don't know. I only know that it should not accept values above 0x10. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
Philippe Verdy scripsit: > And I disagree with you about the fact the U+ can't be used in XML > documents. It can be used in URI through URI escaping mechanism, as > explicitly indicated in the XML specification... You have a hold of the right stick but at the wrong end. U+ can be encoded in a URI as %00, but that does not mean that the IRIs in system ids and namespace names (and potentially other places) can contain explicit U+ characters or � escapes either. Both of those are illegal, and documents that contain them are not well-formed. In character content and attribute values, U+ is not possible. > And the fact that the various character productions, that are normally > normative, have been changed so often, sometimes through erratas that > were forgotten in the text of the next edition of the standard, Do you have evidence for this claim? > The only thing about which I can agree is that XML will forbid surrogates > and U+FFFE and U+, but I won't say that a XML parser that does not > reject NULs or other non-characters or "disallowed" C0 controls is so > much buggy. You are of course entitled to your uninformed opinion. > But all these is also a proof that XML documents are definitely NOT > plain-text documents, so you can't use Unicode encoding rules at the > encoded XML document level, only at the finest plain-text nodes (these > are the levels that the productions in the XML standard are trying, with > more or less success, to standardize). You can't blindly do *normalization* of XML documents as if they were plain text. *Encoding* XML documents according to Unicode is of course possible and desirable. > As a consequence any process that blindly applies a plain-text > normalization to a complete XML document is bogous, because it breaks the > most basic XML conformance, i.e. the core document structure... In one extraordinarily unlikely case, yes: the appearance of a combining overlay slash following the ">" that closes a tag will damage the document if it is NFC-normalized. -- You are a child of the universe no less John Cowan than the trees and all other acyclichttp://www.reutershealth.com graphs; you have a right to be here.http://www.ccil.org/~cowan --DeXiderata by Sean McGrath [EMAIL PROTECTED]
Re: Nicest UTF
Philippe Verdy scripsit: > >Okay, I'm confused. Does ≮ open a tag? Does it matter if it's > >composed or decomposed? > > It does not open a XML tag. > It does matter if it's composed (won't open a tag) or decomposed (will > open a tag, but with a combining character, invalid as an identifier > start) Let's be precise here. If the 7-character character sequence "蠔" appears in an XML document, it never opens a tag and it is never changed by normalization. If the 1-character sequence consisting of a single U+226E appears in an XML document, and that document is put through NF(K)D, it will become not well-formed. However, NF(K)D is not recommended for XML documents, which should be in NFC. -- First known example of political correctness: John Cowan "After Nurhachi had united all the otherhttp://www.reutershealth.com Jurchen tribes under the leadership of the http://www.ccil.org/~cowan Manchus, his successor Abahai (1592-1643) [EMAIL PROTECTED] issued an order that the name Jurchen should --S. Robert Ramsey, be banned, and from then on, they were all The Languages of China to be called Manchus."
Re: Nicest UTF
Philippe Verdy scripsit: > If you look at the XML 1.0 Second Edition The Second Edition has been superseded by the Third. > Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | > [#x1-#x10] That is normative. > But the comment following it specifies: That comment is not normative and not meant to be precise. > the restrictive > definition of "Char" above also includes the whole range of C1 controls By oversight. > (#x80..#x9F), so I can't understand why the Char definition is so > restrictive on controls; in addition the definition of Char also > *includes* many non-characters (it only excludes surrogates, and U+FFFE > and U+, but forgets to exclude U+1FFFE and U+1, U+2FFFE and > U+2, ..., U+10FFFE and U+10). By oversight again. > Note however that nearly all XML parsers don't seem to honor this > constraint (like SGML parsers...)! Please specify the parsers that do and don't honor this. Any which don't honor it are buggy, and any documents which exploit those bugs are not XML. > What is even worse is that XML 1.1 now reallows NUL for system > identifiers and URIs, through escaping mechanisms. Not true. U+ is absolutely excluded in both XML 1.0 and XML 1.1. -- "I could dance with you till the cows John Cowan come home. On second thought, I'd http://www.ccil.org/~cowan rather dance with the cows when you http://www.reutershealth.com came home." --Rufus T. Firefly [EMAIL PROTECTED]
Re: Nicest UTF
From: "D. Starner" <[EMAIL PROTECTED]> Okay, I'm confused. Does ≮ open a tag? Does it matter if it's composed or decomposed? It does not open a XML tag. It does matter if it's composed (won't open a tag) or decomposed (will open a tag, but with a combining character, invalid as an identifier start) Conclusion1: blind normalizations of XML documents, as if they were plain-text documents, can break the XML well-formedness of these documents This is caused by the fact that plain-text documents can be parsed by units of grapheme clusters or combining sequences. But XML parsing stops at the one-codepoint character level, and ignores canonical equivalences. Conclusion2: XML documents are not plain-text documents.
Re: Nicest UTF
From: "John Cowan" <[EMAIL PROTECTED]> Marcin 'Qrczak' Kowalczyk scripsit: http://www.w3.org/TR/2000/REC-xml-20001006#charsets implies that the appropriate level for parsing XML is code points. You are reading the XML Recommendation incorrectly. It is not defined in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of characters. XML processors are required to process UTF-8 and UTF-16, and may process other character encodings or not. But the internal model is that of characters. Thus surrogate code points are not allowed. I have different reading, because the "character" in XML is not the same as the "character" in Unicode. For XML, U+10 is a valid character (even if its use is explicitly not recommanded, it is perfectly valid), for Unicode it's a non-character... For XML, U+0001 is *sometimes* a valid character, sometimes not. And I disagree with you about the fact the U+ can't be used in XML documents. It can be used in URI through URI escaping mechanism, as explicitly indicated in the XML specification... And the fact that the various character productions, that are normally normative, have been changed so often, sometimes through erratas that were forgotten in the text of the next edition of the standard, then reintroduced in an errata, shows that these productions are less reliable than the descriptive *definitions* which ARE normative in XML... The only thing about which I can agree is that XML will forbid surrogates and U+FFFE and U+, but I won't say that a XML parser that does not reject NULs or other non-characters or "disallowed" C0 controls is so much buggy. I do think that these restrictions is a defect of XML... But all these is also a proof that XML documents are definitely NOT plain-text documents, so you can't use Unicode encoding rules at the encoded XML document level, only at the finest plain-text nodes (these are the levels that the productions in the XML standard are trying, with more or less success, to standardize). As a consequence any process that blindly applies a plain-text normalization to a complete XML document is bogous, because it breaks the most basic XML conformance, i.e. the core document structure...
Re: Nicest UTF
John Cowan writes: > You are reading the XML Recommendation incorrectly. It is not defined > in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of > characters. XML processors are required to process UTF-8 and UTF-16, > and may process other character encodings or not. But the internal > model is that of characters. Thus surrogate code points are not > allowed. Okay, I'm confused. Does ≮ open a tag? Does it matter if it's composed or decomposed? -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
Re: Nicest UTF
Marcin 'Qrczak' Kowalczyk scripsit: > http://www.w3.org/TR/2000/REC-xml-20001006#charsets > implies that the appropriate level for parsing XML is code points. You are reading the XML Recommendation incorrectly. It is not defined in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of characters. XML processors are required to process UTF-8 and UTF-16, and may process other character encodings or not. But the internal model is that of characters. Thus surrogate code points are not allowed. -- John Cowan www.reutershealth.com www.ccil.org/~cowan [EMAIL PROTECTED] Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash, The day and hour soon are coming / When all the IT folks say "Gosh!" It isn't from a clever lawsuit / That Windowsland will finally fall, But thousands writing open source code / Like mice who nibble through a wall. --The Linux-nationale by Greg Baker
Re: Nicest UTF
From: "Philippe Verdy" <[EMAIL PROTECTED]> From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> "Philippe Verdy" <[EMAIL PROTECTED]> writes: The XML/HTML core syntax is defined with fixed behavior of some individual characters like '&', '<', quotation marks, and with special behavior for spaces. The point is: what "characters" mean in this sentence. Code points? Combining character sequences? Something else? See the XML character model document... XML ignores combining sequences. But for Unicode and for XML a character is an abstract character with a single code allocated in a *finite* repertoire. The repertoire of all possible combining characters sequences is already infinite in Unicode, as well as the number of "default grapheme clusters" they can represent. Note there is some differently relaxed definitions of what constitutes a "character" for XML. If you look at the XML 1.0 Second Edition, it specifies that the document is a "text" (defined only as a sequence of "characters", which may represent markup or character data) will only contain characters in this set: Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10] But the comment following it specifies: "any Unicode character, excluding the surrogate blocks, FFFE, and ." which is considerably weaker (because it would include ALL basic controls in the range #x0 to #x1F, and not only TAB, LF, CR); the restrictive definition of "Char" above also includes the whole range of C1 controls (#x80..#x9F), so I can't understand why the Char definition is so restrictive on controls; in addition the definition of Char also *includes* many non-characters (it only excludes surrogates, and U+FFFE and U+, but forgets to exclude U+1FFFE and U+1, U+2FFFE and U+2, ..., U+10FFFE and U+10). So XML does allow Unicode/ISO10646 non-characters... But not all. Apparently many XML parsers seem to ignore the restriction of Char above, notably in CDATA sections The alternative is then to use numeric character references, as defined by this even weaker production (in 4.1. Character and Entity References): CharRef ::= '' [0-9]+ ';' | '' [0-9a-fA-F]+ ';' but with this definition: "A character reference refers to a specific character in the ISO/IEC 10646 character set, for example one not directly accessible from available input devices." Which is exactly the purpose of encoding something like "" to encode a SOH character U+0001 (which after all is a valid Unicode/ISO/IEC10646 "character"), or even a NUL character. The "CharRef" production however is annotated by a Well-Formedness Constraint, "Legal Character": "Characters referred to using character references must match the production for Char. Note however that nearly all XML parsers don't seem to honor this constraint (like SGML parsers...)! This was later amended in an errata for XML 1.0 which now says that the list of code points whose use is *discouraged* (but explicitly *not* forbidden) for the "Char" production is now: [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3], [#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6], [#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9], [#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC], [#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF], [#x10FFFE-#x10]. This clause is not really normative, but just adds to the confusion...Then comes XML 1.1, that extends the restrictive "Char" production:Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10]with the same comment "any Unicode character, excluding the surrogate blocks, FFFE, and ."So in XML 1.0, the comment was accurate, not the formal production...In XML 1.1, all C0 and C1 controls (except NUL) are now allowed, but some of them their use is restricted in some cases: RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F] What is even worse is that XML 1.1 now reallows NUL for system identifiers and URIs, through escaping mechanisms. Clearly, the XML specification is inconsistent there, and this would explain why most XML parsers are more permissive than what is given in the "Char" production of the XML specification, and that they simply refer to the definition of valid codepoints for Unicode and ISO/IEC 10646, excluding only surrogate code points (a valid code point can be a non-character, and can also be a NUL...): the XML parser will accept those code points, but will let the validity control to the application using the parsed XML data, or will offer some tuning options to enable this "Char" filter (that depends on XML version...). See also the various erratas for XML 1.1, related to "RestrictedChar"... Or to the list of characters whose use is discouraged (meaning explicitly not forbidden, so allowed...): [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3FF
Re: Nicest UTF
"Marcin 'Qrczak' Kowalczyk" writes: > "D. Starner" writes: > > > This implies that every programmer needs an indepth knowledge of > > Unicode to handle simple strings. > > There is no way to avoid that. Then there's no way that we're ever going to get reliable Unicode support. > If the runtime automatically performed NFC on input, then a part of a > program which is supposed to pass a string unmodified would sometimes > modify it. Similarly with NFD. No. By the same logic you used above, I can expect the programmer to understand their tools, and if they need to pass strings unmodified, they shouldn't load them using methods that normalize the string. > You can't expect each and every program which compares strings to > perform normalization (e.g. Linux kernel with filenames). As has been pointed out here, Posix filenames are not character strings; they are byte strings. They quite likely aren't even valid UTF-8 strings. > > So S should _sometimes_ match an accented S? Again, I feel extended misery > > of explaining to people why things aren't working right coming on. > > Well, otherwise things get ambiguous, similarly to these XML issues. Sometimes things get ambiguous if one day ŝ is matched by s and one day ŝ isn't? That's absolutely wrong behavior; the program must serve the user, not the programmer. 's' cannot, should, must not match 'ŝ'; and if it must, then it absolutely always must match 'ŝ' and someway to make a regex that matches s but not ŝ must be designed. It doesn't matter what problems exist in the world of programming; that is the entirely reasonable expectation of the end user. > Does "\n" followed by a combining code point start a new line? The Standard says no, that's a defective combining sequence. > Does > a double quote followed by a combining code point start a string > literal? That would depend on your language. I'd prefer no, but it's obvious many have made other choices. > Does a slash followed by a combining code point separate > subdirectory names? In Unix, yes; that's because filenames in Unix are byte streams with the byte 0x2F acting as a path seperator. > It's hard enough to convince them that a > character is not the same as a byte. That contradicts you above statement, that every programmer needs an indepth knowledge of Unicode. > In case I want to circumvent security or deliberately cause a piece of > software to misbehave. Robustness require unambiguous and simple rules. The rules you are offering are only simple and unambiguous to the programmer; they appear completely random to the end user. To have ≮ sometimes start a tag means that a user can't look at the XML and tell whether something opens a tag or is just text. You might be able to expect all programmers, but you can't expect all end users to. -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
Re: Nicest UTF
John Cowan <[EMAIL PROTECTED]> writes: >> > The XML/HTML core syntax is defined with fixed behavior of some >> > individual characters like '&', '<', quotation marks, and with special >> > behavior for spaces. >> >> The point is: what "characters" mean in this sentence. Code points? >> Combining character sequences? Something else? > > Neither. Unicode characters. http://www.w3.org/TR/2000/REC-xml-20001006#charsets implies that the appropriate level for parsing XML is code points. In particular XML allows a combining character directly after ">". -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
- Original Message - From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Friday, December 10, 2004 8:35 PM Subject: Re: Nicest UTF "Philippe Verdy" <[EMAIL PROTECTED]> writes: The XML/HTML core syntax is defined with fixed behavior of some individual characters like '&', '<', quotation marks, and with special behavior for spaces. The point is: what "characters" mean in this sentence. Code points? Combining character sequences? Something else? See the XML character model document... XML ignores combining sequences. But for Unicode and for XML a character is an abstract character with a single code allocated in a *finite* repertoire. The repertoire of all possible combining characters sequences is already infinite in Unicode, as well as the number of "default grapheme clusters" they can represent.
Re: Nicest UTF
John Cowan <[EMAIL PROTECTED]> writes: >> > The XML/HTML core syntax is defined with fixed behavior of some >> > individual characters like '&', '<', quotation marks, and with special >> > behavior for spaces. >> >> The point is: what "characters" mean in this sentence. Code points? >> Combining character sequences? Something else? > > Neither. Unicode characters. What does "Unicode characters" mean? -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
Marcin 'Qrczak' Kowalczyk scripsit: > > The XML/HTML core syntax is defined with fixed behavior of some > > individual characters like '&', '<', quotation marks, and with special > > behavior for spaces. > > The point is: what "characters" mean in this sentence. Code points? > Combining character sequences? Something else? Neither. Unicode characters. -- "May the hair on your toes never fall out!" John Cowan --Thorin Oakenshield (to Bilbo) [EMAIL PROTECTED]
Re: Nicest UTF
"Philippe Verdy" <[EMAIL PROTECTED]> writes: > The XML/HTML core syntax is defined with fixed behavior of some > individual characters like '&', '<', quotation marks, and with special > behavior for spaces. The point is: what "characters" mean in this sentence. Code points? Combining character sequences? Something else? -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
"D. Starner" <[EMAIL PROTECTED]> writes: >> String equality in a programming language should not treat composed >> and decomposed forms as equal. Not this level of abstraction. > > This implies that every programmer needs an indepth knowledge of > Unicode to handle simple strings. There is no way to avoid that. If the runtime automatically performed NFC on input, then a part of a program which is supposed to pass a string unmodified would sometimes modify it. Similarly with NFD. You can't expect each and every program which compares strings to perform normalization (e.g. Linux kernel with filenames). Perhaps if there was a single normalization format which everybody agreed to, and unnormalized strings were never used for data interchange (if UTF-8 was specified such that to disallow unnormalized data, etc.), things would be different. But Unicode treats both composed and decomposed representations as valid. >> IMHO splitting into graphemes is the job of a rendering engine, not of >> a function which extracts a part of a string which matches a regex. > > So S should _sometimes_ match an accented S? Again, I feel extended misery > of explaining to people why things aren't working right coming on. Well, otherwise things get ambiguous, similarly to these XML issues. Does "\n" followed by a combining code point start a new line? Does a double quote followed by a combining code point start a string literal? Does a slash followed by a combining code point separate subdirectory names? An iterator which delivers whole combining character sequences out of a sequence of code points can be used. You can also manipulate strings as arrays of combining character sequences. But if you insist that this is the primary string representation, you become incompatible with most programs which have different ideas about delimited strings. You can't expect each and every program to check combining classes of processed characters. It's hard enough to convince them that a character is not the same as a byte. >> I expect breakage of XML-based protocols if implementations are >> actually changed to conform to these rules (I bet they don't now). > > Really? In what cases are you storing isolated combining code points > in XML as text? In case I want to circumvent security or deliberately cause a piece of software to misbehave. Robustness require unambiguous and simple rules. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
Philippe Verdy wrote: >> Please start adding spaces to your entity references or >> something, because those of us reading this through a web interface >> are getting very confused. > > No confusion possible if using any classic mail reader. > > Blame your ISP (and other ISPs as well like AOL that don't respect the > interoperable standards for plain-text emails) for its poor webmail > interface, that does not properly escape the characters... No harm done in following David's suggestion, though, to help accommodate the mail readers that do this. It's just an e-mail, after all. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
From: "Antoine Leca" <[EMAIL PROTECTED]> Err, not really. MS-DOS *need to know* the encoding to use, a bit like a *nix application that displays filenames need to know the encoding to use the correct set of glyphs (but constrainst are much more heavy.) Also Windows NT Unicode applications know it, because it can't be changed :-). But when it comes to other Windows applications (still the more common) that happen to operate in 'Ansi' mode, they are subject to the hazard of codepage translations. Even if Windows 'knows' the encoding used for the filesystem (as when it uses NTFS or Joliet, or VFAT on NT kernels; in the other cases it does not even know it, much like with *nix kernels), the only usable set is the _intersection_ of the set used to write and the set used to read; that is, usually, it is restricted to US ASCII, very much like the usable set in *nix cases... True, but this applies to FAT-only filesystems, which happen to store filenames with a "OEM" charset which is not stored explicitly on the volume. This is a known caveat even for Unix, when you look at the tricky details of the support of Windows file sharing through Samba, when the client requests a file with a "short" 8.3 name, that a partition used by Windows is supposed to support. In fact, this nightmare comes from the support in Windows of the compatibility with legacy DOS applications which don't know the details and don't use the Win32 APIs with Unicode support. Note that DOS applications use a "OEM" charset which is part of the user settings, not part of the system settings (see the effects of the command CHCP in a DOS command prompt). FAT32 and NTFS help reconciliate these incompatible charsets because these filesystems also store a "LFN" (Long File Name) for the same files (in that case the short name, encoded in some ambiguous OEM charset, is just an alias, acting exactly like a hard link on Unix created in the same directory that references the same file). "LFN" names are UTF-16 encoded and support mostly the same names as in NTFS volumes. However, on FAT32 volumes, the short names are mandatory, unlike on NTFS volumes where they can be created "on the fly" by the filesystem driver, according to the current user settings for the selected OEM charset, without storing them explicitly on the volume. Windows contains, in CHKDSK, a way to verify that short names of FAT32 filesystems are properly encoded with a coherent OEM charset, using the UTF-16 encoded LFN names as a reference. If needed, corrections for the OEM charset can be applied... This nightmare of incompatible OEM charsets do happen on Windows 98/98SE/ME, when the "autoexec.bat" file that defines the current user profile is not executing as it should the proper "CHCP" command, or when this autoexec.bat file has been modified or erased: in that case, the default OEM charset (codepage 437) is used, and short filenames are incorrectly encoded. Another complexity is that Win32 applications, that use a fixed (not user-settable) "ANSI" charset, and that don't use the Unicode API depend on the conversion from the ANSI charset to the current OEM charset. But if a file is handled through some directory shares via multiple hosts, that have distinct ANSI charsets (i.e. Windows hosts running different localization of Windows, such as a US installation and a French version in the same LAN), the charsets viewed by these hosts will create incompatible encodings on the same shared volume. So the only "stable" subset for short names, that is not affected by OS localization or user settings is the intersection of all possible ANSI and OEM charsets that can be set in all versions of Windows! No need to say, this designates only the printable ASCII charset for short 8.3 names. Long filenames are not affected by this problem. Conclusion: to use international characters out of ASCII in filenames used by Windows, make sure that the the name is not in a 8.3 short format, so that a long filename, in UTF-16, will be created on FAT32 filesystems or on SMBFS shares (Samba on Unix/Linux, Windows servers)... Or use NTFS (but then resolve the interoperability problems with Linux/Unix client hosts that can't access reliably, for now, to these filesystems, and that are not completely emulated by Unix filesystems used by Samba, due to the limitation on the LanMan sharing protocol, and limitations of Unix filesystems as well that rarely use UTF-8 as their prefered encoding...)
Re: Nicest UTF
From: "D. Starner" <[EMAIL PROTECTED]> "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes: If it's a broken character reference, then what about Á (769 is the code for combining acute if I'm not mistaken)? Please start adding spaces to your entity references or something, because those of us reading this through a web interface are getting very confused. No confusion possible if using any classic mail reader. Blame your ISP (and other ISPs as well like AOL that don't respect the interoperable standards for plain-text emails) for its poor webmail interface, that does not properly escape the characters used in plain-text emails you receive (and that are NOT containing any html entities), but that get inserted blindly within the HTML page they create in their webmail interface. Not only such webmail interface is bogous, but it is also dangerous as it allows arbitrary HTML code to run from plain-text emails. Ask for support and press your ISP to correct its server-side scripts so that it will correctly support plain-text emails !
Re: Nicest UTF
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> Ok, so it's the conversion from raw text to escaped character references which should treat combining characters specially. What about < with combining acute, which doesn't have a precomposed form? A broken opening tag or a valid text character? Also a broken opening tag for HTML/XML documents (which are NOT plain text documents, and must be first parsed as HTML/XML, before parsing the many text sections contained in text elements, element names, attribute names, attribute values (etc...) as plain-text under the restrictions specified in the HTML or XML specifications (which contain restriction for example on which characters are allowed in names). The XML/HTML core syntax is defined with fixed behavior of some individual characters like '&', '<', quotation marks, and with special behavior for spaces. This core structure is not plain-text, and cannot be overriden, even by Unicode grapheme clusters. Note that HTML/XML do NOT mandate the use or even the support of Unicode, just the support of a character repertoire that contains some required characters, and the acceptance of at least the ISO/10646 repertoire under some conditions, however the encoding to code points itself is not required for something else than numeric character references, which are more symbolic in a way similar to other named character entities in SGML, than absolute as implying the required support of the repertoire with a single code! So you can as well create fully conforming HTML or XML documents using a character set which includes characters not even defined in Unicode/ISO/IEC 10646, or characters defined only symbolically with just a name. Whever this name will map or not to one or more Unicode characters does not change the validity of the document itself. And all the XML/HTML behavior ignores almost all Unicode properties (including normalization properties, because XML and HTML treat different strings, which are still canonically equivalent, as completely distinct; an important feature for cases like XML Signatures, where normalization of documents should not be applied blindly as it would break the data signature). If you want to normalize XML documents, you should not do it with a normalizer working on the whole document as if it was plain-text. Instead you must normalize the individual strings that are in the XML InfoSet, as accessible when browsing the nodes of its DOM tree, and then you can serialize the normalized tree to create a new document (using CDATA sections and/or character references, if needed to escape some syntaxic characters reserved by XML that would be present in the string data of DOM tree nodes). Note also that a XML document containing references to Unicode non-characters would still be well-formed, because these characters may be part of a non-Unicode charset. XML document validation is a separate and optional problem from XML parsing which checks well-formedness and builds a DOM tree: validation is only performed when matching the DOM tree according to a schema definition, DTD or XSD, in which additional restrictions on allowed characters may be checked, or in which additional symbolic-only "characters" may be defined and used in the XML document with parsable named entities similar to: ">". (An example: the schema may contain a definition for a "character" representing a private company logo, mapped to a symbolic name; the XML document can contain such references, but the DTD may also define an encoding for it in a private charset, so that the XML document will directly use that code; the Apple logo in Macintosh charsets is an example, for which an internal mapping to Unicode PUAs is not sufficient to allow correct processing of multiple XML documents, where PUAs used in each XML documents have no equivalence; the conversion of such documents to Unicode with these PUAs is a lossy conversion, not suitable for XML data processing).
Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Antoine Leca Sent: 09 December 2004 11:29 To: Unicode Mailing List Subject: Re: Invalid UTF-8 sequences (was: Re: Nicest UTF) Windows filesystems do know what encoding they use. Err, not really. MS-DOS *need to know* the encoding to use, a bit like a *nix application that displays filenames need to know the encoding to use the correct set of glyphs (but constrainst are much more heavy.) Sure, but MS-DOS is not Windows. MS-DOS uses "8.3" filenames. But it's not like MS-DOS is still terrifically popular these days. But when it comes to other Windows applications (still the more common) that happen to operate in 'Ansi' mode, they are subject to the hazard of codepage translations. Sure, but this has got nothing to do with the filesystem. The Windows filesystem(s) store filenames in those disk sectors which are reserved for file headers, and in these location they are stored using sixteen-bit wide code units. (I assume this can only be UTF-16?). Thus, "Windows file systems do know what encodings they use" seems to me to be a correct statement. The fact that applications can still open files using the legacy fopen() call (which requires char*, hence 8-bit-wide, strings) is kind of irrelevant. If the user creates a file using fopen() via a code page translation, AND GETS IT WRONG, then the file will be created with Unicode characters other than those she - but those characters will still be Unicode and unambiguous, no? that is, usually, it is restricted to US ASCII, very much like the usable set in *nix cases... [OFF TOPIC] Why do so many people call it "US ASCII" anyway? Since "ASCII" comprises that subset of Unicode from U+ to U+007F, it is not clear to me in what way "US-ASCII" is different from ASCII. It's bad enough for us non-Americans that the A in ASCII already stands for "American", but to stick "US" on the front as well is just Anyway, back to the discussion on US-Unicode...
Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
On Monday, December 6th, 2004 20:52Z John Cowan va escriure: > Doug Ewell scripsit: > >>> Now suppose you have a UNIX filesystem, containing filenames in a >>> legacy encoding (possibly even more than one). If one wants to >>> switch to UTF-8 filenames, what is one supposed to do? Convert all >>> filenames to UTF-8? >> >> Well, yes. Doesn't the file system dictate what encoding it uses for >> file names? How would it interpret file names with "unknown" >> characters from a legacy encoding? How would they be handled in a >> directory search? > > Windows filesystems do know what encoding they use. Err, not really. MS-DOS *need to know* the encoding to use, a bit like a *nix application that displays filenames need to know the encoding to use the correct set of glyphs (but constrainst are much more heavy.) Also Windows NT Unicode applications know it, because it can't be changed :-). But when it comes to other Windows applications (still the more common) that happen to operate in 'Ansi' mode, they are subject to the hazard of codepage translations. Even if Windows 'knows' the encoding used for the filesystem (as when it uses NTFS or Joliet, or VFAT on NT kernels; in the other cases it does not even know it, much like with *nix kernels), the only usable set is the _intersection_ of the set used to write and the set used to read; that is, usually, it is restricted to US ASCII, very much like the usable set in *nix cases... Antoine
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Lars responded: > > ... Whatever the solutions > > for representation of corrupt data bytes or uninterpreted data > > bytes on conversion to Unicode may be, that is irrelevant to the > > concerns on whether an application is using UTF-8 or UTF-16 > > or UTF-32. > The important fact is that if you have an 8-bit based program, and you > provide a locale to support UTF-8, you can keep things working (unless you ^^^ You can keep *some* things *sorta* working. If you don't make the effort to actually upgrade software to use the standard *conformantly*, then it is no real surprise when data corruptions creep in, characters get mislaid, and some things don't work the way they should. > prescribe validation). But you cannot achieve the same if you try to base > your program on 16 or 32 bit strings. Of course you can. You just have to rewrite the program to handle 16-bit or 32-bit strings correctly. You can't pump them through 8-bit pipes or char* API's, but it's just silly to try that, because they are different animals to begin with. By the way, I participated as an engineer in a multi-year project that shifted an advanced, distributed data analysis system from an 8-bit character set to 16-bit Unicode. *All* user-visible string processing was converted over -- and that included proprietary file servers, comm servers, database gateways, networking code, a proprietary 32-bit workstation GUI implementation, and a suite of object-oriented application tools, including a spreadsheet, plotting tool, query and database reporting tools, and much more. It worked cross-platform, too. It was completed, running, and *delivered* to customers in 1994, a decade ago. You can't bamboozle me with any of this "it can't be done with 16-bit strings" BS. > Or, again, you really cannot with 16 > bit (UTF-16), Yes you can. > and you sort of can with 32 bit (UTF-32), but must resort to > values above 21 bits. No, you need not -- that is non-conformant, besides. > Again, nothing standardized there, nothing defined for > how functions like isspace should react and so on. That is wrong, too. The standard information that people seek is in the Unicode Character Database: http://www.unicode.org/Public/UNIDATA/ And there are standard(*) libraries such as ICU that public API's for programs to use to get the kind of behavior they need. (*) Just because a library isn't an International Standard does not mean that it is not a de facto standard that people can and do rely upon for such program behavior. You can't expect to just rely upon the C or C++ standards and POSIX to solve all your application problems, but there are perfectly good solutions working out there, in UTF-8, in UTF-16, and in UTF-32. (Or in combinations of those.) > And it's about the fact that it is far more likely that this > happens to UTF-8 data (or that some legacy data is mistakenly labelled or > assumed to be UTF-8). > UTF-16 data is far cleaner than 8-bit data. Basically because you had to > know the encoding in order to store the data in UTF-16. Actually, I think this should be characterized as software engineers writing software for UTF-16 are likely to do a better job of handling characters, because they have to, whereas a lot of stuff using UTF-8 just slides by, because people think they can ignore character set issues long enough, so that when the problem occurs, it can no longer be traced to mistakes they made or that they are still held responsible for. ;-) > UTF-8 is what solved the problems on UNIX. It allowed UNIX to process > Windows data. Alongside its own. > It is Windows that has problems now. And I think roundtripping is the > solution that will allow Windows to process UNIX data. Without dropping data > or raising exceptions. Alongside its own. I just don't understand these assertions at all. First of all it isn't "UNIX data" or "Windows data" -- it is end user's data, which happens to be processed in software systems which in turn are running on a UNIX or Windows OS. I work for a company that *routinely* runs applications that cross the platform barriers in all sorts of ways. It works because character sets are handled conformantly, and conversions are done carefully at platform boundaries -- not because some hack has been added to UTF-8 to preserve data corruptions. > > There's more to it, of course, but this is, I believe, as the > > bottom of the reason why, for 12 years now, people have been > > fundamentally misunderstanding each other about UTF-8. > Is it 12? Thought it was far less. Yes. The precursor of UTF-8 was dreamed up around 1992. > Off topic, when was UTF-8 added to > Unicode standard? In Unicode 1.1, Appendix F, then known as "FSS-UTF", in 1993. > Quite close. Except for the fact that: > * U+EE93 is represented in UTF-32 as 0xEE93 > * U+EE93 is represented in UTF-16 as 0xEE93 > * U+EE93 is represented in UTF
Re: Nicest UTF
Marcin asked: > The general trouble is that numeric character references can only > encode individual code points By design. > rather than graphemes (is this a correct > term for a non-combining code point with a sequence of combining code > points?). No. The correct term is "combining character sequence". TUS 4.0, p. 70, D17. The correct NCR representation of a combining character sequence is a sequence of NCR's. -- Not too surprisingly. --Ken > So if XML is supposed to be treated as a sequence of > graphemes, weird effects arise in the above boundary cases...
Re: Nicest UTF
"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes: > If it's a broken character reference, then what about Á (769 is > the code for combining acute if I'm not mistaken)? Please start adding spaces to your entity references or something, because those of us reading this through a web interface are getting very confused. -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
Re: Nicest UTF
"Marcin 'Qrczak' Kowalczyk" writes: > String equality in a programming language should not treat composed > and decomposed forms as equal. Not this level of abstraction. This implies that every programmer needs an indepth knowledge of Unicode to handle simple strings. The concept makes me want to replace Unicode; spending the rest of my life explaining to programmers, and people who use their programs, why a search for "Römishe Elegien" isn't bringing the book is not my idea of happiness. > IMHO splitting into graphemes is the job of a rendering engine, not of > a function which extracts a part of a string which matches a regex. So S should _sometimes_ match an accented S? Again, I feel extended misery of explaining to people why things aren't working right coming on. > They are supposed to be equivalent when they are actual characters. > What if they are numeric character references? Should "≮" > (7 characters) represent a valid plain-text character or be a broken > opening tag? Which 7 characters? My email "client" turned them into the actual characters. But I think it's fairly obvious that XML added entities in part so you could include '<'s and other characters without them getting interpreted as part of the text of the document. Similarly, a combining character entity following an actual < should be the start of a tag. >Note that if it's a valid plain-text character, it's impossible >to represent isolated combining code points in XML, No more then it's impossible to represent '<' in the text. > I expect breakage of XML-based protocols if implementations are > actually changed to conform to these rules (I bet they don't now). Really? In what cases are you storing isolated combining code points in XML as text? I can think of hypothetical cases, but most real-world use isn't going to be affected. If I were designing such an XML protocol, I'd probably store it as a decimal number anyway; XML is designed to be human-readable, and an isolated combining character that randomly combines with other characters that it's not logically associated with when displayed isn't particularly human readable. > Implementing an API which works in terms of graphemes over an API > which works in terms of code points is more sane than the converse, > which suggests that the core API should use code points if both APIs > are sometimes needed at all. Implementing an API which works in terms of lists over an API which works in terms of pointers is more sane than the converse, which suggests that the core API should use pointers if both APIs are sometimes needed at all. > While I'm not obsessed with efficiency, it would be nice if changing > the API would not slow down string processing too much. Who knows how much it would slow down string processing? If I get around to writing the test code, I'll try and see how much it slows stuff down, but right now we don't know. -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
Re: Nicest UTF
John Cowan <[EMAIL PROTECTED]> writes: >> String equality in a programming language should not treat composed >> and decomposed forms as equal. Not this level of abstraction. > > Well, that assumes that there's a special "string equality" predicate, > as distinct from just having various predicates that DWIM. No, I meant the default generic equality predicate when applied to two strings. > It's a broken opening tag. Ok, so it's the conversion from raw text to escaped character references which should treat combining characters specially. What about < with combining acute, which doesn't have a precomposed form? A broken opening tag or a valid text character? What about AACUTE where ACUTE stands for combining acute? Is this A with acute, or a broken character reference which ends with an accented semicolon? If it's a broken character reference, then what about Á (769 is the code for combining acute if I'm not mistaken)? If *this* is A with acute, then it's inconsistent: here combining accents are processed after resolving numeric character references, and previously it was in the opposite order. OTOH if this is something else, then it's impossible to represent letters without precomposed forms with numeric character references. The general trouble is that numeric character references can only encode individual code points rather than graphemes (is this a correct term for a non-combining code point with a sequence of combining code points?). So if XML is supposed to be treated as a sequence of graphemes, weird effects arise in the above boundary cases... -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
Marcin 'Qrczak' Kowalczyk scripsit: > String equality in a programming language should not treat composed > and decomposed forms as equal. Not this level of abstraction. Well, that assumes that there's a special "string equality" predicate, as distinct from just having various predicates that DWIM. In a Unicode Lisp implementation, e.g., equal might be char-by-char equality and equalp might not. > They are supposed to be equivalent when they are actual characters. > What if they are numeric character references? Should "≮" > (7 characters) represent a valid plain-text character or be a broken > opening tag? It's a broken opening tag. > Note that if it's a valid plain-text character, it's impossible > to represent isolated combining code points in XML, It's problematic to represent the *specific* combining code point when it appears immediately after a tag. -- Don't be so humble. You're not that great. John Cowan --Golda Meir[EMAIL PROTECTED]
Re: Nicest UTF
"D. Starner" <[EMAIL PROTECTED]> writes: > The semantics there are surprising, but that's true no matter what you > do. An NFC string + an NFC string may not be NFC; the resulting text > doesn't have N+M graphemes. Which implies that automatically NFC-ing strings as they are processed would be a bad idea. They can be NFC-ed at the end of processing if the consumer of this data will demand this. Especially if other consumers would want NFD. String equality in a programming language should not treat composed and decomposed forms as equal. Not this level of abstraction. IMHO splitting into graphemes is the job of a rendering engine, not of a function which extracts a part of a string which matches a regex. > If you do so with an language that includes <, you violate the Unicode > standard, because ≮ (not <) and ≮ are canonically equivalent. I think that Unicode tries to push implications of "equivalence" too far. They are supposed to be equivalent when they are actual characters. What if they are numeric character references? Should "≮" (7 characters) represent a valid plain-text character or be a broken opening tag? Note that if it's a valid plain-text character, it's impossible to represent isolated combining code points in XML, and thus it's impossible to use XML for transportation of data which allows isolated combining code points (except by introducing custom escaping of course, e.g. transmitting decimal numbers instead of characters). I expect breakage of XML-based protocols if implementations are actually changed to conform to these rules (I bet they don't now). OTOH if it's not a valid plain-text character, then conversion between numeric character references and actual characters is getting more hairy. > I'll see if I have time after finals to pound out a basic API that > implements this, in Ada or Lisp or something. My language is quite similar to Lisp semantically. Implementing an API which works in terms of graphemes over an API which works in terms of code points is more sane than the converse, which suggests that the core API should use code points if both APIs are sometimes needed at all. While I'm not obsessed with efficiency, it would be nice if changing the API would not slow down string processing too much. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Kenneth Whistler scripsit: > A Sybase ASE database has the same behavior running on Windows as > running on Sun Solaris or Linux, for that matter. Fair enough. > UNIX filenames are just one instance of this. However, although they are *technically* octet sequences, they are *functionally* character strings. That's the issue. > Failing that, then BINARY fields *are* the appropriate > way to deal with arbitrary arrays of bytes that cannot > be interpreted as characters. This is purism. All the filenames on my Unix system, for example, can be interpreted as character strings; the potential to create filenames that can't be is unutilized, and sensibly so. For that matter, the potential to create files containing C0 controls is also unutilized. > > in the same way that it would > > be overkill to encode all 8-bit strings in XML using Base-64 > > just because some of them may contain control characters that are > > illegal in well-formed XML. > > Dunno about the XML issue here -- you're the expert on what > the expected level of illegality in usage is there. XML's policy is zero tolerance, both for illegal encodings and for illegal characters such as U+0001. So in order to be *100% sure* that a character string (ASCII, Latin-1, or UTF-*, it matters not) can be put into an XML document, one must treat it as binary and encode it as such, using QP or Base64 or what have you. But nobody does. XML 1.1 allows the representation of every Unicode character except U+, which materially reduces the problem, but there is little support for XML 1.1 as yet. In any case, this case is only an analogy, not an exact equivalent: the problems of representing illegal *characters* in an XML document is closely analogous to the problem of representing illegal *bytes* in a character string. > The point I'm making is that *whatever* you do, you are still > asking for implementers to obey some convention on conversion > failures for corrupt, uninterpretable character data. > My assessment is that you'd have no better success at making > this work universally well with some set of 128 magic bullet > corruption pills on Plane 14 than you have with the > existing Quoted-Unprintable as a convention. It doesn't have to work universally; indeed, it becomes a QOI issue. Allocating representations of bytes with "bits that are high" makes it possible to do something recoverable, at very little expense to the Unicode Consortium. > Further, as it turns out that Lars is actually asking for > "standardizing" corrupt UTF-8, a notion that isn't going to > fly even two feet, I think the whole idea is going to be > a complete non-starter. I agree that that part won't fly, absolutely. -- In politics, obedience and support John Cowan <[EMAIL PROTECTED]> are the same thing. --Hannah Arendthttp://www.ccil.org/~cowan
Re: Nicest UTF
"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes: > "D. Starner" <[EMAIL PROTECTED]> writes: > > > You could hide combining characters, which would be extremely useful if we > > were just using Latin > > and Cyrillic scripts. > > It would need a separate API for examining the contents of a combining > character. You can't avoid the sequence of code points completely. Not a seperate API; a function that takes a character and returns an array of integers. > It would yield to surprising semantics: for example if you concatenate > a string with N+1 possible positions of an iterator with a string with > M+1 positions, you don't necessarily get a string with N+M+1 positions > because there can be combining characters at the border. The semantics there are surprising, but that's true no matter what you do. An NFC string + an NFC string may not be NFC; the resulting text doesn't have N+M graphemes. Unless you're explicitly adding a combining character, a combining character should never start a string. This could be fixed several ways, including by inserting a dummy character to hold the combining character, and "normalizing" the string by removing the dummy characters. That would, for the most part, only hurt pathological cases. > It would impose complexity in cases where it's not needed. Most of the > time you don't care which code points are combining and which are not, > for example when you compose a text file from many pieces (constants > and parts filled by users) or when parsing (if a string is specified > as ending with a double quote, then programs will in general treat a > double quote followed by a combining character as an end marker). If you do so with an language that includes <, you violate the Unicode standard, because ≮ (not <) and ≮ are canonically equivalent. You've either got to decompose first or look at the individual characters as a whole instead of looking at code points. Has anyone considered this while defining a language? How about the official standards bodies? Searching for XML in the archives is a bit unhelpful, and UTR #20 doesn't mention the issue. Your solution is just fine if you're considering the issue on the bit level, but it strikes me as the wrong answer, and I would think that it would surprising to a user that didn't understand Unicode, especially in the ≮ case. A warning either way would be nice. I'll see if I have time after finals to pound out a basic API that implements this, in Ada or Lisp or something. It's not going to be the most efficient thing, but I doubt it's going to be a big difference for most programs, and if you want C, you know where to find it. -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
John Cowan responded: > > Storage of UNIX filenames on Windows databases, for example, ^^ O.k., I just quoted this back from the original email, but it really is a complete misconception of the issue for databases. "Windows databases" is a misnomer to start with. There are some databases, like Access, that are Windows-only applications, but most serious SQL databases in production (DB2, Oracle, Sybase ASE and ASA, and so on) are crossplatform from the get go, and have their *own* rules for what can and cannot legitimately be stored in data fields, independent of what platform you are running them on. A Sybase ASE database has the same behavior running on Windows as running on Sun Solaris or Linux, for that matter. > > can be done with BINARY fields, which correctly capture the > > identity of them as what they are: an unconvertible array of > > byte values, not a convertible string in some particular > > code page. > > This solution, however, is overkill, Actually, I don't think it is. One of the serious classes of fundamental errors that database administrators and database programmers run into when creating global applications is ignoring or misconstruing character set issues. In a database, if I define the database (or table or field) as containing UTF-8 data, it damn well better have UTF-8 data in it, or I'm just asking for index corruptions, data corruptions or worse -- and calls from unhappy customers. When database programmers "lie" to the database about character sets, by setting a character set to Latin-1, say, and then pumping in data which is actually UTF-8, for instance, expecting it to come back out unchanged with no problems, they are skating on very thin ice ... which usually tends to break right in the middle of some critical application during a holiday while your customer service desk is also down. ;-) Such "lying to the database" is generally the tactic of first resort for "fixing" global applications when they start having to deal with mixed Japanese/European/UTF-8 data on networks, but it is clearly a hack for not understanding and dealing with the character set architecture and interoperability problems of putting such applications together. UNIX filenames are just one instance of this. The first mistake is to network things together in ways that create a technical mismatch between what the users of the localized systems think the filenames mean and what somebody on the other end of such a system may end up interpreted the bag o' bytes to mean. The application should be constructed in such a way that the locale/charset state can be preserved on connection, with the "filename" interpreted in terms of characters in the realm that needs to deal with it that way, and restored to its bag o' bytes at the point that needs it that way. If you can't do that reliably with a "raw" UNIX set of applications, c'est la vie -- you should be building more sophisticated multi-tiered applications on top of your UNIX layer, applications which *can* track and properly handle locale and character set identities. Failing that, then BINARY fields *are* the appropriate way to deal with arbitrary arrays of bytes that cannot be interpreted as characters. Trying to pump them into UTF-8 text data fields and processing them as such when they *aren't* UTF-8 text data is lying to the database and basically forfeiting your warranty that the database will do reasonable things with that data. It's as stupid as trying to store date or numeric types in text data fields without first converting them to formatted strings of text data. > in the same way that it would > be overkill to encode all 8-bit strings in XML using Base-64 > just because some of them may contain control characters that are > illegal in well-formed XML. Dunno about the XML issue here -- you're the expert on what the expected level of illegality in usage is there. But for real database applications, there are usually mountains and mountains of stuff going on, most of it completely orthogonal to something as conceptually straightforward as maintaining the correct interpretation of a UNIX filename. It isn't really overkill, in my opinion, to design the appropriate tables and metadata needed for ensuring that your filename handling doesn't blow up somewhere because you've tried to do an UPDATE on a UTF-8 data field with some random bag o' bytes that won't validate as UTF-8 data. > > > In my opinion, trying to do that with a set of encoded characters > > (these 128 or something else) is *less* likely to solve the > > problem than using some visible markup convention instead. > > The trouble with the visible markup, or even the PUA, is that > "well-formed filenames", those which are interpretable as > UTF-8 text, must also be encoded so as to be sure any > markup or PUA that naturally appears in the filename is > escaped properly. This is essentially the Quoted-Printable > encoding, whic
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Kenneth Whistler wrote: > I'm going to step in here, because this argument seems to > be generating more heat than light. I agree, and I thank you for that. > First, I'm going to summarize what I think Lars Kristan is > suggesting, to test whether my understanding of the proposal > is correct or not. > > I do not think this is a proposal to amend UTF-8 to allow > invalid sequences. So we should get that off the table. At least until we all understand everything else about this issue. > > What I think this suggestion is is for adding 128 characters > to represent byte values in conversion to Unicode when the > byte values are uninterpretable as characters. Why 128 instead > of 256 I find a little mysterious, but presumably the intent > is to represent 0x80..0xFF as raw, uninterpreted byte values, > unconvertible to Unicode characters otherwise. Indeed, the full 256 codepoints could and perhaps even should be assigned for this purpose. The low 128 may in fact have a different purpose, and different handling. But I would delay this discussion also. > > This is suggested by Lars' use case of: > > > Storing UNIX filenames in a Windows database. > > ... since UNIX filenames are simply arrays of bytes, and cannot, > on interconnected systems, necessarily be interpreted in terms > of well-defined characters. > > Apparently Lars is currently using PUA U+E080..U+E0FF > (or U+EE80..U+EEFF ?) for this purpose, enabling the round-tripping > of byte values uninterpretable as characters to be converted, and > is asking for standard Unicode values for this purpose, instead. Yes. And, yes, it's U+EE80..U+EEFF. > > The other use case that Lars seems to be talking about are > existing documents containing data corruptions in them, which > can often happen when Latin-1 data gets dropped into UTF-8 data > or vice versa due to mislabeled email or whatever. Yes. One could argue that the need for the first use will gradually go away, that's why I also use this second example. Although, I think the first problem is underestimated. And is not limited to my example. And can have much more serious consequences. And might not go away anytime soon. > And I am assuming this is referring primarily to the second case, > where the extreme scenario Lars is envisioning would be, for > example, where each point in a system was hyper-alert to > invalid sequences and simply tossed or otherwise sequestered > entire documents if they got these kinds of data corruptions > in them. And in such a case, I can understand the concern about > angry users. How many people on this list would be cursing if > every bit of email that had a character set conversion error in > it resulting in some bit hash or other, simply got tossed in the > bit bucket instead of being delivered with the glorious hash > intact, at least giving you the chance to see if you could > figure out what was intended? The two aspects of the problem are not always clearly distinct. But yes, let's say it's the second one. I had the need to solve the first problem, not the second one. So some of what I say about this second one is somewhat theoretical. But also realistic, I hope. Or fear. > > This is, I think the basic point at which people are talking past each > other. > > Notionally, Doug is correct that UTF-8 and UTF-16 are equivalent > encoding forms, and anything represented (correctly) in one can > be represented (correctly) in the other. In that sense, there is > no difference between representation of text in UTF-8 or UTF-16, > and no reason to postulate that a "UTF-8 based program" will have > any advantages or disadvantages over a "UTF-16 based program" when > it comes to dealing with corrupted data. > > What Lars is talking about is a broad class of UNIX-based software > which is written to handle strings essentially as > opaque bags of bytes, not caring what they contain for many > purposes. Such software generally keeps working just fine if you > pump UTF-8 at it, which is by design for UTF-8 -- precisely because > UTF-8 leaves untouched all the 0x00..0x7F byte values that may > have particular significance for those processes. Most of that > software treats 0x80..0xFF just as bit hash from the get-go, and > neither cares nor has any way of knowing if the particular > sequence of bit hash is valid UTF-8 or Shift-JIS or Latin-1 or > EUC-JIS or some mix or whatever. Yes. With a couple of additions. It is not true that most of that software doesn't care about the encoding. Copy or cat really don't need to, but more does, to count the lines properly (needs to know the number of outputted glyphs or whatever they are, in
Re: Nicest UTF
"D. Starner" <[EMAIL PROTECTED]> writes: > You could hide combining characters, which would be extremely useful if > we were just using Latin and Cyrillic scripts. It would need a separate API for examining the contents of a combining character. You can't avoid the sequence of code points completely. It would yield to surprising semantics: for example if you concatenate a string with N+1 possible positions of an iterator with a string with M+1 positions, you don't necessarily get a string with N+M+1 positions because there can be combining characters at the border. It's simpler to overlay various grouping styles on top of a sequence of code points than to start with automatically combined combining characters and process inwards and outwards from there (sometimes looking inside characters, sometimes grouping them even more). It would impose complexity in cases where it's not needed. Most of the time you don't care which code points are combining and which are not, for example when you compose a text file from many pieces (constants and parts filled by users) or when parsing (if a string is specified as ending with a double quote, then programs will in general treat a double quote followed by a combining character as an end marker). I believe code points are the appropriate general-purpose unit of string processing. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) > Needless to say, these systems were badly designed at their > origin, and > newer filesystems (and OS APIs) offer much better > alternative, by either > storing explicitly on volumes which encoding it uses, or by > forcing all > user-selected encodings to a common kernel encoding such as > Unicode encoding > schemes (this is what FAT32 and NTFS do on filenames created > under Windows, > since Windows 98 or NT). > The UNIX (I also call it variant) principle has a problem of not knowing the encoding. The Windows (I also call it invariant) principle has a problem that it HAS to know the encoding. The Windows principle has another problem, it can store data from any encoding, and it also does a good job of trying to represent the data in any encoding, but it cannot guarantee identification in just any encoding. An invariant store can be implemented as UTF-8 or UTF-16. Windows uses UTF-16 and guranteed indentification used to be only possible in UTF-16. Due to UTF-8, now it can also be done in 8-bit (console, telnet). But for some reason, support for UTF-8 is still limited in some areas. And the missing rountrip capability may have something to do with it. I basically agree that the variant approach is not a good one. But the invariant one is not an easy path. It was easier for the Windows to take it, because at the time transition was made, those systems were still single user. Hence, typically all data was in a single encoding. Lars
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Doug Ewell wrote: > How do file names work when the user changes from one SBCS to another > (let's ignore UTF-8 for now) where the interpretation is > different? For > example, byte C3 is U+00C3, A with tilde (Ã) in ISO 8859-1, > but U+0102, > A with breve (Ä) in ISO 8859-2. If a file name contains byte > C3, is its > name different depending on the current locale? It displays differently, but compares the same. Whether or not it is the same name is a philosophical question. > Is it > accessible in all > locales? Typically, yes for all SBCS, but not really guaranteed for all MBCS. Depends on whether you validate the string or not. The way UNIX is being developed, those files are typically still accessible since the programs are still working with 8-bit strings. And that is what I am saying. A UTF-8 program (a hypothetical 'UNIX Commander 8') would have no problems accessing the files. A UTF-16 program (a hypothetical 'UNIX Commander 16') on the other hand would have problems. > (Not every SBCS defines a character at every code point. > There's no C3 in ISO 8859-3, for example.) It works just like unassigned codepoints in Unicode work. How they are displayed is not defined, but they can be passed around and compared for equality. Collation is again not defined, but simple sorting does give useful results. > > Does this work with MBCS other than UTF-8? I know you said > other MBCS, > like Shift-JIS, are not often used alongside other encodings except > ASCII, but we can't guarantee that since we're not in a perfect world. > :-) What if they were? I don't know if and how much they were. But I am assuming UTF-8 would be used alongside other encodings on a much larger scale. At least that's what we are hoping for aren't we? Of course it would be even better if we would be only using UTF-8 (or any other Unicode format), but the transition has to come first. > I fear Ken is not correct when he says you are not arguing for the > legalization of invalid UTF-8 sequences. I am arguing for a mechanism that allows processing invalid UTF-8 sequences. For those who need to do so. You can still think of them as invalid. Exactly how they will be called and to what extent will they be discouraged still needs to be investigated and defined. > This isn't about UTF-8 versus other encoding forms. UTF-8-based > programs will reject these invalid sequences because they don't map to > code points, and because they are supposed to reject them. The problem is, until now a text editor typically preserved all data if a file was opened and saved immediately. Even binary data. And the data could be interpreted as Latin 1, Latin 2, ... But you cannot interprete the data as UTF-8 and preserve all the data at the same time. Well, actually it is possible, which is exactly what I am saying is the advantage of UTF-8. But if you insist on validation, you break it. Fine, you get your Unicode world, and UTF-16 is then just as good as UTF-8. But you are now losing data where previously it wasn't lost. Well, you better remember to put a disclaimer in you license agreement... > > Besides, surrogates are not completely interchangeable. > Frankly, they > > are, but do not need to be, right? > > They are not completely. In UTF-8 and UTF-32, they are not allowed at > all. In UTF-16, they may only occur in the proper context: a high > surrogate may only occur before a low surrogate, and a low > surrogate may > only appear after a high surrogate. No other usage of surrogates is > permitted, because if unpaired surrogates could be interpreted, the > interpretation would be ambiguous. Well, yes, that's the theory. But as usual, I look at how things that are not defined yet work. From the algorithms, unpaired surrogates convert pretty well. Unless they start to pair up, of course. But there are cases where one knows they cannot (no concatenation is done). Let me bring up one issue again. I want to standardize a mechanism that allows a roundtrip for 8-bit data. And I already stated that by doing that, you lose the roundtrip for 16-bit data. Now I ask myself again, is that true? Yes and no. For the case I mentioned above (no concatenation), roundtrip is currently really possible. But generally speaking, it is not always possible. And last but not least, you don't even care for it, right? Good, because that means my proposal doesn't make anything worse. > I admit my error with regard to the handling of file names by > Unix-style > file systems, and I appreciate being set straight. Sorry for rubbing it in, but .. could it be that a lot of conclusions you have about what Unicode should or should not be are also wrong if they were based on such incorrect assu
Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Lars Kristan wrote: > I never said it doesn't violate any existing rules. Stating that it > does, doesn't help a bit. Rules can be changed. Assuming we understand > the consequences. And that is what we should be discussing. By stating > what should be allowed and what should be prohibited you are again > defending those rules. I agree, rules should be defended, but only up > to a certain point. Simply finding a rule that is offended is not > enough to prove something is bad or useless. In my opinion, these are rules that should not be broken or changed, NOT because changing the rules is inherently bad but because these particular changes would cause more problems than they would solve. In my opinion. > Defining Unicode as the world of codepoints is a complex task on its > own. It seems that you are afraid of stepping out of this world, since > you do not know what awaits you there. So, it is easier to find an > excuse within existing rules, especially if a proposed change > threatens to shake everything right down to the foundation. If I would > be dealing with Unicode (as we know it), I would probably be doing the > same thing. I ask you to step back and try to see the big picture. My objection to this has nothing to do with being some kind of conservative fuddy-duddy who is afraid to think outside the box. >> Do you have a use case for this? > > Yes, I definitely have. I am the one accusing you of living in a > perfect world, remember?. Yes, I remember. Thank you. > Do you think I would do that if I wasn't dealing with this problem in > real life? The problem seems to be that you have file names in a Unix or Unix-like file system, where names are stored as uninterpreted bytes (thanks to everyone who pointed this out; I have learned something), and these bytes need to remain valid if the locale specifies UTF-8 and the bytes don't make a valid UTF-8 sequence. Right? How do file names work when the user changes from one SBCS to another (let's ignore UTF-8 for now) where the interpretation is different? For example, byte C3 is U+00C3, A with tilde (Ã) in ISO 8859-1, but U+0102, A with breve (Ä) in ISO 8859-2. If a file name contains byte C3, is its name different depending on the current locale? Is it accessible in all locales? (Not every SBCS defines a character at every code point. There's no C3 in ISO 8859-3, for example.) Does this work with MBCS other than UTF-8? I know you said other MBCS, like Shift-JIS, are not often used alongside other encodings except ASCII, but we can't guarantee that since we're not in a perfect world. :-) What if they were? If you have a UTF-8 locale, and file names that contain invalid UTF-8 sequences, how would you address those files in a locale-aware way? This is similar to the question about the file with byte C3, which is à in one locale, Ä in another, and an unassigned code point in a third. > It is the current design that is unfair. A UTF-16 based program will > only be able to process valid UTF-8 data. A UTF-8 based program will > in many cases preserve invalid sequences even without any effort. I fear Ken is not correct when he says you are not arguing for the legalization of invalid UTF-8 sequences. > Let me guess, you will say it is a flaw in the UTF-8 based program. Good guess. Unicode and ISO/IEC 10646 say it is, and I say it is. > If validation is desired, yes. But then I think you would want all > UTF-8 based programs to do that. That will not happen. What will > happen is that UTF-8 based programs will be better text editors > (because they will not lose data or constantly complain), while UTF-16 > based programs will produce cleaner data. You will opt for the latter. > And I for the former. But will users know exactly what they've got? > Will designers know exactly what they're gonna get? This is where all > this started. I stated that there is an important difference between > deciding for UTF-8 or for UTF-16 (or UTF-32). This isn't about UTF-8 versus other encoding forms. UTF-8-based programs will reject these invalid sequences because they don't map to code points, and because they are supposed to reject them. > BTW, you have mixed up source and target. Or I don't understand what > you're trying to say. You are right. I spoke of translating German to French, when the example was about going the other way. I made a mistake. > Besides, surrogates are not completely interchangeable. Frankly, they > are, but do not need to be, right? They are not completely. In UTF-8 and UTF-32, they are not allowed at all. In UTF-16, they may only occur in the proper context: a high surrogate may only occur before a low surrogate, and a low surrogate may only appear after a high surrogate. No other usage of s
Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Kenneth Whistler wrote: > I do not think this is a proposal to amend UTF-8 to allow > invalid sequences. So we should get that off the table. I hope you are right. > Apparently Lars is currently using PUA U+E080..U+E0FF > (or U+EE80..U+EEFF ?) for this purpose, enabling the round-tripping > of byte values uninterpretable as characters to be converted, and > is asking for standard Unicode values for this purpose, instead. If I understand correctly, he is using these PUA values when the data is in UTF-16, and using bare high-bit bytes (i.e. invalid UTF-8 sequences) when the data is in UTF-8, and expecting to convert between the two. That has at least two bad implications: (1) the PUA characters would not round-trip from UTF-8 to UTF-16 to UTF-8, but would be converted to the bare high-bit bytes, and (2) the bare high-bit bytes might or might not accidentally form valid UTF-8 sequences, which mean they might not round-tip either. > Say a process gets handed a "UTF-8" string that contains the > byte sequence <61 62 63 93 4D D0 B0 E4 BA 8C F0 90 8C 82 94>. > ^^ ^^ > > The 93 and 94 are just corrupt data -- it cannot be interpreted > as UTF-8, and may have been introduced by some process that > screwed up smart quotes from Code Page 1252 and UTF-8, for > example. Interpreting the string, we have: > > > > Now *if* I am interpreting Lars correctly, he is using 128 > PUA code points to *validly* contain any such byte, so that > it can be retained. If the range he is using is U+EE80..U+EEFF, > then the string would be reinterpreted as: > > U+EE94> > > which in UTF-8 would be the byte sequence: > > <61 62 63 EE BA 93 4D D0 B0 E4 BA 8C F0 90 8C 82 EE BA 94> > > > This is now well-formed UTF-8, which anybody could deal with. > And if you interpret U+EE93 as meaning "a placeholder for the > uninterpreted or corrupt byte 0x93 in the original source", > and so on, you could use this representation to exactly > preserve the original information, including corruptions, > which you could feed back out, byte-for-byte, if you reversed > the conversion. Oh, how I hope that is all he is asking for. > Now moving from interpretation to critique, I think it unlikely > that the UTC would actually want to encode 128 such characters > to represent byte values -- and the reasons would be similar to > those adduced for rejecting the earlier proposal. Effectively, > in either case, these are proposals for enabling representation > of arbitrary, embedded binary data (byte streams) in plain text. > And that concept is pretty fundamentally antithetical to the > Unicode concept of plain text. Isn't this an excellent use for the PUA? These characters are private anyway; they are defined by some standard other than Unicode, which is not evident in the Unicode data. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Philippe Verdy wrote: > An alternative can then be a mixed encoding selection: > - choose a legacy encoding that will most often be able to represent > valid filenames without loss of information (for example ISO-8859-1, > or Cp1252). > - encode the filename with it. > - try to decode it with a *strict* UTF-8 decoder, as if it was UTF-8 > encoded. > - if there's no failure, then you must reencode the filename with > UTF-8 instead, even if the result is longer. > - if the strict UTF-8 decoding fails, you can keep the filename in the > first 8-bit encoding... > When parsing files: > - try decoding filenames with *strict* UTF-8 rules. If this does not > fail, then the filename was effectively encoded with UTF-8. > - if the decoding failed, decode the filename with the legacy 8-bit > encoding. > > But even with this scheme, you will find interoperability problems > because some applications will only expect the legacy encoding, or > only the UTF-8 encoding, without deciding... This technique was described as "adaptive UTF-8" by Dan Oscarsson in August 1998: http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML012/0738.html although he did not go as far as Philippe did, in actually checking the "adaptively" encoded string to make sure it would be decoded correctly. All the same, it was decided not to go this route, partly because the auto-detection capability of UTF-8 would be lost, partly because having multiple context-dependent encodings of the same code points would have been a Bad Thing (<99 C9> could be encoded adaptively but could not), and partly for the reason Philippe mentions -- most existing decoders would expect either Latin-1 or UTF-8, and would choke if handed a mixture of the two. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Kenneth Whistler scripsit: > Storage of UNIX filenames on Windows databases, for example, > can be done with BINARY fields, which correctly capture the > identity of them as what they are: an unconvertible array of > byte values, not a convertible string in some particular > code page. This solution, however, is overkill, in the same way that it would be overkill to encode all 8-bit strings in XML using Base-64 just because some of them may contain control characters that are illegal in well-formed XML. > In my opinion, trying to do that with a set of encoded characters > (these 128 or something else) is *less* likely to solve the > problem than using some visible markup convention instead. The trouble with the visible markup, or even the PUA, is that "well-formed filenames", those which are interpretable as UTF-8 text, must also be encoded so as to be sure any markup or PUA that naturally appears in the filename is escaped properly. This is essentially the Quoted-Printable encoding, which is quite rightly known to those stuck with it as "Quoted-Unprintable". > Simply > encoding 128 characters in the Unicode Standard ostensibly to > serve this purpose is no guarantee whatsoever that anyone would > actually implement and support them in the universal way you > envision, any more than they might a "=93", "=94" convention. Why not, when it's so easy to do so? And they'd be *there*, reserved, unassignable for actual character encoding. Plane E would be a plausible location. -- John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan han mathon ne chae, a han noston ne 'wilith. --Galadriel, LOTR:FOTR
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Lars, I'm going to step in here, because this argument seems to be generating more heat than light. > I never said it doesn't violate any existing rules. Stating that it does, > doesn't help a bit. Rules can be changed. > I ask you to step back and try to see the big picture. First, I'm going to summarize what I think Lars Kristan is suggesting, to test whether my understanding of the proposal is correct or not. I do not think this is a proposal to amend UTF-8 to allow invalid sequences. So we should get that off the table. What I think this suggestion is is for adding 128 characters to represent byte values in conversion to Unicode when the byte values are uninterpretable as characters. Why 128 instead of 256 I find a little mysterious, but presumably the intent is to represent 0x80..0xFF as raw, uninterpreted byte values, unconvertible to Unicode characters otherwise. This is suggested by Lars' use case of: > Storing UNIX filenames in a Windows database. ... since UNIX filenames are simply arrays of bytes, and cannot, on interconnected systems, necessarily be interpreted in terms of well-defined characters. Apparently Lars is currently using PUA U+E080..U+E0FF (or U+EE80..U+EEFF ?) for this purpose, enabling the round-tripping of byte values uninterpretable as characters to be converted, and is asking for standard Unicode values for this purpose, instead. The other use case that Lars seems to be talking about are existing documents containing data corruptions in them, which can often happen when Latin-1 data gets dropped into UTF-8 data or vice versa due to mislabeled email or whatever. > So you would drop the data. There are only two options with current designs. > Dropping invalid sequences, or storing it separately (which probably means > the whole document is dead until manually decoded). Dropping invalid > sequences is actually a better choice. And would even be justifiable (but > still sometimes inconvenient) if we were living in world where everything is > in UTF-8. In a world, trying to transition from legacy encodings to Unicode, > there could be a lot of data lost and a lot of angry users. And I am assuming this is referring primarily to the second case, where the extreme scenario Lars is envisioning would be, for example, where each point in a system was hyper-alert to invalid sequences and simply tossed or otherwise sequestered entire documents if they got these kinds of data corruptions in them. And in such a case, I can understand the concern about angry users. How many people on this list would be cursing if every bit of email that had a character set conversion error in it resulting in some bit hash or other, simply got tossed in the bit bucket instead of being delivered with the glorious hash intact, at least giving you the chance to see if you could figure out what was intended? > A UTF-16 based program will only be able to process valid UTF-8 > data. A UTF-8 based program will in many cases preserve invalid sequences > even without any effort. Let me guess, you will say it is a flaw in the > UTF-8 based program. If validation is desired, yes. But then I think you > would want all UTF-8 based programs to do that. That will not happen. What > will happen is that UTF-8 based programs will be better text editors > (because they will not lose data or constantly complain), while UTF-16 based > programs will produce cleaner data. You will opt for the latter. This is, I think the basic point at which people are talking past each other. Notionally, Doug is correct that UTF-8 and UTF-16 are equivalent encoding forms, and anything represented (correctly) in one can be represented (correctly) in the other. In that sense, there is no difference between representation of text in UTF-8 or UTF-16, and no reason to postulate that a "UTF-8 based program" will have any advantages or disadvantages over a "UTF-16 based program" when it comes to dealing with corrupted data. What Lars is talking about is a broad class of UNIX-based software which is written to handle strings essentially as opaque bags of bytes, not caring what they contain for many purposes. Such software generally keeps working just fine if you pump UTF-8 at it, which is by design for UTF-8 -- precisely because UTF-8 leaves untouched all the 0x00..0x7F byte values that may have particular significance for those processes. Most of that software treats 0x80..0xFF just as bit hash from the get-go, and neither cares nor has any way of knowing if the particular sequence of bit hash is valid UTF-8 or Shift-JIS or Latin-1 or EUC-JIS or some mix or whatever. > And I for > the former. But will users know exactly what they've got? Will designers > know exactly what they're gonna get? This is where all this started. I > stated that there is an important difference between deciding for UTF-8 or > for UTF-16 (or UTF-32). This is where this is all getting derailed. Whatever the solutions for representation of corrupt data bytes or un
Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...
Philippe continued: > As if Unicode had to be bound on > architectural constraints such as the requirement of representing code units > (which are architectural for a system) only as 16-bit or 32-bit units, Yes, it does. By definition. In the standard. > ignoring the fact that technologies do evolve and will not necessarily keep > this constraint. 64-bit systems already exist today, and even if they have, > for now, the architectural capability of handling efficiently 16-bit and > 32-bit code units so that they can be addressed individually, this will > possibly not be the case in the future. This is just as irrelevant as worrying about the fact that 8-bit character encodings may not be handled efficiently by some 32-bit processors. > When I look at the encoding forms such as UTF-16 and UTF-32, they just > define the value ranges in which code units will be be valid, but not > necessarily their size. Philippe, you are wrong. Go reread the standard. Each of the encoding forms is *explicitly* defined in terms of code unit size in bits. "The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form." If there is something ambiguous or unclear in wording such as that, I think the UTC would like to know about it. > You are mixing this with encoding schemes, which is > what is needed for interoperability, and where other factors such as bit or > byte ordering is also important in addition to the value range. I am not mixing it up -- you are, unfortunately. And it is most unhelpful on this list to have people waxing on, with apparently authoritative statements about the architecture of the Unicode Standard, which on examination turn out to be flat wrong. > I won't see anything wrong if a system is set so that UTF-32 code units will > be stored in 24-bit or even 64-bit memory cells, as long as they respect and > fully represent the value range defined in encoding forms, Correct. And I said as much. There is nothing wrong with implementing UTF-32 on a 64-bit processor. Putting a UTF-32 code point into a 64-bit register is fine. What you have to watch out for is handing me a 64-bit array of ints and claiming that it is a UTF-32 sequence of code points -- it isn't. > and if the system > also provides an interface to convert them with encoding schemes to > interoperable streams of 8-bit bytes. No, you have to have an interface which hands me the correct data type when I declare it uint_32, and which gives me correct offsets in memory if I walk an index pointer down an array. That applies to the encoding *form*, and is completely separate from provision of any streaming interface that wants to feed data back and form in terms of byte streams. > Are you saying that UTF-32 code units need to be able to represent any > 32-bit value, even if the valid range is limited, for now to the 17 first > planes? Yes. > An API on a 64-bit system that would say that it requires strings being > stored with UTF-32 would also define how UTF-32 code units are represented. > As long as the valid range 0 to 0x10 can be represented, this interface > will be fine. No, it will not. Read the standard. An API on a 64-bit system that uses an unsigned 32-bit datatype for UTF-32 is fine. It isn't fine if it uses an unsigned 64-bit datatype for UTF-32. > If this system is designed so that two or three code units > will be stored in a single 64-bit memory cell, no violation will occur in > the valid range. You can do whatever the heck crazy thing you want to do internal to your data manipulation, but you cannot surface a datatype packed that way and conformantly claim that it is UTF-32. > More interestingly, there already exists systems where memory is adressable > by units of 1 bit, and on these systems, ... [excised some vamping on the future of computers] > Nothing there is impossible for the future (when it will become more and > more difficult to increase the density of transistors, or to reduce further > the voltage, or to increase the working frequency, or to avoid the > inevitable and random presence of natural defects in substrates; escaping > from the historic binary-only systems may offer interesting opportunities > for further performance increase). Look, I don't care if the processors are dealing in qubits on molecular arrays under the covers. It is the job of the hardware folks to surface appropriate machine instructions that compiler makers can use to surface appropriate formal language constructs to programmers to enable hooking the defined datatypes of the character encoding standards into programming language datatypes. It is the job of the Unicode Consortium to define the encoding forms for representing Unicode code points, so that people manipulating Unicode digital text representation can do so reliably using general purpose programming languages with wel
Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)I know wht you mean here: most Linux/Unix filesystems (as well as many legacy filesystems for Windows and MacOS...) do not track the encoding with which filenames were encoded and, depending on local user preferences when that user created that file, filenames on such systems seem to have unpredictable encodings. However the problem comes, most often, when interchanging data from one system to another, through removeable volumes or shared volumes. Needless to say, these systems were badly designed at their origin, and newer filesystems (and OS APIs) offer much better alternative, by either storing explicitly on volumes which encoding it uses, or by forcing all user-selected encodings to a common kernel encoding such as Unicode encoding schemes (this is what FAT32 and NTFS do on filenames created under Windows, since Windows 98 or NT). I understand that there may exist situations, such as Linux/Unix UFS-like filesystems where it will be hard to decide which encoding was used for filenames (or simply for the content of plain-text files). For plain-text files, which have long-enough data in them, automatic identification of the encoding is possible, and used with success in many applications (notably in web browsers). But foir filenames, which are generally short, automatic identification is often difficult. However, UTF-16 remains easy to identify, most often, due to the very unusual frequency of low-values in byte sequences on every even or odd position. UTF-8 is also easy to identify due to its strict rules (without these strict rules, that forbid some sequences, automatic identification of the encoding becomes very risky). If the encoding cannot be identified precisely and explicitly, I think that UTF-16 is much better than UTF-8 (and it also offers a better compromize for total size for names in any modern language). However, it's true that UTF-16 cannot be used on Linux/Unix due to the presence of null bytes. The alternative is then UTF-8, but it is often larger than legacy encodings. An alternative can then be a mixed encoding selection: - choose a legacy encoding that will most often be able to represent valid filenames without loss of information (for example ISO-8859-1, or Cp1252). - encode the filename with it. - try to decode it with a *strict* UTF-8 decoder, as if it was UTF-8 encoded. - if there's no failure, then you must reencode the filename with UTF-8 instead, even if the result is longer. - if the strict UTF-8 decoding fails, you can keep the filename in the first 8-bit encoding... When parsing files: - try decoding filenames with *strict* UTF-8 rules. If this does not fail, then the filename was effectively encoded with UTF-8. - if the decoding failed, decode the filename with the legacy 8-bit encoding. But even with this scheme, you will find interoperability problems because some applications will only expect the legacy encoding, or only the UTF-8 encoding, without deciding...
If only MS Word was coded this well (was Re: Nicest UTF)
From: "D. Starner" <[EMAIL PROTECTED]> (Sorry for sending this twice, Marcin.) "Marcin 'Qrczak' Kowalczyk" writes: UTF-8 is poorly suitable for internal processing of strings in a modern programming language (i.e. one which doesn't already have a pile of legacy functions working of bytes, but which can be designed to make Unicode convenient at all). It's because code points have variable lengths in bytes, so extracting individual characters is almost meaningless Same with UTF-16 and UTF-32. A character is multiple code-points, remember? (decomposed chars?) (unless you care only about the ASCII subset, and sequences of all other characters are treated as non-interpreted bags of bytes). Nope. I've done tons of UTF-8 string processing. I've even done a case insensitive word-frequency measuring algorithm on UTF-8. It runs blastingly fast, because I can do the processing with bytes. It just requires you to understand the actual logic of UTF-8 well enough to know that you can treat it as bytes, most of the time. And the times you can't treat it as bytes, usually you can't even treat UTF-32 as bytes! If you are talking about creating an editfield or text control or something, that is true that UTF-32 is better. However, UTF-16 is the worst of all cases, you'd be better off using UTF-8 as the native encoding of an editfield. The thing is, very very very few people write editfields. I've seen tons of XML parsers in my lifetime (at least 3 I wrote myself), but only a few editfield libraries. Its a shame that very few people understand the different UTFs properly. As for isspace... sure there is a UTF-8 non-byte space. My case insensitive utf-8 word frequency counter (which runs blastingly fast) however didn't find this to be any problem. It dealt with non-single byte all sorts of word breaks :o) It appears to run at about 3MB/second on my laptop, which involves for every word, doing a word check on the entire previous collection of words. Thats like having MS Word spell-check 3MB of pure Unicode text (no style junk bloating up the file-size) in one second, for you. (The words would all be spelt correctly though, so as to not require expensive RAM copying when doing the replacements.) Yes, I do know how to code ;o) Too bad so few others do. -- Theodore H. Smith - Software Developer - www.elfdata.com/plugin/ Industrial strength string processing code, made easy. (If you believe that's an oxymoron, see for yourself.)
Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...
From: "Kenneth Whistler" <[EMAIL PROTECTED]> Yes, and pigs could fly, if they had big enough wings. Once again, this is a creative comment. As if Unicode had to be bound on architectural constraints such as the requirement of representing code units (which are architectural for a system) only as 16-bit or 32-bit units, ignoring the fact that technologies do evolve and will not necessarily keep this constraint. 64-bit systems already exist today, and even if they have, for now, the architectural capability of handling efficiently 16-bit and 32-bit code units so that they can be addressed individually, this will possibly not be the case in the future. When I look at the encoding forms such as UTF-16 and UTF-32, they just define the value ranges in which code units will be be valid, but not necessarily their size. You are mixing this with encoding schemes, which is what is needed for interoperability, and where other factors such as bit or byte ordering is also important in addition to the value range. I won't see anything wrong if a system is set so that UTF-32 code units will be stored in 24-bit or even 64-bit memory cells, as long as they respect and fully represent the value range defined in encoding forms, and if the system also provides an interface to convert them with encoding schemes to interoperable streams of 8-bit bytes. Are you saying that UTF-32 code units need to be able to represent any 32-bit value, even if the valid range is limited, for now to the 17 first planes? An API on a 64-bit system that would say that it requires strings being stored with UTF-32 would also define how UTF-32 code units are represented. As long as the valid range 0 to 0x10 can be represented, this interface will be fine. If this system is designed so that two or three code units will be stored in a single 64-bit memory cell, no violation will occur in the valid range. More interestingly, there already exists systems where memory is adressable by units of 1 bit, and on these systems, an UTF-32 code unit will work perfectly if code units are stored by steps of 21 bits of memory. On 64-bit systems, the possibility of addressing any groups individual bits will become an interesting option, notably when handling complex data structures such as bitfields, data compressors, bitmaps, ... No more need to use costly shifts and masking. Nothing would prevent such system to offer interoperability with 8-bit byte based systems (note also that recent memory technologies use fast serial interfaces instead of parallel buses, so that the memory granularity is less important). The only cost for bit-addressing is that it just requires 3 bits of address, but in a 64-bit address, this cost seems very low becaue the global addressable space will still be... more than 2.3*10^18 bytes, much more than any computer will manage in a single process for the next century (according to the Moore's law which doubles the computing capabilities every 3 years). Even such scheme would not limit the performance given that memory caches are paged, and these caches are always increasing, eliminating most of the costs and problems related to data alignment experimented today on bus-based systems. Other territories are also still unexplored in microprocessors, notably the possibility of using non-binary numeric systems (think about optical or magnetic systems which could outperform the current electric systems due to reduced power and heat caused by currents of electrons through molecular substrates, replacing them by shifts of atomic states caused by light rays, and the computing possibilities offered by light diffraction through cristals). The lowest granularity of information in some future may be larger than a dual-state bit, meaning that todays 8-bit systems would need to be emulated using other numerical systems... (Note for example that to store the range 0..0x10, you would need 13 digits on a ternary system, and to store the range of 32-bit integers, you would need 21 ternary digits; memry technologies for such systems may use byte units made of 6 ternary digits, so programmers would have the choice between 3 "ternary bytes", i.e. 18 ternary digits, to store our 21-bit code units, or 4 "ternary bytes", i.e. 24 ternary digits or more than 34 binary bits, to be able to store the whole 32-bit range.) Nothing there is impossible for the future (when it will become more and more difficult to increase the density of transistors, or to reduce further the voltage, or to increase the working frequency, or to avoid the inevitable and random presence of natural defects in substrates; escaping from the historic binary-only systems may offer interesting opportunities for further performance increase).
Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...
> Yes, and pigs could fly, if they had big enough wings. An 8-foot wingspan should do it. For picture of said flying pig see: http://www.cincinnati.com/bigpiggig/profile_091700.html http://www.cincinnati.com/bigpiggig/images/pig091700.jpg Rick
Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...
Philippe stated, and I need to correct: > UTF-24 already exists as an encoding form (it is identical to UTF-32), if > you just consider that encoding forms just need to be able to represent a > valid code range within a single code unit. This is false. Unicode encoding forms exist by virtue of the establishment of them as standard, by actions of the standardizing organization, the Unicode Consortium. > UTF-32 is not meant to be restricted on 32-bit representations. This is false. The definition of UTF-32 is: "The Unicode encoding form which assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value." It is true that UTF-32 could be (and is) implemented on computers which hold 32-bit numeric types transiently in 64-bit registers (or even other size registers), but if an array of 64-bit integers (or 24-bit integers) were handed to some API claiming to be UTF-32, it would simply be nonconformant to the standard. UTF-24 does not "already exist as an encoding form" -- it already exists as one of a large number of more or less idle speculations by character numerologists regarding other cutesy ways to handle Unicode characters on computers. Many of those cutesy ways are mere thought experiments or even simply jokes. > However it's true that UTF-24BE and UTF-24LE could be useful as a encoding > schemes for serializations to byte-oriented streams, suppressing one > unnecessary byte per code point. "Could be", perhaps, but is not. Implementers using UTF-32 for processing efficiency, but who have bandwidth constraints in some streaming context should simply use one of the CES's with better size characteristics or use a compression on their data. > Note that 64-bit systems could do the same: 3 code points per 64-bit unit, > requires only 63 bits, that are stored in a single positive 64-bit integer > (the remaining bit would be the sign bit, always set to 0, avoiding problems > related to sign extensions). And even today's system could use such > representation as well, given that most 32-bit processors of today also have > the internal capabilities to manage 64-bit integers natively. This is just an incredibly bad idea. Packing instructions in large-word microprocessors is one thing. You have built-in microcode which handles that, hidden away from application-level programming, and carefully architected for maximal processor efficiency. But attempting to pack character data into microprocessor words, just because you have bits available, would just detract from the efficiency of handling that data. Storage is not the issue -- you want to get the characters in and out of the registers as efficiently as possible. UTF-32 works fine for that. UTF-16 works almost as well, in aggregate, for that. And I could care less that when U+0061 goes in a 64-bit register for manipulation, the high 57 bits are all set to zero. > Strings could be encoded as well using only 64-bit code units that would > each store 1 to 3 code points, Yes, and pigs could fly, if they had big enough wings. > the unused positions being filled with > invalid codepoints out the Unicode space (for example by setting all 21 bits > to 1, producing the out-of-range code point 0x1F, used as a filler for > missing code points, notably when the string to encode is not an exact > multiple of 3 code points). Then, these 64-bit code units could be > serialized on byte streams as well, multiplying the number of possibilities: > UTF-64BE and UTF-64LE? One interest of such scheme is that it would be more > compact than UTF-32, because this UTF-64 encoding scheme would waste only 1 > bit for 3 codepoints, instead 1 byte and 3 bits for each codepoint with > UTF-32! Wow! > You can imagine many other encoding schemes, depending on your architecture > choices and constraints... Yes, one can imagine all sorts of strange things. I myself imagined UTF-17 once. But there is a difference between having fun imagining strange things and filling the list with confusing misinterpretations of the status and use of UTF-8, UTF-16, and UTF-32. --Ken
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Doug Ewell replied: > Actually the Unicode Technical Committee. But you are > correct: it is up > to the UTC to decide whether they want to redefine UTF-8 to permit > invalid sequences, which are to be interpreted as unknown characters > from an unknown legacy coding standard, and to prohibit > conversion from > this redefined UTF-8 to other encoding schemes, or directly to Unicode > code points. We will have to wait and see what UTC members think of > this. I never said it doesn't violate any existing rules. Stating that it does, doesn't help a bit. Rules can be changed. Assuming we understand the consequences. And that is what we should be discussing. By stating what should be allowed and what should be prohibited you are again defending those rules. I agree, rules should be defended, but only up to a certain point. Simply finding a rule that is offended is not enough to prove something is bad or useless. > > > But this decision should not be based solely on theory and ideal > > worlds. > > Right. Uh-huh. Defining Unicode as the world of codepoints is a complex task on its own. It seems that you are afraid of stepping out of this world, since you do not know what awaits you there. So, it is easier to find an excuse within existing rules, especially if a proposed change threatens to shake everything right down to the foundation. If I would be dealing with Unicode (as we know it), I would probably be doing the same thing. I ask you to step back and try to see the big picture. > > Of course not. That is not at all the same as INTENTIONALLY storing > invalid sequences in UTF-8 and expecting the decoding mechanism to > preserve the invalid bytes for posterity. So you would drop the data. There are only two options with current designs. Dropping invalid sequences, or storing it separately (which probably means the whole document is dead until manually decoded). Dropping invalid sequences is actually a better choice. And would even be justifiable (but still sometimes inconvenient) if we were living in world where everything is in UTF-8. In a world, trying to transition from legacy encodings to Unicode, there could be a lot of data lost and a lot of angry users. > > > And do what with it, Lars? Keep it on a shelf indefinitely > in case some > archaeologist unearths a new legacy encoding that might unlock the > mystery data? > > Is this really worth the effort of redefining UTF-8 and > disallowing free > conversion between UTF-8 and Unicode code points? > > Do you have a use case for this? Yes, I definitely have. I am the one accusing you of living in a perfect world, remember?. Do you think I would do that if I wasn't dealing with this problem in real life? > > > So with your plan, you have invalid sequence #1, invalid sequence #2, > and so forth. Now, what do the sequences mean? Is there any way to > interpret them? No, there isn't, because by definition these > sequences > represent characters from an unknown coding standard. Either > (a) nobody > has gone to the trouble to find out what characters they truly > represent, (b) the original standard is lost and we will *never* know, > or (c) we are waiting for the archaeologist to save the day. > > In the meantime, the UTF-8 data with invalid sequences must be kept > isolated from all processes that would interpret the sequences as code > points, and raise an exception on invalid sequences-- in other words, > all existing processes that handle UTF-8. On the contrary. If those invalid sequences can (well, may) be translated into codepoints, then you can stop worrying about them. Or at least all the worrying is done within the conversion. It is the current design that is unfair. A UTF-16 based program will only be able to process valid UTF-8 data. A UTF-8 based program will in many cases preserve invalid sequences even without any effort. Let me guess, you will say it is a flaw in the UTF-8 based program. If validation is desired, yes. But then I think you would want all UTF-8 based programs to do that. That will not happen. What will happen is that UTF-8 based programs will be better text editors (because they will not lose data or constantly complain), while UTF-16 based programs will produce cleaner data. You will opt for the latter. And I for the former. But will users know exactly what they've got? Will designers know exactly what they're gonna get? This is where all this started. I stated that there is an important difference between deciding for UTF-8 or for UTF-16 (or UTF-32). > > > Let's compare UTF-8 to UTF-16 conversion to an automated translation > > from German to French. What Unicode standard says can be interpreted > > as follows: > > > > * All in
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Doug Ewell wrote: > John Cowan wrote: > > > Windows filesystems do know what encoding they use. But a > filename on > > a Unix(oid) file system is a mere sequence of octets, of > which only 00 > > and 2F are interpreted. (Filenames containing 20, and > especially 0A, > > are annoying to handle with standard tools, but not illegal.) > > > > How these octet sequences are translated to characters, if at all, > > is no concern of the file system's. Some higher-level > tools, such as > > directory listers and shells, have hardwired assumptions, > others have > > changeable assumptions, but all are assumptions. > > OK, fair enough. Under a Unixoid file system, a file name > consists of a > more or less arbitrary sequence of bytes, essentially > unregulated by the > OS. > > If interpreted as UTF-8, some of these sequences may be > invalid, and the > files may be inaccessible. > > This is *exactly* the same scenario as with GB 2312, or > Shift-JIS, or KS > C 5601, or ISO 6937, or any other multibyte character encoding ever > devised. > > This is not a problem that needs to be solved within Unicode, any more > than it needed to be solved within those other encodings. > Shift-JIS was typically not mixed with other encodings, except for pure 7-bit ASCII. UTF-8 will be. And Shift-JIS had other serious problems, like the trailing backslash byte. UTF-8 has learned a lot from Shift-JIS. If there is anything still to learn, then let's welcome that. Also, Shift-JIS (and other MBCS encodings) were a must for those cultures. UTF-8 is not a must. If there will be problems, there will be complaints. And resistance. Lars
Re: Nicest UTF
From: "D. Starner" <[EMAIL PROTECTED]> If you're talking about a language that hides the structure of strings and has no problem with variable length data, then it wouldn't matter what the internal processing of the string looks like. You'd need to use iterators and discourage the use of arbitrary indexing, but arbitrary indexing is rarely important. I fully concur to this point of view. Almost all (if not all) string processing can be performed in terms of sequential enumerators, instead of through random indexing (which has also the big disavantage of not allowing with rich context dependant processing behaviors, something you can't ignore when handling international texts). So internal storage of string does not matter for the programming interface of parsable string objects. In terms of efficiency and global application performance, using compressed encoding schemes is highly recommanded for large databases of text, because the negative impact of the decompressing overhead is extremely small face to the huge benefits you get when reducing the load on system resources, on data locality and on memory caches, on the system memory allocator, on the memory fragmentation level, on reduced VM swaps and on file or database I/O (which will be the only effective limitation for large databases).
Re: Nicest UTF
(Sorry for sending this twice, Marcin.) "Marcin 'Qrczak' Kowalczyk" writes: > UTF-8 is poorly suitable for internal processing of strings in a > modern programming language (i.e. one which doesn't already have a > pile of legacy functions working of bytes, but which can be designed > to make Unicode convenient at all). It's because code points have > variable lengths in bytes, so extracting individual characters is > almost meaningless (unless you care only about the ASCII subset, and > sequences of all other characters are treated as non-interpreted bags > of bytes). You can't even have a correct equivalent of C isspace(). That's assuming that the programming language is similar to C and Ada. If you're talking about a language that hides the structure of strings and has no problem with variable length data, then it wouldn't matter what the internal processing of the string looks like. You'd need to use iterators and discourage the use of arbitrary indexing, but arbitrary indexing is rarely important. You could hide combining characters, which would be extremely useful if we were just using Latin and Cyrillic scripts. You'd have to be flexible, since it would be natural to step through a Hebrew or Arabic string as if the vowels were written inline, and people might want to look at the combining characters (which would be incredibly rare if your language already provided most standard Unicode functions.) -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...
- Original Message - From: "Arcane Jill" <[EMAIL PROTECTED]> Probably a dumb question, but how come nobody's invented "UTF-24" yet? I just made that up, it's not an official standard, but one could easily define UTF-24 as UTF-32 with the most-significant byte (which is always zero) removed, hence all characters are stored in exactly three bytes and all are treated equally. You could have UTF-24LE and UTF-24BE variants, and even UTF-24 BOMs. Of course, I'm not suggesting this is a particularly brilliant idea, but I just wonder why no-one's suggested it before. UTF-24 already exists as an encoding form (it is identical to UTF-32), if you just consider that encoding forms just need to be able to represent a valid code range within a single code unit. UTF-32 is not meant to be restricted on 32-bit representations. However it's true that UTF-24BE and UTF-24LE could be useful as a encoding schemes for serializations to byte-oriented streams, suppressing one unnecessary byte per code point. (And then of course, there's UTF-21, in which blocks of 21 bits are concatenated, so that eight Unicode characters will be stored in every 21 bytes - and not to mention UTF-20.087462841250343, in which a plain text document is simply regarded as one very large integer expressed in radix 1114112, and whose UTF-20.087462841250343 representation is simply that number expressed in binary. But now I'm getting /very/ silly - please don't take any of this seriously.) :-) I don't think that UTF-21 would be useful as an encoding form, but possibly as a encoding scheme where 3 always-zero bits would be stripped, providing a tiny compression level, which would only be justified for transmission over serial or network links. However I do think that such "optimization" would have the effect of removing byte alignments, on which more powerful compressors are working. If you really need a more effective compression use SCSU or apply some deflate or bzip2 compression to UTF-8, UTF-16, or UTF-24/32... (there's not much difference between compressing UTF-24 or UTF-32 with generic compression algorithms like deflate or bzip2). The "UTF-24" thing seems a reasonably sensible question though. Is it just that we don't like it because some processors have alignment restrictions or something? There does exists, even still today, 4-bit processors, and 1-bit processors, where the smallest addressable memory unit is smaller than 8-bit. They are used for lowcost micro-devices, notably to build automated robots for the industry, or even for many home/kitchen devices. I don't know whever they do need Unicode to represent international text, given that they often have a very limited user interface, incapable of inputing or output text, but who knows? May be they are used in some mobile phones, or within "smart" keyboards or tablets or other input devices connected to PCs... There also exists systems where the smallest addressable memory cell is a 9-bit byte. This is more an issue here, because the Unicode standard does not specify whever encoding schemes (that serialize code points to bytes) should set the 9th bit of each byte to 0, or should fill every 8 bit of memory, even if this means that 8-bit bytes of UTF-8 will not be synchronized with memory 9-bit bytes. Somebody already introduced UTF-9 in the past for 9-bit systems. A 36-bit processor could as well address the memory by cells of 36 bits, where the 4 highest bits would be either used for CRC control bits (generated and checked automatically by the processor or a memory bus interface within memory regions where this behavior would be allowed), or either used to store supplementary bits of actual data (in unchecked regions that fit in reliable and fast memory, such as the internal memory cache of the CPU, or static CPU registers). For such things, the impact of the transformation of addressable memory widths through interfaces is for now not discussed in Unicode, which supposes that internal memory is necessarily addressed in a power of 2 and a multiple of 8 bits, and then interchanged or stored using this byte unit. Today, we assist to the constant expansion of bus widths to allow parallel processing instead of multiplying the working frequency (and the energy spent and temperature, which generates other environmental problems), so why the 8-bit byte unit would remain the most efficient universal unit? If you look at IEEE floatting point formats, they are often implemented in FPU working on 80-bit units, and a 80-bit memory cell could as well become tomorrow a standard (compatible with the increasingly used 64-bit architectures of today) which would no longer be a power of 2 (even if this stays a multiple of 8 bits). On a 80-bit system, the easiest solution for handling UTF-32 without using too much space would be a unit of 40-bits (i.e. two code points per 80-bit memory cell). But if you consider that 21 bits only are used
Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
John Cowan wrote: > Windows filesystems do know what encoding they use. But a filename on > a Unix(oid) file system is a mere sequence of octets, of which only 00 > and 2F are interpreted. (Filenames containing 20, and especially 0A, > are annoying to handle with standard tools, but not illegal.) > > How these octet sequences are translated to characters, if at all, > is no concern of the file system's. Some higher-level tools, such as > directory listers and shells, have hardwired assumptions, others have > changeable assumptions, but all are assumptions. OK, fair enough. Under a Unixoid file system, a file name consists of a more or less arbitrary sequence of bytes, essentially unregulated by the OS. If interpreted as UTF-8, some of these sequences may be invalid, and the files may be inaccessible. This is *exactly* the same scenario as with GB 2312, or Shift-JIS, or KS C 5601, or ISO 6937, or any other multibyte character encoding ever devised. This is not a problem that needs to be solved within Unicode, any more than it needed to be solved within those other encodings. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Doug Ewell scripsit: > > Now suppose you have a UNIX filesystem, containing filenames in a > > legacy encoding (possibly even more than one). If one wants to switch > > to UTF-8 filenames, what is one supposed to do? Convert all filenames > > to UTF-8? > > Well, yes. Doesn't the file system dictate what encoding it uses for > file names? How would it interpret file names with "unknown" characters > from a legacy encoding? How would they be handled in a directory > search? Windows filesystems do know what encoding they use. But a filename on a Unix(oid) file system is a mere sequence of octets, of which only 00 and 2F are interpreted. (Filenames containing 20, and especially 0A, are annoying to handle with standard tools, but not illegal.) How these octet sequences are translated to characters, if at all, is no concern of the file system's. Some higher-level tools, such as directory listers and shells, have hardwired assumptions, others have changeable assumptions, but all are assumptions. -- John Cowan [EMAIL PROTECTED] www.reutershealth.com www.ccil.org/~cowan No man is an island, entire of itself; every man is a piece of the continent, a part of the main. If a clod be washed away by the sea, Europe is the less, as well as if a promontory were, as well as if a manor of thy friends or of thine own were: any man's death diminishes me, because I am involved in mankind, and therefore never send to know for whom the bell tolls; it tolls for thee. --John Donne
Invalid UTF-8 sequences (was: Re: Nicest UTF)
RE: Nicest UTFLars Kristan wrote: >> I could not disagree more with the basic premise of Lars' post. It >> is a fundamental and critical mistake to try to "extend" Unicode with >> non-standard code unit sequences to handle data that cannot be, or >> has not been, converted to Unicode from a legacy standard. This is >> not what any character encoding standard is for. > > What a standard is or is not for is a decision. And Unicode consortium > is definitely the body that makes the decision in this case. Actually the Unicode Technical Committee. But you are correct: it is up to the UTC to decide whether they want to redefine UTF-8 to permit invalid sequences, which are to be interpreted as unknown characters from an unknown legacy coding standard, and to prohibit conversion from this redefined UTF-8 to other encoding schemes, or directly to Unicode code points. We will have to wait and see what UTC members think of this. > But this decision should not be based solely on theory and ideal > worlds. Right. Uh-huh. >> This is simply what you have to do. You cannot convert the data into >> Unicode in a way that says "I don't know how to convert this data >> into Unicode." You must either convert it properly, or leave the >> data in its original encoding (properly marked, preferably). > > Here lies the problem. Suppose you have a document in UTF-8, which > somehow got corrupted and now contains a single invalid sequence. Are > you proposing that this document needs to be stored separately? Of course not. That is not at all the same as INTENTIONALLY storing invalid sequences in UTF-8 and expecting the decoding mechanism to preserve the invalid bytes for posterity. > Everything else in the database would be stored in UTF-16, but now one > must add the capability to store this document separately. And > probably not index it. Regardless of any useful data in it. But if you > use UTF-8 storage instead, you can put it in with the rest (if you can > mark it, even better, but you only need to do it if that is a > requirement). And do what with it, Lars? Keep it on a shelf indefinitely in case some archaeologist unearths a new legacy encoding that might unlock the mystery data? Is this really worth the effort of redefining UTF-8 and disallowing free conversion between UTF-8 and Unicode code points? Do you have a use case for this? > I can reinterprete your example. Using the French word is exactly the > solution I am proposing, and I see your solution is to replace the > word with a placeholder which says "a word that does not exist in > German". Even worse, you want to use the same placeholder for all the > unknown words. Numbering them would be better, but awkward, since you > don't know how to assign numbers. Fortunetely, with bytes in invalid > sequences, the numbering is trivial and has a meaning. So with your plan, you have invalid sequence #1, invalid sequence #2, and so forth. Now, what do the sequences mean? Is there any way to interpret them? No, there isn't, because by definition these sequences represent characters from an unknown coding standard. Either (a) nobody has gone to the trouble to find out what characters they truly represent, (b) the original standard is lost and we will *never* know, or (c) we are waiting for the archaeologist to save the day. In the meantime, the UTF-8 data with invalid sequences must be kept isolated from all processes that would interpret the sequences as code points, and raise an exception on invalid sequences-- in other words, all existing processes that handle UTF-8. > Let's compare UTF-8 to UTF-16 conversion to an automated translation > from German to French. What Unicode standard says can be interpreted > as follows: > > * All input text must be valid German language. > * All output text must be valid French language. > * Any unknown words shall be replaced by a (single) 'unknown word' > placeholder. If you have French words that cannot be translated into German at all, and nobody in the target audience is capable of understanding French, then what you have is an inscrutable collection of mystery data, perhaps suitable for research and examination by linguists, but not something that the audience can make any sense of. In that case, converting all the mystery data to a single "unknown word" placeholder is no worse than any other solution, and in particular, no worse than a solution that converts 100 different mystery words into 100 different placeholders, *none* of which the audience can decipher. > And that last statement goes for German words missing in your > dictionary, misspelled words, Spanish words, proper nouns... The underlying assumption is that somebody, somewhere, will be able to recognize these "foreign" or "unrecognized" words and make some sense of them. But in your character encoding example, the premise is that we DON'T know what the original encoding was, and it's too difficult or impossible to find out, so we just shoeho
Re: Nicest UTF
Lars Kristan <[EMAIL PROTECTED]> writes: >> This is simply what you have to do. You cannot convert the data >> into Unicode in a way that says "I don't know how to convert this >> data into Unicode." You must either convert it properly, or leave >> the data in its original encoding (properly marked, preferably). > > Here lies the problem. Suppose you have a document in UTF-8, which > somehow got corrupted and now contains a single invalid sequence. > Are you proposing that this document needs to be stored separately? He is not proposing that. > Everything else in the database would be stored in UTF-16, but now > one must add the capability to store this document separately. No, it can be be stored in UTF-16 or whatever else is used. Except the corrupted part of course, but it's corrupted, and thus useless, so it doesn't matter what happens with it. > Now suppose you have a UNIX filesystem, containing filenames in a legacy > encoding (possibly even more than one). If one wants to switch to UTF-8 > filenames, what is one supposed to do? Convert all filenames to UTF-8? Yes. > Who will do that? A system administrator (because he has access to all files). > And when? When the owners of the computer system decide to switch to UTF-8. > Will all users agree? It depends on who decides about such things. Either they don't have a voice, or they agree and the change is made, or they don't agree and the change is not made. What's the point? > Should all filenames that do not conform to UTF-8 be declared invalid? What do you mean by "invalid"? They are valid from the point of view of the OS, but they will not work with reasonable applications which use Unicode internally. > If you keep all processing in UTF-8, then this is a decision you can > postpone. You mean, various programs will break at various points of time, instead of working correctly from the beginning? If it's broken, fix it, instead of applying patches which will sometimes hide the fact that it's broken, or sometimes not. > I didn't encourage users to mix UTF-8 filenames and Latin 1 filenames. > Do you want to discourage them? Mixing any two incompatible filename encodings on the same file system is a bad idea. > IMHO, preserving data is more important, but so far it seems it is > not a goal at all. With a simple argument - that Unicode only > defines how to process Unicode data. Understandably so, but this > doesn't mean it needs to remain so. If you don't know the encoding and want to preserve the values of bytes, then don't convert it to Unicode. > Well, you may have a wrong assumption here. You probably think that > I convert invalid sequences into PUA characters and keep them as > such in UTF-8. That is not the case. Any invalid sequences in UTF-8 > are left as they are. If they need to be converted to UTF-16, then > PUA is used. If they are then converted to UTF-8, they are converted > back to their original bytes, hence the incorrect sequences are > re-created. This does not make sense. If you want to preserve the bytes instead of working in terms of characters, don't convert it at all - keep the original byte stream. > One more example of data loss that arises from your approach: If a > single bit is changed in UTF-16 or UTF-32, that is all that will > happen (in more than 99% of the cases). If a single bit changes in > UTF-8, you risk that the entire character will be dropped or > replaced with the U+FFFD. But funny, only if it ever gets converted > to the UTF-16 or UTF-32. Not that this is a major problem on its > own, but it indicates that there is something fishy in there. If you change one bit in a file compressed by gzip, you might not be able to recover any part of it. What's the point? UTF-x were not designed to minimize the impact of corruption of encoded bytes. If you want to preserve the text despite occasional corruption, use a higher level protocol for this (if I remember correctly, RAR can add additional information to an archive which allows to recover the data even if parts of the archive, entire blocks, have been lost). > There was a discussion on nul characters not so long ago. Many text > editors do not properly preserve nul characters in text files. > But it is definitely a nice thing if they do. While preserving nul > characters only has a limited value, preserving invalid sequences > in text files could be crucial. An editor should alert the user that the file is not encoded in a particular encoding or that it's corrupted, instead of trying to guess which characters were supposed to be there. If it's supposed to edit binary files too, it should work on the bytes instead of decoded characters. > A UTF-8 based editor can easily do this. A UTF-16 based editor > cannot do it at all. If you say that UTF-16 is not intended for such > a purpose, then so be it. But this also means that UTF-8 is superior. It's much easier with CP-1252, which shows that it's superior to UTF-8 :-) > Yes, it is not related
Re: Nicest UTF
Asmus Freytag wrote: A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider 1) 1 extra test per character (to see whether it's a surrogate) In my experience with tuning a fair amount of utf-16 software, this test takes pretty close to zero time. All modern processors have branch and pipeline trickery that fairly effectively disappears the cost of a predictable branch within a tight loop. Occurrences of supplementary characters should generally be rare enough that the extra time to process them when they are encountered is not statistically significant. 2) special handling every 100 to 1000 characters (say 10 instructions) 3) additional cost of accessing 16-bit registers (per character) 4) reduction in cache misses (each the equivalent of many instructions) This is a big deal. The costs in plowing through lots of text data with relatively simple processing appear to be heavily related to the required memory bandwidth. Assuming reasonably carefully written code, that is. 5) reduction in disk access (each the equivaletn of many many instructions) For many operations, e.g. string length, both 1, and 2 are no-ops, so you need to apply a reduction factor based on the mix of operations you do perform, say 50%-75%. For many processors, item 3 is not an issue. For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each occurrence depending on the architecture. Their relative weight depends not only on cache sizes, but also on how many other instructions per character are performed. For text scanning operations, their cost does predominate with large data sets. -- Andy Heninger [EMAIL PROTECTED]
Re: Nicest UTF
Asmus Freytag wrote: > A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider > 3) additional cost of accessing 16-bit registers (per character) > For many processors, item 3 is not an issue. I do not know, I only know of a few of them; for example, I do not know how Alpha or Sparc or PowerPC handle 16-bit datas (I did hear different sounds.) I agree this was not an issue for 80386-80486 or Pentium. However, for the more recent processors, P6, Pentium 4, or AMD K7 or K8, I am unsure; and I shall appreciate insights. I remember reading that in the case of the AMD K7, for instance, 16-bit instructions (all? a few of them? only ALU-related, i.e. exclusing load and store, which is the point here? I do not know) are handled in a different way from the 32-bit ones, e.g. reduced number of decoders. The impact could be really important. I also remember that when the P6 was launched (1995, known as PentiumPro), there was a bunch of critics toward Intel because the performances of 16-bit code was actually worse than an equivalent Pentium (but there were an advantage for 32-bit code); of course this should be considered in the context, where 16-bit (DOS/Windows 3.x) code was important, something that faded. But I believe the reasoning behind the arguments should still hold. Finally, there is certainly an issue about the need to add a prefix with the X86 processors. The issue is reduced for the Pentium4 (because the prefix does not consume space in the L1-cache); but it still holds for L2-cache. And the impact is noticeable; I do not have figures for the access to UTF-16 datas, but I know that for when using 64-bit mode (with AMD K8), the need to have a prefix to access 64-bit data, so consuming code cache space for it, was given as cause for a 1-3% penality in execution time. Of course, such a tiny penalty is easily hidden by other factors, such as the others Dr. Freitag mentionned. > Given this little model and some additional assumptions about your > own project(s), you should be able to determine the 'nicest' UTF for > your own performance-critical case. My point was that the variability of the factors headed to keeping the three UTFs as possible candidates when one consider writing a "perfect-world" library. Can we say we are in agreement? By the way, this will also mean that the optimisations to be considered inside the library could be very different, since the optimal uses can be significantly different. For example, use of UTF-32 might signal a user bias toward easy management of codepoints, disregarding memory use, so the used code in the library should favour time over space (so unrolling loops and similar things could be considered). UTF-8 /might/ be the reverse. Antoine
Re: Nicest UTF
Arcane Jill wrote: > Probably a dumb question, but how come nobody's invented "UTF-24" yet? > I just made that up, it's not an official standard, but one could > easily define UTF-24 as UTF-32 with the most-significant byte (which > is always zero) removed, hence all characters are stored in exactly > three bytes and all are treated equally. You could have UTF-24LE and > UTF-24BE variants, and even UTF-24 BOMs. Of course, I'm not suggesting > this is a particularly brilliant idea, but I just wonder why no-one's > suggested it before. It has been suggested before, by Pim Blokland on April 3, 2003, in a message titled "UTF-24." If you get the digest, it's in Digest V3 #79. > The "UTF-24" thing seems a reasonably sensible question though. Is it > just that we don't like it because some processors have alignment > restrictions or something? Almost all do. In addition, no programming language I know of has a 3-byte-wide integer data type (maybe INTERCAL does), so the efficiency of UTF-24 would be wasted in software as well as in hardware. Besides that, there were the usual protests that supplementary characters would be vanishingly rare in the context of "normal" text, and that one should use compression (SCSU/BOCU or GP tools) if size is an issue. None of this stopped me from experimentally implementing it, of course, but I haven't touched it since finishing the implementation. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
RE: Nicest UTF
Title: RE: Nicest UTF Doug Ewell wrote: > RE: Nicest UTFLars Kristan wrote: > > >> I think UTF8 would be the nicest UTF. > > > > I agree. But not for reasons you mentioned. There is one other > > important advantage: UTF-8 is stored in a way that permits storing > > invalid sequences. I will need to elaborate that, of course. > > I could not disagree more with the basic premise of Lars' > post. It is a > fundamental and critical mistake to try to "extend" Unicode with > non-standard code unit sequences to handle data that cannot be, or has > not been, converted to Unicode from a legacy standard. This > is not what > any character encoding standard is for. What a standard is or is not for is a decision. And Unicode consortium is definitely the body that makes the decision in this case. But this decision should not be based solely on theory and ideal worlds. > > > 1.2 - Any data for which encoding is not known can only be > stored in a > > UTF-16 database if it is converted. One needs to choose a conversion > > (say Latin-1, since it is trivial). When a user finds out that the > > result is not appealing, the data needs to be converted back to the > > original 8-bit sequence and then the user (or an algorithm) can try > > various encodings until the result is appealing. > > This is simply what you have to do. You cannot convert the data into > Unicode in a way that says "I don't know how to convert this data into > Unicode." You must either convert it properly, or leave the > data in its > original encoding (properly marked, preferably). Here lies the problem. Suppose you have a document in UTF-8, which somehow got corrupted and now contains a single invalid sequence. Are you proposing that this document needs to be stored separately? Everything else in the database would be stored in UTF-16, but now one must add the capability to store this document separately. And probably not index it. Regardless of any useful data in it. But if you use UTF-8 storage instead, you can put it in with the rest (if you can mark it, even better, but you only need to do it if that is a requirement). > > It is just as if a German speaker wanted to communicate a > word or phrase > in French that she did not understand. She could find the correct > German translation and use that, or she could use the French word or > phrase directly (moving the translation burden onto the > listener). What > she cannot do is "extend" German by creating special words that are > placeholders for French words whose meaning she does not know. I can reinterprete your example. Using the French word is exactly the solution I am proposing, and I see your solution is to replace the word with a placeholder which says "a word that does not exist in German". Even worse, you want to use the same placeholder for all the unknown words. Numbering them would be better, but awkward, since you don't know how to assign numbers. Fortunetely, with bytes in invalid sequences, the numbering is trivial and has a meaning. Let's compare UTF-8 to UTF-16 conversion to an automated translation from German to French. What Unicode standard says can be interpreted as follows: * All input text must be valid German language. * All output text must be valid French language. * Any unknown words shall be replaced by a (single) 'unknown word' placeholder. And that last statement goes for German words missing in your dictionary, misspelled words, Spanish words, proper nouns... > > > 2.2 - Any data for which encoding is not known can simply be stored > > as-is. > > NO. Do not do this, and do not encourage others to do this. > It is not > valid UTF-8. I never said it is valid UTF-8. The fact remains I can store legacy data in the same store as UTF-8 data. But cannot do that if storage is UTF-16 based. Now suppose you have a UNIX filesystem, containing filenames in a legacy encoding (possibly even more than one). If one wants to switch to UTF-8 filenames, what is one supposed to do? Convert all filenames to UTF-8? Who will do that? And when? Will all users agree? Should all filenames that do not conform to UTF-8 be declared invalid? And those files innacessible? If you keep all processing in UTF-8, then this is a decision you can postpone. But if you start using UTF-32 applications for processing filenames, invalid sequences will be dropped and those files can in fact become inaccessible. And then you'll be wondering why users don't want to start using Unicode. I didn't encourage users to mix UTF-8 filenames and Latin 1 filenames. Do you want to discourage them? > > Among other things, you run the risk that the mystery data happens to > form a valid UTF-8 sequence, by sheer coi
Re: Nicest UTF
Probably a dumb question, but how come nobody's invented "UTF-24" yet? I just made that up, it's not an official standard, but one could easily define UTF-24 as UTF-32 with the most-significant byte (which is always zero) removed, hence all characters are stored in exactly three bytes and all are treated equally. You could have UTF-24LE and UTF-24BE variants, and even UTF-24 BOMs. Of course, I'm not suggesting this is a particularly brilliant idea, but I just wonder why no-one's suggested it before. (And then of course, there's UTF-21, in which blocks of 21 bits are concatenated, so that eight Unicode characters will be stored in every 21 bytes - and not to mention UTF-20.087462841250343, in which a plain text document is simply regarded as one very large integer expressed in radix 1114112, and whose UTF-20.087462841250343 representation is simply that number expressed in binary. But now I'm getting /very/ silly - please don't take any of this seriously.) :-) The "UTF-24" thing seems a reasonably sensible question though. Is it just that we don't like it because some processors have alignment restrictions or something? Arcane Jill -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Marcin 'Qrczak' Kowalczyk Sent: 02 December 2004 16:59 To: [EMAIL PROTECTED] Subject: Re: Nicest UTF "Arcane Jill" <[EMAIL PROTECTED]> writes: Oh for a chip with 21-bit wide registers! Not 21-bit but 20.087462841250343-bit :-) -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
Philippe Verdy wrote: > Only the encoder may be a bit complex to write (if one wants to > generate the optimal smallest result size), but even a moderate > programmer could find a simple and working scheme with a still > excellent compression rate (around 1 to 1.2 bytes per character on > average for any Latin text, and around 1.2 to 1.5 bytes per character > for Asian texts which would still be a good application of SCSU face > to UTF-32 or even UTF-8). If by "Asian texts" you mean CJK ideographs (*), precomposed Hangul, or Yi syllables, you have no chance of doing better than 2 bytes per character. This is because it is not possible in SCSU to set a dynamic window to any range between U+3400 and U+DFFF, where these characters reside. Such a window would be of little use anyway, because real-world texts using these characters would draw from so many windows that single-byte mode would be less efficient than Unicode mode, where 2 bytes per character is the norm. Of course, this is still better than UTF-32 or UTF-8 for these characters. For Katakana and Hiragana, you can get the same efficiency with SCSU as for other small scripts, but very few texts are written in pure kana except for young children. Sorry for missing this point in my earlier post. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/ (*) No, I'm not interested in arguing over this word.
Re: Nicest UTF
Philippe Verdy wrote: >> Here is a string, expressed as a sequence of bytes in SCSU: >> >> 05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E >> M o s s o v SP i s SP . > > Without looking at it, it's easy to see that this tream is separated > in three sections, initiated by 05 1C, then 05 1D, then 12. I can't > remember without looking at the UTN what they perform (i.e. which > Unicode code points range they select), but the other bytes are simple > offsets relative to the start of the selected ranges. Also the third > section is ended by a regular dot (2E) in the ASCII range selected for > the low half-page, and the other bytes are offsets for the script > block initiated by 12. 05 is a static-quote tag which modifies only the next byte. It doesn't really initiate a new section; it's intended for isolated characters where initiating a new section would be wasteful. The sequences <05 1C> and <05 1D> encode the matching double-quote characters U+201C and U+201D respectively. 12 switches to a new dynamic window -- in this case, window 2, which is predefined to point to the Cyrillic block -- so it does select a range as you said. Also, the ASCII bytes do represent Basic Latin characters. > Immediately I can identify this string, without looking at any table: > > "Mossov?" is ??. > > where each ? replaces a character that I can't decipher only through > my defective memory. (I don't need to remember the details of the > standard table of ranges, because I know that this table is complete > in a small and easily available document). Actually "Moscow," not "Mossov" -- but as you said, this is not important because a computer would have gotten this arithmetic right. The actual string is: âMoscowâ is ÐÐÑÐÐÐ. > The decoder part of SCSU still remains extremely trivial to implement, > given the small but complete list of codes that can alter the state of > the decoder, because there's no choice in its interpretation and > because the set of variables to store the decoder state is very > limited, as well as the number of decision tests at each step. This is > a "finite state automata". I think "extremely trivial" is overstating the case a bit. It is straightforward and not very difficult, but still somewhat more complex than a UTF. (There had better not be any choice in interpretation, if we want lossless decompression!) BTW, the singular is "automaton." > Only the encoder may be a bit complex to write (if one wants to > generate the optimal smallest result size), but even a moderate > programmer could find a simple and working scheme with a still > excellent compression rate (around 1 to 1.2 bytes per character on > average for any Latin text, and around 1.2 to 1.5 bytes per character > for Asian texts which would still be a good application of SCSU face > to UTF-32 or even UTF-8). UTN #14 contains pseudocode for an encoder that beats the Japanese example in UTS #6 (by one byte, big deal) and can be easily translated into working code. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
SCSU as internal encoding (was: Re: Nicest UTF)
Philippe Verdy wrote: >> The point is that indexing should better be O(1). > > SCSU is also O(1) in terms of indexing complexity... simply because it > keeps the exact equivalence with codepoints, and requires a *fixed* > (and small) number of steps to decode it to code points, but also > because the decoder states uses a *fixed* (and small) number of > variables for the internal context (unlike more powerful compression > algorithms like dictionnary-based, Lempel-Ziv-Welsh-like, algorithms > such as deflate). As Marcin said, SCSU is O(n) in terms of indexing complexity, because you have to decode the first (n - 1) characters before you can decode the n'th. Even when you have a run of "ASCII" bytes between 0x20 and 0x7E, there is no guarantee that the characters are Basic Latin. There might have been a previous SCU tag that switched into Unicode mode. >> No, individual characters are immutable in almost every language. > > But individual characters do not always have any semantic. For > languages, the relevant unit is almost always the grapheme cluster, > not the character (so not its code point...). As grapheme clusters > need to be represented on variable lengths, an algorithm that could > only work with fixed-width units would not work internationaly or > would cause serious problems for correct analysis or transformation of > true languages. This is beside the point, as I said at the outset. In programming, you have to deal with individual characters in a string on a regular basis, even if some characters depend on others from a linguistic standpoint. > Code points are probably the easiest thing to describe what an text > algorithm is supposed to do, but this is not a requirement for > applications (in fact many libraries have been written that correctly > implement the Unicode algorithms, without even dealing with code > points, but only with in-memory code units of UTF-16 or even in UTF-8 > or GB18030, or directly with serialization bytes of UTF-16LE or UTF-8 > or SCSU or ether encoding schemes). Algorithms that operate on CES-specific code units are what lead to such "wonderful" innovations as CESU-8. All text operations, except for encoding and decoding, should work with code points. Marcin responded: > UTF-8 is much better for interoperability than SCSU, because it's > already widely supported and SCSU is not. True, but not really Philippe's point. Philippe again: > The question is why you would need to extract the nth codepoint so > blindly. If you have such reasons, because you know the context in > which this index is valid and usable, then you can as well extract a > sequence using an index in the SCSU encoding itself using the same > knowledge. > > Linguistically, extracting a substring or characters at any random > index in a sequence of code points will only cause you problems. In > general, you will more likely use index as a way to mark a known > position that you have already parsed sequentially in the past. You have to do this ALL THE TIME in programming. Example: searching and replacing text. To search a string for a substring, you would normally write a function that would not only give a yes/no answer (i.e. "this string does/does not contain the substring"), but would also indicate *where* the substring was found within the string. That's because the world needs not only search tools, but also search-and-replace tools, and you need to know where the substring is in order to replace it with another. "Linguistically" has nothing to do with it. Nothing prevents the user of a search-and-replace tool from doing something linguistically unsound, nor should it. If you do this in SCSU, you have to keep track of the state of the decoder within the string (single-byte vs. Unicode mode, current dynamic window, and position of all dynamic windows). If you lose track of the decoder state, you run the risk of corrupting the data. (Philippe acknowledged this in his next paragraph.) You really need to convert internally to code points in order to do this. I'm a believer in SCSU as an efficient storage and transfer encoding, but not as an internal process code. > All those are not demonstration: decoding IRC commands or similar > things does not constitute the need to encode large sets of texts. In > your examples, you show applications that need to handle locally some > strings made for computer languages. One of the main stated goals of SCSU was to provide good compression for small strings. > Texts of human languages, or even a collection of person names, or > places are not like this, and have a much wider variety, but with huge > possibilities for data compression (inherent to the phonology of human > languages and their overall structure, but also due to repetitive > conventions spread throughout the text to allow easier reading and > understanding). This is where general-purpose compression schemes excel, and should be considered. (You might want to read UTN #1
Re: Nicest UTF
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> Now consider scanning forwards. We want to strip a beginning of a string. For example the string is an irc message prefixed with a command and we want to take the message only for further processing. We have found the end of the prefix and we want to produce a string from this position to the end (a copy, since strings are immutable). All those are not demonstration: decoding IRC commands or similar things does not constitute the need to encode large sets of texts. In your examples, you show applications that need to handle locally some strings made for computer languages. Texts of human languages, or even a collection of person names, or places are not like this, and have a much wider variety, but with huge possibilities for data compression (inherent to the phonology of human languages and their overall structure, but also due to repetitive conventions spread throughout the text to allow easier reading and understanding). Scanning backward a person name or human text is possibly needed locally, but such text has a strong forward directionality without which it does not make sense. Same thing if you scan such text starting at random positions: you could make many false interpretations of this text by extracting random fragments like this. Anyway, if you have a large database of texts to process or even to index, you will, in fine, need to scan this text linearily first from the beginning to the end, should it be only to create an index for accessing it later randomly. You will still need to store the indexed text somewhere, and in order to maximize the performance, or responsiveness of your application, you'll need to minimize its storage: that's where compression takes place. This does not change the semantic of the text, does not remove its semantics, but this is still an optimization, which does not prevent a further access with more easily parsable representation as stateless streams of characters, through surjective (sometimes bijective) converters between the compressed and uncompressed forms. My conclusion: there's no "best" representation to fit all needs. Each representation has its merits in its domain. The Unicode UTFs are excellent only for local processing of limited texts, but they are not necessarily the best for long term storage or for large text sets. And even for texts that will be accessed frequently, compressed schemes can still constitute optimizations, even if these texts need to be decompressed repeatedly each time they are needed. I am clearly against the arguments with "one scheme fits all needs", even if you think that UTF-32 is the only viable long-term solution.
Re: Nicest UTF
"Philippe Verdy" <[EMAIL PROTECTED]> writes: > The question is why you would need to extract the nth codepoint so > blindly. For example I'm scanning a string backwards (to remove '\n' at the end, to find and display the last N lines of a buffer, to find the last '/' or last '.' in a file name). SCSU in general supports traversal only forwards. > But remember the context in which this discussion was introduced: > which UTF would be the best to represent (and store) large sets of > immutable strings. The discussion about indexes in substrings is not > relevevant in that context. It is relevant. A general purpose string representation should support at least a bidirectional iterator, or preferably efficient random access. Neither is possible with SCSU. * * * Now consider scanning forwards. We want to strip a beginning of a string. For example the string is an irc message prefixed with a command and we want to take the message only for further processing. We have found the end of the prefix and we want to produce a string from this position to the end (a copy, since strings are immutable). With any stateless encoding a suitable library function will compute the length of the result, allocate memory, and do an equivalent of memcpy. With SCSU it's not possible to copy the string without analysing it because the prefix might have changed the state, so the suffix is not correct when treated as a standalone string. If the stripped part is short and the remaining part is long, it might pay off to scan the part we want to strip and perform a shortcut of memcpy if the prefix did not change the state (which is probably a common case). But in general we must recompress the whole copied part! We can't even precalculate its physical size. Decompressing into temporary memory will negate benefits of a compressed encoding, so we should better decompress and compress in parallel into a dynamically resizing buffer. This is ridiculously complex compared to a memcpy. The *only* advantage of SCSU is that it takes little space. Although in most programs most strings are ASCII, and SCSU never beats ISO-8859-1 which is what the implementation of my language is using for strings which no characters above U+00FF, so it usually does not have even this advantage. Disadvantages are everywhere else: every operation which looks at the contents of a string or produces contents of a string is more complex. Some operations can't be supported at all with the same asymptotic complexity, so the API would have to be changed as well to use opaque iterators instead of indices. It's more complicated both for internal processing and for interoperability (unless the other end understands SCSU too, which is unlikely). Plain immutable character arrays are not completely universal either (e.g. they are not sufficient for a buffer of a text editor), but they are appropriate as the default representation for common cases; for representing filenames, URLs, email addresses, computer language identifiers, command line option names, lines of a text file, messages in a dialog in a GUI, names of columns of a database table etc. Most strings are short and thus performing a physical copy when extracting a substring is not disastrous. But the complexity of SCSU is too bad. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> "Philippe Verdy" <[EMAIL PROTECTED]> writes: The point is that indexing should better be O(1). SCSU is also O(1) in terms of indexing complexity... It is not. You can't extract the nth code point without scanning the previous n-1 code points. The question is why you would need to extract the nth codepoint so blindly. If you have such reasons, because you know the context in which this index is valid and usable, then you can as well extract a sequence using an index in the SCSU encoding itself using the same knowledge. Linguistically, extracting a substring or characters at any random index in a sequence of code points will only cause you problems. In general, you will more likely use index as a way to mark a known position that you have already parsed sequentially in the past. However it is true that if you have determined a good index position to allow future extraction of substrings, SCSU will be more complex because you not only need to remember the index, but also the current state of the SCSU decoder, to allow decoding characters encoded starting at that index. This is not needed for UTF's and most legacy character encodings, or national standards, or GB18030 which looks like a valid UTF, even though it is not part of the Unicode standard itself. But remember the context in which this discussion was introduced: which UTF would be the best to represent (and store) large sets of immutable strings. The discussion about indexes in substrings is not relevevant in that context.
Re: Nicest UTF
"Philippe Verdy" <[EMAIL PROTECTED]> writes: >> The point is that indexing should better be O(1). > > SCSU is also O(1) in terms of indexing complexity... It is not. You can't extract the nth code point without scanning the previous n-1 code points. > But individual characters do not always have any semantic. For > languages, the relevant unit is almost always the grapheme cluster, > not the character (so not its code point...). How do you determine the semantics of a grapheme cluster? Answer: by splitting it into code points. A code point is atomic, it's not split any more, because there is a finite number of them. When a string is exchanged with another application or network computer or the OS, it always uses some encoding which is closer to code points than to grapheme clusters, no matter if it's UTF-8 or UTF-16 or ISO-8859-something. If the string was originally stored as an array of grapheme clusters, it would have to be translated to code points before further conversion. > Which represent will be the best is left to implementers, but I really > think that compressed schemes are often introduced to increase the > application performances and reduce the needed resources both in > memory and for I/O, but also in networking where interoperability > across systems and bandwidth optimization are also important design > goals... UTF-8 is much better for interoperability than SCSU, because it's already widely supported and SCSU is not. It's also easier to add support for UTF-8 than for SCSU. UTF-8 is stateless, SCSU is stateful - this is very important. UTF-8 is easier to encode and decode. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
- Original Message - From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Sunday, December 05, 2004 1:37 AM Subject: Re: Nicest UTF "Philippe Verdy" <[EMAIL PROTECTED]> writes: There's nothing that requires the string storage to use the same "exposed" array, The point is that indexing should better be O(1). SCSU is also O(1) in terms of indexing complexity... simply because it keeps the exact equivalence with codepoints, and requires a *fixed* (and small) number of steps to decode it to code points, but also because the decoder states uses a *fixed* (and small) number of variables for the internal context (unlike more powerful compression algorithms like dictionnary-based, Lempel-Ziv-Welsh-like, algorithms such as deflate). Not having a constant side per code point requires one of three things: 1. Using opaque iterators instead of integer indices. 2. Exposing a different unit in the API. 3. Living with the fact that indexing is not O(1) in general; perhaps with clever caching it's good enough in common cases. Altough all three choices can work, I would prefer to avoid them. If I had to, I would probably choose 1. But for now I've chosen a representation based on code points. Anyway, each time you use an index to access to some components of a String, the returned value is not an immutable String, but a mutable character or code unit or code point, from which you can build *other* immatable Strings No, individual characters are immutable in almost every language. But individual characters do not always have any semantic. For languages, the relevant unit is almost always the grapheme cluster, not the character (so not its code point...). As grapheme clusters need to be represented on variable lengths, an algorithm that could only work with fixed-width units would not work internationaly or would cause serious problems for correct analysis or transformation of true languages. Assignment to a character variable can be thought as changing the reference to point to a different character object, even if it's physically implemented by overwriting raw character code. When you do that, the returned character or code unit or code point does not guarantee that you'll build valid Unicode strings. In fact, such character-level interface is not enough to work with and transform Strings (for example it does not work to perform correct transformation of lettercase, or to manage grapheme clusters). This is a different issue. Indeed transformations like case mapping work in terms of strings, but in order to implement them you must split a string into some units of bounded size (code points, bytes, etc.). Yes, but why do you want that this intermediate unit be the code point? Such algorithm can be developped with any UTF, or even with compressed encoding schemes through accessor or enumerator methods... All non-trivial string algorithms boil down to working on individual units, because conditionals and dispatch tables must be driven by finite sets. Any unit of a bounded size is technically workable, but they are not equally convenient. Most algorithms are specified in terms of code points, so I chose code points for the basic unit in the API. "Most" is the right term here: this is not a requirement, and it's not because it is the simplest way to implement such algorithm that it will be the most efficient in terms of performance or resource allocations. Most experiences prove that the most efficient algorithms are also complex to implement. Code points are probably the easiest thing to describe what an text algorithm is supposed to do, but this is not a requirement for applications (in fact many libraries have been written that correctly implement the Unicode algorithms, without even dealing with code points, but only with in-memory code units of UTF-16 or even in UTF-8 or GB18030, or directly with serialization bytes of UTF-16LE or UTF-8 or SCSU or ether encoding schemes). Which represent will be the best is left to implementers, but I really think that compressed schemes are often introduced to increase the application performances and reduce the needed resources both in memory and for I/O, but also in networking where interoperability across systems and bandwidth optimization are also important design goals...
Re: Nicest UTF
Philippe Verdy wrote: >> I appreciate Philippe's support of SCSU, but I don't think *even I* >> would recommend it as an internal storage format. The effort to >> encode and decode it, while by no means Herculean as often perceived, >> is not trivial once you step outside Latin-1. > > I said: "for immutable strings", which means that these Strings are > instanciated for long term, and multiple reuses. In that sense, what > is really significant is its decoding, not the effort to encode it > (which is minimal for ISO-8859-1 encoded source texts, or Unicode > UTF-encoded texts that only use characters from the first page). > > Decoding SCSU is very straightforward, even if this is stateful (at > the internal character level). But for immutable strings, there's no > need to handle various initial states, and the states associated with > each conponent character of the string has no importance (strings > being immutable, only the decoding of the string as a whole makes > sense). Here is a string, expressed as a sequence of bytes in SCSU: 05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E See how long it takes you to decode this to Unicode code points. (Do not refer to UTN #14; that would be cheating. :-) It may not be rocket science, but it is not trivial. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Nicest UTF
RE: Nicest UTFLars Kristan wrote: >> I think UTF8 would be the nicest UTF. > > I agree. But not for reasons you mentioned. There is one other > important advantage: UTF-8 is stored in a way that permits storing > invalid sequences. I will need to elaborate that, of course. I could not disagree more with the basic premise of Lars' post. It is a fundamental and critical mistake to try to "extend" Unicode with non-standard code unit sequences to handle data that cannot be, or has not been, converted to Unicode from a legacy standard. This is not what any character encoding standard is for. > 1.2 - Any data for which encoding is not known can only be stored in a > UTF-16 database if it is converted. One needs to choose a conversion > (say Latin-1, since it is trivial). When a user finds out that the > result is not appealing, the data needs to be converted back to the > original 8-bit sequence and then the user (or an algorithm) can try > various encodings until the result is appealing. This is simply what you have to do. You cannot convert the data into Unicode in a way that says "I don't know how to convert this data into Unicode." You must either convert it properly, or leave the data in its original encoding (properly marked, preferably). It is just as if a German speaker wanted to communicate a word or phrase in French that she did not understand. She could find the correct German translation and use that, or she could use the French word or phrase directly (moving the translation burden onto the listener). What she cannot do is "extend" German by creating special words that are placeholders for French words whose meaning she does not know. > 2.2 - Any data for which encoding is not known can simply be stored > as-is. NO. Do not do this, and do not encourage others to do this. It is not valid UTF-8. Among other things, you run the risk that the mystery data happens to form a valid UTF-8 sequence, by sheer coincidence. The example of "NESTLÃâ" in Windows CP1252 is applicable here. The last two bytes are C9 99, a valid UTF-8 sequence for U+0259. By applying the concept of "adaptive UTF-8" (as Dan Oscarsson called it in 1998), this sequence would be interpreted as valid UTF-8, and data loss would occur. > 2.4 - Any data that was stored as-is may contain invalid sequences, > but these are stored as such, in their original form. Therefore, it is > possible to raise an exception (alert) when the data is retrieved. > This warns the user that additional caution is needed. That was not > possible in 1.4. This is where the fatal mistake is made. No matter what Unicode encoding form is used, its entire purpose is to encode *Unicode code points*, not to implement a two-level scheme that supports both Unicode and non-Unicode data. What sort of "exception" is to be raised? What sort of "additional caution" should the user take? What if this process is not interactive, and contains no user intervention? > 3.1 - Unfortunately we don't live in either of the two perfect worlds, > which makes it even worse. A database on UNIX will typically be (or > can be made to be) 8-bit. Therefore perfectly able to handle UTF-8 > data. On Windows however, there is a lot of support for UTF-16, but > trying to work in UTF-8 could prove to be a handicap, if not close to > impossible. UTF-8 and UTF-16, used correctly, are perfectly interchangeable. It is not in any way a fault of UTF-16 that it cannot be used to store arbitrary binary data. > 3.3 - For the record: other UTF formats CAN be made equally useful to > UTF-8. It requires 128 codepoints. Back in 2002, I have tried to > convince people on the Unicode mailing list that this should be done, > but have failed. Because it is an incredibly bad idea. > I am now using the PUA for this purpose. And I am even tempted to hope > nobody will never realize the need for these 128 codepoints, because > then all my data will be non-standard. You *should* use the PUA for this purpose. It is an excellent application of the PUA. But do not be surprised if someone else, somewhere, decides to use the same 128 PUA code points for some other purpose. That does not make your data "non-standard," because all PUA data, by definition, is "non-standard." What you are doing with the PUA is far more standard, and far more interoperable, than writing invalid UTF-8 sequences and expecting parsers to interpret them as "undeciphered 8-bit legacy text of some sort." > 4.1 - UTF-32 is probably very useful for certain string operations. > Changing case for example. You can do it in-place, like you could > with ASCII. Perhaps it can even be done in UTF-8, I am not sure. But > even if it is possible today, it is definitely not guaranteed that it > will always remain so, so one shouldn't rely on it. Not only is this not 100% true, as others have pointed out, but it is completely irrelevant to your other points. > 4.2 - But UTF-8 is superior. You can make UTF-8 functions ignore > inv
Re: Nicest UTF
Asmus Freytag wrote: > Given this little model and some additional assumptions about your > own project(s), you should be able to determine the 'nicest' UTF for > your own performance-critical case. This is absolutely correct. Each situation may have different needs and constraints, and these should govern which UTF is best suited for the task. No one UTF is better than the others in all cases. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Nicest UTF
"Philippe Verdy" <[EMAIL PROTECTED]> writes: > There's nothing that requires the string storage to use the same > "exposed" array, The point is that indexing should better be O(1). Not having a constant side per code point requires one of three things: 1. Using opaque iterators instead of integer indices. 2. Exposing a different unit in the API. 3. Living with the fact that indexing is not O(1) in general; perhaps with clever caching it's good enough in common cases. Altough all three choices can work, I would prefer to avoid them. If I had to, I would probably choose 1. But for now I've chosen a representation based on code points. > Anyway, each time you use an index to access to some components of a > String, the returned value is not an immutable String, but a mutable > character or code unit or code point, from which you can build > *other* immatable Strings No, individual characters are immutable in almost every language. Assignment to a character variable can be thought as changing the reference to point to a different character object, even if it's physically implemented by overwriting raw character code. > When you do that, the returned character or code unit or code point > does not guarantee that you'll build valid Unicode strings. In fact, > such character-level interface is not enough to work with and > transform Strings (for example it does not work to perform correct > transformation of lettercase, or to manage grapheme clusters). This is a different issue. Indeed transformations like case mapping work in terms of strings, but in order to implement them you must split a string into some units of bounded size (code points, bytes, etc.). All non-trivial string algorithms boil down to working on individual units, because conditionals and dispatch tables must be driven by finite sets. Any unit of a bounded size is technically workable, but they are not equally convenient. Most algorithms are specified in terms of code points, so I chose code points for the basic unit in the API. In fact in my language there is no separate character type: a code point extracted from a string is represented by a string of length 1. It doesn't change the fact that indexing a string by code point index should run in constant time, and thus using UTF-8 internally would be a bad idea unless we implement one of the three points above. > Once you realize that, which UTF you use to handle immutable String > objects is not important, because it becomes part of the "blackbox" > implementation of String instances. The black box must provide enough tools to implement any algorithm specified in terms of characters, an algorithm which was not already provided as a primitive by the language. Algorithms generally scan strings sequentially, but in order to store positions to come back to them later you must use indices or some iterators. Indices are simpler (and in my case more efficient). > Using SCSU for such String blackbox can be a good option if this > effectively helps in store many strings in a compact (for global > performance) but still very fast (for transformations) representation. I disagree. SCSU can be a separate type to be used explicitly, but it's a bad idea for the default string representation. Most strings are short, and thus constant factors and simplicity matter more than the amount of storage. And you wouldn't save much storage anyway: as I said, in my representation strings which contain only characters U+..U+00FF are stored one byte per character. The majority of strings in average programs is ASCII. In general what I don't like in SCSU is that there is no obvious compression algorithm which makes good use of various features. Each compression algorithm is either not as powerful as it could, or is extremely slow (trying various choices), or is extremely complicated (trying only sensible paths). > Unfortunately, the immutable String implementations in Java or C# > or Python does not allow the application designer to decide which > representation will be the best (they are implemented as concrete > classes instead of virtual interfaces with possible multiple > implementations, as they should; the alternative to interfaces would > have been class-level methods allowing the application to trade with > the blackbox class implementation the tuning parameters). Some functions accept any sequence of characters. Other functions accept only standard strings. The question is how often to use each style. Choosing the first option increases flexibility but adds an overhead in the common case. For example case mapping of a string would have to either perform dispatching functions at each step, or be implemented twice. Currently it's implemented for strings only, in C, and thus avoids calling a generic indexing function and other overheads. At some time I will probably implement it again, to work for arbitrary sequences of characters, but it's more work for effects that I don't currently need, s
Re: Nicest UTF
On Dec 3, 2004, at 2:54 AM, Andrew C. West wrote: I strongly agree that all Unicode implementations should cover all of Unicode, and not just the BMP, and it really annoys me when they don't; but suggesting that you need to implement supra-BMP characters because they are going to start popping up all over the place is wrong in my opinion (not that Doug suggested that, but that's my extrapolation of his point). Software developers need to implement supra-BMP characters because some users (probably very few) will from time to time want to use them, and software should allow people to do what they want. Actually, about 10% of the glyphs in the Japanese fonts that ship with Mac OS X are represented by characters in plane 2. The main reason they are there is because they are used in names (people, places, and companies). So there are real customers who want to use characters outside the BMP. I would not characterize it as "very few". That's true of the vast majority of SMP characters, but not all of them. Deborah Goldsmith Internationalization, Unicode Liaison Apple Computer, Inc. [EMAIL PROTECTED]
Re: Nicest UTF
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> "Philippe Verdy" <[EMAIL PROTECTED]> writes: Random access by code point index means that you don't use strings as immutable objects, No. Look at Python, Java and C#: their strings are immutable (don't change in-place) and are indexed by integers (not necessarily by code points, but it doesn't change the point). Those strings are not indexed. They are just accessible through methods or accessors, that act *as if* they were arrays. There's nothing that requires the string storage to use the same "exposed" array, and in fact you can as well work on immutable strings, as if they were vectors of code points, or vectors of code units, and sometimes vectors of bytes. Note for example the difference between the .length property of Java arrays, and the .length() method of java String instances... Note also the fact that the "conversion" of an array of bytes or code units or code points to a String requires distinct constructors, and that the storage is copied rather than simply referenced (the main reason being that indexed vectors or arrays are mutable in their indexed content, but not String instances which become sharable). Anyway, each time you use an index to access to some components of a String, the returned value is not an immutable String, but a mutable character or code unit or code point, from which you can build *other* immatable Strings (using for example mutable StringBuffers or StringBuilder or similar objects in other languages). When you do that, the returned character or code unit or code point does not guarantee that you'll build valid Unicode strings. In fact, such character-level interface is not enough to work with and transform Strings (for example it does not work to perform correct transformation of lettercase, or to manage grapheme clusters). The most powerful (and universal) transformations are those that don't use these interfaces directly, but that work on complete Strings and return complete Strings. The character-level APIs are convenience for very basic legacy transformations, but they do not solve alone most internationalization problems; or they are used as a "protected" interface that allow building more powerful String to String transformations. Once you realize that, which UTF you use to handle immutable String objects is not important, because it becomes part of the "blackbox" implementation of String instances. If you consider then the UTF as a blackbox, then the real arguments for an UTF or another depends on the set of String-to-String transformations you want to use (because it conditions the implmentation of these transformations), but more importantly it affects the efficiency of the String storage allocation. For this reason, the blackbox can determine itself which UTF or internal encoding is the best to perform those transformations: the total volume of immutable string instances to handle in memory and the frequency of their instanciation determines which representation to use (because large String volumes will sollicitate the memory manager, and will seriously impact the overall application performance). Using SCSU for such String blackbox can be a good option if this effectively helps in store many strings in a compact (for global performance) but still very fast (for transformations) representation. Unfortunately, the immutable String implementations in Java or C# or Python does not allow the application designer to decide which representation will be the best (they are implemented as concrete classes instead of virtual interfaces with possible multiple implementations, as they should; the alternative to interfaces would have been class-level methods allowing the application to trade with the blackbox class implementation the tuning parameters). There are other classes or libraries within which such multiple representations are possible and easily and transparently convertible from one to the other. (Note that this discussion is related to the UTF used to represent code points, but today, there are also needs to work on strings within grapheme cluster boundaries, including the various normalization forms, and a few libraries do exist for which the various normalizations can be changed without changing the "immutable" aspect of Strings, the complexity being that Strings do not always represent plain-text...)
Re: Nicest UTF
From: "Theo" <[EMAIL PROTECTED]> From: Asmus Freytag <[EMAIL PROTECTED]> So, despite it being UTF-8 case insensitive, it was totally blastingly fast. (One person reported counting words at 1MB/second of pure text, from within a mixed Basic / C environment). You'll need to keep in mind, that the counter must look up through thousands of words (Every single word its come across in the text), on every single word lookup. Anyhow, from my experience, UTF-8 is great for speed and RAM. Probably true for English or most Western European Latin-based languages (plus Greek and Coptic). But for other languages that still use lots of characters in the range U+ to U+03FF (C0 and C1 controls, Basic Latin, Latin-1 suplement, Latin Extended-A and -B, IPA Extensions, Spacing Modifier Letters, Combining Diacritical Marks, Greek and Coptic) UTF-8 and UTF-16 may be nearly as efficient. For all others, that need lots of characters out of the range U+ to U+03FF (Cyrillic, Armenian, Hebrew, Arabic, and all Asian or Native-American or African scripts, or even PUAs), UTF-16 is better (more compact in memory, so faster). UTF-32 will be better only for historic texts written nearly completely with characters out of the BMP (for now, only Old Ialic, Gothic, Ugaritic, Deseret, Shavian, Osmanya, Cypriot Syllabary), if C0 controls (such as TAB, CR and LF), or ASCII SPACE, or NBSP are a minority.
Re: Nicest UTF
From: "Asmus Freytag" <[EMAIL PROTECTED]> A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider 1) 1 extra test per character (to see whether it's a surrogate) 2) special handling every 100 to 1000 characters (say 10 instructions) 3) additional cost of accessing 16-bit registers (per character) 4) reduction in cache misses (each the equivalent of many instructions) 5) reduction in disk access (each the equivaletn of many many instructions) (...) For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each occurrence depending on the architecture. Their relative weight depends not only on cache sizes, but also on how many other instructions per character are performed. For text scanning operations, their cost does predominate with large data sets. I tend to disagree with you on points 4 and 5: cache misses, and disk accesses (more commonly refered to as "data locality" in computing performances) really favors UTF-16 face to UTF-32, simply because UTF-16 will be more compact for almost every text you need to process, unless you are working on texts that only contain characters from a script *not present at all* in the BMP (this sentence excludes Han, even if there are tons of ideographs out of the BMP, because these ideographs are almost never used alone, but used seldomly within tons of other conventional Han characters in the BMP). Given that these scripts are all historic ones, or were encoded for technical purpose with very specific usage, a very large majority of texts will not use significant numbers of characters out of the BMP, so the use of surrogates in UTF-16 will remain a minority. In all cases, even for texts made only of characters out of the BMP, UTF-16 can't be larger than UTF-32. The only case where it would be worse than UTF-32 is for the internal representation of strings in memory, where 16-bit code units can't be represented with 16-bit only, for example if memory cells are not individually addressable below units of at least 32 bits, and the CPU architecture is very inefficient when working with 16-bit bitfields within 32-bit memory units or registers, due to extra shifts and masking operations needed to pack and unpack 16-bit bitfields into a single 32-bit memory cell. I doubt that such architecture would be very successful, given that too many standard protocols depend on being able to work with datastreams made of 8-bit bytes: with such architecture, all data I/O would need to store 8-bit bytes in separate but addressable 32-bit memory cells, which would really be a poor usage of available central memory (such architecture would require much more RAM to work with equivalent performances for data I/O, and even the very costly fast RAM caches would need to be increased a lot, meaning higher hardware construction costs). So even on such 32-bit only (or 64-bit only...) architectures (where for example the C datatype "char" would be 32-bit or 64-bit), there would be efficient instructions in the CPU to allow packing/unpacking bytes in 32-bit (or 64-bit) memory cells (or at least at the register level, with instructions allowing to work efficiently with such bitfields).
Re: Nicest UTF
"Philippe Verdy" <[EMAIL PROTECTED]> writes: > Decoding SCSU is very straightforward, But not for random access by code point index, which is needed by many string APIs. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/