RE: CESU-8 vs UTF-8
MichKa, Many people believe that any rule or law that makes no sense or cannot be enforced weakens all other laws. I believe that publishing an inconsistent document that would allow any reasonably intelligent reader to come to the same conclusions as you did, and the standard itself would be weakened thereby. I am confused as to how the Peoplesoft justify the need to have a private protocol published by a standards committee unless their intent is to have a real public standard. Until I read this I was of the opinion that Peoplesoft had convinced Oracle to provide this interface but that they wanted a standard to arm twist Microsoft, IBM and maybe others to provide this interface as well. Now that I hear of the reference to IANA character set portion, I am afraid that they are trying to force systems that don't even use UTF-16 to buy into this madness. What this says is that Peoplesoft is trying to make the world change because they do not want to change their software to do the right thing. Now that you closed up the UTF-8 security holes CESU-8 would open them back up. It would allow people to impersonate UTF-8 because it would look enough like UTF-8 to be detected as UTF-8. However if you do not kick out surrogate encoding as bad UTF-8 to allow CESU-8 through then you must allow data that contains non-distinct characters through your fire wall. Because it will detect as UTF-8 you also have the dual representation problems of the non-short form encoding. Having dealt with security issues extensively in the past, I know that the biggest security issues are mistakes and bugs. This is so close to UTF-8 that it may share common support code that will introduce subtle bugs. This is the worst kind. If it can be demonstrated that there is a real need for an encoding like CESU-8 then is should be very different from UTF-8. How does SCSU for example sort? If CESU-8 becomes an IANA standard then other systems can be compelled to support it. Now these systems are faced with dealing with Unicode in two sort sequences. If endorsed by the Unicode commitee it will be a stanard that code be used between systems as well. Unicode describe the encoding not the use. By endorsing it they are endorsing it for any use. It is not intended nor recommended as an encoding used for open information exchange. The Unicode Consortium, does not encourage the use of CESU-8, but does recognize the existence of data in this encoding says that it is an acknowledged and supported Unicode encoding standard even though its use is not encouraged. This says that you can use it as a publicly endorsed Unicode standard. I, however, work on the assumption that IANA is not populated by morons and that they would be at least willing to hear from the UTC on the inadvisabiity of supporting any such encoding, no matter who presents it. I hope that if the Unicode committee assumes that the IANA are not morons and would not support such an encoding that they could also credit themselves with the brains to reject it as well. The problem is that if Unicode blesses this encoding, then IANA is hard pressed to deny an endorsed Unicode encoding. It is much like the fact that UTF-8 is recommended for intersystem communications because unlike UTF-16 and UTF-32 you don't have endian problems. Likewise it is permissible to send little endian UTF-16 between systems without a BOM. If passed, it will say to the world that if your business partner wants to use CESU-8 because they have a business need to do so, then they have the blessing of the Unicode consortium. By not endorsing CESU-8 you are telling the world that if you use this standard you do so on you own. It is the proper way to say It is not intended nor recommended. OTOH if they want to approve this standard because they don't feel that anyone will take this standard seriously then they should approve it for use with Unicode 1.x Unicode 2.x data only. ___ The bottom line: This UTR tells the world that if a large company has too much software that was written to support UCS-2 that it does not want to add UTF-16 support that it can use this standard to force the smaller partner into jumping through hoops because it has less to convert. In all likelihood there are probably not too many places in their code where it is critical that compares exactly match the database sort order. For those I will supply the code wcscmpDB which will invoke either wcscmp for databases in UTF-16 order or wcscmpCP for UTF-8/UTF-32. I will even throw in wcsncmpDB and wcsncmpCP. This will do until code point ordering is available on all databases. Carl
Re: CESU-8 vs UTF-8
Sun, 16 Sep 2001 01:14:06 -0700, Carl W. Brown [EMAIL PROTECTED] pisze: If it can be demonstrated that there is a real need for an encoding like CESU-8 then is should be very different from UTF-8. How does SCSU for example sort? SCSU encoding is non-deterministic and its representations can't be compared lexicographically at all (logically equal strings might compare unequal). Ehh, we wouldn't have the problem with CESU-8 now if Unicode hadn't been described as a 16-bit encoding in the past. I still think that UTF-16 was a big mistake. Too bad that it still affects people who avoid it. We can't change the past, but I hope that at least UTF-8 processing can be done without treating surrogates in any special way. Surrogates are relevant only for UTF-16; by not using UTF-16 you should be free of surrogate issues, except by having a silly unused area in character numbers and a silly highest character number. Please don't spread UTF-16 madness where it doesn't belong. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
RE: CESU-8 vs UTF-8
Marcin, We can't change the past, but I hope that at least UTF-8 processing can be done without treating surrogates in any special way. Surrogates are relevant only for UTF-16; by not using UTF-16 you should be free of surrogate issues, except by having a silly unused area in character numbers and a silly highest character number. Please don't spread UTF-16 madness where it doesn't belong. I think that it took us USC-2 to get Unicode started, but I suspect that UTF-16 usage will eventually fade out. Unlike UCS-2 UTF-16 is another MBCS character set and has lost the advantage of a fixed width character like UTF-32. I think that some applications will find it easier to migrate to UTF-32 rather than convert to UTF-16. With xIUA I demonstrate that it really does not matter much what format of Unicode you use and the it is even trivial to process it in a mix of formats in the same transaction. The Unicode processing is somwhat independent of its format. To do so you must compare UTF-16 in code point coder which is also a trivial thing to do. CESU-8 breaks that model becasue it is a form of Unicode with the sole purpose of supporting a non-Unicode code point order sort order. Yes I could devise a way to sort UTF-32 and UTF-8 in UTF-16 binary sort order but that is only a matter of some messy code. The real issue is that I must now handle Unicode that has as part of it essential property that it must survive transforms with two distinctly different sort orders. With this standard approved my applications can be compelled to use CESU-8 in place of UTF-8 if I was to talk to Peoplesoft or other packages that will insist on this sort order of Unicode data. If I use UTF-8 as well, then I will need two completely different sets of support routines. Fundamental to all MBCS string handling routines is character length determination. To do that for CESU-8 I will have to not only check the first byte but in the case of three byte sequences I will have to determine if the value corresponds to a surrogate. If I don't do this then it is like processing MBCS data with SBCS routines. For example if I use a UTF-8 strtok on CESU-8 data it will break the stings whenever either and initial or trailing token matches. So you need a special CESU-8 routine. The problem will be that CESU-8 my be detected as UTF-8. Supposedly I open a socket and get a buffer of data that looks like UTF-8 so I decide to use the UTF-8 support routines. The second buffer code comes in with surrogates and I continue to process it as UTF-8. This introduces errors of the worst kind - the subtle errors. The program runs but the data is slightly bad. Oops I just put the amount in the credit not asset field. If my application accepts both UTF-8 and CESU-8 then what sorting do I use for my database? My problem is that the correct approach is for people like Peoplesoft to fix their code before accepting non BMP characters. They should upgrade the UCS-2 code to truly support UTF-16 properly. CESU-8 does more than propagate the errors but it extends the problem by implementing a bad solution. What started out as a comparatively minor problem for a few people ends up as a major problem for everyone. I think that the coexistence of both UTF-8 and CESU-8 is a nightmare and the Unicode committee has to decide on one or the other or restrict CESU-8 to BMP character use only which of course makes it a limited UTF-8. If people really need matching UTF-16 sequences between systems that can always transform to UTF-8 and convert back on the other end into UTF-16 again. Also that can compare is any order that want. If they like to compare UTF-16 in little endian byte order more power to them, just don't ask me to do the same. Carl
Re: CESU-8 vs UTF-8
In a message dated 2001-09-16 13:13:38 Pacific Daylight Time, [EMAIL PROTECTED] writes: I think that some applications will find it easier to migrate to UTF-32 rather than convert to UTF-16. I know I have. Handle everything internally as UTF-32, then read and write UTF-8 or UTF-16 as appropriate. CESU-8 breaks that model becasue it is a form of Unicode with the sole purpose of supporting a non-Unicode code point order sort order. Yes I could devise a way to sort UTF-32 and UTF-8 in UTF-16 binary sort order but that is only a matter of some messy code. The real issue is that I must now handle Unicode that has as part of it essential property that it must survive transforms with two distinctly different sort orders. I was glad when Unicode began moving away from the doctrine of treat all characters as 16-bit code units and toward treat them as abstract code points in the range 0..0x10. Make no mistake, UTF-16 can be a useful 16-bit transformation format; but it should not be considered the essence of Unicode, especially not to the point where additional machinery needs to be built on top of the Unicode standard solely to support UTF-16. With this standard approved my applications can be compelled to use CESU-8 in place of UTF-8 if I was to talk to Peoplesoft or other packages that will insist on this sort order of Unicode data. If I use UTF-8 as well, then I will need two completely different sets of support routines. Actually, what you will need is *one* routine that works with both UTF-8 and CESU-8, but breaks the definition of both in doing so, by permitting either method of handling supplementary characters, and auto-detecting the data as UTF-8 or CESU-8 based on the method encountered. My problem is that the correct approach is for people like Peoplesoft to fix their code before accepting non BMP characters. Still unanswered, in this proposal to sanctify hitherto non-standard representations of non-BMP characters in commercial databases, is the question of how much non-BMP data even exists in commercial databases in the first place. I know I personally have some (and will soon have more, now that SC UniPad supports Deseret), but what about users of Oracle and Peoplesoft databases? Other than the private-use planes, it was not even allowable to use non-BMP characters until the release of Unicode 3.1 earlier this year. Where is the great need for a compatibility encoding? -Doug Ewell Fullerton, California
RE: CESU-8 vs UTF-8 (Was: PDUTR #26 posted
Sorry but I left out three points. 1) Why ask for an IANA character set designation for internal use within systems processing Unicode? This is a definite indication that the real intent goes well beyond even the multi-vendor application to data base interfaces. It is apparent that the real intent is the use the force of standards not only to compel the major database developers to offer support for CESU-8 but to make it a public internet standard as well. 2) The time is now to add the specification of code point order compare support for systems, databases and libraries offering UTF-16 support before Unicode systems are split into two different migration paths for future multi plane character support and while vendors are upgrading from UCS-2 to UTF-16 support. 3) We don't want to have to deal with CESU-8 in systems that do not use UTF-16. It will be almost impossible to develop code to support both CESU-8 and UTF-8 well. It will propagate the sort problem from the special case, to all systems that use databases or communicate with other systems by virtue of having to simultaneously support a mix of CESU-8 and UTF-8 which by definition are required to have a distinctly different sort orders. Lets fix the problem the right way. Thank you, (Now stepping off the soap box) Carl -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Carl W. Brown Sent: Friday, September 14, 2001 9:40 PM To: [EMAIL PROTECTED] Subject: CESU-8 vs UTF-8 (Was: PDUTR #26 posted Julie, Proposed Draft Unicode Technical Report #26: Compatibility Encoding Thank you for posting this. This document specifies an 8-bit Compatibility Encoding Scheme for UTF-16 (CESU) that is intended as an alternate encoding to UTF-8 for internal use within systems processing Unicode in order to provide an ASCII-compatible 8-bit encoding that preserves UTF-16 binary collation. It is not intended nor recommended as an encoding used for open information exchange. The Unicode Consortium, does not encourage the use of CESU-8, but does recognize the existence of data in this encoding and supplies this technical report to clearly define the format and to distinguish it from UTF-8. This encoding does not replace or amend the definition of UTF-8. This is not a true statement. It is not intended nor recommended as an encoding used for open information exchange. is false. Its intent is to layout a format encoding between Oracle and Peoplesoft code in the hopes that they can get other database vendors to support it. They are really asking for a public standard not a private implementation. If it were only an internal protocol used internally by a single vendor they would not be submitting a UTR. The decision becomes should the Unicode committee approve this a as public encoding? To determine that you have to ask three questions. Is there a problem? Are there and negative impacts? I there an alternative? Is there a problem? I think that the answer is yes. There is a problem once you implement characters outside of BMP that binary sorts of UTF-32 UTF-8 sort in a different sort order from UTF-16. If you application compares much match a databases key sort they you have problems if you transform the Unicode from the native database encoding. They want Oracle data stored in UTF-8 to match data encoded by other databases in UTF-16. Are there negative impacts? Yes. It will almost work with most UTF-8 support libraries. This causes the worst type of errors. You need to have code the work right or really breaks and not introduce subtle errors. It will fool most UTF-8 detection routines. It can create security problems just like non-short form encoding in UTF-8 because the character is not a character but a surrogate. Is there an alternative? Yes. You must use special code to compare UTF-16. If you use the OLD UCS-2 code it will give you the unique UTF-16 compare problem. However by adding two instructions to the compare that add very little overhead, you can provide a Unicode code point compare routine that sorts in exactly the same order as UTF-32 UTF-8. I propose that since all UCS-2 vendors will have to upgrade the code to provide UTF-16 support the part of the UTF-16 compliance should be that all UTF-16 compares default to a code point order compare. You might want to allow an optional a binary compare but the standard compare should be in code point order. This provides an optimal solution to the problem for everybody. This small extra overhead is just like the extra overhead in checking for and handling surrogates. If this is a problem then UTF-32 is an alternate solution. Carl
Re: CESU-8 vs UTF-8
Carl W. Brown [EMAIL PROTECTED] writes: This is not a true statement. It is not intended nor recommended as an encoding used for open information exchange. is false. Its intent is to layout a format encoding between Oracle and Peoplesoft code in the hopes that they can get other database vendors to support it. They are really asking for a public standard not a private implementation. If it were only an internal protocol used internally by a single vendor they would not be submitting a UTR. Exactly. If CESU-8 were intended only as an internal representation, it would not matter whether it had any official recognition or blessing from Unicode. I can store Unicode data internally any way I want, using UTF-17 [1] if I choose, and there is nothing non-conformant about this as long as I treat the data as scalar values and can convert to the real UTFs for data exchange purposes. To propose CESU-8 in a Technical Report is, as Carl said, an attempt to make it an official, public standard. 1) Why ask for an IANA character set designation for internal use within systems processing Unicode? This is a definite indication that the real intent goes well beyond even the multi-vendor application to data base interfaces. It is apparent that the real intent is the use the force of standards not only to compel the major database developers to offer support for CESU-8 but to make it a public internet standard as well. This section of the TR amazed me. In the Summary and elsewhere, CESU-8 is not intended nor recommended as an encoding used for open information exchange, but by the end of the document we learn that it will be registered with the Internet Assigned Numbers Authority. I have spelled out IANA for a reason, to highlight that it is a body dealing with open information exchange over the Internet. This completely refutes all of the internal use only claims made in the rest of the document. Is there an alternative? Yes. You must use special code to compare UTF-16. If you use the OLD UCS-2 code it will give you the unique UTF-16 compare problem. However by adding two instructions to the compare that add very little overhead, you can provide a Unicode code point compare routine that sorts in exactly the same order as UTF-32 UTF-8. This was my solution long ago: fix the code that sorts in UCS-2 order so that supplementary characters are sorted correctly. In case there is any disagreement about this, sorting by UCS-2 order has been WRONG ever since surrogates and UTF-16 were invented. However, the database vendors' position is that there is now data sorted in this way, and it cannot be changed or database integrity will be compromised. Fine, there is another alternative: sort all data in UCS-2 order, regardless of the encoding scheme. This takes, as Carl said, about two lines of code. You don't lose any significant processing time, and you DON'T need to invent a new encoding scheme. 2) The time is now to add the specification of code point order compare support for systems, databases and libraries offering UTF-16 support before Unicode systems are split into two different migration paths for future multi plane character support and while vendors are upgrading from UCS-2 to UTF-16 support. Unicode has, understandably, avoided recommending binary code point order, referring people instead to the Collation Algorithm for culturally correct sorting. This is good because it alerts designers of most applications to the real issues surrounding collation. For database applications, however, there is a need for binary code point order that has more to do with consistency than cultural correctness. I accept this, but still contend that you can sort UTF-8 data in UCS-2 code point order quickly and easily, without the need for CESU-8 at all, let alone the need to enshrine it in a TR. There was a lot that I liked in this PDUTR. The misleading name UTF-8S has been replaced, and there are all those caveats that CESU-8 is not, not, NOT to be used in open data exchange. None of these caveats, however, can be taken seriously as long as Section 4, IANA Registration, is present. I suggest, as part of the Proposed Draft stage for this document, that Section 4 be deleted and that IANA be informed that CESU-8 is intended as an internal encoding only and that they are explicitly requested NOT to register it. -Doug Ewell Fullerton, California [1] UTF-17 was a *humorous* description of an exceedingly inefficient Unicode character encoding scheme. It was not proposed seriously and does not contribute to the proliferation of UTFs.
RE: CESU-8 vs UTF-8
Doug, This was my solution long ago: fix the code that sorts in UCS-2 order so that supplementary characters are sorted correctly. In case there is any disagreement about this, sorting by UCS-2 order has been WRONG ever since surrogates and UTF-16 were invented. However, the database vendors' position is that there is now data sorted in this way, and it cannot be changed or database integrity will be compromised. It will not be compromised unless they already have data with characters in the database indexes beyond U+. This is why I think that the Unicode standards committee should take quick action so that the Unicode world does not get split between two alternate basic sorting sequences. It needs to be done before there are a lot of legacy data to contend with. Now it the time because developers are just starting to convert to provide real surrogate support. Collation does not work for three reasons. 1) It is too slow. 2) More importantly we need to have a locale neutral sorting sequence. 3) Code point order sequencing supports all existing data stored with UCS-2 binary indexes but collation does not. I suggest, as part of the Proposed Draft stage for this document, that Section 4 be deleted and that IANA be informed that CESU-8 is intended as an internal encoding only and that they are explicitly requested NOT to register it. In actuality Section 4 neither adds not takes away from PDUTR #26. They can either apply to IANA or not if Section 4 is included or not. It is merely a notification that there is no intent to make CESU-8 a private protocol. PDUTR #26 should be rejected in its entirety. If it is truly a private protocol as they claim it does not belong it any form in the Unicode standard. You may have heard about hijacking legislative bills. It is taking an existing bill and amending it to change the entire text of the bill. I think that we should hijack PDUTR #26 and replace it with UTF-17. In actuality we should hijack PDUTR #26 to modify TR27 to specify that at a minimum, systems that support UTF-16 must provide code point order support services. We should delete all references to CESU-8 and reject the idea of adding CESU-8 to the standard. Carl
Re: CESU-8 vs UTF-8
Carl, Doug, The issues you and Doug brought up were vigorously discussed. For the decision, all I can say is that not everyone voted for it (which will be a matter of public record once the preliminary minutes are posted). D This section of the TR amazed me. In the Summary and D elsewhere, CESU-8 is not intended nor recommended as D an encoding used for open information exchange, but by D the end of the document we learn that it will be registered D with the Internet Assigned Numbers Authority. I have D spelled out IANA for a reason, to highlight that it is a body D dealing with open information exchange over the Internet. ... D This completely refutes all of the internal use only claims D made in the rest of the document. Yes, there are many such issues. This is, however, more of a side effect of how much the document *changed* from the original document, based on feedback. Many people believe that any rule or law that makes no sense or cannot be enforced weakens all other laws. I believe that publishing an inconsistent document that would allow any reasonably intelligent reader to come to the same conclusions as you did, and the standard itself would be weakened thereby. ... D I suggest, as part of the Proposed Draft stage for this document, D that Section 4 be deleted and that IANA be informed that CESU-8 D is intended as an internal encoding only and that they are explicitly D requested NOT to register it. C In actuality Section 4 neither adds not takes away from PDUTR #26. C They can either apply to IANA or not if Section 4 is included or not. C It is merely a notification that there is no intent to make CESU-8 a C private protocol. The argument was put forward [unconvincingly, in my eyes] that the only way to protect the situation from having some other vendor register it with IANA would be to do so in a pre-emptive manner. I, however, work on the assumption that IANA is not populated by morons and that they would be at least willing to hear from the UTC on the inadvisabiity of supporting any such encoding, no matter who presents it. No guarantees of course (there never are any) but I am sure they would be willing to consider the desire of the UTC to not further litter the playing field? C PDUTR #26 should be rejected in its entirety. If it is truly a private C protocol as they claim it does not belong it any form in the Unicode C standard. I concur. The argument was made that it should be tied to the [orthagonal, in my eyes] argument of tightening up Unicode 3.2's UTF-8 definition to disallow the 6-byte form. In my eyes, however, it is perfectly acceptable to claim that, in order to be compliant with the Unicode 3.2 definiton of UTF-8 one must not use the 6-byte form but that prior versions would allow one to accept it (if they so desired). Thus you can make one change without *requiring* the other. Since the only clients who would emit CESU-8 (PeopleSoft, et. al) are doing so privately, no UTR is needed for them to do so. And there is a [prior] version of the standard that can accomodate them. C You may have heard about hijacking legislative bills. It is taking an C existing bill and amending it to change the entire text of the bill. I C think that we should hijack PDUTR #26 and replace it with UTF-17. C C In actuality we should hijack PDUTR #26 to modify TR27 to specify C that at a minimum, systems that support UTF-16 must provide code C point order support services. We should delete all references to C CESU-8 and reject the idea of adding CESU-8 to the standard. I do not know that the former is required; but either way I agree that CESU-8 (neƩ UTF8-S) should not be included even as a UTR. However, it is not possible to hijack the current proposal as the author does not wish this to happen... though I suppose you are welcome to try and convince him? :-) MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/