RE: CESU-8 vs UTF-8

2001-09-16 Thread Carl W. Brown

MichKa,


 Many people believe that any rule or law that makes no sense or cannot be
 enforced weakens all other laws. I believe that publishing an inconsistent
 document that would allow any reasonably intelligent reader to come to the
 same conclusions as you did, and the standard itself would be weakened
 thereby.


I am confused as to how the Peoplesoft justify the need to have a private
protocol published by a standards committee unless their intent is to have a
real public standard.

Until I read this I was of the opinion that Peoplesoft had convinced Oracle
to provide this interface but that they wanted a standard to arm twist
Microsoft, IBM and maybe others to provide this interface as well.

Now that I hear of the reference to IANA character set portion, I am afraid
that they are trying to force systems that don't even use UTF-16 to buy into
this madness.

What this says is that Peoplesoft is trying to make the world change because
they do not want to change their software to do the right thing.

Now that you closed up the UTF-8 security holes CESU-8 would open them back
up.  It would allow people to impersonate UTF-8 because it would look enough
like UTF-8 to be detected as UTF-8.  However if you do not kick out
surrogate encoding as bad UTF-8 to allow CESU-8 through then you must allow
data that contains non-distinct characters through your fire wall.  Because
it will detect as UTF-8 you also have the dual representation problems of
the non-short form encoding.

Having dealt with security issues extensively in the past, I know that the
biggest security issues are mistakes and bugs.  This is so close to UTF-8
that it may share common support code that will introduce subtle bugs.  This
is the worst kind.  If it can be demonstrated that there is a real need for
an encoding like CESU-8 then is should be very different from UTF-8.  How
does SCSU for example sort?

If CESU-8 becomes an IANA standard then other systems can be compelled to
support it.  Now these systems are faced with dealing with Unicode in two
sort sequences.  If endorsed by the Unicode commitee it will be a stanard
that code be used between systems as well.  Unicode describe the encoding
not the use.  By endorsing it they are endorsing it for any use.

It is not intended nor recommended as an encoding used for open information
exchange. The Unicode Consortium, does not encourage the use of CESU-8, but
does recognize the existence of data in this encoding says that it is an
acknowledged and supported Unicode encoding standard even though its use is
not encouraged.  This says that you can use it as a publicly endorsed
Unicode standard.

 I, however, work on the
 assumption that IANA is not populated by morons and that they would be at
 least willing to hear from the UTC on the inadvisabiity of supporting any
 such encoding, no matter who presents it.


I hope that if the Unicode committee assumes that the IANA are not morons
and would not support such an encoding that they could also credit
themselves with the brains to reject it as well.

The problem is that if Unicode blesses this encoding, then IANA is hard
pressed to deny an endorsed Unicode encoding.

It is much like the fact that UTF-8 is recommended for intersystem
communications because unlike UTF-16 and UTF-32 you don't have endian
problems.  Likewise it is permissible to send little endian UTF-16 between
systems without a BOM.

If passed, it will say to the world that if your business partner wants to
use CESU-8 because they have a business need to do so, then they have the
blessing of the Unicode consortium.

By not endorsing CESU-8 you are telling the world that if you use this
standard you do so on you own.  It is the proper way to say It is not
intended nor recommended.

OTOH if they want to approve this standard because they don't feel that
anyone will take this standard seriously then they should approve it for use
with Unicode 1.x  Unicode 2.x data only.


___

The bottom line:  This UTR tells the world that if a large company has too
much software that was written to support UCS-2 that it does not want to add
UTF-16 support that it can use this standard to force the smaller partner
into jumping through hoops because it has less to convert.

In all likelihood there are probably not too many places in their code where
it is critical that compares exactly match the database sort order.  For
those I will supply the code wcscmpDB which will invoke either wcscmp for
databases in UTF-16 order or wcscmpCP for UTF-8/UTF-32.  I will even throw
in wcsncmpDB and wcsncmpCP.  This will do until code point ordering is
available on all databases.

Carl







Re: CESU-8 vs UTF-8

2001-09-16 Thread Marcin 'Qrczak' Kowalczyk

Sun, 16 Sep 2001 01:14:06 -0700, Carl W. Brown [EMAIL PROTECTED] pisze:

 If it can be demonstrated that there is a real need for an encoding
 like CESU-8 then is should be very different from UTF-8.  How does
 SCSU for example sort?

SCSU encoding is non-deterministic and its representations can't
be compared lexicographically at all (logically equal strings might
compare unequal).

Ehh, we wouldn't have the problem with CESU-8 now if Unicode hadn't
been described as a 16-bit encoding in the past. I still think that
UTF-16 was a big mistake. Too bad that it still affects people who
avoid it.

We can't change the past, but I hope that at least UTF-8 processing can
be done without treating surrogates in any special way. Surrogates are
relevant only for UTF-16; by not using UTF-16 you should be free of
surrogate issues, except by having a silly unused area in character
numbers and a silly highest character number. Please don't spread
UTF-16 madness where it doesn't belong.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK





RE: CESU-8 vs UTF-8

2001-09-16 Thread Carl W. Brown

Marcin,


 We can't change the past, but I hope that at least UTF-8 processing can
 be done without treating surrogates in any special way. Surrogates are
 relevant only for UTF-16; by not using UTF-16 you should be free of
 surrogate issues, except by having a silly unused area in character
 numbers and a silly highest character number. Please don't spread
 UTF-16 madness where it doesn't belong.


I think that it took us USC-2 to get Unicode started, but I suspect that
UTF-16 usage will eventually fade out.  Unlike UCS-2 UTF-16 is another MBCS
character set and has lost the advantage of a fixed width character like
UTF-32.  I think that some applications will find it easier to migrate to
UTF-32 rather than convert to UTF-16.

With xIUA I demonstrate that it really does not matter much what format of
Unicode you use and the it is even trivial to process it in a mix of formats
in the same transaction.   The Unicode processing is somwhat independent of
its format.  To do so you must compare UTF-16 in code point coder which is
also a trivial thing to do.

CESU-8 breaks that model becasue it is a form of Unicode with the sole
purpose of supporting a non-Unicode code point order sort order.  Yes I
could devise a way to sort UTF-32 and UTF-8 in UTF-16 binary sort order but
that is only a matter of some messy code.  The real issue is that I must now
handle Unicode that has as part of it essential property that it must
survive transforms with two distinctly different sort orders.

With this standard approved my applications can be compelled to use CESU-8
in place of UTF-8 if I was to talk to Peoplesoft or other packages that will
insist on this sort order of Unicode data.  If I use UTF-8 as well, then I
will need two completely different sets of support routines.

Fundamental to all MBCS string handling routines is character length
determination.  To do that for CESU-8 I will have to not only check the
first byte but in the case of three byte sequences I will have to determine
if the value corresponds to a surrogate.  If I don't do this then it is like
processing MBCS data with SBCS routines.  For example if I use a UTF-8
strtok on CESU-8 data it will break the stings whenever either and initial
or trailing token matches.  So you need a special  CESU-8 routine.  The
problem will be that CESU-8 my be detected as UTF-8.  Supposedly I open a
socket and get a buffer of data that looks like UTF-8 so I decide to use the
UTF-8 support routines.  The second buffer code comes in with surrogates and
I continue to process it as UTF-8.  This introduces errors of the worst
kind - the subtle errors.  The program runs but the data is slightly bad.
Oops I just put the amount in the credit not asset field.

If my application accepts both UTF-8 and CESU-8 then what sorting do I use
for my database?

My problem is that the correct approach is for people like Peoplesoft to fix
their code before accepting non BMP characters.  They should upgrade the
UCS-2 code to truly support UTF-16 properly.  CESU-8 does more than
propagate the errors but it extends the problem by implementing a bad
solution.  What started out as a comparatively minor problem for a few
people ends up as a major problem for everyone.

I think that the coexistence of both UTF-8 and CESU-8 is a nightmare and the
Unicode committee has to decide on one or the other or restrict CESU-8 to
BMP character use only which of course makes it a limited UTF-8.  If people
really need matching UTF-16 sequences between systems that can always
transform to UTF-8 and convert back on the other end into UTF-16 again.
Also that can compare is any order that want.  If they like to compare
UTF-16 in little endian byte order more power to them, just don't ask me to
do the same.

Carl






Re: CESU-8 vs UTF-8

2001-09-16 Thread DougEwell2

In a message dated 2001-09-16 13:13:38 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

  I think that some applications will find it easier to migrate to
  UTF-32 rather than convert to UTF-16.

I know I have.  Handle everything internally as UTF-32, then read and write 
UTF-8 or UTF-16 as appropriate.

  CESU-8 breaks that model becasue it is a form of Unicode with the sole
  purpose of supporting a non-Unicode code point order sort order.  Yes I
  could devise a way to sort UTF-32 and UTF-8 in UTF-16 binary sort order but
  that is only a matter of some messy code.  The real issue is that I must 
now
  handle Unicode that has as part of it essential property that it must
  survive transforms with two distinctly different sort orders.

I was glad when Unicode began moving away from the doctrine of treat all 
characters as 16-bit code units and toward treat them as abstract code 
points in the range 0..0x10.  Make no mistake, UTF-16 can be a useful 
16-bit transformation format; but it should not be considered the essence of 
Unicode, especially not to the point where additional machinery needs to be 
built on top of the Unicode standard solely to support UTF-16.

  With this standard approved my applications can be compelled to use CESU-8
  in place of UTF-8 if I was to talk to Peoplesoft or other packages that 
will
  insist on this sort order of Unicode data.  If I use UTF-8 as well, then I
  will need two completely different sets of support routines.

Actually, what you will need is *one* routine that works with both UTF-8 and 
CESU-8, but breaks the definition of both in doing so, by permitting either 
method of handling supplementary characters, and auto-detecting the data as 
UTF-8 or CESU-8 based on the method encountered.

  My problem is that the correct approach is for people like Peoplesoft to 
fix
  their code before accepting non BMP characters.

Still unanswered, in this proposal to sanctify hitherto non-standard 
representations of non-BMP characters in commercial databases, is the 
question of how much non-BMP data even exists in commercial databases in the 
first place.  I know I personally have some (and will soon have more, now 
that SC UniPad supports Deseret), but what about users of Oracle and 
Peoplesoft databases?  Other than the private-use planes, it was not even 
allowable to use non-BMP characters until the release of Unicode 3.1 earlier 
this year.  Where is the great need for a compatibility encoding?

-Doug Ewell
 Fullerton, California




RE: CESU-8 vs UTF-8 (Was: PDUTR #26 posted

2001-09-15 Thread Carl W. Brown

Sorry but I left out three points.

1) Why ask for an IANA character set designation for internal use within
systems processing Unicode?  This is a definite indication that the real
intent goes well beyond even the multi-vendor application to data base
interfaces.  It is apparent that the real intent is the use the force of
standards not only to compel the major database developers to offer support
for CESU-8 but to make it a public internet standard as well.

2) The time is now to add the specification of code point order compare
support for systems, databases and libraries offering UTF-16 support before
Unicode systems are split into two different migration paths for future
multi plane character support and while vendors are upgrading from UCS-2 to
UTF-16 support.

3) We don't want to have to deal with CESU-8 in systems that do not use
UTF-16.

It will be almost impossible to develop code to support both CESU-8 and
UTF-8 well.  It will propagate the sort problem from the special case, to
all systems that use databases or communicate with other systems by virtue
of having to simultaneously support a mix of CESU-8 and UTF-8 which by
definition are required to have a distinctly different sort orders.

Lets fix the problem the right way.

Thank you,  (Now stepping off the soap box)

Carl

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Carl W. Brown
 Sent: Friday, September 14, 2001 9:40 PM
 To: [EMAIL PROTECTED]
 Subject: CESU-8 vs UTF-8 (Was: PDUTR #26 posted


 Julie,

  Proposed Draft Unicode Technical Report #26: Compatibility Encoding

 Thank you for posting this.

 This document specifies an 8-bit Compatibility Encoding Scheme for UTF-16
 (CESU) that is intended as an alternate encoding to UTF-8 for internal use
 within systems processing Unicode in order to provide an ASCII-compatible
 8-bit encoding that preserves UTF-16 binary collation. It is not intended
 nor recommended as an encoding used for open information exchange. The
 Unicode Consortium, does not encourage the use of CESU-8, but
 does recognize
 the existence of data in this encoding and supplies this
 technical report to
 clearly define the format and to distinguish it from UTF-8. This encoding
 does not replace or amend the definition of UTF-8.

 This is not a true statement.  It is not intended nor recommended as an
 encoding used for open information exchange. is false.  Its intent is to
 layout a format encoding between Oracle and Peoplesoft code in the hopes
 that they can get other database vendors to support it.  They are really
 asking for a public standard not a private implementation.

 If it were only an internal protocol used internally by a single
 vendor they
 would not be submitting a UTR.

 The decision becomes should the Unicode committee approve this a as public
 encoding?  To determine that you have to ask three questions.  Is there a
 problem? Are there and negative impacts? I there an alternative?

 Is there a problem?  I think that the answer is yes.  There is a problem
 once you implement characters outside of BMP that binary sorts of UTF-32 
 UTF-8 sort in a different sort order from  UTF-16.  If you application
 compares much match a databases key sort they you have problems if you
 transform the Unicode from the native database encoding.  They want Oracle
 data stored in UTF-8 to match data encoded by other databases in UTF-16.

 Are there negative impacts?  Yes.  It will almost work with most UTF-8
 support libraries.  This causes the worst type of errors.  You
 need to have
 code the work right or really breaks and not introduce subtle errors.  It
 will fool most UTF-8 detection routines.  It can create security problems
 just like non-short form encoding in UTF-8 because the
 character is not a
 character but a surrogate.

 Is there an alternative?  Yes.  You must use special code to
 compare UTF-16.
 If you use the OLD UCS-2 code it will give you the unique UTF-16 compare
 problem.  However by adding two instructions to the compare that add very
 little overhead, you can provide a Unicode code point compare routine that
 sorts in exactly the same order as UTF-32  UTF-8.

 I propose that since all UCS-2 vendors will have to upgrade the code to
 provide UTF-16 support the part of the UTF-16 compliance should
 be that all
 UTF-16 compares default to a code point order compare.  You might want to
 allow an optional a binary compare but the standard compare should be in
 code point order.

 This provides an optimal solution to the problem for everybody.
 This small
 extra overhead is just like the extra overhead in checking for
 and handling
 surrogates.  If this is a problem then UTF-32 is an alternate solution.

 Carl








Re: CESU-8 vs UTF-8

2001-09-15 Thread DougEwell2

Carl W. Brown [EMAIL PROTECTED] writes:

 This is not a true statement.  It is not intended nor recommended as an
 encoding used for open information exchange. is false.  Its intent is to
 layout a format encoding between Oracle and Peoplesoft code in the hopes
 that they can get other database vendors to support it.  They are really
 asking for a public standard not a private implementation.

 If it were only an internal protocol used internally by a single
 vendor they
 would not be submitting a UTR.

Exactly.  If CESU-8 were intended only as an internal representation, it 
would not matter whether it had any official recognition or blessing from 
Unicode.  I can store Unicode data internally any way I want, using UTF-17 
[1] if I choose, and there is nothing non-conformant about this as long as I 
treat the data as scalar values and can convert to the real UTFs for data 
exchange purposes.  To propose CESU-8 in a Technical Report is, as Carl said, 
an attempt to make it an official, public standard.

 1) Why ask for an IANA character set designation for internal use within
 systems processing Unicode?  This is a definite indication that the real
 intent goes well beyond even the multi-vendor application to data base
 interfaces.  It is apparent that the real intent is the use the force of
 standards not only to compel the major database developers to offer support
 for CESU-8 but to make it a public internet standard as well.

This section of the TR amazed me.  In the Summary and elsewhere, CESU-8 is 
not intended nor recommended as an encoding used for open information 
exchange, but by the end of the document we learn that it will be registered 
with the Internet Assigned Numbers Authority.  I have spelled out IANA for a 
reason, to highlight that it is a body dealing with open information exchange 
over the Internet.  This completely refutes all of the internal use only 
claims made in the rest of the document.

 Is there an alternative?  Yes.  You must use special code to
 compare UTF-16.
 If you use the OLD UCS-2 code it will give you the unique UTF-16 compare
 problem.  However by adding two instructions to the compare that add very
 little overhead, you can provide a Unicode code point compare routine that
 sorts in exactly the same order as UTF-32  UTF-8.

This was my solution long ago: fix the code that sorts in UCS-2 order so that 
supplementary characters are sorted correctly.  In case there is any 
disagreement about this, sorting by UCS-2 order has been WRONG ever since 
surrogates and UTF-16 were invented.

However, the database vendors' position is that there is now data sorted in 
this way, and it cannot be changed or database integrity will be compromised. 
 Fine, there is another alternative: sort all data in UCS-2 order, regardless 
of the encoding scheme.  This takes, as Carl said, about two lines of code.  
You don't lose any significant processing time, and you DON'T need to invent 
a new encoding scheme.

 2) The time is now to add the specification of code point order compare
 support for systems, databases and libraries offering UTF-16 support before
 Unicode systems are split into two different migration paths for future
 multi plane character support and while vendors are upgrading from UCS-2 to
 UTF-16 support.

Unicode has, understandably, avoided recommending binary code point order, 
referring people instead to the Collation Algorithm for culturally correct 
sorting.  This is good because it alerts designers of most applications to 
the real issues surrounding collation.  For database applications, however, 
there is a need for binary code point order that has more to do with 
consistency than cultural correctness.  I accept this, but still contend that 
you can sort UTF-8 data in UCS-2 code point order quickly and easily, without 
the need for CESU-8 at all, let alone the need to enshrine it in a TR.

There was a lot that I liked in this PDUTR.  The misleading name UTF-8S has 
been replaced, and there are all those caveats that CESU-8 is not, not, NOT 
to be used in open data exchange.  None of these caveats, however, can be 
taken seriously as long as Section 4, IANA Registration, is present.

I suggest, as part of the Proposed Draft stage for this document, that 
Section 4 be deleted and that IANA be informed that CESU-8 is intended as an 
internal encoding only and that they are explicitly requested NOT to register 
it.

-Doug Ewell
 Fullerton, California

[1]  UTF-17 was a *humorous* description of an exceedingly inefficient 
Unicode character encoding scheme.  It was not proposed seriously and does 
not contribute to the proliferation of UTFs.




RE: CESU-8 vs UTF-8

2001-09-15 Thread Carl W. Brown

Doug,


 This was my solution long ago: fix the code that sorts in UCS-2
 order so that
 supplementary characters are sorted correctly.  In case there is any
 disagreement about this, sorting by UCS-2 order has been WRONG ever since
 surrogates and UTF-16 were invented.

 However, the database vendors' position is that there is now data
 sorted in
 this way, and it cannot be changed or database integrity will be
 compromised.

It will not be compromised unless they already have data with characters in
the database indexes beyond U+.  This is why I think that the Unicode
standards committee should take quick action so that the Unicode world does
not get split between two alternate basic sorting sequences.  It needs to be
done before there are a lot of legacy data to contend with.  Now it the time
because developers are just starting to convert to provide real surrogate
support.  Collation does not work for three reasons.  1) It is too slow.  2)
More importantly we need to have a locale neutral sorting sequence. 3) Code
point order sequencing supports all existing data stored with UCS-2 binary
indexes but collation does not.


 I suggest, as part of the Proposed Draft stage for this document, that
 Section 4 be deleted and that IANA be informed that CESU-8 is
 intended as an
 internal encoding only and that they are explicitly requested NOT
 to register
 it.

In actuality Section 4 neither adds not takes away from PDUTR #26.  They can
either apply to IANA or not if Section 4 is included or not.  It is merely a
notification that there is no intent to make CESU-8 a private protocol.

PDUTR #26 should be rejected in its entirety.  If it is truly a private
protocol as they claim it does not belong it any form in the Unicode
standard.

You may have heard about hijacking legislative bills.  It is taking an
existing bill and amending it to change the entire text of the bill.  I
think that we should hijack PDUTR #26 and replace it with UTF-17.

In actuality we should hijack PDUTR #26 to modify TR27 to specify that at a
minimum, systems that support UTF-16 must provide code point order support
services.  We should delete all references to CESU-8 and reject the idea of
adding CESU-8 to the standard.

Carl








Re: CESU-8 vs UTF-8

2001-09-15 Thread Michael \(michka\) Kaplan

Carl, Doug,

The issues you and Doug brought up were vigorously discussed. For the
decision, all I can say is that not everyone voted for it (which will be a
matter of public record once the preliminary minutes are posted).

D This section of the TR amazed me.  In the Summary and
D elsewhere, CESU-8 is not intended nor recommended as
D an encoding used for open information exchange, but by
D the end of the document we learn that it will be registered
D with the Internet Assigned Numbers Authority.  I have
D spelled out IANA for a reason, to highlight that it is a body
D dealing with open information exchange over the Internet.

...

D This completely refutes all of the internal use only claims
D made in the rest of the document.

Yes, there are many such issues. This is, however, more of a side effect of
how much the document *changed* from the original document, based on
feedback.

Many people believe that any rule or law that makes no sense or cannot be
enforced weakens all other laws. I believe that publishing an inconsistent
document that would allow any reasonably intelligent reader to come to the
same conclusions as you did, and the standard itself would be weakened
thereby.

...


D I suggest, as part of the Proposed Draft stage for this document,
D that Section 4 be deleted and that IANA be informed that CESU-8
D is intended as an internal encoding only and that they are explicitly
D requested NOT to register it.

C In actuality Section 4 neither adds not takes away from PDUTR #26.
C They can either apply to IANA or not if Section 4 is included or not.
C It is merely a notification that there is no intent to make CESU-8 a
C private protocol.

The argument was put forward [unconvincingly, in my eyes] that the only way
to protect the situation from having some other vendor register it with IANA
would be to do so in a pre-emptive manner. I, however, work on the
assumption that IANA is not populated by morons and that they would be at
least willing to hear from the UTC on the inadvisabiity of supporting any
such encoding, no matter who presents it.

No guarantees of course (there never are any) but I am sure they would be
willing to consider the desire of the UTC to not further litter the playing
field?

C PDUTR #26 should be rejected in its entirety.  If it is truly a private
C protocol as they claim it does not belong it any form in the Unicode
C standard.

I concur.

The argument was made that it should be tied to the [orthagonal, in my eyes]
argument of tightening up Unicode 3.2's UTF-8 definition to disallow the
6-byte form. In my eyes, however, it is perfectly acceptable to claim that,
in order to be compliant with the Unicode 3.2 definiton of UTF-8 one must
not use the 6-byte form but that prior versions would allow one to accept it
(if they so desired).

Thus you can make one change without *requiring* the other.

Since the only clients who would emit CESU-8 (PeopleSoft, et. al) are doing
so privately, no UTR is needed for them to do so. And there is a [prior]
version of the standard that can accomodate them.

C You may have heard about hijacking legislative bills.  It is taking an
C existing bill and amending it to change the entire text of the bill.  I
C think that we should hijack PDUTR #26 and replace it with UTF-17.
C
C In actuality we should hijack PDUTR #26 to modify TR27 to specify
C that at a minimum, systems that support UTF-16 must provide code
C point order support services.  We should delete all references to
C CESU-8 and reject the idea of adding CESU-8 to the standard.

I do not know that the former is required; but either way I agree that
CESU-8 (neƩ UTF8-S) should not be included even as a UTR.

However, it is not possible to hijack the current proposal as the author
does not wish this to happen... though I suppose you are welcome to try and
convince him? :-)


MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/