Re: compatibility between unicode 2.0 and 3.0

2003-02-06 Thread Markus Scherer
Doug Ewell wrote:

That said, there are certain conventions for certain ranges of code
points.  For example, the range from U+0590 through U+08FF is marked in
the Roadmap as being reserved for right-to-left scripts, and IIRC there
are ranges reserved for invisible formatting and control characters
(U+206x and U+FFFx). ...


Note that Unicode provides property values like the bidi class and Default_Ignorable_Code_Point even 
for unassigned code points.

The only "magic" for implementers of Unicode property APIs is that some of the UCD files do not 
explicitly list properties for unassigned code points, or which is the default value for 
not-mentioned code points in general. Sometimes one has to check the header of the file or the 
chapter in the Unicode book, etc. I believe this is being fixed for 4.0.

*Users* of such low-level property APIs need not care about where the implementers get this data, of 
course.

markus




Re: compatibility between unicode 2.0 and 3.0

2003-02-04 Thread Asmus Freytag
At 10:49 PM 2/3/03 -0800, Doug Ewell wrote:

> Can you please explain what is the best practice to handle unassigned
> code points so that applications can easily become forward compatible?
> If we just ignore unassigned code points, then will it make for
> application easier to migrate to later version of Unicode?


In many circumstances, the best approach for unassigned character
codes is to treat them like the characters around them.

An implementation might chose to interpolate the property values
of assigned characters bordering a range of unassigned characters,
using the following rules:

* Look at the nearest assigned characters in both directions.
If they are in the same block, and have the same property value,
then use that value.
* From any block boundary, extending to the nearest assigned
character inside the block, use the property value of that character.
* For all code points entirely in empty or unassigned blocks use the
default property value for that property as given in the Unicode Character
Database.

There are two important benefits of using that approach in implementations.
Property values become much more contiguous, allowing better compaction of
property tables. Furthermore, because similar characters are often
encoded in proximity, chances are good that the interpolated values
will match the actual property values when characters are assigned
to a given code point later.

Of course, many important properties may well not be predictable, but on
the whole, the approach has proven successful.

A./




Re: compatibility between unicode 2.0 and 3.0

2003-02-03 Thread Doug Ewell
Keyur Shroff  wrote:

> Can you please explain what is the best practice to handle unassigned
> code points so that applications can easily become forward compatible?
> If we just ignore unassigned code points, then will it make for
> application easier to migrate to later version of Unicode?

I should probably wait for someone like Ken to come by and provide an
authoritative answer, but until then:

The basic rule is that unassigned code points cannot be interpreted or
modified in any way.  In particular, they cannot simply be thrown away,
or converted to an assigned code point such as U+003F or U+FFFD.

That said, there are certain conventions for certain ranges of code
points.  For example, the range from U+0590 through U+08FF is marked in
the Roadmap as being reserved for right-to-left scripts, and IIRC there
are ranges reserved for invisible formatting and control characters
(U+206x and U+FFFx).  But I really don't know how advisable it is to,
say, render an string of unassigned code points like ࠁࠂࠃ as RTL just
because it falls within the "RTL block."

Better wait for the experts.

-Doug Ewell
 Fullerton, California





Re: compatibility between unicode 2.0 and 3.0

2003-02-03 Thread Keyur Shroff

--- Kenneth Whistler <[EMAIL PROTECTED]> wrote:
> 
> This depends greatly on what implementation you did for
> sorting and searching, and how it handles unassigned code points
> in your Unicode 2.0 code. If the code was designed to be
> forward compatible, it should do reasonable things with
> unassigned code points, and getting Unicode 3.0 data which
> is actually using those code points should not disturb your
> existing code. But, on the other hand, if you have built
> in a bunch of range checks or have used tables which cannot
> gracefully handle the appearance of unassigned code points
> in your data, then it could well blow up.

Can you please explain what is the best practice to handle unassigned code
points so that applications can easily become forward compatible? If we
just ignore unassigned code points, then will it make for application
easier to migrate to later version of Unicode?

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




Re: compatibility between unicode 2.0 and 3.0

2003-02-03 Thread Kenneth Whistler
Erik Ostermueller asked:

> We have a large amount of C++ that currently has Unicode 2.0 support.
> 
> Could you all help me figure out what types of operations will fail
> if we attempt to pass Unicode 3.0 thru this code?
> 
> I can start the list off with 
> 
> -sorting 
> -searching for text 

This depends greatly on what implementation you did for
sorting and searching, and how it handles unassigned code points
in your Unicode 2.0 code. If the code was designed to be
forward compatible, it should do reasonable things with
unassigned code points, and getting Unicode 3.0 data which
is actually using those code points should not disturb your
existing code. But, on the other hand, if you have built
in a bunch of range checks or have used tables which cannot
gracefully handle the appearance of unassigned code points
in your data, then it could well blow up.

The Unicode Collation Algorithm was not defined until after
Unicode 2.0, and was first synched with Unicode 2.1. It has
also been considerably updated since then -- the current version
is aimed at Unicode 3.1. You should take a look at the
current version to check for gotchas you may have in your
current code.

> -text comparison

I assume here you are not talking about language-specific
collation comparisons, but just Unicode analogs of strcmp()
and the like. If so, those should behave well -- they aren't
usually programmed in ways which make them sensitive to
particular code point assignments.

> -other character classification (isSpace, isDigit, etc...).

Again, these depend on what kinds of forward compatibility
assumptions your original code made. If it provides
meaningful results for unassigned code points in Unicode 2.0,
then tossing Unicode 3.0 data at such APIs shouldn't cause
any problem to existing code, other than not getting the
right results for Unicode 3.0 additions until you have
modified and updated your property tables.

> 
> I'm understand that these operations probably won't work in ALL cases.
> But how about basic plumbing code -- creating and copying string?

Constructors and copy constructors ought to work fine, unless
you've done something odd.

What you should be more concerned about, however, is
how your code is going to get from Unicode 3.0 to
Unicode 3.1 (or higher), because then you will have to
deal with supplementary characters. Any assumptions that
characters don't lie outside the range U+..U+
will be broken. Whether this will be a small problem
or a big problem for your code depends on whether you
are effectively processing Unicode in UTF-8, UTF-16,
or UTF-32 (or combinations of those). The biggest hit,
when moving from Unicode 3.0 to Unicode 3.1 (or higher)
is for UTF-16 APIs. See Unicode Technical Note #7,
Migrating Software to Supplementary Characters, for some
ideas:
http://www.unicode.org/notes/tn7/

--Ken

> 
> As I mentioned in my last post, I've enjoyed
> listening in on this forum -- I've learned a whole lot.
> 
> Thanks,
> 
> --Erik Ostermueller
> 





compatibility between unicode 2.0 and 3.0

2003-01-31 Thread Erik.Ostermueller
We have a large amount of C++ that currently has Unicode 2.0 support.

Could you all help me figure out what types of operations will fail
if we attempt to pass Unicode 3.0 thru this code?

I can start the list off with 

-sorting 
-searching for text 
-text comparison
-other character classification (isSpace, isDigit, etc...).

I'm understand that these operations probably won't work in ALL cases.
But how about basic plumbing code -- creating and copying string?

As I mentioned in my last post, I've enjoyed
listening in on this forum -- I've learned a whole lot.

Thanks,

--Erik Ostermueller