Re: compatibility between unicode 2.0 and 3.0
Doug Ewell wrote: That said, there are certain conventions for certain ranges of code points. For example, the range from U+0590 through U+08FF is marked in the Roadmap as being reserved for right-to-left scripts, and IIRC there are ranges reserved for invisible formatting and control characters (U+206x and U+FFFx). ... Note that Unicode provides property values like the bidi class and Default_Ignorable_Code_Point even for unassigned code points. The only "magic" for implementers of Unicode property APIs is that some of the UCD files do not explicitly list properties for unassigned code points, or which is the default value for not-mentioned code points in general. Sometimes one has to check the header of the file or the chapter in the Unicode book, etc. I believe this is being fixed for 4.0. *Users* of such low-level property APIs need not care about where the implementers get this data, of course. markus
Re: compatibility between unicode 2.0 and 3.0
At 10:49 PM 2/3/03 -0800, Doug Ewell wrote: > Can you please explain what is the best practice to handle unassigned > code points so that applications can easily become forward compatible? > If we just ignore unassigned code points, then will it make for > application easier to migrate to later version of Unicode? In many circumstances, the best approach for unassigned character codes is to treat them like the characters around them. An implementation might chose to interpolate the property values of assigned characters bordering a range of unassigned characters, using the following rules: * Look at the nearest assigned characters in both directions. If they are in the same block, and have the same property value, then use that value. * From any block boundary, extending to the nearest assigned character inside the block, use the property value of that character. * For all code points entirely in empty or unassigned blocks use the default property value for that property as given in the Unicode Character Database. There are two important benefits of using that approach in implementations. Property values become much more contiguous, allowing better compaction of property tables. Furthermore, because similar characters are often encoded in proximity, chances are good that the interpolated values will match the actual property values when characters are assigned to a given code point later. Of course, many important properties may well not be predictable, but on the whole, the approach has proven successful. A./
Re: compatibility between unicode 2.0 and 3.0
Keyur Shroff wrote: > Can you please explain what is the best practice to handle unassigned > code points so that applications can easily become forward compatible? > If we just ignore unassigned code points, then will it make for > application easier to migrate to later version of Unicode? I should probably wait for someone like Ken to come by and provide an authoritative answer, but until then: The basic rule is that unassigned code points cannot be interpreted or modified in any way. In particular, they cannot simply be thrown away, or converted to an assigned code point such as U+003F or U+FFFD. That said, there are certain conventions for certain ranges of code points. For example, the range from U+0590 through U+08FF is marked in the Roadmap as being reserved for right-to-left scripts, and IIRC there are ranges reserved for invisible formatting and control characters (U+206x and U+FFFx). But I really don't know how advisable it is to, say, render an string of unassigned code points like ࠁࠂࠃ as RTL just because it falls within the "RTL block." Better wait for the experts. -Doug Ewell Fullerton, California
Re: compatibility between unicode 2.0 and 3.0
--- Kenneth Whistler <[EMAIL PROTECTED]> wrote: > > This depends greatly on what implementation you did for > sorting and searching, and how it handles unassigned code points > in your Unicode 2.0 code. If the code was designed to be > forward compatible, it should do reasonable things with > unassigned code points, and getting Unicode 3.0 data which > is actually using those code points should not disturb your > existing code. But, on the other hand, if you have built > in a bunch of range checks or have used tables which cannot > gracefully handle the appearance of unassigned code points > in your data, then it could well blow up. Can you please explain what is the best practice to handle unassigned code points so that applications can easily become forward compatible? If we just ignore unassigned code points, then will it make for application easier to migrate to later version of Unicode? - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
Re: compatibility between unicode 2.0 and 3.0
Erik Ostermueller asked: > We have a large amount of C++ that currently has Unicode 2.0 support. > > Could you all help me figure out what types of operations will fail > if we attempt to pass Unicode 3.0 thru this code? > > I can start the list off with > > -sorting > -searching for text This depends greatly on what implementation you did for sorting and searching, and how it handles unassigned code points in your Unicode 2.0 code. If the code was designed to be forward compatible, it should do reasonable things with unassigned code points, and getting Unicode 3.0 data which is actually using those code points should not disturb your existing code. But, on the other hand, if you have built in a bunch of range checks or have used tables which cannot gracefully handle the appearance of unassigned code points in your data, then it could well blow up. The Unicode Collation Algorithm was not defined until after Unicode 2.0, and was first synched with Unicode 2.1. It has also been considerably updated since then -- the current version is aimed at Unicode 3.1. You should take a look at the current version to check for gotchas you may have in your current code. > -text comparison I assume here you are not talking about language-specific collation comparisons, but just Unicode analogs of strcmp() and the like. If so, those should behave well -- they aren't usually programmed in ways which make them sensitive to particular code point assignments. > -other character classification (isSpace, isDigit, etc...). Again, these depend on what kinds of forward compatibility assumptions your original code made. If it provides meaningful results for unassigned code points in Unicode 2.0, then tossing Unicode 3.0 data at such APIs shouldn't cause any problem to existing code, other than not getting the right results for Unicode 3.0 additions until you have modified and updated your property tables. > > I'm understand that these operations probably won't work in ALL cases. > But how about basic plumbing code -- creating and copying string? Constructors and copy constructors ought to work fine, unless you've done something odd. What you should be more concerned about, however, is how your code is going to get from Unicode 3.0 to Unicode 3.1 (or higher), because then you will have to deal with supplementary characters. Any assumptions that characters don't lie outside the range U+..U+ will be broken. Whether this will be a small problem or a big problem for your code depends on whether you are effectively processing Unicode in UTF-8, UTF-16, or UTF-32 (or combinations of those). The biggest hit, when moving from Unicode 3.0 to Unicode 3.1 (or higher) is for UTF-16 APIs. See Unicode Technical Note #7, Migrating Software to Supplementary Characters, for some ideas: http://www.unicode.org/notes/tn7/ --Ken > > As I mentioned in my last post, I've enjoyed > listening in on this forum -- I've learned a whole lot. > > Thanks, > > --Erik Ostermueller >
compatibility between unicode 2.0 and 3.0
We have a large amount of C++ that currently has Unicode 2.0 support. Could you all help me figure out what types of operations will fail if we attempt to pass Unicode 3.0 thru this code? I can start the list off with -sorting -searching for text -text comparison -other character classification (isSpace, isDigit, etc...). I'm understand that these operations probably won't work in ALL cases. But how about basic plumbing code -- creating and copying string? As I mentioned in my last post, I've enjoyed listening in on this forum -- I've learned a whole lot. Thanks, --Erik Ostermueller