RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)
Michael (michka) Kaplan: ... then the conversion will simply strip the errant characters. Note that either solution meets the needs of refusal to interpret the errant sequences. Simply stripping the errant byte sequences means that they are each interpreted as the empty string of characters. To me, that contradicts: C12a When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition, and shall not interpret such sequences as characters. On the other hand I think C12a is too harsh. It essentially requires either an error stop, or at least division of the input into a sequence of runs of text with possible error byte (for UTF-8) sequences at the borders between the runs. I think it would be ok to replace errant byte sequence with characters that indicate that there may have been an error (which excludes the empty string). SUBSTITUTE (SUB is used in the place of a character [sic] that has been found to be invalid or in error, SUB is intended to be introduced by automatic means) seem to fit that. (Ken's Titan discussion earlier is at a much lower protocol level; byte string, or even bit string level). /kent k
Impossible combinations?
I'm working on a Latin-based font that's got a large number of kerning pairs already defined and I'm trying to pare this list of pairs down to the bare minimum. There seem to be many pairs which are unlikely ever to be used. These pairs all involve a lowercase on the left with an uppercase on the right. My intuition is to delete all such pairs but since I am not a linguist I thought I'd better check first. Does anyone know of a Latin-based language in which it is possible to have a lowercase immediately followed by an uppercase in the SAME word? Thanks, Kevin
Some of Andy's assertions
1. The sequence 'Vowel+Virama+Ya...' is illogical to scholars of Bengali and indeed Indic languages in general. I refuted this yesterday by indication that this usage is an innovation. 2. Such sequences are not semantically equivalent to the intended ... sentence fragment. 3. There are no other cases of a Vowel+Virama combination in the Unicode encoding model. Yes, there are. Khmer. 4. Yaphalaa is not equivalent to 'Virama+Ya' Yes, it is, as I showed yesterday. 5. ISCII implementations encode these letters as separate characters corresponding to the Devanagari Candra A E. Unicode should follow the example of these implementations. No, it shouldn't. Unicode has a method for writing these sequences already and a second method for doing so should not be introduced. Use mapping tables to exchange ISCII and Unicode data. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Please see my latest proposal
Andy, Your BENGALI LETTER OPEN O can be encoded already with the sequence U+0985 U+09CD U+09AF. Your BENGALI LETTER CENTRAL E can be encoded already with the sequence U+098F U+09CD U+09AF. There is no need to bring the Bengali code block in line with the Devanagari block. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)
I agree with Kent that it is somewhat less robust to simply remove ill-formed sequences, since it removes any indication that the data was corrupted. Either better to signal an error, or insert some other indication like a REPLACEMENT CHARACTER or SUB at that point. (And in my reading, C12a does allow that; you are not interpreting the sequence as a character, you are replacing a host of possible errant sequences by an error indicator.) But the final decision should be made by the user of the API, since the desired behavior may vary depending on the environment. Mark [EMAIL PROTECTED] IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193 (408) 256-3148 fax: (408) 256-0799 - Original Message - From: Kent Karlsson [EMAIL PROTECTED] To: 'Michael (michka) Kaplan' [EMAIL PROTECTED] Cc: 'Yung-Fong Tang' [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Sunday, March 02, 2003 02:00 Subject: RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review) Michael (michka) Kaplan: ... then the conversion will simply strip the errant characters. Note that either solution meets the needs of refusal to interpret the errant sequences. Simply stripping the errant byte sequences means that they are each interpreted as the empty string of characters. To me, that contradicts: C12a When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition, and shall not interpret such sequences as characters. On the other hand I think C12a is too harsh. It essentially requires either an error stop, or at least division of the input into a sequence of runs of text with possible error byte (for UTF-8) sequences at the borders between the runs. I think it would be ok to replace errant byte sequence with characters that indicate that there may have been an error (which excludes the empty string). SUBSTITUTE (SUB is used in the place of a character [sic] that has been found to be invalid or in error, SUB is intended to be introduced by automatic means) seem to fit that. (Ken's Titan discussion earlier is at a much lower protocol level; byte string, or even bit string level). /kent k
Re: Impossible combinations?
On Sun, 2 Mar 2003, Kevin Brown wrote: Does anyone know of a Latin-based language in which it is possible to have a lowercase immediately followed by an uppercase in the SAME word? That happens in many common names, like McGowan. It will also be used in tech terms that need to avoid space for some reason or other (domain name technical restrictions, for example), like FarsiWeb, that's the name of a project on Persian standardization issues, or SearchEngine, the term Arthur C Clarke used for some global search engine and AI in his late book The Light of Other Days, co-authored by Stephen Baxter. roozbeh
Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)
From: Mark Davis [EMAIL PROTECTED] I agree with Kent that it is somewhat less robust to simply remove ill-formed sequences, since it removes any indication that the data was corrupted. Nice that the API gives one the option to choose, huh? ;-) The notion of continuing (even if one is limping along, removing invalid sequences) is to help some of the backcompat story, where there were no errors previously -- without adding security errors due to non-shortest form strings. But the final decision should be made by the user of the API, since the desired behavior may vary depending on the environment. Also agreed. MichKa
Re: Impossible combinations?
At 21:01 +0330 2003-03-02, Roozbeh Pournader wrote: That happens in many common names, like McGowan. Noble names, Roozbeh. ;-) -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Impossible combinations?
On Sun, 2 Mar 2003, Kevin Brown wrote: Does anyone know of a Latin-based language in which it is possible to have a lowercase immediately followed by an uppercase in the SAME word? In addition to the examples pointed out by Roozbeh and Michael, this pattern is growing increasingly common in commercial English, where such forms as eBusiness and eSecurity are enjoying increasing vogue. And CamelCasing is apparent not only in technical terminology, but has spread to company names and the like, as well. Consider, e.g., PayPal. --Ken
Re: Impossible combinations?
At 04:11 AM 3/2/2003, Kevin Brown wrote: I'm working on a Latin-based font that's got a large number of kerning pairs already defined and I'm trying to pare this list of pairs down to the bare minimum. There seem to be many pairs which are unlikely ever to be used. These pairs all involve a lowercase on the left with an uppercase on the right. My intuition is to delete all such pairs but since I am not a linguist I thought I'd better check first. Does anyone know of a Latin-based language in which it is possible to have a lowercase immediately followed by an uppercase in the SAME word? This is not uncommon in some of the Bantu languages; I can't remember which ones, but at least one major regional language in southern Africa. You should be aware that there are lots of applications that gag on large numbers of kerning pairs. Thomas Phinney in the type group at Adobe advised us that 3,000 standard kern pairs is about the maximum one can expect to work in all apps. Some applications will fail to support the rull range of kerning pairs if there are too many; some applications will not support any kerning if there are too many pairs; and some older applications may even crash. In OpenType fonts, using GPOS instead of kern table kerning, you can employ class-based kerning, which can be very handy for large fonts. Some systems will decompile GPOS kerning to standard kerning on the fly, which may result in subsetting of kerning (Adobe Type Manager and the CFF rasteriser in Windows does this for PS-flavour OT fonts, subsetting to Windows CP 1252 support). The subsetting is necessary because the fully decompiled class-based kerning for a font can easily overload many applications (the class-based kerning in Adobe's Minion Pro decompiles to approx. 70,000 pairs). Adobe's latest applications, e.g. InDesign, make direct use of GPOS kerning, so can access all the kerning in a font. Hopefully more applications and systems will soon follow suit. Windows supports GPOS kerning for complex scripts via Uniscribe, but not yet for Latin or other 'simple' scripts. Finally, bear in mind that an excessive number of kerning pairs may indicate that your font has fundamental spacing problems. It is often possible to reduce the number of kerning pairs by revising the sidebearings to produce a better pre-kern fit. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] It is necessary that by all means and cunning, the cursed owners of books should be persuaded to make them available to us, either by argument or by force. - Michael Apostolis, 1467
Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)
At 07:21 AM 3/2/03 -0800, Mark Davis wrote: C12a When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition, and shall not interpret such sequences as characters. Can we agree or disagree on whether an API that returns an error code, but also an output buffer that contains a simplistic conversion of the erroneous sequence is or is not conformant. To me it seems that by setting an error flag in the return code, the API has signalled that the user should not treat the output as containing correct Unicode. Such an API design (on a low enough level) might strike the right balance between between usability in many different environments and satisfying the formal requirement. The ideal case is one where the converter stops in a restartable configuration, allowing the client to implement (or ask for) a variety of error-recovery options. However, such an interface requires a lot of thought and may be difficult to implement for some language/platform/library environments. Further, it may be unnecessarily difficult to use for at least some conceivable clients. A./