RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-02 Thread Kent Karlsson


Michael (michka) Kaplan:
...
 then the conversion will simply strip the errant characters. Note that
 either solution meets the needs of refusal to interpret the errant
 sequences.

Simply stripping the errant byte sequences means that they are
each interpreted as the empty string of characters.  To me, that
contradicts:

   C12a When a process interprets a code unit sequence which
purports to be in a Unicode character encoding form, it
shall treat ill-formed code unit sequences as an error
condition, and shall not interpret such sequences as
characters.

On the other hand I think C12a is too harsh.  It essentially
requires either an error stop, or at least division of the
input into a sequence of runs of text with possible error
byte (for UTF-8) sequences at the borders between the runs.
I think it would be ok to replace errant byte sequence with
characters that indicate that there may have been an error
(which excludes the empty string).  SUBSTITUTE (SUB is used
in the place of a character [sic] that has been found to be
invalid or in error, SUB is intended to be introduced by
automatic means) seem to fit that.

(Ken's Titan discussion earlier is at a much lower protocol
level; byte string, or even bit string level).

/kent k




Impossible combinations?

2003-03-02 Thread Kevin Brown
I'm working on a Latin-based font that's got a large number of kerning 
pairs already defined and I'm trying to pare this list of pairs down to 
the bare minimum. There seem to be many pairs which are unlikely ever to 
be used. These pairs all involve a lowercase on the left with an 
uppercase on the right.

My intuition is to delete all such pairs but since I am not a linguist I 
thought I'd better check first. Does anyone know of a Latin-based 
language in which it is possible to have a lowercase immediately followed 
by an uppercase in the SAME word?

Thanks, Kevin



Some of Andy's assertions

2003-03-02 Thread Michael Everson
1. The sequence 'Vowel+Virama+Ya...' is illogical to scholars of 
Bengali and indeed Indic languages in general.
I refuted this yesterday by indication that this usage is an innovation.

2. Such sequences are not semantically equivalent to the intended
... sentence fragment.

3. There are no other cases of a Vowel+Virama combination in the 
Unicode encoding model.
Yes, there are. Khmer.

4. Yaphalaa is not equivalent to 'Virama+Ya'
Yes, it is, as I showed yesterday.

5. ISCII implementations encode these letters as separate characters 
corresponding to the Devanagari Candra A  E. Unicode should follow 
the example of these implementations.
No, it shouldn't. Unicode has a method for writing these sequences 
already and a second method for doing so should not be introduced. 
Use mapping tables to exchange ISCII and Unicode data.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Please see my latest proposal

2003-03-02 Thread Michael Everson
Andy,

Your BENGALI LETTER OPEN O can be encoded already with the sequence 
U+0985 U+09CD U+09AF.

Your BENGALI LETTER CENTRAL E can be encoded already with the 
sequence U+098F U+09CD U+09AF.

There is no need to bring the Bengali code block in line with the 
Devanagari block.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-02 Thread Mark Davis
I agree with Kent  that it is somewhat less robust to simply remove
ill-formed sequences, since it removes any indication that the data was
corrupted. Either better to signal an error, or insert some other indication
like a REPLACEMENT CHARACTER or SUB at that point. (And in my reading, C12a
does allow that; you are not interpreting the sequence as a character, you
are replacing a host of possible errant sequences by an error indicator.)
But the final decision should be made by the user of the API, since the
desired behavior may vary depending on the environment.

Mark

[EMAIL PROTECTED]
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799

- Original Message -
From: Kent Karlsson [EMAIL PROTECTED]
To: 'Michael (michka) Kaplan' [EMAIL PROTECTED]
Cc: 'Yung-Fong Tang' [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Sunday, March 02, 2003 02:00
Subject: RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for
review)




 Michael (michka) Kaplan:
 ...
  then the conversion will simply strip the errant characters. Note that
  either solution meets the needs of refusal to interpret the errant
  sequences.

 Simply stripping the errant byte sequences means that they are
 each interpreted as the empty string of characters.  To me, that
 contradicts:

C12a When a process interprets a code unit sequence which
 purports to be in a Unicode character encoding form, it
 shall treat ill-formed code unit sequences as an error
 condition, and shall not interpret such sequences as
 characters.

 On the other hand I think C12a is too harsh.  It essentially
 requires either an error stop, or at least division of the
 input into a sequence of runs of text with possible error
 byte (for UTF-8) sequences at the borders between the runs.
 I think it would be ok to replace errant byte sequence with
 characters that indicate that there may have been an error
 (which excludes the empty string).  SUBSTITUTE (SUB is used
 in the place of a character [sic] that has been found to be
 invalid or in error, SUB is intended to be introduced by
 automatic means) seem to fit that.

 (Ken's Titan discussion earlier is at a much lower protocol
 level; byte string, or even bit string level).

 /kent k







Re: Impossible combinations?

2003-03-02 Thread Roozbeh Pournader
On Sun, 2 Mar 2003, Kevin Brown wrote:

 Does anyone know of a Latin-based language in which it is possible to
 have a lowercase immediately followed by an uppercase in the SAME word?

That happens in many common names, like McGowan. It will also be used in
tech terms that need to avoid space for some reason or other (domain name
technical restrictions, for example), like FarsiWeb, that's the name of a
project on Persian standardization issues, or SearchEngine, the term
Arthur C Clarke used for some global search engine and AI in his late book
The Light of Other Days, co-authored by Stephen Baxter.

roozbeh




Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-02 Thread Michael \(michka\) Kaplan
From: Mark Davis [EMAIL PROTECTED]

 I agree with Kent  that it is somewhat less robust to simply remove
 ill-formed sequences, since it removes any indication that the data
was
 corrupted.

Nice that the API gives one the option to choose, huh? ;-)

The notion of continuing (even if one is limping along, removing
invalid sequences) is to help some of the backcompat story, where
there were no errors previously -- without adding security errors due
to non-shortest form strings.

 But the final decision should be made by the user of the API, since
the
 desired behavior may vary depending on the environment.

Also agreed.

MichKa




Re: Impossible combinations?

2003-03-02 Thread Michael Everson
At 21:01 +0330 2003-03-02, Roozbeh Pournader wrote:

That happens in many common names, like McGowan.
Noble names, Roozbeh. ;-)
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: Impossible combinations?

2003-03-02 Thread Kenneth Whistler

 On Sun, 2 Mar 2003, Kevin Brown wrote:
 
  Does anyone know of a Latin-based language in which it is possible to
  have a lowercase immediately followed by an uppercase in the SAME word?

In addition to the examples pointed out by Roozbeh and Michael,
this pattern is growing increasingly common in commercial English,
where such forms as eBusiness and eSecurity are enjoying
increasing vogue. And CamelCasing is apparent not only in
technical terminology, but has spread to company names and the
like, as well. Consider, e.g., PayPal.

--Ken




Re: Impossible combinations?

2003-03-02 Thread John Hudson
At 04:11 AM 3/2/2003, Kevin Brown wrote:

I'm working on a Latin-based font that's got a large number of kerning
pairs already defined and I'm trying to pare this list of pairs down to
the bare minimum. There seem to be many pairs which are unlikely ever to
be used. These pairs all involve a lowercase on the left with an
uppercase on the right.
My intuition is to delete all such pairs but since I am not a linguist I
thought I'd better check first. Does anyone know of a Latin-based
language in which it is possible to have a lowercase immediately followed
by an uppercase in the SAME word?
This is not uncommon in some of the Bantu languages; I can't remember which 
ones, but at least one major regional language in southern Africa.

You should be aware that there are lots of applications that gag on large 
numbers of kerning pairs. Thomas Phinney in the type group at Adobe advised 
us that 3,000 standard kern pairs is about the maximum one can expect to 
work in all apps. Some applications will fail to support the rull range of 
kerning pairs if there are too many; some applications will not support any 
kerning if there are too many pairs; and some older applications may even 
crash.

In OpenType fonts, using GPOS instead of kern table kerning, you can employ 
class-based kerning, which can be very handy for large fonts. Some systems 
will decompile GPOS kerning to standard kerning on the fly, which may 
result in subsetting of kerning (Adobe Type Manager and the CFF rasteriser 
in Windows does this for PS-flavour OT fonts, subsetting to Windows CP 1252 
support). The subsetting is necessary because the fully decompiled 
class-based kerning for a font can easily overload many applications (the 
class-based kerning in Adobe's Minion Pro decompiles to approx. 70,000 
pairs). Adobe's latest applications, e.g. InDesign, make direct use of GPOS 
kerning, so can access all the kerning in a font. Hopefully more 
applications and systems will soon follow suit. Windows supports GPOS 
kerning for complex scripts via Uniscribe, but not yet for Latin or other 
'simple' scripts.

Finally, bear in mind that an excessive number of kerning pairs may 
indicate that your font has fundamental spacing problems. It is often 
possible to reduce the number of kerning pairs by revising the sidebearings 
to produce a better pre-kern fit.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
It is necessary that by all means and cunning,
the cursed owners of books should be persuaded
to make them available to us, either by argument
or by force.  - Michael Apostolis, 1467



Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-02 Thread Asmus Freytag
At 07:21 AM 3/2/03 -0800, Mark Davis wrote:
C12a When a process interprets a code unit sequence which
 purports to be in a Unicode character encoding form, it
 shall treat ill-formed code unit sequences as an error
 condition, and shall not interpret such sequences as
 characters.
Can we agree or disagree on whether an API that returns an error code, but 
also an output buffer that contains a simplistic conversion of the 
erroneous sequence is or is not conformant.

To me it seems that by setting an error flag in the return code, the API 
has signalled that the user should not treat the output as containing 
correct Unicode.

Such an API design (on a low enough level) might strike the right balance 
between between usability in many different environments and satisfying the 
formal requirement.

The ideal case is one where the converter stops in a restartable 
configuration, allowing the client to implement (or ask for) a variety of 
error-recovery options. However, such an interface requires a lot of 
thought and may be difficult to implement for some 
language/platform/library environments. Further, it may be unnecessarily 
difficult to use for at least some conceivable clients.

A./