Re: Reasonable to propose stability policy on numeric type = decimal

Philippe Verdy Sun, 25 Jul 2010 06:30:40 -0700

"Kent Karlsson" <[email protected]> wrote:
> Den 2010-07-25 03.09, skrev "Michael Everson" <[email protected]>:
> > On 25 Jul 2010, at 02:02, Bill Poser wrote:
> >> As I said, it isn't a huge issue, but scattering the digits makes the
> >> programming a bit more complex and error-prone and the programs a little 
> >> less
> >> efficient.
> >
> > But it would still *work*. So my hyperbole was not outrageous. And nobody 
> > has
> > actually scattered them. THough there are various types of "runs" in 
> > existing
> > encoded digits and numbers.
>
> While not formally of general category Nd (they are "No"), the superscript
> digits are a bit scattered:
>
> 00B2;SUPERSCRIPT TWO
> 00B3;SUPERSCRIPT THREE
> 00B9;SUPERSCRIPT ONE
> 2070;SUPERSCRIPT ZERO
> 2074;SUPERSCRIPT FOUR
> ...
> 2079;SUPERSCRIPT NINE
>
> And there are situations where one wants to interpret them as in a
> decimal-position system.


Scattering does not only affect decimal digits, but also mathematical
operators needed to represent:

- the numeric sign (« - » or « + »), with at least two variants for
the same system to represent the minus sign (either the ambiguous
minus-heighten, the only one supported in many text-to-number
conversions, or the true mathematical minus sign U+2212 « − » that has
the same width as the plus sign), including some « alternating signs »
that exist in two opposite versions (« ± », « ∓ »);

- the characters that represent the decimal separator (« . » or « , »)
which is almost always needed but locale-specific (this is not just a
property of the script);

- the optional character used to note exponential notations and used
in text-to-number conversion (usually « e » or  « E »);

- the optional characters used in the conventional formatting for
grouping digits (NNBSP alias « fine », with possible automatic
fallback to THINSP in font renderers and in rich-text documents
controlling the breaking property with separate style, or fallback to
NBSP in plain-text documents, or fallback to standard SPACE in
preformatted plain-text documents, « , », or « ' », and possibly other
punctuations in their « wide » form, for ideographic scripts).

Some of them exist in exponential/superscript or indice/subscript
versions (notably digits and decimal separators), but not all of them
(not all separators for grouping digits, using NNBSP may not be
appropriate as its width is not adjusted and it does not have the
semantic of a superscript or subscript).

For generality, it seems better to assume that digits and other
characters needed to note numbers in the positional decimal system may
be scattered (libraries may still avoid the small overhead of
performing table lookups, by just inspecting a property of the
character '0' or of the convention use, that will either say that it
starts a contiguous ranges, or that the complete sequence is stored in
a lookup array for the 10 digits.

The general category "Nd" may not always be accurate to find all
digits usable in decimal notations of integers, because the sequence
may have been incomplete when it was first encoded, and completed
later in scattered positions.

In this case, the digits will often have a general property of "No"
(or even "Nl") that will remain stable. What should also be stable is
their numeric value property (but I'm not sure that this is the case
of "Nl" digits, notably for scripts systems using letters in a way
similar to Greek or Hebrew letters as digits, even if Greek and Hebrew
digits are not encoded separately from the letters that these number
notations are borrowing).

Also I'm not sure that scripts that define "half-digits", or digits
with higher numeric values than 9, are permitting the use of their
digits with a numeric value between 0 and 9, in a positional decimal
system. The Roman numeric system is such a numeric system (borrowing
some scattered Latin letters and adding a few other specific digits)
where this will be completely wrong.

Or another base than 10 could be assumed by their positional system,
even if their digits are encoded in a contiguous range of characters
for the subset of values 0 to 9. This is probably no longer the case
with scripts that have modern use, but in historical scripts or in
historical texts using a modern script, the implied base may be
different and would have used more or less distinct digits. So instead
of guessing automatically from the encoded text, it may be preferable
to annotate the text (easy to insert if the conversion of the
historical text uses some rich-text format) to specify how to
interpret the numeric value of the original number.

And sometimes, the conversion to superscripts/subscripts compatibility
characters will not be possible even if some of them may be converted
safely to their numeric value, after detecting that they are in
superscript/subscript and that they don't behave the same as normal
digits (16²⁰ must NOT be interpreted as the numeric value 1620, but
must be parsed as two successive numbers 16 and 20, where the second
one has the semantic of an exponent, as if there was an exponentiation
operator between the two numbers).

It is also very frequent that only a few superscript digits will be
supported in one font, and other digits may be borrowed from another
font using a completely distinct style with distinct metrics or may
not be displayed at all (missing glyph). The result is then horrible
if you can't predict which font will be used that support the 10
digits in a contiguous range of values (even if they are scattered in
the code space).

When converting numbers to text with exponential notations, the use of
superscripts should only be used with care, knowing that this won't be
possible in all scripts, and that only integers without grouping
separators can be used.

Some writing systems (unified as « scripts » in Unicode) will still require to:

- either use rich-text styling for superscripts used in the
conventional notation of exponents,

- or use an explicit exponentiation operator, such as the ASCII symbol
U+005E "^" (which is not the same as a modifier letter circonflex
U+02C6 "ˆ", and that many fonts render at with glyph size and position
different from the the combining diacritic and implied by the modifier
letter), or a mathemetical operator or modifier letter (like the
upward arrow head U+02C4 "˄" that some fonts render as the
mathematical wedge operator on the baseline U+2227 "∧", or the less
ambiguous upward arrow U+2191 "↑").

Philippe.

Re: Reasonable to propose stability policy on numeric type = decimal

Reply via email to