Philippe Verdy wrote:
"Kent Karlsson" <[email protected]> wrote:
Den 2010-07-25 03.09, skrev "Michael Everson" <[email protected]>:
On 25 Jul 2010, at 02:02, Bill Poser wrote:
As I said, it isn't a huge issue, but scattering the digits makes the
programming a bit more complex and error-prone and the programs a little less
efficient.
But it would still *work*. So my hyperbole was not outrageous. And nobody has
actually scattered them. THough there are various types of "runs" in existing
encoded digits and numbers.
While not formally of general category Nd (they are "No"), the superscript
digits are a bit scattered:

00B2;SUPERSCRIPT TWO
00B3;SUPERSCRIPT THREE
00B9;SUPERSCRIPT ONE
2070;SUPERSCRIPT ZERO
2074;SUPERSCRIPT FOUR
...
2079;SUPERSCRIPT NINE

And there are situations where one wants to interpret them as in a
decimal-position system.

Scattering does not only affect decimal digits, but also mathematical
operators needed to represent:

- the numeric sign (« - » or « + »), with at least two variants for
the same system to represent the minus sign (either the ambiguous
minus-heighten, the only one supported in many text-to-number
conversions, or the true mathematical minus sign U+2212 « − » that has
the same width as the plus sign), including some « alternating signs »
that exist in two opposite versions (« ± », « ∓ »);

- the characters that represent the decimal separator (« . » or « , »)
which is almost always needed but locale-specific (this is not just a
property of the script);

- the optional character used to note exponential notations and used
in text-to-number conversion (usually « e » or  « E »);

- the optional characters used in the conventional formatting for
grouping digits (NNBSP alias « fine », with possible automatic
fallback to THINSP in font renderers and in rich-text documents
controlling the breaking property with separate style, or fallback to
NBSP in plain-text documents, or fallback to standard SPACE in
preformatted plain-text documents, « , », or « ' », and possibly other
punctuations in their « wide » form, for ideographic scripts).

Some of them exist in exponential/superscript or indice/subscript
versions (notably digits and decimal separators), but not all of them
(not all separators for grouping digits, using NNBSP may not be
appropriate as its width is not adjusted and it does not have the
semantic of a superscript or subscript).

For generality, it seems better to assume that digits and other
characters needed to note numbers in the positional decimal system may
be scattered (libraries may still avoid the small overhead of
performing table lookups, by just inspecting a property of the
character '0' or of the convention use, that will either say that it
starts a contiguous ranges, or that the complete sequence is stored in
a lookup array for the 10 digits.

The general category "Nd" may not always be accurate to find all
digits usable in decimal notations of integers, because the sequence
may have been incomplete when it was first encoded, and completed
later in scattered positions.

In this case, the digits will often have a general property of "No"
(or even "Nl") that will remain stable. What should also be stable is
their numeric value property (but I'm not sure that this is the case
of "Nl" digits, notably for scripts systems using letters in a way
similar to Greek or Hebrew letters as digits, even if Greek and Hebrew
digits are not encoded separately from the letters that these number
notations are borrowing).

Also I'm not sure that scripts that define "half-digits", or digits
with higher numeric values than 9, are permitting the use of their
digits with a numeric value between 0 and 9, in a positional decimal
system. The Roman numeric system is such a numeric system (borrowing
some scattered Latin letters and adding a few other specific digits)
where this will be completely wrong.

Or another base than 10 could be assumed by their positional system,
even if their digits are encoded in a contiguous range of characters
for the subset of values 0 to 9. This is probably no longer the case
with scripts that have modern use, but in historical scripts or in
historical texts using a modern script, the implied base may be
different and would have used more or less distinct digits. So instead
of guessing automatically from the encoded text, it may be preferable
to annotate the text (easy to insert if the conversion of the
historical text uses some rich-text format) to specify how to
interpret the numeric value of the original number.

And sometimes, the conversion to superscripts/subscripts compatibility
characters will not be possible even if some of them may be converted
safely to their numeric value, after detecting that they are in
superscript/subscript and that they don't behave the same as normal
digits (16²⁰ must NOT be interpreted as the numeric value 1620, but
must be parsed as two successive numbers 16 and 20, where the second
one has the semantic of an exponent, as if there was an exponentiation
operator between the two numbers).

It is also very frequent that only a few superscript digits will be
supported in one font, and other digits may be borrowed from another
font using a completely distinct style with distinct metrics or may
not be displayed at all (missing glyph). The result is then horrible
if you can't predict which font will be used that support the 10
digits in a contiguous range of values (even if they are scattered in
the code space).

When converting numbers to text with exponential notations, the use of
superscripts should only be used with care, knowing that this won't be
possible in all scripts, and that only integers without grouping
separators can be used.

Some writing systems (unified as « scripts » in Unicode) will still require to:

- either use rich-text styling for superscripts used in the
conventional notation of exponents,

- or use an explicit exponentiation operator, such as the ASCII symbol
U+005E "^" (which is not the same as a modifier letter circonflex
U+02C6 "ˆ", and that many fonts render at with glyph size and position
different from the the combining diacritic and implied by the modifier
letter), or a mathemetical operator or modifier letter (like the
upward arrow head U+02C4 "˄" that some fonts render as the
mathematical wedge operator on the baseline U+2227 "∧", or the less
ambiguous upward arrow U+2191 "↑").

Philippe.



That all may be true, but it is really besides the point.

I'm considering extending an existing computer programming language which currently only understands numbers composed solely by the ASCII numbers to also understand those from other scripts. I'm not going to do it unless it is easy within the existing implementation (not some theoretical better implementation) and efficient and not a security threat.

The symbols for operators like exponentiation are already set in stone., and their being scattered isn't relevant. Likewise, non-decimal-digit numbers, like subscripts, are also not relevant.

I found a way to do the implementation that meets all my criteria, but is based on the existing pattern of Gc=Nd (or Nt=De) code point assignments. The assignments have so far been prudent, to use Asmus' term. I was merely trying to see if this prudence could be codified so that my implementation wouldn't get obsoleted on a whim in some future Unicode release.

I hadn't thought of the case where a zero is later found or its usage develops in a script, and suddenly all the digits in that script change from Nt=Di to Nt=De, which because of an existing stability policy would necessarily require their general category changing to Nd.

Prudence would dictate, then, that when assigning code points to the numbers in a script, that a contiguous block of 12-13 be reserved for them, such that the first one in the block be set aside for ZERO; the next for ONE, etc.

My original question comes down to then, would it be reasonable to codify this prudence? People have said it will never happen. But no one has said why that is.

Obviously, things can happen that will mess this up--the Phaistos disk could turn out to be a base-46 numbering system, as an extremely unlikely example. But by dictating prudence now, most such eventualities wouldn't happen.

I have since looked at the Nt=Di characters. The ones that aren't in contiguous runs are the superscripts and ones that would never be considered to be decimal digits, such as a circled ZERO. The only run in the BMP which doesn't have a zero is Ethiopic. It seems extremely unlikely to me that a zero will be discovered or come into use with that script. I'm guessing that they have adopted European numbers in order to have commerce with the rest of the world.

There are several runs in the SMP, but the code point where a zero would go isn't assigned.

I don't know for sure, but it appears to me that we are running out of non-dead scripts to encode. I see that draft 6.0 has only 544 BMP code points not in any block and not much in the pipeline. I would think that most any script yet to be encoded would have borrowed numbering systems from their neighbors.

And there is still plenty of space in the SMP, so this proposal to require prudence should not use up too many precious unassigned code points.

Reply via email to