Re: Processing Digit Variants

Asmus Freytag Tue, 19 Mar 2013 16:38:00 -0700

On the basis of security considerations, it might be necessary to notallow variation selectors to "salt" strings for parsing. If the stringcannot be rejected, then the proper thing might be to parse it as if thevariation selectors were not present (on the basis that they do notaffect semantics - by design - setting aside Han for the moment, wherethat story isn't totally clear).

Similar considerations would apply to other invisible characters, likeredundant directional marks, as well as joiners and non-joiners. Again,if their presence can't be used to reject a string, parsing needs tohandle them properly, so that what the user "sees" is what actually getsparsed.


A./


On 3/19/2013 1:45 PM, Richard Wordingham wrote:

On Mon, 18 Mar 2013 17:28:30 -0700
"Steven R. Loomis" <s...@icu-project.org> wrote:

On Monday, March 18, 2013, Richard Wordingham wrote:

The issue is rather with emphatically plain text <U+0031, U+FE0E,
U+0032, U+FE0E>.

It's the same situation to something like an implementation of LDML
number parsing. U+FE0E is not part of a number.

I agree that the same arguments are applicable to both parsing and
collating, though not necessarily with equal force.

Formally, <U+0031, U+FE0E, U+0032, U+FE0E> seems to be just as much a
number as <U+FF11 FULLWIDTH DIGIT ONE, U+FF12 FULLWIDTH DIGIT TWO>,
which the current LDML semantics do treat on an even footing with
"12".  If the emoji digits had been encoded as new characters, ICU
would support them without batting an eyelid.  Because the difference
does not merit full characterhood, they are encoded by a sequence
rather than a single character.  Remember, all that U+FE0E does is to
request a particular glyph.  In a sense, we have 20 new decimal digits,
<U+0030, U+FE0E> to <U+0039, U+FE0F> and <U+0030, U+FE0F> to <U+0039,
U+FE0F>.

So, why do you consider <U+0031, U+FE0E, U+0032, U+FE0E> not to be
a valid decimal number?

10<ZWJ>0<ZWJ>0 would be perfectly reasonable for text
likely to be rendered by a cursive Latin font.

Identifying such an edge case does not prove that numeric tailoring is
broken.

An 'edge case' is often just a case that shows that an algorithm that
often works has not been thought through thoroughly.  Now, as CLDR
seems to value speed above perfect correctness, perhaps handling
variation sequences will be rejected on that basis.  All I was trying
to find out on this list was whether <U+0031, U+FE0E, U+0032, U+FE0E>
should be regarded as a proper number.

Special characters intended for just one aspect of text processing
should not affect other aspects. Unfortunately, a parametric tailoring
to ignore irrelevant characters while complying with the UCA is not
quite as simple as just ignoring them.  The issues arise with the
blocking of discontiguous contractions and the possibility that, for
example, one might wish to collate character variants differently.  On
the other hand, ignoring variation selectors by default might be
excusable, for they should not occur where they might block canonical
reordering (antepenultimate paragraph of TUS 6.2.0 Section 16.4).

Richard.

Re: Processing Digit Variants

Reply via email to