On the basis of security considerations, it might be necessary to not allow variation selectors to "salt" strings for parsing. If the string cannot be rejected, then the proper thing might be to parse it as if the variation selectors were not present (on the basis that they do not affect semantics - by design - setting aside Han for the moment, where that story isn't totally clear).

Similar considerations would apply to other invisible characters, like redundant directional marks, as well as joiners and non-joiners. Again, if their presence can't be used to reject a string, parsing needs to handle them properly, so that what the user "sees" is what actually gets parsed.

A./


On 3/19/2013 1:45 PM, Richard Wordingham wrote:
On Mon, 18 Mar 2013 17:28:30 -0700
"Steven R. Loomis" <s...@icu-project.org> wrote:
On Monday, March 18, 2013, Richard Wordingham wrote:
The issue is rather with emphatically plain text <U+0031, U+FE0E,
U+0032, U+FE0E>.
It's the same situation to something like an implementation of LDML
number parsing. U+FE0E is not part of a number.
I agree that the same arguments are applicable to both parsing and
collating, though not necessarily with equal force.

Formally, <U+0031, U+FE0E, U+0032, U+FE0E> seems to be just as much a
number as <U+FF11 FULLWIDTH DIGIT ONE, U+FF12 FULLWIDTH DIGIT TWO>,
which the current LDML semantics do treat on an even footing with
"12".  If the emoji digits had been encoded as new characters, ICU
would support them without batting an eyelid.  Because the difference
does not merit full characterhood, they are encoded by a sequence
rather than a single character.  Remember, all that U+FE0E does is to
request a particular glyph.  In a sense, we have 20 new decimal digits,
<U+0030, U+FE0E> to <U+0039, U+FE0F> and <U+0030, U+FE0F> to <U+0039,
U+FE0F>.

So, why do you consider <U+0031, U+FE0E, U+0032, U+FE0E> not to be
a valid decimal number?

10<ZWJ>0<ZWJ>0 would be perfectly reasonable for text
likely to be rendered by a cursive Latin font.
Identifying such an edge case does not prove that numeric tailoring is
broken.
An 'edge case' is often just a case that shows that an algorithm that
often works has not been thought through thoroughly.  Now, as CLDR
seems to value speed above perfect correctness, perhaps handling
variation sequences will be rejected on that basis.  All I was trying
to find out on this list was whether <U+0031, U+FE0E, U+0032, U+FE0E>
should be regarded as a proper number.

Special characters intended for just one aspect of text processing
should not affect other aspects. Unfortunately, a parametric tailoring
to ignore irrelevant characters while complying with the UCA is not
quite as simple as just ignoring them.  The issues arise with the
blocking of discontiguous contractions and the possibility that, for
example, one might wish to collate character variants differently.  On
the other hand, ignoring variation selectors by default might be
excusable, for they should not occur where they might block canonical
reordering (antepenultimate paragraph of TUS 6.2.0 Section 16.4).

Richard.




Reply via email to