On Mon, 7 Sep 2015 16:54:16 +0200 Mark Davis ☕️ <m...@macchiato.com> wrote:
> On Mon, Sep 7, 2015 at 8:23 AM, Richard Wordingham < > richard.wording...@ntlworld.com> wrote: >> By my reading, adding string ranges will initially make regular >> expression engines that don't use ICU non-compliant with Level 1 of >> UTS#18 Unicode Regular Expressions, in particular RL1.3 'subtraction >> and > I don't see where you are getting that. UTS 35 isn't referenced by > UTS 18 except for some examples of possible extensions in 1.2.3 Other > Properties, and locale id syntax in level 3. I may be missing > something, however. Can you tell me where #18 is referencing > UnicodeSet? In http://unicode.org/mail-arch/unicode-ml/y2014-m05/0052.html , you stated that the Unicode sets referred to in UTS#18 RL1.3 are the Unicode sets defined in UTS #35. We are now waiting for you to add the reference under Action 141-A76 - 'Make changes in UTS #18 based on general feedback in L2/14-277' (http://www.unicode.org/L2/L2014/14277-pubrev-ovrflw.html). I presume no change has been made yet because there are no *urgent* changes for UTS #18. > String ranges need not be implemented internally (and I don't think > the CLDR committee would expect them to be, in general). They are > simply a way of expressing the *string format* of a UnicodeSet in a > more compact fashion. (And UnicodeSets themselves can have a variety > of different implementations, in any event). [\x{0000 0000 0000 0000} - \x{DFFFF DFFFF DFFFF DFFFF}] is a very compact way of expressing a lot of strings. You wouldn't decompose that into a list of strings. >> String >> ranges seem particularly vulnerable to the ill-effects of >> unpredictable > UnicodeSets are low level constructs, as are their string > representations. Like all strings, the string format of a UnicodeSet > may change if it is normalized. That is nothing new. > - The string format "[a-Ω]" (that is, U+0061 LATIN SMALL LETTER A > through U+2126 OHM SIGN) represents a UnicodeSet that contains 8,390 > code points. > - Under NFC it would change to "[a-Ω]" (that is, U+0061 LATIN > SMALL LETTER A through U+03A9 GREEK CAPITAL LETTER OMEGA), and > contain 841 code points. At least this gives the same range whether normalised to NFC or to NFD. Using NFD, the preferred normalisation for regular expressions semi-respecting canonical equivalence, [{x̀}-{ẍ}] would not include the 2-character string "xa", as both bounds would decompose to two characters. Using NFC, the preferred normalisation for LDML (and for XML, I think), this would be a contraction for [{x̀}-{xẍ}], and would include the 2-character string "xa". If the two strings had to have the same length, [{x̀}-{ẍ}] would be flagged as erroneous if interpreted in NFC, and with any luck, similar errors that were not detected would then also be corrected. It's not perfect, but il meglio è l’inimico del bene. > You really don't want to normalize the string format of UnicodeSets. > Or if you suspect that those string formats might be normalized, then > just use escaped format \x{...} for anything that might change under > normalization. It would probably be sensible to issue a warning if the specification of a string bound had more than one canonical equivalent. I'm thinking of accidents. While an XML processor must not be Unicode compliant, I thought most regular expression engine environments were allowed to be Unicode compliant. TUS 8.0 Chapter 3 C6: "A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct." > Note that while it is fine to bring up topics for discussion here (or, > better yet, on the "cldr-us...@unicode.org" <cldr-us...@unicode.org> > list), As this impacts regular expressions in general, I think this is the better list for the impact on Unicode sets outside CLDR. > anything that requires a change will have to be filed as a > CLDR ticket. Richard, I'm sure you know this, and also raised this > topic here because of the relation to UTS18, so this is a reminder > for others. Exactly. Richard.