On Thu, 17 Oct 2019 10:42:19 +0300 Eli Zaretskii via Unicode <unicode@unicode.org> wrote:
> > Date: Thu, 17 Oct 2019 02:26:35 +0100 > > From: Richard Wordingham <richard.wording...@ntlworld.com> > > Cc: Eli Zaretskii <e...@gnu.org> > > > > (c) A search for 'n' finding 'ñ'. > > > > When it comes to canonical equivalence, one answer to (c) is that as > > soon as one adds the next letter letter, e.g. 'na', the search will > > no longer match 'ñ'. > > Sounds arbitrary to me. How do we know that all the users will want > that? If the change from codepoint by codepoint matching is just canonical equivalence, then there is no way that the ‘n’ of ‘na’ will be matched by the ‘n’ within ‘ñ’. > > (This doesn't apply to diacritic-ignoring folding.) > But the issue _was_ diacritic-ignoring folding. Then we don't seem to have any evidence of user discontent arising from supporting canonical equivalence. > > That argument doesn't work with the Polish letter 'ń' though, as it > > can be word-final. > It actually doesn't work in general, and one factor is indeed > different languages. The problem with ñ was raised by > Spanish-speaking users, and only they were very much against folding > in this case. I'm not talking about folding. I'm talking about canonical equivalence, which largely but not solely consists of treating precomposed characters as the same as their *canonical* decompositions. > > In many cases, the answer might be a search by collation graphemes, > > but that has other issues besides language sensitivity. > It is also unworkable, because search has to work in contexts where > the text is not displayed at all, and graphemes only exist at display > time. The definition of a grapheme cluster is given in Section 9.9 of UTS#10, which is currently at Version 12.1.0. It is only connected to display at a deep level, so display time is irrelevant. Formally, it depends on a collation, though the sorting aspect is irrelevant and is removed for many 'search' collations in the CLDR. So, if one were using a Spanish collation, on typing 'n' into the incremental search string (and having it committed), the search wouldn't consider a match with 'ñ'. Then, on further typing the combining tilde, it would reject the matches it had found and choose those matches with 'ñ', whether one codepoint or two. Would that behaviour cause serious grief for incremental search? As I use an XSAMPA-based input implemented in quail that attempts to generate text in form NFC, I would type 'n~' to get the Spanish character, and so would never get an intermediate state where the incremental search was searching for 'n'. (At least, not in Emacs 25.3.1.) Richard.