[Philippe tells me that his message that I'm quoting could have been rejected by the mailing list as spam; my answer is below.]
On Fri, Dec 21, 2012 at 5:13 AM, Philippe Verdy <verd...@wanadoo.fr> wrote: > This is an interesting case. A solution would be to be able define a > distinct collation element for "^ë", where "^" means "begining of a word" > (even if there's no character encoded there). That element would be such > that : > > e << ë < ^ë > > But this requires a prior definition of word boundaries to recognize the "^" > as an additional collation element by itself (usable distinctly only in > context, and ignored when it occurs anywhere else, meaning that all weights > assigned to "^" alone would be null.) > > So "^ë" would become valid as a collation element, but "т^ё" makes no sense > if there's no possible word boundary between "т" and "ё". > > This would work with the UCA algorithm, which does not really mandate what > is a "collation element" (not only in terms of encoding as characters), or > any syntax to support it. > > This mechanism of incorporating word boundaries in UCA would be an > interesting extension for section 6.9 (Handling Collation Graphemes) of > UTS#10 (but for now there's no support for it in LDML with a defined syntax > allowing the insertion of boundaries or other contextual conditions). Would it also mean that using a CGJ at the beginning of a word will cause a ё at the beginning of a word to be treated as a mid-word one? Is "space, CGJ" a well-formed character sequence? Leo