I just found a mapping table proposed by John Cowan [1]. It's already pretty old, so it doesn't cover newer Unicode versions, but it's surely better than our current solution.
[1] http://www.unicode.org/mail-arch/unicode-ml/y2003-m08/0047.html On Sun, Nov 23, 2014 at 11:19 PM, Christian Grün <christian.gr...@gmail.com> wrote: > Hi Graydon, > > I just had a look. In BaseX, "without diacritics" can be explained by > this a single, glorious mapping table [1]. > > It's quite obvious that there are just too many cases which are not > covered by this mapping. We introduced this solution in the very > beginnings of our full-text implementation, and I am just surprised > that it survived for such a long time, probably because it was > sufficient for most use cases our users came across so far. > > However, I would like to extend the current solution with something > more general and, still, more efficient than full Unicode > normalizations (performance-wise, the current mapping is probably > difficult to beat). As you already indicated, the XQFT spec left it to > the implementers to decide what diacritics are. > >> I'd like to advocate for an equivalent to the "decomposed normal form, >> strip the non-spacing modifier characters, recompose to composed >> normal form" equivalence because at least that one is plausibly well >> understood. > > Shame on me; could you give me some quick tutoring what this would > mean?… Would accepts and dots from German umlauts, and other > characters in the range of \C380-\C3BF, be stripped as well by that > recomposition? And just in case you know more about it: What happens > with characters like the German "ß" that is typically rewritten to two > characters ("ss")? > > Thanks, > Christian > > [1] > https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420