Hi Graydon, I just had a look. In BaseX, "without diacritics" can be explained by this a single, glorious mapping table [1].
It's quite obvious that there are just too many cases which are not covered by this mapping. We introduced this solution in the very beginnings of our full-text implementation, and I am just surprised that it survived for such a long time, probably because it was sufficient for most use cases our users came across so far. However, I would like to extend the current solution with something more general and, still, more efficient than full Unicode normalizations (performance-wise, the current mapping is probably difficult to beat). As you already indicated, the XQFT spec left it to the implementers to decide what diacritics are. > I'd like to advocate for an equivalent to the "decomposed normal form, > strip the non-spacing modifier characters, recompose to composed > normal form" equivalence because at least that one is plausibly well > understood. Shame on me; could you give me some quick tutoring what this would mean?… Would accepts and dots from German umlauts, and other characters in the range of \C380-\C3BF, be stripped as well by that recomposition? And just in case you know more about it: What happens with characters like the German "ß" that is typically rewritten to two characters ("ss")? Thanks, Christian [1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420