Re: [basex-talk] More Diacritic Questions

Christian Grün Sun, 23 Nov 2014 14:20:06 -0800

Hi Graydon,

I just had a look. In BaseX, "without diacritics" can be explained by
this a single, glorious mapping table [1].


It's quite obvious that there are just too many cases which are not
covered by this mapping. We introduced this solution in the very
beginnings of our full-text implementation, and I am just surprised
that it survived for such a long time, probably because it was
sufficient for most use cases our users came across so far.

However, I would like to extend the current solution with something
more general and, still, more efficient than full Unicode
normalizations (performance-wise, the current mapping is probably
difficult to beat). As you already indicated, the XQFT spec left it to
the implementers to decide what diacritics are.

> I'd like to advocate for an equivalent to the "decomposed normal form,
> strip the non-spacing modifier characters, recompose to composed
> normal form" equivalence because at least that one is plausibly well
> understood.

Shame on me; could you give me some quick tutoring what this would
mean?… Would accepts and dots from German umlauts, and other
characters in the range of \C380-\C3BF, be stripped as well by that
recomposition? And just in case you know more about it: What happens
with characters like the German "ß" that is typically rewritten to two
characters ("ss")?

Thanks,
Christian

[1] 
https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420

Re: [basex-talk] More Diacritic Questions

Reply via email to