I just found a mapping table proposed by John Cowan [1]. It's already
pretty old, so it doesn't cover newer Unicode versions, but it's
surely better than our current solution.

[1] http://www.unicode.org/mail-arch/unicode-ml/y2003-m08/0047.html


On Sun, Nov 23, 2014 at 11:19 PM, Christian Grün
<christian.gr...@gmail.com> wrote:
> Hi Graydon,
>
> I just had a look. In BaseX, "without diacritics" can be explained by
> this a single, glorious mapping table [1].
>
> It's quite obvious that there are just too many cases which are not
> covered by this mapping. We introduced this solution in the very
> beginnings of our full-text implementation, and I am just surprised
> that it survived for such a long time, probably because it was
> sufficient for most use cases our users came across so far.
>
> However, I would like to extend the current solution with something
> more general and, still, more efficient than full Unicode
> normalizations (performance-wise, the current mapping is probably
> difficult to beat). As you already indicated, the XQFT spec left it to
> the implementers to decide what diacritics are.
>
>> I'd like to advocate for an equivalent to the "decomposed normal form,
>> strip the non-spacing modifier characters, recompose to composed
>> normal form" equivalence because at least that one is plausibly well
>> understood.
>
> Shame on me; could you give me some quick tutoring what this would
> mean?… Would accepts and dots from German umlauts, and other
> characters in the range of \C380-\C3BF, be stripped as well by that
> recomposition? And just in case you know more about it: What happens
> with characters like the German "ß" that is typically rewritten to two
> characters ("ss")?
>
> Thanks,
> Christian
>
> [1] 
> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420

Reply via email to