Hi Christian -- After various adventures re-learning Perl's encoding management quirks, I generated a simple XML file of all the codepoints between 0x20 and 0xD7FF; this isn't complete for XML but I thought it would be enough to be interesting.
If I load that file into current BaseX dev version (BaseX80-20141128.214728.zip) using the gui, and *do* turn on the Full Text indexing and *do not* turn on diacritics, //text()[contains(.,'<')] gives me three hits. <codepoint> <value>U+003C</value> <glyph><</glyph> <nodiacritic><</nodiacritic> <basevalue>U+003C</basevalue> </codepoint> <codepoint> <value>U+226E</value> <glyph>≮</glyph> <nodiacritic><</nodiacritic> <basevalue>U+003C</basevalue> </codepoint> I think there should "should" be four against the relevant bit of XML with full-text search, since with no diacritics, U+226E should match. (U+226E's ability to decompose into a less-than sign is one of my very favourite surprises involved in stripping diacritics. What do you mean the document stopped being well-formed...?) How I get the full-text search to confirm this is not obvious; for $x in //text() where $x contains text { "A" } return $x happily gives me 101 results, case and diacritic-insensitive; for $x in //text() where $x contains text { "<" } return $x gives me nothing, presumably on the grounds that < isn't a letter. I suspect ICU is the way to go; having to keep an all-unicode table up to date involves more suffering than anyone should willingly undertake. (I can probably still generate that table for you if you like.) -- Graydon On Sun, Nov 23, 2014 at 8:42 PM, Christian Grün <christian.gr...@gmail.com> wrote: > Hi Graydon, > > Thanks for your detailed reply, very appreciated. > > For today, I decided to choose a pragmatic solution that provides > support for much more cases than before. I have added some more > (glorious) mappings motivated by John Cowan's mail, which can now be > found in a new class [1]. > > However, to push things a bit further, I have rewritten the code for > removing diacritics. Normalized tokens may now have a different byte > length than the original token, as I'm removing combining marks as > well (starting from 0300, and others). > > As a result, the following query will now yield the expected result (true): > > (: U+00E9 vs. U+0065 U+0E01 :) > let $e1 := codepoints-to-string(233) > let $e2 := codepoints-to-string((101, 769)) > return $e1 contains text { $e2 } > > I will have some more thoughts on embracing the full Unicode > normalization. I fully agree that it makes sense to use standards > whenever appropriate. However, one disadvantage for us is that it > usually works on String data, whereas most textual data in BaseX is > internally represented in byte arrays. One more challenge is that > Java's Unicode support is not up-to-date anymore. For example, I am > checking diacritical combining marks from Unicode 7.0 that are not > detected as such by current versions of Java (1AB0–1AFF). > > To be able to support the new requirements of XQuery 3.1 (see e.g. > [2]), we are already working with ICU [3]; it will be requested > dynamically if it's found in the classpath. In future, we could use it > for all of our full-text operations as well, but the optional > embedding comes at a price in terms of performance. > > Looking forward to your feedback on the new snapshot, > Christian > > [1] > https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/FTToken.java > [2] http://www.w3.org/TR/xpath-functions-31/#uca-collations > [3] http://site.icu-project.org/