Hi Christian -- On Sat, Nov 29, 2014 at 6:03 PM, Christian Grün <christian.gr...@gmail.com> wrote: > Hi Graydon, > >> //text()[contains(.,'<')] >> >> gives me three hits. >> >> I think there should "should" be four against the relevant bit of XML >> with full-text search, since with no diacritics, U+226E should match. > > So you would expected this node to be returned as well? > > <glyph>≮</glyph> > > For this, you'll probably have to call normalize-unicode first: > > //text()[contains(normalize-unicode(., 'NFD),'<')]
With that query, absolutely I should only get three hits. My expectation for "full text search" is that it searches the contents of text nodes. (Since I'm not sure there's a coherent way to describe "text" in XML that isn't "contents of text nodes".) So I would expect that, with a full text search that ignores diacritics, I'd get four hits. >> for $x in //text() >> where $x contains text { "<" } >> return $x >> >> gives me nothing, presumably on the grounds that < isn't a letter. > > Exactly. With "contains text", only letters can be found. It would > generally be possible to write a tokenizer that also returns other > characters as tokens, but there has been no use for that until now > (and it would generate many new questions in regards to normalization, > with and without ICU). Entirely understood that the tokenizer only recognizes letters. I don't think it's clear that "text" in "full text" means "groups of letters". Anything that isn't letters is sort of inherently partaking of the edge-case nature, but it's not too hard to imagine text with equations and strange effects from operators with a decomposable unicode representation. [snip] > If you want to play around with our current ICU support, feel free to > download the latest snapshot, add ICU to the classpath, and use the > new XQuery 3.1 UCA collation. The new fn:collation-key() function is > still work in progress, but all other collation features should > already be available when using the XQuery default string functions. That's very interesting; thank you! I shall see about taking a poke at that, and maybe trying to produce some performance numbers. Thanks! Graydon