Hi Christian --

On Sat, Nov 29, 2014 at 6:03 PM, Christian Grün
<christian.gr...@gmail.com> wrote:
> Hi Graydon,
>
>> //text()[contains(.,'&lt;')]
>>
>> gives me three hits.
>>
>> I think there should "should" be four against the relevant bit of XML
>> with full-text search, since with no diacritics, U+226E should match.
>
> So you would expected this node to be returned as well?
>
>    <glyph>≮</glyph>
>
> For this, you'll probably have to call normalize-unicode first:
>
>   //text()[contains(normalize-unicode(., 'NFD),'&lt;')]

With that query, absolutely I should only get three hits.

My expectation for "full text search" is that it searches the contents
of text nodes.  (Since I'm not sure there's a coherent way to describe
"text" in XML that isn't "contents of text nodes".)

So I would expect that, with a full text search that ignores
diacritics, I'd get four hits.

>> for $x in //text()
>> where $x contains text { "<" }
>> return $x
>>
>> gives me nothing, presumably on the grounds that < isn't a letter.
>
> Exactly. With "contains text", only letters can be found. It would
> generally be possible to write a tokenizer that also returns other
> characters as tokens, but there has been no use for that until now
> (and it would generate many new questions in regards to normalization,
> with and without ICU).

Entirely understood that the tokenizer only recognizes letters.

I don't think it's clear that "text" in "full text" means "groups of
letters".  Anything that isn't letters is sort of inherently partaking
of the edge-case nature, but it's not too hard to imagine text with
equations and strange effects from operators with a decomposable
unicode representation.

[snip]
> If you want to play around with our current ICU support, feel free to
> download the latest snapshot, add ICU to the classpath, and use the
> new XQuery 3.1 UCA collation. The new fn:collation-key() function is
> still work in progress, but all other collation features should
> already be available when using the XQuery default string functions.

That's very interesting; thank you!

I shall see about taking a poke at that, and maybe trying to produce
some performance numbers.

Thanks!
Graydon

Reply via email to