Re: [basex-talk] More Diacritic Questions

Graydon Saunders Sat, 29 Nov 2014 14:07:10 -0800

Hi Christian --

After various adventures re-learning Perl's encoding management
quirks, I generated a simple XML file of all the codepoints between
0x20 and 0xD7FF; this isn't complete for XML but I thought it would be
enough to be interesting.


If I load that file into current BaseX dev version
(BaseX80-20141128.214728.zip) using the gui, and *do* turn on the Full
Text indexing and *do not* turn on diacritics,

//text()[contains(.,'&lt;')]

gives me three hits.

<codepoint>
  <value>U+003C</value>
  <glyph>&lt;</glyph>
  <nodiacritic>&lt;</nodiacritic>
  <basevalue>U+003C</basevalue>
</codepoint>
<codepoint>
  <value>U+226E</value>
  <glyph>≮</glyph>
  <nodiacritic>&lt;</nodiacritic>
  <basevalue>U+003C</basevalue>
</codepoint>

I think there should "should" be four against the relevant bit of XML
with full-text search, since with no diacritics, U+226E should match.
(U+226E's ability to decompose into a less-than sign is one of my very
favourite surprises involved in stripping diacritics.  What do you
mean the document stopped being well-formed...?)

How I get the full-text search to confirm this is not obvious;

for $x in //text()
where $x contains text { "A" }
return $x

happily gives me 101 results, case and diacritic-insensitive;

for $x in //text()
where $x contains text { "<" }
return $x

gives me nothing, presumably on the grounds that < isn't a letter.

I suspect ICU is the way to go; having to keep an all-unicode table up
to date involves more suffering than anyone should willingly
undertake.  (I can probably still generate that table for you if you
like.)

-- Graydon

On Sun, Nov 23, 2014 at 8:42 PM, Christian Grün
<christian.gr...@gmail.com> wrote:
> Hi Graydon,
>
> Thanks for your detailed reply, very appreciated.
>
> For today, I decided to choose a pragmatic solution that provides
> support for much more cases than before. I have added some more
> (glorious) mappings motivated by John Cowan's mail, which can now be
> found in a new class [1].
>
> However, to push things a bit further, I have rewritten the code for
> removing diacritics. Normalized tokens may now have a different byte
> length than the original token, as I'm removing combining marks as
> well (starting from 0300, and others).
>
> As a result, the following query will now yield the expected result (true):
>
>   (: U+00E9 vs. U+0065 U+0E01 :)
>   let $e1 := codepoints-to-string(233)
>   let $e2 := codepoints-to-string((101, 769))
>   return $e1 contains text { $e2 }
>
> I will have some more thoughts on embracing the full Unicode
> normalization. I fully agree that it makes sense to use standards
> whenever appropriate. However, one disadvantage for us is that it
> usually works on String data, whereas most textual data in BaseX is
> internally represented in byte arrays. One more challenge is that
> Java's Unicode support is not up-to-date anymore. For example, I am
> checking diacritical combining marks from Unicode 7.0 that are not
> detected as such by current versions of Java (1AB0–1AFF).
>
> To be able to support the new requirements of XQuery 3.1 (see e.g.
> [2]), we are already working with ICU [3]; it will be requested
> dynamically if it's found in the classpath. In future, we could use it
> for all of our full-text operations as well, but the optional
> embedding comes at a price in terms of performance.
>
> Looking forward to your feedback on the new snapshot,
> Christian
>
> [1] 
> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/FTToken.java
> [2] http://www.w3.org/TR/xpath-functions-31/#uca-collations
> [3] http://site.icu-project.org/

Re: [basex-talk] More Diacritic Questions

Reply via email to