Thanks, Christian. What is the effective character set used when diacritics
are removed? Latin-1?

Tim


--
Tim A. Thompson (*he, him*)
Librarian for Applied Metadata Research
Yale University Library
www.linkedin.com/in/timathompson
timothy.thomp...@yale.edu

On Mon, Nov 22, 2021 at 2:53 PM Christian Grün <christian.gr...@gmail.com>
wrote:

> Hi Tim,
>
> > I have a question about the BaseX ft:normalize function. What kind of
> Unicode normalization is performed by this function, and how might it be
> implemented using standard XPath functions?
>
> The function is based on a custom BaseX tokenization, which includes
> normalization of case, removal of diacritics and (if enabled)
> language-based stemming. It would be rather challenging to implement
> the behavior with standard XPath (that’s mostly why we introduced
> ft:tokenize and ft:normalize). If you are looking for a starting
> point, you could begin with the FtTokenize Java class [1].
>
> Hope this helps,
> Christian
>
> [1]
> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/query/func/ft/FtTokenize.java#L31-L51
>

Reply via email to