Thanks again. Does BaseX support any Unicode block properties, such as \p{InCombiningDiacriticalMarks}, in regex functions? \p{Mn} works, but \p{InCombiningDiacriticalMarks} doesn't seem to.
Tim -- Tim A. Thompson (*he, him*) Librarian for Applied Metadata Research Yale University Library www.linkedin.com/in/timathompson timothy.thomp...@yale.edu On Tue, Nov 23, 2021 at 11:16 AM Christian Grün <christian.gr...@gmail.com> wrote: > It’s US-ASCII (7 bit). > > > Tim Thompson <timat...@gmail.com> schrieb am Di., 23. Nov. 2021, 17:07: > >> Thanks, Christian. What is the effective character set used when >> diacritics are removed? Latin-1? >> >> Tim >> >> >> -- >> Tim A. Thompson (*he, him*) >> Librarian for Applied Metadata Research >> Yale University Library >> www.linkedin.com/in/timathompson >> timothy.thomp...@yale.edu >> >> On Mon, Nov 22, 2021 at 2:53 PM Christian Grün <christian.gr...@gmail.com> >> wrote: >> >>> Hi Tim, >>> >>> > I have a question about the BaseX ft:normalize function. What kind of >>> Unicode normalization is performed by this function, and how might it be >>> implemented using standard XPath functions? >>> >>> The function is based on a custom BaseX tokenization, which includes >>> normalization of case, removal of diacritics and (if enabled) >>> language-based stemming. It would be rather challenging to implement >>> the behavior with standard XPath (that’s mostly why we introduced >>> ft:tokenize and ft:normalize). If you are looking for a starting >>> point, you could begin with the FtTokenize Java class [1]. >>> >>> Hope this helps, >>> Christian >>> >>> [1] >>> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/query/func/ft/FtTokenize.java#L31-L51 >>> >>