Thanks again.

Does BaseX support any Unicode block properties, such as
\p{InCombiningDiacriticalMarks}, in regex functions? \p{Mn} works, but
\p{InCombiningDiacriticalMarks} doesn't seem to.

Tim


--
Tim A. Thompson (*he, him*)
Librarian for Applied Metadata Research
Yale University Library
www.linkedin.com/in/timathompson
timothy.thomp...@yale.edu


On Tue, Nov 23, 2021 at 11:16 AM Christian Grün <christian.gr...@gmail.com>
wrote:

> It’s US-ASCII (7 bit).
>
>
> Tim Thompson <timat...@gmail.com> schrieb am Di., 23. Nov. 2021, 17:07:
>
>> Thanks, Christian. What is the effective character set used when
>> diacritics are removed? Latin-1?
>>
>> Tim
>>
>>
>> --
>> Tim A. Thompson (*he, him*)
>> Librarian for Applied Metadata Research
>> Yale University Library
>> www.linkedin.com/in/timathompson
>> timothy.thomp...@yale.edu
>>
>> On Mon, Nov 22, 2021 at 2:53 PM Christian Grün <christian.gr...@gmail.com>
>> wrote:
>>
>>> Hi Tim,
>>>
>>> > I have a question about the BaseX ft:normalize function. What kind of
>>> Unicode normalization is performed by this function, and how might it be
>>> implemented using standard XPath functions?
>>>
>>> The function is based on a custom BaseX tokenization, which includes
>>> normalization of case, removal of diacritics and (if enabled)
>>> language-based stemming. It would be rather challenging to implement
>>> the behavior with standard XPath (that’s mostly why we introduced
>>> ft:tokenize and ft:normalize). If you are looking for a starting
>>> point, you could begin with the FtTokenize Java class [1].
>>>
>>> Hope this helps,
>>> Christian
>>>
>>> [1]
>>> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/query/func/ft/FtTokenize.java#L31-L51
>>>
>>

Reply via email to