-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
Hi Everyone,
I am rather confused again about diacritic handling in basex. For
instance, with Full Text turned on a word like athgabáil will match
both athgabail and athgabáil with "diacritics insensitive" which is
what I would expect. However, if
Hi Chris,
Thanks for the observation. I can confirm that some characters like ṡ
(U+1E61) do not seem be properly normalized yet. I have added an issue
for that [1], and I hope I will soon have it fixed.
If you encounter some other surprising behavior like this, feel free to tell us.
Best,
Christ
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
Hi Chrsitian
Thanks for letting me know! I also need ḟ U+1E1F.
All the best,
Chris
On Sun, Nov 23, 2014 at 06:22:39PM +0100, Christian Grün wrote:
> Hi Chris,
>
> Thanks for the observation. I can confirm that some characters like ṡ
> (U+1E61) d
What does "without diacritics" mean?
If it's equivalent to running
normalize-unicode(replace(normalize-unicode($token,'NFKD'),'\p{Mn}',''),'NFKC')
on the tokens we shouldn't expect all the diacritics to go away; cases
like U+00F8 ("latin small letter o with stroke"), despite the
descriptive nam
Hi Graydon,
I just had a look. In BaseX, "without diacritics" can be explained by
this a single, glorious mapping table [1].
It's quite obvious that there are just too many cases which are not
covered by this mapping. We introduced this solution in the very
beginnings of our full-text implementat
I just found a mapping table proposed by John Cowan [1]. It's already
pretty old, so it doesn't cover newer Unicode versions, but it's
surely better than our current solution.
[1] http://www.unicode.org/mail-arch/unicode-ml/y2003-m08/0047.html
On Sun, Nov 23, 2014 at 11:19 PM, Christian Grün
wr
Hi Christian --
That is indeed a glorious table! :)
Unicode defines whether or not a character has a decomposition; so
e-with-acute, U+00E9, decomposes into U+0065 + U+0301 (an e and a
combining acute accent.) I think the presence of a decomposition is a
recoverable character property in Java.
Hi Chris,
I am glad to report that the latest snapshot of BaseX [1] now provides
much better support for diacritical characters.
Please find more details in my next mail to Graydon.
Hope this helps,
Christian
[1] http://files.basex.org/releases/latest/
__
Hi Graydon,
Thanks for your detailed reply, very appreciated.
For today, I decided to choose a pragmatic solution that provides
support for much more cases than before. I have added some more
(glorious) mappings motivated by John Cowan's mail, which can now be
found in a new class [1].
However,
Hi Christian,
Great. Thank you for handling this so quickly. When is the next version
due out? I hesitate to run snapshots as my users are rather vocal when
things don't work right.
All the best,
Chris
On Mon, Nov 24, 2014 at 1:13 AM, Christian Grün
wrote:
> Hi Chris,
>
> I am glad to repor
Hi Chris,
> Great. Thank you for handling this so quickly. When is the next version
> due out? I hesitate to run snapshots as my users are rather vocal when
> things don't work right.
Our snapshots are usually very stable, so you should not have much
worries. The next official release is plann
Thanks. I will give it a spin on my test machine first. Darn, I will be
on holiday to Prague around that time but not at the actual conference.
Chris
On Mon, Nov 24, 2014 at 11:15 AM, Christian Grün
wrote:
> Hi Chris,
>
> > Great. Thank you for handling this so quickly. When is the next ver
Hi Christian --
After various adventures re-learning Perl's encoding management
quirks, I generated a simple XML file of all the codepoints between
0x20 and 0xD7FF; this isn't complete for XML but I thought it would be
enough to be interesting.
If I load that file into current BaseX dev version
(
Hi Graydon,
> //text()[contains(.,'<')]
>
> gives me three hits.
>
> I think there should "should" be four against the relevant bit of XML
> with full-text search, since with no diacritics, U+226E should match.
So you would expected this node to be returned as well?
≮
For this, you'll probab
Hi Christian --
On Sat, Nov 29, 2014 at 6:03 PM, Christian Grün
wrote:
> Hi Graydon,
>
>> //text()[contains(.,'<')]
>>
>> gives me three hits.
>>
>> I think there should "should" be four against the relevant bit of XML
>> with full-text search, since with no diacritics, U+226E should match.
>
> S
Hi Graydon,
> So I would expect that, with a full text search that ignores
> diacritics, I'd get four hits.
By adding some collation hints to one of the standard string
functions, the comparison will succeed:
fn:compare('≮','<','?lang=en;strength=primary')
In the example, I used the BaseX not
16 matches
Mail list logo