[basex-talk] More Diacritic Questions

2014-11-23 Thread Chris Yocum
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Everyone, I am rather confused again about diacritic handling in basex. For instance, with Full Text turned on a word like athgabáil will match both athgabail and athgabáil with "diacritics insensitive" which is what I would expect. However, if

Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
Hi Chris, Thanks for the observation. I can confirm that some characters like ṡ (U+1E61) do not seem be properly normalized yet. I have added an issue for that [1], and I hope I will soon have it fixed. If you encounter some other surprising behavior like this, feel free to tell us. Best, Christ

Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Chris Yocum
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Chrsitian Thanks for letting me know! I also need ḟ U+1E1F. All the best, Chris On Sun, Nov 23, 2014 at 06:22:39PM +0100, Christian Grün wrote: > Hi Chris, > > Thanks for the observation. I can confirm that some characters like ṡ > (U+1E61) d

Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Graydon Saunders
What does "without diacritics" mean? If it's equivalent to running normalize-unicode(replace(normalize-unicode($token,'NFKD'),'\p{Mn}',''),'NFKC') on the tokens we shouldn't expect all the diacritics to go away; cases like U+00F8 ("latin small letter o with stroke"), despite the descriptive nam

Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
Hi Graydon, I just had a look. In BaseX, "without diacritics" can be explained by this a single, glorious mapping table [1]. It's quite obvious that there are just too many cases which are not covered by this mapping. We introduced this solution in the very beginnings of our full-text implementat

Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
I just found a mapping table proposed by John Cowan [1]. It's already pretty old, so it doesn't cover newer Unicode versions, but it's surely better than our current solution. [1] http://www.unicode.org/mail-arch/unicode-ml/y2003-m08/0047.html On Sun, Nov 23, 2014 at 11:19 PM, Christian Grün wr

Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Graydon Saunders
Hi Christian -- That is indeed a glorious table! :) Unicode defines whether or not a character has a decomposition; so e-with-acute, U+00E9, decomposes into U+0065 + U+0301 (an e and a combining acute accent.) I think the presence of a decomposition is a recoverable character property in Java.

Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
Hi Chris, I am glad to report that the latest snapshot of BaseX [1] now provides much better support for diacritical characters. Please find more details in my next mail to Graydon. Hope this helps, Christian [1] http://files.basex.org/releases/latest/ __

Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
Hi Graydon, Thanks for your detailed reply, very appreciated. For today, I decided to choose a pragmatic solution that provides support for much more cases than before. I have added some more (glorious) mappings motivated by John Cowan's mail, which can now be found in a new class [1]. However,

Re: [basex-talk] More Diacritic Questions

2014-11-24 Thread Christopher Yocum
Hi Christian, Great. Thank you for handling this so quickly. When is the next version due out? I hesitate to run snapshots as my users are rather vocal when things don't work right. All the best, Chris On Mon, Nov 24, 2014 at 1:13 AM, Christian Grün wrote: > Hi Chris, > > I am glad to repor

Re: [basex-talk] More Diacritic Questions

2014-11-24 Thread Christian Grün
Hi Chris, > Great. Thank you for handling this so quickly. When is the next version > due out? I hesitate to run snapshots as my users are rather vocal when > things don't work right. Our snapshots are usually very stable, so you should not have much worries. The next official release is plann

Re: [basex-talk] More Diacritic Questions

2014-11-24 Thread Christopher Yocum
Thanks. I will give it a spin on my test machine first. Darn, I will be on holiday to Prague around that time but not at the actual conference. Chris On Mon, Nov 24, 2014 at 11:15 AM, Christian Grün wrote: > Hi Chris, > > > Great. Thank you for handling this so quickly. When is the next ver

Re: [basex-talk] More Diacritic Questions

2014-11-29 Thread Graydon Saunders
Hi Christian -- After various adventures re-learning Perl's encoding management quirks, I generated a simple XML file of all the codepoints between 0x20 and 0xD7FF; this isn't complete for XML but I thought it would be enough to be interesting. If I load that file into current BaseX dev version (

Re: [basex-talk] More Diacritic Questions

2014-11-29 Thread Christian Grün
Hi Graydon, > //text()[contains(.,'<')] > > gives me three hits. > > I think there should "should" be four against the relevant bit of XML > with full-text search, since with no diacritics, U+226E should match. So you would expected this node to be returned as well? ≮ For this, you'll probab

Re: [basex-talk] More Diacritic Questions

2014-11-30 Thread Graydon Saunders
Hi Christian -- On Sat, Nov 29, 2014 at 6:03 PM, Christian Grün wrote: > Hi Graydon, > >> //text()[contains(.,'<')] >> >> gives me three hits. >> >> I think there should "should" be four against the relevant bit of XML >> with full-text search, since with no diacritics, U+226E should match. > > S

Re: [basex-talk] More Diacritic Questions

2014-11-30 Thread Christian Grün
Hi Graydon, > So I would expect that, with a full text search that ignores > diacritics, I'd get four hits. By adding some collation hints to one of the standard string functions, the comparison will succeed: fn:compare('≮','<','?lang=en;strength=primary') In the example, I used the BaseX not