Re: [basex-talk] More Diacritic Questions
Hi Graydon, > So I would expect that, with a full text search that ignores > diacritics, I'd get four hits. By adding some collation hints to one of the standard string functions, the comparison will succeed: fn:compare('≮','<','?lang=en;strength=primary') In the example, I used the BaseX notation for collations (it is similar to the notation in Saxon or Exist; in future, more and more people will probably switch to the newly introduced UCA collation). > I don't think it's clear that "text" in "full text" means "groups of > letters". I agree. Once again, the XQFT spec does not dictate what a "token" in a full-text is. Currently, we only have two tokenizers: one for Western languages and another one for Japanese (which gets along without whitespaces). When we initially implemented the XQFT features some years ago, our major use case was the search in a library catalog (comprising meta data on appr. 2 million titles). Best, Christian [1] http://docs.basex.org/wiki/Full-Text#Collations
Re: [basex-talk] More Diacritic Questions
Hi Christian -- On Sat, Nov 29, 2014 at 6:03 PM, Christian Grün wrote: > Hi Graydon, > >> //text()[contains(.,'<')] >> >> gives me three hits. >> >> I think there should "should" be four against the relevant bit of XML >> with full-text search, since with no diacritics, U+226E should match. > > So you would expected this node to be returned as well? > >≮ > > For this, you'll probably have to call normalize-unicode first: > > //text()[contains(normalize-unicode(., 'NFD),'<')] With that query, absolutely I should only get three hits. My expectation for "full text search" is that it searches the contents of text nodes. (Since I'm not sure there's a coherent way to describe "text" in XML that isn't "contents of text nodes".) So I would expect that, with a full text search that ignores diacritics, I'd get four hits. >> for $x in //text() >> where $x contains text { "<" } >> return $x >> >> gives me nothing, presumably on the grounds that < isn't a letter. > > Exactly. With "contains text", only letters can be found. It would > generally be possible to write a tokenizer that also returns other > characters as tokens, but there has been no use for that until now > (and it would generate many new questions in regards to normalization, > with and without ICU). Entirely understood that the tokenizer only recognizes letters. I don't think it's clear that "text" in "full text" means "groups of letters". Anything that isn't letters is sort of inherently partaking of the edge-case nature, but it's not too hard to imagine text with equations and strange effects from operators with a decomposable unicode representation. [snip] > If you want to play around with our current ICU support, feel free to > download the latest snapshot, add ICU to the classpath, and use the > new XQuery 3.1 UCA collation. The new fn:collation-key() function is > still work in progress, but all other collation features should > already be available when using the XQuery default string functions. That's very interesting; thank you! I shall see about taking a poke at that, and maybe trying to produce some performance numbers. Thanks! Graydon
Re: [basex-talk] More Diacritic Questions
Hi Graydon, > //text()[contains(.,'<')] > > gives me three hits. > > I think there should "should" be four against the relevant bit of XML > with full-text search, since with no diacritics, U+226E should match. So you would expected this node to be returned as well? ≮ For this, you'll probably have to call normalize-unicode first: //text()[contains(normalize-unicode(., 'NFD),'<')] > for $x in //text() > where $x contains text { "<" } > return $x > > gives me nothing, presumably on the grounds that < isn't a letter. Exactly. With "contains text", only letters can be found. It would generally be possible to write a tokenizer that also returns other characters as tokens, but there has been no use for that until now (and it would generate many new questions in regards to normalization, with and without ICU). > (I can probably still generate that table for you if you like.) I think that for now we will stick with the existing tokenization. If it turns out that we need more power, we could think about optionally providing support for ICU as well. However, I'll be glad to have your feedback if you find examples that are currently not, but should be, covered by our diacritics normalization mapping. If you want to play around with our current ICU support, feel free to download the latest snapshot, add ICU to the classpath, and use the new XQuery 3.1 UCA collation. The new fn:collation-key() function is still work in progress, but all other collation features should already be available when using the XQuery default string functions. Thanks for your feedback, Christian
Re: [basex-talk] More Diacritic Questions
Hi Christian -- After various adventures re-learning Perl's encoding management quirks, I generated a simple XML file of all the codepoints between 0x20 and 0xD7FF; this isn't complete for XML but I thought it would be enough to be interesting. If I load that file into current BaseX dev version (BaseX80-20141128.214728.zip) using the gui, and *do* turn on the Full Text indexing and *do not* turn on diacritics, //text()[contains(.,'<')] gives me three hits. U+003C < < U+003C U+226E ≮ < U+003C I think there should "should" be four against the relevant bit of XML with full-text search, since with no diacritics, U+226E should match. (U+226E's ability to decompose into a less-than sign is one of my very favourite surprises involved in stripping diacritics. What do you mean the document stopped being well-formed...?) How I get the full-text search to confirm this is not obvious; for $x in //text() where $x contains text { "A" } return $x happily gives me 101 results, case and diacritic-insensitive; for $x in //text() where $x contains text { "<" } return $x gives me nothing, presumably on the grounds that < isn't a letter. I suspect ICU is the way to go; having to keep an all-unicode table up to date involves more suffering than anyone should willingly undertake. (I can probably still generate that table for you if you like.) -- Graydon On Sun, Nov 23, 2014 at 8:42 PM, Christian Grün wrote: > Hi Graydon, > > Thanks for your detailed reply, very appreciated. > > For today, I decided to choose a pragmatic solution that provides > support for much more cases than before. I have added some more > (glorious) mappings motivated by John Cowan's mail, which can now be > found in a new class [1]. > > However, to push things a bit further, I have rewritten the code for > removing diacritics. Normalized tokens may now have a different byte > length than the original token, as I'm removing combining marks as > well (starting from 0300, and others). > > As a result, the following query will now yield the expected result (true): > > (: U+00E9 vs. U+0065 U+0E01 :) > let $e1 := codepoints-to-string(233) > let $e2 := codepoints-to-string((101, 769)) > return $e1 contains text { $e2 } > > I will have some more thoughts on embracing the full Unicode > normalization. I fully agree that it makes sense to use standards > whenever appropriate. However, one disadvantage for us is that it > usually works on String data, whereas most textual data in BaseX is > internally represented in byte arrays. One more challenge is that > Java's Unicode support is not up-to-date anymore. For example, I am > checking diacritical combining marks from Unicode 7.0 that are not > detected as such by current versions of Java (1AB0–1AFF). > > To be able to support the new requirements of XQuery 3.1 (see e.g. > [2]), we are already working with ICU [3]; it will be requested > dynamically if it's found in the classpath. In future, we could use it > for all of our full-text operations as well, but the optional > embedding comes at a price in terms of performance. > > Looking forward to your feedback on the new snapshot, > Christian > > [1] > https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/FTToken.java > [2] http://www.w3.org/TR/xpath-functions-31/#uca-collations > [3] http://site.icu-project.org/
Re: [basex-talk] More Diacritic Questions
Thanks. I will give it a spin on my test machine first. Darn, I will be on holiday to Prague around that time but not at the actual conference. Chris On Mon, Nov 24, 2014 at 11:15 AM, Christian Grün wrote: > Hi Chris, > > > Great. Thank you for handling this so quickly. When is the next version > > due out? I hesitate to run snapshots as my users are rather vocal when > > things don't work right. > > Our snapshots are usually very stable, so you should not have much > worries. The next official release is planned in alignment with the > XMLPrague conference (Feb 13). > > Best, > Christian >
Re: [basex-talk] More Diacritic Questions
Hi Chris, > Great. Thank you for handling this so quickly. When is the next version > due out? I hesitate to run snapshots as my users are rather vocal when > things don't work right. Our snapshots are usually very stable, so you should not have much worries. The next official release is planned in alignment with the XMLPrague conference (Feb 13). Best, Christian
Re: [basex-talk] More Diacritic Questions
Hi Christian, Great. Thank you for handling this so quickly. When is the next version due out? I hesitate to run snapshots as my users are rather vocal when things don't work right. All the best, Chris On Mon, Nov 24, 2014 at 1:13 AM, Christian Grün wrote: > Hi Chris, > > I am glad to report that the latest snapshot of BaseX [1] now provides > much better support for diacritical characters. > > Please find more details in my next mail to Graydon. > > Hope this helps, > Christian > > [1] http://files.basex.org/releases/latest/ > __ > > On Sun, Nov 23, 2014 at 11:56 PM, Graydon Saunders > wrote: > > Hi Christian -- > > > > That is indeed a glorious table! :) > > > > Unicode defines whether or not a character has a decomposition; so > > e-with-acute, U+00E9, decomposes into U+0065 + U+0301 (an e and a > > combining acute accent.) I think the presence of a decomposition is a > > recoverable character property in Java. (it is in Perl. :) > > > > U+0386, "Greek Capital Alpha With Tonos", has a decomposition, so the > > combining accute accent -- U+0301 again! -- would strip. > > > > If one is going to go all strict-and-high-church Unicode, "diacritic" > > is "anything that decomposes into a combining (that is, non-spacing) > > character code point when considering the decomposed normal form (NFD > > or NFKD in the Unicode spec). This would NOT convert U+00DF, "latin > > small letter sharp s", into ss, because per the Unicode Consortium, > > sharp s is a full letter, rather than a modified s. (Same with thorn > > not decomposing into th, and so on for other things that are > > considered full letters, which can get surprising in the Scandinavian > > dotted A's and such.) The disadvantage is that users of BaseX might > > expect the compare to work; that advantage is the arbitrarily large > > number of arguments, headaches, and natural language edge cases can be > > shifted off to the Unicode guys by saying "we're following the Unicode > > character category rules". > > > > It also gives something that can be pointed to as an explanation and > > works like the existing normalized-unicode functions. This is not the > > same as saying it's easy to understand but it's something. > > > > How you do it efficiently, well, my knowledge of Java would probably > > fit on the bottom of your shoe. On the plus side, Java regular > > expressions support the \p{...} Unicode character category syntax so > > it's got to be in there somewhere. I'd think there's an efficient way > > to load the huge horrible table once, and then filter the characters > > by property -- if this character has got a decomposition, you then > > want the members of the decomposition that have the Unicode property > > Character.isUnicodeIdentifierStart() returning true comes to mind as > > something that might work. > > > > Did that make sense? > > > > -- Graydon > > > > > > On Sun, Nov 23, 2014 at 5:19 PM, Christian Grün > > wrote: > >> Hi Graydon, > >> > >> I just had a look. In BaseX, "without diacritics" can be explained by > >> this a single, glorious mapping table [1]. > >> > >> It's quite obvious that there are just too many cases which are not > >> covered by this mapping. We introduced this solution in the very > >> beginnings of our full-text implementation, and I am just surprised > >> that it survived for such a long time, probably because it was > >> sufficient for most use cases our users came across so far. > >> > >> However, I would like to extend the current solution with something > >> more general and, still, more efficient than full Unicode > >> normalizations (performance-wise, the current mapping is probably > >> difficult to beat). As you already indicated, the XQFT spec left it to > >> the implementers to decide what diacritics are. > >> > >>> I'd like to advocate for an equivalent to the "decomposed normal form, > >>> strip the non-spacing modifier characters, recompose to composed > >>> normal form" equivalence because at least that one is plausibly well > >>> understood. > >> > >> Shame on me; could you give me some quick tutoring what this would > >> mean?… Would accepts and dots from German umlauts, and other > >> characters in the range of \C380-\C3BF, be stripped as well by that > >> recomposition? And just in case you know more about it: What happens > >> with characters like the German "ß" that is typically rewritten to two > >> characters ("ss")? > >> > >> Thanks, > >> Christian > >> > >> [1] > https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420 >
Re: [basex-talk] More Diacritic Questions
Hi Graydon, Thanks for your detailed reply, very appreciated. For today, I decided to choose a pragmatic solution that provides support for much more cases than before. I have added some more (glorious) mappings motivated by John Cowan's mail, which can now be found in a new class [1]. However, to push things a bit further, I have rewritten the code for removing diacritics. Normalized tokens may now have a different byte length than the original token, as I'm removing combining marks as well (starting from 0300, and others). As a result, the following query will now yield the expected result (true): (: U+00E9 vs. U+0065 U+0E01 :) let $e1 := codepoints-to-string(233) let $e2 := codepoints-to-string((101, 769)) return $e1 contains text { $e2 } I will have some more thoughts on embracing the full Unicode normalization. I fully agree that it makes sense to use standards whenever appropriate. However, one disadvantage for us is that it usually works on String data, whereas most textual data in BaseX is internally represented in byte arrays. One more challenge is that Java's Unicode support is not up-to-date anymore. For example, I am checking diacritical combining marks from Unicode 7.0 that are not detected as such by current versions of Java (1AB0–1AFF). To be able to support the new requirements of XQuery 3.1 (see e.g. [2]), we are already working with ICU [3]; it will be requested dynamically if it's found in the classpath. In future, we could use it for all of our full-text operations as well, but the optional embedding comes at a price in terms of performance. Looking forward to your feedback on the new snapshot, Christian [1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/FTToken.java [2] http://www.w3.org/TR/xpath-functions-31/#uca-collations [3] http://site.icu-project.org/
Re: [basex-talk] More Diacritic Questions
Hi Chris, I am glad to report that the latest snapshot of BaseX [1] now provides much better support for diacritical characters. Please find more details in my next mail to Graydon. Hope this helps, Christian [1] http://files.basex.org/releases/latest/ __ On Sun, Nov 23, 2014 at 11:56 PM, Graydon Saunders wrote: > Hi Christian -- > > That is indeed a glorious table! :) > > Unicode defines whether or not a character has a decomposition; so > e-with-acute, U+00E9, decomposes into U+0065 + U+0301 (an e and a > combining acute accent.) I think the presence of a decomposition is a > recoverable character property in Java. (it is in Perl. :) > > U+0386, "Greek Capital Alpha With Tonos", has a decomposition, so the > combining accute accent -- U+0301 again! -- would strip. > > If one is going to go all strict-and-high-church Unicode, "diacritic" > is "anything that decomposes into a combining (that is, non-spacing) > character code point when considering the decomposed normal form (NFD > or NFKD in the Unicode spec). This would NOT convert U+00DF, "latin > small letter sharp s", into ss, because per the Unicode Consortium, > sharp s is a full letter, rather than a modified s. (Same with thorn > not decomposing into th, and so on for other things that are > considered full letters, which can get surprising in the Scandinavian > dotted A's and such.) The disadvantage is that users of BaseX might > expect the compare to work; that advantage is the arbitrarily large > number of arguments, headaches, and natural language edge cases can be > shifted off to the Unicode guys by saying "we're following the Unicode > character category rules". > > It also gives something that can be pointed to as an explanation and > works like the existing normalized-unicode functions. This is not the > same as saying it's easy to understand but it's something. > > How you do it efficiently, well, my knowledge of Java would probably > fit on the bottom of your shoe. On the plus side, Java regular > expressions support the \p{...} Unicode character category syntax so > it's got to be in there somewhere. I'd think there's an efficient way > to load the huge horrible table once, and then filter the characters > by property -- if this character has got a decomposition, you then > want the members of the decomposition that have the Unicode property > Character.isUnicodeIdentifierStart() returning true comes to mind as > something that might work. > > Did that make sense? > > -- Graydon > > > On Sun, Nov 23, 2014 at 5:19 PM, Christian Grün > wrote: >> Hi Graydon, >> >> I just had a look. In BaseX, "without diacritics" can be explained by >> this a single, glorious mapping table [1]. >> >> It's quite obvious that there are just too many cases which are not >> covered by this mapping. We introduced this solution in the very >> beginnings of our full-text implementation, and I am just surprised >> that it survived for such a long time, probably because it was >> sufficient for most use cases our users came across so far. >> >> However, I would like to extend the current solution with something >> more general and, still, more efficient than full Unicode >> normalizations (performance-wise, the current mapping is probably >> difficult to beat). As you already indicated, the XQFT spec left it to >> the implementers to decide what diacritics are. >> >>> I'd like to advocate for an equivalent to the "decomposed normal form, >>> strip the non-spacing modifier characters, recompose to composed >>> normal form" equivalence because at least that one is plausibly well >>> understood. >> >> Shame on me; could you give me some quick tutoring what this would >> mean?… Would accepts and dots from German umlauts, and other >> characters in the range of \C380-\C3BF, be stripped as well by that >> recomposition? And just in case you know more about it: What happens >> with characters like the German "ß" that is typically rewritten to two >> characters ("ss")? >> >> Thanks, >> Christian >> >> [1] >> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420
Re: [basex-talk] More Diacritic Questions
Hi Christian -- That is indeed a glorious table! :) Unicode defines whether or not a character has a decomposition; so e-with-acute, U+00E9, decomposes into U+0065 + U+0301 (an e and a combining acute accent.) I think the presence of a decomposition is a recoverable character property in Java. (it is in Perl. :) U+0386, "Greek Capital Alpha With Tonos", has a decomposition, so the combining accute accent -- U+0301 again! -- would strip. If one is going to go all strict-and-high-church Unicode, "diacritic" is "anything that decomposes into a combining (that is, non-spacing) character code point when considering the decomposed normal form (NFD or NFKD in the Unicode spec). This would NOT convert U+00DF, "latin small letter sharp s", into ss, because per the Unicode Consortium, sharp s is a full letter, rather than a modified s. (Same with thorn not decomposing into th, and so on for other things that are considered full letters, which can get surprising in the Scandinavian dotted A's and such.) The disadvantage is that users of BaseX might expect the compare to work; that advantage is the arbitrarily large number of arguments, headaches, and natural language edge cases can be shifted off to the Unicode guys by saying "we're following the Unicode character category rules". It also gives something that can be pointed to as an explanation and works like the existing normalized-unicode functions. This is not the same as saying it's easy to understand but it's something. How you do it efficiently, well, my knowledge of Java would probably fit on the bottom of your shoe. On the plus side, Java regular expressions support the \p{...} Unicode character category syntax so it's got to be in there somewhere. I'd think there's an efficient way to load the huge horrible table once, and then filter the characters by property -- if this character has got a decomposition, you then want the members of the decomposition that have the Unicode property Character.isUnicodeIdentifierStart() returning true comes to mind as something that might work. Did that make sense? -- Graydon On Sun, Nov 23, 2014 at 5:19 PM, Christian Grün wrote: > Hi Graydon, > > I just had a look. In BaseX, "without diacritics" can be explained by > this a single, glorious mapping table [1]. > > It's quite obvious that there are just too many cases which are not > covered by this mapping. We introduced this solution in the very > beginnings of our full-text implementation, and I am just surprised > that it survived for such a long time, probably because it was > sufficient for most use cases our users came across so far. > > However, I would like to extend the current solution with something > more general and, still, more efficient than full Unicode > normalizations (performance-wise, the current mapping is probably > difficult to beat). As you already indicated, the XQFT spec left it to > the implementers to decide what diacritics are. > >> I'd like to advocate for an equivalent to the "decomposed normal form, >> strip the non-spacing modifier characters, recompose to composed >> normal form" equivalence because at least that one is plausibly well >> understood. > > Shame on me; could you give me some quick tutoring what this would > mean?… Would accepts and dots from German umlauts, and other > characters in the range of \C380-\C3BF, be stripped as well by that > recomposition? And just in case you know more about it: What happens > with characters like the German "ß" that is typically rewritten to two > characters ("ss")? > > Thanks, > Christian > > [1] > https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420
Re: [basex-talk] More Diacritic Questions
I just found a mapping table proposed by John Cowan [1]. It's already pretty old, so it doesn't cover newer Unicode versions, but it's surely better than our current solution. [1] http://www.unicode.org/mail-arch/unicode-ml/y2003-m08/0047.html On Sun, Nov 23, 2014 at 11:19 PM, Christian Grün wrote: > Hi Graydon, > > I just had a look. In BaseX, "without diacritics" can be explained by > this a single, glorious mapping table [1]. > > It's quite obvious that there are just too many cases which are not > covered by this mapping. We introduced this solution in the very > beginnings of our full-text implementation, and I am just surprised > that it survived for such a long time, probably because it was > sufficient for most use cases our users came across so far. > > However, I would like to extend the current solution with something > more general and, still, more efficient than full Unicode > normalizations (performance-wise, the current mapping is probably > difficult to beat). As you already indicated, the XQFT spec left it to > the implementers to decide what diacritics are. > >> I'd like to advocate for an equivalent to the "decomposed normal form, >> strip the non-spacing modifier characters, recompose to composed >> normal form" equivalence because at least that one is plausibly well >> understood. > > Shame on me; could you give me some quick tutoring what this would > mean?… Would accepts and dots from German umlauts, and other > characters in the range of \C380-\C3BF, be stripped as well by that > recomposition? And just in case you know more about it: What happens > with characters like the German "ß" that is typically rewritten to two > characters ("ss")? > > Thanks, > Christian > > [1] > https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420
Re: [basex-talk] More Diacritic Questions
Hi Graydon, I just had a look. In BaseX, "without diacritics" can be explained by this a single, glorious mapping table [1]. It's quite obvious that there are just too many cases which are not covered by this mapping. We introduced this solution in the very beginnings of our full-text implementation, and I am just surprised that it survived for such a long time, probably because it was sufficient for most use cases our users came across so far. However, I would like to extend the current solution with something more general and, still, more efficient than full Unicode normalizations (performance-wise, the current mapping is probably difficult to beat). As you already indicated, the XQFT spec left it to the implementers to decide what diacritics are. > I'd like to advocate for an equivalent to the "decomposed normal form, > strip the non-spacing modifier characters, recompose to composed > normal form" equivalence because at least that one is plausibly well > understood. Shame on me; could you give me some quick tutoring what this would mean?… Would accepts and dots from German umlauts, and other characters in the range of \C380-\C3BF, be stripped as well by that recomposition? And just in case you know more about it: What happens with characters like the German "ß" that is typically rewritten to two characters ("ss")? Thanks, Christian [1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420
Re: [basex-talk] More Diacritic Questions
What does "without diacritics" mean? If it's equivalent to running normalize-unicode(replace(normalize-unicode($token,'NFKD'),'\p{Mn}',''),'NFKC') on the tokens we shouldn't expect all the diacritics to go away; cases like U+00F8 ("latin small letter o with stroke"), despite the descriptive name, doesn't have a decomposition (= it's a full letter). (Though both Chris' dotted letters do decompose so their dots should go away under that scheme.) The documentation says "If diacritics is insensitive, characters with and without diacritics (umlauts, characters with accents) are declared as identical." without defining diacritics. So far as I can tell from a quick check, the XFTQ spec doesn't say what a diacritic is, either, it just says there's a collation defined for the purpose and maybe it can't always cope algorithmically. I'd like to advocate for an equivalent to the "decomposed normal form, strip the non-spacing modifier characters, recompose to composed normal form" equivalence because at least that one is plausibly well understood. If that's not it, can we get the collation in the documentation somewhere? -- Graydon, who hopes that made sense On Sun, Nov 23, 2014 at 1:29 PM, Chris Yocum wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > Hi Chrsitian > > Thanks for letting me know! I also need ḟ U+1E1F. > > All the best, > Chris > > On Sun, Nov 23, 2014 at 06:22:39PM +0100, Christian Grün wrote: >> Hi Chris, >> >> Thanks for the observation. I can confirm that some characters like ṡ >> (U+1E61) do not seem be properly normalized yet. I have added an issue >> for that [1], and I hope I will soon have it fixed. >> >> If you encounter some other surprising behavior like this, feel free to tell >> us. >> >> Best, >> Christian >> >> [1] https://github.com/BaseXdb/basex/issues/1029 >> >> >> > Hi Everyone, >> > >> > I am rather confused again about diacritic handling in basex. For >> > instance, with Full Text turned on a word like athgabáil will match >> > both athgabail and athgabáil with "diacritics insensitive" which is >> > what I would expect. However, if I try to match a word like >> > cúachṡnaidm (with s with a dot above), it will not match without >> > "diacritics sensitive" turned on in the query itself. I am rather >> > confused why it would match in the case of athgabáil but not in the >> > case of cúachṡnaidm. Does anyone know why this is happening and how >> > to make it match like the other match? >> > >> > All the best, >> > Chris >> > -BEGIN PGP SIGNATURE- >> > Version: GnuPG v1 >> > >> > iF4EAREIAAYFAlRyFScACgkQDjE+CSbP7HogoQD/eyQT0ioILJv3KHL1cKyNi+Ht >> > a8s0g3Az5o6Uvnniw50A/Reyycsn8YiIY3naEOPuwKydBOUwrYKSe5fjOTqtZyXl >> > =jycQ >> > -END PGP SIGNATURE- > -BEGIN PGP SIGNATURE- > Version: GnuPG v1 > > iF4EAREIAAYFAlRyJ6YACgkQDjE+CSbP7HpN8AEAnkcKGXtUxGJzWIsaTsdDhvbS > NC44Gc2qp04bsyc6YCIBAKeVf1CBbWnGl9cs4RUe3tPqPSv0T0FNIIQ0/45dUBfM > =mfHa > -END PGP SIGNATURE-
Re: [basex-talk] More Diacritic Questions
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Chrsitian Thanks for letting me know! I also need ḟ U+1E1F. All the best, Chris On Sun, Nov 23, 2014 at 06:22:39PM +0100, Christian Grün wrote: > Hi Chris, > > Thanks for the observation. I can confirm that some characters like ṡ > (U+1E61) do not seem be properly normalized yet. I have added an issue > for that [1], and I hope I will soon have it fixed. > > If you encounter some other surprising behavior like this, feel free to tell > us. > > Best, > Christian > > [1] https://github.com/BaseXdb/basex/issues/1029 > > > > Hi Everyone, > > > > I am rather confused again about diacritic handling in basex. For > > instance, with Full Text turned on a word like athgabáil will match > > both athgabail and athgabáil with "diacritics insensitive" which is > > what I would expect. However, if I try to match a word like > > cúachṡnaidm (with s with a dot above), it will not match without > > "diacritics sensitive" turned on in the query itself. I am rather > > confused why it would match in the case of athgabáil but not in the > > case of cúachṡnaidm. Does anyone know why this is happening and how > > to make it match like the other match? > > > > All the best, > > Chris > > -BEGIN PGP SIGNATURE- > > Version: GnuPG v1 > > > > iF4EAREIAAYFAlRyFScACgkQDjE+CSbP7HogoQD/eyQT0ioILJv3KHL1cKyNi+Ht > > a8s0g3Az5o6Uvnniw50A/Reyycsn8YiIY3naEOPuwKydBOUwrYKSe5fjOTqtZyXl > > =jycQ > > -END PGP SIGNATURE- -BEGIN PGP SIGNATURE- Version: GnuPG v1 iF4EAREIAAYFAlRyJ6YACgkQDjE+CSbP7HpN8AEAnkcKGXtUxGJzWIsaTsdDhvbS NC44Gc2qp04bsyc6YCIBAKeVf1CBbWnGl9cs4RUe3tPqPSv0T0FNIIQ0/45dUBfM =mfHa -END PGP SIGNATURE-
Re: [basex-talk] More Diacritic Questions
Hi Chris, Thanks for the observation. I can confirm that some characters like ṡ (U+1E61) do not seem be properly normalized yet. I have added an issue for that [1], and I hope I will soon have it fixed. If you encounter some other surprising behavior like this, feel free to tell us. Best, Christian [1] https://github.com/BaseXdb/basex/issues/1029 > Hi Everyone, > > I am rather confused again about diacritic handling in basex. For > instance, with Full Text turned on a word like athgabáil will match > both athgabail and athgabáil with "diacritics insensitive" which is > what I would expect. However, if I try to match a word like > cúachṡnaidm (with s with a dot above), it will not match without > "diacritics sensitive" turned on in the query itself. I am rather > confused why it would match in the case of athgabáil but not in the > case of cúachṡnaidm. Does anyone know why this is happening and how > to make it match like the other match? > > All the best, > Chris > -BEGIN PGP SIGNATURE- > Version: GnuPG v1 > > iF4EAREIAAYFAlRyFScACgkQDjE+CSbP7HogoQD/eyQT0ioILJv3KHL1cKyNi+Ht > a8s0g3Az5o6Uvnniw50A/Reyycsn8YiIY3naEOPuwKydBOUwrYKSe5fjOTqtZyXl > =jycQ > -END PGP SIGNATURE-