[basex-talk] More Diacritic Questions
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Everyone, I am rather confused again about diacritic handling in basex. For instance, with Full Text turned on a word like athgabáil will match both athgabail and athgabáil with diacritics insensitive which is what I would expect. However, if I try to match a word like cúachṡnaidm (with s with a dot above), it will not match without diacritics sensitive turned on in the query itself. I am rather confused why it would match in the case of athgabáil but not in the case of cúachṡnaidm. Does anyone know why this is happening and how to make it match like the other match? All the best, Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1 iF4EAREIAAYFAlRyFScACgkQDjE+CSbP7HogoQD/eyQT0ioILJv3KHL1cKyNi+Ht a8s0g3Az5o6Uvnniw50A/Reyycsn8YiIY3naEOPuwKydBOUwrYKSe5fjOTqtZyXl =jycQ -END PGP SIGNATURE-
Re: [basex-talk] More Diacritic Questions
Hi Chris, Thanks for the observation. I can confirm that some characters like ṡ (U+1E61) do not seem be properly normalized yet. I have added an issue for that [1], and I hope I will soon have it fixed. If you encounter some other surprising behavior like this, feel free to tell us. Best, Christian [1] https://github.com/BaseXdb/basex/issues/1029 Hi Everyone, I am rather confused again about diacritic handling in basex. For instance, with Full Text turned on a word like athgabáil will match both athgabail and athgabáil with diacritics insensitive which is what I would expect. However, if I try to match a word like cúachṡnaidm (with s with a dot above), it will not match without diacritics sensitive turned on in the query itself. I am rather confused why it would match in the case of athgabáil but not in the case of cúachṡnaidm. Does anyone know why this is happening and how to make it match like the other match? All the best, Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1 iF4EAREIAAYFAlRyFScACgkQDjE+CSbP7HogoQD/eyQT0ioILJv3KHL1cKyNi+Ht a8s0g3Az5o6Uvnniw50A/Reyycsn8YiIY3naEOPuwKydBOUwrYKSe5fjOTqtZyXl =jycQ -END PGP SIGNATURE-
Re: [basex-talk] More Diacritic Questions
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Chrsitian Thanks for letting me know! I also need ḟ U+1E1F. All the best, Chris On Sun, Nov 23, 2014 at 06:22:39PM +0100, Christian Grün wrote: Hi Chris, Thanks for the observation. I can confirm that some characters like ṡ (U+1E61) do not seem be properly normalized yet. I have added an issue for that [1], and I hope I will soon have it fixed. If you encounter some other surprising behavior like this, feel free to tell us. Best, Christian [1] https://github.com/BaseXdb/basex/issues/1029 Hi Everyone, I am rather confused again about diacritic handling in basex. For instance, with Full Text turned on a word like athgabáil will match both athgabail and athgabáil with diacritics insensitive which is what I would expect. However, if I try to match a word like cúachṡnaidm (with s with a dot above), it will not match without diacritics sensitive turned on in the query itself. I am rather confused why it would match in the case of athgabáil but not in the case of cúachṡnaidm. Does anyone know why this is happening and how to make it match like the other match? All the best, Chris -BEGIN PGP SIGNATURE- Version: GnuPG v1 iF4EAREIAAYFAlRyFScACgkQDjE+CSbP7HogoQD/eyQT0ioILJv3KHL1cKyNi+Ht a8s0g3Az5o6Uvnniw50A/Reyycsn8YiIY3naEOPuwKydBOUwrYKSe5fjOTqtZyXl =jycQ -END PGP SIGNATURE- -BEGIN PGP SIGNATURE- Version: GnuPG v1 iF4EAREIAAYFAlRyJ6YACgkQDjE+CSbP7HpN8AEAnkcKGXtUxGJzWIsaTsdDhvbS NC44Gc2qp04bsyc6YCIBAKeVf1CBbWnGl9cs4RUe3tPqPSv0T0FNIIQ0/45dUBfM =mfHa -END PGP SIGNATURE-
[basex-talk] Experiencing unexpected fallback to Xalan with xslt:transform
Hi, I was using an XSLT 2 stylesheet to transform xqdoc XML to Markdown. I already had a stylesheet that did the same with HTML. This stylesheet is XSLT 2 and uses Saxon HE which is located inside the BaseX lib directory. Saxon HE is picked up just fine with xslt:transform and xslt:transform-text. I used this as basis for the conversion to markdown. Then I added an xsl:function called i:code-block and suddenly got this error message: [bxerr:BXSL0001] ERROR: 'The first argument to the non-static Java function 'codeBlock' is not a valid object reference.' FATAL ERROR: 'Could not compile stylesheet' Ok, so I changed it to a named template called code-block then I got an error about the string-join used inside. So I re-wrote to not use this xpath function. [bxerr:BXSL0001] ERROR: 'Invalid conversion from 'com.sun.org.apache.xalan.internal.xsltc.dom.AdaptiveResultTreeImpl' to 'node-set'.' So something is causing a fallback to Xalan XSLT 1 processor but I cannot undersand why. However, if I remove all functions and named templates this same stylesheet is compiled and executed fine using Saxon again!!?? If necessary I can add some more code. --Marc
Re: [basex-talk] Full text score with or
Hi Andy, Thanks, that works for me. I always prefer less complex code :-) so it would be nice if this feature made a return at some point. So it did: In the latest snapshot, scores will again be propagated when using and, or, and predicates [2]. Cheers, Christian [1] http://files.basex.org/releases/latest/ [2] http://docs.basex.org/wiki/Full-Text#Scoring
Re: [basex-talk] Experiencing unexpected fallback to Xalan with xslt:transform
Hmm, removing all functions and named templates didn't help either. Something is causing that my HTML conversion stylesheet is using Saxon whereas the other stylesheet insists on using Xalan. Here's another error I got when I added a function and called it. [bxerr:BXSL0001] ERROR: 'Cannot find class '1.0'.' FATAL ERROR: 'Could not compile stylesheet' On Sun, Nov 23, 2014 at 10:40 PM, Marc van Grootel marc.van.groo...@gmail.com wrote: Hi, I was using an XSLT 2 stylesheet to transform xqdoc XML to Markdown. I already had a stylesheet that did the same with HTML. This stylesheet is XSLT 2 and uses Saxon HE which is located inside the BaseX lib directory. Saxon HE is picked up just fine with xslt:transform and xslt:transform-text. I used this as basis for the conversion to markdown. Then I added an xsl:function called i:code-block and suddenly got this error message: [bxerr:BXSL0001] ERROR: 'The first argument to the non-static Java function 'codeBlock' is not a valid object reference.' FATAL ERROR: 'Could not compile stylesheet' Ok, so I changed it to a named template called code-block then I got an error about the string-join used inside. So I re-wrote to not use this xpath function. [bxerr:BXSL0001] ERROR: 'Invalid conversion from 'com.sun.org.apache.xalan.internal.xsltc.dom.AdaptiveResultTreeImpl' to 'node-set'.' So something is causing a fallback to Xalan XSLT 1 processor but I cannot undersand why. However, if I remove all functions and named templates this same stylesheet is compiled and executed fine using Saxon again!!?? If necessary I can add some more code. --Marc -- --Marc
Re: [basex-talk] Experiencing unexpected fallback to Xalan with xslt:transform
Stop the press . No, really. I probably have to stop now. I was switching BaseX versions and used a version where I didn't add Saxon yet. The HTMl stylesheet I used didn't use any XSLT 2 features. Face palm. Sorry for wasting your time Can I get Google to forget this ;-) --Marc
Re: [basex-talk] Experiencing unexpected fallback to Xalan with xslt:transform
Can I get Google to forget this ;-) Never ever, sorry…
Re: [basex-talk] More Diacritic Questions
Hi Graydon, I just had a look. In BaseX, without diacritics can be explained by this a single, glorious mapping table [1]. It's quite obvious that there are just too many cases which are not covered by this mapping. We introduced this solution in the very beginnings of our full-text implementation, and I am just surprised that it survived for such a long time, probably because it was sufficient for most use cases our users came across so far. However, I would like to extend the current solution with something more general and, still, more efficient than full Unicode normalizations (performance-wise, the current mapping is probably difficult to beat). As you already indicated, the XQFT spec left it to the implementers to decide what diacritics are. I'd like to advocate for an equivalent to the decomposed normal form, strip the non-spacing modifier characters, recompose to composed normal form equivalence because at least that one is plausibly well understood. Shame on me; could you give me some quick tutoring what this would mean?… Would accepts and dots from German umlauts, and other characters in the range of \C380-\C3BF, be stripped as well by that recomposition? And just in case you know more about it: What happens with characters like the German ß that is typically rewritten to two characters (ss)? Thanks, Christian [1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420
Re: [basex-talk] More Diacritic Questions
I just found a mapping table proposed by John Cowan [1]. It's already pretty old, so it doesn't cover newer Unicode versions, but it's surely better than our current solution. [1] http://www.unicode.org/mail-arch/unicode-ml/y2003-m08/0047.html On Sun, Nov 23, 2014 at 11:19 PM, Christian Grün christian.gr...@gmail.com wrote: Hi Graydon, I just had a look. In BaseX, without diacritics can be explained by this a single, glorious mapping table [1]. It's quite obvious that there are just too many cases which are not covered by this mapping. We introduced this solution in the very beginnings of our full-text implementation, and I am just surprised that it survived for such a long time, probably because it was sufficient for most use cases our users came across so far. However, I would like to extend the current solution with something more general and, still, more efficient than full Unicode normalizations (performance-wise, the current mapping is probably difficult to beat). As you already indicated, the XQFT spec left it to the implementers to decide what diacritics are. I'd like to advocate for an equivalent to the decomposed normal form, strip the non-spacing modifier characters, recompose to composed normal form equivalence because at least that one is plausibly well understood. Shame on me; could you give me some quick tutoring what this would mean?… Would accepts and dots from German umlauts, and other characters in the range of \C380-\C3BF, be stripped as well by that recomposition? And just in case you know more about it: What happens with characters like the German ß that is typically rewritten to two characters (ss)? Thanks, Christian [1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420
Re: [basex-talk] More Diacritic Questions
Hi Christian -- That is indeed a glorious table! :) Unicode defines whether or not a character has a decomposition; so e-with-acute, U+00E9, decomposes into U+0065 + U+0301 (an e and a combining acute accent.) I think the presence of a decomposition is a recoverable character property in Java. (it is in Perl. :) U+0386, Greek Capital Alpha With Tonos, has a decomposition, so the combining accute accent -- U+0301 again! -- would strip. If one is going to go all strict-and-high-church Unicode, diacritic is anything that decomposes into a combining (that is, non-spacing) character code point when considering the decomposed normal form (NFD or NFKD in the Unicode spec). This would NOT convert U+00DF, latin small letter sharp s, into ss, because per the Unicode Consortium, sharp s is a full letter, rather than a modified s. (Same with thorn not decomposing into th, and so on for other things that are considered full letters, which can get surprising in the Scandinavian dotted A's and such.) The disadvantage is that users of BaseX might expect the compare to work; that advantage is the arbitrarily large number of arguments, headaches, and natural language edge cases can be shifted off to the Unicode guys by saying we're following the Unicode character category rules. It also gives something that can be pointed to as an explanation and works like the existing normalized-unicode functions. This is not the same as saying it's easy to understand but it's something. How you do it efficiently, well, my knowledge of Java would probably fit on the bottom of your shoe. On the plus side, Java regular expressions support the \p{...} Unicode character category syntax so it's got to be in there somewhere. I'd think there's an efficient way to load the huge horrible table once, and then filter the characters by property -- if this character has got a decomposition, you then want the members of the decomposition that have the Unicode property Character.isUnicodeIdentifierStart() returning true comes to mind as something that might work. Did that make sense? -- Graydon On Sun, Nov 23, 2014 at 5:19 PM, Christian Grün christian.gr...@gmail.com wrote: Hi Graydon, I just had a look. In BaseX, without diacritics can be explained by this a single, glorious mapping table [1]. It's quite obvious that there are just too many cases which are not covered by this mapping. We introduced this solution in the very beginnings of our full-text implementation, and I am just surprised that it survived for such a long time, probably because it was sufficient for most use cases our users came across so far. However, I would like to extend the current solution with something more general and, still, more efficient than full Unicode normalizations (performance-wise, the current mapping is probably difficult to beat). As you already indicated, the XQFT spec left it to the implementers to decide what diacritics are. I'd like to advocate for an equivalent to the decomposed normal form, strip the non-spacing modifier characters, recompose to composed normal form equivalence because at least that one is plausibly well understood. Shame on me; could you give me some quick tutoring what this would mean?… Would accepts and dots from German umlauts, and other characters in the range of \C380-\C3BF, be stripped as well by that recomposition? And just in case you know more about it: What happens with characters like the German ß that is typically rewritten to two characters (ss)? Thanks, Christian [1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420
Re: [basex-talk] More Diacritic Questions
Hi Chris, I am glad to report that the latest snapshot of BaseX [1] now provides much better support for diacritical characters. Please find more details in my next mail to Graydon. Hope this helps, Christian [1] http://files.basex.org/releases/latest/ __ On Sun, Nov 23, 2014 at 11:56 PM, Graydon Saunders graydon...@gmail.com wrote: Hi Christian -- That is indeed a glorious table! :) Unicode defines whether or not a character has a decomposition; so e-with-acute, U+00E9, decomposes into U+0065 + U+0301 (an e and a combining acute accent.) I think the presence of a decomposition is a recoverable character property in Java. (it is in Perl. :) U+0386, Greek Capital Alpha With Tonos, has a decomposition, so the combining accute accent -- U+0301 again! -- would strip. If one is going to go all strict-and-high-church Unicode, diacritic is anything that decomposes into a combining (that is, non-spacing) character code point when considering the decomposed normal form (NFD or NFKD in the Unicode spec). This would NOT convert U+00DF, latin small letter sharp s, into ss, because per the Unicode Consortium, sharp s is a full letter, rather than a modified s. (Same with thorn not decomposing into th, and so on for other things that are considered full letters, which can get surprising in the Scandinavian dotted A's and such.) The disadvantage is that users of BaseX might expect the compare to work; that advantage is the arbitrarily large number of arguments, headaches, and natural language edge cases can be shifted off to the Unicode guys by saying we're following the Unicode character category rules. It also gives something that can be pointed to as an explanation and works like the existing normalized-unicode functions. This is not the same as saying it's easy to understand but it's something. How you do it efficiently, well, my knowledge of Java would probably fit on the bottom of your shoe. On the plus side, Java regular expressions support the \p{...} Unicode character category syntax so it's got to be in there somewhere. I'd think there's an efficient way to load the huge horrible table once, and then filter the characters by property -- if this character has got a decomposition, you then want the members of the decomposition that have the Unicode property Character.isUnicodeIdentifierStart() returning true comes to mind as something that might work. Did that make sense? -- Graydon On Sun, Nov 23, 2014 at 5:19 PM, Christian Grün christian.gr...@gmail.com wrote: Hi Graydon, I just had a look. In BaseX, without diacritics can be explained by this a single, glorious mapping table [1]. It's quite obvious that there are just too many cases which are not covered by this mapping. We introduced this solution in the very beginnings of our full-text implementation, and I am just surprised that it survived for such a long time, probably because it was sufficient for most use cases our users came across so far. However, I would like to extend the current solution with something more general and, still, more efficient than full Unicode normalizations (performance-wise, the current mapping is probably difficult to beat). As you already indicated, the XQFT spec left it to the implementers to decide what diacritics are. I'd like to advocate for an equivalent to the decomposed normal form, strip the non-spacing modifier characters, recompose to composed normal form equivalence because at least that one is plausibly well understood. Shame on me; could you give me some quick tutoring what this would mean?… Would accepts and dots from German umlauts, and other characters in the range of \C380-\C3BF, be stripped as well by that recomposition? And just in case you know more about it: What happens with characters like the German ß that is typically rewritten to two characters (ss)? Thanks, Christian [1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420
Re: [basex-talk] More Diacritic Questions
Hi Graydon, Thanks for your detailed reply, very appreciated. For today, I decided to choose a pragmatic solution that provides support for much more cases than before. I have added some more (glorious) mappings motivated by John Cowan's mail, which can now be found in a new class [1]. However, to push things a bit further, I have rewritten the code for removing diacritics. Normalized tokens may now have a different byte length than the original token, as I'm removing combining marks as well (starting from 0300, and others). As a result, the following query will now yield the expected result (true): (: U+00E9 vs. U+0065 U+0E01 :) let $e1 := codepoints-to-string(233) let $e2 := codepoints-to-string((101, 769)) return $e1 contains text { $e2 } I will have some more thoughts on embracing the full Unicode normalization. I fully agree that it makes sense to use standards whenever appropriate. However, one disadvantage for us is that it usually works on String data, whereas most textual data in BaseX is internally represented in byte arrays. One more challenge is that Java's Unicode support is not up-to-date anymore. For example, I am checking diacritical combining marks from Unicode 7.0 that are not detected as such by current versions of Java (1AB0–1AFF). To be able to support the new requirements of XQuery 3.1 (see e.g. [2]), we are already working with ICU [3]; it will be requested dynamically if it's found in the classpath. In future, we could use it for all of our full-text operations as well, but the optional embedding comes at a price in terms of performance. Looking forward to your feedback on the new snapshot, Christian [1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/FTToken.java [2] http://www.w3.org/TR/xpath-functions-31/#uca-collations [3] http://site.icu-project.org/