[basex-talk] More Diacritic Questions

2014-11-23 Thread Chris Yocum
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi Everyone,

I am rather confused again about diacritic handling in basex.  For
instance, with Full Text turned on a word like athgabáil will match
both athgabail and athgabáil with diacritics insensitive which is
what I would expect.  However, if I try to match a word like
cúachṡnaidm (with s with a dot above), it will not match without
diacritics sensitive turned on in the query itself.  I am rather
confused why it would match in the case of athgabáil but not in the
case of cúachṡnaidm.  Does anyone know why this is happening and how
to make it match like the other match?

All the best,
Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iF4EAREIAAYFAlRyFScACgkQDjE+CSbP7HogoQD/eyQT0ioILJv3KHL1cKyNi+Ht
a8s0g3Az5o6Uvnniw50A/Reyycsn8YiIY3naEOPuwKydBOUwrYKSe5fjOTqtZyXl
=jycQ
-END PGP SIGNATURE-


Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
Hi Chris,

Thanks for the observation. I can confirm that some characters like ṡ
(U+1E61) do not seem be properly normalized yet. I have added an issue
for that [1], and I hope I will soon have it fixed.

If you encounter some other surprising behavior like this, feel free to tell us.

Best,
Christian

[1] https://github.com/BaseXdb/basex/issues/1029


 Hi Everyone,

 I am rather confused again about diacritic handling in basex.  For
 instance, with Full Text turned on a word like athgabáil will match
 both athgabail and athgabáil with diacritics insensitive which is
 what I would expect.  However, if I try to match a word like
 cúachṡnaidm (with s with a dot above), it will not match without
 diacritics sensitive turned on in the query itself.  I am rather
 confused why it would match in the case of athgabáil but not in the
 case of cúachṡnaidm.  Does anyone know why this is happening and how
 to make it match like the other match?

 All the best,
 Chris
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1

 iF4EAREIAAYFAlRyFScACgkQDjE+CSbP7HogoQD/eyQT0ioILJv3KHL1cKyNi+Ht
 a8s0g3Az5o6Uvnniw50A/Reyycsn8YiIY3naEOPuwKydBOUwrYKSe5fjOTqtZyXl
 =jycQ
 -END PGP SIGNATURE-


Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Chris Yocum
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi Chrsitian

Thanks for letting me know!  I also need ḟ U+1E1F.

All the best,
Chris

On Sun, Nov 23, 2014 at 06:22:39PM +0100, Christian Grün wrote:
 Hi Chris,
 
 Thanks for the observation. I can confirm that some characters like ṡ
 (U+1E61) do not seem be properly normalized yet. I have added an issue
 for that [1], and I hope I will soon have it fixed.
 
 If you encounter some other surprising behavior like this, feel free to tell 
 us.
 
 Best,
 Christian
 
 [1] https://github.com/BaseXdb/basex/issues/1029
 
 
  Hi Everyone,
 
  I am rather confused again about diacritic handling in basex.  For
  instance, with Full Text turned on a word like athgabáil will match
  both athgabail and athgabáil with diacritics insensitive which is
  what I would expect.  However, if I try to match a word like
  cúachṡnaidm (with s with a dot above), it will not match without
  diacritics sensitive turned on in the query itself.  I am rather
  confused why it would match in the case of athgabáil but not in the
  case of cúachṡnaidm.  Does anyone know why this is happening and how
  to make it match like the other match?
 
  All the best,
  Chris
  -BEGIN PGP SIGNATURE-
  Version: GnuPG v1
 
  iF4EAREIAAYFAlRyFScACgkQDjE+CSbP7HogoQD/eyQT0ioILJv3KHL1cKyNi+Ht
  a8s0g3Az5o6Uvnniw50A/Reyycsn8YiIY3naEOPuwKydBOUwrYKSe5fjOTqtZyXl
  =jycQ
  -END PGP SIGNATURE-
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iF4EAREIAAYFAlRyJ6YACgkQDjE+CSbP7HpN8AEAnkcKGXtUxGJzWIsaTsdDhvbS
NC44Gc2qp04bsyc6YCIBAKeVf1CBbWnGl9cs4RUe3tPqPSv0T0FNIIQ0/45dUBfM
=mfHa
-END PGP SIGNATURE-


[basex-talk] Experiencing unexpected fallback to Xalan with xslt:transform

2014-11-23 Thread Marc van Grootel
Hi,

I was using an XSLT 2 stylesheet to transform xqdoc XML to Markdown. I
already had a stylesheet that did the same with HTML. This stylesheet
is XSLT 2 and uses Saxon HE which is located inside the BaseX lib
directory. Saxon HE is picked up just fine with xslt:transform and
xslt:transform-text. I used this as basis for the conversion to
markdown.

Then I added an xsl:function called i:code-block and suddenly got this
error message:

[bxerr:BXSL0001] ERROR:  'The first argument to the non-static Java
function 'codeBlock' is not a valid object reference.'
FATAL ERROR:  'Could not compile stylesheet'

Ok, so I changed it to a named template called code-block then I got
an error about the string-join used inside. So I re-wrote to not use
this xpath function.

  [bxerr:BXSL0001] ERROR:  'Invalid conversion from
  'com.sun.org.apache.xalan.internal.xsltc.dom.AdaptiveResultTreeImpl'
to 'node-set'.'

So something is causing a fallback to Xalan XSLT 1 processor but I
cannot undersand why. However, if I remove all functions and named
templates this same stylesheet is compiled and executed fine using
Saxon again!!??

If necessary I can add some more code.

--Marc


Re: [basex-talk] Full text score with or

2014-11-23 Thread Christian Grün
Hi Andy,

 Thanks, that works for me. I always prefer less complex code :-) so it would
 be nice if this feature made a return at some point.

So it did: In the latest snapshot, scores will again be propagated
when using and, or, and predicates [2].

Cheers,
Christian

[1] http://files.basex.org/releases/latest/
[2] http://docs.basex.org/wiki/Full-Text#Scoring


Re: [basex-talk] Experiencing unexpected fallback to Xalan with xslt:transform

2014-11-23 Thread Marc van Grootel
Hmm, removing all functions and named templates didn't help either.
Something is causing that my HTML conversion stylesheet is using Saxon
whereas the other stylesheet insists on using Xalan.

Here's another error I got when I added a function and called it.

  [bxerr:BXSL0001] ERROR:  'Cannot find class '1.0'.'
  FATAL ERROR:  'Could not compile stylesheet'


On Sun, Nov 23, 2014 at 10:40 PM, Marc van Grootel
marc.van.groo...@gmail.com wrote:
 Hi,

 I was using an XSLT 2 stylesheet to transform xqdoc XML to Markdown. I
 already had a stylesheet that did the same with HTML. This stylesheet
 is XSLT 2 and uses Saxon HE which is located inside the BaseX lib
 directory. Saxon HE is picked up just fine with xslt:transform and
 xslt:transform-text. I used this as basis for the conversion to
 markdown.

 Then I added an xsl:function called i:code-block and suddenly got this
 error message:

 [bxerr:BXSL0001] ERROR:  'The first argument to the non-static Java
 function 'codeBlock' is not a valid object reference.'
 FATAL ERROR:  'Could not compile stylesheet'

 Ok, so I changed it to a named template called code-block then I got
 an error about the string-join used inside. So I re-wrote to not use
 this xpath function.

   [bxerr:BXSL0001] ERROR:  'Invalid conversion from
   'com.sun.org.apache.xalan.internal.xsltc.dom.AdaptiveResultTreeImpl'
 to 'node-set'.'

 So something is causing a fallback to Xalan XSLT 1 processor but I
 cannot undersand why. However, if I remove all functions and named
 templates this same stylesheet is compiled and executed fine using
 Saxon again!!??

 If necessary I can add some more code.

 --Marc



-- 
--Marc


Re: [basex-talk] Experiencing unexpected fallback to Xalan with xslt:transform

2014-11-23 Thread Marc van Grootel
Stop the press .

No, really. I probably have to stop now. I was switching BaseX
versions and used a version where I didn't add Saxon yet. The HTMl
stylesheet I used didn't use any XSLT 2 features.

Face palm. Sorry for wasting your time 

Can I get Google to forget this ;-)

--Marc


Re: [basex-talk] Experiencing unexpected fallback to Xalan with xslt:transform

2014-11-23 Thread Christian Grün
 Can I get Google to forget this ;-)

Never ever, sorry…


Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
Hi Graydon,

I just had a look. In BaseX, without diacritics can be explained by
this a single, glorious mapping table [1].

It's quite obvious that there are just too many cases which are not
covered by this mapping. We introduced this solution in the very
beginnings of our full-text implementation, and I am just surprised
that it survived for such a long time, probably because it was
sufficient for most use cases our users came across so far.

However, I would like to extend the current solution with something
more general and, still, more efficient than full Unicode
normalizations (performance-wise, the current mapping is probably
difficult to beat). As you already indicated, the XQFT spec left it to
the implementers to decide what diacritics are.

 I'd like to advocate for an equivalent to the decomposed normal form,
 strip the non-spacing modifier characters, recompose to composed
 normal form equivalence because at least that one is plausibly well
 understood.

Shame on me; could you give me some quick tutoring what this would
mean?… Would accepts and dots from German umlauts, and other
characters in the range of \C380-\C3BF, be stripped as well by that
recomposition? And just in case you know more about it: What happens
with characters like the German ß that is typically rewritten to two
characters (ss)?

Thanks,
Christian

[1] 
https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420


Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
I just found a mapping table proposed by John Cowan [1]. It's already
pretty old, so it doesn't cover newer Unicode versions, but it's
surely better than our current solution.

[1] http://www.unicode.org/mail-arch/unicode-ml/y2003-m08/0047.html


On Sun, Nov 23, 2014 at 11:19 PM, Christian Grün
christian.gr...@gmail.com wrote:
 Hi Graydon,

 I just had a look. In BaseX, without diacritics can be explained by
 this a single, glorious mapping table [1].

 It's quite obvious that there are just too many cases which are not
 covered by this mapping. We introduced this solution in the very
 beginnings of our full-text implementation, and I am just surprised
 that it survived for such a long time, probably because it was
 sufficient for most use cases our users came across so far.

 However, I would like to extend the current solution with something
 more general and, still, more efficient than full Unicode
 normalizations (performance-wise, the current mapping is probably
 difficult to beat). As you already indicated, the XQFT spec left it to
 the implementers to decide what diacritics are.

 I'd like to advocate for an equivalent to the decomposed normal form,
 strip the non-spacing modifier characters, recompose to composed
 normal form equivalence because at least that one is plausibly well
 understood.

 Shame on me; could you give me some quick tutoring what this would
 mean?… Would accepts and dots from German umlauts, and other
 characters in the range of \C380-\C3BF, be stripped as well by that
 recomposition? And just in case you know more about it: What happens
 with characters like the German ß that is typically rewritten to two
 characters (ss)?

 Thanks,
 Christian

 [1] 
 https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420


Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Graydon Saunders
Hi Christian --

That is indeed a glorious table! :)

Unicode defines whether or not a character has a decomposition; so
e-with-acute, U+00E9, decomposes into U+0065 + U+0301  (an e and a
combining acute accent.)  I think the presence of a decomposition is a
recoverable character property in Java.  (it is in Perl. :)

U+0386, Greek Capital Alpha With Tonos, has a decomposition, so the
combining accute accent -- U+0301 again! -- would strip.

If one is going to go all strict-and-high-church Unicode, diacritic
is anything that decomposes into a combining (that is, non-spacing)
character code point when considering the decomposed normal form (NFD
or NFKD in the Unicode spec).  This would NOT convert U+00DF, latin
small letter sharp s, into ss, because per the Unicode Consortium,
sharp s is a full letter, rather than a modified s.  (Same with thorn
not decomposing into th, and so on for other things that are
considered full letters, which can get surprising in the Scandinavian
dotted A's and such.)  The disadvantage is that users of BaseX might
expect the compare to work; that advantage is the arbitrarily large
number of arguments, headaches, and natural language edge cases can be
shifted off to the Unicode guys by saying we're following the Unicode
character category rules.

It also gives something that can be pointed to as an explanation and
works like the existing normalized-unicode functions.  This is not the
same as saying it's easy to understand but it's something.

How you do it efficiently, well, my knowledge of Java would probably
fit on the bottom of your shoe.  On the plus side, Java regular
expressions support the \p{...} Unicode character category syntax so
it's got to be in there somewhere.  I'd think there's an efficient way
to load the huge horrible table once, and then filter the characters
by property -- if this character has got a decomposition, you then
want the members of the decomposition that have the Unicode property
Character.isUnicodeIdentifierStart() returning true comes to mind as
something that might work.

Did that make sense?

-- Graydon


On Sun, Nov 23, 2014 at 5:19 PM, Christian Grün
christian.gr...@gmail.com wrote:
 Hi Graydon,

 I just had a look. In BaseX, without diacritics can be explained by
 this a single, glorious mapping table [1].

 It's quite obvious that there are just too many cases which are not
 covered by this mapping. We introduced this solution in the very
 beginnings of our full-text implementation, and I am just surprised
 that it survived for such a long time, probably because it was
 sufficient for most use cases our users came across so far.

 However, I would like to extend the current solution with something
 more general and, still, more efficient than full Unicode
 normalizations (performance-wise, the current mapping is probably
 difficult to beat). As you already indicated, the XQFT spec left it to
 the implementers to decide what diacritics are.

 I'd like to advocate for an equivalent to the decomposed normal form,
 strip the non-spacing modifier characters, recompose to composed
 normal form equivalence because at least that one is plausibly well
 understood.

 Shame on me; could you give me some quick tutoring what this would
 mean?… Would accepts and dots from German umlauts, and other
 characters in the range of \C380-\C3BF, be stripped as well by that
 recomposition? And just in case you know more about it: What happens
 with characters like the German ß that is typically rewritten to two
 characters (ss)?

 Thanks,
 Christian

 [1] 
 https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420


Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
Hi Chris,

I am glad to report that the latest snapshot of BaseX [1] now provides
much better support for diacritical characters.

Please find more details in my next mail to Graydon.

Hope this helps,
Christian

[1] http://files.basex.org/releases/latest/
__

On Sun, Nov 23, 2014 at 11:56 PM, Graydon Saunders graydon...@gmail.com wrote:
 Hi Christian --

 That is indeed a glorious table! :)

 Unicode defines whether or not a character has a decomposition; so
 e-with-acute, U+00E9, decomposes into U+0065 + U+0301  (an e and a
 combining acute accent.)  I think the presence of a decomposition is a
 recoverable character property in Java.  (it is in Perl. :)

 U+0386, Greek Capital Alpha With Tonos, has a decomposition, so the
 combining accute accent -- U+0301 again! -- would strip.

 If one is going to go all strict-and-high-church Unicode, diacritic
 is anything that decomposes into a combining (that is, non-spacing)
 character code point when considering the decomposed normal form (NFD
 or NFKD in the Unicode spec).  This would NOT convert U+00DF, latin
 small letter sharp s, into ss, because per the Unicode Consortium,
 sharp s is a full letter, rather than a modified s.  (Same with thorn
 not decomposing into th, and so on for other things that are
 considered full letters, which can get surprising in the Scandinavian
 dotted A's and such.)  The disadvantage is that users of BaseX might
 expect the compare to work; that advantage is the arbitrarily large
 number of arguments, headaches, and natural language edge cases can be
 shifted off to the Unicode guys by saying we're following the Unicode
 character category rules.

 It also gives something that can be pointed to as an explanation and
 works like the existing normalized-unicode functions.  This is not the
 same as saying it's easy to understand but it's something.

 How you do it efficiently, well, my knowledge of Java would probably
 fit on the bottom of your shoe.  On the plus side, Java regular
 expressions support the \p{...} Unicode character category syntax so
 it's got to be in there somewhere.  I'd think there's an efficient way
 to load the huge horrible table once, and then filter the characters
 by property -- if this character has got a decomposition, you then
 want the members of the decomposition that have the Unicode property
 Character.isUnicodeIdentifierStart() returning true comes to mind as
 something that might work.

 Did that make sense?

 -- Graydon


 On Sun, Nov 23, 2014 at 5:19 PM, Christian Grün
 christian.gr...@gmail.com wrote:
 Hi Graydon,

 I just had a look. In BaseX, without diacritics can be explained by
 this a single, glorious mapping table [1].

 It's quite obvious that there are just too many cases which are not
 covered by this mapping. We introduced this solution in the very
 beginnings of our full-text implementation, and I am just surprised
 that it survived for such a long time, probably because it was
 sufficient for most use cases our users came across so far.

 However, I would like to extend the current solution with something
 more general and, still, more efficient than full Unicode
 normalizations (performance-wise, the current mapping is probably
 difficult to beat). As you already indicated, the XQFT spec left it to
 the implementers to decide what diacritics are.

 I'd like to advocate for an equivalent to the decomposed normal form,
 strip the non-spacing modifier characters, recompose to composed
 normal form equivalence because at least that one is plausibly well
 understood.

 Shame on me; could you give me some quick tutoring what this would
 mean?… Would accepts and dots from German umlauts, and other
 characters in the range of \C380-\C3BF, be stripped as well by that
 recomposition? And just in case you know more about it: What happens
 with characters like the German ß that is typically rewritten to two
 characters (ss)?

 Thanks,
 Christian

 [1] 
 https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420


Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
Hi Graydon,

Thanks for your detailed reply, very appreciated.

For today, I decided to choose a pragmatic solution that provides
support for much more cases than before. I have added some more
(glorious) mappings motivated by John Cowan's mail, which can now be
found in a new class [1].

However, to push things a bit further, I have rewritten the code for
removing diacritics. Normalized tokens may now have a different byte
length than the original token, as I'm removing combining marks as
well (starting from 0300, and others).

As a result, the following query will now yield the expected result (true):

  (: U+00E9 vs. U+0065 U+0E01 :)
  let $e1 := codepoints-to-string(233)
  let $e2 := codepoints-to-string((101, 769))
  return $e1 contains text { $e2 }

I will have some more thoughts on embracing the full Unicode
normalization. I fully agree that it makes sense to use standards
whenever appropriate. However, one disadvantage for us is that it
usually works on String data, whereas most textual data in BaseX is
internally represented in byte arrays. One more challenge is that
Java's Unicode support is not up-to-date anymore. For example, I am
checking diacritical combining marks from Unicode 7.0 that are not
detected as such by current versions of Java (1AB0–1AFF).

To be able to support the new requirements of XQuery 3.1 (see e.g.
[2]), we are already working with ICU [3]; it will be requested
dynamically if it's found in the classpath. In future, we could use it
for all of our full-text operations as well, but the optional
embedding comes at a price in terms of performance.

Looking forward to your feedback on the new snapshot,
Christian

[1] 
https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/FTToken.java
[2] http://www.w3.org/TR/xpath-functions-31/#uca-collations
[3] http://site.icu-project.org/