Re: [basex-talk] More Diacritic Questions

2014-11-30 Thread Christian Grün
Hi Graydon,

> So I would expect that, with a full text search that ignores
> diacritics, I'd get four hits.

By adding some collation hints to one of the standard string
functions, the comparison will succeed:

  fn:compare('≮','<','?lang=en;strength=primary')

In the example, I used the BaseX notation for collations (it is
similar to the notation in Saxon or Exist; in future, more and more
people will probably switch to the newly introduced UCA collation).

> I don't think it's clear that "text" in "full text" means "groups of
> letters".

I agree. Once again, the XQFT spec does not dictate what a "token" in
a full-text is. Currently, we only have two tokenizers: one for
Western languages and another one for Japanese (which gets along
without whitespaces). When we initially implemented the XQFT features
some years ago, our major use case was the search in a library catalog
(comprising meta data on appr. 2 million titles).

Best,
Christian

[1] http://docs.basex.org/wiki/Full-Text#Collations


Re: [basex-talk] More Diacritic Questions

2014-11-30 Thread Graydon Saunders
Hi Christian --

On Sat, Nov 29, 2014 at 6:03 PM, Christian Grün
 wrote:
> Hi Graydon,
>
>> //text()[contains(.,'<')]
>>
>> gives me three hits.
>>
>> I think there should "should" be four against the relevant bit of XML
>> with full-text search, since with no diacritics, U+226E should match.
>
> So you would expected this node to be returned as well?
>
>≮
>
> For this, you'll probably have to call normalize-unicode first:
>
>   //text()[contains(normalize-unicode(., 'NFD),'<')]

With that query, absolutely I should only get three hits.

My expectation for "full text search" is that it searches the contents
of text nodes.  (Since I'm not sure there's a coherent way to describe
"text" in XML that isn't "contents of text nodes".)

So I would expect that, with a full text search that ignores
diacritics, I'd get four hits.

>> for $x in //text()
>> where $x contains text { "<" }
>> return $x
>>
>> gives me nothing, presumably on the grounds that < isn't a letter.
>
> Exactly. With "contains text", only letters can be found. It would
> generally be possible to write a tokenizer that also returns other
> characters as tokens, but there has been no use for that until now
> (and it would generate many new questions in regards to normalization,
> with and without ICU).

Entirely understood that the tokenizer only recognizes letters.

I don't think it's clear that "text" in "full text" means "groups of
letters".  Anything that isn't letters is sort of inherently partaking
of the edge-case nature, but it's not too hard to imagine text with
equations and strange effects from operators with a decomposable
unicode representation.

[snip]
> If you want to play around with our current ICU support, feel free to
> download the latest snapshot, add ICU to the classpath, and use the
> new XQuery 3.1 UCA collation. The new fn:collation-key() function is
> still work in progress, but all other collation features should
> already be available when using the XQuery default string functions.

That's very interesting; thank you!

I shall see about taking a poke at that, and maybe trying to produce
some performance numbers.

Thanks!
Graydon


Re: [basex-talk] More Diacritic Questions

2014-11-29 Thread Christian Grün
Hi Graydon,

> //text()[contains(.,'<')]
>
> gives me three hits.
>
> I think there should "should" be four against the relevant bit of XML
> with full-text search, since with no diacritics, U+226E should match.

So you would expected this node to be returned as well?

   ≮

For this, you'll probably have to call normalize-unicode first:

  //text()[contains(normalize-unicode(., 'NFD),'<')]


> for $x in //text()
> where $x contains text { "<" }
> return $x
>
> gives me nothing, presumably on the grounds that < isn't a letter.

Exactly. With "contains text", only letters can be found. It would
generally be possible to write a tokenizer that also returns other
characters as tokens, but there has been no use for that until now
(and it would generate many new questions in regards to normalization,
with and without ICU).


> (I can probably still generate that table for you if you like.)

I think that for now we will stick with the existing tokenization. If
it turns out that we need more power, we could think about optionally
providing support for ICU as well. However, I'll be glad to have your
feedback if you find examples that are currently not, but should be,
covered by our diacritics normalization mapping.

If you want to play around with our current ICU support, feel free to
download the latest snapshot, add ICU to the classpath, and use the
new XQuery 3.1 UCA collation. The new fn:collation-key() function is
still work in progress, but all other collation features should
already be available when using the XQuery default string functions.

Thanks for your feedback,
Christian


Re: [basex-talk] More Diacritic Questions

2014-11-29 Thread Graydon Saunders
Hi Christian --

After various adventures re-learning Perl's encoding management
quirks, I generated a simple XML file of all the codepoints between
0x20 and 0xD7FF; this isn't complete for XML but I thought it would be
enough to be interesting.

If I load that file into current BaseX dev version
(BaseX80-20141128.214728.zip) using the gui, and *do* turn on the Full
Text indexing and *do not* turn on diacritics,

//text()[contains(.,'<')]

gives me three hits.


  U+003C
  <
  <
  U+003C


  U+226E
  ≮
  <
  U+003C


I think there should "should" be four against the relevant bit of XML
with full-text search, since with no diacritics, U+226E should match.
(U+226E's ability to decompose into a less-than sign is one of my very
favourite surprises involved in stripping diacritics.  What do you
mean the document stopped being well-formed...?)

How I get the full-text search to confirm this is not obvious;

for $x in //text()
where $x contains text { "A" }
return $x

happily gives me 101 results, case and diacritic-insensitive;

for $x in //text()
where $x contains text { "<" }
return $x

gives me nothing, presumably on the grounds that < isn't a letter.

I suspect ICU is the way to go; having to keep an all-unicode table up
to date involves more suffering than anyone should willingly
undertake.  (I can probably still generate that table for you if you
like.)

-- Graydon

On Sun, Nov 23, 2014 at 8:42 PM, Christian Grün
 wrote:
> Hi Graydon,
>
> Thanks for your detailed reply, very appreciated.
>
> For today, I decided to choose a pragmatic solution that provides
> support for much more cases than before. I have added some more
> (glorious) mappings motivated by John Cowan's mail, which can now be
> found in a new class [1].
>
> However, to push things a bit further, I have rewritten the code for
> removing diacritics. Normalized tokens may now have a different byte
> length than the original token, as I'm removing combining marks as
> well (starting from 0300, and others).
>
> As a result, the following query will now yield the expected result (true):
>
>   (: U+00E9 vs. U+0065 U+0E01 :)
>   let $e1 := codepoints-to-string(233)
>   let $e2 := codepoints-to-string((101, 769))
>   return $e1 contains text { $e2 }
>
> I will have some more thoughts on embracing the full Unicode
> normalization. I fully agree that it makes sense to use standards
> whenever appropriate. However, one disadvantage for us is that it
> usually works on String data, whereas most textual data in BaseX is
> internally represented in byte arrays. One more challenge is that
> Java's Unicode support is not up-to-date anymore. For example, I am
> checking diacritical combining marks from Unicode 7.0 that are not
> detected as such by current versions of Java (1AB0–1AFF).
>
> To be able to support the new requirements of XQuery 3.1 (see e.g.
> [2]), we are already working with ICU [3]; it will be requested
> dynamically if it's found in the classpath. In future, we could use it
> for all of our full-text operations as well, but the optional
> embedding comes at a price in terms of performance.
>
> Looking forward to your feedback on the new snapshot,
> Christian
>
> [1] 
> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/FTToken.java
> [2] http://www.w3.org/TR/xpath-functions-31/#uca-collations
> [3] http://site.icu-project.org/


Re: [basex-talk] More Diacritic Questions

2014-11-24 Thread Christopher Yocum
Thanks.  I will give it a spin on my test machine first.  Darn, I will be
on holiday to Prague around that time but not at the actual conference.

Chris

On Mon, Nov 24, 2014 at 11:15 AM, Christian Grün 
wrote:

> Hi Chris,
>
> > Great.  Thank you for handling this so quickly.  When is the next version
> > due out?  I hesitate to run snapshots as my users are rather vocal when
> > things don't work right.
>
> Our snapshots are usually very stable, so you should not have much
> worries. The next official release is planned in alignment with the
> XMLPrague conference (Feb 13).
>
> Best,
> Christian
>


Re: [basex-talk] More Diacritic Questions

2014-11-24 Thread Christian Grün
Hi Chris,

> Great.  Thank you for handling this so quickly.  When is the next version
> due out?  I hesitate to run snapshots as my users are rather vocal when
> things don't work right.

Our snapshots are usually very stable, so you should not have much
worries. The next official release is planned in alignment with the
XMLPrague conference (Feb 13).

Best,
Christian


Re: [basex-talk] More Diacritic Questions

2014-11-24 Thread Christopher Yocum
Hi Christian,

Great.  Thank you for handling this so quickly.  When is the next version
due out?  I hesitate to run snapshots as my users are rather vocal when
things don't work right.

All the best,
Chris

On Mon, Nov 24, 2014 at 1:13 AM, Christian Grün 
wrote:

> Hi Chris,
>
> I am glad to report that the latest snapshot of BaseX [1] now provides
> much better support for diacritical characters.
>
> Please find more details in my next mail to Graydon.
>
> Hope this helps,
> Christian
>
> [1] http://files.basex.org/releases/latest/
> __
>
> On Sun, Nov 23, 2014 at 11:56 PM, Graydon Saunders 
> wrote:
> > Hi Christian --
> >
> > That is indeed a glorious table! :)
> >
> > Unicode defines whether or not a character has a decomposition; so
> > e-with-acute, U+00E9, decomposes into U+0065 + U+0301  (an e and a
> > combining acute accent.)  I think the presence of a decomposition is a
> > recoverable character property in Java.  (it is in Perl. :)
> >
> > U+0386, "Greek Capital Alpha With Tonos", has a decomposition, so the
> > combining accute accent -- U+0301 again! -- would strip.
> >
> > If one is going to go all strict-and-high-church Unicode, "diacritic"
> > is "anything that decomposes into a combining (that is, non-spacing)
> > character code point when considering the decomposed normal form (NFD
> > or NFKD in the Unicode spec).  This would NOT convert U+00DF, "latin
> > small letter sharp s", into ss, because per the Unicode Consortium,
> > sharp s is a full letter, rather than a modified s.  (Same with thorn
> > not decomposing into th, and so on for other things that are
> > considered full letters, which can get surprising in the Scandinavian
> > dotted A's and such.)  The disadvantage is that users of BaseX might
> > expect the compare to work; that advantage is the arbitrarily large
> > number of arguments, headaches, and natural language edge cases can be
> > shifted off to the Unicode guys by saying "we're following the Unicode
> > character category rules".
> >
> > It also gives something that can be pointed to as an explanation and
> > works like the existing normalized-unicode functions.  This is not the
> > same as saying it's easy to understand but it's something.
> >
> > How you do it efficiently, well, my knowledge of Java would probably
> > fit on the bottom of your shoe.  On the plus side, Java regular
> > expressions support the \p{...} Unicode character category syntax so
> > it's got to be in there somewhere.  I'd think there's an efficient way
> > to load the huge horrible table once, and then filter the characters
> > by property -- if this character has got a decomposition, you then
> > want the members of the decomposition that have the Unicode property
> > Character.isUnicodeIdentifierStart() returning true comes to mind as
> > something that might work.
> >
> > Did that make sense?
> >
> > -- Graydon
> >
> >
> > On Sun, Nov 23, 2014 at 5:19 PM, Christian Grün
> >  wrote:
> >> Hi Graydon,
> >>
> >> I just had a look. In BaseX, "without diacritics" can be explained by
> >> this a single, glorious mapping table [1].
> >>
> >> It's quite obvious that there are just too many cases which are not
> >> covered by this mapping. We introduced this solution in the very
> >> beginnings of our full-text implementation, and I am just surprised
> >> that it survived for such a long time, probably because it was
> >> sufficient for most use cases our users came across so far.
> >>
> >> However, I would like to extend the current solution with something
> >> more general and, still, more efficient than full Unicode
> >> normalizations (performance-wise, the current mapping is probably
> >> difficult to beat). As you already indicated, the XQFT spec left it to
> >> the implementers to decide what diacritics are.
> >>
> >>> I'd like to advocate for an equivalent to the "decomposed normal form,
> >>> strip the non-spacing modifier characters, recompose to composed
> >>> normal form" equivalence because at least that one is plausibly well
> >>> understood.
> >>
> >> Shame on me; could you give me some quick tutoring what this would
> >> mean?… Would accepts and dots from German umlauts, and other
> >> characters in the range of \C380-\C3BF, be stripped as well by that
> >> recomposition? And just in case you know more about it: What happens
> >> with characters like the German "ß" that is typically rewritten to two
> >> characters ("ss")?
> >>
> >> Thanks,
> >> Christian
> >>
> >> [1]
> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420
>


Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
Hi Graydon,

Thanks for your detailed reply, very appreciated.

For today, I decided to choose a pragmatic solution that provides
support for much more cases than before. I have added some more
(glorious) mappings motivated by John Cowan's mail, which can now be
found in a new class [1].

However, to push things a bit further, I have rewritten the code for
removing diacritics. Normalized tokens may now have a different byte
length than the original token, as I'm removing combining marks as
well (starting from 0300, and others).

As a result, the following query will now yield the expected result (true):

  (: U+00E9 vs. U+0065 U+0E01 :)
  let $e1 := codepoints-to-string(233)
  let $e2 := codepoints-to-string((101, 769))
  return $e1 contains text { $e2 }

I will have some more thoughts on embracing the full Unicode
normalization. I fully agree that it makes sense to use standards
whenever appropriate. However, one disadvantage for us is that it
usually works on String data, whereas most textual data in BaseX is
internally represented in byte arrays. One more challenge is that
Java's Unicode support is not up-to-date anymore. For example, I am
checking diacritical combining marks from Unicode 7.0 that are not
detected as such by current versions of Java (1AB0–1AFF).

To be able to support the new requirements of XQuery 3.1 (see e.g.
[2]), we are already working with ICU [3]; it will be requested
dynamically if it's found in the classpath. In future, we could use it
for all of our full-text operations as well, but the optional
embedding comes at a price in terms of performance.

Looking forward to your feedback on the new snapshot,
Christian

[1] 
https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/FTToken.java
[2] http://www.w3.org/TR/xpath-functions-31/#uca-collations
[3] http://site.icu-project.org/


Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
Hi Chris,

I am glad to report that the latest snapshot of BaseX [1] now provides
much better support for diacritical characters.

Please find more details in my next mail to Graydon.

Hope this helps,
Christian

[1] http://files.basex.org/releases/latest/
__

On Sun, Nov 23, 2014 at 11:56 PM, Graydon Saunders  wrote:
> Hi Christian --
>
> That is indeed a glorious table! :)
>
> Unicode defines whether or not a character has a decomposition; so
> e-with-acute, U+00E9, decomposes into U+0065 + U+0301  (an e and a
> combining acute accent.)  I think the presence of a decomposition is a
> recoverable character property in Java.  (it is in Perl. :)
>
> U+0386, "Greek Capital Alpha With Tonos", has a decomposition, so the
> combining accute accent -- U+0301 again! -- would strip.
>
> If one is going to go all strict-and-high-church Unicode, "diacritic"
> is "anything that decomposes into a combining (that is, non-spacing)
> character code point when considering the decomposed normal form (NFD
> or NFKD in the Unicode spec).  This would NOT convert U+00DF, "latin
> small letter sharp s", into ss, because per the Unicode Consortium,
> sharp s is a full letter, rather than a modified s.  (Same with thorn
> not decomposing into th, and so on for other things that are
> considered full letters, which can get surprising in the Scandinavian
> dotted A's and such.)  The disadvantage is that users of BaseX might
> expect the compare to work; that advantage is the arbitrarily large
> number of arguments, headaches, and natural language edge cases can be
> shifted off to the Unicode guys by saying "we're following the Unicode
> character category rules".
>
> It also gives something that can be pointed to as an explanation and
> works like the existing normalized-unicode functions.  This is not the
> same as saying it's easy to understand but it's something.
>
> How you do it efficiently, well, my knowledge of Java would probably
> fit on the bottom of your shoe.  On the plus side, Java regular
> expressions support the \p{...} Unicode character category syntax so
> it's got to be in there somewhere.  I'd think there's an efficient way
> to load the huge horrible table once, and then filter the characters
> by property -- if this character has got a decomposition, you then
> want the members of the decomposition that have the Unicode property
> Character.isUnicodeIdentifierStart() returning true comes to mind as
> something that might work.
>
> Did that make sense?
>
> -- Graydon
>
>
> On Sun, Nov 23, 2014 at 5:19 PM, Christian Grün
>  wrote:
>> Hi Graydon,
>>
>> I just had a look. In BaseX, "without diacritics" can be explained by
>> this a single, glorious mapping table [1].
>>
>> It's quite obvious that there are just too many cases which are not
>> covered by this mapping. We introduced this solution in the very
>> beginnings of our full-text implementation, and I am just surprised
>> that it survived for such a long time, probably because it was
>> sufficient for most use cases our users came across so far.
>>
>> However, I would like to extend the current solution with something
>> more general and, still, more efficient than full Unicode
>> normalizations (performance-wise, the current mapping is probably
>> difficult to beat). As you already indicated, the XQFT spec left it to
>> the implementers to decide what diacritics are.
>>
>>> I'd like to advocate for an equivalent to the "decomposed normal form,
>>> strip the non-spacing modifier characters, recompose to composed
>>> normal form" equivalence because at least that one is plausibly well
>>> understood.
>>
>> Shame on me; could you give me some quick tutoring what this would
>> mean?… Would accepts and dots from German umlauts, and other
>> characters in the range of \C380-\C3BF, be stripped as well by that
>> recomposition? And just in case you know more about it: What happens
>> with characters like the German "ß" that is typically rewritten to two
>> characters ("ss")?
>>
>> Thanks,
>> Christian
>>
>> [1] 
>> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420


Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Graydon Saunders
Hi Christian --

That is indeed a glorious table! :)

Unicode defines whether or not a character has a decomposition; so
e-with-acute, U+00E9, decomposes into U+0065 + U+0301  (an e and a
combining acute accent.)  I think the presence of a decomposition is a
recoverable character property in Java.  (it is in Perl. :)

U+0386, "Greek Capital Alpha With Tonos", has a decomposition, so the
combining accute accent -- U+0301 again! -- would strip.

If one is going to go all strict-and-high-church Unicode, "diacritic"
is "anything that decomposes into a combining (that is, non-spacing)
character code point when considering the decomposed normal form (NFD
or NFKD in the Unicode spec).  This would NOT convert U+00DF, "latin
small letter sharp s", into ss, because per the Unicode Consortium,
sharp s is a full letter, rather than a modified s.  (Same with thorn
not decomposing into th, and so on for other things that are
considered full letters, which can get surprising in the Scandinavian
dotted A's and such.)  The disadvantage is that users of BaseX might
expect the compare to work; that advantage is the arbitrarily large
number of arguments, headaches, and natural language edge cases can be
shifted off to the Unicode guys by saying "we're following the Unicode
character category rules".

It also gives something that can be pointed to as an explanation and
works like the existing normalized-unicode functions.  This is not the
same as saying it's easy to understand but it's something.

How you do it efficiently, well, my knowledge of Java would probably
fit on the bottom of your shoe.  On the plus side, Java regular
expressions support the \p{...} Unicode character category syntax so
it's got to be in there somewhere.  I'd think there's an efficient way
to load the huge horrible table once, and then filter the characters
by property -- if this character has got a decomposition, you then
want the members of the decomposition that have the Unicode property
Character.isUnicodeIdentifierStart() returning true comes to mind as
something that might work.

Did that make sense?

-- Graydon


On Sun, Nov 23, 2014 at 5:19 PM, Christian Grün
 wrote:
> Hi Graydon,
>
> I just had a look. In BaseX, "without diacritics" can be explained by
> this a single, glorious mapping table [1].
>
> It's quite obvious that there are just too many cases which are not
> covered by this mapping. We introduced this solution in the very
> beginnings of our full-text implementation, and I am just surprised
> that it survived for such a long time, probably because it was
> sufficient for most use cases our users came across so far.
>
> However, I would like to extend the current solution with something
> more general and, still, more efficient than full Unicode
> normalizations (performance-wise, the current mapping is probably
> difficult to beat). As you already indicated, the XQFT spec left it to
> the implementers to decide what diacritics are.
>
>> I'd like to advocate for an equivalent to the "decomposed normal form,
>> strip the non-spacing modifier characters, recompose to composed
>> normal form" equivalence because at least that one is plausibly well
>> understood.
>
> Shame on me; could you give me some quick tutoring what this would
> mean?… Would accepts and dots from German umlauts, and other
> characters in the range of \C380-\C3BF, be stripped as well by that
> recomposition? And just in case you know more about it: What happens
> with characters like the German "ß" that is typically rewritten to two
> characters ("ss")?
>
> Thanks,
> Christian
>
> [1] 
> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420


Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
I just found a mapping table proposed by John Cowan [1]. It's already
pretty old, so it doesn't cover newer Unicode versions, but it's
surely better than our current solution.

[1] http://www.unicode.org/mail-arch/unicode-ml/y2003-m08/0047.html


On Sun, Nov 23, 2014 at 11:19 PM, Christian Grün
 wrote:
> Hi Graydon,
>
> I just had a look. In BaseX, "without diacritics" can be explained by
> this a single, glorious mapping table [1].
>
> It's quite obvious that there are just too many cases which are not
> covered by this mapping. We introduced this solution in the very
> beginnings of our full-text implementation, and I am just surprised
> that it survived for such a long time, probably because it was
> sufficient for most use cases our users came across so far.
>
> However, I would like to extend the current solution with something
> more general and, still, more efficient than full Unicode
> normalizations (performance-wise, the current mapping is probably
> difficult to beat). As you already indicated, the XQFT spec left it to
> the implementers to decide what diacritics are.
>
>> I'd like to advocate for an equivalent to the "decomposed normal form,
>> strip the non-spacing modifier characters, recompose to composed
>> normal form" equivalence because at least that one is plausibly well
>> understood.
>
> Shame on me; could you give me some quick tutoring what this would
> mean?… Would accepts and dots from German umlauts, and other
> characters in the range of \C380-\C3BF, be stripped as well by that
> recomposition? And just in case you know more about it: What happens
> with characters like the German "ß" that is typically rewritten to two
> characters ("ss")?
>
> Thanks,
> Christian
>
> [1] 
> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420


Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
Hi Graydon,

I just had a look. In BaseX, "without diacritics" can be explained by
this a single, glorious mapping table [1].

It's quite obvious that there are just too many cases which are not
covered by this mapping. We introduced this solution in the very
beginnings of our full-text implementation, and I am just surprised
that it survived for such a long time, probably because it was
sufficient for most use cases our users came across so far.

However, I would like to extend the current solution with something
more general and, still, more efficient than full Unicode
normalizations (performance-wise, the current mapping is probably
difficult to beat). As you already indicated, the XQFT spec left it to
the implementers to decide what diacritics are.

> I'd like to advocate for an equivalent to the "decomposed normal form,
> strip the non-spacing modifier characters, recompose to composed
> normal form" equivalence because at least that one is plausibly well
> understood.

Shame on me; could you give me some quick tutoring what this would
mean?… Would accepts and dots from German umlauts, and other
characters in the range of \C380-\C3BF, be stripped as well by that
recomposition? And just in case you know more about it: What happens
with characters like the German "ß" that is typically rewritten to two
characters ("ss")?

Thanks,
Christian

[1] 
https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420


Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Graydon Saunders
What does "without diacritics" mean?

If it's equivalent to running

normalize-unicode(replace(normalize-unicode($token,'NFKD'),'\p{Mn}',''),'NFKC')

on the tokens we shouldn't expect all the diacritics to go away; cases
like U+00F8  ("latin small letter o with stroke"), despite the
descriptive name, doesn't have a decomposition (= it's a full letter).
(Though both Chris' dotted letters do decompose so their dots should
go away under that scheme.)

The documentation says "If diacritics is insensitive, characters with
and without diacritics (umlauts, characters with accents) are declared
as identical."  without defining diacritics. So far as I can tell from
a quick check, the XFTQ spec doesn't say what a diacritic is, either,
it just says there's a collation defined for the purpose and maybe it
can't always cope algorithmically.

I'd like to advocate for an equivalent to the "decomposed normal form,
strip the non-spacing modifier characters, recompose to composed
normal form" equivalence because at least that one is plausibly well
understood.  If that's not it, can we get the collation in the
documentation somewhere?

-- Graydon, who hopes that made sense

On Sun, Nov 23, 2014 at 1:29 PM, Chris Yocum  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> Hi Chrsitian
>
> Thanks for letting me know!  I also need ḟ U+1E1F.
>
> All the best,
> Chris
>
> On Sun, Nov 23, 2014 at 06:22:39PM +0100, Christian Grün wrote:
>> Hi Chris,
>>
>> Thanks for the observation. I can confirm that some characters like ṡ
>> (U+1E61) do not seem be properly normalized yet. I have added an issue
>> for that [1], and I hope I will soon have it fixed.
>>
>> If you encounter some other surprising behavior like this, feel free to tell 
>> us.
>>
>> Best,
>> Christian
>>
>> [1] https://github.com/BaseXdb/basex/issues/1029
>>
>>
>> > Hi Everyone,
>> >
>> > I am rather confused again about diacritic handling in basex.  For
>> > instance, with Full Text turned on a word like athgabáil will match
>> > both athgabail and athgabáil with "diacritics insensitive" which is
>> > what I would expect.  However, if I try to match a word like
>> > cúachṡnaidm (with s with a dot above), it will not match without
>> > "diacritics sensitive" turned on in the query itself.  I am rather
>> > confused why it would match in the case of athgabáil but not in the
>> > case of cúachṡnaidm.  Does anyone know why this is happening and how
>> > to make it match like the other match?
>> >
>> > All the best,
>> > Chris
>> > -BEGIN PGP SIGNATURE-
>> > Version: GnuPG v1
>> >
>> > iF4EAREIAAYFAlRyFScACgkQDjE+CSbP7HogoQD/eyQT0ioILJv3KHL1cKyNi+Ht
>> > a8s0g3Az5o6Uvnniw50A/Reyycsn8YiIY3naEOPuwKydBOUwrYKSe5fjOTqtZyXl
>> > =jycQ
>> > -END PGP SIGNATURE-
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1
>
> iF4EAREIAAYFAlRyJ6YACgkQDjE+CSbP7HpN8AEAnkcKGXtUxGJzWIsaTsdDhvbS
> NC44Gc2qp04bsyc6YCIBAKeVf1CBbWnGl9cs4RUe3tPqPSv0T0FNIIQ0/45dUBfM
> =mfHa
> -END PGP SIGNATURE-


Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Chris Yocum
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi Chrsitian

Thanks for letting me know!  I also need ḟ U+1E1F.

All the best,
Chris

On Sun, Nov 23, 2014 at 06:22:39PM +0100, Christian Grün wrote:
> Hi Chris,
> 
> Thanks for the observation. I can confirm that some characters like ṡ
> (U+1E61) do not seem be properly normalized yet. I have added an issue
> for that [1], and I hope I will soon have it fixed.
> 
> If you encounter some other surprising behavior like this, feel free to tell 
> us.
> 
> Best,
> Christian
> 
> [1] https://github.com/BaseXdb/basex/issues/1029
> 
> 
> > Hi Everyone,
> >
> > I am rather confused again about diacritic handling in basex.  For
> > instance, with Full Text turned on a word like athgabáil will match
> > both athgabail and athgabáil with "diacritics insensitive" which is
> > what I would expect.  However, if I try to match a word like
> > cúachṡnaidm (with s with a dot above), it will not match without
> > "diacritics sensitive" turned on in the query itself.  I am rather
> > confused why it would match in the case of athgabáil but not in the
> > case of cúachṡnaidm.  Does anyone know why this is happening and how
> > to make it match like the other match?
> >
> > All the best,
> > Chris
> > -BEGIN PGP SIGNATURE-
> > Version: GnuPG v1
> >
> > iF4EAREIAAYFAlRyFScACgkQDjE+CSbP7HogoQD/eyQT0ioILJv3KHL1cKyNi+Ht
> > a8s0g3Az5o6Uvnniw50A/Reyycsn8YiIY3naEOPuwKydBOUwrYKSe5fjOTqtZyXl
> > =jycQ
> > -END PGP SIGNATURE-
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iF4EAREIAAYFAlRyJ6YACgkQDjE+CSbP7HpN8AEAnkcKGXtUxGJzWIsaTsdDhvbS
NC44Gc2qp04bsyc6YCIBAKeVf1CBbWnGl9cs4RUe3tPqPSv0T0FNIIQ0/45dUBfM
=mfHa
-END PGP SIGNATURE-


Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
Hi Chris,

Thanks for the observation. I can confirm that some characters like ṡ
(U+1E61) do not seem be properly normalized yet. I have added an issue
for that [1], and I hope I will soon have it fixed.

If you encounter some other surprising behavior like this, feel free to tell us.

Best,
Christian

[1] https://github.com/BaseXdb/basex/issues/1029


> Hi Everyone,
>
> I am rather confused again about diacritic handling in basex.  For
> instance, with Full Text turned on a word like athgabáil will match
> both athgabail and athgabáil with "diacritics insensitive" which is
> what I would expect.  However, if I try to match a word like
> cúachṡnaidm (with s with a dot above), it will not match without
> "diacritics sensitive" turned on in the query itself.  I am rather
> confused why it would match in the case of athgabáil but not in the
> case of cúachṡnaidm.  Does anyone know why this is happening and how
> to make it match like the other match?
>
> All the best,
> Chris
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1
>
> iF4EAREIAAYFAlRyFScACgkQDjE+CSbP7HogoQD/eyQT0ioILJv3KHL1cKyNi+Ht
> a8s0g3Az5o6Uvnniw50A/Reyycsn8YiIY3naEOPuwKydBOUwrYKSe5fjOTqtZyXl
> =jycQ
> -END PGP SIGNATURE-