Re: [sword-devel] Normalize the search string (comparing front-end apps)

2018-03-22 Thread DM Smith
Re case sensitivity, it was just a very simple example of the principle.

If it doesn’t find all then the search request and the index were not 
normalized the same. NFC and NFD are different normalizations.

Note, stripping diacritics may be an appropriate normalization.

JSword doesn’t properly handle Unicode either. 

— DM Smith
From my phone. Brief. Weird autocorrections. 

> On Mar 22, 2018, at 9:30 AM, David Haslam  wrote:
> 
> Thanks, DM.
> 
> My question was not about case-sensitivity, but about Unicode normalization.
> The main issue is composition vs decomposition and the canonical ordering of 
> diacritics in each glyph.
> 
> e.g. Suppose the module contains 181 instances of the name "Efraím" which has 
> 6 characters.
> Suppose a user enters in the search box instead "E f r a i ́ m" - (NB, remove 
> the spaces!)
> That's 7 characters when normalized to NFD, the acute accent now being a 
> separate character (U+0301 COMBINING ACUTE ACCENT).
> 
> In each front-end, will the search function find all the 181 instances (as 
> Eloquent does) ?
> Or (as with Xiphos) will it find none?
> 
> DM, what does BibleDesktop do here?
> 
> Best regards,
> 
> David
> 
> PS. ProtonMail converts automatically to NFC even though the text was keyed 
> in as NFD, hence the above kludge with spaces.
> 
> Sent with ProtonMail Secure Email.
> 
> ‐‐‐ Original Message ‐‐‐
>> On 22 March 2018 1:14 PM, DM Smith  wrote:
>> 
>> It doesn’t matter that a search doesn’t use Lucene. The principle is the 
>> same. The search request has to be normalized to the same form as the 
>> searched text. For example a case insensitive search normalizes both to a 
>> single case. If it isn’t done, even on the fly, then search will fail at 
>> times. As they say, “even a blind squirrel gets a nut sometimes."
>> 
>> Regarding Lucene there are mulitple different analyzers (that’s what does 
>> the normalization in Lucene). Each normalizes differently. Each has it’s own 
>> documentation. The analyzer that SWORD uses is suited and was developed for 
>> English texts. It is not appropriate for non-Latin texts. There is a 
>> multi-language analyzer that is much better, ICUAnalyzer, which follows UAX 
>> #29 for tokenization. For details see: 
>> https://issues.apache.org/jira/browse/LUCENE-1488 You’ll note that I 
>> participate in its development.
>> 
>> The osis2mod proclivity for NFC is for display.
>> 
>> DM
>> 
>>> On Mar 22, 2018, at 8:19 AM, David Haslam  wrote:
>>> 
>>> Thanks DM,
>>> 
>>> Not all searches make use of the Lucene index !
>>> 
>>> e.g. In Xiphos, the advanced search panel gives the user a choice of which 
>>> type of search.
>>> Lucene is only one of these mutually exclusive options.
>>> 
>>> btw. Where is it documented that the creation of a Lucene search index 
>>> normalizes the Unicode for the index?
>>> Do we know for certain that this would occur irrespective of whether 
>>> normalization was suppressed during module build?
>>> i.e. With osis2mod option   -N do not convert UTF-8 or normalize UTF-8 to 
>>> NFC
>>> 
>>> 
>>> Best regards,
>>> 
>>> David
>>> 
>>> Sent with ProtonMail Secure Email.
>>> 
>>> ‐‐‐ Original Message ‐‐‐
 On 22 March 2018 10:20 AM, DM Smith  wrote:
 
 The requirement is not that the search is normalized to nfc but rather 
 that it is normalized the same as the index. This should not be a front 
 end issue.
 
 Btw it doesn’t matter how Hebrew is stored in the module. Indexing should 
 normalize it to a form that is internal to the engine. 
 
 — DM Smith
 From my phone. Brief. Weird autocorrections. 
 
> On Mar 22, 2018, at 5:22 AM, David Haslam  wrote:
> Dear all,
> 
> Not all front-ends automatically normalize the search string to Unicode 
> NFC.
> e.g.
> Eloquent does
> Xiphos does not
> The data is incomplete for this feature in the table in our wiki page.
> https://wiki.crosswire.org/Choosing_a_SWORD_program#Search_and_Dictionary
> 
> Please would other front-end app developers supply the missing 
> information. Thanks.
> 
> Further thought:
> For front-ends that also have an Advanced search feature, would it not be 
> a useful enhancement to have a tick box option for Search string 
> normalization?
> Then if we do make any Biblical Hebrew modules with custom normalization, 
> search could at least still work for the "corner cases" in Hebrew, 
> providing the user gave the proper input in the search box.
> 
> cf. The source text for the WLC at tanach.us is not normalized to NFC, 
> but our module is.
> I'll refrain from going into a lot more detail here. There's an issue in 
> our tracker that covers this.
> 
> Best regards,
> 
> David
> 
> Sent with ProtonMail Secure Email.
> 
> 

Re: [sword-devel] Normalize the search string (comparing front-end apps)

2018-03-22 Thread David Haslam
Thanks, DM.

My question was not about case-sensitivity, but about Unicode normalization.
The main issue is composition vs decomposition and the canonical ordering of 
diacritics in each glyph.

e.g. Suppose the module contains 181 instances of the name "Efraím" which has 6 
characters.
Suppose a user enters in the search box instead "E f r a i ́ m" - (NB, remove 
the spaces!)
That's 7 characters when normalized to NFD, the acute accent now being a 
separate character (U+0301 COMBINING ACUTE ACCENT).

In each front-end, will the search function find all the 181 instances (as 
Eloquent does) ?
Or (as with Xiphos) will it find none?

DM, what does BibleDesktop do here?

Best regards,

David

PS. ProtonMail converts automatically to NFC even though the text was keyed in 
as NFD, hence the above kludge with spaces.

Sent with [ProtonMail](https://protonmail.com) Secure Email.

‐‐‐ Original Message ‐‐‐
On 22 March 2018 1:14 PM, DM Smith  wrote:

> It doesn’t matter that a search doesn’t use Lucene. The principle is the 
> same. The search request has to be normalized to the same form as the 
> searched text. For example a case insensitive search normalizes both to a 
> single case. If it isn’t done, even on the fly, then search will fail at 
> times. As they say, “even a blind squirrel gets a nut sometimes."
>
> Regarding Lucene there are mulitple different analyzers (that’s what does the 
> normalization in Lucene). Each normalizes differently. Each has it’s own 
> documentation. The analyzer that SWORD uses is suited and was developed for 
> English texts. It is not appropriate for non-Latin texts. There is a 
> multi-language analyzer that is much better, ICUAnalyzer, which follows UAX 
> #29 for tokenization. For details see: 
> https://issues.apache.org/jira/browse/LUCENE-1488 You’ll note that I 
> participate in its development.
>
> The osis2mod proclivity for NFC is for display.
>
> DM
>
>> On Mar 22, 2018, at 8:19 AM, David Haslam  wrote:
>>
>> Thanks DM,
>>
>> Not all searches make use of the Lucene index !
>>
>> e.g. In Xiphos, the advanced search panel gives the user a choice of which 
>> type of search.
>> Lucene is only one of these mutually exclusive options.
>>
>> btw. Where is it documented that the creation of a Lucene search index 
>> normalizes the Unicode for the index?
>> Do we know for certain that this would occur irrespective of whether 
>> normalization was suppressed during module build?
>> i.e. With osis2mod option   -N do not convert UTF-8 or normalize UTF-8 to NFC
>>
>> Best regards,
>>
>> David
>>
>> Sent with [ProtonMail](https://protonmail.com/) Secure Email.
>>
>> ‐‐‐ Original Message ‐‐‐
>> On 22 March 2018 10:20 AM, DM Smith  wrote:
>>
>>> The requirement is not that the search is normalized to nfc but rather that 
>>> it is normalized the same as the index. This should not be a front end 
>>> issue.
>>>
>>> Btw it doesn’t matter how Hebrew is stored in the module. Indexing should 
>>> normalize it to a form that is internal to the engine.
>>>
>>> — DM Smith
>>> From my phone. Brief. Weird autocorrections.
>>>
>>> On Mar 22, 2018, at 5:22 AM, David Haslam  wrote:
>>>
 Dear all,

 Not all front-ends automatically normalize the search string to Unicode 
 NFC.
 e.g.

 - Eloquent does
 - Xiphos does not

 The data is incomplete for this feature in the table in our wiki page.
 https://wiki.crosswire.org/Choosing_a_SWORD_program#Search_and_Dictionary

 Please would other front-end app developers supply the missing 
 information. Thanks.

 Further thought:
 For front-ends that also have an Advanced search feature, would it not be 
 a useful enhancement to have a tick box option for Search string 
 normalization?
 Then if we do make any Biblical Hebrew modules with custom normalization, 
 search could at least still work for the "corner cases" in Hebrew, 
 providing the user gave the proper input in the search box.

 cf. The source text for the WLC at [tanach.us](http://tanach.us/) is not 
 normalized to NFC, but our module is.
 I'll refrain from going into a lot more detail here. There's an issue in 
 our tracker that covers this.

 Best regards,

 David

 Sent with [ProtonMail](https://protonmail.com/) Secure Email.
>>>
 ___
 sword-devel mailing list: sword-devel@crosswire.org
 http://www.crosswire.org/mailman/listinfo/sword-devel
 Instructions to unsubscribe/change your settings at above page___
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Normalize the search string (comparing front-end apps)

2018-03-22 Thread DM Smith
It doesn’t matter that a search doesn’t use Lucene. The principle is the same. 
The search request has to be normalized to the same form as the searched text. 
For example a case insensitive search normalizes both to a single case. If it 
isn’t done, even on the fly, then search will fail at times. As they say, “even 
a blind squirrel gets a nut sometimes."

Regarding Lucene there are mulitple different analyzers (that’s what does the 
normalization in Lucene). Each normalizes differently. Each has it’s own 
documentation. The analyzer that SWORD uses is suited and was developed for 
English texts. It is not appropriate for non-Latin texts. There is a 
multi-language analyzer that is much better, ICUAnalyzer, which follows UAX #29 
for tokenization. For details see: 
https://issues.apache.org/jira/browse/LUCENE-1488 
 You’ll note that I 
participate in its development.

The osis2mod proclivity for NFC is for display.

DM

> On Mar 22, 2018, at 8:19 AM, David Haslam  wrote:
> 
> Thanks DM,
> 
> Not all searches make use of the Lucene index !
> 
> e.g. In Xiphos, the advanced search panel gives the user a choice of which 
> type of search.
> Lucene is only one of these mutually exclusive options.
> 
> btw. Where is it documented that the creation of a Lucene search index 
> normalizes the Unicode for the index?
> Do we know for certain that this would occur irrespective of whether 
> normalization was suppressed during module build?
> i.e. With osis2mod option   -Ndo not convert UTF-8 or normalize UTF-8 
> to NFC
> 
> 
> Best regards,
> 
> David
> 
> Sent with ProtonMail  Secure Email.
> 
> ‐‐‐ Original Message ‐‐‐
> On 22 March 2018 10:20 AM, DM Smith  wrote:
> 
>> The requirement is not that the search is normalized to nfc but rather that 
>> it is normalized the same as the index. This should not be a front end issue.
>> 
>> Btw it doesn’t matter how Hebrew is stored in the module. Indexing should 
>> normalize it to a form that is internal to the engine. 
>> 
>> — DM Smith
>> From my phone. Brief. Weird autocorrections. 
>> 
>> On Mar 22, 2018, at 5:22 AM, David Haslam > > wrote:
>>> Dear all,
>>> 
>>> Not all front-ends automatically normalize the search string to Unicode NFC.
>>> e.g.
>>> Eloquent does
>>> Xiphos does not
>>> The data is incomplete for this feature in the table in our wiki page.
>>> https://wiki.crosswire.org/Choosing_a_SWORD_program#Search_and_Dictionary 
>>> 
>>> 
>>> Please would other front-end app developers supply the missing information. 
>>> Thanks.
>>> 
>>> Further thought:
>>> For front-ends that also have an Advanced search feature, would it not be a 
>>> useful enhancement to have a tick box option for Search string 
>>> normalization?
>>> Then if we do make any Biblical Hebrew modules with custom normalization, 
>>> search could at least still work for the "corner cases" in Hebrew, 
>>> providing the user gave the proper input in the search box.
>>> 
>>> cf. The source text for the WLC at tanach.us  is not 
>>> normalized to NFC, but our module is.
>>> I'll refrain from going into a lot more detail here. There's an issue in 
>>> our tracker that covers this.
>>> 
>>> Best regards,
>>> 
>>> David
>>> 
>>> Sent with ProtonMail  Secure Email.
>>> 
>>> ___
>>> sword-devel mailing list: sword-devel@crosswire.org 
>>> 
>>> http://www.crosswire.org/mailman/listinfo/sword-devel 
>>> 
>>> Instructions to unsubscribe/change your settings at above page
> 

___
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Normalize the search string (comparing front-end apps)

2018-03-22 Thread David Haslam
Thanks DM,

Not all searches make use of the Lucene index !

e.g. In Xiphos, the advanced search panel gives the user a choice of which type 
of search.
Lucene is only one of these mutually exclusive options.

btw. Where is it documented that the creation of a Lucene search index 
normalizes the Unicode for the index?
Do we know for certain that this would occur irrespective of whether 
normalization was suppressed during module build?
i.e. With osis2mod option   -N do not convert UTF-8 or normalize UTF-8 to NFC

Best regards,

David

Sent with [ProtonMail](https://protonmail.com) Secure Email.

‐‐‐ Original Message ‐‐‐
On 22 March 2018 10:20 AM, DM Smith  wrote:

> The requirement is not that the search is normalized to nfc but rather that 
> it is normalized the same as the index. This should not be a front end issue.
>
> Btw it doesn’t matter how Hebrew is stored in the module. Indexing should 
> normalize it to a form that is internal to the engine.
>
> — DM Smith
> From my phone. Brief. Weird autocorrections.
>
> On Mar 22, 2018, at 5:22 AM, David Haslam  wrote:
>
>> Dear all,
>>
>> Not all front-ends automatically normalize the search string to Unicode NFC.
>> e.g.
>>
>> - Eloquent does
>> - Xiphos does not
>>
>> The data is incomplete for this feature in the table in our wiki page.
>> https://wiki.crosswire.org/Choosing_a_SWORD_program#Search_and_Dictionary
>>
>> Please would other front-end app developers supply the missing information. 
>> Thanks.
>>
>> Further thought:
>> For front-ends that also have an Advanced search feature, would it not be a 
>> useful enhancement to have a tick box option for Search string normalization?
>> Then if we do make any Biblical Hebrew modules with custom normalization, 
>> search could at least still work for the "corner cases" in Hebrew, providing 
>> the user gave the proper input in the search box.
>>
>> cf. The source text for the WLC at tanach.us is not normalized to NFC, but 
>> our module is.
>> I'll refrain from going into a lot more detail here. There's an issue in our 
>> tracker that covers this.
>>
>> Best regards,
>>
>> David
>>
>> Sent with [ProtonMail](https://protonmail.com) Secure Email.
>
>> ___
>> sword-devel mailing list: sword-devel@crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page___
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Normalize the search string (comparing front-end apps)

2018-03-22 Thread DM Smith
The requirement is not that the search is normalized to nfc but rather that it 
is normalized the same as the index. This should not be a front end issue.

Btw it doesn’t matter how Hebrew is stored in the module. Indexing should 
normalize it to a form that is internal to the engine. 

— DM Smith
From my phone. Brief. Weird autocorrections. 

> On Mar 22, 2018, at 5:22 AM, David Haslam  wrote:
> 
> Dear all,
> 
> Not all front-ends automatically normalize the search string to Unicode NFC.
> e.g.
> Eloquent does
> Xiphos does not
> The data is incomplete for this feature in the table in our wiki page.
> https://wiki.crosswire.org/Choosing_a_SWORD_program#Search_and_Dictionary
> 
> Please would other front-end app developers supply the missing information. 
> Thanks.
> 
> Further thought:
> For front-ends that also have an Advanced search feature, would it not be a 
> useful enhancement to have a tick box option for Search string normalization?
> Then if we do make any Biblical Hebrew modules with custom normalization, 
> search could at least still work for the "corner cases" in Hebrew, providing 
> the user gave the proper input in the search box.
> 
> cf. The source text for the WLC at tanach.us is not normalized to NFC, but 
> our module is.
> I'll refrain from going into a lot more detail here. There's an issue in our 
> tracker that covers this.
> 
> Best regards,
> 
> David
> 
> Sent with ProtonMail Secure Email.
> 
> ___
> sword-devel mailing list: sword-devel@crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
___
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

[sword-devel] Normalize the search string (comparing front-end apps)

2018-03-22 Thread David Haslam
Dear all,

Not all front-ends automatically normalize the search string to Unicode NFC.
e.g.

- Eloquent does
- Xiphos does not

The data is incomplete for this feature in the table in our wiki page.
https://wiki.crosswire.org/Choosing_a_SWORD_program#Search_and_Dictionary

Please would other front-end app developers supply the missing information. 
Thanks.

Further thought:
For front-ends that also have an Advanced search feature, would it not be a 
useful enhancement to have a tick box option for Search string normalization?
Then if we do make any Biblical Hebrew modules with custom normalization, 
search could at least still work for the "corner cases" in Hebrew, providing 
the user gave the proper input in the search box.

cf. The source text for the WLC at tanach.us is not normalized to NFC, but our 
module is.
I'll refrain from going into a lot more detail here. There's an issue in our 
tracker that covers this.

Best regards,

David

Sent with [ProtonMail](https://protonmail.com) Secure Email.___
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page