Re: [Koha] How to make the Koha/Zebra search ignore hyphens?

dcook Wed, 25 Sep 2019 18:52:32 -0700

Hi Michael,

Your experience suggests to me that you're using Zebra 2.0.59 (which is the 
package available in the official Debian repositories). There is a bug in that 
version which causes hyphens to cause an incorrect truncation so that 
"Sinti-Swing" becomes "Sinti". If you use the Indexdata Debian repository and 
upgrade to the latest version of Zebra (or any version higher than 2.0.59 such 
as 2.0.60), you shouldn't have that problem anymore. (Debian doesn’t have an 
active maintainer for idzebra-2.0 if I recall correctly, so it's never going to 
be fixed in Debian, unless someone new steps forward. I've thought about doing 
it, but I have enough responsibilities already, and the workaround here is 
fairly trivial. That said, we as a community should probably do more with the 
Koha instructions to warn about this problem...)


I warned in my original email that you will have to modify words-icu.xml and 
phrases-icu.xml to get the behaviour that you're wanting as well.  You'll want 
to add a "transliterate" or "transform" rule before the "tokenize" rule to 
remove the hyphens. I don't know the exact rule you'll need, so you'll have to 
experiment a bit. You can read more about that at 
https://software.indexdata.com/yaz/doc/yaz-icu.html. 

If you upgrade your Zebra and modify your ICU chain files, I think you should 
be able to achieve the behaviour you're wanting. 

Take the time to fully read the documentation at 
https://software.indexdata.com/yaz/doc/yaz-icu.html, as you can use yaz-icu to 
test the ICU configuration directly without having to reindex Zebra every time. 
Note that you may have to install yaz-icu as I don't know that it's installed 
by default on Debian when you install idzebra. (I rarely use Debian/Ubuntu for 
Koha, so my exact experiences can be a bit different.)

Actually, I'm going to do a little test myself.

Standard words-icu.xml:
echo "Sinti-Swing" | yaz-icu -c words-icu.xml
1 1 'sinti' 'Sinti'
2 1 'swing' 'Swing'
See that there are two separate tokens there. 

Using the following words-icu.xml with a transform rule before the tokenize 
rule:
<icu_chain locale="">
  <transliterate rule="{ œ > oe "/>
  <transliterate rule="{ Œ > oe "/>
  <transliterate rule="{ æ > ae "/>
  <transliterate rule="{ Æ > ae "/>
  <transliterate rule="\'>\ "/>
  <transliterate rule="\u2019>\ "/>
  <transliterate rule="\u02BC>\ "/>
  <transliterate rule="[:Number:] { '-' > '' "/>
  <!-- Remove control characters except \t\n\r -->
  <transform rule="[\x00-\x08\x0B\x0C\x0E-\x1F\x7F] Any-Remove"/>
  <transform rule="[-] Any-Remove"/>
  <tokenize rule="l"/>
  <transform rule="[[:WhiteSpace:][:Punctuation:]] Remove"/>
  <transform rule="NFD"/>
  <transform rule="[:Nonspacing Mark:] Remove"/>
  <transform rule="NFC"/>
  <display/>
  <casemap rule="l"/>
</icu_chain>

echo "Sinti-Swing" | yaz-icu -c words-icu.xml
1 1 'sintiswing' 'SintiSwing'
echo "Sintiswing" | yaz-icu -c words-icu.xml
1 1 'sintiswing' 'Sintiswing'

Now you can see that there's just 1 token. 

If I were you, I'd experiment a bit, as I naively wrote that transform rule 
without thinking too much. There might be cases where you don't want to remove 
the hyphen. For example, a French search for "Mont-Royal" might want it to be 
normalized as "mont royal" and tokenized into "mont" and "royal", so that 
keyword searches for "mont" or "royal" will still match the record. 

Note that the transliterate rules are very powerful. For example, you could 
replace that transform rule I added with one of the following:
<transliterate rule="([a-zA-Z]+) { '-' } ([a-zA-Z]+) > '' " />
<transliterate rule="([a-zA-Z]+)'-'([a-zA-Z]+) > $1$2" />

echo "Sinti-Swing" | yaz-icu -c words-icu.xml
1 1 'sintiswing' 'SintiSwing'

Take a look at <transliterate rule="[:Number:] { '-' > '' "/> which already 
exists to remove hyphens when they follow a number. 

What I'm trying to say is that the ICU rules are very powerful, but you have to 
be careful with how you use them. While it's trivial to fix the Sinti-Swing 
example, creating that "fix" might actually "break" something else. I think it 
comes down to trade-offs, and that's something that you'll have to think about 
as you're configuring your ICU rules.

Remember that this file is used both at index time *and* search time (as far as 
I know). Rules that might make sense at index time might not make sense at 
search time. I'm not familiar with hyphen usage in German, so I wouldn't really 
know what would make sense. 

Anyway, I hope that's more helpful!

David Cook
Systems Librarian
Prosentient Systems
72/330 Wattle St
Ultimo, NSW 2007
Australia

Office: 02 9212 0899
Direct: 02 8005 0595

-----Original Message-----
From: Michael Kuhn <m...@adminkuhn.ch> 
Sent: Thursday, 26 September 2019 4:59 AM
To: dc...@prosentient.com.au; koha@lists.katipo.co.nz
Subject: Re: [Koha] How to make the Koha/Zebra search ignore hyphens?

Hi David

 > I'm glad that I got you a bit further on your journey. It's a shame  > about 
 > having to use the CHR indexing. You can find more information  > here at 
 > https://software.indexdata.com/zebra/doc/character-map-files.html.
 >
 > After reading through that, I'm thinking perhaps that CHR indexing  > can't 
 > help you.

Thanks for your assessment!

 > You could ask Indexdata for more information, but I'm guessing it  > can't 
 > be done with CHR. It should be doable with ICU though.

So I tried to change the Koha-Standard CHR to ICU according to 
https://wiki.koha-community.org/wiki/ICU_chains_configuration, just using the 
original configuration of "words-icu.xml" and "phrases-icu.xml", then 
restarting Zebra and reindexing. But getting a very unexpected result: Now a 
catalog search

* for "Sintiswing" shows 1 hit

* for "Sinti-Swing" shows 4'222 hits, the hyphen seems to be ignored completely 
and everything is found that contains either "Sinti" OR "Swing" or both

* for "Sinti Swing" shows 18 hits, the hyphen is used as a breaking character, 
so any record containing "Sinti-Swing" or "Sinti" AND "Swing"
  is found, but not "Sintiswing"

In short: The Koha standard configuration of ICU ("words-icu.xml" and
"phrases-icu.xml") seems defective to me. The results are much worse than what 
CHR gives. And of course the desired result isn't there yet anyway.

Do you maybe have a hint where to find some documentation about how to change 
the behaviour of ICU indexing in the desired way?

Best wishes: Michael
--
Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin 
Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 
· E m...@adminkuhn.ch · W www.adminkuhn.ch





Am 25.09.19 um 08:34 schrieb dc...@prosentient.com.au:
> Hi Michael,
> 
> I'm glad that I got you a bit further on your journey. It's a shame about 
> having to use the CHR indexing. You can find more information here at 
> https://software.indexdata.com/zebra/doc/character-map-files.html.
> 
> After reading through that, I'm thinking perhaps that CHR indexing can't help 
> you.
> 
> You could ask Indexdata for more information, but I'm guessing it can't be 
> done with CHR. It should be doable with ICU though.
> 
> David Cook
> Systems Librarian
> Prosentient Systems
> 72/330 Wattle St
> Ultimo, NSW 2007
> Australia
> 
> Office: 02 9212 0899
> Direct: 02 8005 0595
> 
> -----Original Message-----
> From: Michael Kuhn <m...@adminkuhn.ch>
> Sent: Wednesday, 25 September 2019 4:47 AM
> To: dc...@prosentient.com.au; koha@lists.katipo.co.nz
> Subject: Re: [Koha] How to make the Koha/Zebra search ignore hyphens?
> 
> Hi David
> 
> Many thanks for your reply and the hints!
> 
> After a standard installation of Koha 18.11 the CHR indexing is used, thus 
> the configuration is done in file "word-phrase-utf.chr".
> 
> A catalog search
> * for "Sintiswing" shows 1 hit
> * for "Sinti-Swing" shows 18 hits, the hyphen is used as a breaking 
> character, so any record containing "Sinti-Swing" or "Sinti" and "Swing"
> is found, but not "Sintiswing"
> 
> I changed the following line, omitting the hyphen (between comma and dot):
> 
> space
> {\001-\040}!"#$%&'\()*+,./:;<=>?@\[\\]^_`\{|}~’{\x88-\x89}{\x98-\x9C}¡¿«»
> 
> After a Zebra reindexing a catalog search
> * for "Sintiswing" shows 1 hit
> * for "Sinti-Swing" now shows only 8 hits, the hyphen is no more used as a 
> breaking character, so any record containing "Sinti Swing" or "Sinti-Swing" 
> is found, but not "Sintiswing"
> 
> I also tried to add "map (-) @" but this leads to the original results.
> 
> In short: My change of configuration didn't lead to the desired result... If 
> searching for "Sintiswing" also "Sinti-Swing" should be found, and vice 
> versa. This is not the case.
> 
> Since I couldn't find any documentation about CHR indexing - does anyone know 
> where to find out more about the CHR way of indexing?
> 
> Best wishes: Michael
> --
> Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis 
> Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 
> 261 55 61 · E m...@adminkuhn.ch · W www.adminkuhn.ch
> 
> 
> 
> Am 19.09.19 um 03:29 schrieb dc...@prosentient.com.au:
>> Hi Michael,
>>
>> That's really interesting. I assume that you're using ICU indexing?
>>
>> You could update "phrases-icu.xml" and "words-icu.xml" to strip out hyphens. 
>> You would need to re-index all your records afterwards though.
>>
>> I haven't actually tested that particular change, but just taking a little 
>> look with both ICU and CHR and it looks like hyphens are used to tokenize. 
>> Currently, when you search "Tee-Ei", you're actually searching for "Tee" and 
>> "Ei".
>>
>> If you're using ICU, you could add a transform rule before the tokenize rule 
>> to remove the hyphen. This would prevent it from tokenizing and then 
>> "Tee-Ei" and "Teeei" should retrieve the same records.
>>
>> Beware also that this is a universal change. You might want to check to see 
>> if there are hyphens that shouldn't be removed. If so, you may need to make 
>> a more complex rule to try to just capture the desired cases.
>>
>> If you're using CHR, you can take a look at word-phrase-utf.chr and remove - 
>> from the "Breaking characters" section. You may or may not also need to map 
>> it. I'm less familiar with CHR indexing.
>>
>> Anyway, I hope that helps.
>>
>> David Cook
>> Systems Librarian
>> Prosentient Systems
>> 72/330 Wattle St
>> Ultimo, NSW 2007
>> Australia
>>
>> Office: 02 9212 0899
>> Direct: 02 8005 0595
>>
>> -----Original Message-----
>>
>> Date: Wed, 18 Sep 2019 22:46:15 +0200
>> From:        
>> To: "Koha : access" <koha@lists.katipo.co.nz>
>> Subject: [Koha] How to make the Koha/Zebra search ignore hyphens?
>> Message-ID: <5b63f3b4-76c1-c1f8-f35a-6a33e3b0a...@adminkuhn.ch>
>> Content-Type: text/plain; charset=utf-8; format=flowed
>>
>> Hi
>>
>> We have found that, at least in German, there are words or combinations
>> of words that can be written in different ways, and both are correct and
>> are meaning the same, e. g.
>>
>> * Ultraschallmessgerät = Ultraschall-Messgerät
>> * Sintiswing = Sinti-Swing
>> * Teeei = Tee-Ei
>> * Haftpflichtversicherungsgesellschaft =
>> Haftpflicht-Versicherungsgesellschaft
>>
>> This is a general concept in German, so it makes no sense to add a "used
>> for/see from:" in the authority data. Anyway, such words can exist
>> everywhere in the bibliographic record, not only in fields linked to
>> authority fields.
>>
>> Now the question: is there a way how to teach Koha (or Zebra) to look
>> for the second term also when the first term is searched, and vice
>> versa? Or shorter: Just to ignore the hyphens? Using the standard
>> configuration Koha will not find the second term if the first one is
>> searched, and vice bversa.
>>
>> We would appreciate any hint or tip!
>>
>> Best wishes: Michael
>>
> 
> 
>

signature.asc
Description: PGP signature

_______________________________________________
Koha mailing list  http://koha-community.org
Koha@lists.katipo.co.nz
https://lists.katipo.co.nz/mailman/listinfo/koha

Re: [Koha] How to make the Koha/Zebra search ignore hyphens?

Reply via email to