Best practice - preparing search term for Lucene

Hrvoje Lončar Thu, 22 Sep 2022 09:37:48 -0700

Hi!

I'm using Hibernate Search / Lucene to index my entities in Spring Boot
aplication.


One thing I'm not sure is how to handle Croatian specific letters.
Croatian language has few additional letters "*č* *Č* *ć* *Ć* *đ* *Đ* *š*
*Š* *ž* *Ž*".
Letters "*đ* *Đ*" are commonly replaced with "*dj* *DJ*" when no Croatian
letters available.

In my custom Hibernate bridge there is a step that replaces all Croatian
characters with appropriate ASCII replacements which means "*č*" becomes "
*c*", "*š*" becomes "*s*" and so on.
Later, when user enters search text, the same process is done to match
values from index.
There is one more good thing about it - some older users that used
computers in early ages when no Croatian letters were available - those
users type words without Croatian letters, automatically replacing "*č*" with
"*c*" and that fits my logic to get good search results.

For example, the title of my entity is: "*juha s češnjakom u đumbirom*".
My custom Hibernate String bridge converts it to "*juha cesnjakom dumbirom*
".
Then user enters "*juha s češnjakom*".
Before issuing a search, the same conversion is made to users' query and
text sent to Lucene is "*juha cesnjakom*".
This is the way how I implemented it and it's working fine.

The other way would be to index original text and then find words with
Croatian characters, convert them to ASCII and add to original.
The title "*juha s češnjakom i đumbirom*" would become "*juha češnjakom
đumbirom cesnjakom dumbirom*".
In that case there is no need to convert users' search terms because
both "*juha
s češnjakom*" and "*juha s cesnjakom*" would return the same result.

My question is:
Is there any reason to switch to this alternative logic and have original
keywords indexed in parallel with those converted to ASCII?

Thanks!

BR,
Hrvoje

Best practice - preparing search term for Lucene

Reply via email to