Re: How to retain % sign next to number during tokenization

Uwe Schindler Thu, 21 Sep 2023 05:17:57 -0700

The problem with WhitespaceTokenizer is that is splits only onwhitespace. If you have text like "This is, was some test." then you gettokens like "is," and "test." including the punctuations.

This is the reason why StandardTokenizer is normally used for humanreadable text. WhitespaceTokenizer is normally only used for specialstuff like token lists (like tags) or uinque identifiers,...

As quick workaround while still keeping the %, you can add a CharFilterlike MappingCharFilter before the Tokenizer that replaces the "%" charby something else which is not stripped off. As this is done for bothindexing and searching this does not hurt you. How about a "percentemoji"? :-)

Another common "workaround" is also shown in some Solr defaultconfigurations typically used for product search: Those useWhitespaceTokenizer, followed by WordDelimiterFilter. WDF is then ableto remove accents and handle stuff like product numbers correctly. Thereyou can possibly make sure thet "%" survives.


Uwe

Am 20.09.2023 um 22:42 schrieb Amitesh Kumar:

Thanks Mikhail!

I have tried all other tokenizers from Lucene4.4. In case of
WhitespaceTokwnizer, it loses romanizing of special chars like - etc


On Wed, Sep 20, 2023 at 16:39 Mikhail Khludnev <[email protected]> wrote:

Hello,
Check the whitespace tokenizer.

On Wed, Sep 20, 2023 at 7:46 PM Amitesh Kumar <[email protected]>
wrote:

Hi,

I am facing a requirement change to get % sign retained in searches. e.g.

Sample search docs:
1. Number of boys 50
2. My score was 50%
3. 40-50% for pass score

Search query: 50%
Expected results: Doc-2, Doc-3 i.e.
My score was
1. 50%
2. 40-50% for pass score

Actual result: All 3 documents (because tokenizer strips off the % both
during indexing as well as searching and hence matches all docs with 50

in

it.

On the implementation front, I am using a set of filters like
lowerCaseFilter, EnglishPossessiveFilter etc in addition to base

tokenizer

StandardTokenizer.

Per my analysis suggests, StandardTokenizer strips off the %  I am

facing a

requirement change to get % sign retained in searches. e.g

Sample search docs:
1. Number of boys 50
2. My score was 50%
3. 40-50% for pass score

Search query: 50%
Expected results: Doc-2, Doc-3 i.e.
My score was 50%
40-50% for pass score

Actual result: All 4 documents

On the implementation front, I am using a set of filters like
lowerCaseFilter, EnglishPossessiveFilter etc in addition to base

tokenizer

StandardTokenizer.

Per my analysis, StandardTOkenizer strips off the %  sign and hence the
behavior.Has someone faced similar requirement? Any help/guidance is

highly

appreciated.


--
Sincerely yours
Mikhail Khludnev

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: How to retain % sign next to number during tokenization

Reply via email to