Re: How to retain % sign next to number during tokenization

2023-09-21 Thread Amitesh Kumar
Thank you! I will give it a try and share my findings with you all

Regards
Amitesh

On Thu, Sep 21, 2023 at 08:18 Uwe Schindler  wrote:

> The problem with WhitespaceTokenizer is that is splits only on
> whitespace. If you have text like "This is, was some test." then you get
> tokens like "is," and "test." including the punctuations.
>
> This is the reason why StandardTokenizer is normally used for human
> readable text. WhitespaceTokenizer is normally only used for special
> stuff like token lists (like tags) or uinque identifiers,...
>
> As quick workaround while still keeping the %, you can add a CharFilter
> like MappingCharFilter before the Tokenizer that replaces the "%" char
> by something else which is not stripped off. As this is done for both
> indexing and searching this does not hurt you. How about a "percent
> emoji"? :-)
>
> Another common "workaround" is also shown in some Solr default
> configurations typically used for product search: Those use
> WhitespaceTokenizer, followed by WordDelimiterFilter. WDF is then able
> to remove accents and handle stuff like product numbers correctly. There
> you can possibly make sure thet "%" survives.
>
> Uwe
>
> Am 20.09.2023 um 22:42 schrieb Amitesh Kumar:
> > Thanks Mikhail!
> >
> > I have tried all other tokenizers from Lucene4.4. In case of
> > WhitespaceTokwnizer, it loses romanizing of special chars like - etc
> >
> >
> > On Wed, Sep 20, 2023 at 16:39 Mikhail Khludnev  wrote:
> >
> >> Hello,
> >> Check the whitespace tokenizer.
> >>
> >> On Wed, Sep 20, 2023 at 7:46 PM Amitesh Kumar 
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> I am facing a requirement change to get % sign retained in searches.
> e.g.
> >>>
> >>> Sample search docs:
> >>> 1. Number of boys 50
> >>> 2. My score was 50%
> >>> 3. 40-50% for pass score
> >>>
> >>> Search query: 50%
> >>> Expected results: Doc-2, Doc-3 i.e.
> >>> My score was
> >>> 1. 50%
> >>> 2. 40-50% for pass score
> >>>
> >>> Actual result: All 3 documents (because tokenizer strips off the % both
> >>> during indexing as well as searching and hence matches all docs with 50
> >> in
> >>> it.
> >>>
> >>> On the implementation front, I am using a set of filters like
> >>> lowerCaseFilter, EnglishPossessiveFilter etc in addition to base
> >> tokenizer
> >>> StandardTokenizer.
> >>>
> >>> Per my analysis suggests, StandardTokenizer strips off the %  I am
> >> facing a
> >>> requirement change to get % sign retained in searches. e.g
> >>>
> >>> Sample search docs:
> >>> 1. Number of boys 50
> >>> 2. My score was 50%
> >>> 3. 40-50% for pass score
> >>>
> >>> Search query: 50%
> >>> Expected results: Doc-2, Doc-3 i.e.
> >>> My score was 50%
> >>> 40-50% for pass score
> >>>
> >>> Actual result: All 4 documents
> >>>
> >>> On the implementation front, I am using a set of filters like
> >>> lowerCaseFilter, EnglishPossessiveFilter etc in addition to base
> >> tokenizer
> >>> StandardTokenizer.
> >>>
> >>> Per my analysis, StandardTOkenizer strips off the %  sign and hence the
> >>> behavior.Has someone faced si
> milar
> requirement? Any help/guidance is
> >> highly
> >>> appreciated.
> >>>
> >>
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev
> >>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: How to retain % sign next to number during tokenization

2023-09-21 Thread Uwe Schindler
The problem with WhitespaceTokenizer is that is splits only on 
whitespace. If you have text like "This is, was some test." then you get 
tokens like "is," and "test." including the punctuations.


This is the reason why StandardTokenizer is normally used for human 
readable text. WhitespaceTokenizer is normally only used for special 
stuff like token lists (like tags) or uinque identifiers,...


As quick workaround while still keeping the %, you can add a CharFilter 
like MappingCharFilter before the Tokenizer that replaces the "%" char 
by something else which is not stripped off. As this is done for both 
indexing and searching this does not hurt you. How about a "percent 
emoji"? :-)


Another common "workaround" is also shown in some Solr default 
configurations typically used for product search: Those use 
WhitespaceTokenizer, followed by WordDelimiterFilter. WDF is then able 
to remove accents and handle stuff like product numbers correctly. There 
you can possibly make sure thet "%" survives.


Uwe

Am 20.09.2023 um 22:42 schrieb Amitesh Kumar:

Thanks Mikhail!

I have tried all other tokenizers from Lucene4.4. In case of
WhitespaceTokwnizer, it loses romanizing of special chars like - etc


On Wed, Sep 20, 2023 at 16:39 Mikhail Khludnev  wrote:


Hello,
Check the whitespace tokenizer.

On Wed, Sep 20, 2023 at 7:46 PM Amitesh Kumar 
wrote:


Hi,

I am facing a requirement change to get % sign retained in searches. e.g.

Sample search docs:
1. Number of boys 50
2. My score was 50%
3. 40-50% for pass score

Search query: 50%
Expected results: Doc-2, Doc-3 i.e.
My score was
1. 50%
2. 40-50% for pass score

Actual result: All 3 documents (because tokenizer strips off the % both
during indexing as well as searching and hence matches all docs with 50

in

it.

On the implementation front, I am using a set of filters like
lowerCaseFilter, EnglishPossessiveFilter etc in addition to base

tokenizer

StandardTokenizer.

Per my analysis suggests, StandardTokenizer strips off the %  I am

facing a

requirement change to get % sign retained in searches. e.g

Sample search docs:
1. Number of boys 50
2. My score was 50%
3. 40-50% for pass score

Search query: 50%
Expected results: Doc-2, Doc-3 i.e.
My score was 50%
40-50% for pass score

Actual result: All 4 documents

On the implementation front, I am using a set of filters like
lowerCaseFilter, EnglishPossessiveFilter etc in addition to base

tokenizer

StandardTokenizer.

Per my analysis, StandardTOkenizer strips off the %  sign and hence the
behavior.Has someone faced similar requirement? Any help/guidance is

highly

appreciated.



--
Sincerely yours
Mikhail Khludnev


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to retain % sign next to number during tokenization

2023-09-21 Thread Mikhail Khludnev
Hello,
I'm surprised and in doubt it may happen. Would you mind to upload a short
test reproducing it?

On Wed, Sep 20, 2023 at 11:44 PM Amitesh Kumar 
wrote:

> Thanks Mikhail!
>
> I have tried all other tokenizers from Lucene4.4. In case of
> WhitespaceTokwnizer, it loses romanizing of special chars like - etc
>
>
> On Wed, Sep 20, 2023 at 16:39 Mikhail Khludnev  wrote:
>
> > Hello,
> > Check the whitespace tokenizer.
> >
> > On Wed, Sep 20, 2023 at 7:46 PM Amitesh Kumar 
> > wrote:
> >
> > > Hi,
> > >
> > > I am facing a requirement change to get % sign retained in searches.
> e.g.
> > >
> > > Sample search docs:
> > > 1. Number of boys 50
> > > 2. My score was 50%
> > > 3. 40-50% for pass score
> > >
> > > Search query: 50%
> > > Expected results: Doc-2, Doc-3 i.e.
> > > My score was
> > > 1. 50%
> > > 2. 40-50% for pass score
> > >
> > > Actual result: All 3 documents (because tokenizer strips off the % both
> > > during indexing as well as searching and hence matches all docs with 50
> > in
> > > it.
> > >
> > > On the implementation front, I am using a set of filters like
> > > lowerCaseFilter, EnglishPossessiveFilter etc in addition to base
> > tokenizer
> > > StandardTokenizer.
> > >
> > > Per my analysis suggests, StandardTokenizer strips off the %  I am
> > facing a
> > > requirement change to get % sign retained in searches. e.g
> > >
> > > Sample search docs:
> > > 1. Number of boys 50
> > > 2. My score was 50%
> > > 3. 40-50% for pass score
> > >
> > > Search query: 50%
> > > Expected results: Doc-2, Doc-3 i.e.
> > > My score was 50%
> > > 40-50% for pass score
> > >
> > > Actual result: All 4 documents
> > >
> > > On the implementation front, I am using a set of filters like
> > > lowerCaseFilter, EnglishPossessiveFilter etc in addition to base
> > tokenizer
> > > StandardTokenizer.
> > >
> > > Per my analysis, StandardTOkenizer strips off the %  sign and hence the
> > > behavior.Has someone faced similar requirement? Any help/guidance is
> > highly
> > > appreciated.
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>


-- 
Sincerely yours
Mikhail Khludnev


Re: How to retain % sign next to number during tokenization

2023-09-20 Thread Amitesh Kumar
Thanks Mikhail!

I have tried all other tokenizers from Lucene4.4. In case of
WhitespaceTokwnizer, it loses romanizing of special chars like - etc


On Wed, Sep 20, 2023 at 16:39 Mikhail Khludnev  wrote:

> Hello,
> Check the whitespace tokenizer.
>
> On Wed, Sep 20, 2023 at 7:46 PM Amitesh Kumar 
> wrote:
>
> > Hi,
> >
> > I am facing a requirement change to get % sign retained in searches. e.g.
> >
> > Sample search docs:
> > 1. Number of boys 50
> > 2. My score was 50%
> > 3. 40-50% for pass score
> >
> > Search query: 50%
> > Expected results: Doc-2, Doc-3 i.e.
> > My score was
> > 1. 50%
> > 2. 40-50% for pass score
> >
> > Actual result: All 3 documents (because tokenizer strips off the % both
> > during indexing as well as searching and hence matches all docs with 50
> in
> > it.
> >
> > On the implementation front, I am using a set of filters like
> > lowerCaseFilter, EnglishPossessiveFilter etc in addition to base
> tokenizer
> > StandardTokenizer.
> >
> > Per my analysis suggests, StandardTokenizer strips off the %  I am
> facing a
> > requirement change to get % sign retained in searches. e.g
> >
> > Sample search docs:
> > 1. Number of boys 50
> > 2. My score was 50%
> > 3. 40-50% for pass score
> >
> > Search query: 50%
> > Expected results: Doc-2, Doc-3 i.e.
> > My score was 50%
> > 40-50% for pass score
> >
> > Actual result: All 4 documents
> >
> > On the implementation front, I am using a set of filters like
> > lowerCaseFilter, EnglishPossessiveFilter etc in addition to base
> tokenizer
> > StandardTokenizer.
> >
> > Per my analysis, StandardTOkenizer strips off the %  sign and hence the
> > behavior.Has someone faced similar requirement? Any help/guidance is
> highly
> > appreciated.
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Re: How to retain % sign next to number during tokenization

2023-09-20 Thread Mikhail Khludnev
Hello,
Check the whitespace tokenizer.

On Wed, Sep 20, 2023 at 7:46 PM Amitesh Kumar  wrote:

> Hi,
>
> I am facing a requirement change to get % sign retained in searches. e.g.
>
> Sample search docs:
> 1. Number of boys 50
> 2. My score was 50%
> 3. 40-50% for pass score
>
> Search query: 50%
> Expected results: Doc-2, Doc-3 i.e.
> My score was
> 1. 50%
> 2. 40-50% for pass score
>
> Actual result: All 3 documents (because tokenizer strips off the % both
> during indexing as well as searching and hence matches all docs with 50 in
> it.
>
> On the implementation front, I am using a set of filters like
> lowerCaseFilter, EnglishPossessiveFilter etc in addition to base tokenizer
> StandardTokenizer.
>
> Per my analysis suggests, StandardTokenizer strips off the %  I am facing a
> requirement change to get % sign retained in searches. e.g
>
> Sample search docs:
> 1. Number of boys 50
> 2. My score was 50%
> 3. 40-50% for pass score
>
> Search query: 50%
> Expected results: Doc-2, Doc-3 i.e.
> My score was 50%
> 40-50% for pass score
>
> Actual result: All 4 documents
>
> On the implementation front, I am using a set of filters like
> lowerCaseFilter, EnglishPossessiveFilter etc in addition to base tokenizer
> StandardTokenizer.
>
> Per my analysis, StandardTOkenizer strips off the %  sign and hence the
> behavior.Has someone faced similar requirement? Any help/guidance is highly
> appreciated.
>


-- 
Sincerely yours
Mikhail Khludnev


How to retain % sign next to number during tokenization

2023-09-20 Thread Amitesh Kumar
Hi,

I am facing a requirement change to get % sign retained in searches. e.g.

Sample search docs:
1. Number of boys 50
2. My score was 50%
3. 40-50% for pass score

Search query: 50%
Expected results: Doc-2, Doc-3 i.e.
My score was
1. 50%
2. 40-50% for pass score

Actual result: All 3 documents (because tokenizer strips off the % both
during indexing as well as searching and hence matches all docs with 50 in
it.

On the implementation front, I am using a set of filters like
lowerCaseFilter, EnglishPossessiveFilter etc in addition to base tokenizer
StandardTokenizer.

Per my analysis suggests, StandardTokenizer strips off the %  I am facing a
requirement change to get % sign retained in searches. e.g

Sample search docs:
1. Number of boys 50
2. My score was 50%
3. 40-50% for pass score

Search query: 50%
Expected results: Doc-2, Doc-3 i.e.
My score was 50%
40-50% for pass score

Actual result: All 4 documents

On the implementation front, I am using a set of filters like
lowerCaseFilter, EnglishPossessiveFilter etc in addition to base tokenizer
StandardTokenizer.

Per my analysis, StandardTOkenizer strips off the %  sign and hence the
behavior.Has someone faced similar requirement? Any help/guidance is highly
appreciated.


Re: How to retain % sign next to number during tokenization

2023-07-18 Thread Amitesh Kumar
Sorry for duplicating the question.

On Tue, Jul 18, 2023 at 19:09 Amitesh Kumar  wrote:

> I am facing a requirement change to get % sign retained in searches. e.g.
>
> Sample search docs:
> 1. Number of boys 50
> 2. My score was 50%
> 3. 40-50% for pass score
>
> Search query: 50%
> Expected results: Doc-2, Doc-3 i.e.
> 1. My score was 50%

2. 40-50% for pass score
>
> Actual result: All 3 documents


(possibly because tokenizer strips off the % both during indexing as well
> as searching and hence matches all docs with 50 in it.)
>
> On the implementation front, I am using a set of filters like
> lowerCaseFilter, EnglishPossessiveFilter etc in addition to base tokenizer
> StandardTokenizer.
>
> Per my analysis, StandardTOkenizer strips off the %  sign and hence the
> behavior.Has someone faced similar requirement? Any help/guidance is highly
> appreciated.
>
> Regards
> Amitesh
> --
> Regards,
> Amitesh
> Sent from Gmail Mobile
> (Please ignore typos)
>
-- 
Regards
Amitesh


How to retain % sign next to number during tokenization

2023-07-18 Thread Amitesh Kumar
I am facing a requirement change to get % sign retained in searches. e.g.

Sample search docs:
1. Number of boys 50
2. My score was 50%
3. 40-50% for pass score

Search query: 50%
Expected results: Doc-2, Doc-3 i.e.
My score was
1. 50%
2. 40-50% for pass score

Actual result: All 3 documents (because tokenizer strips off the % both
during indexing as well as searching and hence matches all docs with 50 in
it.

On the implementation front, I am using a set of filters like
lowerCaseFilter, EnglishPossessiveFilter etc in addition to base tokenizer
StandardTokenizer.

Per my analysis suggests, StandardTokenizer strips off the %  I am facing a
requirement change to get % sign retained in searches. e.g

Sample search docs:
1. Number of boys 50
2. My score was 50%
3. 40-50% for pass score

Search query: 50%
Expected results: Doc-2, Doc-3 i.e.
My score was 50%
40-50% for pass score

Actual result: All 4 documents

On the implementation front, I am using a set of filters like
lowerCaseFilter, EnglishPossessiveFilter etc in addition to base tokenizer
StandardTokenizer.

Per my analysis, StandardTOkenizer strips off the %  sign and hence the
behavior.Has someone faced similar requirement? Any help/guidance is highly
appreciated.

Regards
Amitesh
-- 
Regards,
Amitesh
Sent from Gmail Mobile
(Please ignore typos)