Re: any analyzer will keep punctuation？

Michael McCandless Mon, 06 Mar 2017 03:51:04 -0800

You could use ICUTokenizer and make a custom RuleBasedBreakIterator .rbbi
file to control precisely when splitting should happen, but that language
is complex to configure ;)


Another option is to maybe make a CharFilter ahead of StandardTokenizer
that tries to rewrite the punctuation you want to keep into something that
StandardTokenizer would not split on.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Mar 6, 2017 at 5:22 AM, Yonghui Zhao <[email protected]> wrote:

> Yes whitespace analyzer will keep punctuation, but it only breaks word by
> space.
>
>
> I didn’t explain my requirement clearly.
>
> I want to an analyzer like standard analyzer but may keep some punctuation
> configured.
>
> 2017-03-06 18:03 GMT+08:00 Ahmet Arslan <[email protected]>:
>
> > Hi,
> >
> > Whitespace analyser/tokenizer for example.
> >
> > Ahmet
> >
> >
> >
> > On Monday, March 6, 2017 10:21 AM, Yonghui Zhao <[email protected]>
> > wrote:
> > Lucene standard anlyzer will remove almost all punctuation.
> > In some cases, we want to keep some punctuation, for example in music
> > search, some singer name and album name could be a punctuation.
> >
> > Is there any analyzer that we can customized punctuation to be removed?
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>

Re: any analyzer will keep punctuation？

Reply via email to