You could use ICUTokenizer and make a custom RuleBasedBreakIterator .rbbi file to control precisely when splitting should happen, but that language is complex to configure ;)
Another option is to maybe make a CharFilter ahead of StandardTokenizer that tries to rewrite the punctuation you want to keep into something that StandardTokenizer would not split on. Mike McCandless http://blog.mikemccandless.com On Mon, Mar 6, 2017 at 5:22 AM, Yonghui Zhao <zhaoyong...@gmail.com> wrote: > Yes whitespace analyzer will keep punctuation, but it only breaks word by > space. > > > I didn’t explain my requirement clearly. > > I want to an analyzer like standard analyzer but may keep some punctuation > configured. > > 2017-03-06 18:03 GMT+08:00 Ahmet Arslan <iori...@yahoo.com.invalid>: > > > Hi, > > > > Whitespace analyser/tokenizer for example. > > > > Ahmet > > > > > > > > On Monday, March 6, 2017 10:21 AM, Yonghui Zhao <zhaoyong...@gmail.com> > > wrote: > > Lucene standard anlyzer will remove almost all punctuation. > > In some cases, we want to keep some punctuation, for example in music > > search, some singer name and album name could be a punctuation. > > > > Is there any analyzer that we can customized punctuation to be removed? > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >