Re: Solr/Lucene Tokenizers - cannot get the behavior I need

Shawn Heisey Sat, 17 Nov 2012 13:48:00 -0800

On 11/16/2012 12:30 PM, Shawn Heisey wrote:

I am extremely interested in the Unicode behavior of ICUTokenizer, butI cannot disable the punctuation-splitting behavior and let WDF handleit properly, which causes recall problems. There is no filter that Ican run after tokenization, either. Looking at ICUTokenizer.java, Ido not see any way to write my own tokenizer that does what I need.
I have this problem with pretty much all of the tokenizers other thanWhitespace. There are situations where I would like to use some ofthe others, but the punctuation-splitting behavior is a major problemfor me.
Do I have any options? I have never looked at the ICU code from IBM,so I don't know if it would require major surgery there.

Related problem: The entire reason I started down this path is becauseI'd like to handle CJK better with CJKBigramFilter. It appears thatunless you use StandardTokenizer, ClassicTokenizer, or ICUTokenizer,CJKBigramFilter doesn't work ... but none of these tokenizers willhandle punctuation right for me.

I seem to remember a discussion some time ago around this, saying that afuture version of CJKBigramFilter would drop the requirement that eachtoken be tagged.


Do I need to file an issue about this, and/or start a new discussion thread?

Thanks,
Shawn

Re: Solr/Lucene Tokenizers - cannot get the behavior I need

Reply via email to