On 11/16/2012 12:30 PM, Shawn Heisey wrote:
I am extremely interested in the Unicode behavior of ICUTokenizer, but I cannot disable the punctuation-splitting behavior and let WDF handle it properly, which causes recall problems. There is no filter that I can run after tokenization, either. Looking at ICUTokenizer.java, I do not see any way to write my own tokenizer that does what I need.

I have this problem with pretty much all of the tokenizers other than Whitespace. There are situations where I would like to use some of the others, but the punctuation-splitting behavior is a major problem for me.

Do I have any options? I have never looked at the ICU code from IBM, so I don't know if it would require major surgery there.

Related problem: The entire reason I started down this path is because I'd like to handle CJK better with CJKBigramFilter. It appears that unless you use StandardTokenizer, ClassicTokenizer, or ICUTokenizer, CJKBigramFilter doesn't work ... but none of these tokenizers will handle punctuation right for me.

I seem to remember a discussion some time ago around this, saying that a future version of CJKBigramFilter would drop the requirement that each token be tagged.

Do I need to file an issue about this, and/or start a new discussion thread?

Thanks,
Shawn

Reply via email to