Generally, you don't need the preserveOriginal attribute for WDF. Generate both the word parts and the concatenated terms, and queries should work fine without the original. The separated terms will be indexed as a sequence, and the split/separated terms will generate a phrase query that matches the indexed sequence. And if you index the concatenated terms, that can be queried as well.

With that issue out of the way, is there a remaining issue here?

-- Jack Krupansky

-----Original Message----- From: Shawn Heisey
Sent: Friday, November 16, 2012 11:30 AM
To: solr-user@lucene.apache.org
Subject: Solr/Lucene Tokenizers - cannot get the behavior I need

I cannot seem to get the combination of behaviors that I want from the
tokenizer/filter combinations in Solr.

Right now I am using WhitespaceTokenizer.  This does not split on
punctuation, which is the behavior I want, because I do this myself
later.  I use WordDelimeterFilter with preserveOriginal so that
documents with text in the format "Word1-Word2" can be located by a
search for word1word2 as well as the two words individually.

I am extremely interested in the Unicode behavior of ICUTokenizer, but I
cannot disable the punctuation-splitting behavior and let WDF handle it
properly, which causes recall problems.  There is no filter that I can
run after tokenization, either.  Looking at ICUTokenizer.java, I do not
see any way to write my own tokenizer that does what I need.

I have this problem with pretty much all of the tokenizers other than
Whitespace.  There are situations where I would like to use some of the
others, but the punctuation-splitting behavior is a major problem for me.

Do I have any options?  I have never looked at the ICU code from IBM, so
I don't know if it would require major surgery there.

Thanks,
Shawn

Reply via email to