On 11/16/2012 12:36 PM, Jack Krupansky wrote:
Generally, you don't need the preserveOriginal attribute for WDF.
Generate both the word parts and the concatenated terms, and queries
should work fine without the original. The separated terms will be
indexed as a sequence, and the split/separated terms will generate a
phrase query that matches the indexed sequence. And if you index the
concatenated terms, that can be queried as well.
With that issue out of the way, is there a remaining issue here?
You're right, that's handled by catenateWords. I do need
preserveOriginal for other things, though. I think it's unimportant for
this discussion. I may consider removing it at a later stage, but right
now our assessment is that we need it.
The immediate problem is that when ICUTokenizer is done with an input of
"Word1-Word2" I am left with two tokens, Word1 and Word2. The
punctuation in the middle is gone. Even if WDF is the very next thing
in the analysis chain, there's nothing for it to do - the fact that
Word1 and Word2 were connected by punctuation is entirely lost.
Thanks,
Shawn