Re: Why does the StandardTokenizer split hyphenated words?

Daniel Naber Wed, 15 Dec 2004 11:55:14 -0800

On Wednesday 15 December 2004 19:29, Mike Snare wrote:

> In my case, the words are keywords that must remain as is, searchable
> with the hyphen in place. ÂIt was easy enough to modify the tokenizer
> to do what I need, so I'm not really asking for help there. ÂI'm
> really just curious as to why it is that "a-1" is considered a single
> token, but "a-b" is split.


a-1 is considered a typical product name that needs to be unchanged 
(there's a comment in the source that mentions this). Indexing 
"hyphen-word" as two tokens has the advantage that it can then be found 
with the following queries:
hypen-word (will be turned into a phrase query internally)
"hypen word" (phrase query)
(it cannot be found searching for hyphenword, however).

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Why does the StandardTokenizer split hyphenated words?

Reply via email to