the problem with whitespaceanalyzer is that if you have for example a sentence in the 
text say "lucene is indexing." a query for "indexing" will produce no hits because "." 
is not a token delimiter. you will have to search for "indexing*".

for me the solution was to write my own tokenizer/analyzer pair

--snip

and 
--snip

public final class myTokenizer extends CharTokenizer {
/** Construct a new LowerCaseTokenizer. */
public myTokenizer(Reader in) {
super(in);
}

/** Collects only characters which satisfy
* {@link Character#isLetter(char)}.*/
protected char normalize(char c) {
return Character.toLowerCase(c);
}


/** Collects only characters which do not satisfy
* {@link Character#isWhitespace(char)}.*/
protected boolean isTokenChar(char c) {
return Character.isLetterOrDigit(c);
}
}
public final class myAnalyzer extends Analyzer {
public final TokenStream tokenStream(String fieldName, Reader reader) {
return new myTokenizer(reader);
}
}
--snip


regards joe

"RAYMOND Romain" <[EMAIL PROTECTED]> writes on 
Tue, 26 Mar 2002 08:53:51 +0100 (MET):

> hello,
> 
> The solution we adopted is to use WhiteSpaceAnalyser.
> If you print the result of a query after parsing it (with parse
> method)
> the tokenizers used delete the numbers from the query.
> But WhiteSpaceAnalyser only tokenizes based on ... spaces, so we can
> search on numbers values ....
> 
> --
> To unsubscribe, e-mail: 
> <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
> 


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to