I wrote a slightly modified version of the WhiteSpaceTokenizer that
allows me to treat other characters as whitespace. My thought was that
this would be an easy way to make it tokenize on characters such as "-".
My tokenizer looks like this:
public class CustomWhiteSpaceTokenizer extends CharTokenizer
{
protected boolean isTokenChar(char c)
{
if (Character.isWhitespace(c) || whiteSpaceChars_.contains(new
Character(c)))
{
return false;
}
else
{
return true;
}
}
<snip other stuff>
}
When I use my Analyzer which uses this tokenizer in the QueryParser with
the character "-" defined as whitespace, the following query gets parsed
like this:
"title:(john a) body:(john a) " -> (title:john title:a) (body:john body:a)
which is what I expect. But then the following query:
"title:(john--a) body:(john--a) " -> title:"john a" body:"john a"
Isn't what I want. I can't seem to figure out why it is behaving
differently on these characters (space vs hyphen) when I am specifying
them both as a non-token.
This is with the svn trunk as of yesterday.
Any help appreciated,
Thanks,
Dan
--
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]