I wrote a slightly modified version of the WhiteSpaceTokenizer that allows me to treat other characters as whitespace. My thought was that this would be an easy way to make it tokenize on characters such as "-".

My tokenizer looks like this:

public class CustomWhiteSpaceTokenizer extends CharTokenizer
{

   protected boolean isTokenChar(char c)
   {
if (Character.isWhitespace(c) || whiteSpaceChars_.contains(new Character(c)))
       {
           return false;
       }
       else
       {
           return true;
       }
   }

<snip other stuff>
}

When I use my Analyzer which uses this tokenizer in the QueryParser with the character "-" defined as whitespace, the following query gets parsed like this:

"title:(john  a) body:(john  a) " -> (title:john title:a) (body:john body:a)

which is what I expect.  But then the following query:

"title:(john--a) body:(john--a) " -> title:"john a" body:"john a"

Isn't what I want. I can't seem to figure out why it is behaving differently on these characters (space vs hyphen) when I am specifying them both as a non-token.

This is with the svn trunk as of yesterday.
Any help appreciated,

Thanks,

Dan

--
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to