WhiteSpace Tokenizer question

Dan Armbrust Tue, 23 Aug 2005 08:32:19 -0700

I wrote a slightly modified version of the WhiteSpaceTokenizer thatallows me to treat other characters as whitespace. My thought was thatthis would be an easy way to make it tokenize on characters such as "-".


My tokenizer looks like this:


public class CustomWhiteSpaceTokenizer extends CharTokenizer
{

   protected boolean isTokenChar(char c)
   {

if (Character.isWhitespace(c) || whiteSpaceChars_.contains(newCharacter(c)))

       {
           return false;
       }
       else
       {
           return true;
       }
   }

<snip other stuff>
}

When I use my Analyzer which uses this tokenizer in the QueryParser withthe character "-" defined as whitespace, the following query gets parsedlike this:


"title:(john  a) body:(john  a) " -> (title:john title:a) (body:john body:a)

which is what I expect.  But then the following query:

"title:(john--a) body:(john--a) " -> title:"john a" body:"john a"

Isn't what I want. I can't seem to figure out why it is behavingdifferently on these characters (space vs hyphen) when I am specifyingthem both as a non-token.


This is with the svn trunk as of yesterday.
Any help appreciated,

Thanks,

Dan

--
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

WhiteSpace Tokenizer question

Reply via email to