One of the token patterns defined by the StandardTokenizer.jj is this: <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
| <HAS_DIGIT> <P> <ALPHANUM> | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ ) So basically if you have some sequences of characters separated by a "-" character, sequences that contain a digit will be combined with sequences which are adjacent to it to form a single token. That explains why the WS and YYMM sequences got separated out. You can alter this behavior this with some simple changes to StandardTokenizer.jj. ----- Original Message ----- From: "Iain Young" <[EMAIL PROTECTED]> To: "'Lucene Users List'" <[EMAIL PROTECTED]> Sent: Tuesday, December 16, 2003 7:46 AM Subject: RE: Disabling modifiers? > I think it is a problem with the indexing. I've found another example... > > WS-CA-PP00-PROCESS-YYMM > > I've looked at the index, and it has been tokenized into 3 words... > > WS > CA-PP00-PROCESS > YYMM > > Looks as though I might have to use a custom tokenizer as well as an > analyzer then, but any ideas as to why the standard tokenizer would have > split the variable up like this (i.e. why didn't it split the middle bit, > only the word off either end)? The only thing I can think of is that there > are several other variables in the source beginning with WS- or ending with > -YYMM, so could the tokenizer have seen this and be doing something clever > with them? > > Thanks, > Iain > > ***************************************** > * Micro Focus Developer Forum 2004 * > * 3 days that will make a difference * > * www.microfocus.com/devforum * > ***************************************** > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]