Would it be simpler just to modify the input with a regex rather than risk
messing with StandardANalyzer? Or wouldn't that do what you need?
On 1/11/07, Van Nguyen <[EMAIL PROTECTED]> wrote:
Hi,
I need to modify the StandardAnalyzer so that it will tokenize zip codes
that look like this:
92626-2646
I think the part I need to modify is in here - specifically:
<HAS_DIGIT> <P> <ALPHANUM>
// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
| <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
| <HAS_DIGIT> <P> <ALPHANUM>
| <HAS_DIGIT> <M>
| <HAS_DIGIT> (<P> <HAS_DIGIT>)+ <M>
| <LETTER> (<P> <LETTER>)+
| <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
| <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
| <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
| <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
)
>
Is there a way to keep that line so that the StandardAnalyzer works as
is - but tokenize anything that looks like
(HAS_DIGITS) <P>) | (<HAS_DIGITS> <P> <HAS_DIGITS>) or even better:
(<DIGIT><DIGIT><DIGIT><DIGIT><DIGIT><P>) |
<DIGIT><DIGIT><DIGIT><DIGIT><DIGIT><P><DIGIT><DIGIT><DIGIT><DIGIT>) - I
have zip codes that look like 92626, 92626-, and 92626-2646
I've tried adding that both lines to the "SKIP" section - but to no
avail.