I think you're absolutely right Erick,

Thanks for the insight - that's the direction I'll be heading.

Cheers,

-D

-----Original Message-----
From: Erick Erickson [mailto:[email protected]]
Sent: Friday, July 13, 2012 8:53 AM
To: [email protected]
Subject: Re: Pattern Analyzer

Sure, you can do it that way. But first I'd look over the zillion
tokenizers and filters that are available and string together the ones
that best suit your need. For instance, WhitespaceTokenizer and
PatternReplaceFilter might make your regex much easier since the
PatternReplaceFilter gets just the whitespace-delimited tokens to operate
on. You can hook arbitrary numbers of Filters into your chain, so you
could add LowercaseFilter and....

But unless your case is pretty unusual, I'd claim just using the pre-built
Tokenizers and Filters will probably work for you, or at least I'd check
that out first.

Best
Erick

On Thu, Jul 12, 2012 at 2:20 PM, Dave Seltzer <[email protected]> wrote:
> Hello,
>
> I have a search project which uses the Lucene PatternAnalyzer for its
> text/query analysis.
>
> At the moment it's configured like so:
> analyzer = new PatternAnalyzer(Version.LUCENE_35,
> Pattern.compile("\\s+"), true, null);
>
> My goal here was to split words based on spaces and make things case
> insensitive.
>
> In thinking about this however I probably want to be a little bit more
> sophisticated. I'd like to ignore punctuation which occurs at the end
> or beginning of a word.
>
> Is this simply a matter of writing a regex which treats those cases
> the same as a space?
>
> Would I use something like this:
> analyzer = new PatternAnalyzer(Version.LUCENE_35,
> Pattern.compile("\\s+|\\p{Punct}+\\w|\\w\\p{Punct}"), true, null);
>
> Thanks so much!
>
> Dave
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to