Re: indexing multiple email addresses in one field

Matthew Hall Fri, 31 Jul 2009 05:53:24 -0700

And to address the stop word issue, you can override the stop word listthat it uses.

Most analyzers that use stop words, (Standard included) has an option topass it an arbitrary list of StopWords which will override the defaults.

You could also just roll your own (which is what you are going to end updoing here anyhow) When you do, just don't include stop word removal inthe processing of your token stream.


Matt

Phil Whelan wrote:

Hi Matthew / Paul,

On Thu, Jul 30, 2009 at 4:32 PM, Paul Cowan<co...@aconex.com> wrote:

Matthew Hall wrote:

Place a delimiter between the email addresses that doesn't get removed in
your analyzer.  (preferably something you know will never be searched on)

Or add them separately (rather than:
 doc.add(new Field("email", "f...@bar.com b...@foo.com c...@bar.foo" ...);
use
 doc.add(new Field("email", "f...@bar.com");
 doc.add(new Field("email", "b...@foo.com");
 doc.add(new Field("email", "c...@bar.foo");
), using an Analyzer that overrides getPositionIncrementGap(). This inserts
a 'gap' between each set of Tokens for the same Field, which stops phrase
queries from 'crossing the boundaries' between subsequent values.


I like the sound of that! I think I understand it.
getPositionIncrementGap() returns 0 by default which keeps the "email"
field tokens sequential. Overriding with 1, will add an effective
blank token between the email addresses (overriding with 2 would leave
2). Similar to Matthew's delimiter token, but a bit neater.

So the token (with positions in brackets) would look something like this.

"foo(0) bar(1) com(2) bar(4) foo(5) com(6) com(8) bar(9) foo(10)"

Up until now I've only been using the WhiteSpaceAnalyzer, as I've been
keeping quite a tight control over the fields going into the index
(not making best use of Lucene).

What Analyzer would you recommend I use for this. I'll also be
indexing IPs, and other things, but that's pretty much the same story.
It seems I have to use the same Analyzer for the all the fields in the
index?

I've been looking at StandardAnalyzer, but I do not want to remove
stop words. I want to keep letters and numbers mainly, and also
override getPositionIncrementGap? Is there anything that does these
things already, or close to it? Overriding getPositionIncrementGap
shouldn't be difficult though.

Cheers,
Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: indexing multiple email addresses in one field

Reply via email to