Re: indexing multiple email addresses in one field

Phil Whelan Fri, 31 Jul 2009 07:45:44 -0700

Thanks Matt. Thanks Paul. I'm up early (PST) and ready for a major
rewrite of my indexer. I think these changes are going to make a huge
difference.


Cheers,
Phil

On Fri, Jul 31, 2009 at 5:52 AM, Matthew Hall<mh...@informatics.jax.org> wrote:
> And to address the stop word issue, you can override the stop word list that
> it uses.
>
> Most analyzers that use stop words, (Standard included) has an option to
> pass it an arbitrary list of StopWords which will override the defaults.
>
> You could also just roll your own (which is what you are going to end up
> doing here anyhow)  When you do, just don't include stop word removal in the
> processing of your token stream.
>
> Matt
>
> Phil Whelan wrote:
>>
>> Hi Matthew / Paul,
>>
>> On Thu, Jul 30, 2009 at 4:32 PM, Paul Cowan<co...@aconex.com> wrote:
>>
>>>
>>> Matthew Hall wrote:
>>>
>>>>
>>>> Place a delimiter between the email addresses that doesn't get removed
>>>> in
>>>> your analyzer.  (preferably something you know will never be searched
>>>> on)
>>>>
>>>
>>> Or add them separately (rather than:
>>>  doc.add(new Field("email", "f...@bar.com b...@foo.com c...@bar.foo" ...);
>>> use
>>>  doc.add(new Field("email", "f...@bar.com");
>>>  doc.add(new Field("email", "b...@foo.com");
>>>  doc.add(new Field("email", "c...@bar.foo");
>>> ), using an Analyzer that overrides getPositionIncrementGap(). This
>>> inserts
>>> a 'gap' between each set of Tokens for the same Field, which stops phrase
>>> queries from 'crossing the boundaries' between subsequent values.
>>>
>>
>> I like the sound of that! I think I understand it.
>> getPositionIncrementGap() returns 0 by default which keeps the "email"
>> field tokens sequential. Overriding with 1, will add an effective
>> blank token between the email addresses (overriding with 2 would leave
>> 2). Similar to Matthew's delimiter token, but a bit neater.
>>
>> So the token (with positions in brackets) would look something like this.
>>
>> "foo(0) bar(1) com(2) bar(4) foo(5) com(6) com(8) bar(9) foo(10)"
>>
>> Up until now I've only been using the WhiteSpaceAnalyzer, as I've been
>> keeping quite a tight control over the fields going into the index
>> (not making best use of Lucene).
>>
>> What Analyzer would you recommend I use for this. I'll also be
>> indexing IPs, and other things, but that's pretty much the same story.
>> It seems I have to use the same Analyzer for the all the fields in the
>> index?
>>
>> I've been looking at StandardAnalyzer, but I do not want to remove
>> stop words. I want to keep letters and numbers mainly, and also
>> override getPositionIncrementGap? Is there anything that does these
>> things already, or close to it? Overriding getPositionIncrementGap
>> shouldn't be difficult though.
>>
>> Cheers,
>> Phil
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> --
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mh...@informatics.jax.org
> (207) 288-6012
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 
Mobile: +1  778-233-4935
Website: http://philw.co.uk
Skype: philwhelan76
Twitter: philwhln
Email : phil...@gmail.com
iChat: philw...@mac.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: indexing multiple email addresses in one field

Reply via email to