Thanks Matt. Thanks Paul. I'm up early (PST) and ready for a major rewrite of my indexer. I think these changes are going to make a huge difference.
Cheers, Phil On Fri, Jul 31, 2009 at 5:52 AM, Matthew Hall<mh...@informatics.jax.org> wrote: > And to address the stop word issue, you can override the stop word list that > it uses. > > Most analyzers that use stop words, (Standard included) has an option to > pass it an arbitrary list of StopWords which will override the defaults. > > You could also just roll your own (which is what you are going to end up > doing here anyhow) When you do, just don't include stop word removal in the > processing of your token stream. > > Matt > > Phil Whelan wrote: >> >> Hi Matthew / Paul, >> >> On Thu, Jul 30, 2009 at 4:32 PM, Paul Cowan<co...@aconex.com> wrote: >> >>> >>> Matthew Hall wrote: >>> >>>> >>>> Place a delimiter between the email addresses that doesn't get removed >>>> in >>>> your analyzer. (preferably something you know will never be searched >>>> on) >>>> >>> >>> Or add them separately (rather than: >>> doc.add(new Field("email", "f...@bar.com b...@foo.com c...@bar.foo" ...); >>> use >>> doc.add(new Field("email", "f...@bar.com"); >>> doc.add(new Field("email", "b...@foo.com"); >>> doc.add(new Field("email", "c...@bar.foo"); >>> ), using an Analyzer that overrides getPositionIncrementGap(). This >>> inserts >>> a 'gap' between each set of Tokens for the same Field, which stops phrase >>> queries from 'crossing the boundaries' between subsequent values. >>> >> >> I like the sound of that! I think I understand it. >> getPositionIncrementGap() returns 0 by default which keeps the "email" >> field tokens sequential. Overriding with 1, will add an effective >> blank token between the email addresses (overriding with 2 would leave >> 2). Similar to Matthew's delimiter token, but a bit neater. >> >> So the token (with positions in brackets) would look something like this. >> >> "foo(0) bar(1) com(2) bar(4) foo(5) com(6) com(8) bar(9) foo(10)" >> >> Up until now I've only been using the WhiteSpaceAnalyzer, as I've been >> keeping quite a tight control over the fields going into the index >> (not making best use of Lucene). >> >> What Analyzer would you recommend I use for this. I'll also be >> indexing IPs, and other things, but that's pretty much the same story. >> It seems I have to use the same Analyzer for the all the fields in the >> index? >> >> I've been looking at StandardAnalyzer, but I do not want to remove >> stop words. I want to keep letters and numbers mainly, and also >> override getPositionIncrementGap? Is there anything that does these >> things already, or close to it? Overriding getPositionIncrementGap >> shouldn't be difficult though. >> >> Cheers, >> Phil >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > -- > Matthew Hall > Software Engineer > Mouse Genome Informatics > mh...@informatics.jax.org > (207) 288-6012 > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Mobile: +1 778-233-4935 Website: http://philw.co.uk Skype: philwhelan76 Twitter: philwhln Email : phil...@gmail.com iChat: philw...@mac.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org