Re: Is there a way to get absolutely exact phrase matching (no stop words, etc)

Steven Rowe Mon, 24 Oct 2005 13:33:20 -0700

Hi Bob,

StandardAnalyzer filters the token stream created by StandardTokenizerthrough StandardFilter, LowercaseFilter, and then StopFilter. Unlessyou supply a stoplist to the StandardAnalyzer constructor, you get thedefault set of English stopwords, from StopAnalyzer:


  public static final String[] ENGLISH_STOP_WORDS = {
    "a", "an", "and", "are", "as", "at", "be", "but", "by",
    "for", "if", "in", "into", "is", "it",
    "no", "not", "of", "on", "or", "s", "such",
    "t", "that", "the", "their", "then", "there", "these",
    "they", "this", "to", "was", "will", "with"
  };

One approach to the problem you're seeing is to advance the tokenposition in StopFilter with each stopword encountered, so that phrasequeries like


   "group effect"

will fail to match against

   "...group of ~a- The effect..."

because the positions for tokens "group" and "effect" would not be adjacent.

(My naive reading of StandardTokenizer.jj, the JavaCC grammar used tocreate StandardTokenizer.java, is that "~a-" will generate a singletoken "a", which will then be filtered out by StopFilter.)

A patch implementing this approach was actually applied toStopFilter.java in late 2003, but was reverted shortly afterward,because this approach conflicts with the QueryParser and PhraseQueryimplementations.

See Doug Cutting's description of the problem with the positionincrement modification approach here:

<http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200312.mbox/[EMAIL 
PROTECTED]>

See a colored diff of StopFilter.java, just before and after theposition increment modification patch was reverted, here:

<http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/analysis/StopFilter.java?rev=150152&r1=150150&r2=150152&diff_format=h>

This modification is simple and straightforward. You could make thesame changes to a local copy of StopFilter (call it PosIncrStopFilter),then create and use a StandardAnalyzer clone that uses PosIncrStopFilterinstead of StopFilter.


Good luck,
Steve Rowe

Bob Mason wrote:

We have a large body of documents that have xml
and ocr embedded within one of the xml fields.

Searches such as "group effect"

are returning hits for docs such as ones that include the following:

 ...group of ~a- The effect...

because, I take it, stop words like 'of' and 'the' and punctuation
are ignored. Is there anything I can do about this other
than write an alternative to the Standard Analyzer?

thanks,

Bob Mason
UCSF Tobacco Industy Digital Library


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Is there a way to get absolutely exact phrase matching (no stop words, etc)

Reply via email to