Re: Phrase Frequency For Analysis

Kamal Abou Mikhael Thu, 22 Jun 2006 10:56:05 -0700

I may be coming into this thread without knowing enough.  I have implemented a
phrase filter, which indexes all token sequences that are 2 to N tokens long.
The n is defined in the constructor.


It takes a stopword Trie for input because the policy I used, based on a publish
work I read, was that a phrase should neither begin nor end with a stopword.
This is immaterial as one can simply provide a trie with nothing in it.

The Trie structure is found at http://www.graphbuilder.com/trie/
The SWPhraseFilter.java file is attached and is self explanatory.

I do believe that if you change the condition on linke 67 of phraseTerms.size()
> 1 to phraseTerms.size() > 0, you may get the result you need.  Sorry, I do
not have the time to test this out.

I have also included TriePhraseFilter.java.  When I indexed all the possible
phrases, I went back and dumped the ones that occured in more than 25 docs into
a file (this turned out to be about 2% of all candidates), then I placed these
in a Trie and reindexed the documents.

So I used this TriePhraseFilter to recognize the phrases I dumped into the file.
 The requirement of them being in 25 documents is also in the paper I that I
read about phrase indexing.

By the way, I am 90 percent certain of the TriePhraseFilter code.

Best Regards,
Kamal Abou Mikhael


Quoting Nader Akhnoukh <[EMAIL PROTECTED]>:

> Yes, Chris is correct, the goal is to determine the most frequently occuring
> phrases in a document compared to the frequency of that phrase in the
> index.  So there are only output phrases, no inputs.
>
> Also performance is not really an issue, this would take place on an
> irregular basis and could run overnight if need be.
>
> So it sounds like the best approach would be to index all 1, 2, and 3 word
> phrases.  Does anyone know of an Analyzer that does this?  And if I can
> successfully index the phrases would the term frequency vector contain all
> the combination of phrases as terms along with their frequencies?
>
> Andrzej,  can you discuss your approach in a little more detail.  Are you
> suggesting manually traversing each document and doing a search on each
> phrase?  That seems very intensive as I have tens of thousands of documents.
>
> Thanks.
>

package analysis;

import stem.*;
import java.util.Vector;
import java.io.IOException;
import com.graphbuilder.struc.*;
import org.apache.lucene.index.*;
import org.apache.lucene.analysis.*;

public class SWPhraseFilter extends TokenFilter {

	private final static String SPACE = " ";

	public SWPhraseFilter(TokenStream tokenStream, Trie stopwords, int length) {
		super(tokenStream);
		this.stopwords = stopwords;
		this.length = length;
		this.outputMode = false;
		this.windowPosition = 0;
		this.phraseTerms = new Vector(length);
		this.candidate = new StringBuffer(32 * length);
	}

	public final Token next() throws IOException {

		Token t = null;
		Token result = null;
		boolean done = false;

		while (!done) {

			//Shift the window over if it has been fully inspected.
			if (windowPosition >= length) {
				phraseTerms.remove(0);
				//candidate.delete(0, candidate.indexOf(SPACE)+1);
				windowPosition = 0;
			}

			//Keep the window full.
			if (phraseTerms.size() < length) {
				t = input.next(); 
				if (t != null) {
					phraseTerms.add(t);
					String tText = t.termText();
					//candidate.append(tText);
					//candidate.append(SPACE);
				} else {
					if (length == 1) {
						done = true;
					} else {
						windowPosition = 0;
						length = length - 1;
					}
				}
			}

			if (!done) {
				if (phraseTerms.size() > 0) {
					//Shift the window over if the first token is a stopword.
					String firstTerm = ((Token) phraseTerms.elementAt(0)).termText();
					if (stopwords.contains(firstTerm)) {
						phraseTerms.remove(0);
						//candidate.delete(0, candidate.indexOf(SPACE)+1);
						windowPosition = 0;
					} else {
						//Look for a phrase (window consisting of more than one term).
						if (phraseTerms.size() > 1 && windowPosition > 0) {
							String currentTerm = ((Token) phraseTerms.elementAt(windowPosition)).termText();
							//Make sure the phrase does not end in a stopword.
							if (!stopwords.contains(currentTerm)) {
								done = true;
							}
						}
						windowPosition = windowPosition + 1;
					} 
				} else {
					done = true;
				}
			}
		}

		if (done) {
			candidate.delete(0, candidate.length());
			for (int i = 0; i < windowPosition; i = i + 1) {
				Token x = (Token) phraseTerms.elementAt(i);
				candidate.append(x.termText());
				candidate.append(SPACE);
			}
			if (candidate.length() > 0) {
				String text = candidate.substring(0, candidate.length() - 1);
				//			System.out.println(windowPosition + " " + text);
				int start = ((Token)phraseTerms.elementAt(0)).startOffset();
				int end   = ((Token)phraseTerms.elementAt(phraseTerms.size()-1)).endOffset();
				result = new Token(text,start,end,"phrase");
			}
		}

		return result;
	}

	private int length;
	private final Trie stopwords;
	private final StringBuffer candidate;

	private boolean outputMode;
	private int windowPosition;
	private Vector phraseTerms;
}

package analysis;

import stem.*;
import java.util.Vector;
import java.io.IOException;
import com.graphbuilder.struc.*;
import org.apache.lucene.index.*;
import org.apache.lucene.analysis.*;

public class TriePhraseFilter extends TokenFilter {

	private final static String SPACE = " ";

	public TriePhraseFilter(TokenStream tokenStream, Trie phrases) {
		super(tokenStream);
		this.phrases = phrases;
		this.candidateSB = new StringBuffer(256);
		this.waitingTokens = new Vector(7);
		this.candidateTokens = new Vector(7);
	}

	public final Token next() throws IOException {

		Token t = null;
		Token result = null;
		boolean done = false;

		//Until candidate phrase is not in the trie
			//Take a token (either from input or from waiting list)
			//Check if it is in the trie (use the hasPrefix() )
			//If so, then add it to the current phrase
		while (!done && ((t = getNext()) != null)) {
			candidateSB.append(t.termText());
			candidateSB.append(SPACE);
			candidateTokens.add(t);
			String text = candidateSB.toString().substring(0, candidateSB.length());
			//System.out.println(" c " + text);
			done = !phrases.hasPrefix(text);
		}

		int candidateLength = candidateSB.length();
		if (candidateLength > 0 && 
			(candidateSB.lastIndexOf(SPACE) == candidateLength - 1)) {
			candidateSB.deleteCharAt(candidateLength - 1);
		}

		//Until candidate phrase elements are not a phrase (using contains() )
			//Put last token in candidate phrase on waiting list
			//Note: Most of the time, it may be that only the last element 
			//      that prevents it from being a phrase.
			//      However, it may be that case that more than one element
            //      form a prefix that does not exist as a phrase in itself.
		int insertionPoint = waitingTokens.size();
		while (candidateSB.length() > 0 &&
				!phrases.contains(candidateSB.toString().substring(0, candidateSB.length()))) {

			int end = candidateTokens.size() - 1;

			waitingTokens.insertElementAt(candidateTokens.remove(end), insertionPoint);

			int spaceIndex = candidateSB.lastIndexOf(SPACE);
			if (spaceIndex == -1) {
				candidateSB.delete(0, candidateSB.length());
			} else {
				candidateSB.delete(spaceIndex, candidateSB.length());
			}

			if (candidateSB.length() > 0) {
				String text = candidateSB.toString().substring(0, candidateSB.length());
				//System.out.println("nc " + text);
			}
		}

		//If candidate phrase has survived pass it on as the result
		//Otherwise pass on first token on the waiting list as the result
		if (candidateSB.length() > 0) {
			Token first = (Token) candidateTokens.firstElement();
			Token last  = (Token) candidateTokens.lastElement();
			String text = candidateSB.toString().substring(0, candidateSB.length());
			result = new Token(text, first.startOffset(), last.endOffset(), "phrase");
			candidateSB.delete(0,candidateSB.length());
			candidateTokens.clear();
			//System.out.println("phrase : " + text + " " + result.startOffset() + " " + result.endOffset());
		} else if (!waitingTokens.isEmpty()) {
			result = (Token) waitingTokens.remove(0);
			//System.out.println("word   : " + result.termText() + " " + result.startOffset() + " " + result.endOffset());
		}

		return result;
	}

	private Token getNext() throws IOException {
		Token next = null;

		if (!waitingTokens.isEmpty()) {
			next = (Token) waitingTokens.remove(0);
			//System.out.print("w ");
		} else {
			next = input.next();
			//System.out.print("i ");
		}

		return next;
	}

	private final Trie phrases;
	private final StringBuffer candidateSB;
	private Vector candidateTokens;
	private Vector waitingTokens;
}

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Phrase Frequency For Analysis

Reply via email to