I may be coming into this thread without knowing enough. I have implemented a phrase filter, which indexes all token sequences that are 2 to N tokens long. The n is defined in the constructor.
It takes a stopword Trie for input because the policy I used, based on a publish work I read, was that a phrase should neither begin nor end with a stopword. This is immaterial as one can simply provide a trie with nothing in it. The Trie structure is found at http://www.graphbuilder.com/trie/ The SWPhraseFilter.java file is attached and is self explanatory. I do believe that if you change the condition on linke 67 of phraseTerms.size() > 1 to phraseTerms.size() > 0, you may get the result you need. Sorry, I do not have the time to test this out. I have also included TriePhraseFilter.java. When I indexed all the possible phrases, I went back and dumped the ones that occured in more than 25 docs into a file (this turned out to be about 2% of all candidates), then I placed these in a Trie and reindexed the documents. So I used this TriePhraseFilter to recognize the phrases I dumped into the file. The requirement of them being in 25 documents is also in the paper I that I read about phrase indexing. By the way, I am 90 percent certain of the TriePhraseFilter code. Best Regards, Kamal Abou Mikhael Quoting Nader Akhnoukh <[EMAIL PROTECTED]>: > Yes, Chris is correct, the goal is to determine the most frequently occuring > phrases in a document compared to the frequency of that phrase in the > index. So there are only output phrases, no inputs. > > Also performance is not really an issue, this would take place on an > irregular basis and could run overnight if need be. > > So it sounds like the best approach would be to index all 1, 2, and 3 word > phrases. Does anyone know of an Analyzer that does this? And if I can > successfully index the phrases would the term frequency vector contain all > the combination of phrases as terms along with their frequencies? > > Andrzej, can you discuss your approach in a little more detail. Are you > suggesting manually traversing each document and doing a search on each > phrase? That seems very intensive as I have tens of thousands of documents. > > Thanks. >
package analysis; import stem.*; import java.util.Vector; import java.io.IOException; import com.graphbuilder.struc.*; import org.apache.lucene.index.*; import org.apache.lucene.analysis.*; public class SWPhraseFilter extends TokenFilter { private final static String SPACE = " "; public SWPhraseFilter(TokenStream tokenStream, Trie stopwords, int length) { super(tokenStream); this.stopwords = stopwords; this.length = length; this.outputMode = false; this.windowPosition = 0; this.phraseTerms = new Vector(length); this.candidate = new StringBuffer(32 * length); } public final Token next() throws IOException { Token t = null; Token result = null; boolean done = false; while (!done) { //Shift the window over if it has been fully inspected. if (windowPosition >= length) { phraseTerms.remove(0); //candidate.delete(0, candidate.indexOf(SPACE)+1); windowPosition = 0; } //Keep the window full. if (phraseTerms.size() < length) { t = input.next(); if (t != null) { phraseTerms.add(t); String tText = t.termText(); //candidate.append(tText); //candidate.append(SPACE); } else { if (length == 1) { done = true; } else { windowPosition = 0; length = length - 1; } } } if (!done) { if (phraseTerms.size() > 0) { //Shift the window over if the first token is a stopword. String firstTerm = ((Token) phraseTerms.elementAt(0)).termText(); if (stopwords.contains(firstTerm)) { phraseTerms.remove(0); //candidate.delete(0, candidate.indexOf(SPACE)+1); windowPosition = 0; } else { //Look for a phrase (window consisting of more than one term). if (phraseTerms.size() > 1 && windowPosition > 0) { String currentTerm = ((Token) phraseTerms.elementAt(windowPosition)).termText(); //Make sure the phrase does not end in a stopword. if (!stopwords.contains(currentTerm)) { done = true; } } windowPosition = windowPosition + 1; } } else { done = true; } } } if (done) { candidate.delete(0, candidate.length()); for (int i = 0; i < windowPosition; i = i + 1) { Token x = (Token) phraseTerms.elementAt(i); candidate.append(x.termText()); candidate.append(SPACE); } if (candidate.length() > 0) { String text = candidate.substring(0, candidate.length() - 1); // System.out.println(windowPosition + " " + text); int start = ((Token)phraseTerms.elementAt(0)).startOffset(); int end = ((Token)phraseTerms.elementAt(phraseTerms.size()-1)).endOffset(); result = new Token(text,start,end,"phrase"); } } return result; } private int length; private final Trie stopwords; private final StringBuffer candidate; private boolean outputMode; private int windowPosition; private Vector phraseTerms; }
package analysis; import stem.*; import java.util.Vector; import java.io.IOException; import com.graphbuilder.struc.*; import org.apache.lucene.index.*; import org.apache.lucene.analysis.*; public class TriePhraseFilter extends TokenFilter { private final static String SPACE = " "; public TriePhraseFilter(TokenStream tokenStream, Trie phrases) { super(tokenStream); this.phrases = phrases; this.candidateSB = new StringBuffer(256); this.waitingTokens = new Vector(7); this.candidateTokens = new Vector(7); } public final Token next() throws IOException { Token t = null; Token result = null; boolean done = false; //Until candidate phrase is not in the trie //Take a token (either from input or from waiting list) //Check if it is in the trie (use the hasPrefix() ) //If so, then add it to the current phrase while (!done && ((t = getNext()) != null)) { candidateSB.append(t.termText()); candidateSB.append(SPACE); candidateTokens.add(t); String text = candidateSB.toString().substring(0, candidateSB.length()); //System.out.println(" c " + text); done = !phrases.hasPrefix(text); } int candidateLength = candidateSB.length(); if (candidateLength > 0 && (candidateSB.lastIndexOf(SPACE) == candidateLength - 1)) { candidateSB.deleteCharAt(candidateLength - 1); } //Until candidate phrase elements are not a phrase (using contains() ) //Put last token in candidate phrase on waiting list //Note: Most of the time, it may be that only the last element // that prevents it from being a phrase. // However, it may be that case that more than one element // form a prefix that does not exist as a phrase in itself. int insertionPoint = waitingTokens.size(); while (candidateSB.length() > 0 && !phrases.contains(candidateSB.toString().substring(0, candidateSB.length()))) { int end = candidateTokens.size() - 1; waitingTokens.insertElementAt(candidateTokens.remove(end), insertionPoint); int spaceIndex = candidateSB.lastIndexOf(SPACE); if (spaceIndex == -1) { candidateSB.delete(0, candidateSB.length()); } else { candidateSB.delete(spaceIndex, candidateSB.length()); } if (candidateSB.length() > 0) { String text = candidateSB.toString().substring(0, candidateSB.length()); //System.out.println("nc " + text); } } //If candidate phrase has survived pass it on as the result //Otherwise pass on first token on the waiting list as the result if (candidateSB.length() > 0) { Token first = (Token) candidateTokens.firstElement(); Token last = (Token) candidateTokens.lastElement(); String text = candidateSB.toString().substring(0, candidateSB.length()); result = new Token(text, first.startOffset(), last.endOffset(), "phrase"); candidateSB.delete(0,candidateSB.length()); candidateTokens.clear(); //System.out.println("phrase : " + text + " " + result.startOffset() + " " + result.endOffset()); } else if (!waitingTokens.isEmpty()) { result = (Token) waitingTokens.remove(0); //System.out.println("word : " + result.termText() + " " + result.startOffset() + " " + result.endOffset()); } return result; } private Token getNext() throws IOException { Token next = null; if (!waitingTokens.isEmpty()) { next = (Token) waitingTokens.remove(0); //System.out.print("w "); } else { next = input.next(); //System.out.print("i "); } return next; } private final Trie phrases; private final StringBuffer candidateSB; private Vector candidateTokens; private Vector waitingTokens; }
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]