I may be coming into this thread without knowing enough. I have implemented a
phrase filter, which indexes all token sequences that are 2 to N tokens long.
The n is defined in the constructor.
It takes a stopword Trie for input because the policy I used, based on a publish
work I read, was that a phrase should neither begin nor end with a stopword.
This is immaterial as one can simply provide a trie with nothing in it.
The Trie structure is found at http://www.graphbuilder.com/trie/
The SWPhraseFilter.java file is attached and is self explanatory.
I do believe that if you change the condition on linke 67 of phraseTerms.size()
> 1 to phraseTerms.size() > 0, you may get the result you need. Sorry, I do
not have the time to test this out.
I have also included TriePhraseFilter.java. When I indexed all the possible
phrases, I went back and dumped the ones that occured in more than 25 docs into
a file (this turned out to be about 2% of all candidates), then I placed these
in a Trie and reindexed the documents.
So I used this TriePhraseFilter to recognize the phrases I dumped into the file.
The requirement of them being in 25 documents is also in the paper I that I
read about phrase indexing.
By the way, I am 90 percent certain of the TriePhraseFilter code.
Best Regards,
Kamal Abou Mikhael
Quoting Nader Akhnoukh <[EMAIL PROTECTED]>:
> Yes, Chris is correct, the goal is to determine the most frequently occuring
> phrases in a document compared to the frequency of that phrase in the
> index. So there are only output phrases, no inputs.
>
> Also performance is not really an issue, this would take place on an
> irregular basis and could run overnight if need be.
>
> So it sounds like the best approach would be to index all 1, 2, and 3 word
> phrases. Does anyone know of an Analyzer that does this? And if I can
> successfully index the phrases would the term frequency vector contain all
> the combination of phrases as terms along with their frequencies?
>
> Andrzej, can you discuss your approach in a little more detail. Are you
> suggesting manually traversing each document and doing a search on each
> phrase? That seems very intensive as I have tens of thousands of documents.
>
> Thanks.
>
package analysis;
import stem.*;
import java.util.Vector;
import java.io.IOException;
import com.graphbuilder.struc.*;
import org.apache.lucene.index.*;
import org.apache.lucene.analysis.*;
public class SWPhraseFilter extends TokenFilter {
private final static String SPACE = " ";
public SWPhraseFilter(TokenStream tokenStream, Trie stopwords, int length) {
super(tokenStream);
this.stopwords = stopwords;
this.length = length;
this.outputMode = false;
this.windowPosition = 0;
this.phraseTerms = new Vector(length);
this.candidate = new StringBuffer(32 * length);
}
public final Token next() throws IOException {
Token t = null;
Token result = null;
boolean done = false;
while (!done) {
//Shift the window over if it has been fully inspected.
if (windowPosition >= length) {
phraseTerms.remove(0);
//candidate.delete(0, candidate.indexOf(SPACE)+1);
windowPosition = 0;
}
//Keep the window full.
if (phraseTerms.size() < length) {
t = input.next();
if (t != null) {
phraseTerms.add(t);
String tText = t.termText();
//candidate.append(tText);
//candidate.append(SPACE);
} else {
if (length == 1) {
done = true;
} else {
windowPosition = 0;
length = length - 1;
}
}
}
if (!done) {
if (phraseTerms.size() > 0) {
//Shift the window over if the first token is a stopword.
String firstTerm = ((Token) phraseTerms.elementAt(0)).termText();
if (stopwords.contains(firstTerm)) {
phraseTerms.remove(0);
//candidate.delete(0, candidate.indexOf(SPACE)+1);
windowPosition = 0;
} else {
//Look for a phrase (window consisting of more than one term).
if (phraseTerms.size() > 1 && windowPosition > 0) {
String currentTerm = ((Token) phraseTerms.elementAt(windowPosition)).termText();
//Make sure the phrase does not end in a stopword.
if (!stopwords.contains(currentTerm)) {
done = true;
}
}
windowPosition = windowPosition + 1;
}
} else {
done = true;
}
}
}
if (done) {
candidate.delete(0, candidate.length());
for (int i = 0; i < windowPosition; i = i + 1) {
Token x = (Token) phraseTerms.elementAt(i);
candidate.append(x.termText());
candidate.append(SPACE);
}
if (candidate.length() > 0) {
String text = candidate.substring(0, candidate.length() - 1);
// System.out.println(windowPosition + " " + text);
int start = ((Token)phraseTerms.elementAt(0)).startOffset();
int end = ((Token)phraseTerms.elementAt(phraseTerms.size()-1)).endOffset();
result = new Token(text,start,end,"phrase");
}
}
return result;
}
private int length;
private final Trie stopwords;
private final StringBuffer candidate;
private boolean outputMode;
private int windowPosition;
private Vector phraseTerms;
}
package analysis;
import stem.*;
import java.util.Vector;
import java.io.IOException;
import com.graphbuilder.struc.*;
import org.apache.lucene.index.*;
import org.apache.lucene.analysis.*;
public class TriePhraseFilter extends TokenFilter {
private final static String SPACE = " ";
public TriePhraseFilter(TokenStream tokenStream, Trie phrases) {
super(tokenStream);
this.phrases = phrases;
this.candidateSB = new StringBuffer(256);
this.waitingTokens = new Vector(7);
this.candidateTokens = new Vector(7);
}
public final Token next() throws IOException {
Token t = null;
Token result = null;
boolean done = false;
//Until candidate phrase is not in the trie
//Take a token (either from input or from waiting list)
//Check if it is in the trie (use the hasPrefix() )
//If so, then add it to the current phrase
while (!done && ((t = getNext()) != null)) {
candidateSB.append(t.termText());
candidateSB.append(SPACE);
candidateTokens.add(t);
String text = candidateSB.toString().substring(0, candidateSB.length());
//System.out.println(" c " + text);
done = !phrases.hasPrefix(text);
}
int candidateLength = candidateSB.length();
if (candidateLength > 0 &&
(candidateSB.lastIndexOf(SPACE) == candidateLength - 1)) {
candidateSB.deleteCharAt(candidateLength - 1);
}
//Until candidate phrase elements are not a phrase (using contains() )
//Put last token in candidate phrase on waiting list
//Note: Most of the time, it may be that only the last element
// that prevents it from being a phrase.
// However, it may be that case that more than one element
// form a prefix that does not exist as a phrase in itself.
int insertionPoint = waitingTokens.size();
while (candidateSB.length() > 0 &&
!phrases.contains(candidateSB.toString().substring(0, candidateSB.length()))) {
int end = candidateTokens.size() - 1;
waitingTokens.insertElementAt(candidateTokens.remove(end), insertionPoint);
int spaceIndex = candidateSB.lastIndexOf(SPACE);
if (spaceIndex == -1) {
candidateSB.delete(0, candidateSB.length());
} else {
candidateSB.delete(spaceIndex, candidateSB.length());
}
if (candidateSB.length() > 0) {
String text = candidateSB.toString().substring(0, candidateSB.length());
//System.out.println("nc " + text);
}
}
//If candidate phrase has survived pass it on as the result
//Otherwise pass on first token on the waiting list as the result
if (candidateSB.length() > 0) {
Token first = (Token) candidateTokens.firstElement();
Token last = (Token) candidateTokens.lastElement();
String text = candidateSB.toString().substring(0, candidateSB.length());
result = new Token(text, first.startOffset(), last.endOffset(), "phrase");
candidateSB.delete(0,candidateSB.length());
candidateTokens.clear();
//System.out.println("phrase : " + text + " " + result.startOffset() + " " + result.endOffset());
} else if (!waitingTokens.isEmpty()) {
result = (Token) waitingTokens.remove(0);
//System.out.println("word : " + result.termText() + " " + result.startOffset() + " " + result.endOffset());
}
return result;
}
private Token getNext() throws IOException {
Token next = null;
if (!waitingTokens.isEmpty()) {
next = (Token) waitingTokens.remove(0);
//System.out.print("w ");
} else {
next = input.next();
//System.out.print("i ");
}
return next;
}
private final Trie phrases;
private final StringBuffer candidateSB;
private Vector candidateTokens;
private Vector waitingTokens;
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]