I need to index bigrams and trigrams in a document. Here is an example:
Text:
This is a text document written by someone. Read this and post your comments
words that must be indexed:
text
document
written
someone
read
post
your
comments
text document
document written
post your
your comments
text document written
post your comments
So, I made changes to StandardAnalyzer.java and StandardTokenizer.jj to try
and achieve this.
I increased the LOOKAHEAD option value to 4:
options {
LOOKAHEAD = 4;
FORCE_LA_CHECK = true;
.
.
}
I made the following changes to StandardTokenizer.jj :
org.apache.lucene.analysis.Token next() throws IOException :
:
:
{
if (token.kind == EOF) {
return null;
}
else if(token.kind == ALPHANUM) {
Token nextToken = token.next;
if(token.next.kind ==ALPHANUM) {
return
new org.apache.lucene.analysis.Token(token.image+" "+nextToken.image,
token.beginColumn,nextToken.endColumn,
tokenImage[token.kind]);
}
}
else {
return
new org.apache.lucene.analysis.Token(token.image,
token.beginColumn,token.endColumn,
tokenImage[token.kind]);
}
}
That is, I am using token.next to get info about the next token. But it is
returning null. What is the reason and is there a better way of doing this?
--
View this message in context:
http://www.nabble.com/Indexing-bigrams-and-trigrams-in-Lucene-tf2213042.html#a6129254
Sent from the Lucene - Java Users forum at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]