DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect
---------------------------------------------------------------------
Key: LUCENE-3022
URL: https://issues.apache.org/jira/browse/LUCENE-3022
Project: Lucene - Java
Issue Type: Bug
Components: contrib/analyzers
Affects Versions: 3.1, 2.9.4
Reporter: Johann Höchtl
Priority: Minor
When using the DictionaryCompoundWordTokenFilter with a german dictionary, I
got a strange behaviour:
The german word "streifenbluse" (blouse with stripes) was decompounded to
"streifen" (stripe),"reifen"(tire) which makes no sense at all.
I thought the flag onlyLongestMatch would fix this, because "streifen" is
longer than "reifen", but it had no effect.
So I reviewed the sourcecode and found the problem:
[code]
protected void decomposeInternal(final Token token) {
// Only words longer than minWordSize get processed
if (token.length() < this.minWordSize) {
return;
}
char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.buffer());
for (int i=0;i<token.length()-this.minSubwordSize;++i) {
Token longestMatchToken=null;
for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
if(i+j>token.length()) {
break;
}
if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
if (this.onlyLongestMatch) {
if (longestMatchToken!=null) {
if (longestMatchToken.length()<j) {
longestMatchToken=createToken(i,j,token);
}
} else {
longestMatchToken=createToken(i,j,token);
}
} else {
tokens.add(createToken(i,j,token));
}
}
}
if (this.onlyLongestMatch && longestMatchToken!=null) {
tokens.add(longestMatchToken);
}
}
}
[/code]
should be changed to
[code]
protected void decomposeInternal(final Token token) {
// Only words longer than minWordSize get processed
if (token.termLength() < this.minWordSize) {
return;
}
char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.termBuffer());
Token longestMatchToken=null;
for (int i=0;i<token.termLength()-this.minSubwordSize;++i) {
for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
if(i+j>token.termLength()) {
break;
}
if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
if (this.onlyLongestMatch) {
if (longestMatchToken!=null) {
if (longestMatchToken.termLength()<j) {
longestMatchToken=createToken(i,j,token);
}
} else {
longestMatchToken=createToken(i,j,token);
}
} else {
tokens.add(createToken(i,j,token));
}
}
}
}
if (this.onlyLongestMatch && longestMatchToken!=null) {
tokens.add(longestMatchToken);
}
}
[/code]
So, that only the longest token is really indexed and the onlyLongestMatch Flag
makes sense.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]