DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect ---------------------------------------------------------------------
Key: LUCENE-3022 URL: https://issues.apache.org/jira/browse/LUCENE-3022 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 3.1, 2.9.4 Reporter: Johann Höchtl Priority: Minor When using the DictionaryCompoundWordTokenFilter with a german dictionary, I got a strange behaviour: The german word "streifenbluse" (blouse with stripes) was decompounded to "streifen" (stripe),"reifen"(tire) which makes no sense at all. I thought the flag onlyLongestMatch would fix this, because "streifen" is longer than "reifen", but it had no effect. So I reviewed the sourcecode and found the problem: [code] protected void decomposeInternal(final Token token) { // Only words longer than minWordSize get processed if (token.length() < this.minWordSize) { return; } char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.buffer()); for (int i=0;i<token.length()-this.minSubwordSize;++i) { Token longestMatchToken=null; for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) { if(i+j>token.length()) { break; } if(dictionary.contains(lowerCaseTermBuffer, i, j)) { if (this.onlyLongestMatch) { if (longestMatchToken!=null) { if (longestMatchToken.length()<j) { longestMatchToken=createToken(i,j,token); } } else { longestMatchToken=createToken(i,j,token); } } else { tokens.add(createToken(i,j,token)); } } } if (this.onlyLongestMatch && longestMatchToken!=null) { tokens.add(longestMatchToken); } } } [/code] should be changed to [code] protected void decomposeInternal(final Token token) { // Only words longer than minWordSize get processed if (token.termLength() < this.minWordSize) { return; } char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.termBuffer()); Token longestMatchToken=null; for (int i=0;i<token.termLength()-this.minSubwordSize;++i) { for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) { if(i+j>token.termLength()) { break; } if(dictionary.contains(lowerCaseTermBuffer, i, j)) { if (this.onlyLongestMatch) { if (longestMatchToken!=null) { if (longestMatchToken.termLength()<j) { longestMatchToken=createToken(i,j,token); } } else { longestMatchToken=createToken(i,j,token); } } else { tokens.add(createToken(i,j,token)); } } } } if (this.onlyLongestMatch && longestMatchToken!=null) { tokens.add(longestMatchToken); } } [/code] So, that only the longest token is really indexed and the onlyLongestMatch Flag makes sense. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org