[jira] [Updated] (LUCENE-3022) DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect

Robert Muir (JIRA) Thu, 14 Apr 2011 09:07:48 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-3022:
--------------------------------

    Attachment: LUCENE-3022.patch

Hi Johann, in my opinion your patch is completely correct, thanks for fixing 
this.

I noticed though, that a solr test failed because its factory defaults to this 
value being "on" (and the previous behavior was broken!!!)

Because of this, I propose we default this behavior to "off" in the Solr 
factory and add an upgrading note. Previously decompounding in solr defaulted 
to buggy behavior, but I think by default we should index all compound 
components (since that seems to be what the desired intended behavior was, 
which mostly worked, only because of the bug!)

I'll leave the issue open for a few days to see if anyone objects to this plan.


> DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-3022
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3022
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Johann Höchtl
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3022.patch, LUCENE-3022.patch
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> When using the DictionaryCompoundWordTokenFilter with a german dictionary, I 
> got a strange behaviour:
> The german word "streifenbluse" (blouse with stripes) was decompounded to 
> "streifen" (stripe),"reifen"(tire) which makes no sense at all.
> I thought the flag onlyLongestMatch would fix this, because "streifen" is 
> longer than "reifen", but it had no effect.
> So I reviewed the sourcecode and found the problem:
> [code]
> protected void decomposeInternal(final Token token) {
>     // Only words longer than minWordSize get processed
>     if (token.length() < this.minWordSize) {
>       return;
>     }
>     
>     char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.buffer());
>     
>     for (int i=0;i<token.length()-this.minSubwordSize;++i) {
>         Token longestMatchToken=null;
>         for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
>             if(i+j>token.length()) {
>                 break;
>             }
>             if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
>                 if (this.onlyLongestMatch) {
>                    if (longestMatchToken!=null) {
>                      if (longestMatchToken.length()<j) {
>                        longestMatchToken=createToken(i,j,token);
>                      }
>                    } else {
>                      longestMatchToken=createToken(i,j,token);
>                    }
>                 } else {
>                    tokens.add(createToken(i,j,token));
>                 }
>             } 
>         }
>         if (this.onlyLongestMatch && longestMatchToken!=null) {
>           tokens.add(longestMatchToken);
>         }
>     }
>   }
> [/code]
> should be changed to 
> [code]
> protected void decomposeInternal(final Token token) {
>     // Only words longer than minWordSize get processed
>     if (token.termLength() < this.minWordSize) {
>       return;
>     }
>     char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.termBuffer());
>     Token longestMatchToken=null;
>     for (int i=0;i<token.termLength()-this.minSubwordSize;++i) {
>         for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
>             if(i+j>token.termLength()) {
>                 break;
>             }
>             if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
>                 if (this.onlyLongestMatch) {
>                    if (longestMatchToken!=null) {
>                      if (longestMatchToken.termLength()<j) {
>                        longestMatchToken=createToken(i,j,token);
>                      }
>                    } else {
>                      longestMatchToken=createToken(i,j,token);
>                    }
>                 } else {
>                    tokens.add(createToken(i,j,token));
>                 }
>             }
>         }
>     }
>     if (this.onlyLongestMatch && longestMatchToken!=null) {
>         tokens.add(longestMatchToken);
>     }
>   }
> [/code]
> So, that only the longest token is really indexed and the onlyLongestMatch 
> Flag makes sense.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3022) DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect

Reply via email to