[jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

Mark Lassau (JIRA) Tue, 02 Sep 2008 18:00:39 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627873#action_12627873
 ]


Mark Lassau commented on LUCENE-1373:
-------------------------------------

I would be willing to contribute a patch to make these Analyzers work in the 
next point release.

I see two ways to do this:
1) Introduce a static method to StandardTokenizerImpl, whereby you could set 
the "default" value of the replaceInvalidAcronym flag.
One could then call setDefaultForReplaceInvalidAcronym(true) one time from your 
code, and then whenever anyone uses the old Constructor, it would set 
replaceInvalidAcronym=true
2) Add the replaceInvalidAcronym flag to all of the above Analyzers.
Some of these have multiple constructors already, so I would probably just add 
a setter/getter to them.

The question is, which of the above would be preferred?
Personally, I think the first is the least amount of work to do, and also the 
easiest to back out when you move onto v3.x, and the "deprecated" behaviour is 
removed.
However, doing 2) means the least disruption to core code.

Also, judging by the "Fix Version/s" field above, I am guessing that a v2.3.3 
release is planned, therefore I guess I should provide a patch for the 2.3 
branch as well as trunk which will end up as 2.4?

> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-1373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis, contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Mark Lassau
>            Priority: Minor
>
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like 
> "www.apache.org." would be incorrectly tokenized as an acronym (note the dot 
> at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to 
> harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but 
> unfortunately the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the 
> StandardTokenizer, and in v2.3.2 not one of these provides a way to get the 
> non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer
> I would be willing to contribute a patch to make these Analyzers work in the 
> next point release.
> I see two ways to do this:
> 1) Introduce a static method to StandardTokenizerImpl, whereby you could set 
> the "default" value of the replaceInvalidAcronym flag.
>     One could then call setDefaultForReplaceInvalidAcronym(true) one time 
> from your code,  and then whenever anyone uses the old Constructor, it would 
> set replaceInvalidAcronym=true
> 2) Add the replaceInvalidAcronym flag to all of the above Analyzers.
>     Some of these have multiple constructors already, so I would probably 
> just add a setter/getter to them.
> The question is, which of the above would be preferred?
> Personally, I think the first is the least amount of work to do, and also the 
> easiest to back out when you move onto v3.x, and the "deprecated" behaviour 
> is removed.
> However, doing 2) means the least disruption to core code.
> Also, judging by the "Fix Version/s" field above, I am guessing that a v2.3.3 
> release is planned, therefore I guess I should provide a patch for the 2.3 
> branch as well as trunk which will end up as 2.4?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

Reply via email to