Most of the contributed Analyzers suffer from invalid recognition of acronyms. ------------------------------------------------------------------------------
Key: LUCENE-1373 URL: https://issues.apache.org/jira/browse/LUCENE-1373 Project: Lucene - Java Issue Type: Bug Components: Analysis, contrib/analyzers Affects Versions: 2.3.2 Reporter: Mark Lassau Priority: Minor LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end). Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us. StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy. Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :( I refer to: * BrazilianAnalyzer * CzechAnalyzer * DutchAnalyzer * FrenchAnalyzer * GermanAnalyzer * GreekAnalyzer * ThaiAnalyzer I would be willing to contribute a patch to make these Analyzers work in the next point release. I see two ways to do this: 1) Introduce a static method to StandardTokenizerImpl, whereby you could set the "default" value of the replaceInvalidAcronym flag. One could then call setDefaultForReplaceInvalidAcronym(true) one time from your code, and then whenever anyone uses the old Constructor, it would set replaceInvalidAcronym=true 2) Add the replaceInvalidAcronym flag to all of the above Analyzers. Some of these have multiple constructors already, so I would probably just add a setter/getter to them. The question is, which of the above would be preferred? Personally, I think the first is the least amount of work to do, and also the easiest to back out when you move onto v3.x, and the "deprecated" behaviour is removed. However, doing 2) means the least disruption to core code. Also, judging by the "Fix Version/s" field above, I am guessing that a v2.3.3 release is planned, therefore I guess I should provide a patch for the 2.3 branch as well as trunk which will end up as 2.4? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]