[ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627903#action_12627903 ]
Mark Lassau commented on LUCENE-1373: ------------------------------------- Had a closer look at the code, including changes in {{StandardAnalyzer}}. The static default idea would need a reworking of {{StandardAnalyzer.reusableTokenStream()}}, and so I think it is safer to just add the {{replaceInvalidAcronym}} flag to the affected Analyzers. > Most of the contributed Analyzers suffer from invalid recognition of acronyms. > ------------------------------------------------------------------------------ > > Key: LUCENE-1373 > URL: https://issues.apache.org/jira/browse/LUCENE-1373 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis, contrib/analyzers > Affects Versions: 2.3.2 > Reporter: Mark Lassau > Priority: Minor > > LUCENE-1068 describes a bug in StandardTokenizer whereby a string like > "www.apache.org." would be incorrectly tokenized as an acronym (note the dot > at the end). > Unfortunately, keeping the "backward compatibility" of a bug turns out to > harm us. > StandardTokenizer has a couple of ways to indicate "fix this bug", but > unfortunately the default behaviour is still to be buggy. > Most of the non-English analyzers provided in lucene-analyzers utilize the > StandardTokenizer, and in v2.3.2 not one of these provides a way to get the > non-buggy behaviour :( > I refer to: > * BrazilianAnalyzer > * CzechAnalyzer > * DutchAnalyzer > * FrenchAnalyzer > * GermanAnalyzer > * GreekAnalyzer > * ThaiAnalyzer -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]