[
https://issues.apache.org/jira/browse/LUCENE-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081374#comment-13081374
]
Robert Muir commented on LUCENE-3366:
-------------------------------------
the purpose of the filter is "Normalizes tokens extracted with
StandardTokenizer".
currently this is a no-op, but we can always improve it going with the spirit
of the whole standard this thing implements.
The TODO currently refers to this statement:
"For Thai, Lao, Khmer, Myanmar, and other scripts that do not use typically use
spaces between words, a good implementation should not depend on the default
word boundary specification. It should use a more sophisticated mechanism ...
Ideographic scripts such as Japanese and Chinese are even more complex"
There is no problem having a TODO in this filter, we don't need to do a rush
job for any reason...
Some of the preparation for this (e.g. improving the default behavior for CJK)
was already done in LUCENE-2911. We now tag all these special types,
so in the meantime if someone wants to do their own downstream processing they
can do this themselves.
> StandardFilter only works with ClassicTokenizer and only when version < 3.1
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3366
> URL: https://issues.apache.org/jira/browse/LUCENE-3366
> Project: Lucene - Java
> Issue Type: Improvement
> Components: modules/analysis
> Affects Versions: 3.3
> Reporter: David Smiley
>
> The StandardFilter used to remove periods from acronyms and apostrophes-S's
> where they occurred. And it used to work in conjunction with the
> StandardTokenizer. Presently, it only does this with ClassicTokenizer and
> when the lucene match version is before 3.1. Here is a excerpt from the code:
> {code:lang=java}
> public final boolean incrementToken() throws IOException {
> if (matchVersion.onOrAfter(Version.LUCENE_31))
> return input.incrementToken(); // TODO: add some niceties for the new
> grammar
> else
> return incrementTokenClassic();
> }
> {code}
> It seems to me that in the great refactor of the standard tokenizer,
> LUCENE-2167, something was forgotten here. I think that if someone uses the
> ClassicTokenizer then no matter what the version is, this filter should do
> what it used to do. And the TODO suggests someone forgot to make this filter
> do something useful for the StandardTokenizer. Or perhaps that idea should
> be discarded and this class should be named ClassicTokenFilter.
> In any event, the javadocs for this class appear out of date as there is no
> mention of ClassicTokenizer, and the wiki is out of date too.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]