[
https://issues.apache.org/jira/browse/LUCENE-9088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992910#comment-16992910
]
Jim Ferenczi commented on LUCENE-9088:
--------------------------------------
I don't think this behavior is documented. The javadocs says:
{noformat}
* Also notice that token attributes such as
* \{@link org.apache.lucene.analysis.ja.tokenattributes.PartOfSpeechAttribute},
* \{@link org.apache.lucene.analysis.ja.tokenattributes.ReadingAttribute},
* \{@link org.apache.lucene.analysis.ja.tokenattributes.InflectionAttribute} and
* \{@link org.apache.lucene.analysis.ja.tokenattributes.BaseFormAttribute} are
left
* unchanged and will inherit the values of the last token used to compose the
normalized
* number and can be wrong. Hence, for 10万 (10000), we will have
* \{@link org.apache.lucene.analysis.ja.tokenattributes.ReadingAttribute}
* set to マン. This is a known issue and is subject to a future improvement.
* <p>
{noformat}
but that doesn't explain why we use the POS of the token following a grouped
number. IMO this is a bug that we should fix in order to ensure that the POS
stop filter can be used to remove the punctuations that was needed to detect
the numbers.
> JapaneseNumberFilter uses inaccurate PartOfSpeechAttribute
> ----------------------------------------------------------
>
> Key: LUCENE-9088
> URL: https://issues.apache.org/jira/browse/LUCENE-9088
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Reporter: Christoph Büscher
> Priority: Major
>
> According to the JapaneseNumberFilter javadocs, it uses the attribute values
> of the last token used to compose the normalized number, which can be wrong.
> While this is documented it leads to a number of incompatibilities with other
> japanese token filters.
> For example, the PartOfSpeechAttribute of the last token used for an input
> text of "2008 2009" will lead to an the following output (some attributes
> left out...):
> ```
> {
> "token" : "2008",
> "start_offset" : 0,
> "end_offset" : 4,
> "type" : "word",
> [...]
> "partOfSpeech" : "記号-空白",
> "partOfSpeech (en)" : "symbol-space"
> [...]
> },
> {
> "token" : " ",
> "start_offset" : 4,
> "end_offset" : 5,
> "type" : "word",
> [...]
> "partOfSpeech" : "記号-空白",
> "partOfSpeech (en)" : "symbol-space",
> [...]
> },
> {
> "token" : "2009",
> "start_offset" : 5,
> "end_offset" : 9,
> "type" : "word",
> ...
> "partOfSpeech" : "名詞-数",
> "partOfSpeech (en)" : "noun-numeric",
> }
> ```
> so that e.g. a following `{color:#1d1c1d}kuromoji_part_of_speech{color}`
> filter will eliminate the "2008" token erroneously tagged as "symbol-space".
> Even without fixing the other token attrobutes, the POS attributes should
> IMHO be set to "noun-numeric", since that's what the filter is supposed to
> detect.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]