Christian Moen created LUCENE-6216:
--------------------------------------
Summary: Make it easier to modify Japanese token attributes
downstream
Key: LUCENE-6216
URL: https://issues.apache.org/jira/browse/LUCENE-6216
Project: Lucene - Core
Issue Type: Improvement
Components: modules/analysis
Reporter: Christian Moen
Priority: Minor
Japanese-specific token attributes such as {{PartOfSpeechAttribute}},
{{BaseFormAttribute}}, etc. get their values from a
{{org.apache.lucene.analysis.ja.Token}} through a {{setToken()}} method. This
makes it cumbersome to change these token attributes later on in the analysis
chain since the {{Token}} instances are difficult to instantiate (sort of
read-only objects).
I've ran into this issue in LUCENE-3922 (JapaneseNumberFilter) where it would
be appropriate to update token attributes to also reflect Japanese number
normalization.
I think it might be more practical to allow setting a specific value for these
token attributes directly rather than through a {{Token}} since it makes the
APIs simpler, allows for easier changing attributes downstream, and also
supporting additional dictionaries easier.
The drawback with the approach that I can think of is a performance hit as we
will miss out on the inherent lazy retrieval of these token attributes from the
{{Token}} object (and the underlying dictionary/buffer).
I'd like to do some testing to better understand the performance impact of this
change. Happy to hear your thoughts on this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]