[
https://issues.apache.org/jira/browse/LUCENE-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304342#comment-14304342
]
Christian Moen commented on LUCENE-6216:
----------------------------------------
Thanks, Robert.
I had the same idea and I tried this out last night. The advantage of the
approach is that we only read the buffer data for the token attributes we use,
but it leaves the API a bit slightly awkward in my opinion since we would have
both a {{setToken()}} and a {{setPartOfSpeech()}}. That said, this is still
perhaps the best way to go for performance reasons and these APIs being very
low-level and not commonly used.
For the sake of exploring an alternative idea; a different approach could be to
have separate token filters set these attributes. The tokenizer would set a
{{CharTermAttribute}}, etc. and a {{JapaneseTokenAttribute}} (or something
suitably named) that holds the {{Token}}. A separate
{{JapanesePartOfSpeechFilter}} would be responsible for setting the
{{PartOfSpeechAttribute}} by getting the data from the
{{JapaneseTokenAttribute}} using a {{getToken()}} method. We'd still need logic
similar to the above to deal with {{setPartOfSpeech()}}, etc. so I don't think
we gain anything by taking this approach, and it's a big change, too.
> Make it easier to modify Japanese token attributes downstream
> -------------------------------------------------------------
>
> Key: LUCENE-6216
> URL: https://issues.apache.org/jira/browse/LUCENE-6216
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: Christian Moen
> Priority: Minor
>
> Japanese-specific token attributes such as {{PartOfSpeechAttribute}},
> {{BaseFormAttribute}}, etc. get their values from a
> {{org.apache.lucene.analysis.ja.Token}} through a {{setToken()}} method.
> This makes it cumbersome to change these token attributes later on in the
> analysis chain since the {{Token}} instances are difficult to instantiate
> (sort of read-only objects).
> I've ran into this issue in LUCENE-3922 (JapaneseNumberFilter) where it would
> be appropriate to update token attributes to also reflect Japanese number
> normalization.
> I think it might be more practical to allow setting a specific value for
> these token attributes directly rather than through a {{Token}} since it
> makes the APIs simpler, allows for easier changing attributes downstream, and
> also supporting additional dictionaries easier.
> The drawback with the approach that I can think of is a performance hit as we
> will miss out on the inherent lazy retrieval of these token attributes from
> the {{Token}} object (and the underlying dictionary/buffer).
> I'd like to do some testing to better understand the performance impact of
> this change. Happy to hear your thoughts on this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]