[ https://issues.apache.org/jira/browse/LUCENE-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304342#comment-14304342 ]
Christian Moen commented on LUCENE-6216: ---------------------------------------- Thanks, Robert. I had the same idea and I tried this out last night. The advantage of the approach is that we only read the buffer data for the token attributes we use, but it leaves the API a bit slightly awkward in my opinion since we would have both a {{setToken()}} and a {{setPartOfSpeech()}}. That said, this is still perhaps the best way to go for performance reasons and these APIs being very low-level and not commonly used. For the sake of exploring an alternative idea; a different approach could be to have separate token filters set these attributes. The tokenizer would set a {{CharTermAttribute}}, etc. and a {{JapaneseTokenAttribute}} (or something suitably named) that holds the {{Token}}. A separate {{JapanesePartOfSpeechFilter}} would be responsible for setting the {{PartOfSpeechAttribute}} by getting the data from the {{JapaneseTokenAttribute}} using a {{getToken()}} method. We'd still need logic similar to the above to deal with {{setPartOfSpeech()}}, etc. so I don't think we gain anything by taking this approach, and it's a big change, too. > Make it easier to modify Japanese token attributes downstream > ------------------------------------------------------------- > > Key: LUCENE-6216 > URL: https://issues.apache.org/jira/browse/LUCENE-6216 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Reporter: Christian Moen > Priority: Minor > > Japanese-specific token attributes such as {{PartOfSpeechAttribute}}, > {{BaseFormAttribute}}, etc. get their values from a > {{org.apache.lucene.analysis.ja.Token}} through a {{setToken()}} method. > This makes it cumbersome to change these token attributes later on in the > analysis chain since the {{Token}} instances are difficult to instantiate > (sort of read-only objects). > I've ran into this issue in LUCENE-3922 (JapaneseNumberFilter) where it would > be appropriate to update token attributes to also reflect Japanese number > normalization. > I think it might be more practical to allow setting a specific value for > these token attributes directly rather than through a {{Token}} since it > makes the APIs simpler, allows for easier changing attributes downstream, and > also supporting additional dictionaries easier. > The drawback with the approach that I can think of is a performance hit as we > will miss out on the inherent lazy retrieval of these token attributes from > the {{Token}} object (and the underlying dictionary/buffer). > I'd like to do some testing to better understand the performance impact of > this change. Happy to hear your thoughts on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org