[jira] [Commented] (LUCENE-6216) Make it easier to modify Japanese token attributes downstream

Christian Moen (JIRA) Tue, 03 Feb 2015 15:58:07 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304342#comment-14304342
 ]


Christian Moen commented on LUCENE-6216:
----------------------------------------

Thanks, Robert.

I had the same idea and I tried this out last night.  The advantage of the 
approach is that we only read the buffer data for the token attributes we use, 
but it leaves the API a bit slightly awkward in my opinion since we would have 
both a {{setToken()}} and a {{setPartOfSpeech()}}.  That said, this is still 
perhaps the best way to go for performance reasons and these APIs being very 
low-level and not commonly used.

For the sake of exploring an alternative idea; a different approach could be to 
have separate token filters set these attributes.  The tokenizer would set a 
{{CharTermAttribute}}, etc. and a {{JapaneseTokenAttribute}} (or something 
suitably named) that holds the {{Token}}.  A separate 
{{JapanesePartOfSpeechFilter}} would be responsible for setting the 
{{PartOfSpeechAttribute}} by getting the data from the 
{{JapaneseTokenAttribute}} using a {{getToken()}} method. We'd still need logic 
similar to the above to deal with {{setPartOfSpeech()}}, etc. so I don't think 
we gain anything by taking this approach, and it's a big change, too.

> Make it easier to modify Japanese token attributes downstream
> -------------------------------------------------------------
>
>                 Key: LUCENE-6216
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6216
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Christian Moen
>            Priority: Minor
>
> Japanese-specific token attributes such as {{PartOfSpeechAttribute}}, 
> {{BaseFormAttribute}}, etc. get their values from a 
> {{org.apache.lucene.analysis.ja.Token}} through a {{setToken()}} method.  
> This makes it cumbersome to change these token attributes later on in the 
> analysis chain since the {{Token}} instances are difficult to instantiate 
> (sort of read-only objects).
> I've ran into this issue in LUCENE-3922 (JapaneseNumberFilter) where it would 
> be appropriate to update token attributes to also reflect Japanese number 
> normalization.
> I think it might be more practical to allow setting a specific value for 
> these token attributes directly rather than through a {{Token}} since it 
> makes the APIs simpler, allows for easier changing attributes downstream, and 
> also supporting additional dictionaries easier.
> The drawback with the approach that I can think of is a performance hit as we 
> will miss out on the inherent lazy retrieval of these token attributes from 
> the {{Token}} object (and the underlying dictionary/buffer).
> I'd like to do some testing to better understand the performance impact of 
> this change. Happy to hear your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6216) Make it easier to modify Japanese token attributes downstream

Reply via email to