[jira] Resolved: (LUCENE-1181) Token reuse is not ideal for avoiding array copies

Michael McCandless (JIRA) Wed, 23 Apr 2008 06:26:49 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless resolved LUCENE-1181.
----------------------------------------

    Resolution: Won't Fix

> Token reuse is not ideal for avoiding array copies
> --------------------------------------------------
>
>                 Key: LUCENE-1181
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1181
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3
>            Reporter: Trejkaz
>
> The way the Token API is currently written results in two unnecessary array 
> copies which could be avoided by changing the way it works.
> 1. setTermBuffer(char[],int,int) calls resizeTermBuffer(int) which copies the 
> original term text even though it's about to be overwritten.
> #1 should be trivially fixable by introducing a private 
> resizeTermBuffer(int,boolean) where the new boolean parameter specifies 
> whether the existing term data gets copied over or not.
> 2. setTermBuffer(char[],int,int) copies what you pass in, instead of actually 
> setting the term buffer.
> Setting aside the fact that the setTermBuffer method is misleadingly named, 
> consider a token filter which performs Unicode normalisation on each token.
> How it has to be implemented at present:
>   once:
>     - create a reusable char[] for storing the normalisation result
>   every token:
>     - use getTermBuffer() and getTermLength() to get the buffer and relevant 
> length
>     - normalise the original string into our temporary buffer   (if it isn't 
> big enough, grow the temp buffer size.)
>     - setTermBuffer(byte[],int,int) - this does an extra copy.
> The following sequence would be much better:
>   once:
>     - create a reusable char[] for storing the normalisation result
>   every token:
>     - use getTermBuffer() and getTermLength() to get the buffer and relevant 
> length
>     - normalise the original string into our temporary buffer   (if it isn't 
> big enough, grow the temp buffer size.)
>     - setTermBuffer(byte[],int,int) sets in our buffer by reference
>     - set the term buffer which used to be in the Token such that it becomes 
> our new temp buffer.
> The latter sequence results in no copying with the exception of the 
> normalisation itself, which is unavoidable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-1181) Token reuse is not ideal for avoiding array copies

Reply via email to