[ https://issues.apache.org/jira/browse/LUCENE-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless resolved LUCENE-1181. ---------------------------------------- Resolution: Won't Fix > Token reuse is not ideal for avoiding array copies > -------------------------------------------------- > > Key: LUCENE-1181 > URL: https://issues.apache.org/jira/browse/LUCENE-1181 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Affects Versions: 2.3 > Reporter: Trejkaz > > The way the Token API is currently written results in two unnecessary array > copies which could be avoided by changing the way it works. > 1. setTermBuffer(char[],int,int) calls resizeTermBuffer(int) which copies the > original term text even though it's about to be overwritten. > #1 should be trivially fixable by introducing a private > resizeTermBuffer(int,boolean) where the new boolean parameter specifies > whether the existing term data gets copied over or not. > 2. setTermBuffer(char[],int,int) copies what you pass in, instead of actually > setting the term buffer. > Setting aside the fact that the setTermBuffer method is misleadingly named, > consider a token filter which performs Unicode normalisation on each token. > How it has to be implemented at present: > once: > - create a reusable char[] for storing the normalisation result > every token: > - use getTermBuffer() and getTermLength() to get the buffer and relevant > length > - normalise the original string into our temporary buffer (if it isn't > big enough, grow the temp buffer size.) > - setTermBuffer(byte[],int,int) - this does an extra copy. > The following sequence would be much better: > once: > - create a reusable char[] for storing the normalisation result > every token: > - use getTermBuffer() and getTermLength() to get the buffer and relevant > length > - normalise the original string into our temporary buffer (if it isn't > big enough, grow the temp buffer size.) > - setTermBuffer(byte[],int,int) sets in our buffer by reference > - set the term buffer which used to be in the Token such that it becomes > our new temp buffer. > The latter sequence results in no copying with the exception of the > normalisation itself, which is unavoidable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]