"Michael McCandless" <[EMAIL PROTECTED]> wrote: > I agree the situation is not ideal, and it's confusing. > > This comes back to LUCENE-969. > > At the time, we decided to keep both String & char[] only to avoid > performance cost for those analyzer chains that use String tokens > exclusively. > > The idea was to allow Token to keep both text or char[] and sometimes > both (if they are storing the same characters, as happens if > termBuffer() is called when it's a String being stored) > > Then, in 3.0, we would make the change you are proposing (to only > store char[] internally). That was the plan, anyway. Accelerating > this plan (to store only char[] today) is compelling, but I worry > about the performance hit to legacy analyzer chains...
I'd like to suggest another implementation which use StringBuilder or CharBuffer instead of char[]. Because we don't need to maintain the length separatly from the characater sequence itself. If we use char[], then we have to handle char[] and the offset and the sequence length, the method we implement will be so complex. I think those should be packed into one object. I did not test that using StringBuilder or CharBuffer hit the performance or not. But I think it might not result in so bad performace. > More responses below: > DM Smith <[EMAIL PROTECTED]> wrote: -snip- > > I was looking at this in light of TokenFilter's next(Token) method and how > > it was being used. In looking at the contrib filters, they have not been > > modified. Further, most of them, if they work with the content analysis and > > generation, do their work in strings. Some of these appear to be good > > candidates for using char[] rather than strings, such as the NGram filter. > > But others look like they'd just as well remain with String manipulation. > > It would be great to upgrade all contrib filters to use the re-use APIs. I'll contribute, too. :-) > > I'd like to suggest that internally, that Token be changed to only use > > char[] termBuffer and eliminate termText. > > The question is what performance cost we are incurring eg on the > contrib (& other) sources/filters? Every time setTermText is called, > we copy out the chars (instead of holding a reference to the String). > Every time getText() is called we create a new String(...) from the > char[]. I think it's potentially a high cost, and so maybe we should > wait until 3.0 when we drop the deprecated APIs? > > > And also, that termText be restored as not deprecated. > > It made me nervous keeping this method because it looks like it should > be cheap to call, and in the past it was very cheap to call. But, > maybe we could keep it, if we mark very very clearly in the javadocs > the performance cost you incur by using this method (it makes a new > String() every time)? I'd like to suggest changing the method definition to: public void setTermText(CharSequence text) > > But, in TokenFilter, next() should be deprecated, IMHO. > > I think this is a good idea. After all if people don't want to bother > using the passed in Token, they are still allowed to return a new > one. I could not see what you meant. Can I ask you to let me know the reason why it should be deprecated? --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]