[ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748071#action_12748071 ]
Tim Smith edited comment on LUCENE-1859 at 8/26/09 11:31 AM: ------------------------------------------------------------- bq. The worst-case scenario seems kind of theoretical 100% agree, but even if one extremely large token gets added to the stream (and possibly dropped prior to indexing), the char[] grows without ever shrinking back (so it can result in memory usage growing if "bad" content is thrown in (and people have no shortage of bad content) bq. Is a priority of "major" justified? major is just the default priority (feel free to change) bq. I assume that, based on this report, TermAttributeImpl never gets reset or discarded/recreated over the course of an indexing session? using reusable TokenStream will never cause the buffer to be nulled (as far as i can tell) for the lifetime of the thread (please correct me if i'm wrong on this) i would argue for a semi-large value for MAX_BUFFER_SIZE (potentially allowing this to be statically updated), just as a means to bound the max memory used here currently, the memory use is bounded by Integer.MAX_VALUE (which is really big) If someone feeds a large text document with no spaces or other delimiting characters, a "non-intelligent" tokenizer would treat this a 1 big token (and grow the char[] accordingly) was (Author: tsmith): b1. The worst-case scenario seems kind of theoretical 100% agree, but even if one extremely large token gets added to the stream (and possibly dropped prior to indexing), the char[] grows without ever shrinking back (so it can result in memory usage growing if "bad" content is thrown in (and people have no shortage of bad content) bq. Is a priority of "major" justified? major is just the default priority (feel free to change) bq. I assume that, based on this report, TermAttributeImpl never gets reset or discarded/recreated over the course of an indexing session? using reusable TokenStream will never cause the buffer to be nulled (as far as i can tell) for the lifetime of the thread (please correct me if i'm wrong on this) i would argue for a semi-large value for MAX_BUFFER_SIZE (potentially allowing this to be statically updated), just as a means to bound the max memory used here currently, the memory use is bounded by Integer.MAX_VALUE (which is really big) If someone feeds a large text document with no spaces or other delimiting characters, a "non-intelligent" tokenizer would treat this a 1 big token (and grow the char[] accordingly) > TermAttributeImpl's buffer will never "shrink" if it grows too big > ------------------------------------------------------------------ > > Key: LUCENE-1859 > URL: https://issues.apache.org/jira/browse/LUCENE-1859 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis > Affects Versions: 2.9 > Reporter: Tim Smith > > This was also an issue with Token previously as well > If a TermAttributeImpl is populated with a very long buffer, it will never be > able to reclaim this memory > Obviously, it can be argued that Tokenizer's should never emit "large" > tokens, however it seems that the TermAttributeImpl should have a reasonable > static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, > it will shrink back down to this size once the next token smaller than > MAX_BUFFER_SIZE is set > I don't think i have actually encountered issues with this yet, however it > seems like if you have multiple indexing threads, you could end up with a > char[Integer.MAX_VALUE] per thread (in the very worst case scenario) > perhaps growTermBuffer should have the logic to shrink if the buffer is > currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org