Re: OutOfMemoryError indexing large documents

2014-11-26 Thread ryanb
I've had success limiting the number of documents by size, and doing them 1 at a time works OK with 2G heap. I'm also hoping to understand why memory usage would be so high to begin with, or maybe this is expected? I agree that indexing 100+M of text is a bit silly, but the use case is a legal

Re: OutOfMemoryError indexing large documents

2014-11-26 Thread ryanb
100MB of text for a single lucene document, into a single analyzed field. The analyzer is basically the StandardAnalyzer, with minor changes: 1. UAX29URLEmailTokenizer instead of the StandardTokenizer. This doesn't split URLs and email addresses (so we can do it ourselves in the next step). 2.

OutOfMemoryError indexing large documents

2014-11-25 Thread ryanb
Hello, We use vanilla Lucene 4.9.0 in a 64 bit Linux OS. We sometimes need to index large documents (100+ MB), but this results in extremely high memory usage, to the point of OutOfMemoryError even with 17GB of heap. We allow up to 20 documents to be indexed simultaneously, but the text to be