Re: OutOfMemoryError indexing large documents
Is that 100MB for a single Lucene document? And is that 100MB for a single field? Is that field analyzed text? How complex is the analyzer? Like, does it do ngrams or something else that is token or memory intensive? Posting the analyzer might help us see what the issue might be. Try indexing only one document at a time - maybe GC is occurring due to activity on one stream and then the parallel streams are then trying to index while the GC is in progress. Alternatively, try running with a lot smaller heap since a large heap means GC will take longer. You might consider a strategy where only one large document can be processed at a time - have other threads pause if a large document is currently being processed or maybe allow only a few large documents to be processed at the same time. What is your average document size? I mean, are the large documents a rarity so that the above strategy would be reasonable, or do you need to process large numbers of large documents. -- Jack Krupansky -Original Message- From: ryanb Sent: Tuesday, November 25, 2014 7:39 PM To: java-user@lucene.apache.org Subject: OutOfMemoryError indexing large documents Hello, We use vanilla Lucene 4.9.0 in a 64 bit Linux OS. We sometimes need to index large documents (100+ MB), but this results in extremely high memory usage, to the point of OutOfMemoryError even with 17GB of heap. We allow up to 20 documents to be indexed simultaneously, but the text to be analyzed and indexed is streamed, not loaded into memory all at once. Any suggestions for how to troubleshoot or ideas about the problem are greatly appreciated! Some details about our setup (let me know what other information will help): - Use MMapDirectory wrapped in a NRTCachingDirectory - RamBufferSize 64MB - No compund files - We commit every 20 seconds Thanks, Ryan -- View this message in context: http://lucene.472066.n3.nabble.com/OutOfMemoryError-indexing-large-documents-tp4170983.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: OutOfMemoryError indexing large documents
On Wed, Nov 26, 2014 at 2:09 PM, Erick Erickson erickerick...@gmail.com wrote: Well 2 seriously consider the utility of indexing a 100+M file. Assuming it's mostly text, lots and lots and lots of queries will match it, and it'll score pretty low due to length normalization. And you probably can't return it to the user. And highlighting it will be a performance problem. And may blow out memory too. And... Meanwhile, some of our users have expressed concern that they can't view a 2GB text file which was returned in a Lucene result. They even want to see the term hits and expect that to somehow perform the same as a small file. Totally unreasonable. :) TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: OutOfMemoryError indexing large documents
I've had success limiting the number of documents by size, and doing them 1 at a time works OK with 2G heap. I'm also hoping to understand why memory usage would be so high to begin with, or maybe this is expected? I agree that indexing 100+M of text is a bit silly, but the use case is a legal context where you need to be able to see, and eventually look at, all of the documents matching a query (even if they are 100+M). Thanks Erick! -- View this message in context: http://lucene.472066.n3.nabble.com/OutOfMemoryError-indexing-large-documents-tp4170983p4171212.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: OutOfMemoryError indexing large documents
100MB of text for a single lucene document, into a single analyzed field. The analyzer is basically the StandardAnalyzer, with minor changes: 1. UAX29URLEmailTokenizer instead of the StandardTokenizer. This doesn't split URLs and email addresses (so we can do it ourselves in the next step). 2. Split tokens with components, e.g. f...@bar.com emits all of f...@bar.com, foo, bar, bar.com. So the full component, individual parts, and some 2-grams. I've been doing all my testing with the Hotspot ParallelOldGC which is entirely stop-the-world so I don't think indexing can be simultaneous with GC. However, I tried indexing one document at a time, with a smaller 2G heap and that works. I am also having success with a strategy that limits the number of documents being indexed by their size, this was a good idea. I still don't understand how the RamBuffer size of 64MB can be exceeded by so much though. Average document size is much smaller, definitely below 100K. Handling large documents is relatively atypical, but when we get them there are a relatively large number of them to be processed together. -- View this message in context: http://lucene.472066.n3.nabble.com/OutOfMemoryError-indexing-large-documents-tp4170983p4171218.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
OutOfMemoryError indexing large documents
Hello, We use vanilla Lucene 4.9.0 in a 64 bit Linux OS. We sometimes need to index large documents (100+ MB), but this results in extremely high memory usage, to the point of OutOfMemoryError even with 17GB of heap. We allow up to 20 documents to be indexed simultaneously, but the text to be analyzed and indexed is streamed, not loaded into memory all at once. Any suggestions for how to troubleshoot or ideas about the problem are greatly appreciated! Some details about our setup (let me know what other information will help): - Use MMapDirectory wrapped in a NRTCachingDirectory - RamBufferSize 64MB - No compund files - We commit every 20 seconds Thanks, Ryan -- View this message in context: http://lucene.472066.n3.nabble.com/OutOfMemoryError-indexing-large-documents-tp4170983.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: OutOfMemoryError indexing large documents
Well 1 don't send 20 docs at once. Or send docs over some size N by themselves. 2 seriously consider the utility of indexing a 100+M file. Assuming it's mostly text, lots and lots and lots of queries will match it, and it'll score pretty low due to length normalization. And you probably can't return it to the user. And highlighting it will be a performance problem. And may blow out memory too. And... May be an XY problem. Best, Erick On Tue, Nov 25, 2014 at 4:39 PM, ryanb ryanbl...@everlaw.com wrote: Hello, We use vanilla Lucene 4.9.0 in a 64 bit Linux OS. We sometimes need to index large documents (100+ MB), but this results in extremely high memory usage, to the point of OutOfMemoryError even with 17GB of heap. We allow up to 20 documents to be indexed simultaneously, but the text to be analyzed and indexed is streamed, not loaded into memory all at once. Any suggestions for how to troubleshoot or ideas about the problem are greatly appreciated! Some details about our setup (let me know what other information will help): - Use MMapDirectory wrapped in a NRTCachingDirectory - RamBufferSize 64MB - No compund files - We commit every 20 seconds Thanks, Ryan -- View this message in context: http://lucene.472066.n3.nabble.com/OutOfMemoryError-indexing-large-documents-tp4170983.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org