Re: OutOfMemoryError indexing large documents

2014-11-26 Thread Jack Krupansky
Is that 100MB for a single Lucene document? And is that 100MB for a single 
field? Is that field analyzed text? How complex is the analyzer? Like, does 
it do ngrams or something else that is token or memory intensive? Posting 
the analyzer might help us see what the issue might be.


Try indexing only one document at a time - maybe GC is occurring due to 
activity on one stream and then the parallel streams are then trying to 
index while the GC is in progress.


Alternatively, try running with a lot smaller heap since a large heap means 
GC will take longer.


You might consider a strategy where only one large document can be processed 
at a time - have other threads pause if a large document is currently being 
processed or maybe allow only a few large documents to be processed at the 
same time.


What is your average document size? I mean, are the large documents a rarity 
so that the above strategy would be reasonable, or do you need to process 
large numbers of large documents.


-- Jack Krupansky

-Original Message- 
From: ryanb

Sent: Tuesday, November 25, 2014 7:39 PM
To: java-user@lucene.apache.org
Subject: OutOfMemoryError indexing large documents

Hello,

We use vanilla Lucene 4.9.0 in a 64 bit Linux OS. We sometimes need to index
large documents (100+ MB), but this results in extremely high memory usage,
to the point of OutOfMemoryError even with 17GB of heap. We allow up to 20
documents to be indexed simultaneously, but the text to be analyzed and
indexed is streamed, not loaded into memory all at once.

Any suggestions for how to troubleshoot or ideas about the problem are
greatly appreciated!

Some details about our setup (let me know what other information will help):
- Use MMapDirectory wrapped in a NRTCachingDirectory
- RamBufferSize 64MB
- No compund files
- We commit every 20 seconds

Thanks,
Ryan



--
View this message in context: 
http://lucene.472066.n3.nabble.com/OutOfMemoryError-indexing-large-documents-tp4170983.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: OutOfMemoryError indexing large documents

2014-11-26 Thread Trejkaz
On Wed, Nov 26, 2014 at 2:09 PM, Erick Erickson erickerick...@gmail.com wrote:
 Well
 2 seriously consider the utility of indexing a 100+M file. Assuming
 it's mostly text, lots and lots and lots of queries will match it, and
 it'll score pretty low due to length normalization. And you probably
 can't return it to the user. And highlighting it will be a performance
 problem. And may blow out memory too. And...

Meanwhile, some of our users have expressed concern that they can't
view a 2GB text file which was returned in a Lucene result. They even
want to see the term hits and expect that to somehow perform the same
as a small file. Totally unreasonable. :)

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: OutOfMemoryError indexing large documents

2014-11-26 Thread ryanb
I've had success limiting the number of documents by size, and doing them 1
at a time works OK with 2G heap. I'm also hoping to understand why memory
usage would be so high to begin with, or maybe this is expected?

I agree that indexing 100+M of text is a bit silly, but the use case is a
legal context where you need to be able to see, and eventually look at, all
of the documents matching a query (even if they are 100+M).

Thanks Erick!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/OutOfMemoryError-indexing-large-documents-tp4170983p4171212.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: OutOfMemoryError indexing large documents

2014-11-26 Thread ryanb
100MB of text for a single lucene document, into a single analyzed field. The
analyzer is basically the StandardAnalyzer, with minor changes:
1. UAX29URLEmailTokenizer instead of the StandardTokenizer. This doesn't
split URLs and email addresses (so we can do it ourselves in the next step).
2. Split tokens with components, e.g. f...@bar.com emits all of f...@bar.com,
foo, bar, bar.com. So the full component, individual parts, and some
2-grams.

I've been doing all my testing with the Hotspot ParallelOldGC which is
entirely stop-the-world so I don't think indexing can be simultaneous with
GC. However, I tried indexing one document at a time, with a smaller 2G heap
and that works. I am also having success with a strategy that limits the
number of documents being indexed by their size, this was a good idea. I
still don't understand how the RamBuffer size of 64MB can be exceeded by so
much though.

Average document size is much smaller, definitely below 100K. Handling large
documents is relatively atypical, but when we get them there are a
relatively large number of them to be processed together.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/OutOfMemoryError-indexing-large-documents-tp4170983p4171218.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



OutOfMemoryError indexing large documents

2014-11-25 Thread ryanb
Hello,

We use vanilla Lucene 4.9.0 in a 64 bit Linux OS. We sometimes need to index
large documents (100+ MB), but this results in extremely high memory usage,
to the point of OutOfMemoryError even with 17GB of heap. We allow up to 20
documents to be indexed simultaneously, but the text to be analyzed and
indexed is streamed, not loaded into memory all at once.

Any suggestions for how to troubleshoot or ideas about the problem are
greatly appreciated!

Some details about our setup (let me know what other information will help):
- Use MMapDirectory wrapped in a NRTCachingDirectory
- RamBufferSize 64MB
- No compund files
- We commit every 20 seconds

Thanks,
Ryan



--
View this message in context: 
http://lucene.472066.n3.nabble.com/OutOfMemoryError-indexing-large-documents-tp4170983.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: OutOfMemoryError indexing large documents

2014-11-25 Thread Erick Erickson
Well
1 don't send 20 docs at once. Or send docs over some size N by themselves.

2 seriously consider the utility of indexing a 100+M file. Assuming
it's mostly text, lots and lots and lots of queries will match it, and
it'll score pretty low due to length normalization. And you probably
can't return it to the user. And highlighting it will be a performance
problem. And may blow out memory too. And...

May be an XY problem.
Best,
Erick

On Tue, Nov 25, 2014 at 4:39 PM, ryanb ryanbl...@everlaw.com wrote:
 Hello,

 We use vanilla Lucene 4.9.0 in a 64 bit Linux OS. We sometimes need to index
 large documents (100+ MB), but this results in extremely high memory usage,
 to the point of OutOfMemoryError even with 17GB of heap. We allow up to 20
 documents to be indexed simultaneously, but the text to be analyzed and
 indexed is streamed, not loaded into memory all at once.

 Any suggestions for how to troubleshoot or ideas about the problem are
 greatly appreciated!

 Some details about our setup (let me know what other information will help):
 - Use MMapDirectory wrapped in a NRTCachingDirectory
 - RamBufferSize 64MB
 - No compund files
 - We commit every 20 seconds

 Thanks,
 Ryan



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/OutOfMemoryError-indexing-large-documents-tp4170983.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org