100MB of text for a single lucene document, into a single analyzed field. The
analyzer is basically the StandardAnalyzer, with minor changes:
1. UAX29URLEmailTokenizer instead of the StandardTokenizer. This doesn't
split URLs and email addresses (so we can do it ourselves in the next step).
2. Split tokens with components, e.g. f...@bar.com emits all of f...@bar.com,
foo, bar, bar.com. So the full component, individual parts, and some
2-grams.

I've been doing all my testing with the Hotspot ParallelOldGC which is
entirely stop-the-world so I don't think indexing can be simultaneous with
GC. However, I tried indexing one document at a time, with a smaller 2G heap
and that works. I am also having success with a strategy that limits the
number of documents being indexed by their size, this was a good idea. I
still don't understand how the RamBuffer size of 64MB can be exceeded by so
much though.

Average document size is much smaller, definitely below 100K. Handling large
documents is relatively atypical, but when we get them there are a
relatively large number of them to be processed together.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/OutOfMemoryError-indexing-large-documents-tp4170983p4171218.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to