Re: Lucene Indexing out of memory

2010-03-15 Thread ajay_gupta
Hi Michale and others, I did get some hint for my problem. There was a bug in the code which was eating up the memory which I figured out after lot of effort. Thanks All of you for your suggestions. Regards Ajay Michael McCandless-2 wrote: I agree, memory profiler or heap dump or small

Re: Lucene Indexing out of memory

2010-03-15 Thread ajay_gupta
Erick, I did get some hint for my problem. There was a bug in the code which was eating up the memory which I figured out after lot of effort. Thanks All of you for your suggestions. But I still feel it takes lot of time to index documents. Its taking around an hour or more for indexing 330 MB

Re: Lucene Indexing out of memory

2010-03-15 Thread Michael McCandless
Try the ideas here? http://wiki.apache.org/lucene-java/ImproveIndexingSpeed Mike On Mon, Mar 15, 2010 at 1:51 AM, ajay_gupta ajay...@gmail.com wrote: Erick, I did get some hint for my problem. There was a bug in the code which was eating up the memory which I figured out after lot of

Re: Lucene Indexing out of memory

2010-03-04 Thread Ian Lea
Have you run it through a memory profiler yet? Seems the obvious next step. If that doesn't help, cut it down to the simplest possible self-contained program that demonstrates the problem and post it here. -- Ian. On Thu, Mar 4, 2010 at 6:04 AM, ajay_gupta ajay...@gmail.com wrote: Erick,

Re: Lucene Indexing out of memory

2010-03-04 Thread Michael McCandless
I agree, memory profiler or heap dump or small test case is the next step... the code looks fine. This is always a single thread adding docs? Are you really certain that the iterator only iterates over 2500 docs? What analyzer are you using? Mike On Thu, Mar 4, 2010 at 4:50 AM, Ian Lea

Re: Lucene Indexing out of memory

2010-03-03 Thread ajay_gupta
Ian, OOM exception point varies not fixed. It could come anywhere once memory exceeds a certain point. I have allocated 1 GB memory for JVM. I haven't used profiler. When I said after 70 K docs it fails i meant approx 70k documents but if I reduce memory then it will OOM before 70K so its not

Re: Lucene Indexing out of memory

2010-03-03 Thread Ian Lea
Lucene doesn't load everything into memory and can carry on running consecutive searches or loading documents for ever without hitting OOM exceptions. So if it isn't failing on a specific document the most likely cause is that your program is hanging on to something it shouldn't. Previous docs?

Re: Lucene Indexing out of memory

2010-03-03 Thread Erick Erickson
Interpolating from your data (and, by the way, some code examples would help a lot), if you're reopening the index reader to pick up recent additions but not closing it if a different one is returned from reopen, you'll consume resources. From the JavaDocs... IndexReader new = r.reopen(); if

Re: Lucene Indexing out of memory

2010-03-03 Thread Michael McCandless
The worst case RAM usage for Lucene is a single doc with many unique terms. Lucene allocates ~60 bytes per unique term (plus space to hold that term's characters = 2 bytes per char). And, Lucene cannot flush within one document -- it must flush after the doc has been fully indexed. This past

Re: Lucene Indexing out of memory

2010-03-03 Thread ajay_gupta
Mike, Actually my documents are very small in size. We have csv files where each record represents a document which is not very large so I don't think document size is an issue. For each record I am tokenizing it and for each token I am keeping 3 neighbouring tokens in a Hashtable. After X

Re: Lucene Indexing out of memory

2010-03-03 Thread Erick Erickson
The first place I'd look is how big my your strings got. w_context and context_str come to mind. My first suspicion is that you're building ever-longer strings and around 70K documents your strings are large enough to produce OOMs. FWIW Erick On Wed, Mar 3, 2010 at 1:09 PM, ajay_gupta

Re: Lucene Indexing out of memory

2010-03-03 Thread ajay_gupta
Erick, w_context and context_str are local to this method and are used only for 2500 K documents not entire 70 k. I am clearing the hashmap after each 2500k doc processing and also I printed memory consumed by hashmap which is kind of constant for each chunk processing. For each invocation of

Re: Lucene Indexing out of memory

2010-03-02 Thread Erick Erickson
I'm not following this entirely, but these docs may be huge by the time you add context for every word in them. You say that you search the existing indices then I get the content and append. So is it possible that after 70K documents your additions become so huge that you're blowing up? Have

RE: Lucene Indexing out of memory

2010-03-02 Thread Murdoch, Paul
Ajay, I've posted a few times on OOM issues. Here is one thread. http://mail-archives.apache.org/mod_mbox//lucene-java-user/200909.mbox/% 3c5b20def02611534db08854076ce825d803626...@sc1exc2.corp.emainc.com%3e I'll try and get some more links to you from some other threads I started for OOM

RE: Lucene Indexing out of memory

2010-03-02 Thread Murdoch, Paul
Ajay, Here is another thread I started on the same issue. http://stackoverflow.com/questions/1362460/why-does-lucene-cause-oom-whe n-indexing-large-files Paul -Original Message- From: java-user-return-45254-paul.b.murdoch=saic@lucene.apache.org

Re: Lucene Indexing out of memory

2010-03-02 Thread ajay_gupta
Hi Erick, I tried setting setRAMBufferSizeMB as 200-500MB as well but still it goes OOM error. I thought its filebased indexing so memory shouldn't be an issue but you might be right that when searching it might be using lot of memory ? Is there way to load documents in chunks or someothere way

Re: Lucene Indexing out of memory

2010-03-02 Thread Ian Lea
Where exactly are you hitting the OOM exception? Have you got a stack trace? How much memory are you allocating to the JVM? Have you run a profiler to find out what is using the memory? If it runs OK for 70K docs then fails, 2 possibilities come to mind: either the 70K + 1 doc is particularly

Re: Lucene Indexing out of memory

2010-03-02 Thread Erick Erickson
It's not searching that I'm wondering about. The memory size, as far as I understand, really only has document resolution. That is, you can't index a part of a document, flush to disk, then index the rest of the document. The entire document is parsed into memory, and only then flushed to disk if