Have you run it through a memory profiler yet? Seems the obvious next step.
If that doesn't help, cut it down to the simplest possible self-contained program that demonstrates the problem and post it here. -- Ian. On Thu, Mar 4, 2010 at 6:04 AM, ajay_gupta <ajay...@gmail.com> wrote: > > Erick, > w_context and context_str are local to this method and are used only for > 2500 K documents not entire 70 k. I am clearing the hashmap after each 2500k > doc processing and also I printed memory consumed by hashmap which is kind > of constant for each chunk processing. For each invocation of > update_context memory should be kind of constant but after each invocation > it increase few MB's and after 70k it goes OOM so something wrong is > happening inside update_context some operation like search/update/add > document is creating some memory and which is not release after returning > from this method. > > -Ajay > > > Erick Erickson wrote: >> >> The first place I'd look is how big my your strings >> got. w_context and context_str come to mind. My >> first suspicion is that you're building ever-longer >> strings and around 70K documents your strings >> are large enough to produce OOMs. >> >> FWIW >> Erick >> >> On Wed, Mar 3, 2010 at 1:09 PM, ajay_gupta <ajay...@gmail.com> wrote: >> >>> >>> Mike, >>> Actually my documents are very small in size. We have csv files where >>> each >>> record represents a document which is not very large so I don't think >>> document size is an issue. >>> For each record I am tokenizing it and for each token I am keeping 3 >>> neighbouring tokens in a Hashtable. After X number of documents where X >>> is >>> currently 2500 I am creating >>> index by following code: >>> >>> //Initialization step done only at >>> starting >>> >>> cram = FSDirectory.open(new >>> File("lucenetemp2")); >>> context_writer = new IndexWriter(cram, >>> analyzer, true, >>> IndexWriter.MaxFieldLength.LIMITED); >>> >>> // After each 2500 docs >>> >>> update_context() >>> { >>> context_writer.commit(); >>> context_writer.optimize(); >>> >>> IndexSearcher is = new IndexSearcher(cram); >>> IndexReader ir = is.getIndexReader(); >>> Iterator<String> it = context.keySet().iterator(); >>> >>> while(it.hasNext()) >>> { >>> String word = it.next(); >>> // This is all the context of "word" for >>> all >>> the 2500 docs >>> StringBuffer w_context = >>> context.get(word); >>> Term t = new Term("Word", word); >>> TermQuery tq = new TermQuery(t); >>> TopScoreDocCollector collector = >>> TopScoreDocCollector.create(1, false); >>> is.search(tq,collector); >>> ScoreDoc[] hits = >>> collector.topDocs().scoreDocs; >>> >>> if(hits.length!=0) >>> { >>> int id = hits[0].doc; >>> TermFreqVector tfv = >>> ir.getTermFreqVector(id, "Context"); >>> >>> // This creates context string >>> from >>> TermFreqVector. For e.g if >>> TermFreqVector is word1(2), word2(1),word3(2) then its output is >>> // context_str="word1 word1 word2 >>> word3 word3" >>> String context_str = >>> getContextString(tfv); >>> >>> >>> w_context.append(context_str); >>> Document new_doc = new Document(); >>> new_doc.add(new Field("Word", >>> word,Field.Store.YES, >>> Field.Index.NOT_ANALYZED)); >>> new_doc.add(new Field("Context", >>> w_context.toString(),Field.Store.YES, >>> Field.Index.ANALYZED, Field.TermVector.YES)); >>> context_writer.updateDocument(t, >>> new_doc); >>> >>> }else{ >>> >>> Document new_doc = new Document(); >>> new_doc.add(new Field("Word", >>> word,Field.Store.YES, >>> Field.Index.NOT_ANALYZED)); >>> new_doc.add(new Field("Context", >>> w_context.toString(),Field.Store.YES, >>> Field.Index.ANALYZED, Field.TermVector.YES)); >>> >>> context_writer.addDocument(new_doc); >>> >>> } >>> } >>> ir.close(); >>> is.close(); >>> >>> } >>> >>> >>> I am printing memory also after each invocation of this method and I >>> observed that after each call of update_context memory increases and when >>> it >>> reaches around 65-70k it goes outofmemory so somewhere memory is >>> increasing >>> in each invocation. I thought each invocation should take constant amount >>> of >>> memory and it should not be increased cumulatively. Also after each >>> invocation of Update_context I am also calling System.gc() to release >>> memory >>> and I also tried various other parameters like >>> context_writer.setMaxBufferedDocs() >>> context_writer.setMaxMergeDocs() >>> context_writer.setRAMBufferSizeMB() >>> I set these parameters smaller values as well but nothing worked. >>> >>> Any hint will be very helpful. >>> >>> Thanks >>> Ajay >>> >>> >>> Michael McCandless-2 wrote: >>> > >>> > The worst case RAM usage for Lucene is a single doc with many unique >>> > terms. Lucene allocates ~60 bytes per unique term (plus space to hold >>> > that term's characters = 2 bytes per char). And, Lucene cannot flush >>> > within one document -- it must flush after the doc has been fully >>> > indexed. >>> > >>> > This past thread (also from Paul) delves into some of the details: >>> > >>> > http://lucene.markmail.org/thread/pbeidtepentm6mdn >>> > >>> > But it's not clear whether that is the issue affecting Ajay -- I think >>> > more details about the docs, or, some code fragments, could help shed >>> > light. >>> > >>> > Mike >>> > >>> > On Tue, Mar 2, 2010 at 8:47 AM, Murdoch, Paul <paul.b.murd...@saic.com> >>> > wrote: >>> >> Ajay, >>> >> >>> >> Here is another thread I started on the same issue. >>> >> >>> >> >>> http://stackoverflow.com/questions/1362460/why-does-lucene-cause-oom-whe >>> >> n-indexing-large-files >>> >> >>> >> Paul >>> >> >>> >> >>> >> -----Original Message----- >>> >> From: java-user-return-45254-paul.b.murdoch=saic....@lucene.apache.org >>> >> [mailto:java-user-return-45254-PAUL.B.MURDOCH=saic.com@ >>> lucene.apache.org >>> >> ] On Behalf Of ajay_gupta >>> >> Sent: Tuesday, March 02, 2010 8:28 AM >>> >> To: java-user@lucene.apache.org >>> >> Subject: Lucene Indexing out of memory >>> >> >>> >> >>> >> Hi, >>> >> It might be general question though but I couldn't find the answer >>> yet. >>> >> I >>> >> have around 90k documents sizing around 350 MB. Each document contains >>> a >>> >> record which has some text content. For each word in this text I want >>> to >>> >> store context for that word and index it so I am reading each document >>> >> and >>> >> for each word in that document I am appending fixed number of >>> >> surrounding >>> >> words. To do that first I search in existing indices if this word >>> >> already >>> >> exist and if it is then I get the content and append the new context >>> and >>> >> update the document. In case no context exist I create a document with >>> >> fields "word" and "context" and add these two fields with values as >>> word >>> >> value and context value. >>> >> >>> >> I tried this in RAM but after certain no of docs it gave out of memory >>> >> error >>> >> so I thought to use FSDirectory method but surprisingly after 70k >>> >> documents >>> >> it also gave OOM error. I have enough disk space but still I am >>> getting >>> >> this >>> >> error.I am not sure even for disk based indexing why its giving this >>> >> error. >>> >> I thought disk based indexing will be slow but atleast it will be >>> >> scalable. >>> >> Could someone suggest what could be the issue ? >>> >> >>> >> Thanks >>> >> Ajay >>> >> -- >>> >> View this message in context: >>> >> >>> http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27755872 >>> . >>> >> html >>> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >>> >> >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >>> >> >>> > >>> > --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> > For additional commands, e-mail: java-user-h...@lucene.apache.org >>> > >>> > >>> > >>> >>> -- >>> View this message in context: >>> http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27771637.html >>> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> > > -- > View this message in context: > http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27777206.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org