Re: Lucene Indexing out of memory

Ian Lea Thu, 04 Mar 2010 01:51:15 -0800

Have you run it through a memory profiler yet?  Seems the obvious next step.


If that doesn't help, cut it down to the simplest possible
self-contained program that demonstrates the problem and post it here.


--
Ian.


On Thu, Mar 4, 2010 at 6:04 AM, ajay_gupta <[email protected]> wrote:
>
> Erick,
> w_context and context_str are local to this method and are used only for
> 2500 K documents not entire 70 k. I am clearing the hashmap after each 2500k
> doc processing and also I printed memory consumed by  hashmap which is kind
> of constant for each chunk processing.  For each invocation of
> update_context memory should be kind of constant but after each invocation
> it increase few MB's and after 70k it goes OOM so something wrong is
> happening inside update_context some operation like search/update/add
> document is creating some memory and which is not release after returning
> from this method.
>
> -Ajay
>
>
> Erick Erickson wrote:
>>
>> The first place I'd look is how big my your strings
>> got. w_context and context_str come to mind. My
>> first suspicion is that you're building ever-longer
>> strings and around 70K documents your strings
>> are large enough to produce OOMs.
>>
>> FWIW
>> Erick
>>
>> On Wed, Mar 3, 2010 at 1:09 PM, ajay_gupta <[email protected]> wrote:
>>
>>>
>>> Mike,
>>> Actually my documents are very small in size. We have csv files where
>>> each
>>> record represents a document which is not very large so I don't think
>>> document size is an issue.
>>> For each record I am tokenizing it and for each token I am keeping 3
>>> neighbouring tokens in a Hashtable. After X number of documents where X
>>> is
>>> currently 2500 I am creating
>>> index by following code:
>>>
>>>                                //Initialization step done only at
>>> starting
>>>
>>>                                cram = FSDirectory.open(new
>>> File("lucenetemp2"));
>>>                                context_writer = new IndexWriter(cram,
>>> analyzer, true,
>>> IndexWriter.MaxFieldLength.LIMITED);
>>>
>>>                    // After each 2500 docs
>>>
>>>                    update_context()
>>>                    {
>>>                        context_writer.commit();
>>>                        context_writer.optimize();
>>>
>>>                        IndexSearcher is = new IndexSearcher(cram);
>>>                        IndexReader ir = is.getIndexReader();
>>>                        Iterator<String> it = context.keySet().iterator();
>>>
>>>                        while(it.hasNext())
>>>                        {
>>>                                String word = it.next();
>>>                                // This is all the context of "word" for
>>> all
>>> the 2500 docs
>>>                                StringBuffer w_context =
>>> context.get(word);
>>>                                Term t = new Term("Word", word);
>>>                                TermQuery tq = new TermQuery(t);
>>>                                TopScoreDocCollector collector =
>>> TopScoreDocCollector.create(1, false);
>>>                                is.search(tq,collector);
>>>                                ScoreDoc[] hits =
>>> collector.topDocs().scoreDocs;
>>>
>>>                                if(hits.length!=0)
>>>                                {
>>>                                        int id = hits[0].doc;
>>>                                        TermFreqVector tfv =
>>> ir.getTermFreqVector(id, "Context");
>>>
>>>                                        // This creates context string
>>> from
>>> TermFreqVector. For e.g if
>>> TermFreqVector is word1(2), word2(1),word3(2) then its output is
>>>                                        // context_str="word1 word1 word2
>>> word3 word3"
>>>                                        String context_str =
>>> getContextString(tfv);
>>>
>>>
>>>                                        w_context.append(context_str);
>>>                                        Document new_doc = new Document();
>>>                                        new_doc.add(new Field("Word",
>>> word,Field.Store.YES,
>>> Field.Index.NOT_ANALYZED));
>>>                                        new_doc.add(new Field("Context",
>>> w_context.toString(),Field.Store.YES,
>>> Field.Index.ANALYZED, Field.TermVector.YES));
>>>                                        context_writer.updateDocument(t,
>>> new_doc);
>>>
>>>                                }else{
>>>
>>>                                        Document new_doc = new Document();
>>>                                        new_doc.add(new Field("Word",
>>> word,Field.Store.YES,
>>> Field.Index.NOT_ANALYZED));
>>>                                        new_doc.add(new Field("Context",
>>> w_context.toString(),Field.Store.YES,
>>> Field.Index.ANALYZED, Field.TermVector.YES));
>>>
>>> context_writer.addDocument(new_doc);
>>>
>>>                                }
>>>                        }
>>>                        ir.close();
>>>                        is.close();
>>>
>>>                    }
>>>
>>>
>>> I am printing memory also after each invocation of this method and I
>>> observed that after each call of update_context memory increases and when
>>> it
>>> reaches around 65-70k it goes outofmemory so somewhere memory is
>>> increasing
>>> in each invocation. I thought each invocation should take constant amount
>>> of
>>> memory and it should not be increased cumulatively. Also after each
>>> invocation of Update_context I am also calling System.gc() to release
>>> memory
>>> and I also tried various other parameters like
>>> context_writer.setMaxBufferedDocs()
>>> context_writer.setMaxMergeDocs()
>>> context_writer.setRAMBufferSizeMB()
>>> I set these parameters smaller values as well but nothing worked.
>>>
>>> Any hint will be very helpful.
>>>
>>> Thanks
>>> Ajay
>>>
>>>
>>> Michael McCandless-2 wrote:
>>> >
>>> > The worst case RAM usage for Lucene is a single doc with many unique
>>> > terms.  Lucene allocates ~60 bytes per unique term (plus space to hold
>>> > that term's characters = 2 bytes per char).  And, Lucene cannot flush
>>> > within one document -- it must flush after the doc has been fully
>>> > indexed.
>>> >
>>> > This past thread (also from Paul) delves into some of the details:
>>> >
>>> >   http://lucene.markmail.org/thread/pbeidtepentm6mdn
>>> >
>>> > But it's not clear whether that is the issue affecting Ajay -- I think
>>> > more details about the docs, or, some code fragments, could help shed
>>> > light.
>>> >
>>> > Mike
>>> >
>>> > On Tue, Mar 2, 2010 at 8:47 AM, Murdoch, Paul <[email protected]>
>>> > wrote:
>>> >> Ajay,
>>> >>
>>> >> Here is another thread I started on the same issue.
>>> >>
>>> >>
>>> http://stackoverflow.com/questions/1362460/why-does-lucene-cause-oom-whe
>>> >> n-indexing-large-files
>>> >>
>>> >> Paul
>>> >>
>>> >>
>>> >> -----Original Message-----
>>> >> From: [email protected]
>>> >> [mailto:java-user-return-45254-PAUL.B.MURDOCH=saic.com@
>>> lucene.apache.org
>>> >> ] On Behalf Of ajay_gupta
>>> >> Sent: Tuesday, March 02, 2010 8:28 AM
>>> >> To: [email protected]
>>> >> Subject: Lucene Indexing out of memory
>>> >>
>>> >>
>>> >> Hi,
>>> >> It might be general question though but I couldn't find the answer
>>> yet.
>>> >> I
>>> >> have around 90k documents sizing around 350 MB. Each document contains
>>> a
>>> >> record which has some text content. For each word in this text I want
>>> to
>>> >> store context for that word and index it so I am reading each document
>>> >> and
>>> >> for each word in that document I am appending fixed number of
>>> >> surrounding
>>> >> words. To do that first I search in existing indices if this word
>>> >> already
>>> >> exist and if it is then I get the content and append the new context
>>> and
>>> >> update the document. In case no context exist I create a document with
>>> >> fields "word" and "context" and add these two fields with values as
>>> word
>>> >> value and context value.
>>> >>
>>> >> I tried this in RAM but after certain no of docs it gave out of memory
>>> >> error
>>> >> so I thought to use FSDirectory method but surprisingly after 70k
>>> >> documents
>>> >> it also gave OOM error. I have enough disk space but still I am
>>> getting
>>> >> this
>>> >> error.I am not sure even for disk based indexing why its giving this
>>> >> error.
>>> >> I thought disk based indexing will be slow but atleast it will be
>>> >> scalable.
>>> >> Could someone suggest what could be the issue ?
>>> >>
>>> >> Thanks
>>> >> Ajay
>>> >> --
>>> >> View this message in context:
>>> >>
>>> http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27755872
>>> .
>>> >> html
>>> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [email protected]
>>> >> For additional commands, e-mail: [email protected]
>>> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [email protected]
>>> >> For additional commands, e-mail: [email protected]
>>> >>
>>> >>
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: [email protected]
>>> > For additional commands, e-mail: [email protected]
>>> >
>>> >
>>> >
>>>
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27771637.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>
>>
>
> --
> View this message in context: 
> http://old.nabble.com/Lucene-Indexing-out-of-memory-tp27755872p27777206.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Lucene Indexing out of memory

Reply via email to