Paul, I saw your last post and now understand the issues you face.
I don't think there has been any effort to produce a reduced-memory-footprint configurable (RMFC) Lucene. With the many mobile devices, embedded and other reduced memory devices, should this perhaps be one of the areas the Lucene community looks in to? -Glen 2009/9/11 <paul_murd...@emainc.com>: > Thanks Glen! > > I will take at your project. Unfortunately I will only have 512 MB to 1024 > MB to work with as Lucene is only one component in a larger software system > running on one machine. I agree with you on the C\C++ comment. That is what > I would normally use for memory intense software. It turns out that the > larger file you want to index is the larger the heap space you will need. > What I would like to see is a way to "throttle" the indexing process to > control the memory footprint. I understand that this will take longer, but > if I perform the task during off hours it shouldn't matter. At least the file > will be indexed correctly. > > Thanks, > Paul > > > -----Original Message----- > From: java-user-return-42272-paul_murdoch=emainc....@lucene.apache.org > [mailto:java-user-return-42272-paul_murdoch=emainc....@lucene.apache.org] On > Behalf Of Glen Newton > Sent: Friday, September 11, 2009 9:53 AM > To: java-user@lucene.apache.org > Subject: Re: Indexing large files? - No answers yet... > > In this project: > http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html > > I concatenate all the text of all of articles of a single journal into > a single text file. > This can create a text file that is 500MB in size. > Lucene is OK in indexing files this size (in parallel even), but I > have a heap size of 8GB. > > I would suggest increasing your heap to as large as your machine can > reasonably take. > The reality is that Java programs (like Lucene) take up more memory > than a similar C or even C++ program. > Java may approach C/C++ in speed, but not memory. > > We don't use Java because of its memory footprint! ;-) > > See: > Programming language shootout: speed: > http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=1&xmem=0&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1®exdna=1&revcomp=1&spectralnorm=1&threadring=0 > Programming language shootout: memory: > http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=0&xmem=1&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1®exdna=1&revcomp=1&spectralnorm=1&threadring=0 > > -glen > > 2009/9/11 Dan OConnor <docon...@acquiremedia.com>: >> Paul: >> >> My first suggestion would be to update your JVM to the latest version (or at >> least .14). There were several garbage collection related issues resolved in >> version 10 - 13 (especially dealing with large heaps). >> >> Next, your IndexWriter parameters would help figure out why you are using so >> much RAM >> getMaxFieldLength() >> getMaxBufferedDocs() >> getMaxMergeDocs() >> getRAMBufferSizeMB() >> >> How often are you calling commit? >> Do you close your IndexWriter after every document? >> How many documents of this size are you indexing? >> Have you used luke to look at your index? >> If this is a large index, have you optimized it recently? >> Are there any searches going on while you are indexing? >> >> >> Regards, >> Dan >> >> >> -----Original Message----- >> From: paul_murd...@emainc.com [mailto:paul_murd...@emainc.com] >> Sent: Friday, September 11, 2009 7:57 AM >> To: java-user@lucene.apache.org >> Subject: RE: Indexing large files? - No answers yet... >> >> This issue is still open. Any suggestions/help with this would be >> greatly appreciated. >> >> Thanks, >> >> Paul >> >> >> -----Original Message----- >> From: java-user-return-42080-paul_murdoch=emainc....@lucene.apache.org >> [mailto:java-user-return-42080-paul_murdoch=emainc....@lucene.apache.org >> ] On Behalf Of paul_murd...@emainc.com >> Sent: Monday, August 31, 2009 10:28 AM >> To: java-user@lucene.apache.org >> Subject: Indexing large files? >> >> Hi, >> >> >> >> I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07). I'm >> consistently receiving "OutOfMemoryError: Java heap space", when trying >> to index large text files. >> >> >> >> Example 1: Indexing a 5 MB text file runs out of memory with a 16 MB >> max. heap size. So I increased the max. heap size to 512 MB. This >> worked for the 5 MB text file, but Lucene still used 84 MB of heap space >> to do this. Why so much? >> >> >> >> The class FreqProxTermsWriterPerField appears to be the biggest memory >> consumer by far according to JConsole and the TPTP Memory Profiling >> plugin for Eclipse Ganymede. >> >> >> >> Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB >> max. heap size. Increasing the max. heap size to 1024 MB works but >> Lucene uses 826 MB of heap space while performing this. Still seems >> like way too much memory is being used to do this. I'm sure larger >> files would cause the error as it seems correlative. >> >> >> >> I'm on a Windows XP SP2 platform with 2 GB of RAM. So what is the best >> practice for indexing large files? Here is a code snippet that I'm >> using: >> >> >> >> // Index the content of a text file. >> >> private Boolean saveTXTFile(File textFile, Document textDocument) >> throws CIDBException { >> >> >> >> try { >> >> >> >> Boolean isFile = textFile.isFile(); >> >> Boolean hasTextExtension = >> textFile.getName().endsWith(".txt"); >> >> >> >> if (isFile && hasTextExtension) { >> >> >> >> System.out.println("File " + >> textFile.getCanonicalPath() + " is being indexed"); >> >> Reader textFileReader = new >> FileReader(textFile); >> >> if (textDocument == null) >> >> textDocument = new Document(); >> >> textDocument.add(new Field("content", >> textFileReader)); >> >> indexWriter.addDocument(textDocument); >> // BREAKS HERE!!!! >> >> } >> >> } catch (FileNotFoundException fnfe) { >> >> System.out.println(fnfe.getMessage()); >> >> return false; >> >> } catch (CorruptIndexException cie) { >> >> throw new CIDBException("The index has become >> corrupt."); >> >> } catch (IOException ioe) { >> >> System.out.println(ioe.getMessage()); >> >> return false; >> >> } >> >> return true; >> >> } >> >> >> >> >> >> Thanks much, >> >> >> >> Paul >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > > -- > > - > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- - --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org