Re: Indexing large files? - No answers yet...

Glen Newton Fri, 11 Sep 2009 07:44:10 -0700

Paul,

I saw your last post and now understand the issues you face.


I don't think there has been any effort to produce a
reduced-memory-footprint configurable (RMFC) Lucene. With the many
mobile devices, embedded and other reduced memory devices, should this
perhaps be one of the areas the Lucene community looks in to?

-Glen

2009/9/11  <[email protected]>:
> Thanks Glen!
>
> I will take at your project.  Unfortunately I will only have 512 MB to 1024 
> MB to work with as Lucene is only one component in a larger software system 
> running on one machine.  I agree with you on the C\C++ comment.  That is what 
> I would normally use for memory intense software.  It turns out that the 
> larger file you want to index is the larger the heap space you will need.  
> What I would like to see is a way to "throttle" the indexing process to 
> control the memory footprint.  I understand that this will take longer, but 
> if I perform the task during off hours it shouldn't matter. At least the file 
> will be indexed correctly.
>
> Thanks,
> Paul
>
>
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On 
> Behalf Of Glen Newton
> Sent: Friday, September 11, 2009 9:53 AM
> To: [email protected]
> Subject: Re: Indexing large files? - No answers yet...
>
> In this project:
>  http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html
>
> I concatenate all the text of all of articles of a single journal into
> a single text file.
> This can create a text file that is 500MB in size.
> Lucene is OK in indexing files this size (in parallel even), but I
> have a heap size of 8GB.
>
> I would suggest increasing your heap to as large as your machine can
> reasonably take.
> The reality is that Java programs (like Lucene) take up more memory
> than a similar C or even C++ program.
> Java may approach C/C++ in speed, but not memory.
>
> We don't use Java because of its memory footprint!  ;-)
>
> See:
>  Programming language shootout: speed:
> http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=1&xmem=0&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0
>  Programming language shootout: memory:
> http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=0&xmem=1&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0
>
> -glen
>
> 2009/9/11 Dan OConnor <[email protected]>:
>> Paul:
>>
>> My first suggestion would be to update your JVM to the latest version (or at 
>> least .14). There were several garbage collection related issues resolved in 
>> version 10 - 13 (especially dealing with large heaps).
>>
>> Next, your IndexWriter parameters would help figure out why you are using so 
>> much RAM
>>        getMaxFieldLength()
>>        getMaxBufferedDocs()
>>        getMaxMergeDocs()
>>        getRAMBufferSizeMB()
>>
>> How often are you calling commit?
>> Do you close your IndexWriter after every document?
>> How many documents of this size are you indexing?
>> Have you used luke to look at your index?
>> If this is a large index, have you optimized it recently?
>> Are there any searches going on while you are indexing?
>>
>>
>> Regards,
>> Dan
>>
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]]
>> Sent: Friday, September 11, 2009 7:57 AM
>> To: [email protected]
>> Subject: RE: Indexing large files? - No answers yet...
>>
>> This issue is still open.  Any suggestions/help with this would be
>> greatly appreciated.
>>
>> Thanks,
>>
>> Paul
>>
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]
>> ] On Behalf Of [email protected]
>> Sent: Monday, August 31, 2009 10:28 AM
>> To: [email protected]
>> Subject: Indexing large files?
>>
>> Hi,
>>
>>
>>
>> I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07).  I'm
>> consistently receiving "OutOfMemoryError: Java heap space", when trying
>> to index large text files.
>>
>>
>>
>> Example 1: Indexing a 5 MB text file runs out of memory with a 16 MB
>> max. heap size.  So I increased the max. heap size to 512 MB.  This
>> worked for the 5 MB text file, but Lucene still used 84 MB of heap space
>> to do this.  Why so much?
>>
>>
>>
>> The class FreqProxTermsWriterPerField appears to be the biggest memory
>> consumer by far according to JConsole and the TPTP Memory Profiling
>> plugin for Eclipse Ganymede.
>>
>>
>>
>> Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB
>> max. heap size.  Increasing the max. heap size to 1024 MB works but
>> Lucene uses 826 MB of heap space while performing this.  Still seems
>> like way too much memory is being used to do this.  I'm sure larger
>> files would cause the error as it seems correlative.
>>
>>
>>
>> I'm on a Windows XP SP2 platform with 2 GB of RAM.  So what is the best
>> practice for indexing large files?  Here is a code snippet that I'm
>> using:
>>
>>
>>
>> // Index the content of a text file.
>>
>>      private Boolean saveTXTFile(File textFile, Document textDocument)
>> throws CIDBException {
>>
>>
>>
>>            try {
>>
>>
>>
>>                  Boolean isFile = textFile.isFile();
>>
>>                  Boolean hasTextExtension =
>> textFile.getName().endsWith(".txt");
>>
>>
>>
>>                  if (isFile && hasTextExtension) {
>>
>>
>>
>>                        System.out.println("File " +
>> textFile.getCanonicalPath() + " is being indexed");
>>
>>                        Reader textFileReader = new
>> FileReader(textFile);
>>
>>                        if (textDocument == null)
>>
>>                              textDocument = new Document();
>>
>>                        textDocument.add(new Field("content",
>> textFileReader));
>>
>>                        indexWriter.addDocument(textDocument);
>> // BREAKS HERE!!!!
>>
>>                  }
>>
>>            } catch (FileNotFoundException fnfe) {
>>
>>                  System.out.println(fnfe.getMessage());
>>
>>                  return false;
>>
>>            } catch (CorruptIndexException cie) {
>>
>>                  throw new CIDBException("The index has become
>> corrupt.");
>>
>>            } catch (IOException ioe) {
>>
>>                  System.out.println(ioe.getMessage());
>>
>>                  return false;
>>
>>            }
>>
>>            return true;
>>
>>      }
>>
>>
>>
>>
>>
>> Thanks much,
>>
>>
>>
>> Paul
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
>
>
> --
>
> -
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>



-- 

-

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Indexing large files? - No answers yet...

Reply via email to