RE: OutOfMemoryException while Indexing an XML file/PdfParser

Ben Litchfield Tue, 18 Feb 2003 17:07:59 -0800

I am aware of the issues with parsing certain PDF documents.  I am
currently working on refactoring PDFBox to deal with large documents.  You
will see this in the next release.  I would like to thank people for
feedback and sending problem documents.


Ben Litchfield
http://www.pdfbox.org


On Tue, 18 Feb 2003, Pinky Iyer wrote:

>
> I am having similar problem but indexing pdf documents using pdfbox parser 
>(available at www.pdfbox.com). I get an exception saying "Exception in thread "main" 
>java.lang.OutOfMemoryError" Any body who has implemented the above code? Any help 
>appreciated???
> Thanks!
> PI
>  Rob Outar <[EMAIL PROTECTED]> wrote:We are aware of DOM limitations/memory 
>problems, but I am using SAX to parse
> the file and index elements and attributes in my content handler.
>
> Thanks,
>
> Rob
>
> -----Original Message-----
> From: Tatu Saloranta [mailto:[EMAIL PROTECTED]]
> Sent: Friday, February 14, 2003 8:18 PM
> To: Lucene Users List
> Subject: Re: OutOfMemoryException while Indexing an XML file
>
>
> On Friday 14 February 2003 07:27, Aaron Galea wrote:
> > I had this problem when using xerces to parse xml documents. The problem I
> > think lies in the Java garbage collector. The way I solved it was to
> create
>
> It's unlikely that GC is the culprit. Current ones are good at purging
> objects
> that are unreachable, and only throw OutOfMem exception when they really
> have
> no other choice.
> Usually it's the app that has some dangling references to objects that
> prevent
> GC from collecting objects not useful any more.
>
> However, it's good to note that Xerces (and DOM parsers in general)
> generally
> use more memory than the input XML files they process; this because they
> usually have to keep the whole document struct in memory, and there is
> overhead on top of text segments. So it's likely to be at least 2 * input
> file size (files usually use UTF-8 which most of the time uses 1 byte per
> char; in memory 16-bit unicode-2 chars are used for performance), plus some
> additional overhead for storing element structure information and all that.
>
> And since default max. java heap size is 64 megs, big XML files can cause
> problems.
>
> More likely however is that references to already processed DOM trees are
> not
> nulled in a loop or something like that? Especially if doing one JVM process
> for item solves the problem.
>
> > a shell script that invokes a java program for each xml file that adds it
> > to the index.
>
> -+ Tatu +-
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> ---------------------------------
> Do you Yahoo!?
> Yahoo! Shopping - Send Flowers for Valentine's Day

-- 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: OutOfMemoryException while Indexing an XML file/PdfParser

Reply via email to