I am aware of the issues with parsing certain PDF documents. I am currently working on refactoring PDFBox to deal with large documents. You will see this in the next release. I would like to thank people for feedback and sending problem documents.
Ben Litchfield http://www.pdfbox.org On Tue, 18 Feb 2003, Pinky Iyer wrote: > > I am having similar problem but indexing pdf documents using pdfbox parser >(available at www.pdfbox.com). I get an exception saying "Exception in thread "main" >java.lang.OutOfMemoryError" Any body who has implemented the above code? Any help >appreciated??? > Thanks! > PI > Rob Outar <[EMAIL PROTECTED]> wrote:We are aware of DOM limitations/memory >problems, but I am using SAX to parse > the file and index elements and attributes in my content handler. > > Thanks, > > Rob > > -----Original Message----- > From: Tatu Saloranta [mailto:[EMAIL PROTECTED]] > Sent: Friday, February 14, 2003 8:18 PM > To: Lucene Users List > Subject: Re: OutOfMemoryException while Indexing an XML file > > > On Friday 14 February 2003 07:27, Aaron Galea wrote: > > I had this problem when using xerces to parse xml documents. The problem I > > think lies in the Java garbage collector. The way I solved it was to > create > > It's unlikely that GC is the culprit. Current ones are good at purging > objects > that are unreachable, and only throw OutOfMem exception when they really > have > no other choice. > Usually it's the app that has some dangling references to objects that > prevent > GC from collecting objects not useful any more. > > However, it's good to note that Xerces (and DOM parsers in general) > generally > use more memory than the input XML files they process; this because they > usually have to keep the whole document struct in memory, and there is > overhead on top of text segments. So it's likely to be at least 2 * input > file size (files usually use UTF-8 which most of the time uses 1 byte per > char; in memory 16-bit unicode-2 chars are used for performance), plus some > additional overhead for storing element structure information and all that. > > And since default max. java heap size is 64 megs, big XML files can cause > problems. > > More likely however is that references to already processed DOM trees are > not > nulled in a loop or something like that? Especially if doing one JVM process > for item solves the problem. > > > a shell script that invokes a java program for each xml file that adds it > > to the index. > > -+ Tatu +- > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------- > Do you Yahoo!? > Yahoo! Shopping - Send Flowers for Valentine's Day -- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]