Same parser is ok to reuse…should even be ok in multithreaded applications.

Do not reuse ContentHandler or Metadata objects.

As a side note, if you are handling a bunch of files from the wild in a 
production environment, I encourage separating Tika into a separate jvm vs 
tying it into any post processing – consider tika-batch and writing separate 
text files for each file processed (not so efficient, but exceedingly robust).  
If this is demo code or you know your document set well enough, you should be 
good to go with keeping Tika and your postprocessing steps in the same jvm.

From: Steven White [mailto:swhite4...@gmail.com]
Sent: Tuesday, February 09, 2016 10:35 AM
To: user@tika.apache.org
Subject: Re: Preventing OutOfMemory exception

Thanks Tim!!  You helped me find the defect in my code.

Yes, I'm using one BodyContentHandler.  When I changed my code to create a new 
BodyContentHandler for each XML file I'm parsing, I no longer see the OOM.  It 
is weird that I see this issue with XML files only.

For completeness, can you confirm if I have an issue in re-using a single 
instance of AutoDetectParser and Metadata throughout the life of my 
application?  The reason why I'm reusing a single instance is to cut down on 
overhead (I have yet to time this).

Steve


On Mon, Feb 8, 2016 at 8:33 PM, Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:
In your actual code, are you using one BodyContentHandler for all of your 
files?  Or are you creating a new BodyContentHandler for each file?  If the 
former, then, y, there’s a problem with your code; if the latter, that’s not 
something I’ve seen before.

From: Steven White [mailto:swhite4...@gmail.com<mailto:swhite4...@gmail.com>]
Sent: Monday, February 08, 2016 4:56 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Re: Preventing OutOfMemory exception

Hi Tim,

The code I showed is a minimal example code to show the issue I'm running into, 
which is: memory keeps on growing.

In production, the loop that you see will read files off a file system and 
parse them using the logic close to what I sowed.  I use 
contentHandler.toString() to get back the raw text so I can save it.  Even if I 
get ride of that call, I run into OOM.

Note that, if I test the exact same code against PDF or PPT or ODP or RTF (I 
still have far more formats to test) I do *NOT* see the OOM issue even when I 
increase the loop to 1000 -- memory usage remains steady and stable.  This is 
why in my original email I asked if there is an issue with XML files or with my 
code such as if I'm missing to close / release something.

Here is the full call stack when I get the OOM:

  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.lang.StringBuffer.ensureCapacityImpl(StringBuffer.java:338)
    at java.lang.StringBuffer.append(StringBuffer.java:114)
    at java.io.StringWriter.write(StringWriter.java:106)
    at 
org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at 
org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at 
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at 
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at 
org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55)
    at 
org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:102)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown Source)
    at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown 
Source)
    at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
 Source)
    at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(Unknown Source)
    at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)

Thanks

Steve


On Mon, Feb 8, 2016 at 3:07 PM, Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:
I’m not sure why you’d want to append document contents across documents into 
one handler.  Typically, you’d use a new ContentHandler and new Metadata object 
for each parse.  Calling “toString()” does not clear the content handler, and 
you should have 20 copies of the extracted content on your final loop.

There shouldn’t be any difference across file types in the fact that you are 
appending a new copy of the extracted text with each loop.  You might not be 
seeing the memory growth if your other file types aren’t big enough and if you 
are only doing 20 loops.

But the larger question…what are you trying to accomplish?

From: Steven White [mailto:swhite4...@gmail.com<mailto:swhite4...@gmail.com>]
Sent: Monday, February 08, 2016 1:38 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Preventing OutOfMemory exception

Hi everyone,

I'm integrating Tika with my application and need your help to figure out if 
the OOM I'm getting is due to the way I'm using Tika or if it is an issue with 
parsing XML files.

The following example code is causing OOM on 7th iteration with -Xmx2g.  The 
test will pass with -Xmx4g.  The XML file I'm trying to parse is 51mb in size.  
I do not see this issue with other file types that I tested so far.  Memory 
usage keeps on growing with XML file types, but stays constant with other file 
types.

    public class Extractor {
        private BodyContentHandler contentHandler = new BodyContentHandler(-1);
        private AutoDetectParser parser = new AutoDetectParser();
        private Metadata metadata = new Metadata();

        public String extract(File file) throws Exception {
            try {
                stream = TikaInputStream.get(file);
                parser.parse(stream, contentHandler, metadata);
                return contentHandler.toString();
            }
            finally {
                stream.close();
            }
        }
    }

    public static void main(...) {
        Extractor extractor = new Extractor();
        File file = new File("C:\\temp\\test.xml");
        for (int i = 0; i < 20; i++) {
            extractor.extract(file);
        }

Any idea if this is an issue with XML files or if the issue in my code?

Thanks

Steve



Reply via email to