I’m not sure why you’d want to append document contents across documents into 
one handler.  Typically, you’d use a new ContentHandler and new Metadata object 
for each parse.  Calling “toString()” does not clear the content handler, and 
you should have 20 copies of the extracted content on your final loop.

There shouldn’t be any difference across file types in the fact that you are 
appending a new copy of the extracted text with each loop.  You might not be 
seeing the memory growth if your other file types aren’t big enough and if you 
are only doing 20 loops.

But the larger question…what are you trying to accomplish?

From: Steven White [mailto:[email protected]]
Sent: Monday, February 08, 2016 1:38 PM
To: [email protected]
Subject: Preventing OutOfMemory exception

Hi everyone,

I'm integrating Tika with my application and need your help to figure out if 
the OOM I'm getting is due to the way I'm using Tika or if it is an issue with 
parsing XML files.

The following example code is causing OOM on 7th iteration with -Xmx2g.  The 
test will pass with -Xmx4g.  The XML file I'm trying to parse is 51mb in size.  
I do not see this issue with other file types that I tested so far.  Memory 
usage keeps on growing with XML file types, but stays constant with other file 
types.

    public class Extractor {
        private BodyContentHandler contentHandler = new BodyContentHandler(-1);
        private AutoDetectParser parser = new AutoDetectParser();
        private Metadata metadata = new Metadata();

        public String extract(File file) throws Exception {
            try {
                stream = TikaInputStream.get(file);
                parser.parse(stream, contentHandler, metadata);
                return contentHandler.toString();
            }
            finally {
                stream.close();
            }
        }
    }

    public static void main(...) {
        Extractor extractor = new Extractor();
        File file = new File("C:\\temp\\test.xml");
        for (int i = 0; i < 20; i++) {
            extractor.extract(file);
        }

Any idea if this is an issue with XML files or if the issue in my code?

Thanks

Steve

Reply via email to