Hi Tim, The code I showed is a minimal example code to show the issue I'm running into, which is: memory keeps on growing.
In production, the loop that you see will read files off a file system and parse them using the logic close to what I sowed. I use contentHandler.toString() to get back the raw text so I can save it. Even if I get ride of that call, I run into OOM. Note that, if I test the exact same code against PDF or PPT or ODP or RTF (I still have far more formats to test) I do *NOT* see the OOM issue even when I increase the loop to 1000 -- memory usage remains steady and stable. This is why in my original email I asked if there is an issue with XML files or with my code such as if I'm missing to close / release something. Here is the full call stack when I get the OOM: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.lang.StringBuffer.ensureCapacityImpl(StringBuffer.java:338) at java.lang.StringBuffer.append(StringBuffer.java:114) at java.io.StringWriter.write(StringWriter.java:106) at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55) at org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:102) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(Unknown Source) at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136) Thanks Steve On Mon, Feb 8, 2016 at 3:07 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > I’m not sure why you’d want to append document contents across documents > into one handler. Typically, you’d use a new ContentHandler and new > Metadata object for each parse. Calling “toString()” does not clear the > content handler, and you should have 20 copies of the extracted content on > your final loop. > > > > There shouldn’t be any difference across file types in the fact that you > are appending a new copy of the extracted text with each loop. You might > not be seeing the memory growth if your other file types aren’t big enough > and if you are only doing 20 loops. > > > > But the larger question…what are you trying to accomplish? > > > > *From:* Steven White [mailto:swhite4...@gmail.com] > *Sent:* Monday, February 08, 2016 1:38 PM > *To:* user@tika.apache.org > *Subject:* Preventing OutOfMemory exception > > > > Hi everyone, > > > > I'm integrating Tika with my application and need your help to figure out > if the OOM I'm getting is due to the way I'm using Tika or if it is an > issue with parsing XML files. > > > > The following example code is causing OOM on 7th iteration with -Xmx2g. > The test will pass with -Xmx4g. The XML file I'm trying to parse is 51mb > in size. I do not see this issue with other file types that I tested so > far. Memory usage keeps on growing with XML file types, but stays constant > with other file types. > > > > public class Extractor { > > private BodyContentHandler contentHandler = new > BodyContentHandler(-1); > > private AutoDetectParser parser = new AutoDetectParser(); > > private Metadata metadata = new Metadata(); > > > > public String extract(File file) throws Exception { > > try { > > stream = TikaInputStream.get(file); > > parser.parse(stream, contentHandler, metadata); > > return contentHandler.toString(); > > } > > finally { > > stream.close(); > > } > > } > > } > > > > public static void main(...) { > > Extractor extractor = new Extractor(); > > File file = new File("C:\\temp\\test.xml"); > > for (int i = 0; i < 20; i++) { > > extractor.extract(file); > > } > > > > Any idea if this is an issue with XML files or if the issue in my code? > > > > Thanks > > > > Steve > > >