Hi Tim,

The code I showed is a minimal example code to show the issue I'm running
into, which is: memory keeps on growing.

In production, the loop that you see will read files off a file system and
parse them using the logic close to what I sowed.  I use
contentHandler.toString() to get back the raw text so I can save it.  Even
if I get ride of that call, I run into OOM.

Note that, if I test the exact same code against PDF or PPT or ODP or RTF
(I still have far more formats to test) I do *NOT* see the OOM issue even
when I increase the loop to 1000 -- memory usage remains steady and
stable.  This is why in my original email I asked if there is an issue with
XML files or with my code such as if I'm missing to close / release
something.

Here is the full call stack when I get the OOM:

  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.lang.StringBuffer.ensureCapacityImpl(StringBuffer.java:338)
    at java.lang.StringBuffer.append(StringBuffer.java:114)
    at java.io.StringWriter.write(StringWriter.java:106)
    at
org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at
org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at
org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55)
    at
org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:102)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown
Source)
    at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
Source)
    at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
    at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
    at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(Unknown Source)
    at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)

Thanks

Steve


On Mon, Feb 8, 2016 at 3:07 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> I’m not sure why you’d want to append document contents across documents
> into one handler.  Typically, you’d use a new ContentHandler and new
> Metadata object for each parse.  Calling “toString()” does not clear the
> content handler, and you should have 20 copies of the extracted content on
> your final loop.
>
>
>
> There shouldn’t be any difference across file types in the fact that you
> are appending a new copy of the extracted text with each loop.  You might
> not be seeing the memory growth if your other file types aren’t big enough
> and if you are only doing 20 loops.
>
>
>
> But the larger question…what are you trying to accomplish?
>
>
>
> *From:* Steven White [mailto:swhite4...@gmail.com]
> *Sent:* Monday, February 08, 2016 1:38 PM
> *To:* user@tika.apache.org
> *Subject:* Preventing OutOfMemory exception
>
>
>
> Hi everyone,
>
>
>
> I'm integrating Tika with my application and need your help to figure out
> if the OOM I'm getting is due to the way I'm using Tika or if it is an
> issue with parsing XML files.
>
>
>
> The following example code is causing OOM on 7th iteration with -Xmx2g.
> The test will pass with -Xmx4g.  The XML file I'm trying to parse is 51mb
> in size.  I do not see this issue with other file types that I tested so
> far.  Memory usage keeps on growing with XML file types, but stays constant
> with other file types.
>
>
>
>     public class Extractor {
>
>         private BodyContentHandler contentHandler = new
> BodyContentHandler(-1);
>
>         private AutoDetectParser parser = new AutoDetectParser();
>
>         private Metadata metadata = new Metadata();
>
>
>
>         public String extract(File file) throws Exception {
>
>             try {
>
>                 stream = TikaInputStream.get(file);
>
>                 parser.parse(stream, contentHandler, metadata);
>
>                 return contentHandler.toString();
>
>             }
>
>             finally {
>
>                 stream.close();
>
>             }
>
>         }
>
>     }
>
>
>
>     public static void main(...) {
>
>         Extractor extractor = new Extractor();
>
>         File file = new File("C:\\temp\\test.xml");
>
>         for (int i = 0; i < 20; i++) {
>
>             extractor.extract(file);
>
>         }
>
>
>
> Any idea if this is an issue with XML files or if the issue in my code?
>
>
>
> Thanks
>
>
>
> Steve
>
>
>

Reply via email to