[jira] [Commented] (TIKA-4427) Memory Leak when parsing a large (110K+) number of documents

Tim Barrett (Jira) Sat, 31 May 2025 06:59:18 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955416#comment-17955416
 ]


Tim Barrett commented on TIKA-4427:
-----------------------------------

It looks to me as though reset isn’t working. The pooled instances all seem to 
have the following:

saxParser->xmlreader->fDocumentSource->fDocumentHandler->fDocumentSource->…. 
which seems to go down to a great depth.

This happened originally with java 21, I went back to java 11 but the problem 
remained.

When I said this didn’t happen under tika 2.x I’m not sure that was accurate 
info. The last time we did a big data test was a few months ago so not sure 
what version of tika we were on then.

I have put a workaround in XMLReaderUtils: I always call getSAXParser. There is 
no noticeable performance hit and no more memory being leaked. Begs a question 
as to why a pool was needed?
/**

* This checks context for a user specified \{@link SAXParser}. If one is not

* found, this reuses a SAXParser from the pool.

*

* @param is

* InputStream to parse

* @param contentHandler

* handler to use; this wraps a \{@link OfflineContentHandler} to

* the content handler as an extra layer of defense against

* external entity vulnerabilities

* @param context

* context to use

* @return

* @throws TikaException

* @throws IOException

* @throws SAXException

* @since Apache Tika 1.19

*

* Workaround Tim Barrett Nalanda 31/05/2025 - always get a new SAX

* parser due to memory leak in XMLreader

*

*/

publicstaticvoid parseSAX(InputStream is, ContentHandler contentHandler, 
ParseContext context)

throws TikaException, IOException, SAXException {

SAXParser saxParser = context.get(SAXParser.class);

PoolSAXParser poolSAXParser = null;

// if (saxParser == null) {

// poolSAXParser = acquireSAXParser();

// if (poolSAXParser != null) {

// saxParser = poolSAXParser.getSAXParser();

// } else {

saxParser = getSAXParser();

// }

// }

try {

saxParser.parse(is, new OfflineContentHandler(contentHandler));

} finally {

if (poolSAXParser != null) {

releaseParser(poolSAXParser);

}

}

}

> Memory Leak when parsing a large (110K+)  number of documents 
> --------------------------------------------------------------
>
>                 Key: TIKA-4427
>                 URL: https://issues.apache.org/jira/browse/TIKA-4427
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 3.2.0
>            Reporter: Tim Barrett
>            Priority: Major
>         Attachments: Screenshot 2025-05-30 at 17.22.38.png, Screenshot 
> 2025-05-30 at 18.31.01.png, Screenshot 2025-05-30 at 18.31.47.png
>
>
> When parsing a very large number of documents, which include a lot of eml 
> files we see that  
> The static field XMLReaderUtils.SAX_PARSERS  is holding a massive amount of 
> memory: 3.28 GB. This is a static pool of cached SAXParser instances, each of 
> which is holding onto substantial amounts of memory, apparently in the 
> fDocumentHandler field.
> This is a big data test we run regularly, the memory issues did not occur in 
> Tika version 2.x
>  
> I have attached JVM monitor screenshots.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4427) Memory Leak when parsing a large (110K+) number of documents

Reply via email to