[
https://issues.apache.org/jira/browse/TIKA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955416#comment-17955416
]
Tim Barrett commented on TIKA-4427:
-----------------------------------
It looks to me as though reset isn’t working. The pooled instances all seem to
have the following:
saxParser->xmlreader->fDocumentSource->fDocumentHandler->fDocumentSource->….
which seems to go down to a great depth.
This happened originally with java 21, I went back to java 11 but the problem
remained.
When I said this didn’t happen under tika 2.x I’m not sure that was accurate
info. The last time we did a big data test was a few months ago so not sure
what version of tika we were on then.
I have put a workaround in XMLReaderUtils: I always call getSAXParser. There is
no noticeable performance hit and no more memory being leaked. Begs a question
as to why a pool was needed?
/**
* This checks context for a user specified \{@link SAXParser}. If one is not
* found, this reuses a SAXParser from the pool.
*
* @param is
* InputStream to parse
* @param contentHandler
* handler to use; this wraps a \{@link OfflineContentHandler} to
* the content handler as an extra layer of defense against
* external entity vulnerabilities
* @param context
* context to use
* @return
* @throws TikaException
* @throws IOException
* @throws SAXException
* @since Apache Tika 1.19
*
* Workaround Tim Barrett Nalanda 31/05/2025 - always get a new SAX
* parser due to memory leak in XMLreader
*
*/
publicstaticvoid parseSAX(InputStream is, ContentHandler contentHandler,
ParseContext context)
throws TikaException, IOException, SAXException {
SAXParser saxParser = context.get(SAXParser.class);
PoolSAXParser poolSAXParser = null;
// if (saxParser == null) {
// poolSAXParser = acquireSAXParser();
// if (poolSAXParser != null) {
// saxParser = poolSAXParser.getSAXParser();
// } else {
saxParser = getSAXParser();
// }
// }
try {
saxParser.parse(is, new OfflineContentHandler(contentHandler));
} finally {
if (poolSAXParser != null) {
releaseParser(poolSAXParser);
}
}
}
> Memory Leak when parsing a large (110K+) number of documents
> --------------------------------------------------------------
>
> Key: TIKA-4427
> URL: https://issues.apache.org/jira/browse/TIKA-4427
> Project: Tika
> Issue Type: Bug
> Components: core
> Affects Versions: 3.2.0
> Reporter: Tim Barrett
> Priority: Major
> Attachments: Screenshot 2025-05-30 at 17.22.38.png, Screenshot
> 2025-05-30 at 18.31.01.png, Screenshot 2025-05-30 at 18.31.47.png
>
>
> When parsing a very large number of documents, which include a lot of eml
> files we see that
> The static field XMLReaderUtils.SAX_PARSERS is holding a massive amount of
> memory: 3.28 GB. This is a static pool of cached SAXParser instances, each of
> which is holding onto substantial amounts of memory, apparently in the
> fDocumentHandler field.
> This is a big data test we run regularly, the memory issues did not occur in
> Tika version 2.x
>
> I have attached JVM monitor screenshots.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)