[
https://issues.apache.org/jira/browse/TIKA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955280#comment-17955280
]
Tim Allison commented on TIKA-4427:
-----------------------------------
Thank you for opening this issue and identifying the source of the memory leak.
We're calling reset on the parser when we return it to the pool:
https://github.com/apache/tika/blob/branch_3x/tika-core/src/main/java/org/apache/tika/utils/XMLReaderUtils.java#L1064
Any idea why this might not be enough? Is reset not working
>From the last screenshot, you're using the default java xml parser, and it
>looks like the cache only has 10 of them...so that's good.
> Memory Leak when parsing a large (110K+) number of documents
> --------------------------------------------------------------
>
> Key: TIKA-4427
> URL: https://issues.apache.org/jira/browse/TIKA-4427
> Project: Tika
> Issue Type: Bug
> Components: core
> Affects Versions: 3.2.0
> Reporter: Tim Barrett
> Priority: Major
> Attachments: Screenshot 2025-05-30 at 17.22.38.png, Screenshot
> 2025-05-30 at 18.31.01.png, Screenshot 2025-05-30 at 18.31.47.png
>
>
> When parsing a very large number of documents, which include a lot of eml
> files we see that
> The static field XMLReaderUtils.SAX_PARSERS is holding a massive amount of
> memory: 3.28 GB. This is a static pool of cached SAXParser instances, each of
> which is holding onto substantial amounts of memory, apparently in the
> fDocumentHandler field.
> This is a big data test we run regularly, the memory issues did not occur in
> Tika version 2.x
>
> I have attached JVM monitor screenshots.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)