[ 
https://issues.apache.org/jira/browse/DAFFODIL-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18014734#comment-18014734
 ] 

Steve Lawrence commented on DAFFODIL-3030:
------------------------------------------

I believe I've found a major source of our memory leaks. The core issue is that 
calling DataProcessor.withFoo() can lead to memory leaks in some cases. Calling 
it a lot can lead to a lot of memory leaks that lead to out of memory errors.

The core issue is that Daffodil uses ThreadLocals in a couple places. One is in 
the "regexMatchState", which can grow to be pretty large for files that do a 
lot of large regex matching. Another is within the Schematron and Xerces 
validators.

A potential gotcha with ThreadLocals is the internal implementation is a map 
where entries are weak but values are not. This means that entries can be 
garbage collected and periodically bet set to null. But even though the entires 
become null, the values are not and cannot be GCed. Instead, the ThreadLocal 
periodically scans the map for entries that have been made null and then 
removes the values, finally allowing them to be garbage collected. But this 
requires actually calling a function on a ThreadLocal to trigger this (e.g. 
get(), set(), remove()). If we don't call any of these, then the values will 
persist in memory. Making things even worse, each Thread has a reference to all 
its ThreadLocals, which means even if a ThreadLocal can no longer be directly 
referenced, a reference to it still exists so it will never be GCed and it's 
values never GCed. Ultimately, this means that if a we ever lose access to a 
ThreadLocal, any values that were added to it but not explicitly removed could 
be a memory leak.

And I think this is exactly what is happening.

For every test, our TDMLRunner does something like this:

val dp = originalDP.withValidation(validation)
val pr = dp.parse(...)

It then throws away dp, no longer having a need for it and expecting it to be 
GCed.

But the problem is withValidation("xerces") causes a XercesValidator instance 
to be stored in a ThreadLocal. Daffodil doesn't know the dp will no longer be 
used, so it keeps the XercesValidator around in a ThreadLocal for future 
parses. But since the TDML runner no longer uses the XercesValidator instance, 
it just becomes a memory leak, lasting forever until the Thread exists.

Another problem, potentially a bigger memory leak is with the "regexMatchState" 
in the data processor. This is a CharBuffer/LongBuffer tuple stored in a 
ThreadLocal that can grow pretty large, especially if schemas have unbounded 
regex length patterns. As with before, everytime we do withValidation() we 
create a new DataProcessor which requires new regexMatchState buffers to be 
allocated and the previous ones become a memory leak. Fortunately 
regexMatchState is lazy, so it at least shouldn't affect schemas that don't use 
regexs.

Note that not all of our ThreadLocal uses have this issue. For example, Parsers 
that use a ThreadLocal are safe because calling DataProessor.withFoo will use 
the same Parsers with the same ThreadLocals, so there is no memory leak.

Also, this really only occurs if you do a lot of DataProcessor.withFoo(). That 
is generally rare, but buggy code, like in the TDML runner, could do this. 
There may also be other scenarios where it seems like a reasonable thing to do 
(DataProcessor.withFoo is supposed to efficient).

So the short term fix is likely to fix the TDML runner so it doesn't call 
DataProcessor.withFoo() so much. A long term fix is to fix our thread locals, 
or stop using them, so we do not leak memory.

> Investigate increased memory usage
> ----------------------------------
>
>                 Key: DAFFODIL-3030
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-3030
>             Project: Daffodil
>          Issue Type: Bug
>            Reporter: Josh Adams
>            Priority: Major
>             Fix For: 4.0.0
>
>
> I'm seeing a significant increase in memory required for a particular DFDL 
> schema project (P8).  Prior to 4.0.0 the test suite (running "sbt test") 
> required around 10GB of heap space to complete.  After 4.0.0 it needs over 
> 20GB.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to