Casimir Giesler created SPARK-55416:
---------------------------------------

             Summary: Streaming Python Data Source memory leak when is offset 
not updated
                 Key: SPARK-55416
                 URL: https://issues.apache.org/jira/browse/SPARK-55416
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Structured Streaming
    Affects Versions: 4.1.1, 4.0.2, 4.0.1, 4.0.0, 4.1.0
            Reporter: Casimir Giesler


This only becomes a bug / memory leak if users implement the offset 
incorrectly, never increasing it.

The commit logic in 
[datasource_internal.py|https://github.com/apache/spark/blob/master/python/pyspark/sql/datasource_internal.py#L106-L114]
 fails to clean up cache if the offset never increases.

In this case, end will always hits at the first cache item, failing the 
{{end_idx > 0}} condition and not cleaning up cache, while the {{latestOffset}} 
function will continue to append to the cache.

This leads to an infinitely growing cache and will ultimately result in OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to