Casimir Giesler created SPARK-55416:
---------------------------------------
Summary: Streaming Python Data Source memory leak when is offset
not updated
Key: SPARK-55416
URL: https://issues.apache.org/jira/browse/SPARK-55416
Project: Spark
Issue Type: Bug
Components: PySpark, Structured Streaming
Affects Versions: 4.1.1, 4.0.2, 4.0.1, 4.0.0, 4.1.0
Reporter: Casimir Giesler
This only becomes a bug / memory leak if users implement the offset
incorrectly, never increasing it.
The commit logic in
[datasource_internal.py|https://github.com/apache/spark/blob/master/python/pyspark/sql/datasource_internal.py#L106-L114]
fails to clean up cache if the offset never increases.
In this case, end will always hits at the first cache item, failing the
{{end_idx > 0}} condition and not cleaning up cache, while the {{latestOffset}}
function will continue to append to the cache.
This leads to an infinitely growing cache and will ultimately result in OOM.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]