Behroz Sikander created SPARK-26302:
---------------------------------------

             Summary: retainedBatches configuration can cause memory leak
                 Key: SPARK-26302
                 URL: https://issues.apache.org/jira/browse/SPARK-26302
             Project: Spark
          Issue Type: Improvement
          Components: Documentation
    Affects Versions: 2.4.0
            Reporter: Behroz Sikander
         Attachments: heap_dump_detail.png

The documentation for configuration "spark.streaming.ui.retainedBatches" says

"How many batches the Spark Streaming UI and status APIs remember before 
garbage collecting"

The default for this configuration is 1000.
>From our experience, the documentation is incomplete and we found it the hard 
>way.

The size of a single BatchUIData is around 750KB. Increasing this value to 
something like 5000 increases the total size to ~4GB.

If your driver heap is not big enough, the job starts to slow down, has 
frequent GCs and has long scheduling days. Once the heap is full, the job 
cannot be recovered.

A note of caution should be added to the documentation to let users know the 
impact of this seemingly harmless configuration property.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to