Zhilong Hong created FLINK-23354:
------------------------------------
Summary: Limit the size of blob cache on TaskExecutor
Key: FLINK-23354
URL: https://issues.apache.org/jira/browse/FLINK-23354
Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination
Reporter: Zhilong Hong
Fix For: 1.14.0
Currently a TaskExecutor uses BlobCache to cache the blobs transported from
JobManager. The caches are the local file stored on the TaskExecutor. The blob
cache will not be cleaned up until one hour after the related job is finished.
At present, JobInformation and TaskInformation are transported via blob. If a
lot of jobs are submitted, the blob cache will occupy large amount of disk
space. In FLINK-23218, we are going to distribute the cached ShuffleDescriptors
via blob. When large amount of failovers happen, there will be a lot of cache
stored on local disk. In extreme cases, the blob would blow up the disk space.
So we need to add a limit size for the blob cache on TaskExecutor, as described
in the comments of FLINK-23218. The main idea is to add a size limit and and
delete blobs in LRU order if the size limit is exceeded. Before a blob item is
cached, TaskExecutor will firstly check the overall size of cache. If the
overall size exceeds the limit, the blob will be deleted in LRU order until the
limit is not exceeded anymore. For the blob cache that is deleted, if it is
used afterwards, it will be downloaded from the blob server again.
The default value of the size limit of the blob cache on TaskExecutor will be
10GiB.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)