[ 
https://issues.apache.org/jira/browse/FLINK-11937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986861#comment-16986861
 ] 

Congxian Qiu(klion26) commented on FLINK-11937:
-----------------------------------------------

Retarget to 1.11.0 to give more time for review and testing. There is an early 
version 
[here|https://github.com/klion26/flink/tree/small_files_with_pluggable_sharedregistry]
 and please refer to it if anyone meets this same problem and would like a 
quick fix. Thanks.
 

> Resolve small file problem in RocksDB incremental checkpoint
> ------------------------------------------------------------
>
>                 Key: FLINK-11937
>                 URL: https://issues.apache.org/jira/browse/FLINK-11937
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Checkpointing
>            Reporter: Congxian Qiu(klion26)
>            Assignee: Congxian Qiu(klion26)
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.11.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently when incremental checkpoint is enabled in RocksDBStateBackend a 
> separate file will be generated on DFS for each sst file. This may cause 
> “file flood” when running intensive workload (many jobs with high 
> parallelism) in big cluster. According to our observation in Alibaba 
> production, such file flood introduces at lease two drawbacks when using HDFS 
> as the checkpoint storage FileSystem: 1) huge number of RPC request issued to 
> NN which may burst its response queue; 2) huge number of files causes big 
> pressure on NN’s on-heap memory.
> In Flink we ever noticed similar small file flood problem and tried to 
> resolved it by introducing ByteStreamStateHandle(FLINK-2808), but this 
> solution has its limitation that if we configure the threshold too low there 
> will still be too many small files, while if too high the JM will finally 
> OOM, thus could hardly resolve the issue in case of using RocksDBStateBackend 
> with incremental snapshot strategy.
> We propose a new OutputStream called 
> FileSegmentCheckpointStateOutputStream(FSCSOS) to fix the problem. FSCSOS 
> will reuse the same underlying distributed file until its size exceeds a 
> preset threshold. We
> plan to complete the work in 3 steps: firstly introduce FSCSOS, secondly 
> resolve the specific storage amplification issue on FSCSOS, and lastly add an 
> option to reuse FSCSOS across multiple checkpoints to further reduce the DFS 
> file number.
> More details please refer to the attached design doc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to