[ https://issues.apache.org/jira/browse/FLINK-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849615#comment-16849615 ]
Yingjie Cao commented on FLINK-12070: ------------------------------------- I am now testing the latest implementation, and find some problems. When the data volume is large and can not be fit into memory,the memory can be ran out, and no new physical pages can be mapped, the TM seemed blocked and not responsed any more (I tried to check the stack using jstack but failed). Besides, some tools like free and top are also influenced (did not response in time), and the cpu usage of TM and kernal swap process increased suddenly. Meanwhile, the io spped of disk is also low, it seemed the kernal flush is also influenced. This will continue for several or even tens of minutes, finally the job can succeed or incur heartbeat timeout (I use large heartbeat timeout and akka timeout). The old implementation (spilled subpartition) dose not have the problem. Though the old implementation can also leverage the page cache to accelerate the write process, page cache is not a must, if no more memory is left, data can be write to disk directly. Latter, I will post more test results under this JIRA. Concerning the Bug I mentioned earlier: I mean the file was deleted (close is wrong) when sending the EOF event, that is a bug of the earlier mmappartition branch, and there is no problem with the master branch. > Make blocking result partitions consumable multiple times > --------------------------------------------------------- > > Key: FLINK-12070 > URL: https://issues.apache.org/jira/browse/FLINK-12070 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network > Reporter: Till Rohrmann > Assignee: Stephan Ewen > Priority: Major > Labels: pull-request-available > Fix For: 1.9.0 > > Attachments: image-2019-04-18-17-38-24-949.png > > Time Spent: 20m > Remaining Estimate: 0h > > In order to avoid writing produced results multiple times for multiple > consumers and in order to speed up batch recoveries, we should make the > blocking result partitions to be consumable multiple times. At the moment a > blocking result partition will be released once the consumers has processed > all data. Instead the result partition should be released once the next > blocking result has been produced and all consumers of a blocking result > partition have terminated. Moreover, blocking results should not hold on slot > resources like network buffers or memory as it is currently the case with > {{SpillableSubpartitions}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)