[ 
https://issues.apache.org/jira/browse/FLINK-13245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885251#comment-16885251
 ] 

zhijiang commented on FLINK-13245:
----------------------------------

Thanks for finding this potential issue and the investigation! [~Zentol]  
[~azagrebin] 
After reviewing the relevant codes, it actually has two issues here:
 * Considering handling the {{CancelPartitionRequest}} issue, I guess we might 
have two assumptions before. One assumption is that `availableReaders` is 
always equivalent to `allReaders`, but now this assumption is not right because 
of credit. The other assumption is that we make the logic of 
{{CancelPartitionRequest}} as best-effort way, because the last 
{{RemoteInputChannel}} would send {{CloseRequest}} message, then the 
{{PartitionRequestQueue}} would always release all the view readers in 
`allReaders` in final. The details are in {{PartitionRequestQueue#close}}. If 
so, it seems no problem even though the previous {{CancelPartitionRequest}} 
does not work sometimes. But I think it would be more proper/strict if we 
handle the {{CancelPartitionRequest}} via `allReaders` instead.

 *  Another root problem is that the 
{{ReleaseOnConsumptionResultPartition#onConsumedSubpartition}} is not working 
for the implementation of {{BoundedBlockingSubpartition}}. 
{{onConsumedSubpartition}} would be triggered from consumer notification via 
network, but it is only feasible for {{RemoteInputChannel}}. For the case of 
{{LocalInputChannel}}, it would call 
{{ResultSubpartitionView#releaseAllResources}} directly.  So in the 
{{SlotCountExceedingParallelismTest}} there are some local channels which cause 
the reference counter in {{ReleaseOnConsumptionResultPartition}} would never 
decrease to 0, then it would never release the {{BoundedBlockingSubpartition}}. 
But in the process of {{BoundedBlockingSubpartition#releaseReaderReference}} it 
would check the `isReleased` tag before deleting the file. So it has the 
conflict here. In fact this issue already exists in previous 
{{SpillableSubpartition}} , but in the test the spillable subpartition would 
always use the memory way then it hides the potential problem.

> Network stack is leaking files
> ------------------------------
>
>                 Key: FLINK-13245
>                 URL: https://issues.apache.org/jira/browse/FLINK-13245
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.9.0
>            Reporter: Chesnay Schepler
>            Assignee: zhijiang
>            Priority: Blocker
>             Fix For: 1.9.0
>
>
> There's file leak in the network stack / shuffle service.
> When running the {{SlotCountExceedingParallelismTest}} on Windows a large 
> number of {{.channel}} files continue to reside in a 
> {{flink-netty-shuffle-XXX}} directory.
> From what I've gathered so far these files are still being used by a 
> {{BoundedBlockingSubpartition}}. The cleanup logic in this class uses 
> ref-counting to ensure we don't release data while a reader is still present. 
> However, at the end of the job this count has not reached 0, and thus nothing 
> is being released.
> The same issue is also present on the {{ResultPartition}} level; the 
> {{ReleaseOnConsumptionResultPartition}} also are being released while the 
> ref-count is greater than 0.
> Overall it appears like there's some issue with the notifications for 
> partitions being consumed.
> It is feasible that this issue has recently caused issues on Travis where the 
> build were failing due to a lack of disk space.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to