[ 
https://issues.apache.org/jira/browse/FLINK-13100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880465#comment-16880465
 ] 

zhijiang edited comment on FLINK-13100 at 7/8/19 3:29 PM:
----------------------------------------------------------

Yes, the previous SpilledSubpartitionView really has the deadlock issue based 
on two buffers. And now we have the similar issue as before, but throw the 
IOException instead.

Your understanding of above two parts is right. But there exists another case:
 * When the view is created, it would be enqueued into the netty thread loop at 
first time, because it has both data and credit now as you mentioned.

 * Then a buffer is fetched from view and writeAndFlush() is called. If it is 
still available in the queue, that means it must has credit because next buffer 
is always available for reading ahead. For this case it has no problem, because 
we only wait the previous writeAndFlush() done to trigger the next one if 
current queue is not empty.

 * But if it is not available (no credit) when writeAndFlush() is called, then 
the queue is empty in netty thread loop. Before the future of writeAndFlush() 
is done, the netty thread could still process the AddCredit message which would 
make the view become available again, then it would be added into queue to 
trigger the next writeAndFlush(). That means the previous buffer is not 
recycled but the next write is also triggered to cause the problem. The process 
might be like this : first writeAndFlush() pending -> addCredit(trigger second 
writeAndFlush) -> finish the first writeAndFlush() to recycle buffer.

 * I think it might be caused by the improvement of reusing flink buffer in 
netty stack from release-1.5. We could break writeAndFlush()  into write and 
flush two processes. In the before when the write process finishes, the flink 
buffer is copied into netty internal ByteBuffer to be recycled then, so it 
would not cause problem even though the second writeAndFlush is triggered 
before first pending done. But now the write process would still reference the 
flink buffer in netty stack until the flush is done.

I would try to mock this process in relevant unit tests to verify and I might 
submit this test tomorrow for understanding it easily.


was (Author: zjwang):
Yes, the previous SpilledSubpartitionView really has the deadlock issue based 
on two buffers. And now we have the similar issue as before, but throw the 
IOException instead.

Your understanding of above two parts is right. But there exists another case:
 * When the view is created, it would be enqueued into the netty thread loop at 
first time, because it has both data and credit now as you mentioned.
 * Then a buffer is fetched from view and writeAndFlush() is called. If it is 
still available in the queue, that means it must has credit because next buffer 
is always available for reading ahead. For this case it has no problem, because 
we only wait the previous writeAndFlush() done to trigger the next one if 
current queue is not empty.
 * But if it is not available (no credit) when writeAndFlush() is called, then 
the queue is empty in netty thread loop. Before the future of writeAndFlush() 
is done, the netty thread could still process the AddCredit message which would 
make the view become available again, then it would be added into queue to 
trigger the next writeAndFlush(). That means the previous buffer is not 
recycled but the next write is also triggered to cause the problem. The process 
might be like this : first writeAndFlush() pending -> addCredit(trigger second 
writeAndFlush) -> finish the first writeAndFlush() to recycle buffer.
 * I think it might be caused by the improvement of reusing flink buffer in 
netty stack from release-1.5. We could break writeAndFlush()  into write and 
flush two processes. In the before when the write process finishes, the flink 
buffer is copied into netty internal ByteBuffer to be recycled then, so it 
would not cause problem even though the second writeAndFlush is triggered 
before first pending done. But now the write process would still reference the 
flink buffer in netty stack until the flush is done.

I would try to mock this process in relevant unit tests to verify and I might 
submit this test tomorrow for understanding it easily.

> Fix the unexpected IOException during FileBufferReader#nextBuffer
> -----------------------------------------------------------------
>
>                 Key: FLINK-13100
>                 URL: https://issues.apache.org/jira/browse/FLINK-13100
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Network
>            Reporter: zhijiang
>            Priority: Blocker
>
> In the implementation of FileBufferReader#nextBuffer, we expect the next 
> memory segment always available based on the assumption that the nextBuffer 
> call could only happen when the previous buffer was recycled before. 
> Otherwise it would throw an IOException in current implementation.
> In fact, the above assumption is not making sense based on the credit-based 
> and zero-copy features in network. The detail processes are as follows:
>  * The netty thread finishes calling the channel.writeAndFlush() in 
> PartitionRequestQueue and adds a listener to handle the ChannelFuture later. 
> Before future done, the corresponding buffer is not recycled because of 
> zero-copy improvement.
>  * Before the previous future done, the netty thread could trigger next 
> writeAndFlush via processing addCredit message, then 
> FileBufferReader#nextBuffer would throw exception because of previous buffer 
> not recycled.
> We thought of several ways for solving this potential bug:
>  * It does not trigger the next writeAndFlush before the previous future 
> done. To do so it has to maintain the future state and check it in relevant 
> actions. I wonder it might bring performance regression in network throughput 
> and bring extra state management.
>  * Adjust the implementation of current FileBufferReader. We ever regarded 
> the blocking partition view as always available based on the next buffer read 
> ahead, so it would be always added into available queue in 
> PartitionRequestQueue. Actually this next buffer ahead only simplifies the 
> process of BoundedBlockingSubpartitionReader#notifyDataAvailable. The view 
> availability could be judged based on available buffers in FileBufferReader 
> instead of next buffer ahead. When the buffer is recycled into 
> FileBufferReader after writeAndFlush done, it could call notifyDataAvailable 
> to add this view into available queue in PartitionRequestQueue.
> I prefer the second way because it would not bring any bad impacts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to