Till Rohrmann created FLINK-1604:
------------------------------------

             Summary: Livelock in PartitionRequestClientFactory
                 Key: FLINK-1604
                 URL: https://issues.apache.org/jira/browse/FLINK-1604
             Project: Flink
          Issue Type: Bug
            Reporter: Till Rohrmann


In case of a job restart, we observed a livelock in 
{{PartitionRequestClientFactory.createPartitionRequestClient}}. We suspect that 
this might have the following reason:

In order to obtain a new {{PartitionRequestClient}} a new {{ConnectingChannel}} 
is created. This channel acts as a future for the client. The channel is 
inserted into a {{ConcurrentHashMap}} so that other {{Threads}} trying to 
create a client for the same address wait on the future. Once the client is 
obtained by the initially requesting {{Thread}}, it is inserted into the 
{{HashMap}} instead of the {{ConnectionChannel}}. When the client is disposed, 
then it will be removed from the {{HashMap}}, but only if the client is still 
stored in the map. 

And here is where things can go wrong. If the requesting thread is interrupted 
after it created the {{ConnectingChannel}} and inserted it into the 
{{ConcurrentHashMap}} but before inserting the {{PartitionRequestClient}} into 
the same map, then a the map entry for a given {{RemoteAddress}} is the 
{{ConnectingChannel}}. Assume now that another thread waited at this channel 
and eventually obtained the client from this future. In the wake of cancelling 
the job, the client would be disposed by the corresponding 
{{RemoteInputChannel}}. Once the job has been restarted, new threads want to 
connect to the {{RemoteAddress}} and they find the {{ConnectingChannel}} with 
the disposed {{PartitionRequestClient}} as future result in the hash map. They 
retrieve the channel and see that the client has already been disposed. Now 
they try to delete the client from the {{ConcurrentHashMap}} to make room for a 
new one. However, this deletion fails, because the map still contains the 
{{ConnectingChannel}}.

That is currently our best theory for the livelock we observed on Travis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to