zhijiangW edited a comment on issue #8242: [FLINK-6227][network] Introduce the 
DataConsumptionException for downstream task failure
URL: https://github.com/apache/flink/pull/8242#issuecomment-490011255
 
 
   Yes, the blacklist could solve the hardware problems future.
   
   As for `d`, you are right that the `IOException` might be caused by multiple 
factors, then we do not cover it in this PR.
   
   As for `f`, if I understand correctly, the scenario is similar as the case 
`a` I mentioned above. 
   If the TM is unreachable, the first phase of establishing connection would 
fail by `ConnectionTimeoutException` or `ConnectionRefusedException` etc. I 
proposed `DataConnectionException` not `PartitionNotFound` in `a`, because it 
might be caused by network temporary problem. After retrying the connection 
later it might be success. 
   
   In other words, from the `ConnectionTimeoutException` or 
`ConnectionRefusedException` received by consumer, it could not estimate 
whether the producer TM is really lost or the temporary network issue. 
   
   After the connection established, the second phase is to send logic 
`PartitionRequest` message, if at this time the producer TM is unreachable to 
cause any network exceptions, we still can not distinguish whether it is caused 
by TM really lost or temporary network issue, only if the consumer receives 
specific `PartitionNotFoundException` via network which is the cases of b, c.
   
   So we might only focus on b, c currently?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to