[GitHub] [flink] zhijiangW edited a comment on issue #8242: [FLINK-6227][network] Introduce the DataConsumptionException for downstream task failure

GitBox Mon, 06 May 2019 21:21:02 -0700

zhijiangW edited a comment on issue #8242: [FLINK-6227][network] Introduce the 
DataConsumptionException for downstream task failure
URL: https://github.com/apache/flink/pull/8242#issuecomment-489901309
 
 
   @tillrohrmann thanks for the further suggestions! I agree with your overall 
ideas.
   
   ->What kind of environment/hardware issues do you have in mind that could 
cause repeated failures of reading the result data? 
   
   I mean the disk/network hardware problems on producer side which could not 
be restored in short time. So it is better to restart the producer in another 
machine. We ever encountered this corner case in production.
   
   -> Did I understand you correctly, that the `PartitionNotFoundException` is 
good enough and, thus, we don't need to introduce a new exception?
   
   The information in current `PartitionNotFoundException` is enough for 
`JobMaster` restarting the producer, but it can not cover all the cases. So I 
would like to list all the possible cases to confirm with you firstly:
   
   - a. Tcp connection fail: it might be caused by producer TM lost or network 
hardware issue. We might introduce `DataConnectionException` for this.
   
   - b. `PartitionNotFound`: `ResultPartition` is released from 
`ResultPartitionManager` which needs to restart producer immediately.
   
   - c. `ResultSupartition#createReaderView`  throw `IOException`: it might be 
caused by disk file corrupt/deleted for `BlockingResultSubpartition`. It could 
also be wrapped into existing `PartitionNotFound`.
   
   - d. `BlockingResultSubpartitionView#getNextBuffer` thrown IOException: the 
reason is the same as above c. `PartitionNotFound` might also be used here.
   
   - e. Network server exception during transferring data: it seems more 
complicated here. The reason might be caused by producer TM lost, or temporary 
network problem or server hardware environment issue, etc. The consumer as 
client might be sensitive via inactive network channel or `ErrorResponse` from 
server. We could introduce `DataTransferException` for covering all these.
   
   The above {b, c, d} might be determined to restart producer immediately and 
the current `PartitionNotFound` could be used for covering them.
   
   For the cases of {a, e}, we might introduce new exceptions to cover them and 
failover strategy might have different rules for considering them.
   
   So I think there might have two options: 
   
   - If we want to cover all {a, b, c, d, e} ATM, it might be necessary to 
define an abstract `DataConsumptionException` as parent of  above 
`PartitionNotFoundException`, `DataConnectionException` and 
`DataTransferException`. 
   
   - Or we only concern on {b, c, d} in the first step ({a, e} might be 
considered if necessary future or in other ways), then the current 
`PartitionNotFound` is enough and no need new exceptions ATM. 
   
   Both two options make sense for me, so I would like to take your final 
opinion or you have other options. :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] zhijiangW edited a comment on issue #8242: [FLINK-6227][network] Introduce the DataConsumptionException for downstream task failure

Reply via email to