zhijiangW commented on issue #8242: [FLINK-6227][network] Introduce the 
DataConsumptionException for downstream task failure
URL: https://github.com/apache/flink/pull/8242#issuecomment-489389513
 
 
   Thanks for your professional reviews and valuable suggestions @tillrohrmann .
   
    I think there are mainly two issues to be confirmed:
   1 . Whether we need a new special `DataConsumptionException` or why not use 
existing `PartitionNotFoundException`?
   In semantic/conceptional aspect, `PartitionNotFoundException` is actually a 
subtype of `DataConsumptionException`. In functional aspect they provide the 
same information currently, so I agree to maintain only one type of exception 
for simplification.  That means to use `PartitionNotFoundException` directly or 
rename it for covering more semantics.
   
   2. In which cases we should report this `PartitionNotFoundException` for 
JobMaster to restart the producer?
   
   Your above mentioned cases are very guidable and I am trying to further 
specify every case:
   
   -> TaskExecutor no longer reachable
   - TE unavailable during requesting sub partition: Executor shutdown/killed 
would cause exception in establishing network connection or requesting sub 
partition, which is covered by `RemoteInputChannel#requestSubpartition` in PR.
   
   - TE unavailable during transferring partition data: Executor 
shutdown/killed would cause network channel inactive, which is also covered by 
`RemoteInputChannel#onError` in PR.
   
   -> TaskExecutor reachable but result partition not available
   It is covered in `RemoteInputChannel#failPartitionRequest` in PR.
   
   -> TaskExecutor reachable, result partition available but sub partition view 
cannot be created (spilled file has been removed)
   The producer would response `ErrorResponse` for requesting sub partition, 
then the consumer would call `RemoteInputChannel#onError` for covering this 
case. Because this `ErrorResponse` is fatal error, then it would bring all the 
remote input channels report `DataConsumptionException`. So maybe it is better 
to wrap `DataConsumptionException` on producer side in this case to avoid 
affecting all the remote input channels.
   
   Besides above three cases, there also exists another cause of partition data 
corrupt/deleted during transferring, which is covered by 
`PartitionRequestQueue#exceptionCaught` in PR ATM. Then the consumer would 
receive `ErrorResponse` to call `RemoteInputChannel#onError`. 
   
   In this case I ever considered wrapping `DataConsumptionException` directly 
on producer side in specific `SpilledSubpartitionView#getNextBuffer` to avoid 
reusing the general `ErrorResponse`, because you might think some internal 
network exceptions via `ErrorResponse` do not need to restart the producer 
side. My previous thought was some network exceptions including 
environment/hardware issues might also be suitable to restart producer in other 
executors, otherwise the consumer might still encounter the same problem after 
restarting to consume data from previous executor. So the `ErrorResponse` with 
fatal property is easy to be reused to represent the partition problem here. Or 
we could regard network environment/hardware as separate issue to be solved, 
then wrap the special `DataConsumptionException` in specific views to 
distinguish with existing `ErrorResponse`.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to