zhijiangW commented on issue #8242: [FLINK-6227][network] Introduce the DataConsumptionException for downstream task failure URL: https://github.com/apache/flink/pull/8242#issuecomment-489389513 Thanks for your professional reviews and valuable suggestions @tillrohrmann . I think there are mainly two issues to be confirmed: 1 . Whether we need a new special `DataConsumptionException` or why not use existing `PartitionNotFoundException`? In semantic/conceptional aspect, `PartitionNotFoundException` is actually a subtype of `DataConsumptionException`. In functional aspect they provide the same information currently, so I agree to maintain only one type of exception for simplification. That means to use `PartitionNotFoundException` directly or rename it for covering more semantics. 2. In which cases we should report this `PartitionNotFoundException` for JobMaster to restart the producer? Your above mentioned cases are very guidable and I am trying to further specify every case: -> TaskExecutor no longer reachable - TE unavailable during requesting sub partition: Executor shutdown/killed would cause exception in establishing network connection or requesting sub partition, which is covered by `RemoteInputChannel#requestSubpartition` in PR. - TE unavailable during transferring partition data: Executor shutdown/killed would cause network channel inactive, which is also covered by `RemoteInputChannel#onError` in PR. -> TaskExecutor reachable but result partition not available It is covered in `RemoteInputChannel#failPartitionRequest` in PR. -> TaskExecutor reachable, result partition available but sub partition view cannot be created (spilled file has been removed) The producer would response `ErrorResponse` for requesting sub partition, then the consumer would call `RemoteInputChannel#onError` for covering this case. Because this `ErrorResponse` is fatal error, then it would bring all the remote input channels report `DataConsumptionException`. So maybe it is better to wrap `DataConsumptionException` on producer side in this case to avoid affecting all the remote input channels. Besides above three cases, there also exists another cause of partition data corrupt/deleted during transferring, which is covered by `PartitionRequestQueue#exceptionCaught` in PR ATM. Then the consumer would receive `ErrorResponse` to call `RemoteInputChannel#onError`. In this case I ever considered wrapping `DataConsumptionException` directly on producer side in specific `SpilledSubpartitionView#getNextBuffer` to avoid reusing the general `ErrorResponse`, because you might think some internal network exceptions via `ErrorResponse` do not need to restart the producer side. My previous thought was some network exceptions including environment/hardware issues might also be suitable to restart producer in other executors, otherwise the consumer might still encounter the same problem after restarting to consume data from previous executor. So the `ErrorResponse` with fatal property is easy to be reused to represent the partition problem here. Or we could regard network environment/hardware as separate issue to be solved, then wrap the special `DataConsumptionException` in specific views to distinguish with existing `ErrorResponse`.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services