StephanEwen commented on pull request #13574:
URL: https://github.com/apache/flink/pull/13574#issuecomment-718210093


   @stevenzwu You are right, this tradeoff exists. It exists in lot's of places 
in Flink (and I believe other systems as well).
   Either you have synchronous error reporting on job submission, or you 
support long initialization phases.
   
   Flink has generally moved to supporting longer initialization phases, 
because they just happen all the time (lot's of files to enumerate, blocking 
connections to S3 / Kafka, etc.). CLI and SQL client switch immediately to 
status polling after submitting the job, so they still report errors fast.
   
   File enumeration happens already asynchronous to job submission in the 
current code, because the whole execution graph construction and job 
initialization is already asynchronous to the job submission. At least it is in 
Flink 1.12. That change was made also with state backend initialization, 
savepoint loading, etc. in mind. So if these parts take long, it no longer 
leads to a request timeout for the job submission. But it does mean some errors 
are not any more returned on the "submit job" call, but only on a later status 
poll. Tradeoffs :-/
   
   Moving file enumeration into the `SplitEnumerator` and doing it 
asynchronously there would be totally fine, you get a similar behavior as now. 
With the added benefit that the job starts scheduling tasks faster, because the 
execution graph initializes faster (enumerators are initialized as part of 
that), and only after that the scheduling of tasks starts.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to