amoghrajesh commented on issue #24171: URL: https://github.com/apache/airflow/issues/24171#issuecomment-4515061930
Yes, this is directly related. Quick summary of what I think can be done: https://github.com/apache/airflow/pull/65991 solves the memory problem here, terminate the spark-submit early after YARN accepts the submission, poll via `yarn application -status`. That is the "non-blocking submit" split. https://github.com/apache/airflow/issues/67168 will be intending to solve the crash recovery problem on top of that split — once `spark-submit` returns the app ID immediately, we persist it to `task_state` and reconnect on retry instead of resubmitting a duplicate job. The two are complementary layers. #65991 is a prerequisite for my work in a sens, ie: it makes the hook return the app ID early, which is what we need to persist. One coordination point worth discussing: #65991 uses `yarn application -status` (CLI subprocess) for polling. My plan was to use the YARN RM REST API (GET `/ws/v1/cluster/apps/{id}`). REST avoids spawning a subprocess and does not require yarn CLI on the worker, but it is worth aligning rather than having two different polling mechanisms in the same codebase. Happy to sync if useful @nailo2c -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
