potiuk commented on code in PR #33355:
URL: https://github.com/apache/airflow/pull/33355#discussion_r1324802446
##########
airflow/operators/python.py:
##########
@@ -606,7 +614,60 @@ def _prepare_venv(self, venv_path: Path) -> None:
index_urls=self.index_urls,
)
+ def _calculate_cache_hash(self) -> str:
Review Comment:
It's not only race condition. I think it's expected that things will change
while running. The backfill case is actually valid. The "latest" you mention is
right, but whan you run backfill for multiple dag runs over the span of hours,
the `latest` might very easily change mid-flight. I do expect people to bump
versions of dependencies frequently in the DAGs. So it might well be that
someone updates the venv while someone else runs the backfill and we have to
have a way to handle that case. For now I do not see how we could handle the
situation if we keep single venv in this case.
But also there is much more important case. It's ofetn that people will be
using venvs for different tasks even in completely different DAGs. And if we
allow for caching and specifyin where to cache the venv, then we have a very
high potential of different people using the same paths to cache and causing
conflicts. We have currently no mechanism (and we do not plan it) to avoid or
solve such conflicts, if we do not use hashes.
Imagine:
```
DAG1:
PVO(venv_cache_path=/cache/my_venv,requirements="x")
some other team:
DAG2:
PVO(venv_cache_path=/cache/my_venv,requirements="y")
````
When you get this and keep the venv in the same folder. some of the tasks
will start randomly failing (depending on the state of the cache on the worker
we are at). If cache is there created by D1 - they D2 will fail, if there is
cache created by D2 - D1 tasks will start failing on that worker. The venv will
be there but will be a completely different venv. In this case we cannot
delete/recreate existing venv when we see "wrong requirements" - because the
other DAG's tasks might be using it. So the only option we have in case we
detect that we want "y" and we see "x" is to fail such task.
I don't think @uranusjr your proposal has a good solution for that case.
On the other hand `hash` approach is WAY better - it solves it very nicely
and it has multiple other advantages that enhance the effect of caching and
decrease venv footprint by allowing to share the same venve between seemingly
unrelated tasks.
This is precisely why pre-commit hooks have rather small footprint because
this is exactly how similar 'hash" approach works there - multiple unrelated
pre-commits with same requiremetns share the same venv under-the-hood.
The way how hash approach will work:
* it will not fail randomly - it will **just** work
* even better - it will reuse DAG1s task venv in DAG2 if DAG2 will have "x"
as requirement
* the organisation might even choose to deticate a volume for all caches
this way - to reuse and optimise caching and ask ALL their teams to use the
same CACHE_FOLDER -this way there is no need to agree additional namspacing and
conventions between the teams. The "hash" will automaticaly create or reuse the
venv when needed in the most optimized way possible.
I really don't see how we can implement it without using hash (but maybe i
am missing something?).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]