potiuk commented on code in PR #33355:
URL: https://github.com/apache/airflow/pull/33355#discussion_r1323758553
##########
airflow/operators/python.py:
##########
@@ -606,7 +614,60 @@ def _prepare_venv(self, venv_path: Path) -> None:
index_urls=self.index_urls,
)
+ def _calculate_cache_hash(self) -> str:
Review Comment:
> I wonder if this is that worthwhile… Instead of relying on the directory
name, we could just dump the values into a JSON file inside the virtual
environment directory, and read it out later for cache validation. That could
help debugging if caching doesn’t work as expected too.
I agree with @jens-scheffler-bosch "write once, use many" statement.
Validation is not nearly enough. What will you do if the validation fails?
Are we going to delete the venv? or recreate it? What happens if we are running
in Celery Worker (multiple tasks) and each of them are using "almost" the same
environment. For example someone submitted a change to the DAG that will bump a
dependency version and this new task is scheduled, but in the same Celery
worker we are still running an old versions of the same task from the previous
version of the DAG.
Should we delete the venv while another task is running and recreate it? Or
should we fail the task? Or maybe wait until the other task completes (how?)
I think really - venvs should be "write once, use many" and immutable after
they are created. This is precisely what @jens-scheffler-bosch addressed well
that even if we have two task attempting to create identical venv, we have the
file lock that prevents them to do so - and only one will create it and the
other will use it. But when it is already created you should not modify
existing venv, this is a different story - because you don't want to prevent
parallel "read/modify" - you want to prevent parallel "create". You actually
want multiple tasks to use the same venv in parallell once created - that's
almost the whole point of it.
IMHI calidation is not cutting it because there is no good action that you
can attempt once validation fails. We MUST have different venvs if they are
different.
Interestingly - this is for example the very same thing what `pre-commit`
does. PRe-commit also calculates hash of the venv - based on the pre-commit
definition. If you have two or more tasks with the same "dependencies" and
"python version" - they will re-use existing venv and they are even re-used
between multiple branches you might work on (see ~/.cache/pre-commit) and they
also get fully recreated with a different hash when anything changes.
##########
airflow/operators/python.py:
##########
@@ -606,7 +614,60 @@ def _prepare_venv(self, venv_path: Path) -> None:
index_urls=self.index_urls,
)
+ def _calculate_cache_hash(self) -> str:
Review Comment:
> I wonder if this is that worthwhile… Instead of relying on the directory
name, we could just dump the values into a JSON file inside the virtual
environment directory, and read it out later for cache validation. That could
help debugging if caching doesn’t work as expected too.
I agree with @jens-scheffler-bosch "write once, use many" statement.
Validation is not nearly enough. What will you do if the validation fails?
Are we going to delete the venv? or recreate it? What happens if we are running
in Celery Worker (multiple tasks) and each of them are using "almost" the same
environment. For example someone submitted a change to the DAG that will bump a
dependency version and this new task is scheduled, but in the same Celery
worker we are still running an old versions of the same task from the previous
version of the DAG.
Should we delete the venv while another task is running and recreate it? Or
should we fail the task? Or maybe wait until the other task completes (how?)
I think really - venvs should be "write once, use many" and immutable after
they are created. This is precisely what @jens-scheffler-bosch addressed well
that even if we have two task attempting to create identical venv, we have the
file lock that prevents them to do so - and only one will create it and the
other will use it. But when it is already created you should not modify
existing venv, this is a different story - because you don't want to prevent
parallel "read/modify" - you want to prevent parallel "create". You actually
want multiple tasks to use the same venv in parallell once created - that's
almost the whole point of it.
IMHI calidation is not cutting it because there is no good action that you
can attempt once validation fails. We MUST have different venvs if they are
different.
Interestingly - this is for example the very same thing what `pre-commit`
does. PRe-commit also calculates hash of the venv - based on the pre-commit
definition. If you have two or more tasks with the same "dependencies" and
"python version" - they will re-use existing venv and they are even re-used
between multiple branches you might work on (see ~/.cache/pre-commit) and they
also get fully recreated with a different hash when anything changes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]