potiuk commented on PR #35097: URL: https://github.com/apache/airflow/pull/35097#issuecomment-1773761513
I like the idea of mentioning and being explicit about expensive APIs but I think there is a value in mentioning the imports, because not many people are aware how big of an impact such imports might have. I believe `numpy` was indeed not a good example. It does import slowly for the first time but mostly it's because it has to load a lot of C (.so) libraries into memory, numpy is just a thin wrapper around mostly C code, so once those .so libraries are loaded, the import will be fast. Generally anything < 0.2 s seems instanteneous (and numpy imports faster than that). I suggest to add the "expensive" operation as well, but also keep "slow import" example in - but replacing it with `pandas` to show much bigger effect it might have. Pandas is written mostly in Python (and uses numpy under the hood among others) and it is notoriously known from being slow to import as it imports ~700 python files (and that all after `__pycache__` and `.pyc` bytecode files have been computed and numpy shared .so libraries loaded in memory). Some experiments: It takes some 0.3 - 0.4 s to import on my MacOS: ``` python -c 'import pandas' 0.46s user 1.91s system 649% cpu 0.364 total [jarek:~] [airflow-3.11] % time python -c 'import pandas' python -c 'import pandas' 0.65s user 1.73s system 647% cpu 0.367 total [jarek:~] [airflow-3.11] % time python -c 'import pandas' python -c 'import pandas' 0.72s user 1.46s system 658% cpu 0.331 total [jarek:~] [airflow-3.11] % time python -c 'import pandas' python -c 'import pandas' 0.45s user 1.69s system 628% cpu 0.341 total [jarek:~] [airflow-3.11] % time python -c 'import pandas' python -c 'import pandas' 1.08s user 1.34s system 562% cpu 0.430 total ``` So ~ 0.5 s on my MacOS. And around ~ 0.3 s in my docker container: ``` root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas' real 0m0.323s user 0m0.781s sys 0m0.066s root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas' real 0m0.334s user 0m0.780s sys 0m0.079s root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas' real 0m0.291s user 0m0.760s sys 0m0.056s root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas' real 0m0.291s user 0m0.742s sys 0m0.075s root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas' real 0m0.284s user 0m0.744s sys 0m0.057s ``` Importing Pandas results in opening around 750 files: ``` strace python -c 'import pandas' 2>&1 | grep openat | wc 750 3972 96013 ``` The same exercise for numpy shows that it is much faster in container (~0.1s) and opens far less number of files: ``` root@cbc7d85dfa99:/opt/airflow# time python -c 'import numpy' real 0m0.105s user 0m0.342s sys 0m0.028s root@cbc7d85dfa99:/opt/airflow# time python -c 'import numpy' real 0m0.141s user 0m0.571s sys 0m0.026s root@cbc7d85dfa99:/opt/airflow# time python -c 'import numpy' real 0m0.126s user 0m0.352s sys 0m0.038s root@cbc7d85dfa99:/opt/airflow# time python -c 'import numpy' real 0m0.122s user 0m0.341s sys 0m0.044s ``` Opened files: ``` strace python -c 'import numpy' 2>&1 | grep openat | wc 291 1593 35597 ``` Fragment of the strace for pandas - showing that it imports a lot of code. ``` openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/computation", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3 openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/computation/__pycache__/expressions.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/computation/__pycache__/check.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/missing.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/dispatch.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/invalid.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/common.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/docstrings.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/mask_ops.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/arrays/__pycache__/_arrow_string_mixins.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pyarrow/__pycache__/compute.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pyarrow/_compute.cpython-311-aarch64-linux-gnu.so", O_RDONLY|O_CLOEXEC) = 3 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org