potiuk commented on PR #35097:
URL: https://github.com/apache/airflow/pull/35097#issuecomment-1773761513

   I like the idea of mentioning and being explicit about expensive APIs but I 
think there is a  value in mentioning the imports, because not many people are 
aware how big of an impact such imports might have.
   
   I believe `numpy` was indeed not a good example. It does import slowly for 
the first time but mostly it's because it has to load a lot of C (.so) 
libraries into memory, numpy is just a thin wrapper around mostly C code, so 
once those .so libraries are loaded, the import will be fast. Generally 
anything < 0.2 s seems instanteneous (and numpy imports faster than that).
   
   I suggest to add the "expensive" operation as well, but also keep "slow 
import" example in - but replacing it with `pandas` to show much bigger effect 
it might have. Pandas is written mostly in Python (and uses numpy under the 
hood among others) and it is notoriously known from being slow to import as it 
imports ~700 python files (and that all after `__pycache__` and `.pyc` bytecode 
files have been computed and numpy shared .so libraries loaded in memory).
   
   Some experiments:
   
   It takes some 0.3 - 0.4 s to import on my MacOS:
   
   ```
   python -c 'import pandas'  0.46s user 1.91s system 649% cpu 0.364 total
   [jarek:~] [airflow-3.11] % time python -c 'import pandas'
   python -c 'import pandas'  0.65s user 1.73s system 647% cpu 0.367 total
   [jarek:~] [airflow-3.11] % time python -c 'import pandas'
   python -c 'import pandas'  0.72s user 1.46s system 658% cpu 0.331 total
   [jarek:~] [airflow-3.11] % time python -c 'import pandas'
   python -c 'import pandas'  0.45s user 1.69s system 628% cpu 0.341 total
   [jarek:~] [airflow-3.11] % time python -c 'import pandas'
   python -c 'import pandas'  1.08s user 1.34s system 562% cpu 0.430 total
   ```
   
   So ~ 0.5 s on my MacOS.
   
   And around ~ 0.3 s in my docker container:
   
   ```
   root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas'
   
   real 0m0.323s
   user 0m0.781s
   sys  0m0.066s
   root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas'
   
   real 0m0.334s
   user 0m0.780s
   sys  0m0.079s
   root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas'
   
   real 0m0.291s
   user 0m0.760s
   sys  0m0.056s
   root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas'
   
   real 0m0.291s
   user 0m0.742s
   sys  0m0.075s
   root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas'
   
   real 0m0.284s
   user 0m0.744s
   sys  0m0.057s
   ```
   
   Importing Pandas results in opening around 750 files:
   
   ```
   strace python -c 'import pandas' 2>&1 | grep openat | wc
       750    3972   96013
   ```
   
   The same exercise for numpy shows that it is much faster in container 
(~0.1s) and opens far less number of files:
   
   ```
   root@cbc7d85dfa99:/opt/airflow# time python -c 'import numpy'
   
   real 0m0.105s
   user 0m0.342s
   sys  0m0.028s
   root@cbc7d85dfa99:/opt/airflow# time python -c 'import numpy'
   
   real 0m0.141s
   user 0m0.571s
   sys  0m0.026s
   root@cbc7d85dfa99:/opt/airflow# time python -c 'import numpy'
   
   real 0m0.126s
   user 0m0.352s
   sys  0m0.038s
   root@cbc7d85dfa99:/opt/airflow# time python -c 'import numpy'
   
   real 0m0.122s
   user 0m0.341s
   sys  0m0.044s
   ```
   
   Opened files:
   
   ```
   strace python -c 'import numpy' 2>&1 | grep openat | wc
       291    1593   35597
   ```
   
   
   Fragment of the strace for pandas - showing that it imports a lot of code. 
   
   ```
   openat(AT_FDCWD, 
"/usr/local/lib/python3.11/site-packages/pandas/core/computation", 
O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
   openat(AT_FDCWD, 
"/usr/local/lib/python3.11/site-packages/pandas/core/computation/__pycache__/expressions.cpython-311.pyc",
 O_RDONLY|O_CLOEXEC) = 3
   openat(AT_FDCWD, 
"/usr/local/lib/python3.11/site-packages/pandas/core/computation/__pycache__/check.cpython-311.pyc",
 O_RDONLY|O_CLOEXEC) = 3
   openat(AT_FDCWD, 
"/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/missing.cpython-311.pyc",
 O_RDONLY|O_CLOEXEC) = 3
   openat(AT_FDCWD, 
"/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/dispatch.cpython-311.pyc",
 O_RDONLY|O_CLOEXEC) = 3
   openat(AT_FDCWD, 
"/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/invalid.cpython-311.pyc",
 O_RDONLY|O_CLOEXEC) = 3
   openat(AT_FDCWD, 
"/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/common.cpython-311.pyc",
 O_RDONLY|O_CLOEXEC) = 3
   openat(AT_FDCWD, 
"/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/docstrings.cpython-311.pyc",
 O_RDONLY|O_CLOEXEC) = 3
   openat(AT_FDCWD, 
"/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/mask_ops.cpython-311.pyc",
 O_RDONLY|O_CLOEXEC) = 3
   openat(AT_FDCWD, 
"/usr/local/lib/python3.11/site-packages/pandas/core/arrays/__pycache__/_arrow_string_mixins.cpython-311.pyc",
 O_RDONLY|O_CLOEXEC) = 3
   openat(AT_FDCWD, 
"/usr/local/lib/python3.11/site-packages/pyarrow/__pycache__/compute.cpython-311.pyc",
 O_RDONLY|O_CLOEXEC) = 3
   openat(AT_FDCWD, 
"/usr/local/lib/python3.11/site-packages/pyarrow/_compute.cpython-311-aarch64-linux-gnu.so",
 O_RDONLY|O_CLOEXEC) = 3
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to