[ https://issues.apache.org/jira/browse/FLINK-22519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yik San Chan updated FLINK-22519: --------------------------------- Description: [python-archives|https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/python_config.html#python-archives] currently only takes zip. In our use case, we want to package the whole conda environment into python-archives, similar to how the [docs|https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/faq.html#cluster] suggest about using venv (Python virtual environment). As we use PyFlink for ML, there are inevitably a few large dependencies (tensorflow, torch, pyarrow), as well as a lot of small dependencies. This pattern is not friendly for zip. According to the [post|https://superuser.com/a/173825], zip compresses each file independently, and it is not performing good when dealing with a lot of small files. On the other hand, tar simply bundles all files into a tarball, then we can apply gzip to the whole tarball to achieve smaller size. This may explain why the official packaging tool - conda pack - [conda pack|https://conda.github.io/conda-pack/] - produces tar.gz by default, even though zip is an option if we really want to. To further prove the idea, I use my laptop and conda env to run an experiment. My OS: macOS 10.15.7 # Create an environment.yaml as well as a requirements.txt # Run `conda env create -f environment.yaml` to create the conda env # Run conda pack to produce a tar.gz # Run conda pack faetflow-ml-env.zip to produce a zip More details: environment.yaml {code:yaml} name: featflow-ml-env channels: - pytorch - conda-forge - defaults dependencies: - python=3.7 - pytorch=1.8.0 - scikit-learn=0.23.2 - pip - pip: - -r file:requirements.txt {code} requirements.txt {code:yaml} apache-flink==1.12.0 deepctr-torch==0.2.6 black==20.8b1 confluent-kafka==1.6.0 pytest==6.2.2 testcontainers==3.4.0 kafka-python==2.0.2 {code} End result: the tar.gz is 854M, the zip is 1.6G So, long story short, python-archives only support zip, while zip is not a good choice for packaging ML libs. Let's change this by adding python-archives tar.gz support. Change will happen in this way: In ProcessPythonEnvironmentManager.java, check the suffix. If tar.gz, unarchive it using gzip decompresser. was: [python-archives|https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/python_config.html#python-archives] currently only takes zip. In our use case, we want to package the whole conda environment into python-archives, similar to how the [docs|https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/faq.html#cluster] suggest about using venv (Python virtual environment). As we use PyFlink for ML, there are inevitably a few large dependencies (tensorflow, torch, pyarrow), as well as a lot of small dependencies. This pattern is not friendly for zip. According to the [post|https://superuser.com/a/173825], zip compresses each file independently, and it is not performing good when dealing with a lot of small files. On the other hand, tar simply bundles all files into a tarball, then we can apply gzip to the whole tarball to achieve smaller size. This may explain why the official packaging tool - conda pack - [conda pack|https://conda.github.io/conda-pack/] - produces tar.gz by default, even though zip is an option if we really want to. To further prove the idea, I use my laptop and conda env to run an experiment. My OS: macOS 10.15.7 # Create an environment.yaml as well as a requirements.txt # Run `conda env create -f environment.yaml` to create the conda env # Run conda pack to produce a tar.gz # Run conda pack faetflow-ml-env.zip to produce a zip More details: environment.yaml {code:txt} name: featflow-ml-env channels: - pytorch - conda-forge - defaults dependencies: - python=3.7 - pytorch=1.8.0 - scikit-learn=0.23.2 - pip - pip: - -r file:requirements.txt {code} requirements.txt ``` apache-flink==1.12.0 deepctr-torch==0.2.6 black==20.8b1 confluent-kafka==1.6.0 pytest==6.2.2 testcontainers==3.4.0 kafka-python==2.0.2 ``` End result: the tar.gz is 854M, the zip is 1.6G So, long story short, python-archives only support zip, while zip is not a good choice for packaging ML libs. Let's change this by adding python-archives tar.gz support. Change will happen in this way: In ProcessPythonEnvironmentManager.java, check the suffix. If tar.gz, unarchive it using gzip decompresser. > Have python-archives also take tar.gz > ------------------------------------- > > Key: FLINK-22519 > URL: https://issues.apache.org/jira/browse/FLINK-22519 > Project: Flink > Issue Type: New Feature > Components: API / Python > Reporter: Yik San Chan > Priority: Major > > [python-archives|https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/python_config.html#python-archives] > currently only takes zip. > In our use case, we want to package the whole conda environment into > python-archives, similar to how the > [docs|https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/faq.html#cluster] > suggest about using venv (Python virtual environment). As we use PyFlink for > ML, there are inevitably a few large dependencies (tensorflow, torch, > pyarrow), as well as a lot of small dependencies. > This pattern is not friendly for zip. According to the > [post|https://superuser.com/a/173825], zip compresses each file > independently, and it is not performing good when dealing with a lot of small > files. On the other hand, tar simply bundles all files into a tarball, then > we can apply gzip to the whole tarball to achieve smaller size. This may > explain why the official packaging tool - conda pack - [conda > pack|https://conda.github.io/conda-pack/] - produces tar.gz by default, even > though zip is an option if we really want to. > To further prove the idea, I use my laptop and conda env to run an > experiment. My OS: macOS 10.15.7 > # Create an environment.yaml as well as a requirements.txt > # Run `conda env create -f environment.yaml` to create the conda env > # Run conda pack to produce a tar.gz > # Run conda pack faetflow-ml-env.zip to produce a zip > More details: > environment.yaml > {code:yaml} > name: featflow-ml-env > channels: > - pytorch > - conda-forge > - defaults > dependencies: > - python=3.7 > - pytorch=1.8.0 > - scikit-learn=0.23.2 > - pip > - pip: > - -r file:requirements.txt > {code} > requirements.txt > {code:yaml} > apache-flink==1.12.0 > deepctr-torch==0.2.6 > black==20.8b1 > confluent-kafka==1.6.0 > pytest==6.2.2 > testcontainers==3.4.0 > kafka-python==2.0.2 > {code} > > End result: the tar.gz is 854M, the zip is 1.6G > So, long story short, python-archives only support zip, while zip is not a > good choice for packaging ML libs. Let's change this by adding > python-archives tar.gz support. > Change will happen in this way: In ProcessPythonEnvironmentManager.java, > check the suffix. If tar.gz, unarchive it using gzip decompresser. -- This message was sent by Atlassian Jira (v8.3.4#803005)