[ 
https://issues.apache.org/jira/browse/FLINK-22519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yik San Chan updated FLINK-22519:
---------------------------------
    Description: 
[python-archives|https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/python_config.html#python-archives]
 currently only takes zip.

In our use case, we want to package the whole conda environment into 
python-archives, similar to how the 
[docs|https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/faq.html#cluster]
 suggest about using venv (Python virtual environment). As we use PyFlink for 
ML, there are inevitably a few large dependencies (tensorflow, torch, pyarrow), 
as well as a lot of small dependencies.

This pattern is not friendly for zip. According to the 
[post|https://superuser.com/a/173825], zip compresses each file independently, 
and it is not performing good when dealing with a lot of small files. On the 
other hand, tar simply bundles all files into a tarball, then we can apply gzip 
to the whole tarball to achieve smaller size. This may explain why the official 
packaging tool - conda pack -  [conda pack|https://conda.github.io/conda-pack/] 
- produces tar.gz by default, even though zip is an option if we really want to.

To further prove the idea, I use my laptop and conda env to run an experiment. 
My OS: macOS 10.15.7
 # Create an environment.yaml as well as a requirements.txt
 # Run `conda env create -f environment.yaml` to create the conda env
 # Run conda pack to produce a tar.gz
 # Run conda pack faetflow-ml-env.zip to produce a zip

More details

```
# environment.yaml
name: featflow-ml-env
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- python=3.7
- pytorch=1.8.0
- scikit-learn=0.23.2
- pip
- pip:
- -r file:requirements.txt
```
```
#requirements.txt
apache-flink==1.12.0
deepctr-torch==0.2.6
black==20.8b1
confluent-kafka==1.6.0
pytest==6.2.2
testcontainers==3.4.0
kafka-python==2.0.2
```
 
End result: the tar.gz is 854M, the zip is 1.6G

So, long story short, python-archives only support zip, while zip is not a good 
choice for packaging ML libs. Let's change this by adding python-archives 
tar.gz support.

Change will happen in this way: In ProcessPythonEnvironmentManager.java, check 
the suffix. If tar.gz, unarchive it using gzip decompresser.

  was:
[python-archives|https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/python_config.html#python-archives]
 currently only takes zip.

 

In our use case, we want to package the whole conda environment into 
python-archives, similar to how the 
[docs|https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/faq.html#cluster]
 suggest about using venv (Python virtual environment). As we use PyFlink for 
ML, there are inevitably a few large dependencies (tensorflow, torch, pyarrow), 
as well as a lot of small dependencies.

 

This pattern is not friendly for zip. According to the 
[post|https://superuser.com/a/173825], zip compresses each file independently, 
and it is not performing good when dealing with a lot of small files. On the 
other hand, tar simply bundles all files into a tarball, then we can apply gzip 
to the whole tarball to achieve smaller size. This may explain why the official 
packaging tool - conda pack -  [conda pack|https://conda.github.io/conda-pack/] 
- produces tar.gz by default, even though zip is an option if we really want to.

 

To further prove the idea, I use my laptop and conda env to run an experiment. 
My OS: macOS 10.15.7
 # Create an environment.yaml as well as a requirements.txt
 # Run `conda env create -f environment.yaml` to create the conda env
 
 # Run conda pack to produce a tar.gz
 # Run conda pack faetflow-ml-env.zip to produce a zip

 

# environment.yaml

 
name: featflow-ml-env
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- python=3.7
- pytorch=1.8.0
- scikit-learn=0.23.2
- pip
- pip:
- -r file:requirements.txt
 
#requirements.txt
apache-flink==1.12.0
deepctr-torch==0.2.6
black==20.8b1
confluent-kafka==1.6.0
pytest==6.2.2
testcontainers==3.4.0
kafka-python==2.0.2
 
End result: the tar.gz is 854M, the zip is 1.6G

 

So, long story short, python-archives only support zip, while zip is not a good 
choice for packaging ML libs. Let's change this by adding python-archives 
tar.gz support.

 

Change will happen in this way: In ProcessPythonEnvironmentManager.java, check 
the suffix. If tar.gz, unarchive it using gzip decompresser.


> Have python-archives also take tar.gz
> -------------------------------------
>
>                 Key: FLINK-22519
>                 URL: https://issues.apache.org/jira/browse/FLINK-22519
>             Project: Flink
>          Issue Type: New Feature
>          Components: API / Python
>            Reporter: Yik San Chan
>            Priority: Major
>
> [python-archives|https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/python_config.html#python-archives]
>  currently only takes zip.
> In our use case, we want to package the whole conda environment into 
> python-archives, similar to how the 
> [docs|https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/faq.html#cluster]
>  suggest about using venv (Python virtual environment). As we use PyFlink for 
> ML, there are inevitably a few large dependencies (tensorflow, torch, 
> pyarrow), as well as a lot of small dependencies.
> This pattern is not friendly for zip. According to the 
> [post|https://superuser.com/a/173825], zip compresses each file 
> independently, and it is not performing good when dealing with a lot of small 
> files. On the other hand, tar simply bundles all files into a tarball, then 
> we can apply gzip to the whole tarball to achieve smaller size. This may 
> explain why the official packaging tool - conda pack -  [conda 
> pack|https://conda.github.io/conda-pack/] - produces tar.gz by default, even 
> though zip is an option if we really want to.
> To further prove the idea, I use my laptop and conda env to run an 
> experiment. My OS: macOS 10.15.7
>  # Create an environment.yaml as well as a requirements.txt
>  # Run `conda env create -f environment.yaml` to create the conda env
>  # Run conda pack to produce a tar.gz
>  # Run conda pack faetflow-ml-env.zip to produce a zip
> More details
> ```
> # environment.yaml
> name: featflow-ml-env
> channels:
> - pytorch
> - conda-forge
> - defaults
> dependencies:
> - python=3.7
> - pytorch=1.8.0
> - scikit-learn=0.23.2
> - pip
> - pip:
> - -r file:requirements.txt
> ```
> ```
> #requirements.txt
> apache-flink==1.12.0
> deepctr-torch==0.2.6
> black==20.8b1
> confluent-kafka==1.6.0
> pytest==6.2.2
> testcontainers==3.4.0
> kafka-python==2.0.2
> ```
>  
> End result: the tar.gz is 854M, the zip is 1.6G
> So, long story short, python-archives only support zip, while zip is not a 
> good choice for packaging ML libs. Let's change this by adding 
> python-archives tar.gz support.
> Change will happen in this way: In ProcessPythonEnvironmentManager.java, 
> check the suffix. If tar.gz, unarchive it using gzip decompresser.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to