Hi,
I have a PyFlink job that consists of:
- Multiple Python files.
- Multiple 3rdparty Python dependencies, specified in a
`requirements.txt` file.
- A few Java dependencies, mainly for external connectors.
- An overall job config YAML file.
Here's a simplified structure of the code layout.
flink/
├── deps
│ ├── jar
│ │ ├── flink-connector-kafka_2.11-1.12.2.jar
│ │ └── kafka-clients-2.4.1.jar
│ └── pip
│ └── requirements.txt
├── conf
│ └── job.yaml
└── job
├── some_file_x.py
├── some_file_y.py
└── main.py
I'm able to execute this job running it locally i.e. invoking something
like:
python main.py --config <path_to_job_yaml>
I'm loading the jars inside the Python code, using env.add_jars(...).
Now, the next step is to submit this job to a Flink cluster running on K8S.
I'm looking for any best practices in packaging and specifying dependencies
that people tend to follow. As per the documentation here [1], various
Python files, including the conf YAML, can be specified using the --pyFiles
option and Java dependencies can be specified using --jarfile option.
So, how can I specify 3rdparty Python package dependencies? According to
another piece of documentation here [2], I should be able to specify the
requirements.txt directly inside the code and submit it via the --pyFiles
option. Is that right?
Are there any other best practices folks use to package/submit jobs?
Thanks,
Sumeet
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/cli.html#submitting-pyflink-jobs
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/table-api-users-guide/dependency_management.html#python-dependency-in-python-program