Hello,

Why not just add the initialization container with pod mutation? The
container can contain a google/cloud-sdk image and then run the gsutil
rsync -m 16 ... command. Then we will not have to write any code and
it will be a solution similar to Kubernetes thinking, where one
container contains only one tool.

Best,

On Fri, Oct 18, 2019 at 8:21 PM Maulik Soneji <maulik.son...@gojek.com> wrote:
>
> *[Proposal]*
> Create a new *syncer *command to sync dags from any remote folder, which
> will be used as initContainer command in KubernetesExecutor.
> It is just like the initdb command, but it will copy dags from the remote
> folder before running the dag.
>
> *[Problem]*
> Currently, there are only two ways supported to mount dags in the pod
> created by the kubernetes executor: *GCS* and PersistentVolumeClaim(*PVC*).
>
> When using the PVC option it becomes difficult to update the dags in the
> persistent volume correctly.
> Normally, we run rsync/cp command to copy the dags from a remote folder to
> the volume, but it results in an error of reading and writing to the same
> file.
> We encountered this issue in our production airflow where files were read
> and written to at the same time. Link to mailing list discussion:
> https://mail-archives.apache.org/mod_mbox/airflow-dev/201908.mbox/browser
>
> If we are reading from sources like S3/GCS, it is not natively supported in
> airflow, and thus one has to write custom code to pull data from the remote
> folder.
>
> Since we cannot know within the pod when the dags will be updated, the
> route of having an initContainer to copy dags from the remote location on
> instantiation of pod becomes a better choice.
>
> *[Implementation]*
> We can create a new command called syncer inorder to sync dags from a
> remote location.
> We will pass *dags_syncer_conn_id* in airflow.cfg and then using this
> connection and *dags_syncer_folder* denotes the remote location from where
> the files can be copied.
>
> In the initContainer we copy all contents from the
>
> *[Benefit]*
> a. In our pipelines to upload dags to airflow, we can simply write to the
> remote folder and be rest assured that the dags will be taken into account
> by airflow.
>
> b. We can provide S3 or GCS folder as the source of dags and we need to
> just publish to this folder.
> Airflow itself will take care of syncing dags from this remote folder and
> thus there is native support of pulling dags from these sources.
>
> Please share your views/comments.
>
> Regards,
> Maulik

Reply via email to