Hi Kamil,

Thank you very much for this suggestion. I will certainly try this out.

There is one more aspect of this which is updating the airflow deployments
of the webserver and scheduler, these need to be updated as well with the
latest dags. Any thoughts on how do we support updating the dags here?
Since both are infinitely running processes and have no knowledge of when
the dags would be updated.

With the syncer command, we can have another container in webserver and
scheduler for syncing the dags every 5 minutes or so. The syncer command
should have rsync type functionality so that the dags are updated only when
they are changed at the source.

Regards,
Maulik

On Sat, Oct 19, 2019 at 12:57 AM Kamil Breguła <kamil.breg...@polidea.com>
wrote:

> Hello,
>
> Why not just add the initialization container with pod mutation? The
> container can contain a google/cloud-sdk image and then run the gsutil
> rsync -m 16 ... command. Then we will not have to write any code and
> it will be a solution similar to Kubernetes thinking, where one
> container contains only one tool.
>
> Best,
>
> On Fri, Oct 18, 2019 at 8:21 PM Maulik Soneji <maulik.son...@gojek.com>
> wrote:
> >
> > *[Proposal]*
> > Create a new *syncer *command to sync dags from any remote folder, which
> > will be used as initContainer command in KubernetesExecutor.
> > It is just like the initdb command, but it will copy dags from the remote
> > folder before running the dag.
> >
> > *[Problem]*
> > Currently, there are only two ways supported to mount dags in the pod
> > created by the kubernetes executor: *GCS* and
> PersistentVolumeClaim(*PVC*).
> >
> > When using the PVC option it becomes difficult to update the dags in the
> > persistent volume correctly.
> > Normally, we run rsync/cp command to copy the dags from a remote folder
> to
> > the volume, but it results in an error of reading and writing to the same
> > file.
> > We encountered this issue in our production airflow where files were read
> > and written to at the same time. Link to mailing list discussion:
> >
> https://mail-archives.apache.org/mod_mbox/airflow-dev/201908.mbox/browser
> >
> > If we are reading from sources like S3/GCS, it is not natively supported
> in
> > airflow, and thus one has to write custom code to pull data from the
> remote
> > folder.
> >
> > Since we cannot know within the pod when the dags will be updated, the
> > route of having an initContainer to copy dags from the remote location on
> > instantiation of pod becomes a better choice.
> >
> > *[Implementation]*
> > We can create a new command called syncer inorder to sync dags from a
> > remote location.
> > We will pass *dags_syncer_conn_id* in airflow.cfg and then using this
> > connection and *dags_syncer_folder* denotes the remote location from
> where
> > the files can be copied.
> >
> > In the initContainer we copy all contents from the
> >
> > *[Benefit]*
> > a. In our pipelines to upload dags to airflow, we can simply write to the
> > remote folder and be rest assured that the dags will be taken into
> account
> > by airflow.
> >
> > b. We can provide S3 or GCS folder as the source of dags and we need to
> > just publish to this folder.
> > Airflow itself will take care of syncing dags from this remote folder and
> > thus there is native support of pulling dags from these sources.
> >
> > Please share your views/comments.
> >
> > Regards,
> > Maulik
>

Reply via email to