Hello, Why not just add the initialization container with pod mutation? The container can contain a google/cloud-sdk image and then run the gsutil rsync -m 16 ... command. Then we will not have to write any code and it will be a solution similar to Kubernetes thinking, where one container contains only one tool.
Best, On Fri, Oct 18, 2019 at 8:21 PM Maulik Soneji <maulik.son...@gojek.com> wrote: > > *[Proposal]* > Create a new *syncer *command to sync dags from any remote folder, which > will be used as initContainer command in KubernetesExecutor. > It is just like the initdb command, but it will copy dags from the remote > folder before running the dag. > > *[Problem]* > Currently, there are only two ways supported to mount dags in the pod > created by the kubernetes executor: *GCS* and PersistentVolumeClaim(*PVC*). > > When using the PVC option it becomes difficult to update the dags in the > persistent volume correctly. > Normally, we run rsync/cp command to copy the dags from a remote folder to > the volume, but it results in an error of reading and writing to the same > file. > We encountered this issue in our production airflow where files were read > and written to at the same time. Link to mailing list discussion: > https://mail-archives.apache.org/mod_mbox/airflow-dev/201908.mbox/browser > > If we are reading from sources like S3/GCS, it is not natively supported in > airflow, and thus one has to write custom code to pull data from the remote > folder. > > Since we cannot know within the pod when the dags will be updated, the > route of having an initContainer to copy dags from the remote location on > instantiation of pod becomes a better choice. > > *[Implementation]* > We can create a new command called syncer inorder to sync dags from a > remote location. > We will pass *dags_syncer_conn_id* in airflow.cfg and then using this > connection and *dags_syncer_folder* denotes the remote location from where > the files can be copied. > > In the initContainer we copy all contents from the > > *[Benefit]* > a. In our pipelines to upload dags to airflow, we can simply write to the > remote folder and be rest assured that the dags will be taken into account > by airflow. > > b. We can provide S3 or GCS folder as the source of dags and we need to > just publish to this folder. > Airflow itself will take care of syncing dags from this remote folder and > thus there is native support of pulling dags from these sources. > > Please share your views/comments. > > Regards, > Maulik