Hello,

Cloud Composer stores the source code for your workflows (DAGs) and
their dependencies in specific folders in Cloud Storage and uses Cloud
Storage FUSE to map the folders to the Airflow instances in Cloud
Composer environment.

More info:
https://cloud.google.com/composer/docs/concepts/overview

Best Regards,

On Fri, Oct 18, 2019 at 10:26 PM Maulik Soneji <maulik.son...@gojek.com> wrote:
>
> Hi Kamil,
>
> Thank you very much for this suggestion. I will certainly try this out.
>
> There is one more aspect of this which is updating the airflow deployments
> of the webserver and scheduler, these need to be updated as well with the
> latest dags. Any thoughts on how do we support updating the dags here?
> Since both are infinitely running processes and have no knowledge of when
> the dags would be updated.
>
> With the syncer command, we can have another container in webserver and
> scheduler for syncing the dags every 5 minutes or so. The syncer command
> should have rsync type functionality so that the dags are updated only when
> they are changed at the source.
>
> Regards,
> Maulik
>
> On Sat, Oct 19, 2019 at 12:57 AM Kamil Breguła <kamil.breg...@polidea.com>
> wrote:
>
> > Hello,
> >
> > Why not just add the initialization container with pod mutation? The
> > container can contain a google/cloud-sdk image and then run the gsutil
> > rsync -m 16 ... command. Then we will not have to write any code and
> > it will be a solution similar to Kubernetes thinking, where one
> > container contains only one tool.
> >
> > Best,
> >
> > On Fri, Oct 18, 2019 at 8:21 PM Maulik Soneji <maulik.son...@gojek.com>
> > wrote:
> > >
> > > *[Proposal]*
> > > Create a new *syncer *command to sync dags from any remote folder, which
> > > will be used as initContainer command in KubernetesExecutor.
> > > It is just like the initdb command, but it will copy dags from the remote
> > > folder before running the dag.
> > >
> > > *[Problem]*
> > > Currently, there are only two ways supported to mount dags in the pod
> > > created by the kubernetes executor: *GCS* and
> > PersistentVolumeClaim(*PVC*).
> > >
> > > When using the PVC option it becomes difficult to update the dags in the
> > > persistent volume correctly.
> > > Normally, we run rsync/cp command to copy the dags from a remote folder
> > to
> > > the volume, but it results in an error of reading and writing to the same
> > > file.
> > > We encountered this issue in our production airflow where files were read
> > > and written to at the same time. Link to mailing list discussion:
> > >
> > https://mail-archives.apache.org/mod_mbox/airflow-dev/201908.mbox/browser
> > >
> > > If we are reading from sources like S3/GCS, it is not natively supported
> > in
> > > airflow, and thus one has to write custom code to pull data from the
> > remote
> > > folder.
> > >
> > > Since we cannot know within the pod when the dags will be updated, the
> > > route of having an initContainer to copy dags from the remote location on
> > > instantiation of pod becomes a better choice.
> > >
> > > *[Implementation]*
> > > We can create a new command called syncer inorder to sync dags from a
> > > remote location.
> > > We will pass *dags_syncer_conn_id* in airflow.cfg and then using this
> > > connection and *dags_syncer_folder* denotes the remote location from
> > where
> > > the files can be copied.
> > >
> > > In the initContainer we copy all contents from the
> > >
> > > *[Benefit]*
> > > a. In our pipelines to upload dags to airflow, we can simply write to the
> > > remote folder and be rest assured that the dags will be taken into
> > account
> > > by airflow.
> > >
> > > b. We can provide S3 or GCS folder as the source of dags and we need to
> > > just publish to this folder.
> > > Airflow itself will take care of syncing dags from this remote folder and
> > > thus there is native support of pulling dags from these sources.
> > >
> > > Please share your views/comments.
> > >
> > > Regards,
> > > Maulik
> >

Reply via email to