> For the docker example, you'd almost want to inject or "layer" the DAG script and airflow package at run time.
Something sort of like Heroku build packs? -a On 20 December 2019 23:43:30 GMT, Maxime Beauchemin <[email protected]> wrote: >This reminds me of the "DagFetcher" idea. Basically a new abstraction >that >can fetch a DAG object from anywhere and run a task. In theory you >could >extend it to do "zip on s3", "pex on GFS", "docker on artifactory" or >whatever makes sense to your organization. In the proposal I wrote >about >using a universal uri scheme to identify DAG artifacts, with support >for >versioning, as in s3://company_dagbag/some_dag@latest > >One challenge is around *not* serializing Airflow specific code in the >artifact/docker, otherwise you end up with a messy heterogenous cluster >that runs multiple Airflow versions. For the docker example, you'd >almost >want to inject or "layer" the DAG script and airflow package at run >time. > >Max > >On Mon, Dec 16, 2019 at 7:17 AM Dan Davydov ><[email protected]> >wrote: > >> The zip support is a bit of a hack and was a bit controversial when >it was >> added. I think if we go down the path of supporting more DAG sources, >we >> should make sure we have the right interface in place so we avoid the >> current `if format == zip then: else:` and make sure that we don't >tightly >> couple to specific DAG sourcing implementations. Personally I feel >that >> Docker makes more sense than wheels (since they are fully >self-contained >> even at the binary dependency level), but if we go down the interface >route >> it might be fine to add support for both Docker and wheels. >> >> On Mon, Dec 16, 2019 at 11:19 AM Björn Pollex >> <[email protected]> wrote: >> >> > Hi Jarek, >> > >> > This sounds great. Is this possibly related to the work started in >> > https://github.com/apache/airflow/pull/730? < >> > https://github.com/apache/airflow/pull/730?> >> > >> > I'm not sure I’m following your proposal entirely. Initially, what >would >> > be a great first step would be to support loading DAGs from >entry_point, >> as >> > proposed in the closed PR above. This would already enable most of >the >> > features you’ve mentioned below. Each DAG could be a Python >package, and >> it >> > would carry all the information about required packages in its >package >> > meta-data. >> > >> > Is that what you’re envisioning? If so, I’d be happy to support you >with >> > the implementation! >> > >> > Also, I think while the idea of creating a temporary virtual >environment >> > for running tasks is very useful, I’d like this to be optional, as >it can >> > also create a lot of overhead to running tasks. >> > >> > Cheers, >> > >> > Björn >> > >> > > On 14. Dec 2019, at 11:10, Jarek Potiuk ><[email protected]> >> > wrote: >> > > >> > > I had a lot of interesting discussions last few days with Apache >> Airflow >> > > users at PyDataWarsaw 2019 (I was actually quite surprised how >many >> > people >> > > use Airflow in Poland). One discussion brought an interesting >subject: >> > > Packaging dags in wheel format. The users mentioned that they are >> > > super-happy using .zip-packaged DAGs but they think it could be >> improved >> > > with wheel format (which is also .zip BTW). Maybe it was already >> > mentioned >> > > in some discussions before but I have not found any. >> > > >> > > *Context:* >> > > >> > > We are well on the way of implementing "AIP-21 Changing import >paths" >> and >> > > will provide backport packages for Airflow 1.10. As a next step >we want >> > to >> > > target AIP-8. >> > > One of the problems to implement AIP-8 (split hooks/operators >into >> > separate >> > > packages) is the problem of dependencies. Different >operators/hooks >> might >> > > have different dependencies if maintained separately. Currently >we >> have a >> > > common set of dependencies as we have only one setup.py, but if >we >> split >> > to >> > > separate packages, this might change. >> > > >> > > *Proposal:* >> > > >> > > Our users - who love the .zip DAG distribution - proposed that we >> package >> > > the DAGs and all related packages in a wheel package instead of >pure >> > .zip. >> > > This would allow the users to install extra dependencies needed >by the >> > DAG. >> > > And it struck me that we could indeed do that for DAGs but also >> mitigate >> > > most of the dependency problems for separately-packaged >operators. >> > > >> > > The proposal from our users was to package the extra dependencies >> > together >> > > with the DAG in a wheel file. This is quite cool on it's own, but >I >> > thought >> > > we might actually use the same approach to solve dependency >problem >> with >> > > AIP-8. >> > > >> > > I think we could implement "operator group" -> extra -> "pip >packages" >> > > dependencies (we need them anyway for AIP-21) and then we could >have >> > wheel >> > > packages with all the "extra" dependencies for each group of >operators. >> > > >> > > Worker executing an operator could have the "core" dependencies >> installed >> > > initially but then when it is supposed to run an operator it >could >> > create a >> > > virtualenv, install the required "extra" from wheels and run the >task >> for >> > > this operator in this virtualenv (and remove virtualenv). We >could have >> > > such package-wheels prepared (one wheel package per operator >group) and >> > > distributed either same way as DAGs or using some shared binary >> > repository >> > > (and cached in the worker). >> > > >> > > Having such dynamically created virtualenv has also the advantage >that >> if >> > > someone has a DAG with specific dependencies - they could be >embedded >> in >> > > the DAG wheel, installed from it to this virtualenv, and the >virtualenv >> > > would be removed after the task is finished. >> > > >> > > The advantage of this approach is that each DAG's extra >dependencies >> are >> > > isolated and you could have even different versions of the same >> > dependency >> > > used by different DAGs. I think that could save a lot of >headaches for >> > many >> > > users. >> > > >> > > For me that whole idea sounds pretty cool. >> > > >> > > Let me know what you think. >> > > >> > > J. >> > > >> > > >> > > -- >> > > >> > > Jarek Potiuk >> > > Polidea <https://www.polidea.com/> | Principal Software Engineer >> > > >> > > M: +48 660 796 129 <+48660796129> >> > > [image: Polidea] <https://www.polidea.com/> >> > >> > >>
