Thanks all for the input and thanks Zhou too for the detailed AIP. The WIP PR can be a good first step to overall optimization.
Let's sync-up on the progress you have already made & what we want to target. @Jarek Potiuk <jarek.pot...@polidea.com> & @Fokko - If we manage to make it entirely backward-compatible with an enable/disable flag as we mentioned, we can think of including it in 1.10.5 but I am in favor of removing / cleaning stuff like pickles, drop Py 2.0 and cut Airflow 2.0 and include this change there. On Mon, Jul 29, 2019 at 1:03 PM Jarek Potiuk <jarek.pot...@polidea.com> wrote: > Actually I am also doing a lot of v1-10-test merges during the last few > months (probably several tens of them already). Rarely the conflicts are > difficult to solve in fact. We have usually small, localised changes and > until we go for full Black file re-formatting, we should be ok (and the > change from Zhou seems rather small and localised). > > J. > > On Mon, Jul 29, 2019 at 9:25 AM Driesprong, Fokko <fo...@driesprong.frl> > wrote: > > > I would be hesitant to merge it into 1.10.5. When I try to backport > > anything into the 1.x branch, I get a whole bunch on merge conflicts, > even > > on the trivial tickets. For me, the only one who can really comment on > this > > would be Ash, since he's doing the bulk of the conflict resolving. Apart > > from that, I'm really excited to make this happen! > > > > Cheers, Fokko > > > > > > > > Op zo 28 jul. 2019 om 20:23 schreef Jarek Potiuk < > jarek.pot...@polidea.com > > >: > > > > > Some thought I have after looking at the proposal from Zhou. > > > > > > I think this is one of the most important things feature-wise for > > Airflow. > > > It looks like we have several in-progress attempts to solve the problem > > and > > > I guess we should agree common approach. > > > > > > I like very much the approach of Zhou (AIP-24). It does seem to > minimise > > > the changes needed in Airflow and it means that we with some > > optimisations > > > (caching mentioned by Fokko) - it can solve the major pain points and I > > > think relatively quick and is potentially portable to 1.10.5 if we have > > it. > > > > > > I wonder how much it overlaps/differs from what Kaxil and Ash ideas > are. > > If > > > I read it correctly - it sounds like this idea will contain some more > > > "fundamental" changes. Ones that are likely less backwards-compatible, > > and > > > potentially taking longer time to implement and test. And likely > solving > > > some of the problems better or even solving other problems. Am I right > > with > > > my assumptions? > > > > > > I think more information on this might be helpful so that we all know > if > > > those are two different AIPs, or whether they can be joined in one > > effort, > > > and how they relate to AIP-18/AIP-19 (should those be deprecated or > > > independently implemented ?). Also - since 2.0.0 release is half a year > > > ahead we should consider how it impact the roadmap. > > > > > > I can see three approaches here that we as community can follow (maybe > I > > am > > > missing some :) ): > > > > > > 1) focus our work on single "complete" solution that will take longer > > time > > > and targets 2.0.0. > > > 2) work on two of them: one quick/fast - potentially portable to > 1.10.5m > > > one longer-term for 2.0.0. > > > 3) decide that the simple solution we have from Zhou (maybe with some > > > modifications) is our target solution (for both 1.10.5 if we have it > and > > > 2.0.0): > > > > > > J. > > > > > > On Sat, Jul 27, 2019 at 11:43 AM Kevin Yang <yrql...@gmail.com> wrote: > > > > > > > Nice job Zhou! > > > > > > > > Really excited, exactly what we wanted for the webserver scaling > issue. > > > > Want to add another big drive for Airbnb to start think about this > > > > previously to support the effort: it can not only bring consistency > > > between > > > > webservers but also bring consistency between webserver and > > > > scheduler/workers. It may be less of a problem if total DAG parsing > > time > > > is > > > > small, but for us the total DAG parsing time is 15+ mins and we had > to > > > set > > > > the webserver( gunicorn subprocesses) restart interval to 20 mins, > > which > > > > leads to a worst case 15+20+15=50 mins delay between scheduler start > to > > > > schedule things and users can see their deployed DAGs/changes... > > > > > > > > I'm not so sure about the scheduler performance improvement: > currently > > we > > > > already feed the main scheduler process with SimpleDag through > > > > DagFileProcessorManager running in a subprocess--in the future we > feed > > it > > > > with data from DB, which is likely slower( tho the diff should have > > > > negligible impact to the scheduler performance). In fact if we'd keep > > the > > > > existing behavior, try schedule only fresh parsed DAGs, then we may > > need > > > to > > > > deal with some consistency issue--dag processor and the scheduler > race > > > for > > > > updating the flag indicating if the DAG is newly parsed. No big deal > > > there > > > > but just some thoughts on the top of my head and hopefully can be > > > helpful. > > > > > > > > And good idea on pre-rendering the template, believe template > rendering > > > was > > > > the biggest concern in the previous discussion. We've also chose the > > > > pre-rendering+JSON approach in our smart sensor API > > > > < > > > > > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization > > > > > > > > > and > > > > seems to be working fine--a supporting case for ur proposal ;) > There's > > a > > > > WIP > > > > PR <https://github.com/apache/airflow/pull/5499> for it just in case > > you > > > > are interested--maybe we can even share some logics. > > > > > > > > Thumbs-up again for this and please don't heisitate to reach out if > you > > > > want to discuss further with us or need any help from us. > > > > > > > > > > > > Cheers, > > > > Kevin Y > > > > > > > > On Sat, Jul 27, 2019 at 12:54 AM Driesprong, Fokko > > <fo...@driesprong.frl > > > > > > > > wrote: > > > > > > > > > Looks great Zhou, > > > > > > > > > > I have one thing that pops in my mind while reading the AIP; should > > > keep > > > > > the caching on the webserver level. As the famous quote goes: > *"There > > > are > > > > > only two hard things in Computer Science: cache invalidation and > > naming > > > > > things." -- Phil Karlton* > > > > > > > > > > Right now, the fundamental change that is being proposed in the AIP > > is > > > > > fetching the DAGs from the database in a serialized format, and not > > > > parsing > > > > > the Python files all the time. This will give already a great > > > performance > > > > > improvement on the webserver side because it removes a lot of the > > > > > processing. However, since we're still fetching the DAGs from the > > > > database > > > > > in a regular interval, cache it in the local process, so we still > > have > > > > the > > > > > two issues that Airflow is suffering from right now: > > > > > > > > > > 1. No snappy UI because it is still polling the database in a > > > regular > > > > > interval. > > > > > 2. Inconsistency between webservers because they might poll in a > > > > > different interval, I think we've all seen this: > > > > > https://www.youtube.com/watch?v=sNrBruPS3r4 > > > > > > > > > > As I also mentioned in the Slack channel, I strongly feel that we > > > should > > > > be > > > > > able to render most views from the tables in the database, so > without > > > > > touching the blob. For specific views, we could just pull the blob > > from > > > > the > > > > > database. In this case we always have the latest version, and we > > tackle > > > > the > > > > > second point above. > > > > > > > > > > To tackle the first one, I also have an idea. We should change the > > DAG > > > > > parser from a loop to something that uses inotify > > > > > https://pypi.org/project/inotify_simple/. This will change it from > > > > polling > > > > > to an event-driven design, which is much more performant and less > > > > resource > > > > > hungry. But this would be an AIP on its own. > > > > > > > > > > Again, great design and a comprehensive AIP, but I would include > the > > > > > caching on the webserver to greatly improve the user experience in > > the > > > > UI. > > > > > Looking forward to the opinion of others on this. > > > > > > > > > > Cheers, Fokko > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Op za 27 jul. 2019 om 01:44 schreef Zhou Fang > > > > <zhouf...@google.com.invalid > > > > > >: > > > > > > > > > > > Hi Kaxi, > > > > > > > > > > > > Just sent out the AIP: > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler > > > > > > > > > > > > Thanks! > > > > > > Zhou > > > > > > > > > > > > > > > > > > On Fri, Jul 26, 2019 at 1:33 PM Zhou Fang <zhouf...@google.com> > > > wrote: > > > > > > > > > > > > > Hi Kaxil, > > > > > > > > > > > > > > We are also working on persisting DAGs into DB using JSON for > > > Airflow > > > > > > > webserver in Google Composer. We target at minimizing the > change > > to > > > > the > > > > > > > current Airflow code. Happy to get synced on this! > > > > > > > > > > > > > > Here is our progress: > > > > > > > (1) Serializing DAGs using Pickle to be used in webserver > > > > > > > It has been launched in Composer. I am working on the PR to > > > upstream > > > > > it: > > > > > > > https://github.com/apache/airflow/pull/5594 > > > > > > > Currently it does not support non-Airflow operators and we are > > > > working > > > > > on > > > > > > > a fix. > > > > > > > > > > > > > > (2) Caching Pickled DAGs in DB to be used by webserver > > > > > > > We have a proof-of-concept implementation, working on an AIP > now. > > > > > > > > > > > > > > (3) Using JSON instead of Pickle in (1) and (2) > > > > > > > Decided to use JSON because Pickle is not secure and human > > > readable. > > > > > The > > > > > > > serialization approach is very similar to (1). > > > > > > > > > > > > > > I will update the RP ( > > https://github.com/apache/airflow/pull/5594) > > > > to > > > > > > > replace Pickle by JSON, and send our design of (2) as an AIP > next > > > > week. > > > > > > > Glad to check together whether our implementation makes sense > and > > > do > > > > > > > improvements on that. > > > > > > > > > > > > > > Thanks! > > > > > > > Zhou > > > > > > > > > > > > > > > > > > > > > On Fri, Jul 26, 2019 at 7:37 AM Kaxil Naik < > kaxiln...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > >> Hi all, > > > > > > >> > > > > > > >> We, at Astronomer, are going to spend time working on DAG > > > > > Serialisation. > > > > > > >> There are 2 AIPs that are somewhat related to what we plan to > > work > > > > on: > > > > > > >> > > > > > > >> - AIP-18 Persist all information from DAG file in DB > > > > > > >> < > > > > > > >> > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB > > > > > > >> > > > > > > > >> - AIP-19 Making the webserver stateless > > > > > > >> < > > > > > > >> > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-19+Making+the+webserver+stateless > > > > > > >> > > > > > > > >> > > > > > > >> We plan to use JSON as the Serialisation format and store it > as > > a > > > > blob > > > > > > in > > > > > > >> metadata DB. > > > > > > >> > > > > > > >> *Goals:* > > > > > > >> > > > > > > >> - Make Webserver Stateless > > > > > > >> - Use the same version of the DAG across Webserver & > > Scheduler > > > > > > >> - Keep backward compatibility and have a flag (globally & > at > > > DAG > > > > > > level) > > > > > > >> to turn this feature on/off > > > > > > >> - Enable DAG Versioning (extended Goal) > > > > > > >> > > > > > > >> > > > > > > >> We will be preparing a proposal (AIP) after some research and > > some > > > > > > initial > > > > > > >> work and open it for the suggestions of the community. > > > > > > >> > > > > > > >> We already had some good brain-storming sessions with Twitter > > > folks > > > > > > (DanD > > > > > > >> & > > > > > > >> Sumit), folks from GoDataDriven (Fokko & Bas) & Alex (from > Uber) > > > > which > > > > > > >> will > > > > > > >> be a good starting point for us. > > > > > > >> > > > > > > >> If anyone in the community is interested in it or has some > > > > experience > > > > > > >> about > > > > > > >> the same and want to collaborate please let me know and join > > > > > > >> #dag-serialisation channel on Airflow Slack. > > > > > > >> > > > > > > >> Regards, > > > > > > >> Kaxil > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Jarek Potiuk > > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > > > M: +48 660 796 129 <+48660796129> > > > [image: Polidea] <https://www.polidea.com/> > > > > > > > > > > > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/> > -- *Kaxil Naik* *Big Data Consultant | DevOps Data Engineer* *Certified *Google Cloud Data Engineer | *Certified* Apache Spark & Neo4j Developer *LinkedIn*: https://www.linkedin.com/in/kaxil