The one added complexity in back-porting this to 1.10.x is that we have two webservers (classic and RBAC) so either we only add this feature to the RBAC path for a 1.10.5 (which I am okay with) or someone other than me ports any changes to the classic UI once it's merged to master ;)
-ash > On 30 Jul 2019, at 09:23, Jarek Potiuk <[email protected]> wrote: > > I think Zhou's change is pretty much backwards-compatible with 1.10.x - > it's basically optimisation that people might find really useful until 2.0 > is out. I believe (correct me if I am wrong) it does not require any change > from the user's perspective. Airflow will continue to behave the same way > as 1.10.4, - only making it more resilient in case of huge DAG folders. It > also seems battle-tested already and I think it's a small-ish effort to > continue reviewing it and add the few changes that are being discussed in > the PR. > > I think, maybe then a viable approach is to ONLY implement it in v1-10-test > branch and make a 1.10.5 release much faster than by the end of year. If we > can get the 1.10.5 with only that change (and maybe few other bugk fixes ) > in a month or two from now - that would be beneficial for the users - they > will get much better performance faster, and they could have more time > preparing for 2.0.0 backwards-incompatible migration later on. I guess it > will take additional few months for anyone to migrate to 2.0.0 to convert > their DAGs and do additional testing. > > There is a small risk (and bigger diversion between 1.10.x and 2.0.0) but I > think the risk is very small here - especially that I am sure Zhou and > Composer team will find and fix (and back-port) any issues found in > Composer in case they find them (and this is one of the best testing > grounds for Airflow ;)) . > > I volunteer to continue merging simple changes to 1.10. branch if we go > that direction (and help with 1.10.5 release when the time comes). I kind > of got used to it already and it's a bit of a habit to try to cherry-pick > those changes back to the v1-10-test branch. > > J. > > On Tue, Jul 30, 2019 at 8:28 AM Kaxil Naik <[email protected]> wrote: > >> Thanks Kevin for the input. >> >> I am working on this full-time as well with help from Ash. >> >> Persistence to DB is what we want to achieve while making Webserver >> stateless and hence mitigating the issues you described. >> >> My opinion is that we should work towards that goal and aim to include it >> in Airflow 2.0 . However I am not completely against a short-term solution >> that will be completely optional, that might increase the work though. >> >> The diff between master and the current release branch is huge. So we >> should aim to cut 2.0 by the end of this year with Dag Serialisation & >> peristence in DB feature. >> >> We can then have minor releases directly from Master and follow SemVer as >> well. >> >> Regards, >> Kaxil >> >> >> On Tue, Jul 30, 2019, 09:02 Kevin Yang <[email protected]> wrote: >> >>> For sure! I'll try my best to squeeze some time for it during the weekend >>> and see how I can help facilitate the effort( don't get the hopes up too >>> much tho, got ~2 million things piled in my TODO list :P ). Will bring >> this >>> up to the team and see if my team can help too. >>> >>> You guys have showed pretty solid understanding and skills so I believe >> you >>> can handle it well w/o me but just in case you need me, don't hesitate to >>> shoot me a direct mail. >>> >>> Cheers, >>> Kevin Y >>> >>> On Mon, Jul 29, 2019 at 8:19 PM Zhou Fang <[email protected]> wrote: >>> >>>> Hi Kevin, it makes sense. Thanks for the explanation! Hope we can get >> DB >>>> persistence move faster. >>>> >>>> Zhou >>>> >>>> On Mon, Jul 29, 2019, 5:46 PM Kevin Yang <[email protected]> wrote: >>>> >>>>> oops, s/consistent file/consistent file order/ >>>>> >>>>> On Mon, Jul 29, 2019 at 5:42 PM Kevin Yang <[email protected]> wrote: >>>>> >>>>>> Hi Zhou, >>>>>> >>>>>> Totally understood, thank you for that. Streaming logic does cover >> most >>>>>> cases, tho we still have the worst cases where os.walk doesn't give >> us >>>>>> consistent file and file/dir additions/renames causing a different >>> result >>>>>> order of list_py_file_paths( e.g. right after we parsed the first >> dir, >>> it >>>>>> was renamed and will be parsed last in the 2nd DAG loading, or we >>> merged a >>>>>> new file right after the file paths are collected). Maybe there's a >>> way to >>>>>> guarantee the order of parsing but not sure if it worth the effort >>> given >>>>>> that it is less of a problem if the end to end parsing time is small >>>>>> enough. I understand it may be started as a short term improvement >> but >>>>>> since it should not be too much more complicated, we'd rather start >>> with >>>>>> unified long term pattern. >>>>>> >>>>>> Cheers, >>>>>> Kevin Y >>>>>> >>>>>> >>>>>> On Mon, Jul 29, 2019 at 3:59 PM Zhou Fang <[email protected]> >> wrote: >>>>>> >>>>>>> Hi Kevin, >>>>>>> >>>>>>> Yes. DAG persistence in DB is definitely the way to go. I referred >> to >>>>>>> the aysnc dag loader because it may alleviate your current problem >>> (since >>>>>>> it is code ready). >>>>>>> >>>>>>> It actually reduces the time to 15min, because DAGs are refreshed by >>>>>>> the background process in a streaming way and you don't need to >>> restart >>>>>>> webserver per 20min. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Zhou >>>>>>> >>>>>>> >>>>>>> On Mon, Jul 29, 2019 at 3:14 PM Kevin Yang <[email protected]> >> wrote: >>>>>>> >>>>>>>> Hi Zhou, >>>>>>>> >>>>>>>> Thank you for the pointer. This solves the issue gunicorn restart >>> rate >>>>>>>> throttles webserver refresh rate but not the long DAG parsing time >>> issue, >>>>>>>> right? Worst case scenario we still wait 30 mins for the change to >>> show up, >>>>>>>> comparing to the previous 35 mins( I was wrong on the number, it >>> should be >>>>>>>> 35 mins instead of 55 mins as the clock starts whenever the >> webserver >>>>>>>> restarts). I believe in the previous discussion, we firstly >> proposed >>> this >>>>>>>> local webserver DAG parsing optimization to use the same DAG >> parsing >>> logic >>>>>>>> in scheduler to speed up the parsing. Then the stateless webserver >>> proposal >>>>>>>> came up and we were brought in that it is a better idea to persist >>> DAGs >>>>>>>> into the DB and read directly from the DB, better DAG def >>> consistency and >>>>>>>> webserver cluster consistency. I'm all supportive on the proposed >>> structure >>>>>>>> in AIP-24 but -1 on just feed webserver with a single subprocess >>> parsing >>>>>>>> the DAGs. I would image there won't be too many additional work to >>> fetch >>>>>>>> from DB instead of a subprocess, would there?( haven't look into >> the >>>>>>>> serialization format part but assuming they are the same/similar) >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Kevin Y >>>>>>>> >>>>>>>> On Mon, Jul 29, 2019 at 2:18 PM Zhou Fang <[email protected]> >>> wrote: >>>>>>>> >>>>>>>>> Hi Kevin, >>>>>>>>> >>>>>>>>> The problem that DAG parsing takes a long time can be solved by >>>>>>>>> Asynchronous DAG loading: >>> https://github.com/apache/airflow/pull/5594 >>>>>>>>> >>>>>>>>> The idea is the a background process parses DAG files, and sends >>> DAGs >>>>>>>>> to webserver process every [webserver] dagbag_sync_interval = 10s. >>>>>>>>> >>>>>>>>> We have launched it in Composer, so our users can set >>>>>>>>> webserver worker restart interval to 1 hour (or longer). The >>> background DAG >>>>>>>>> parsing processing refresh all DAGs per [webserver] = >>> collect_dags_interval >>>>>>>>> = 30s. >>>>>>>>> >>>>>>>>> If parsing all DAGs take 15min, you can see DAGs being gradually >>>>>>>>> freshed with this feature. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Zhou >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, Jul 27, 2019 at 2:43 AM Kevin Yang <[email protected]> >>> wrote: >>>>>>>>> >>>>>>>>>> Nice job Zhou! >>>>>>>>>> >>>>>>>>>> Really excited, exactly what we wanted for the webserver scaling >>>>>>>>>> issue. >>>>>>>>>> Want to add another big drive for Airbnb to start think about >> this >>>>>>>>>> previously to support the effort: it can not only bring >> consistency >>>>>>>>>> between >>>>>>>>>> webservers but also bring consistency between webserver and >>>>>>>>>> scheduler/workers. It may be less of a problem if total DAG >> parsing >>>>>>>>>> time is >>>>>>>>>> small, but for us the total DAG parsing time is 15+ mins and we >> had >>>>>>>>>> to set >>>>>>>>>> the webserver( gunicorn subprocesses) restart interval to 20 >> mins, >>>>>>>>>> which >>>>>>>>>> leads to a worst case 15+20+15=50 mins delay between scheduler >>> start >>>>>>>>>> to >>>>>>>>>> schedule things and users can see their deployed DAGs/changes... >>>>>>>>>> >>>>>>>>>> I'm not so sure about the scheduler performance improvement: >>>>>>>>>> currently we >>>>>>>>>> already feed the main scheduler process with SimpleDag through >>>>>>>>>> DagFileProcessorManager running in a subprocess--in the future we >>>>>>>>>> feed it >>>>>>>>>> with data from DB, which is likely slower( tho the diff should >> have >>>>>>>>>> negligible impact to the scheduler performance). In fact if we'd >>>>>>>>>> keep the >>>>>>>>>> existing behavior, try schedule only fresh parsed DAGs, then we >> may >>>>>>>>>> need to >>>>>>>>>> deal with some consistency issue--dag processor and the scheduler >>>>>>>>>> race for >>>>>>>>>> updating the flag indicating if the DAG is newly parsed. No big >>> deal >>>>>>>>>> there >>>>>>>>>> but just some thoughts on the top of my head and hopefully can be >>>>>>>>>> helpful. >>>>>>>>>> >>>>>>>>>> And good idea on pre-rendering the template, believe template >>>>>>>>>> rendering was >>>>>>>>>> the biggest concern in the previous discussion. We've also chose >>> the >>>>>>>>>> pre-rendering+JSON approach in our smart sensor API >>>>>>>>>> < >>>>>>>>>> >>> >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization >>>>>>>>>>> >>>>>>>>>> and >>>>>>>>>> seems to be working fine--a supporting case for ur proposal ;) >>>>>>>>>> There's a WIP >>>>>>>>>> PR <https://github.com/apache/airflow/pull/5499> for it just in >>>>>>>>>> case you >>>>>>>>>> are interested--maybe we can even share some logics. >>>>>>>>>> >>>>>>>>>> Thumbs-up again for this and please don't heisitate to reach out >> if >>>>>>>>>> you >>>>>>>>>> want to discuss further with us or need any help from us. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Kevin Y >>>>>>>>>> >>>>>>>>>> On Sat, Jul 27, 2019 at 12:54 AM Driesprong, Fokko >>>>>>>>>> <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Looks great Zhou, >>>>>>>>>>> >>>>>>>>>>> I have one thing that pops in my mind while reading the AIP; >>>>>>>>>> should keep >>>>>>>>>>> the caching on the webserver level. As the famous quote goes: >>>>>>>>>> *"There are >>>>>>>>>>> only two hard things in Computer Science: cache invalidation >> and >>>>>>>>>> naming >>>>>>>>>>> things." -- Phil Karlton* >>>>>>>>>>> >>>>>>>>>>> Right now, the fundamental change that is being proposed in the >>>>>>>>>> AIP is >>>>>>>>>>> fetching the DAGs from the database in a serialized format, and >>>>>>>>>> not parsing >>>>>>>>>>> the Python files all the time. This will give already a great >>>>>>>>>> performance >>>>>>>>>>> improvement on the webserver side because it removes a lot of >> the >>>>>>>>>>> processing. However, since we're still fetching the DAGs from >> the >>>>>>>>>> database >>>>>>>>>>> in a regular interval, cache it in the local process, so we >> still >>>>>>>>>> have the >>>>>>>>>>> two issues that Airflow is suffering from right now: >>>>>>>>>>> >>>>>>>>>>> 1. No snappy UI because it is still polling the database in >> a >>>>>>>>>> regular >>>>>>>>>>> interval. >>>>>>>>>>> 2. Inconsistency between webservers because they might poll >>> in a >>>>>>>>>>> different interval, I think we've all seen this: >>>>>>>>>>> https://www.youtube.com/watch?v=sNrBruPS3r4 >>>>>>>>>>> >>>>>>>>>>> As I also mentioned in the Slack channel, I strongly feel that >> we >>>>>>>>>> should be >>>>>>>>>>> able to render most views from the tables in the database, so >>>>>>>>>> without >>>>>>>>>>> touching the blob. For specific views, we could just pull the >>> blob >>>>>>>>>> from the >>>>>>>>>>> database. In this case we always have the latest version, and >> we >>>>>>>>>> tackle the >>>>>>>>>>> second point above. >>>>>>>>>>> >>>>>>>>>>> To tackle the first one, I also have an idea. We should change >>> the >>>>>>>>>> DAG >>>>>>>>>>> parser from a loop to something that uses inotify >>>>>>>>>>> https://pypi.org/project/inotify_simple/. This will change it >>>>>>>>>> from polling >>>>>>>>>>> to an event-driven design, which is much more performant and >> less >>>>>>>>>> resource >>>>>>>>>>> hungry. But this would be an AIP on its own. >>>>>>>>>>> >>>>>>>>>>> Again, great design and a comprehensive AIP, but I would >> include >>>>>>>>>> the >>>>>>>>>>> caching on the webserver to greatly improve the user experience >>> in >>>>>>>>>> the UI. >>>>>>>>>>> Looking forward to the opinion of others on this. >>>>>>>>>>> >>>>>>>>>>> Cheers, Fokko >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Op za 27 jul. 2019 om 01:44 schreef Zhou Fang >>>>>>>>>> <[email protected] >>>>>>>>>>>> : >>>>>>>>>>> >>>>>>>>>>>> Hi Kaxi, >>>>>>>>>>>> >>>>>>>>>>>> Just sent out the AIP: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>> >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler >>>>>>>>>>>> >>>>>>>>>>>> Thanks! >>>>>>>>>>>> Zhou >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Jul 26, 2019 at 1:33 PM Zhou Fang < >> [email protected] >>>> >>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Kaxil, >>>>>>>>>>>>> >>>>>>>>>>>>> We are also working on persisting DAGs into DB using JSON >> for >>>>>>>>>> Airflow >>>>>>>>>>>>> webserver in Google Composer. We target at minimizing the >>>>>>>>>> change to the >>>>>>>>>>>>> current Airflow code. Happy to get synced on this! >>>>>>>>>>>>> >>>>>>>>>>>>> Here is our progress: >>>>>>>>>>>>> (1) Serializing DAGs using Pickle to be used in webserver >>>>>>>>>>>>> It has been launched in Composer. I am working on the PR to >>>>>>>>>> upstream >>>>>>>>>>> it: >>>>>>>>>>>>> https://github.com/apache/airflow/pull/5594 >>>>>>>>>>>>> Currently it does not support non-Airflow operators and we >>> are >>>>>>>>>> working >>>>>>>>>>> on >>>>>>>>>>>>> a fix. >>>>>>>>>>>>> >>>>>>>>>>>>> (2) Caching Pickled DAGs in DB to be used by webserver >>>>>>>>>>>>> We have a proof-of-concept implementation, working on an >> AIP >>>>>>>>>> now. >>>>>>>>>>>>> >>>>>>>>>>>>> (3) Using JSON instead of Pickle in (1) and (2) >>>>>>>>>>>>> Decided to use JSON because Pickle is not secure and human >>>>>>>>>> readable. >>>>>>>>>>> The >>>>>>>>>>>>> serialization approach is very similar to (1). >>>>>>>>>>>>> >>>>>>>>>>>>> I will update the RP ( >>>>>>>>>> https://github.com/apache/airflow/pull/5594) to >>>>>>>>>>>>> replace Pickle by JSON, and send our design of (2) as an >> AIP >>>>>>>>>> next week. >>>>>>>>>>>>> Glad to check together whether our implementation makes >> sense >>>>>>>>>> and do >>>>>>>>>>>>> improvements on that. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks! >>>>>>>>>>>>> Zhou >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Jul 26, 2019 at 7:37 AM Kaxil Naik < >>>>>>>>>> [email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>> >>>>>>>>>>>>>> We, at Astronomer, are going to spend time working on DAG >>>>>>>>>>> Serialisation. >>>>>>>>>>>>>> There are 2 AIPs that are somewhat related to what we plan >>> to >>>>>>>>>> work on: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - AIP-18 Persist all information from DAG file in DB >>>>>>>>>>>>>> < >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>> >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB >>>>>>>>>>>>>>> >>>>>>>>>>>>>> - AIP-19 Making the webserver stateless >>>>>>>>>>>>>> < >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>> >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-19+Making+the+webserver+stateless >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> We plan to use JSON as the Serialisation format and store >> it >>>>>>>>>> as a blob >>>>>>>>>>>> in >>>>>>>>>>>>>> metadata DB. >>>>>>>>>>>>>> >>>>>>>>>>>>>> *Goals:* >>>>>>>>>>>>>> >>>>>>>>>>>>>> - Make Webserver Stateless >>>>>>>>>>>>>> - Use the same version of the DAG across Webserver & >>>>>>>>>> Scheduler >>>>>>>>>>>>>> - Keep backward compatibility and have a flag >> (globally & >>>>>>>>>> at DAG >>>>>>>>>>>> level) >>>>>>>>>>>>>> to turn this feature on/off >>>>>>>>>>>>>> - Enable DAG Versioning (extended Goal) >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> We will be preparing a proposal (AIP) after some research >>> and >>>>>>>>>> some >>>>>>>>>>>> initial >>>>>>>>>>>>>> work and open it for the suggestions of the community. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We already had some good brain-storming sessions with >>> Twitter >>>>>>>>>> folks >>>>>>>>>>>> (DanD >>>>>>>>>>>>>> & >>>>>>>>>>>>>> Sumit), folks from GoDataDriven (Fokko & Bas) & Alex (from >>>>>>>>>> Uber) which >>>>>>>>>>>>>> will >>>>>>>>>>>>>> be a good starting point for us. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If anyone in the community is interested in it or has some >>>>>>>>>> experience >>>>>>>>>>>>>> about >>>>>>>>>>>>>> the same and want to collaborate please let me know and >> join >>>>>>>>>>>>>> #dag-serialisation channel on Airflow Slack. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> Kaxil >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>> >> > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/>
