Hi Kevin, it makes sense. Thanks for the explanation! Hope we can get DB persistence move faster.
Zhou On Mon, Jul 29, 2019, 5:46 PM Kevin Yang <[email protected]> wrote: > oops, s/consistent file/consistent file order/ > > On Mon, Jul 29, 2019 at 5:42 PM Kevin Yang <[email protected]> wrote: > >> Hi Zhou, >> >> Totally understood, thank you for that. Streaming logic does cover most >> cases, tho we still have the worst cases where os.walk doesn't give us >> consistent file and file/dir additions/renames causing a different result >> order of list_py_file_paths( e.g. right after we parsed the first dir, it >> was renamed and will be parsed last in the 2nd DAG loading, or we merged a >> new file right after the file paths are collected). Maybe there's a way to >> guarantee the order of parsing but not sure if it worth the effort given >> that it is less of a problem if the end to end parsing time is small >> enough. I understand it may be started as a short term improvement but >> since it should not be too much more complicated, we'd rather start with >> unified long term pattern. >> >> Cheers, >> Kevin Y >> >> >> On Mon, Jul 29, 2019 at 3:59 PM Zhou Fang <[email protected]> wrote: >> >>> Hi Kevin, >>> >>> Yes. DAG persistence in DB is definitely the way to go. I referred to >>> the aysnc dag loader because it may alleviate your current problem (since >>> it is code ready). >>> >>> It actually reduces the time to 15min, because DAGs are refreshed by the >>> background process in a streaming way and you don't need to restart >>> webserver per 20min. >>> >>> >>> >>> Thanks, >>> Zhou >>> >>> >>> On Mon, Jul 29, 2019 at 3:14 PM Kevin Yang <[email protected]> wrote: >>> >>>> Hi Zhou, >>>> >>>> Thank you for the pointer. This solves the issue gunicorn restart rate >>>> throttles webserver refresh rate but not the long DAG parsing time issue, >>>> right? Worst case scenario we still wait 30 mins for the change to show up, >>>> comparing to the previous 35 mins( I was wrong on the number, it should be >>>> 35 mins instead of 55 mins as the clock starts whenever the webserver >>>> restarts). I believe in the previous discussion, we firstly proposed this >>>> local webserver DAG parsing optimization to use the same DAG parsing logic >>>> in scheduler to speed up the parsing. Then the stateless webserver proposal >>>> came up and we were brought in that it is a better idea to persist DAGs >>>> into the DB and read directly from the DB, better DAG def consistency and >>>> webserver cluster consistency. I'm all supportive on the proposed structure >>>> in AIP-24 but -1 on just feed webserver with a single subprocess parsing >>>> the DAGs. I would image there won't be too many additional work to fetch >>>> from DB instead of a subprocess, would there?( haven't look into the >>>> serialization format part but assuming they are the same/similar) >>>> >>>> Cheers, >>>> Kevin Y >>>> >>>> On Mon, Jul 29, 2019 at 2:18 PM Zhou Fang <[email protected]> wrote: >>>> >>>>> Hi Kevin, >>>>> >>>>> The problem that DAG parsing takes a long time can be solved by >>>>> Asynchronous DAG loading: https://github.com/apache/airflow/pull/5594 >>>>> >>>>> The idea is the a background process parses DAG files, and sends DAGs >>>>> to webserver process every [webserver] dagbag_sync_interval = 10s. >>>>> >>>>> We have launched it in Composer, so our users can set webserver worker >>>>> restart interval to 1 hour (or longer). The background DAG parsing >>>>> processing refresh all DAGs per [webserver] = collect_dags_interval = 30s >>>>> . >>>>> >>>>> If parsing all DAGs take 15min, you can see DAGs being gradually >>>>> freshed with this feature. >>>>> >>>>> Thanks, >>>>> Zhou >>>>> >>>>> >>>>> On Sat, Jul 27, 2019 at 2:43 AM Kevin Yang <[email protected]> wrote: >>>>> >>>>>> Nice job Zhou! >>>>>> >>>>>> Really excited, exactly what we wanted for the webserver scaling >>>>>> issue. >>>>>> Want to add another big drive for Airbnb to start think about this >>>>>> previously to support the effort: it can not only bring consistency >>>>>> between >>>>>> webservers but also bring consistency between webserver and >>>>>> scheduler/workers. It may be less of a problem if total DAG parsing >>>>>> time is >>>>>> small, but for us the total DAG parsing time is 15+ mins and we had >>>>>> to set >>>>>> the webserver( gunicorn subprocesses) restart interval to 20 mins, >>>>>> which >>>>>> leads to a worst case 15+20+15=50 mins delay between scheduler start >>>>>> to >>>>>> schedule things and users can see their deployed DAGs/changes... >>>>>> >>>>>> I'm not so sure about the scheduler performance improvement: >>>>>> currently we >>>>>> already feed the main scheduler process with SimpleDag through >>>>>> DagFileProcessorManager running in a subprocess--in the future we >>>>>> feed it >>>>>> with data from DB, which is likely slower( tho the diff should have >>>>>> negligible impact to the scheduler performance). In fact if we'd keep >>>>>> the >>>>>> existing behavior, try schedule only fresh parsed DAGs, then we may >>>>>> need to >>>>>> deal with some consistency issue--dag processor and the scheduler >>>>>> race for >>>>>> updating the flag indicating if the DAG is newly parsed. No big deal >>>>>> there >>>>>> but just some thoughts on the top of my head and hopefully can be >>>>>> helpful. >>>>>> >>>>>> And good idea on pre-rendering the template, believe template >>>>>> rendering was >>>>>> the biggest concern in the previous discussion. We've also chose the >>>>>> pre-rendering+JSON approach in our smart sensor API >>>>>> < >>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization >>>>>> > >>>>>> and >>>>>> seems to be working fine--a supporting case for ur proposal ;) >>>>>> There's a WIP >>>>>> PR <https://github.com/apache/airflow/pull/5499> for it just in case >>>>>> you >>>>>> are interested--maybe we can even share some logics. >>>>>> >>>>>> Thumbs-up again for this and please don't heisitate to reach out if >>>>>> you >>>>>> want to discuss further with us or need any help from us. >>>>>> >>>>>> >>>>>> Cheers, >>>>>> Kevin Y >>>>>> >>>>>> On Sat, Jul 27, 2019 at 12:54 AM Driesprong, Fokko >>>>>> <[email protected]> >>>>>> wrote: >>>>>> >>>>>> > Looks great Zhou, >>>>>> > >>>>>> > I have one thing that pops in my mind while reading the AIP; should >>>>>> keep >>>>>> > the caching on the webserver level. As the famous quote goes: >>>>>> *"There are >>>>>> > only two hard things in Computer Science: cache invalidation and >>>>>> naming >>>>>> > things." -- Phil Karlton* >>>>>> > >>>>>> > Right now, the fundamental change that is being proposed in the AIP >>>>>> is >>>>>> > fetching the DAGs from the database in a serialized format, and not >>>>>> parsing >>>>>> > the Python files all the time. This will give already a great >>>>>> performance >>>>>> > improvement on the webserver side because it removes a lot of the >>>>>> > processing. However, since we're still fetching the DAGs from the >>>>>> database >>>>>> > in a regular interval, cache it in the local process, so we still >>>>>> have the >>>>>> > two issues that Airflow is suffering from right now: >>>>>> > >>>>>> > 1. No snappy UI because it is still polling the database in a >>>>>> regular >>>>>> > interval. >>>>>> > 2. Inconsistency between webservers because they might poll in a >>>>>> > different interval, I think we've all seen this: >>>>>> > https://www.youtube.com/watch?v=sNrBruPS3r4 >>>>>> > >>>>>> > As I also mentioned in the Slack channel, I strongly feel that we >>>>>> should be >>>>>> > able to render most views from the tables in the database, so >>>>>> without >>>>>> > touching the blob. For specific views, we could just pull the blob >>>>>> from the >>>>>> > database. In this case we always have the latest version, and we >>>>>> tackle the >>>>>> > second point above. >>>>>> > >>>>>> > To tackle the first one, I also have an idea. We should change the >>>>>> DAG >>>>>> > parser from a loop to something that uses inotify >>>>>> > https://pypi.org/project/inotify_simple/. This will change it from >>>>>> polling >>>>>> > to an event-driven design, which is much more performant and less >>>>>> resource >>>>>> > hungry. But this would be an AIP on its own. >>>>>> > >>>>>> > Again, great design and a comprehensive AIP, but I would include the >>>>>> > caching on the webserver to greatly improve the user experience in >>>>>> the UI. >>>>>> > Looking forward to the opinion of others on this. >>>>>> > >>>>>> > Cheers, Fokko >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > Op za 27 jul. 2019 om 01:44 schreef Zhou Fang >>>>>> <[email protected] >>>>>> > >: >>>>>> > >>>>>> > > Hi Kaxi, >>>>>> > > >>>>>> > > Just sent out the AIP: >>>>>> > > >>>>>> > > >>>>>> > >>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler >>>>>> > > >>>>>> > > Thanks! >>>>>> > > Zhou >>>>>> > > >>>>>> > > >>>>>> > > On Fri, Jul 26, 2019 at 1:33 PM Zhou Fang <[email protected]> >>>>>> wrote: >>>>>> > > >>>>>> > > > Hi Kaxil, >>>>>> > > > >>>>>> > > > We are also working on persisting DAGs into DB using JSON for >>>>>> Airflow >>>>>> > > > webserver in Google Composer. We target at minimizing the >>>>>> change to the >>>>>> > > > current Airflow code. Happy to get synced on this! >>>>>> > > > >>>>>> > > > Here is our progress: >>>>>> > > > (1) Serializing DAGs using Pickle to be used in webserver >>>>>> > > > It has been launched in Composer. I am working on the PR to >>>>>> upstream >>>>>> > it: >>>>>> > > > https://github.com/apache/airflow/pull/5594 >>>>>> > > > Currently it does not support non-Airflow operators and we are >>>>>> working >>>>>> > on >>>>>> > > > a fix. >>>>>> > > > >>>>>> > > > (2) Caching Pickled DAGs in DB to be used by webserver >>>>>> > > > We have a proof-of-concept implementation, working on an AIP >>>>>> now. >>>>>> > > > >>>>>> > > > (3) Using JSON instead of Pickle in (1) and (2) >>>>>> > > > Decided to use JSON because Pickle is not secure and human >>>>>> readable. >>>>>> > The >>>>>> > > > serialization approach is very similar to (1). >>>>>> > > > >>>>>> > > > I will update the RP ( >>>>>> https://github.com/apache/airflow/pull/5594) to >>>>>> > > > replace Pickle by JSON, and send our design of (2) as an AIP >>>>>> next week. >>>>>> > > > Glad to check together whether our implementation makes sense >>>>>> and do >>>>>> > > > improvements on that. >>>>>> > > > >>>>>> > > > Thanks! >>>>>> > > > Zhou >>>>>> > > > >>>>>> > > > >>>>>> > > > On Fri, Jul 26, 2019 at 7:37 AM Kaxil Naik <[email protected] >>>>>> > >>>>>> > wrote: >>>>>> > > > >>>>>> > > >> Hi all, >>>>>> > > >> >>>>>> > > >> We, at Astronomer, are going to spend time working on DAG >>>>>> > Serialisation. >>>>>> > > >> There are 2 AIPs that are somewhat related to what we plan to >>>>>> work on: >>>>>> > > >> >>>>>> > > >> - AIP-18 Persist all information from DAG file in DB >>>>>> > > >> < >>>>>> > > >> >>>>>> > > >>>>>> > >>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB >>>>>> > > >> > >>>>>> > > >> - AIP-19 Making the webserver stateless >>>>>> > > >> < >>>>>> > > >> >>>>>> > > >>>>>> > >>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-19+Making+the+webserver+stateless >>>>>> > > >> > >>>>>> > > >> >>>>>> > > >> We plan to use JSON as the Serialisation format and store it >>>>>> as a blob >>>>>> > > in >>>>>> > > >> metadata DB. >>>>>> > > >> >>>>>> > > >> *Goals:* >>>>>> > > >> >>>>>> > > >> - Make Webserver Stateless >>>>>> > > >> - Use the same version of the DAG across Webserver & >>>>>> Scheduler >>>>>> > > >> - Keep backward compatibility and have a flag (globally & >>>>>> at DAG >>>>>> > > level) >>>>>> > > >> to turn this feature on/off >>>>>> > > >> - Enable DAG Versioning (extended Goal) >>>>>> > > >> >>>>>> > > >> >>>>>> > > >> We will be preparing a proposal (AIP) after some research and >>>>>> some >>>>>> > > initial >>>>>> > > >> work and open it for the suggestions of the community. >>>>>> > > >> >>>>>> > > >> We already had some good brain-storming sessions with Twitter >>>>>> folks >>>>>> > > (DanD >>>>>> > > >> & >>>>>> > > >> Sumit), folks from GoDataDriven (Fokko & Bas) & Alex (from >>>>>> Uber) which >>>>>> > > >> will >>>>>> > > >> be a good starting point for us. >>>>>> > > >> >>>>>> > > >> If anyone in the community is interested in it or has some >>>>>> experience >>>>>> > > >> about >>>>>> > > >> the same and want to collaborate please let me know and join >>>>>> > > >> #dag-serialisation channel on Airflow Slack. >>>>>> > > >> >>>>>> > > >> Regards, >>>>>> > > >> Kaxil >>>>>> > > >> >>>>>> > > > >>>>>> > > >>>>>> > >>>>>> >>>>>
