Re: Airflow DAG Serialisation

Zhou Fang Mon, 29 Jul 2019 20:19:51 -0700

Hi Kevin, it makes sense. Thanks for the explanation! Hope we can get DB
persistence move faster.


Zhou

On Mon, Jul 29, 2019, 5:46 PM Kevin Yang <[email protected]> wrote:

> oops, s/consistent file/consistent file order/
>
> On Mon, Jul 29, 2019 at 5:42 PM Kevin Yang <[email protected]> wrote:
>
>> Hi Zhou,
>>
>> Totally understood, thank you for that. Streaming logic does cover most
>> cases, tho we still have the worst cases where os.walk doesn't give us
>> consistent file and file/dir additions/renames causing a different result
>> order of list_py_file_paths( e.g. right after we parsed the first dir, it
>> was renamed and will be parsed last in the 2nd DAG loading, or we merged a
>> new file right after the file paths are collected). Maybe there's a way to
>> guarantee the order of parsing but not sure if it worth the effort given
>> that it is less of a problem if the end to end parsing time is small
>> enough. I understand it may be started as a short term improvement but
>> since it should not be too much more complicated, we'd rather start with
>> unified long term pattern.
>>
>> Cheers,
>> Kevin Y
>>
>>
>> On Mon, Jul 29, 2019 at 3:59 PM Zhou Fang <[email protected]> wrote:
>>
>>> Hi Kevin,
>>>
>>> Yes. DAG persistence in DB is definitely the way to go. I referred to
>>> the aysnc dag loader because it may alleviate your current problem (since
>>> it is code ready).
>>>
>>> It actually reduces the time to 15min, because DAGs are refreshed by the
>>> background process in a streaming way and you don't need to restart
>>> webserver per 20min.
>>>
>>>
>>>
>>> Thanks,
>>> Zhou
>>>
>>>
>>> On Mon, Jul 29, 2019 at 3:14 PM Kevin Yang <[email protected]> wrote:
>>>
>>>> Hi Zhou,
>>>>
>>>> Thank you for the pointer. This solves the issue gunicorn restart rate
>>>> throttles webserver refresh rate but not the long DAG parsing time issue,
>>>> right? Worst case scenario we still wait 30 mins for the change to show up,
>>>> comparing to the previous 35 mins( I was wrong on the number, it should be
>>>> 35 mins instead of 55 mins as the clock starts whenever the webserver
>>>> restarts). I believe in the previous discussion, we firstly proposed this
>>>> local webserver DAG parsing optimization to use the same DAG parsing logic
>>>> in scheduler to speed up the parsing. Then the stateless webserver proposal
>>>> came up and we were brought in that it is a better idea to persist DAGs
>>>> into the DB and read directly from the DB, better DAG def consistency and
>>>> webserver cluster consistency. I'm all supportive on the proposed structure
>>>> in AIP-24 but -1 on just feed webserver with a single subprocess parsing
>>>> the DAGs. I would image there won't be too many additional work to fetch
>>>> from DB instead of a subprocess, would there?( haven't look into the
>>>> serialization format part but assuming they are the same/similar)
>>>>
>>>> Cheers,
>>>> Kevin Y
>>>>
>>>> On Mon, Jul 29, 2019 at 2:18 PM Zhou Fang <[email protected]> wrote:
>>>>
>>>>> Hi Kevin,
>>>>>
>>>>> The problem that DAG parsing takes a long time can be solved by
>>>>> Asynchronous DAG loading: https://github.com/apache/airflow/pull/5594
>>>>>
>>>>> The idea is the a background process parses DAG files, and sends DAGs
>>>>> to webserver process every [webserver] dagbag_sync_interval = 10s.
>>>>>
>>>>> We have launched it in Composer, so our users can set webserver worker
>>>>> restart interval to 1 hour (or longer). The background DAG parsing
>>>>> processing refresh all DAGs per [webserver] = collect_dags_interval = 30s
>>>>> .
>>>>>
>>>>> If parsing all DAGs take 15min, you can see DAGs being gradually
>>>>> freshed with this feature.
>>>>>
>>>>> Thanks,
>>>>> Zhou
>>>>>
>>>>>
>>>>> On Sat, Jul 27, 2019 at 2:43 AM Kevin Yang <[email protected]> wrote:
>>>>>
>>>>>> Nice job Zhou!
>>>>>>
>>>>>> Really excited, exactly what we wanted for the webserver scaling
>>>>>> issue.
>>>>>> Want to add another big drive for Airbnb to start think about this
>>>>>> previously to support the effort: it can not only bring consistency
>>>>>> between
>>>>>> webservers but also bring consistency between webserver and
>>>>>> scheduler/workers. It may be less of a problem if total DAG parsing
>>>>>> time is
>>>>>> small, but for us the total DAG parsing time is 15+ mins and we had
>>>>>> to set
>>>>>> the webserver( gunicorn subprocesses) restart interval to 20 mins,
>>>>>> which
>>>>>> leads to a worst case 15+20+15=50 mins delay between scheduler start
>>>>>> to
>>>>>> schedule things and users can see their deployed DAGs/changes...
>>>>>>
>>>>>> I'm not so sure about the scheduler performance improvement:
>>>>>> currently we
>>>>>> already feed the main scheduler process with SimpleDag through
>>>>>> DagFileProcessorManager running in a subprocess--in the future we
>>>>>> feed it
>>>>>> with data from DB, which is likely slower( tho the diff should have
>>>>>> negligible impact to the scheduler performance). In fact if we'd keep
>>>>>> the
>>>>>> existing behavior, try schedule only fresh parsed DAGs, then we may
>>>>>> need to
>>>>>> deal with some consistency issue--dag processor and the scheduler
>>>>>> race for
>>>>>> updating the flag indicating if the DAG is newly parsed. No big deal
>>>>>> there
>>>>>> but just some thoughts on the top of my head and hopefully can be
>>>>>> helpful.
>>>>>>
>>>>>> And good idea on pre-rendering the template, believe template
>>>>>> rendering was
>>>>>> the biggest concern in the previous discussion. We've also chose the
>>>>>> pre-rendering+JSON approach in our smart sensor API
>>>>>> <
>>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization
>>>>>> >
>>>>>> and
>>>>>> seems to be working fine--a supporting case for ur proposal ;)
>>>>>> There's a WIP
>>>>>> PR <https://github.com/apache/airflow/pull/5499> for it just in case
>>>>>> you
>>>>>> are interested--maybe we can even share some logics.
>>>>>>
>>>>>> Thumbs-up again for this and please don't heisitate to reach out if
>>>>>> you
>>>>>> want to discuss further with us or need any help from us.
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Kevin Y
>>>>>>
>>>>>> On Sat, Jul 27, 2019 at 12:54 AM Driesprong, Fokko
>>>>>> <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> > Looks great Zhou,
>>>>>> >
>>>>>> > I have one thing that pops in my mind while reading the AIP; should
>>>>>> keep
>>>>>> > the caching on the webserver level. As the famous quote goes:
>>>>>> *"There are
>>>>>> > only two hard things in Computer Science: cache invalidation and
>>>>>> naming
>>>>>> > things." -- Phil Karlton*
>>>>>> >
>>>>>> > Right now, the fundamental change that is being proposed in the AIP
>>>>>> is
>>>>>> > fetching the DAGs from the database in a serialized format, and not
>>>>>> parsing
>>>>>> > the Python files all the time. This will give already a great
>>>>>> performance
>>>>>> > improvement on the webserver side because it removes a lot of the
>>>>>> > processing. However, since we're still fetching the DAGs from the
>>>>>> database
>>>>>> > in a regular interval, cache it in the local process, so we still
>>>>>> have the
>>>>>> > two issues that Airflow is suffering from right now:
>>>>>> >
>>>>>> >    1. No snappy UI because it is still polling the database in a
>>>>>> regular
>>>>>> >    interval.
>>>>>> >    2. Inconsistency between webservers because they might poll in a
>>>>>> >    different interval, I think we've all seen this:
>>>>>> >    https://www.youtube.com/watch?v=sNrBruPS3r4
>>>>>> >
>>>>>> > As I also mentioned in the Slack channel, I strongly feel that we
>>>>>> should be
>>>>>> > able to render most views from the tables in the database, so
>>>>>> without
>>>>>> > touching the blob. For specific views, we could just pull the blob
>>>>>> from the
>>>>>> > database. In this case we always have the latest version, and we
>>>>>> tackle the
>>>>>> > second point above.
>>>>>> >
>>>>>> > To tackle the first one, I also have an idea. We should change the
>>>>>> DAG
>>>>>> > parser from a loop to something that uses inotify
>>>>>> > https://pypi.org/project/inotify_simple/. This will change it from
>>>>>> polling
>>>>>> > to an event-driven design, which is much more performant and less
>>>>>> resource
>>>>>> > hungry. But this would be an AIP on its own.
>>>>>> >
>>>>>> > Again, great design and a comprehensive AIP, but I would include the
>>>>>> > caching on the webserver to greatly improve the user experience in
>>>>>> the UI.
>>>>>> > Looking forward to the opinion of others on this.
>>>>>> >
>>>>>> > Cheers, Fokko
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Op za 27 jul. 2019 om 01:44 schreef Zhou Fang
>>>>>> <[email protected]
>>>>>> > >:
>>>>>> >
>>>>>> > > Hi Kaxi,
>>>>>> > >
>>>>>> > > Just sent out the AIP:
>>>>>> > >
>>>>>> > >
>>>>>> >
>>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler
>>>>>> > >
>>>>>> > > Thanks!
>>>>>> > > Zhou
>>>>>> > >
>>>>>> > >
>>>>>> > > On Fri, Jul 26, 2019 at 1:33 PM Zhou Fang <[email protected]>
>>>>>> wrote:
>>>>>> > >
>>>>>> > > > Hi Kaxil,
>>>>>> > > >
>>>>>> > > > We are also working on persisting DAGs into DB using JSON for
>>>>>> Airflow
>>>>>> > > > webserver in Google Composer. We target at minimizing the
>>>>>> change to the
>>>>>> > > > current Airflow code. Happy to get synced on this!
>>>>>> > > >
>>>>>> > > > Here is our progress:
>>>>>> > > > (1) Serializing DAGs using Pickle to be used in webserver
>>>>>> > > > It has been launched in Composer. I am working on the PR to
>>>>>> upstream
>>>>>> > it:
>>>>>> > > > https://github.com/apache/airflow/pull/5594
>>>>>> > > > Currently it does not support non-Airflow operators and we are
>>>>>> working
>>>>>> > on
>>>>>> > > > a fix.
>>>>>> > > >
>>>>>> > > > (2) Caching Pickled DAGs in DB to be used by webserver
>>>>>> > > > We have a proof-of-concept implementation, working on an AIP
>>>>>> now.
>>>>>> > > >
>>>>>> > > > (3) Using JSON instead of Pickle in (1) and (2)
>>>>>> > > > Decided to use JSON because Pickle is not secure and human
>>>>>> readable.
>>>>>> > The
>>>>>> > > > serialization approach is very similar to (1).
>>>>>> > > >
>>>>>> > > > I will update the RP (
>>>>>> https://github.com/apache/airflow/pull/5594) to
>>>>>> > > > replace Pickle by JSON, and send our design of (2) as an AIP
>>>>>> next week.
>>>>>> > > > Glad to check together whether our implementation makes sense
>>>>>> and do
>>>>>> > > > improvements on that.
>>>>>> > > >
>>>>>> > > > Thanks!
>>>>>> > > > Zhou
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > On Fri, Jul 26, 2019 at 7:37 AM Kaxil Naik <[email protected]
>>>>>> >
>>>>>> > wrote:
>>>>>> > > >
>>>>>> > > >> Hi all,
>>>>>> > > >>
>>>>>> > > >> We, at Astronomer, are going to spend time working on DAG
>>>>>> > Serialisation.
>>>>>> > > >> There are 2 AIPs that are somewhat related to what we plan to
>>>>>> work on:
>>>>>> > > >>
>>>>>> > > >>    - AIP-18 Persist all information from DAG file in DB
>>>>>> > > >>    <
>>>>>> > > >>
>>>>>> > >
>>>>>> >
>>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB
>>>>>> > > >> >
>>>>>> > > >>    - AIP-19 Making the webserver stateless
>>>>>> > > >>    <
>>>>>> > > >>
>>>>>> > >
>>>>>> >
>>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-19+Making+the+webserver+stateless
>>>>>> > > >> >
>>>>>> > > >>
>>>>>> > > >> We plan to use JSON as the Serialisation format and store it
>>>>>> as a blob
>>>>>> > > in
>>>>>> > > >> metadata DB.
>>>>>> > > >>
>>>>>> > > >> *Goals:*
>>>>>> > > >>
>>>>>> > > >>    - Make Webserver Stateless
>>>>>> > > >>    - Use the same version of the DAG across Webserver &
>>>>>> Scheduler
>>>>>> > > >>    - Keep backward compatibility and have a flag (globally &
>>>>>> at DAG
>>>>>> > > level)
>>>>>> > > >>    to turn this feature on/off
>>>>>> > > >>    - Enable DAG Versioning (extended Goal)
>>>>>> > > >>
>>>>>> > > >>
>>>>>> > > >> We will be preparing a proposal (AIP) after some research and
>>>>>> some
>>>>>> > > initial
>>>>>> > > >> work and open it for the suggestions of the community.
>>>>>> > > >>
>>>>>> > > >> We already had some good brain-storming sessions with Twitter
>>>>>> folks
>>>>>> > > (DanD
>>>>>> > > >> &
>>>>>> > > >> Sumit), folks from GoDataDriven (Fokko & Bas) & Alex (from
>>>>>> Uber) which
>>>>>> > > >> will
>>>>>> > > >> be a good starting point for us.
>>>>>> > > >>
>>>>>> > > >> If anyone in the community is interested in it or has some
>>>>>> experience
>>>>>> > > >> about
>>>>>> > > >> the same and want to collaborate please let me know and join
>>>>>> > > >> #dag-serialisation channel on Airflow Slack.
>>>>>> > > >>
>>>>>> > > >> Regards,
>>>>>> > > >> Kaxil
>>>>>> > > >>
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>>
>>>>>

Re: Airflow DAG Serialisation

Reply via email to