Re: Airflow DAG Serialisation

Ash Berlin-Taylor Tue, 30 Jul 2019 05:32:01 -0700

The one added complexity in back-porting this to 1.10.x is that we have two 
webservers (classic and RBAC) so either we only add this feature to the RBAC 
path for a 1.10.5 (which I am okay with) or someone other than me ports any 
changes to the classic UI once it's merged to master ;)


-ash

> On 30 Jul 2019, at 09:23, Jarek Potiuk <[email protected]> wrote:
> 
> I think Zhou's change is pretty much backwards-compatible with 1.10.x -
> it's basically optimisation that people might find really useful until 2.0
> is out. I believe (correct me if I am wrong) it does not require any change
> from the user's perspective. Airflow will continue to behave the same way
> as 1.10.4, - only making it more resilient in case of huge DAG folders. It
> also seems battle-tested already and I think it's a small-ish effort to
> continue reviewing it and add the few changes that are being discussed in
> the PR.
> 
> I think, maybe then a viable approach is to ONLY implement it in v1-10-test
> branch and make a 1.10.5 release much faster than by the end of year. If we
> can get the 1.10.5 with only that change (and maybe few other bugk fixes )
> in a month or two from now - that would be beneficial for the users - they
> will get much better performance faster, and they could have more time
> preparing for 2.0.0 backwards-incompatible migration later on. I guess it
> will take additional few months for anyone to migrate to 2.0.0 to convert
> their DAGs and do additional testing.
> 
> There is a small risk (and bigger diversion between 1.10.x and 2.0.0) but I
> think the risk is very small here - especially that I am sure Zhou and
> Composer team will find and fix (and back-port) any issues found in
> Composer in case they find them (and this is one of the best testing
> grounds for Airflow ;)) .
> 
> I volunteer to continue merging simple changes to 1.10. branch if we go
> that direction (and help with 1.10.5 release when the time comes). I kind
> of got used to it already and it's a bit of a habit to try to cherry-pick
> those changes back to the v1-10-test branch.
> 
> J.
> 
> On Tue, Jul 30, 2019 at 8:28 AM Kaxil Naik <[email protected]> wrote:
> 
>> Thanks Kevin for the input.
>> 
>> I am working on this full-time as well with help from Ash.
>> 
>> Persistence to DB is what we want to achieve while making Webserver
>> stateless and hence mitigating the issues you described.
>> 
>> My opinion is that we should work towards that goal and aim to include it
>> in Airflow 2.0 . However I am not completely against a short-term solution
>> that will be completely optional, that might increase the work though.
>> 
>> The diff between master and the current release branch is huge. So we
>> should aim to cut 2.0 by the end of this year with Dag Serialisation &
>> peristence in DB feature.
>> 
>> We can then have minor releases directly from Master and follow SemVer as
>> well.
>> 
>> Regards,
>> Kaxil
>> 
>> 
>> On Tue, Jul 30, 2019, 09:02 Kevin Yang <[email protected]> wrote:
>> 
>>> For sure! I'll try my best to squeeze some time for it during the weekend
>>> and see how I can help facilitate the effort( don't get the hopes up too
>>> much tho, got ~2 million things piled in my TODO list :P ). Will bring
>> this
>>> up to the team and see if my team can help too.
>>> 
>>> You guys have showed pretty solid understanding and skills so I believe
>> you
>>> can handle it well w/o me but just in case you need me, don't hesitate to
>>> shoot me a direct mail.
>>> 
>>> Cheers,
>>> Kevin Y
>>> 
>>> On Mon, Jul 29, 2019 at 8:19 PM Zhou Fang <[email protected]> wrote:
>>> 
>>>> Hi Kevin, it makes sense. Thanks for the explanation! Hope we can get
>> DB
>>>> persistence move faster.
>>>> 
>>>> Zhou
>>>> 
>>>> On Mon, Jul 29, 2019, 5:46 PM Kevin Yang <[email protected]> wrote:
>>>> 
>>>>> oops, s/consistent file/consistent file order/
>>>>> 
>>>>> On Mon, Jul 29, 2019 at 5:42 PM Kevin Yang <[email protected]> wrote:
>>>>> 
>>>>>> Hi Zhou,
>>>>>> 
>>>>>> Totally understood, thank you for that. Streaming logic does cover
>> most
>>>>>> cases, tho we still have the worst cases where os.walk doesn't give
>> us
>>>>>> consistent file and file/dir additions/renames causing a different
>>> result
>>>>>> order of list_py_file_paths( e.g. right after we parsed the first
>> dir,
>>> it
>>>>>> was renamed and will be parsed last in the 2nd DAG loading, or we
>>> merged a
>>>>>> new file right after the file paths are collected). Maybe there's a
>>> way to
>>>>>> guarantee the order of parsing but not sure if it worth the effort
>>> given
>>>>>> that it is less of a problem if the end to end parsing time is small
>>>>>> enough. I understand it may be started as a short term improvement
>> but
>>>>>> since it should not be too much more complicated, we'd rather start
>>> with
>>>>>> unified long term pattern.
>>>>>> 
>>>>>> Cheers,
>>>>>> Kevin Y
>>>>>> 
>>>>>> 
>>>>>> On Mon, Jul 29, 2019 at 3:59 PM Zhou Fang <[email protected]>
>> wrote:
>>>>>> 
>>>>>>> Hi Kevin,
>>>>>>> 
>>>>>>> Yes. DAG persistence in DB is definitely the way to go. I referred
>> to
>>>>>>> the aysnc dag loader because it may alleviate your current problem
>>> (since
>>>>>>> it is code ready).
>>>>>>> 
>>>>>>> It actually reduces the time to 15min, because DAGs are refreshed by
>>>>>>> the background process in a streaming way and you don't need to
>>> restart
>>>>>>> webserver per 20min.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Zhou
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Jul 29, 2019 at 3:14 PM Kevin Yang <[email protected]>
>> wrote:
>>>>>>> 
>>>>>>>> Hi Zhou,
>>>>>>>> 
>>>>>>>> Thank you for the pointer. This solves the issue gunicorn restart
>>> rate
>>>>>>>> throttles webserver refresh rate but not the long DAG parsing time
>>> issue,
>>>>>>>> right? Worst case scenario we still wait 30 mins for the change to
>>> show up,
>>>>>>>> comparing to the previous 35 mins( I was wrong on the number, it
>>> should be
>>>>>>>> 35 mins instead of 55 mins as the clock starts whenever the
>> webserver
>>>>>>>> restarts). I believe in the previous discussion, we firstly
>> proposed
>>> this
>>>>>>>> local webserver DAG parsing optimization to use the same DAG
>> parsing
>>> logic
>>>>>>>> in scheduler to speed up the parsing. Then the stateless webserver
>>> proposal
>>>>>>>> came up and we were brought in that it is a better idea to persist
>>> DAGs
>>>>>>>> into the DB and read directly from the DB, better DAG def
>>> consistency and
>>>>>>>> webserver cluster consistency. I'm all supportive on the proposed
>>> structure
>>>>>>>> in AIP-24 but -1 on just feed webserver with a single subprocess
>>> parsing
>>>>>>>> the DAGs. I would image there won't be too many additional work to
>>> fetch
>>>>>>>> from DB instead of a subprocess, would there?( haven't look into
>> the
>>>>>>>> serialization format part but assuming they are the same/similar)
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Kevin Y
>>>>>>>> 
>>>>>>>> On Mon, Jul 29, 2019 at 2:18 PM Zhou Fang <[email protected]>
>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Kevin,
>>>>>>>>> 
>>>>>>>>> The problem that DAG parsing takes a long time can be solved by
>>>>>>>>> Asynchronous DAG loading:
>>> https://github.com/apache/airflow/pull/5594
>>>>>>>>> 
>>>>>>>>> The idea is the a background process parses DAG files, and sends
>>> DAGs
>>>>>>>>> to webserver process every [webserver] dagbag_sync_interval = 10s.
>>>>>>>>> 
>>>>>>>>> We have launched it in Composer, so our users can set
>>>>>>>>> webserver worker restart interval to 1 hour (or longer). The
>>> background DAG
>>>>>>>>> parsing processing refresh all DAGs per [webserver] =
>>> collect_dags_interval
>>>>>>>>> = 30s.
>>>>>>>>> 
>>>>>>>>> If parsing all DAGs take 15min, you can see DAGs being gradually
>>>>>>>>> freshed with this feature.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Zhou
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sat, Jul 27, 2019 at 2:43 AM Kevin Yang <[email protected]>
>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Nice job Zhou!
>>>>>>>>>> 
>>>>>>>>>> Really excited, exactly what we wanted for the webserver scaling
>>>>>>>>>> issue.
>>>>>>>>>> Want to add another big drive for Airbnb to start think about
>> this
>>>>>>>>>> previously to support the effort: it can not only bring
>> consistency
>>>>>>>>>> between
>>>>>>>>>> webservers but also bring consistency between webserver and
>>>>>>>>>> scheduler/workers. It may be less of a problem if total DAG
>> parsing
>>>>>>>>>> time is
>>>>>>>>>> small, but for us the total DAG parsing time is 15+ mins and we
>> had
>>>>>>>>>> to set
>>>>>>>>>> the webserver( gunicorn subprocesses) restart interval to 20
>> mins,
>>>>>>>>>> which
>>>>>>>>>> leads to a worst case 15+20+15=50 mins delay between scheduler
>>> start
>>>>>>>>>> to
>>>>>>>>>> schedule things and users can see their deployed DAGs/changes...
>>>>>>>>>> 
>>>>>>>>>> I'm not so sure about the scheduler performance improvement:
>>>>>>>>>> currently we
>>>>>>>>>> already feed the main scheduler process with SimpleDag through
>>>>>>>>>> DagFileProcessorManager running in a subprocess--in the future we
>>>>>>>>>> feed it
>>>>>>>>>> with data from DB, which is likely slower( tho the diff should
>> have
>>>>>>>>>> negligible impact to the scheduler performance). In fact if we'd
>>>>>>>>>> keep the
>>>>>>>>>> existing behavior, try schedule only fresh parsed DAGs, then we
>> may
>>>>>>>>>> need to
>>>>>>>>>> deal with some consistency issue--dag processor and the scheduler
>>>>>>>>>> race for
>>>>>>>>>> updating the flag indicating if the DAG is newly parsed. No big
>>> deal
>>>>>>>>>> there
>>>>>>>>>> but just some thoughts on the top of my head and hopefully can be
>>>>>>>>>> helpful.
>>>>>>>>>> 
>>>>>>>>>> And good idea on pre-rendering the template, believe template
>>>>>>>>>> rendering was
>>>>>>>>>> the biggest concern in the previous discussion. We've also chose
>>> the
>>>>>>>>>> pre-rendering+JSON approach in our smart sensor API
>>>>>>>>>> <
>>>>>>>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization
>>>>>>>>>>> 
>>>>>>>>>> and
>>>>>>>>>> seems to be working fine--a supporting case for ur proposal ;)
>>>>>>>>>> There's a WIP
>>>>>>>>>> PR <https://github.com/apache/airflow/pull/5499> for it just in
>>>>>>>>>> case you
>>>>>>>>>> are interested--maybe we can even share some logics.
>>>>>>>>>> 
>>>>>>>>>> Thumbs-up again for this and please don't heisitate to reach out
>> if
>>>>>>>>>> you
>>>>>>>>>> want to discuss further with us or need any help from us.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Kevin Y
>>>>>>>>>> 
>>>>>>>>>> On Sat, Jul 27, 2019 at 12:54 AM Driesprong, Fokko
>>>>>>>>>> <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Looks great Zhou,
>>>>>>>>>>> 
>>>>>>>>>>> I have one thing that pops in my mind while reading the AIP;
>>>>>>>>>> should keep
>>>>>>>>>>> the caching on the webserver level. As the famous quote goes:
>>>>>>>>>> *"There are
>>>>>>>>>>> only two hard things in Computer Science: cache invalidation
>> and
>>>>>>>>>> naming
>>>>>>>>>>> things." -- Phil Karlton*
>>>>>>>>>>> 
>>>>>>>>>>> Right now, the fundamental change that is being proposed in the
>>>>>>>>>> AIP is
>>>>>>>>>>> fetching the DAGs from the database in a serialized format, and
>>>>>>>>>> not parsing
>>>>>>>>>>> the Python files all the time. This will give already a great
>>>>>>>>>> performance
>>>>>>>>>>> improvement on the webserver side because it removes a lot of
>> the
>>>>>>>>>>> processing. However, since we're still fetching the DAGs from
>> the
>>>>>>>>>> database
>>>>>>>>>>> in a regular interval, cache it in the local process, so we
>> still
>>>>>>>>>> have the
>>>>>>>>>>> two issues that Airflow is suffering from right now:
>>>>>>>>>>> 
>>>>>>>>>>>   1. No snappy UI because it is still polling the database in
>> a
>>>>>>>>>> regular
>>>>>>>>>>>   interval.
>>>>>>>>>>>   2. Inconsistency between webservers because they might poll
>>> in a
>>>>>>>>>>>   different interval, I think we've all seen this:
>>>>>>>>>>>   https://www.youtube.com/watch?v=sNrBruPS3r4
>>>>>>>>>>> 
>>>>>>>>>>> As I also mentioned in the Slack channel, I strongly feel that
>> we
>>>>>>>>>> should be
>>>>>>>>>>> able to render most views from the tables in the database, so
>>>>>>>>>> without
>>>>>>>>>>> touching the blob. For specific views, we could just pull the
>>> blob
>>>>>>>>>> from the
>>>>>>>>>>> database. In this case we always have the latest version, and
>> we
>>>>>>>>>> tackle the
>>>>>>>>>>> second point above.
>>>>>>>>>>> 
>>>>>>>>>>> To tackle the first one, I also have an idea. We should change
>>> the
>>>>>>>>>> DAG
>>>>>>>>>>> parser from a loop to something that uses inotify
>>>>>>>>>>> https://pypi.org/project/inotify_simple/. This will change it
>>>>>>>>>> from polling
>>>>>>>>>>> to an event-driven design, which is much more performant and
>> less
>>>>>>>>>> resource
>>>>>>>>>>> hungry. But this would be an AIP on its own.
>>>>>>>>>>> 
>>>>>>>>>>> Again, great design and a comprehensive AIP, but I would
>> include
>>>>>>>>>> the
>>>>>>>>>>> caching on the webserver to greatly improve the user experience
>>> in
>>>>>>>>>> the UI.
>>>>>>>>>>> Looking forward to the opinion of others on this.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers, Fokko
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Op za 27 jul. 2019 om 01:44 schreef Zhou Fang
>>>>>>>>>> <[email protected]
>>>>>>>>>>>> :
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Kaxi,
>>>>>>>>>>>> 
>>>>>>>>>>>> Just sent out the AIP:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>> Zhou
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Jul 26, 2019 at 1:33 PM Zhou Fang <
>> [email protected]
>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Kaxil,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We are also working on persisting DAGs into DB using JSON
>> for
>>>>>>>>>> Airflow
>>>>>>>>>>>>> webserver in Google Composer. We target at minimizing the
>>>>>>>>>> change to the
>>>>>>>>>>>>> current Airflow code. Happy to get synced on this!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Here is our progress:
>>>>>>>>>>>>> (1) Serializing DAGs using Pickle to be used in webserver
>>>>>>>>>>>>> It has been launched in Composer. I am working on the PR to
>>>>>>>>>> upstream
>>>>>>>>>>> it:
>>>>>>>>>>>>> https://github.com/apache/airflow/pull/5594
>>>>>>>>>>>>> Currently it does not support non-Airflow operators and we
>>> are
>>>>>>>>>> working
>>>>>>>>>>> on
>>>>>>>>>>>>> a fix.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> (2) Caching Pickled DAGs in DB to be used by webserver
>>>>>>>>>>>>> We have a proof-of-concept implementation, working on an
>> AIP
>>>>>>>>>> now.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> (3) Using JSON instead of Pickle in (1) and (2)
>>>>>>>>>>>>> Decided to use JSON because Pickle is not secure and human
>>>>>>>>>> readable.
>>>>>>>>>>> The
>>>>>>>>>>>>> serialization approach is very similar to (1).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I will update the RP (
>>>>>>>>>> https://github.com/apache/airflow/pull/5594) to
>>>>>>>>>>>>> replace Pickle by JSON, and send our design of (2) as an
>> AIP
>>>>>>>>>> next week.
>>>>>>>>>>>>> Glad to check together whether our implementation makes
>> sense
>>>>>>>>>> and do
>>>>>>>>>>>>> improvements on that.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>> Zhou
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Jul 26, 2019 at 7:37 AM Kaxil Naik <
>>>>>>>>>> [email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We, at Astronomer, are going to spend time working on DAG
>>>>>>>>>>> Serialisation.
>>>>>>>>>>>>>> There are 2 AIPs that are somewhat related to what we plan
>>> to
>>>>>>>>>> work on:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   - AIP-18 Persist all information from DAG file in DB
>>>>>>>>>>>>>>   <
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   - AIP-19 Making the webserver stateless
>>>>>>>>>>>>>>   <
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-19+Making+the+webserver+stateless
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We plan to use JSON as the Serialisation format and store
>> it
>>>>>>>>>> as a blob
>>>>>>>>>>>> in
>>>>>>>>>>>>>> metadata DB.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> *Goals:*
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   - Make Webserver Stateless
>>>>>>>>>>>>>>   - Use the same version of the DAG across Webserver &
>>>>>>>>>> Scheduler
>>>>>>>>>>>>>>   - Keep backward compatibility and have a flag
>> (globally &
>>>>>>>>>> at DAG
>>>>>>>>>>>> level)
>>>>>>>>>>>>>>   to turn this feature on/off
>>>>>>>>>>>>>>   - Enable DAG Versioning (extended Goal)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We will be preparing a proposal (AIP) after some research
>>> and
>>>>>>>>>> some
>>>>>>>>>>>> initial
>>>>>>>>>>>>>> work and open it for the suggestions of the community.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We already had some good brain-storming sessions with
>>> Twitter
>>>>>>>>>> folks
>>>>>>>>>>>> (DanD
>>>>>>>>>>>>>> &
>>>>>>>>>>>>>> Sumit), folks from GoDataDriven (Fokko & Bas) & Alex (from
>>>>>>>>>> Uber) which
>>>>>>>>>>>>>> will
>>>>>>>>>>>>>> be a good starting point for us.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If anyone in the community is interested in it or has some
>>>>>>>>>> experience
>>>>>>>>>>>>>> about
>>>>>>>>>>>>>> the same and want to collaborate please let me know and
>> join
>>>>>>>>>>>>>> #dag-serialisation channel on Airflow Slack.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Kaxil
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>> 
>> 
> 
> 
> -- 
> 
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
> 
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>

Re: Airflow DAG Serialisation

Reply via email to