Hi Julian, I'm definitely aligned with you guys on making the webserver independent of DAG parsing, just the end goal to me would be to build a complete story around serializing DAG--and move with the story in mind. I feel like you guys may already have a list of dynamic features we need to deprecate/change, if that is the case feel free to open the discussion on what we do to them with DAG serialization.
Julian, Ash, Dan, on 2nd thought I do agree that if we can meet the requirements Dan mentioned, it would be nice to have them stored in the DB. Some combined solutions like having a column of serialized graph in the serialized dag table can potentially meet all requirements. What format we end up using to represent DAG between components is now less important IMO--fine to refactor those endpoints only need DagModel to use only DagModel, easy to do a batch replacement if we decide otherwise later. More important is to define this source of truth for serialized DAG. Ash, ty for the email list, I'll tune my filters accordingly :D I'm leaning towards having a separate process for the parser so we got no scheduler dependency etc for this parser but we can discuss this in another thread. On Fri, Mar 8, 2019 at 8:57 AM Dan Davydov <ddavy...@twitter.com.invalid> wrote: > > > > Personally I don’t understand why people are pushing for a JSON-based DAG > > representation > > It sounds like you agree that DAGs should be serialized (just in the DB > instead of JSON), so will only address why JSON is better than MySQL (AKA > serializing at the DAG level vs the task level) as far as I can see, and > not why we need serialization. If you zoom out and look at all the use > cases of serialized DAGs, e.g. having the scheduler use them instead of > parsing DAGs directly, then it becomes clear that we need all appropriate > metadata in these DAGs, (operator params, DAG properties, etc), in which > case it's not clear how it will fit nicely into a DB table (unless you > wanted to do something like (parent_task_id, task_id, task_params), also > keep in mind that we will need to store different versions of each DAG in > the future so that we can ensure consistency in a dagrun, i.e. each task in > a dagrun uses the same version of a DAG. > > I think some of our requirements should be: > 1. The data model will lead to acceptable performance in all of its > consumers (scheduler, webserver, workers), i.e. no n+1 access patterns (my > biggest concern about serializing at task level as you propose vs at DAG > level) > 2. We can have versioning of serialized DAGs > 3. The ability to separate DAGs into their own data store (e.g. no reliance > on joins between the new table and the old one) > 4. One source of truth/serialized representation for DAGs (currently we > have SimpleDAG) > > If we can full-fill all of these requirements and serialize at the task > level rather than the DAG level in the DB, then I agree that probably makes > more sense. > > > > In the proposed PR’s we (Peter, Bas and me) aim to avoid re-parsing DAG > > files by querying all the required information from the database. In one > or > > two cases this may however not be possible, in which case we might either > > have to fall back on the DAG file or add the missing information into the > > database. We can tackle these problems as we encounter them. > > I think you would have the support of many of committers in removing any > use-cases that stand in the way of full serialization, that being said if > we need to remove features we need to do this carefully and thoughtfully, > and ideally with proposed alternatives/work-arounds to cover the removals. > > The counter argument: this PR removes the need for the confusing "Refresh" > > button from the UI, and in general you only pay the cost for the > expensive > > DAGs when you ask about them. (I don't know what/when we call the > > /pickle_info endpoint of the top of my head) > > Probably worth splitting out into a separate thread, but I'm actually not > sure the refresh button does anything, I think we should double check... I > think about 2 years ago there was a commit made that made gunicorn > webservers automatically rotate underneath flask (each one would reparse > the DAGbag). Even if it works we should probably remove it since the > webserver refresh interval is pretty fast, and it just causes confusion to > users and implies that the DAGs are not refreshed automatically. > > Do you mean https://json5.org/ or is this a typo? That might be okay for a > > nicer user front end, but the "canonical" version stored in the DB should > > be something "plainer" like just JSON. > > I think he got this from my reply, and it was just an example, but you are > right, I agree JSON would be better than JSON5. > > On Fri, Mar 8, 2019 at 8:53 AM Ash Berlin-Taylor <a...@apache.org> wrote: > > > Comments inline. > > > > > On 8 Mar 2019, at 11:28, Kevin Yang <yrql...@gmail.com> wrote: > > > > > > Hi all, > > > When I was preparing some work related to this AIP I found something > > very concerning. I noticed this JIRA ticket < > > https://issues.apache.org/jira/browse/AIRFLOW-3562> is trying to remove > > the dependency of dagbag from webserver, which is awesome--we wanted > badly > > but never got to start work on. However when I looked at some subtasks of > > it, which try to remove dagbag dependency from each endpoint, I found the > > way we remove the dependency of dagbag is not very ideal. For example > this > > PR <https://github.com/apache/airflow/pull/4867/files> will require us > to > > parse the dag file each time we hit the endpoint. > > > > The counter argument: this PR removes the need for the confusing > "Refresh" > > button from the UI, and in general you only pay the cost for the > expensive > > DAGs when you ask about them. (I don't know what/when we call the > > /pickle_info endpoint of the top of my head) > > > > This end point may be one to hold off on (as it can ask for multiple > dags) > > but there are some that def don't need a full dag bag or to even parse > the > > dag file, the current DAG model has enough info. > > > > > > > > > > > If we go down this path, we indeed can get rid of the dagbag dependency > > easily, but we will have to 1. increase the DB load( not too concerning > at > > the moment ), 2. wait the DAG file to be parsed before getting the page > > back, potentially multiple times. DAG file can sometimes take quite a > while > > to parse, e.g. we have some framework DAG files generating large number > of > > DAGs from some static config files or even jupyter notebooks and they can > > take 30+ seconds to parse. Yes we don't like large DAG files but people > do > > see the beauty of code as config and sometimes heavily abuseleverage it. > > Assuming all users have the same nice small python file that can be > parsed > > fast, I'm still a bit worried about this approach. Continuing on this > path > > means we've chosen DagModel to be the serialized representation of DAG > and > > DB columns to hold different properties--it can be one candidate but I > > don't know if we should settle on that now. I would personally prefer a > > more compact, e.g. JSON5, and easy to scale representation( such that > > serializing new fields != DB upgrade). > > > > Do you mean https://json5.org/ or is this a typo? That might be okay for > > a nicer user front end, but the "canonical" version stored in the DB > should > > be something "plainer" like just JSON. > > > > I'm not sure that "serializing new fields != DB upgrade" is that big of a > > concern, as we don't add fields that often. One possible way of dealing > > with it if we do is to have a hybrid approach - a few distinct columns, > but > > then a JSON blob. (and if we were only to support postgres we could just > > use JSONb. But I think our friends at Google may object ;) ) > > > > Adding a new column in a DB migration with a default NULL shouldn't be an > > expensive operation, or difficult to achieve. > > > > > > > > > > In my imagination we would have to collect the list of dynamic features > > depending on unserializable fields of a DAG and start a discussion/vote > on > > dropping support of them( I'm working on this but if anyone has already > > done so please take over), decide on the serialized representation of a > DAG > > and then replace dagbag with it in webserver. Per previous discussion and > > some offline discussions with Dan, one future of DAG serialization that I > > like would look similar to this: > > > > > > > > https://imgur.com/ncqqQgc > > > > Something I've thought about before for other things was to embed an API > > server _into_ the scheduler - this would be useful for k8s healthchecks, > > native Prometheus metrics without needed statsd bridge, and could have > > endpoints to get information such as this directly. > > > > I was thinking it would be _in_ the scheduler process using either > threads > > (ick. Python's still got a GIL doesn't it?) or using async/twisted etc. > > (not a side-car process like we have with the logs webserver for `airflow > > worker`). > > > > (This is possibly an unrelated discussion, but might be worth talking > > about?) > > > > > We can still discuss/vote which approach we want to take but I don't > > want the door to above design to be shut right now or we have to spend a > > lot effort switch path later. > > > > > > Bas and Peter, I'm very sorry to extend the discussion but I do think > > this is tightly related to the AIP and PRs behind it. And my sincere > > apology for bringing this up so late( I only pull the open PR list > > occasionally, if there's a way to subscribe to new PR event I'd love to > > know how). > > > > It's noisy, but you can subscribe to comm...@airflow.apache.org (but be > > warned, this also includes all Jira tickets, edits of every comment on > > github etc.). > > > > > > > > > > Cheers, > > > Kevin Y > > > > > > On Thu, Feb 28, 2019 at 1:36 PM Peter van t Hof <pjrvant...@gmail.com > > <mailto:pjrvant...@gmail.com>> wrote: > > > Hi all, > > > > > > Just some comments one the point Bolke dit give in relation of my PR. > > > > > > At first, the main focus is: making the webserver stateless. > > > > > > > 1) Make the webserver stateless: needs the graph of the *current* dag > > > > > > This is the main goal but for this a lot more PR’s will be coming once > > my current is merged. For edges and graph view this is covered in my PR > > already. > > > > > > > 2) Version dags: for consistency mainly and not requiring parsing of > > the > > > > dag on every loop > > > > > > In my PR the historical graphs will be stored for each DagRun. This > > means that you can see if an older DagRun was the same graph structure, > > even if some tasks does not exists anymore in the current graph. > Especially > > for dynamic DAG’s this is very useful. > > > > > > > 3) Make the scheduler not require DAG files. This could be done if > the > > > > edges contain all information when to trigger the next task. We can > > then > > > > have event driven dag parsing outside of the scheduler loop, ie. by > the > > > > cli. Storage can also be somewhere else (git, artifactory, > filesystem, > > > > whatever). > > > > > > The scheduler is almost untouched in this PR. The only thing that is > > added is that this edges are saved to the database but the scheduling > > itself din’t change. The scheduler depends now still on the DAG object. > > > > > > > 4) Fully serialise the dag so it becomes transferable to workers > > > > > > It nice to see that people has a lot of idea’s about this. But as Fokko > > already mentioned this is out of scope for the issue what we are trying > to > > solve. I also have some idea’s about this but I like to limit this PR/AIP > > to the webserver. > > > > > > For now my PR does solve 1 and 2 and the rest of the behaviour (like > > scheduling) is untouched. > > > > > > Gr, > > > Peter > > > > > > > >