Re: [PROPOSAL][AIP-36 DAG Versioning]

2020-07-28 Thread Dan Davydov
Strongly agree with Max's points, also I feel the right way to go about this is instead of Airflow schedulers/webservers/workers reading DAG Python files, they would instead read from serialized representations of the DAGs (e.g. json representation in the Airflow DB). Instead of DAG owners pushing

Re: [AIP-35] Add Signal Based Scheduling To Airflow

2020-06-18 Thread Dan Davydov
Metacomment: You might want to consider moving this discussion to a google doc or something since it seems like a lot of divergent threads are being created due to the scope of this change, which can be a bit hard to keep track of in email. If folks are concerned about history we can dump the

Re: [UPDATE] AIP-31 .output update

2020-06-18 Thread Dan Davydov
+1 (binding) On Thu, Jun 18, 2020 at 12:32 PM Daniel Imberman wrote: > +1 (binding) > > via Newton Mail [ > https://cloudmagic.com/k/d/mailapp?ct=dx=10.0.50=10.14.6=email_footer_2 > ] > On Wed, Jun 17, 2020 at 12:13 AM, Tomasz Urbaszek > wrote: > +1 (binding) > > On Wed, Jun 17, 2020 at 3:39

Re: [DISCUSS] Parametrized DAGs

2020-06-16 Thread Dan Davydov
> > > > > > > Regards, > > > > > Kaxil > > > > > > > > > > On Mon, Jun 15, 2020 at 11:48 PM Gerard Casas Saez > > > > > wrote: > > > > > > > > > > > I do not think we should support RunTimeParams to modif

Re: [AIP-34] Rewrite SubDagOperator

2020-06-12 Thread Dan Davydov
Agree with James (and think it's actually the more important issue to fix), but I am still convinced Ash' idea is the right way forward (just it might require a bit more work to deprecate than adding visual grouping in the UI). There was a previous thread about this FYI with more context on why

Re: [DISCUSS] Parametrized DAGs

2020-06-12 Thread Dan Davydov
I think this is a great idea! One thing that I think we should figure out before implementing is how to do so alongside DAG serialization, i.e. letting these params modify DAG topology might make it hard to store serialized representations for the Airflow services to consume and render, though

Re: Proposal: Change TI from having execution date to dag_run_id (in API at least)

2020-05-14 Thread Dan Davydov
+1 but in the future I think better would be /dags/{dag_id}/dagRuns/{execution_date}/{run_number}. That would give an automatic ordering between two runs, is a lot simpler than "backfill_2020-03-16T00:00:00+00:00" and helps enable the multiple dagruns per execution date that you mention. On Thu,

Re: Setting to add choice of schedule at end or schedule at start of interval

2020-05-11 Thread Dan Davydov
I strongly agree with Ash, I also think we should strive to decrease the complexity of core Airflow components and not offer customization/extensibility especially in the form of plugins where it is not needed to make Airflow more robust and easier to reason about (less testing configuration). I

Re: [VOTE] AIP-31: Airflow functional DAG definition

2020-03-25 Thread Dan Davydov
Sorry +1 (binding) :). On Wed, Mar 25, 2020 at 4:53 PM Dan Davydov wrote: > +1 > > On Wed, Mar 25, 2020 at 4:40 PM Tomasz Urbaszek < > tomasz.urbas...@polidea.com> wrote: > >> Hello everyone! >> >> This email calls for a vote on the design proposed in

Re: [VOTE] AIP-31: Airflow functional DAG definition

2020-03-25 Thread Dan Davydov
+1 On Wed, Mar 25, 2020 at 4:40 PM Tomasz Urbaszek wrote: > Hello everyone! > > This email calls for a vote on the design proposed in AIP-31: > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-31:+Airflow+functional+DAG+definition > > Discussion threads: > >

Re: [PROPOSAL][AIP-15 Support Multiple-Schedulers for HA & Better Scheduling Performance]

2020-03-16 Thread Dan Davydov
Haven't checked the math in the AIP but I believe with the given formula, with 5 schedulers and 100 DAGs there is already a 9% chance of conflict and the larger users of Airflow have many more DAGs than that. I'm a bit concerned putting about putting more load on the DB which is already a

Re: Proposal: SIG-Kubernetes

2020-02-26 Thread Dan Davydov
+1 would love to listen in on these On Wed, Feb 26, 2020 at 11:52 AM Tomasz Urbaszek < tomasz.urbas...@polidea.com> wrote: > +1 for the idea. Should Airflow k8s operator be included in those > discussions? > > Also, I'm not sure if we have any more need to have autoscaling-only SIG? > Should we

Re: Airflow and Machine Learning

2020-02-24 Thread Dan Davydov
our approach to that. We save and > > > show in > > > > > > the Airflow UI every specific version of the DAG. This is > important > > > in ML > > > > > > use cases because of the data science experimentation cycle and > the > > &

Re: Airflow and Machine Learning

2020-02-19 Thread Dan Davydov
Twitter uses Airflow primarily for ML, to create automated pipelines for retraining data, but also for more ad-hoc training jobs. The biggest gaps are on the experimentation side. It takes too long for a new user to set up and run a pipeline and then iterate on it. This problem is a bit more

Re: [DISCUSS] Airflow's extensibility options ?

2020-02-18 Thread Dan Davydov
I think the kinds of plugins you are talking make sense in some contexts (e.g. custom views when clicking on a task in the UI, e.g. ability to visualize the data an ETL job provides) but we should be careful allowing extensions to more core parts, it will become very hard to change/maintain the

Re: [DISCUSS] Airflow functional DAGs

2020-02-05 Thread Dan Davydov
gt; > On Wed, 5 Feb 2020 at 22:12, Dan Davydov > wrote: > > > Traditionally we've done this in confluence within the AIP although I > think > > I would prefer google docs at some point in the future maybe :). I would > > use confluence though for this. > > &

Re: [DISCUSS] Airflow functional DAGs

2020-02-05 Thread Dan Davydov
Traditionally we've done this in confluence within the AIP although I think I would prefer google docs at some point in the future maybe :). I would use confluence though for this. On Wed, Feb 5, 2020 at 3:52 PM Gerard Casas Saez wrote: > Happy to drive this. What would be a good place to put

Re: [DISCUSS] Airflow functional DAGs

2020-02-03 Thread Dan Davydov
I like it : ). I think the difficulty in creating operators and chaining them together is one of the most common complaints about Airflow compared to other frameworks. Would be curious to see a comparison to other interfaces e.g. Dagster as well. I would be curious to see what other committers

Re: [DISCUSS] Lineage improvements and standardization of operator signatures

2020-01-22 Thread Dan Davydov
Just want to preface my reply with the fact that I haven't thought about data lineage very much. This is an awesome idea :)! I like something like 1) personally, e.g. operators could optionally define a .outlet() and .inlet() interface which would return the inlets and outlets of a given task,

Re: Autolink References enabled for Airflow

2020-01-06 Thread Dan Davydov
This is awesome, especially for comitters :)! Thank you Kaxil! On Mon, Jan 6, 2020 at 3:11 PM Kaxil Naik wrote: > Hi all, > > A couple of days back I opened an Issue with Apache Infra to autolink Jira > references in commits: https://issues.apache.org/jira/browse/INFRA-19655 > > This is based

Re: [DISCUSS] Packaging DAG/operator dependencies in wheels

2019-12-16 Thread Dan Davydov
The zip support is a bit of a hack and was a bit controversial when it was added. I think if we go down the path of supporting more DAG sources, we should make sure we have the right interface in place so we avoid the current `if format == zip then: else:` and make sure that we don't tightly

Re: Improving the Airflow UI

2019-11-27 Thread Dan Davydov
+1 to everything you said, it all sounds like awesome work : ). Hopefully will be easier to make the front-end code testable as well. Another thing to maybe think about in the future is plugin/customization of the UI. E.g. being able to have custom UI widgets for operators that e.g. visualize data

Re: Closing JIRA Issue for Merged PRs

2019-11-18 Thread Dan Davydov
Wait this doesn't happen automatically!? I thought way-back-when someone wrote a script to automatically close the JIRA tickets (maybe that script is not run when changes are merged via the UI). My apologies, will close JIRAs in the future, I don't think I've closed any JIRA tickets manually. On

Re: [DISCUSS] Back to (some) dependency pinning

2019-11-18 Thread Dan Davydov
My 2 cents on the long-term plan: Once Airflow has Dag Isolation (i.e. DAG python dependencies are completely decoupled from Airflow), we should pin the core Airflow deps, and package operators separately with version ranges (for the reasons Ash mentioned about libraries vs applications). On Sat,

Re: Drop Python 3.5 support?

2019-11-12 Thread Dan Davydov
+1 On Tue, Nov 12, 2019 at 4:46 PM Jarek Potiuk wrote: > Yep. It was actually a '+1' in disguise Bolke :). > > On Tue, Nov 12, 2019 at 10:44 PM Christian Lellmann > wrote: > > > +1 from my side too! > > > > Bolke de Bruin schrieb am Di., 12. Nov. 2019, 22:39: > > > > > I guess thats a +1

Re: [PROPOSAL] Migrate to Pytest

2019-10-16 Thread Dan Davydov
+1 On Wed, Oct 16, 2019 at 12:01 PM Christian Lellmann wrote: > +1 from my side too. > > Regards, > > Chris > > Driesprong, Fokko schrieb am Mi., 16. Okt. 2019, > 17:01: > > > +1 > > > > Op ma 14 okt. 2019 om 16:50 schreef Felix Uellendall > > > >: > > > > > +1, successfully using pytest for

Re: [VOTE] AIP-24: Persisting serialized DAG in DB for webserver scalability

2019-10-16 Thread Dan Davydov
sing the existing tables) > > > - How are we going to do state evolution when we extend the JSON model > > > > The top level object we're storing has a __version field (for example > `{"__version": 1, "dag": { ... } }`) so we can detect older versions and &

Re: [VOTE] AIP-24: Persisting serialized DAG in DB for webserver scalability

2019-10-15 Thread Dan Davydov
into a system that causes > > more pain in the future. (I can't say for sure that it does, but I can't > > say that it doesn't either). I don't think the proposal is necessarily > > wrong or bad, but I think we need some more detailed planning around > future > > milestones. >

Re: [VOTE] AIP-24: Persisting serialized DAG in DB for webserver scalability

2019-10-15 Thread Dan Davydov
-1 (binding), this may sound a bit FUD-y but I don't feel this has been thought through enough... Having both a SimpleDagBag representation and the JSON representation doesn't make sense to me at the moment: *"**Quoting from Airflow code, it is “a simplified representation of a DAG that contains

Re: Proposed roadmap for Airflow 2.0

2019-09-24 Thread Dan Davydov
I think along with "Improve Webserver Performance" we should solve the serialization and task execution isolation problems a little bit more completely since I imagine there could be backwards compatibility issues. e.g. mapping each task JSON to a Docker image or other serialized representation

Re: Setting to add choice of schedule at end or schedule at start of interval

2019-08-23 Thread Dan Davydov
to bite the bullet at some point for more intuitive behavior overall for new users. On Fri, Aug 23, 2019 at 10:29 AM Dan Davydov wrote: > I am for this change, since I feel like in general the start of the > interval is more intuitive (I have been working on Airflow for 3 years and > t

Re: Setting to add choice of schedule at end or schedule at start of interval

2019-08-23 Thread Dan Davydov
I am for this change, since I feel like in general the start of the interval is more intuitive (I have been working on Airflow for 3 years and this still trips me up). That being said I'm not sure how I feel about allowing customization at DAG level instead of cluster level as it makes it harder

Re: [VOTE] Release Apache Airflow 1.10.4 from RC4

2019-08-01 Thread Dan Davydov
Haven't taken a look at the bad cherry pick, but if it's my fault LMK will take a look and submit a patch (I'll be out after tomorrow though). On Thu, Aug 1, 2019 at 2:53 PM Ash Berlin-Taylor wrote: > We've just noticed two problems: > > 1. If a dag parser process takes too long the

Re: Airflow DAG Serialisation

2019-07-31 Thread Dan Davydov
An idea for serialization of dynamic DAGs is moving the serialization to the actual clients. This would require having a python Airflow API that the clients could call like dag.publish(). This enables a couple of things: 1) Clients can serialize as often as they like, and can even serialize in an

Re: Airflow DAG Serialisation

2019-07-26 Thread Dan Davydov
This is awesome, and will bring tons of value to all Airflow users, thank you for driving this! On Fri, Jul 26, 2019 at 5:37 PM Kaxil Naik wrote: > Hi all, > > We, at Astronomer, are going to spend time working on DAG Serialisation. > There are 2 AIPs that are somewhat related to what we plan

Re: Does anybody deploy DAGs in zip files?

2019-06-10 Thread Dan Davydov
I know the code around this is pretty hacky (if use_zip_file then... instead of an abstraction). I know when it was added it was a bit controversial, I would be +1 on removing it. That being said I feel the entire DAG parsing process needs to be moved to the client-side (users who write DAGs),

Re: Announcement: I'm joining Astronomer!

2019-05-31 Thread Dan Davydov
Nice! On Fri, May 31, 2019 at 3:39 AM morefreeze wrote: > Congrats! > > On Fri, May 31, 2019 at 3:18 PM Bolke de Bruin wrote: > > > Awesome! > > > > Sent from my iPhone > > > > > On 31 May 2019, at 08:15, Sumit Maheshwari > > wrote: > > > > > > Congrats Daniel. Really good news for Airflow. >

Re: [DISCUSS] period_start/period_end instead of execution_date/next_execution_date

2019-04-15 Thread Dan Davydov
be > > direct - the time the code started to run. This is why so many people > > misunderstand the execution_date in the terms of Airflow. Airflow took a > > word that is well defined in our conscious and replaced it's meaning. > > > > > > ‐‐‐ Original Message ‐‐

Re: [DISCUSS] period_start/period_end instead of execution_date/next_execution_date

2019-04-15 Thread Dan Davydov
I think if the mission of Airflow is to be a generic Workflow engine, the current semantics of execution date aren't a good default. This might be an unpopular opinion given past threads on this topic :). The execution_date = end_date semantics make sense for the ETL use case but not for other

Re: [2.0 spring cleaning] Deprecate subdags

2019-04-15 Thread Dan Davydov
I don't think fixing subdags to run in the scheduler is enough, although it's a huge improvement over the current implementation (especially the part that lets Subdags specify custom executors). From my experience with Subdags, I think what makes more sense is adding various operators to allow

Re: [DISCUSS] AIP-12 Persist DAG into DB

2019-02-27 Thread Dan Davydov
SDK > > > > as the way to go. > > > > > > > > Since it's pretty clear we need SimpleDAG serialization, and we can > see > > > > through the requirements, people can pretty much get started on this. > > > > > > >

Re: [DISCUSS] AIP-12 Persist DAG into DB

2019-02-27 Thread Dan Davydov
> > * on the topic of serialization, let's be clear whether we're talking about > unidirectional serialization and *not* deserialization back to the object. > This works for making the web server stateless, but isn't a solution around > how DAG definition get shipped around on the cluster (which

Re: Short Airflow user survey

2019-02-25 Thread Dan Davydov
This is very interesting and useful, big thanks for conducting the survey! On Mon, Feb 25, 2019 at 12:24 PM Ash Berlin-Taylor wrote: > Thanks for all those who answered, there's some useful answers in there. > > I've done a short write up >

Re: AIP-12 Persist DAG into DB

2019-02-01 Thread Dan Davydov
fic > > workload has nothing to do with scheduling), would need to serialize the > > DAGs periodically, likely to the database, so that the web server can get > > freshly serialized metadata from the database during the scope of web > > requests. > > > > Max > &

Re: AIP-12 Persist DAG into DB

2019-01-31 Thread Dan Davydov
rst, but falls down with possible code > changes in operators between one task and the next. > > (I would like this, but there are definite complexities) > > -ash > > > On 31 January 2019 16:56:54 GMT, Dan Davydov > wrote: > >I feel the right higher-level solution

Re: AIP-12 Persist DAG into DB

2019-01-31 Thread Dan Davydov
I feel the right higher-level solution to this problem (which is "Adding Consistency to Airflow") is DAG serialization, that is all DAGs should be represented as e.g. JSON (similar to the current SimpleDAGBag object used by the Scheduler). This solves the webserver issue, and also adds consistency

Re: [DISCUSS] Deprecate KnownEvent/KnownEventType

2019-01-02 Thread Dan Davydov
+1 to removing On Wed, Jan 2, 2019 at 10:48 PM Driesprong, Fokko wrote: > Hi all, > > Recently I've opened up a PR > to remove the > KnownEvent and KnownEventType for Apache Airflow 2.0. My feeling was that > not a lot of people are using