Exactly that Ash! This awesome community never fails to impress. Thanks folks!
On Tue, Sep 25, 2018 at 3:29 AM Ash Berlin-Taylor <a...@apache.org> wrote: > > > On 24 Sep 2018, at 23:12, Alex Tronchin-James 949-412-7220 > <(949)%20412-7220> <alex.n.ja...@gmail.com> wrote: > > > > Re: [Brian Greene] "How does filename matter? Frankly I wish the > filename > > was REQUIRED to be the dag name so people would quit confusing themselves > > by mismatching them !" > > > > FWIW in the Facebook predecessor to airflow, the file path/name WAS the > dag > > name. E.g. if your dag resided in best_team/new_project/sweet_dag.py then > > the dag name would be best_team.new_project.sweet_dag > > All tasks were identified by their variable name after that prefix: E.g. > if > > best_team.new_project.sweet_dag defines an operator in a variable named > > task1, then the respective task_id is > best_team.new_project.sweet_dag.task1. > > > > Airflow provides additional flexibility to specify DAG and task names to > > avoid the sometimes annoyingly long task names this resulted in and allow > > DAG/task names without forcing a code directory structure and python's > > variable naming restrictions, and I think this is a Good Thing. > > > > It seems like airflowuser is trying to provide additional metadata beyond > > the DAG/task names (so far, a DAG 'title' distinct from the ID). I've > > provided this through a README.md included in the DAG source directory, > but > > maybe it would be a win to instead add a DAG parameter named 'readme' of > > string type which can include a docstring or even markdown to provide any > > desired additional metadata? This could then be displayed by the UI to > > simplify access to any such provided DAG documentation. > > You mean like https://airflow.apache.org/concepts.html#documentation-notes > <https://airflow.apache.org/concepts.html#documentation-notes> ? ✨ > > > > 🍿 > > > > > > > > On Thu, Sep 20, 2018 at 10:45 PM Brian Greene < > > br...@heisenbergwoodworking.com> wrote: > > > >> Prior to using airflow for much, on first inspection, I think I may have > >> agreed with you. > >> > >> After a bit of use I’d agree with Fokko and others - this isn’t really a > >> problem, and separating them seems to do more harm than good related to > >> deployment. > >> > >> I was gonna stop there, but why? > >> > >> You can add a task to a dag that’s deployed and has run and still view > >> history. The “new” task shows up white Squares in the old dags. nobody > >> said you’re required to also rename the dag when you do so this. If > your > >> process or desire or design determines you need to rename it, well then > by > >> definition... isn’t it a new thing without a history? Airflow is > >> implementing exactly that. > >> > >> One could argue that renaming to reflect exact purpose is good practice. > >> Yes, I’d agree, but again following that logic if it’s a small enough > >> change to “slip in” then the name likely shouldn’t change. If it’s big > >> enough I want to change the name then it’s a big enough change that I’m > >> functionally running something “new”, and I expect to need to account > for > >> that. Airflow is enforcing that logic by coupling the name to the > >> deployment of what you said was a new process. > >> > >> One might put forth that changing the name to be more descriptive In the > >> ui makes it easier for support staff. I think perhaps if that’s your > >> challenge it’s not airflow that’s a problem. Dags are of course > documented > >> elsewhere besides their name, right? Yeah it’s self documenting (and > the > >> graphs are cool), but I have to assume there’s something besides the > NAME > >> to tell people what it does. Additionally, far more than the name is > >> required for even an operator or monitor watcher to take action - you > don’t > >> expect them to know which tasks to rerun or how to troubleshoot failures > >> just based on your “now most descriptive name in the UI” do you? > >> > >> I spent time In an informatica shop where all the jobs were numbered. > >> Numbered. Let’s be more exact... their NAMES were NUMBERS like 56709. > >> Terrible, but 100% worked, because while a descriptive name would have > been > >> useful, the name is the thing that’s supposed to NOT CHANGE (see code of > >> Abibarshim), and all the other information can attach to that in places > >> where you write... other information. People would curse a number > “F’ing > >> 6291 failed again” - everyone knew what they were talking about.. I > digress. > >> > >> You might decide to document “dag ID 12” or just “12” on your wiki - I’m > >> going to document “daily_sales_import”. And when things start failing > at > >> 3am it’s not my dag “56” that’s failing, it’s the sales_export dag. > But if > >> you document “12”, that’s still it’s name, and it’d better be 12 in all > >> your environments and documents. This also means the actual db IDs from > >> your proposal are almost certainly NOT the same across your > environments, > >> making the 12 unchangeable name! > >> > >> There are lots of languages (most of them) where the name of a thing is > >> important and hard to change. It’s not a bad thing, and I’d assume that > >> deploying a thing by name has some significance in many systems. Go > rename > >> a class in... pick a language... tell me how that should be easier to do > >> willy-nilly so it’s easier In the UI. > >> > >> I suppose you could view it as a limitation, But i don’t think you’ve > >> illuminated a single use case where it’s an actual technical constraint > or > >> limitation. > >> > >> The BEST argument against the current implementation is db performance. > >> It’s a hogwash argument. Basic key indexes on low cardinality string > >> columns are plenty fast for the airflow workload, and if your task load > is > >> so high airflow can’t keep up or your seeing super-fast tasks and > airflow > >> db/tracking latency is too much... perhaps a messaging or queue > processing > >> solution is better suited to those workloads. We see scheduler > bottlenecks > >> long before the database for our “quick task” scenarios. Additionally, > >> reading through this list you’ll find people running airflow at > substantial > >> scale - I’ve not seen anyone complaining of production performance > issues > >> based on this design decision. At first I hated it. String keys are > >> dirty, we’re all taught that as good little programmers. Except when > >> performance won’t be a huge consideration since it’s not OLTP and easy > of > >> queryabilty is more important because it’s a growing system... good > >> decision - whoever made it. > >> > >> How does filename matter? Frankly I wish the filename was REQUIRED to > be > >> the dag name so people would quit confusing themselves by mismatching > them > >> ! We’ve renamed dag files with no issue as long as the content doesn’t > >> change, so again, not a real use case. And really - name your stuff > >> careful before you get to prod man. > >> > >> I gotta ask - airflowuser - are you gonna use airflow for anything, or > >> just poke it with a stick from a distance and ask semi-inane questions > of > >> these fine folks that wrote and spend time working on this cool piece of > >> kit? > >> > >> B > >> > >> Sent from a device with less than stellar autocorrect > >> > >>> On Sep 20, 2018, at 3:12 PM, Driesprong, Fokko <fo...@driesprong.frl> > >> wrote: > >>> > >>> I like the dag_id for both the name and as an unique identifier. If you > >>> change the dag in such a way, that it deserves a new name, you probably > >>> want to create a new dag anyway. If you want to give some additional > >>> context, you can use the description field: > >>> > >> > https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L3131-L3132 > >>> > >>> The name of the file of dag does not have any influence. > >>> > >>> My 2¢ > >>> > >>> Cheers, Fokko > >>> > >>> Op do 20 sep. 2018 om 19:40 schreef James Meickle > >>> <jmeic...@quantopian.com.invalid>: > >>> > >>>> I'm personally against having some kind of auto-increment numeric ID > for > >>>> DAGs. While this makes a lot of sense for systems where creation is a > >>>> database activity (like a POST request), in Airflow, DAG creation is > >>>> actually a code ship activity. There are all kinds of complex > scenarios > >>>> around that: > >>>> > >>>> - I revert a commit and a DAG disappears or is renamed > >>>> - I run the same file, twice, with multiple parameters to create two > >> DAGs > >>>> - I create the DAG in both staging and prod, but they wind up with > >>>> different IDs > >>>> > >>>> It's just too hard to automatically track these scenarios. > >>>> > >>>> If we really wanted to put something like this in place, it would > first > >>>> make more sense to decouple DAG creation from code shipping, and > instead > >>>> prefer creation of a DAG outside of code (but with a definition that > >>>> references which git repo/committish/file/arguments/etc. to use). Then > >> if > >>>> you do something like rename a file, the DAG breaks, but at least > still > >>>> exists in the db with that ID and history still makes sense once you > >> update > >>>> the DAG definition with the new code location. > >>>> > >>>> On Thu, Sep 20, 2018 at 4:52 AM airflowuser > >>>> <airflowu...@protonmail.com.invalid> wrote: > >>>> > >>>>> Hi, > >>>>> though this could have been explained on Jira I think this should be > >>>>> discussed first. > >>>>> > >>>>> The problem: > >>>>> Airflow mixes DAG name with id. It uses same filed for both purposes. > >>>>> > >>>>> I assume that most of you use the dag_id to describe what the DAG > >>>> actually > >>>>> does. > >>>>> For example: > >>>>> > >>>>> dag = DAG( > >>>>> dag_id='cost_report_daily', > >>>>> ... > >>>>> ) > >>>>> > >>>>> This dag_id is reflected to the dag id column in the UI. > >>>>> Now, lets say that you want to add another task to this specific dag > - > >>>> You > >>>>> are to be extremely careful when you change the dag_id to represent > the > >>>> new > >>>>> functionality for example : dag_id='cost_expenses_reports_daily' . > This > >>>>> will break the history of the DAG. > >>>>> > >>>>> Or even with simpler use case.. the user just want to change the name > >> he > >>>>> sees on the UI. > >>>>> > >>>>> I suggest to have a discussion if the dag_id should be split into id > >> (an > >>>>> actual id) and name to reflect what it does. When the "connection" is > >>>> done > >>>>> by id's - names can change as much as you want without breaking > >>>> anything. > >>>>> essentially it becomes a field uses for display purpose only. > >>>>> > >>>>> * I didn't mention also the issue of DAG file name which can also > cause > >>>>> trouble if someone wants to change it. > >>>>> > >>>>> Sent with [ProtonMail](https://protonmail.com) Secure Email. > >>>> > >> > >