Did we come to any conclusion on this topic?
> On 12 Apr 2019, at 10:26, Ash Berlin-Taylor <a...@apache.org> wrote: > > Does anyone actually use end_date (of either spelling) given it's value is > currently the same as `ds`: > https://github.com/apache/airflow/blob/3020d9733cb189f091489a62f58f0f586dc8d4a9/airflow/models/taskinstance.py#L1183-L1184 > > > > 'END_DATE': ds, > 'end_date': ds, > > That seems like it is just the wrong value - I could see it being useful as > either the same value as `next_execution_date` or `dag.end_date` but the > current value (from a quick thought) doesn't seem like it is useful? > > Given what the value is currently set to I vote for deprecating/removing > these two. > >> On 12 Apr 2019, at 09:39, Bolke de Bruin <bdbr...@gmail.com> wrote: >> >> I don’t think guarding inlets and outlets makes sense, cause if they are not >> defined they code won’t be executed anyway. Ie. The logic is already there, >> no need for a config flag. >> >> End-date is actually used by some beyond Airbnb as far as I know. What is >> the reason for deprecating it? >> >> Execution_date in finance is actually very well known. A transaction has an >> execution date it is not the time of the sending of the data. I honestly >> think that a change here would make it a thing that techies understand but >> does not make business sense. What is “today’s data?” it might not be the >> data that was sent today. >> >> >> >> Verstuurd vanaf mijn iPad >> >>> Op 12 apr. 2019 om 08:35 heeft Maxime Beauchemin >>> <maximebeauche...@gmail.com> het volgende geschreven: >>> >>> It should be labelled as a "bridge" or "transition" release somehow, and >>> tell the community to absolutely to go there first and address all >>> deprecation warnings prior to upgrade to 2.0. I wonder if semver has >>> something for that. >>> >>> It might be the time to deprecate more things. Spring cleaning! >>> >>> Should we start another thread labeled "[2.0 spring cleaning] what Airflow >>> features should we deprecate!?" >>> >>> Max >>> >>>> On Thu, Apr 11, 2019 at 11:26 PM Ash Berlin-Taylor <a...@apache.org> wrote: >>>> >>>> I am happy to do another small(!) 1.10.x release. >>>> >>>> (There was a small bug I introduced where I broke the rendering of doc_md >>>> on dags) >>>> >>>> On 11 April 2019 22:52:13 BST, "Driesprong, Fokko" <fo...@driesprong.frl> >>>> wrote: >>>>> I agree with Max here, we should be careful. >>>>> >>>>> Regarding the yesterday_ds, yesterday_ds_nodash, tomorrow_ds, >>>>> tomorrow_ds_nodash. I'm not against having better readable shorthands, >>>>> but >>>>> more about the fact that having a tomorrow_ds doesn't make sense when >>>>> we >>>>> have an hourly (or weekly) job. >>>>> >>>>> I'm against guarding the {in,out}lets. Since the guarding will add >>>>> extra >>>>> logic, and this is not worth the added complexity (and another flag) in >>>>> my >>>>> opinion. >>>>> >>>>> - We start by putting deprecation warnings on tables, latest_date, >>>>> end_date >>>>> and END_DATE and remove them in Airflow 2.0. >>>>> >>>>> This is only possible if there is another version before 2.0. >>>>> Currently, >>>>> we're preparing to move to 2.0 >>>>> >>>>> - We add a lineage_enabled config which is false by default and thus >>>>> inlets/outlets aren’t provided, unless set to true. >>>>> >>>>> I'm against guarding the {in,out}lets. Since the guarding will add >>>>> extra >>>>> logic, and this is not worth the added complexity (and another flag) in >>>>> my >>>>> opinion. >>>>> >>>>> - We continue discussion about clarification of execution_date and >>>>> better >>>>> named variables such as something_start and something_end >>>>> >>>>> >>>> https://lists.apache.org/thread.html/31423aa7feba421c53356a1e566f777c7a7973966c3320611286a2fb@%3Cdev.airflow.apache.org%3E >>>>> >>>>> Cheers, Fokko >>>>> >>>>> Op do 11 apr. 2019 om 08:44 schreef Bas Harenslak < >>>>> basharens...@godatadriven.com>: >>>>> >>>>>> Great discussion, let’s stay on track. If I can summarise: >>>>>> >>>>>> >>>>>> * yesterday_ds, yesterday_ds_nodash, tomorrow_ds, >>>>> tomorrow_ds_nodash >>>>>> * Arthur: some users use these for convenience >>>>>> * Bas/Fokko: these are values that can be easily derived in a >>>>>> one-liner >>>>>> >>>>>> * tables >>>>>> * nobody? >>>>>> >>>>>> * latest_date >>>>>> * Arthur: used by internal backfill framework at Airbnb, but >>>>> no >>>>>> issue removing them. >>>>>> >>>>>> * inlets/outlets >>>>>> * Arthur: still in development, could be guarded behind >>>>> feature >>>>>> flag. >>>>>> >>>>>> * end_date/END_DATE >>>>>> * Arthur: used by internal backfill framework at Airbnb, but >>>>> no >>>>>> issue removing them. >>>>>> >>>>>> And to add to the discussion: >>>>>> >>>>>> >>>>>> * execution_date >>>>>> * The meaning of execution_date is ambivalent and super >>>>> confusing >>>>>> to everybody because it’s not the date of execution. >>>>>> * Suggestions to add period_start/period_end, or >>>>>> interval_start/interval_end, or something like that, to provide >>>>> datetime >>>>>> variables for start & end datetime of a DAG run. >>>>>> * Other: >>>>>> * Removal of variables should be done in major version and >>>>>> deprecation warnings should be added. >>>>>> >>>>>> So how about the following: >>>>>> - We start by putting deprecation warnings on tables, latest_date, >>>>>> end_date and END_DATE and remove them in Airflow 2.0. >>>>>> - We add a lineage_enabled config which is false by default and thus >>>>>> inlets/outlets aren’t provided, unless set to true. >>>>>> - We continue discussion about yesterday* and tomorrow* >>>>>> - We continue discussion about clarification of execution_date and >>>>> better >>>>>> named variables such as something_start and something_end >>>>>> >>>>>> Cheers, >>>>>> Bas >>>>>> >>>>>> >>>>>> On 11 Apr 2019, at 04:52, Maxime Beauchemin >>>>> <maximebeauche...@gmail.com >>>>>> <mailto:maximebeauche...@gmail.com>> wrote: >>>>>> >>>>>> Making backwards incompatible changes that require altering the >>>>> thousands >>>>>> (millions?!) of DAGs in the wild will alienate the community and >>>>> prevent >>>>>> many from orchestrating an upgrade. Upgrading hundreds of DAGs and >>>>> Airflow >>>>>> atomically would be hard and dangerous. >>>>>> >>>>>> To mitigate this, changes to the DAG / core API should be backward >>>>>> compatible, at least for a release range, and alert on upcoming >>>>> deprecation >>>>>> (if needed) >>>>>> >>>>>> To be clear if `execution_date` is renamed to `foo`, we should have a >>>>>> version range where both work just as well, but using >>>>> `execution_date` >>>>>> would result is warning messages about its upcoming deprecation, >>>>> maybe >>>>>> something like "It appears you're using the task parameter >>>>> `execution_date` >>>>>> which will be deprecated in favor or `foo` in upcoming release 2.0.0, >>>>> find >>>>>> more info at link.to/foo<http://link.to/foo>" >>>>>> >>>>>> Max >>>>>> >>>>>> On Tue, Apr 9, 2019 at 7:58 AM James Meickle >>>>>> >>>>> <jmeic...@quantopian.com.invalid<mailto:jmeic...@quantopian.com.invalid>> >>>>>> wrote: >>>>>> >>>>>> I agree with Ash here. The naming of "execution_date" is incredibly >>>>>> confusing to people who are new to Airflow, who think it has >>>>> something to >>>>>> do with... execution. >>>>>> >>>>>> However, I think that there's still room for improvement with >>>>>> "period_start" and "period_end". Think about manually triggered tasks >>>>> - >>>>>> they'd have a "period_start", but no useful "period_end" until the >>>>> next >>>>>> trigger occurs. And mid-day ETL tasks that are dated to "today" still >>>>> have >>>>>> to reference "period_end" to get the correct date, even though the >>>>> DAGrun >>>>>> won't be over yet! >>>>>> >>>>>> If we're going to go through and rename "execution_date", I'd rather >>>>> see a >>>>>> wider-ranging rework that understands a mapping from a date to a >>>>> window >>>>>> (like "every day, process the data from one week ago", or "every >>>>> Saturday, >>>>>> process today's data"). Maybe something closer to how Beam does it >>>>>> (but retaining simplicity for the 99% of cases that are just "run >>>>> daily, >>>>>> for the past day"). >>>>>> >>>>>> >>>>>> >>>>> >>>> https://beam.apache.org/documentation/programming-guide/#provided-windowing-functions >>>>>> >>>>>> I could see something where there are two default DAG parameters are: >>>>>> schedule="@daily" >>>>>> window=-1 (or some schedule_delta object that provides a windowing >>>>> impl, or >>>>>> None if there is no window) >>>>>> >>>>>> In this world, all DAGRuns would have a "schedule_date" (the date >>>>> this >>>>>> would have normally started at), and an "execution_date" (when it was >>>>>> actually executed). If a given DAGrun has a window (one-off DAGs may >>>>> or may >>>>>> not!), there would be variables for "window_start" (the start of the >>>>>> window) and "window_end" (the end of the window; this would default >>>>> to the >>>>>> same as schedule_date, and to the next DAGrun's window_start). >>>>>> >>>>>> Disconnecting schedule date, execution date, and window/period would >>>>> also >>>>>> open a pathway for more advanced users to implement irregular >>>>> schedules and >>>>>> windows, without having to rely on hacks. For example: a DAG that >>>>> runs at >>>>>> "the end of the week" to produce a end-of-week would normally run on >>>>>> Friday, but on Friday holidays, instead complete a day early (and >>>>> scan one >>>>>> fewer day in its window). >>>>>> >>>>>> I know there's some historical baggage here, but I think the above is >>>>> much >>>>>> more natural to new users than what we're offering today. And I think >>>>> from >>>>>> a current user perspective, there would be breaking variable name >>>>> changes, >>>>>> but no logical differences for the majority of DAGs. >>>>>> >>>>>> On Tue, Apr 9, 2019 at 6:17 AM Ash Berlin-Taylor <a...@apache.org> >>>>> wrote: >>>>>> >>>>>> To (slightly) hijack this thread: >>>>>> >>>>>> On the subject of execuction_date: as I'm sure we're all aware the >>>>>> concept >>>>>> of execution_date is confusing to new-commers to Airflow (there are >>>>> many >>>>>> questions about "why hasn't my DAG run yet"? "Why is my dag a day >>>>>> behind?" >>>>>> etc.) and although we mention this in the docs it's a confusing >>>>> concept. >>>>>> >>>>>> What to people think about adding two new parameters: `period_start` >>>>> and >>>>>> `period_end` and making these the preferred terms in place of >>>>>> execution_date and next_execution_date? >>>>>> >>>>>> This hopefully avoids any ambitious terms like "execution" or "run" >>>>> which >>>>>> is understandably easy to conflate with the time the task is being >>>>> run >>>>>> (i.e. `now()`) >>>>>> >>>>>> If people think this naming is better and less confusing I would >>>>> suggest >>>>>> we update all the docs and examples to use these terms (but still >>>>> mention >>>>>> the old names somewhere, probably in the macros docs) >>>>>> >>>>>> Thoughts? >>>>>> >>>>>> -ash >>>>>> >>>>>> >>>>>> On 8 Apr 2019, at 16:20, Arthur Wiedmer <arthur.wied...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hi Bas, >>>>>> >>>>>> 1) I am aware of a few places where those parameters are used in >>>>>> production >>>>>> in a few hundred jobs. I highly recommend we don't deprecate them >>>>>> unless >>>>>> we >>>>>> do it in a major version. >>>>>> >>>>>> 2) As James mentioned, inlets and outlets are a lineage annotation >>>>>> feature >>>>>> which is still under development. Let's leave them in, but we can >>>>> guard >>>>>> them behind a feature flag if you prefer. >>>>>> >>>>>> 3) the yesterday*/tomorrow* params are convenience ones if you use a >>>>>> daily >>>>>> ETL. I agree with you that they are simple to compute, but not >>>>> everyone >>>>>> using Apache Airflow is amazing with Python. Some users are only >>>>> trying >>>>>> to >>>>>> get a query scheduled and rely on a couple of niceties like these to >>>>>> get >>>>>> by. >>>>>> >>>>>> 4) latest_date, end_date (I feel like there used to be start_date, >>>>> but >>>>>> maybe it got lost) were a blend of things which were used by a >>>>> backfill >>>>>> framework used internally at Airbnb. Latest date was used if you >>>>> needed >>>>>> to >>>>>> join to a dimension for which you only wanted the latest version of >>>>> the >>>>>> attributes in you backfill. end_date was used for time ranges where >>>>>> several >>>>>> days were processed together in a range to save on compute. I don't >>>>> see >>>>>> an >>>>>> issue with removing them. >>>>>> >>>>>> Best regards, >>>>>> Arthur >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak < >>>>>> basharens...@godatadriven.com> >>>>>> wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> Following Tao Feng’s question to discuss this PR< >>>>>> https://github.com/apache/airflow/pull/5010> (AIRFLOW-4192< >>>>>> https://issues.apache.org/jira/browse/AIRFLOW-4192>), please discuss >>>>>> here >>>>>> if you agree/disagree/would change. >>>>>> >>>>>> ----------- >>>>>> >>>>>> The summary of the PR: >>>>>> >>>>>> I was confused by the task context values and suggest to clean up and >>>>>> clarify these variables. Some are derivations from other variables, >>>>>> some >>>>>> are undocumented and unused, some are wrong (name doesn’t match the >>>>>> value). >>>>>> Please discuss what you think of the removal of these variables: >>>>>> >>>>>> >>>>>> * Removed yesterday_ds, yesterday_ds_nodash, tomorrow_ds, >>>>>> tomorrow_ds_nodash. IMO the next_* and previous_* variables are >>>>> useful >>>>>> since these require complex logic to compute the next execution date, >>>>>> however would leave computing the yesterday* and tomorrow* variables >>>>>> up >>>>>> to >>>>>> the user since they are simple one-liners and don't relate to the DAG >>>>>> interval. >>>>>> * Removed tables. This is a field in params, and is thus also >>>>>> accessible by the user ({{ params.tables }}). Also, it was >>>>>> undocumented. >>>>>> * Removed latest_date. It's the same as ds and was also >>>>>> undocumented. >>>>>> * Removed inlets and outlets. Also undocumented, and have the >>>>>> inlets/outlets ever worked/ever been used by anybody? >>>>>> * Removed end_date and END_DATE. Both have the same value, so it >>>>>> doesn't make sense to have both variables. Also, the value is ds >>>>> which >>>>>> contains the start date of the interval, so the naming didn't make >>>>>> sense to >>>>>> me. However, if anybody argues in favour of adding "start_date" and >>>>>> "end_date" to provide the start and end datetime of task instance >>>>>> intervals, I'd be happy to add them. >>>>>> >>>>>> Cheers, >>>>>> Bas >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>> >