I don’t think guarding inlets and outlets makes sense, cause if they are not defined they code won’t be executed anyway. Ie. The logic is already there, no need for a config flag.
End-date is actually used by some beyond Airbnb as far as I know. What is the reason for deprecating it? Execution_date in finance is actually very well known. A transaction has an execution date it is not the time of the sending of the data. I honestly think that a change here would make it a thing that techies understand but does not make business sense. What is “today’s data?” it might not be the data that was sent today. Verstuurd vanaf mijn iPad > Op 12 apr. 2019 om 08:35 heeft Maxime Beauchemin <maximebeauche...@gmail.com> > het volgende geschreven: > > It should be labelled as a "bridge" or "transition" release somehow, and > tell the community to absolutely to go there first and address all > deprecation warnings prior to upgrade to 2.0. I wonder if semver has > something for that. > > It might be the time to deprecate more things. Spring cleaning! > > Should we start another thread labeled "[2.0 spring cleaning] what Airflow > features should we deprecate!?" > > Max > >> On Thu, Apr 11, 2019 at 11:26 PM Ash Berlin-Taylor <a...@apache.org> wrote: >> >> I am happy to do another small(!) 1.10.x release. >> >> (There was a small bug I introduced where I broke the rendering of doc_md >> on dags) >> >> On 11 April 2019 22:52:13 BST, "Driesprong, Fokko" <fo...@driesprong.frl> >> wrote: >>> I agree with Max here, we should be careful. >>> >>> Regarding the yesterday_ds, yesterday_ds_nodash, tomorrow_ds, >>> tomorrow_ds_nodash. I'm not against having better readable shorthands, >>> but >>> more about the fact that having a tomorrow_ds doesn't make sense when >>> we >>> have an hourly (or weekly) job. >>> >>> I'm against guarding the {in,out}lets. Since the guarding will add >>> extra >>> logic, and this is not worth the added complexity (and another flag) in >>> my >>> opinion. >>> >>> - We start by putting deprecation warnings on tables, latest_date, >>> end_date >>> and END_DATE and remove them in Airflow 2.0. >>> >>> This is only possible if there is another version before 2.0. >>> Currently, >>> we're preparing to move to 2.0 >>> >>> - We add a lineage_enabled config which is false by default and thus >>> inlets/outlets aren’t provided, unless set to true. >>> >>> I'm against guarding the {in,out}lets. Since the guarding will add >>> extra >>> logic, and this is not worth the added complexity (and another flag) in >>> my >>> opinion. >>> >>> - We continue discussion about clarification of execution_date and >>> better >>> named variables such as something_start and something_end >>> >>> >> https://lists.apache.org/thread.html/31423aa7feba421c53356a1e566f777c7a7973966c3320611286a2fb@%3Cdev.airflow.apache.org%3E >>> >>> Cheers, Fokko >>> >>> Op do 11 apr. 2019 om 08:44 schreef Bas Harenslak < >>> basharens...@godatadriven.com>: >>> >>>> Great discussion, let’s stay on track. If I can summarise: >>>> >>>> >>>> * yesterday_ds, yesterday_ds_nodash, tomorrow_ds, >>> tomorrow_ds_nodash >>>> * Arthur: some users use these for convenience >>>> * Bas/Fokko: these are values that can be easily derived in a >>>> one-liner >>>> >>>> * tables >>>> * nobody? >>>> >>>> * latest_date >>>> * Arthur: used by internal backfill framework at Airbnb, but >>> no >>>> issue removing them. >>>> >>>> * inlets/outlets >>>> * Arthur: still in development, could be guarded behind >>> feature >>>> flag. >>>> >>>> * end_date/END_DATE >>>> * Arthur: used by internal backfill framework at Airbnb, but >>> no >>>> issue removing them. >>>> >>>> And to add to the discussion: >>>> >>>> >>>> * execution_date >>>> * The meaning of execution_date is ambivalent and super >>> confusing >>>> to everybody because it’s not the date of execution. >>>> * Suggestions to add period_start/period_end, or >>>> interval_start/interval_end, or something like that, to provide >>> datetime >>>> variables for start & end datetime of a DAG run. >>>> * Other: >>>> * Removal of variables should be done in major version and >>>> deprecation warnings should be added. >>>> >>>> So how about the following: >>>> - We start by putting deprecation warnings on tables, latest_date, >>>> end_date and END_DATE and remove them in Airflow 2.0. >>>> - We add a lineage_enabled config which is false by default and thus >>>> inlets/outlets aren’t provided, unless set to true. >>>> - We continue discussion about yesterday* and tomorrow* >>>> - We continue discussion about clarification of execution_date and >>> better >>>> named variables such as something_start and something_end >>>> >>>> Cheers, >>>> Bas >>>> >>>> >>>> On 11 Apr 2019, at 04:52, Maxime Beauchemin >>> <maximebeauche...@gmail.com >>>> <mailto:maximebeauche...@gmail.com>> wrote: >>>> >>>> Making backwards incompatible changes that require altering the >>> thousands >>>> (millions?!) of DAGs in the wild will alienate the community and >>> prevent >>>> many from orchestrating an upgrade. Upgrading hundreds of DAGs and >>> Airflow >>>> atomically would be hard and dangerous. >>>> >>>> To mitigate this, changes to the DAG / core API should be backward >>>> compatible, at least for a release range, and alert on upcoming >>> deprecation >>>> (if needed) >>>> >>>> To be clear if `execution_date` is renamed to `foo`, we should have a >>>> version range where both work just as well, but using >>> `execution_date` >>>> would result is warning messages about its upcoming deprecation, >>> maybe >>>> something like "It appears you're using the task parameter >>> `execution_date` >>>> which will be deprecated in favor or `foo` in upcoming release 2.0.0, >>> find >>>> more info at link.to/foo<http://link.to/foo>" >>>> >>>> Max >>>> >>>> On Tue, Apr 9, 2019 at 7:58 AM James Meickle >>>> >>> <jmeic...@quantopian.com.invalid<mailto:jmeic...@quantopian.com.invalid>> >>>> wrote: >>>> >>>> I agree with Ash here. The naming of "execution_date" is incredibly >>>> confusing to people who are new to Airflow, who think it has >>> something to >>>> do with... execution. >>>> >>>> However, I think that there's still room for improvement with >>>> "period_start" and "period_end". Think about manually triggered tasks >>> - >>>> they'd have a "period_start", but no useful "period_end" until the >>> next >>>> trigger occurs. And mid-day ETL tasks that are dated to "today" still >>> have >>>> to reference "period_end" to get the correct date, even though the >>> DAGrun >>>> won't be over yet! >>>> >>>> If we're going to go through and rename "execution_date", I'd rather >>> see a >>>> wider-ranging rework that understands a mapping from a date to a >>> window >>>> (like "every day, process the data from one week ago", or "every >>> Saturday, >>>> process today's data"). Maybe something closer to how Beam does it >>>> (but retaining simplicity for the 99% of cases that are just "run >>> daily, >>>> for the past day"). >>>> >>>> >>>> >>> >> https://beam.apache.org/documentation/programming-guide/#provided-windowing-functions >>>> >>>> I could see something where there are two default DAG parameters are: >>>> schedule="@daily" >>>> window=-1 (or some schedule_delta object that provides a windowing >>> impl, or >>>> None if there is no window) >>>> >>>> In this world, all DAGRuns would have a "schedule_date" (the date >>> this >>>> would have normally started at), and an "execution_date" (when it was >>>> actually executed). If a given DAGrun has a window (one-off DAGs may >>> or may >>>> not!), there would be variables for "window_start" (the start of the >>>> window) and "window_end" (the end of the window; this would default >>> to the >>>> same as schedule_date, and to the next DAGrun's window_start). >>>> >>>> Disconnecting schedule date, execution date, and window/period would >>> also >>>> open a pathway for more advanced users to implement irregular >>> schedules and >>>> windows, without having to rely on hacks. For example: a DAG that >>> runs at >>>> "the end of the week" to produce a end-of-week would normally run on >>>> Friday, but on Friday holidays, instead complete a day early (and >>> scan one >>>> fewer day in its window). >>>> >>>> I know there's some historical baggage here, but I think the above is >>> much >>>> more natural to new users than what we're offering today. And I think >>> from >>>> a current user perspective, there would be breaking variable name >>> changes, >>>> but no logical differences for the majority of DAGs. >>>> >>>> On Tue, Apr 9, 2019 at 6:17 AM Ash Berlin-Taylor <a...@apache.org> >>> wrote: >>>> >>>> To (slightly) hijack this thread: >>>> >>>> On the subject of execuction_date: as I'm sure we're all aware the >>>> concept >>>> of execution_date is confusing to new-commers to Airflow (there are >>> many >>>> questions about "why hasn't my DAG run yet"? "Why is my dag a day >>>> behind?" >>>> etc.) and although we mention this in the docs it's a confusing >>> concept. >>>> >>>> What to people think about adding two new parameters: `period_start` >>> and >>>> `period_end` and making these the preferred terms in place of >>>> execution_date and next_execution_date? >>>> >>>> This hopefully avoids any ambitious terms like "execution" or "run" >>> which >>>> is understandably easy to conflate with the time the task is being >>> run >>>> (i.e. `now()`) >>>> >>>> If people think this naming is better and less confusing I would >>> suggest >>>> we update all the docs and examples to use these terms (but still >>> mention >>>> the old names somewhere, probably in the macros docs) >>>> >>>> Thoughts? >>>> >>>> -ash >>>> >>>> >>>> On 8 Apr 2019, at 16:20, Arthur Wiedmer <arthur.wied...@gmail.com> >>>> wrote: >>>> >>>> Hi Bas, >>>> >>>> 1) I am aware of a few places where those parameters are used in >>>> production >>>> in a few hundred jobs. I highly recommend we don't deprecate them >>>> unless >>>> we >>>> do it in a major version. >>>> >>>> 2) As James mentioned, inlets and outlets are a lineage annotation >>>> feature >>>> which is still under development. Let's leave them in, but we can >>> guard >>>> them behind a feature flag if you prefer. >>>> >>>> 3) the yesterday*/tomorrow* params are convenience ones if you use a >>>> daily >>>> ETL. I agree with you that they are simple to compute, but not >>> everyone >>>> using Apache Airflow is amazing with Python. Some users are only >>> trying >>>> to >>>> get a query scheduled and rely on a couple of niceties like these to >>>> get >>>> by. >>>> >>>> 4) latest_date, end_date (I feel like there used to be start_date, >>> but >>>> maybe it got lost) were a blend of things which were used by a >>> backfill >>>> framework used internally at Airbnb. Latest date was used if you >>> needed >>>> to >>>> join to a dimension for which you only wanted the latest version of >>> the >>>> attributes in you backfill. end_date was used for time ranges where >>>> several >>>> days were processed together in a range to save on compute. I don't >>> see >>>> an >>>> issue with removing them. >>>> >>>> Best regards, >>>> Arthur >>>> >>>> >>>> >>>> On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak < >>>> basharens...@godatadriven.com> >>>> wrote: >>>> >>>> Hi all, >>>> >>>> Following Tao Feng’s question to discuss this PR< >>>> https://github.com/apache/airflow/pull/5010> (AIRFLOW-4192< >>>> https://issues.apache.org/jira/browse/AIRFLOW-4192>), please discuss >>>> here >>>> if you agree/disagree/would change. >>>> >>>> ----------- >>>> >>>> The summary of the PR: >>>> >>>> I was confused by the task context values and suggest to clean up and >>>> clarify these variables. Some are derivations from other variables, >>>> some >>>> are undocumented and unused, some are wrong (name doesn’t match the >>>> value). >>>> Please discuss what you think of the removal of these variables: >>>> >>>> >>>> * Removed yesterday_ds, yesterday_ds_nodash, tomorrow_ds, >>>> tomorrow_ds_nodash. IMO the next_* and previous_* variables are >>> useful >>>> since these require complex logic to compute the next execution date, >>>> however would leave computing the yesterday* and tomorrow* variables >>>> up >>>> to >>>> the user since they are simple one-liners and don't relate to the DAG >>>> interval. >>>> * Removed tables. This is a field in params, and is thus also >>>> accessible by the user ({{ params.tables }}). Also, it was >>>> undocumented. >>>> * Removed latest_date. It's the same as ds and was also >>>> undocumented. >>>> * Removed inlets and outlets. Also undocumented, and have the >>>> inlets/outlets ever worked/ever been used by anybody? >>>> * Removed end_date and END_DATE. Both have the same value, so it >>>> doesn't make sense to have both variables. Also, the value is ds >>> which >>>> contains the start date of the interval, so the naming didn't make >>>> sense to >>>> me. However, if anybody argues in favour of adding "start_date" and >>>> "end_date" to provide the start and end datetime of task instance >>>> intervals, I'd be happy to add them. >>>> >>>> Cheers, >>>> Bas >>>> >>>> >>>> >>>> >>>> >>>> >>