Re: API spec questions

Daniel (Daniel Lamblin ) [Data Infrastructure] Wed, 08 Apr 2020 17:01:23 -0700

Having been bit by accidentally having two connections by the same name or 
conn_id, I'd prefer if were made unique. In my experience there's little 
utility in having multiple connections by the same name. Tasks that use a 
connection do to fairly randomly choose one, rather they seem pretty consistent 
in using the one created earliest, which often has the lower id integer.

Circling back to how this is used by the API, from a user perspective, the 
following in path integer fields were ones I'd have expected to be strings 
instead:
ConnectionID 
https://github.com/apache/airflow/blob/11f1e0cad996d5596e3e4fb440eb4ec52c024f70/openapi.yaml#L1845
DAGID 
https://github.com/apache/airflow/blob/11f1e0cad996d5596e3e4fb440eb4ec52c024f70/openapi.yaml#L1853
PoolID 
https://github.com/apache/airflow/blob/11f1e0cad996d5596e3e4fb440eb4ec52c024f70/openapi.yaml#L1901

Though it’s a url-encoding hassle, I also expected that DAGRunID would be more 
like the Run ID E.G.
"scheduled__2020-04-08T23:10:00+00:00", "manual__2020-04-08T23:00:56+00:00",  
"manual__2020-04-08T16:24:56.692868+00:00" , 
"backfill_2020-04-08T22:05:00+00:00" etc

Then TaskID is confusing to me; AFAIK the PK to task instances are task_id, 
dag_id, and execution_date and the api call appears to align with that having 
the pattern:
/dags/{dag_id}/taskInstances/{task_id}/{execution_date}:
But if task_id is a numbered id, then… execution_date isn't even needed… I'm 
thinking it should have been a string.

An aside to this, I've always wondered what happens if an externally triggered 
DAG Run has the same execution date as a pre-existing scheduled DAG Run. They'd 
have different run_ids, EG "scheduled__2020-04-08T23:10:00+00:00" vs 
"manual__2020-04-08T23:10:00+00:00" but then task instances for those runs 
might not be unique.

Lastly, the UI and CLI operation of clearing tasks seems analogous to the 
delete task instance API end point. But probably it's not, and this could 
become confusing to users.
There should be a non-db-model api call for clearing tasks like you can from 
the UI and the CLI. If I read it right, clearing does the following: it sets 
the state of the task instance to None unless the state was Running then 
instead it sets it and related job ids to Shutdown. It deletes reschedules of 
the TI and it sets the dag runs for those task instances back to Running.
This would be a lot to do for a user using PATCH calls to change Task Instances 
and Dag Runs together (and there's no API for Jobs).

-Daniel L.
P.S. as far as renaming all parameters on operators and hooks with *_conn_id, I 
do not want to see that breaking change happen. But IF it HAS TO, I'm still of 
the opinion that the default_args use for X_conn_id is not preferable to having 
all operators take simpler source_conn_id and optional target_conn_id parameter 
names that are consistent across all operators.

On 4/8/20, 9:47 AM, "Ash Berlin-Taylor" <[email protected]> wrote:

    [Warning]: This email originated from an external source. Do not open links 
or attachments unless you know the content is safe.
    [경고]: 본 이메일은 회사 외부에서 유입되었습니다. 내용이 안전한지 확인하기 전까지는 링크나 첨부파일을 열지 마십시오.

    To expand on the "so I think we need to do one of":

    - we need to update the name of "conn_id" somehow. I can't think of a
    better option, and given all the operators have "x_conn_id" I don't
    think that change should be made lightly.

    - make conn_id unique (this "poor" HA has been a source of confusion in
    the past) and the ID we use in the API

    Or a third option:

    - Have the API take conn_id and either return all conns for that
    conn_id, or conn_id plus an ordinal (?idx=0 type thing) to return a
    single value.

    On Apr 8 2020, at 5:42 pm, Ash Berlin-Taylor <[email protected]> wrote:

    > Hi everyone,
    >
    > So as I mentioned in the AIP voting thread, I think we need to give
    > some more thought to how we are exposing connection ids in the API.
    >
    > Right now as proposed (and merged without approval, not cool. The AIP
    > we voted on did not contain a PR against apache/airflow.) it has an
    > end point of `/connections/{connection_id} ` 
    >
    > My issue here is as I said in the previous thread: that is going to be
    > mightly confusing to our users because there is a "conn_id" concept
    > that is exposed, so people are going to try putting "aws_default" etc
    > in there. I don't care what the API spec says, I care what our users
    > expect, and having a connection_id/id and a conn_id fields is just 
confusing
    >
    > So I think we need to do one of:
    > - we need to update the name of "conn_id" somehow, make conn_id unique
    > (this "poor" HA has been a source of confusion in the past)
    >
    > There are similar problems for the DAG run -- the spec has the type as
    > an integer, but everything else about the Airflow system uses the
    > (unique) run_id parameter, and I would expect the API to use that. The
    > autoinc. id on the run column is literally never used in the code
    > base, so exposing that via the API seems odd.
    >
    > A few other smaller points:
    >
    > EventLog: Those are "audit"/action logs, so we probably shouldn't let
    > people delete them via the API!
    >
    > pool_id: still an integer. It should be the "name".
    >
    > How should add/delete variables and connections work in the API  with
    > the addition of the new "Secrets Backends"?
    >
    > xcomValues: task_id is listed as an integer.
    >
    >
    > -ash

Re: API spec questions

Reply via email to