Re: API spec questions

Bas Harenslak Tue, 14 Apr 2020 00:13:03 -0700

Same here, load balancing should be responsibility of the application itself, 
not Airflow.


+1 for making connections unique

Bas


On 11 Apr 2020, at 11:13, Robin Edwards 
<[email protected]<mailto:[email protected]>> wrote:

On Sat, 11 Apr 2020, 01:20 QP Hou, <[email protected]<mailto:[email protected]>> 
wrote:

It looks like we need to first make a decision on whether we want to
enforce uniqueness for conn_id and implement connection load balance
properly in another way.


Yea I feel there are plenty of other tools and services for implementing
load balancing and this particular feature doesn't really need to be
provided by Airflow. I'd be in favour of making them unique.


Also +1 on making audit log read only from API point of view.

Thanks,
QP Hou

On Thu, Apr 9, 2020 at 9:47 AM Ash Berlin-Taylor 
<[email protected]<mailto:[email protected]>> wrote:

They are, but we _can_ make a choice to remove that feature -- it is not
widely used and is confusing to many when they stumble on it.

It's not something we should do lightly, but it is a possibility.

I think I'm probably leaning towards the "ordinal" concept:

/connections/hdfs_default -> list of connections with that ID
/connections/hdfs_default/0 first connection of that type

Something like that.

On Apr 9 2020, at 2:31 pm, Shaw, Damian P.
<[email protected]<mailto:[email protected]>> wrote:

FYI if you look back at the thread "Re: [2.0 spring cleaning] Require
unique conn_id" on 2019-04-14 you can see a message from Kevin Yang
stating that this random choice of connections is a "feature" used to
load balance connections in AirBnB. So users are relying on this
behavior.


-----Original Message-----
From: Daniel (Daniel Lamblin ) [Data Infrastructure]
<[email protected]<mailto:[email protected]>>
Sent: Wednesday, April 8, 2020 20:01
To: [email protected]<mailto:[email protected]>
Subject: Re: API spec questions

Having been bit by accidentally having two connections by the same
name or conn_id, I'd prefer if were made unique. In my experience
there's little utility in having multiple connections by the same
name. Tasks that use a connection do to fairly randomly choose one,
rather they seem pretty consistent in using the one created earliest,
which often has the lower id integer.

Circling back to how this is used by the API, from a user perspective,
the following in path integer fields were ones I'd have expected to be
strings instead:
ConnectionID
https://github.com/apache/airflow/blob/11f1e0cad996d5596e3e4fb440eb4ec52c024f70/openapi.yaml#L1845
DAGID
https://github.com/apache/airflow/blob/11f1e0cad996d5596e3e4fb440eb4ec52c024f70/openapi.yaml#L1853
PoolID
https://github.com/apache/airflow/blob/11f1e0cad996d5596e3e4fb440eb4ec52c024f70/openapi.yaml#L1901

Though it’s a url-encoding hassle, I also expected that DAGRunID would
be more like the Run ID E.G.
"scheduled__2020-04-08T23:10:00+00:00",
"manual__2020-04-08T23:00:56+00:00",
"manual__2020-04-08T16:24:56.692868+00:00" ,
"backfill_2020-04-08T22:05:00+00:00" etc

Then TaskID is confusing to me; AFAIK the PK to task instances are
task_id, dag_id, and execution_date and the api call appears to align
with that having the pattern:
/dags/{dag_id}/taskInstances/{task_id}/{execution_date}:
But if task_id is a numbered id, then… execution_date isn't even
needed… I'm thinking it should have been a string.

An aside to this, I've always wondered what happens if an externally
triggered DAG Run has the same execution date as a pre-existing
scheduled DAG Run. They'd have different run_ids, EG
"scheduled__2020-04-08T23:10:00+00:00" vs
"manual__2020-04-08T23:10:00+00:00" but then task instances for those
runs might not be unique.

Lastly, the UI and CLI operation of clearing tasks seems analogous to
the delete task instance API end point. But probably it's not, and
this could become confusing to users.
There should be a non-db-model api call for clearing tasks like you
can from the UI and the CLI. If I read it right, clearing does the
following: it sets the state of the task instance to None unless the
state was Running then instead it sets it and related job ids to
Shutdown. It deletes reschedules of the TI and it sets the dag runs
for those task instances back to Running.
This would be a lot to do for a user using PATCH calls to change Task
Instances and Dag Runs together (and there's no API for Jobs).

-Daniel L.
P.S. as far as renaming all parameters on operators and hooks with
*_conn_id, I do not want to see that breaking change happen. But IF it
HAS TO, I'm still of the opinion that the default_args use for
X_conn_id is not preferable to having all operators take simpler
source_conn_id and optional target_conn_id parameter names that are
consistent across all operators.

On 4/8/20, 9:47 AM, "Ash Berlin-Taylor" <[email protected]> wrote:

  [Warning]: This email originated from an external source. Do not
open links or attachments unless you know the content is safe.
  [경고]: 본 이메일은 회사 외부에서 유입되었습니다. 내용이 안전한지 확인하기 전까지는 링크나 첨부파일을 열지 마십시오.


  To expand on the "so I think we need to do one of":


  - we need to update the name of "conn_id" somehow. I can't think of
a
  better option, and given all the operators have "x_conn_id" I don't
  think that change should be made lightly.

  - make conn_id unique (this "poor" HA has been a source of
confusion in
  the past) and the ID we use in the API

  Or a third option:

  - Have the API take conn_id and either return all conns for that
  conn_id, or conn_id plus an ordinal (?idx=0 type thing) to return a
  single value.

  On Apr 8 2020, at 5:42 pm, Ash Berlin-Taylor <[email protected]>
wrote:

Hi everyone,

So as I mentioned in the AIP voting thread, I think we need to
give
some more thought to how we are exposing connection ids in the
API.

Right now as proposed (and merged without approval, not cool. The
AIP
we voted on did not contain a PR against apache/airflow.) it has
an
end point of `/connections/{connection_id} `

My issue here is as I said in the previous thread: that is going
to be
mightly confusing to our users because there is a "conn_id"
concept
that is exposed, so people are going to try putting "aws_default"
etc
in there. I don't care what the API spec says, I care what our
users
expect, and having a connection_id/id and a conn_id fields is
just confusing

So I think we need to do one of:
- we need to update the name of "conn_id" somehow, make conn_id
unique
(this "poor" HA has been a source of confusion in the past)

There are similar problems for the DAG run -- the spec has the
type as
an integer, but everything else about the Airflow system uses the
(unique) run_id parameter, and I would expect the API to use
that. The
autoinc. id on the run column is literally never used in the code
base, so exposing that via the API seems odd.

A few other smaller points:

EventLog: Those are "audit"/action logs, so we probably shouldn't
let
people delete them via the API!

pool_id: still an integer. It should be the "name".

How should add/delete variables and connections work in the API
with
the addition of the new "Secrets Backends"?

xcomValues: task_id is listed as an integer.


-ash







===============================================================================
Please access the attached hyperlink for an important electronic
communications disclaimer:
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html


===============================================================================

Re: API spec questions

Reply via email to