Re: API spec questions

QP Hou Mon, 11 May 2020 10:37:41 -0700

I would recommend option III.

IMHO, hierarchy based endpoints are for CURD operations on a single
entity. The flat endpoint, like /taskinstance, is for read-only
multi-facet search queries.


For example:

* to create a resource Foo: "POST /api/parent/{parent_id}/foo"
* to update a resource Foo: "PUT /api/parent/{parent_id}/foo/{foo_id}"
* to get a single Foo entity: "GET /api/parent/{parent_id}/foo/{foo_id}"
* to search for multiple Foo entities using multiple filters: "GET
/api/foo?parent_id=id1,id2,id3&owner=user1"

On Mon, May 11, 2020 at 5:08 AM Kamil Breguła <kamil.breg...@polidea.com> wrote:
>
> Hello,
>
> I apologize for my very negative behavior. I misunderstood the rules.
> I updated the specification based on the discussion. I hope that now I
> have not missed any suggestions that I had to make. I tried to make
> every change in a separate commit, so you can review the changes one
> by one.
>
> Here is new PR:  https://github.com/apache/airflow/pull/8721
>
> Short summary:
> * Changed type for DAGID, DAGRunID, TaskID. Now we use string instead
> of an integer.
> * EventLog collection is read-only
> * Used ExcutionDate in DagRuns endpoints
> * Used custom action to control task instances - Removed PATCH
> /dags/{dag_id}/taskInstances/{task_id}/{execution_date} endpoint and
> added PATCH /dags/{dag_id}/clearTaskInstanaces.
> * Added endpoint - POST /dagRuns "Trigger a DAG Run"
> * Added filter parameters to GET /dags/{dag_id}/taskInstances.
> * Added filter parameters to GET /dags/{dag_id}/dagRuns
> * Removed DELETE /taskInstances endpoint
> * The connection ID is a unique key that identifies the connection
> * Moved TaskInstances resource under DagRuns resource
> * Fixed many typos
>
> There is one more topic for discussion - reading resources across
> multiple collections.
> In other words - how do I retrieve task instances for multiple DAGs?
> We have several solutions.
> I) Currently, the endpoint that receives a list of task instances is
> available at /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances:
> This endpoint support reading resources across multiple DAGs by
> specifying a "-" as a dag_id or an execution_date.
> This is based on Google recommendations - AIP159 [1]. I relied on
> these recommendations because it is the most comprehensive study on
> API design principles. Ash Berlin-Taylor rightly pointed out that this
> would be a backward-incompatible change.
> II) We can use a different character that will have the same role -
> '*'. This character cannot be in Dag/Task ID, so it's safe.
> III) Ash proposed to add a separate endpoint that will not include the
> DAG ID in the address - /taskInstances
>
> If we want to choose one solution then I think it's worth looking at
> what the endpoints for DAGs look like.
> /dags
> /dags/{dag_id}
> /dags/{dag_id}/clearTaskInstanaces
> /dags/{dag_id}/dagRuns
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
> /dags/{dag_id}/dagRuns/{execution_date}
> /dags/{dag_id}/structure
> /dags/{dag_id}/tasks
> /dags/{dag_id}/tasks/{task_id}
> In my opinion, here is a clear hierarchy of resources that will
> facilitate the use of API. The third solution causes that this
> hierarchy is disturbed and the endpoints that are used to receive the
> list of elements will be at the highest level.
> We will have the following endpoints:
> /dags
> /dags/{dag_id}
> /dags/{dag_id}/clearTaskInstanaces
> /dagRuns
> /dags/{dag_id}/dagRuns
> /taskInstances
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
> /xcomEntries
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
> /dags/{dag_id}/dagRuns/{execution_date}
> /dags/{dag_id}/structure
> /dags/{dag_id}/tasks
> /dags/{dag_id}/tasks/{task_id}
>
> 4) Some endpoints will have similar behavior, so we can delete them.
> Then we will have the following list of endpoints.
> /dags
> /dags/{dag_id}
> /dags/{dag_id}/clearTaskInstanaces
> /dagRuns
> /taskInstances
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
> /xcomEntries
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
> /dags/{dag_id}/dagRuns/{execution_date}
> /dags/{dag_id}/structure
> /dags/{dag_id}/tasks
> /dags/{dag_id}/tasks/{task_id}
>
> Which solution do you like the most? I, II, III, or IV?
>
> Best regards,
> Kamil Breguła
>
> [1] https://aip.dev/159
>
> On Wed, Apr 15, 2020 at 8:32 PM Daniel (Daniel Lamblin ) [Data
> Infrastructure] <lamb...@coupang.com> wrote:
> >
> > I'm glad the general mood is that connection_id should be unique.
> > FWIW when I had multiple connections in v1.8.2 with mysql I didn't seem to 
> > be getting any randomized loadbalancing anyway. Then again, maybe random 
> > was just 100 selections of 1 over 2.
> > There are many other ways to load balance connections, each specific to the 
> > service type, so I don't see it Airflow's place to provide a semi-generic 
> > option to do it.
> >
> > +1 for connection ID being unique.
> >
> > Pardon outlook for changing links to the ConnectionID, DagID and PoolID 
> > being integers in a version of the API.
> > Are we past that decision already; I'd expect to use string names.
> >
> > I'd also asked about DAG run ID, task ID, and finally whether there'd be an 
> > endpoint with which to clear tasks, because crud operations don't model the 
> > interplay of task instance, jobs, and dag run state involved.
> > -Daniel
> >
> > On 4/14/20, 8:49 AM, "Xinbin Huang" <bin.huan...@gmail.com> wrote:
> >
> >     [Warning]: This email originated from an external source. Do not open 
> > links or attachments unless you know the content is safe.
> >     [경고]: 본 이메일은 회사 외부에서 유입되었습니다. 내용이 안전한지 확인하기 전까지는 링크나 첨부파일을 열지 마십시오.
> >
> >
> >     +1 on making connection IDs unique. It's confusing to have Airflow 
> > handled
> >     load balancing here.
> >
> >     On Tue, Apr 14, 2020, 4:59 AM Kaxil Naik <kaxiln...@gmail.com> wrote:
> >
> >     > +1 to make connection ids unique
> >     >
> >     > On Tue, Apr 14, 2020 at 11:59 AM Jarek Potiuk 
> > <jarek.pot...@polidea.com>
> >     > wrote:
> >     >
> >     > > I am also supporting converting the connection to be unique.
> >     > >
> >     > > I've worked with similar approach long time ago (10s of years) and 
> > it was
> >     > > fine then where we have not yet figured out how to scale 
> > client/server
> >     > > architecture and we did not have all the nice infrastructure like
> >     > > load/balancing, cloud services etc. I believe in most of the current
> >     > > services/systems load-balancing should be handled by the service 
> > itself
> >     > or
> >     > > some kind of proxy between - not by the client side - in production
> >     > > environments.
> >     > >
> >     > > It's far more robust and might provide much better control (you can
> >     > control
> >     > > multiple independent client's connection and do not rely on random
> >     > > distribution of connections). There are scenarios that would not 
> > work
> >     > well
> >     > > in this case - for example physical proximity of the workers, 
> > current
> >     > load
> >     > > on different services etc. etc. It is very limiting to rely on this
> >     > > feature.
> >     > >
> >     > > On Tue, Apr 14, 2020 at 11:36 AM Kamil Breguła <
> >     > kamil.breg...@polidea.com>
> >     > > wrote:
> >     > >
> >     > > > Hello,
> >     > > >
> >     > > > This can cause big problems with idempotence. According to 
> > RFC-7231,
> >     > > > the DELETE method should be idempotent.
> >     > > >
> >     > > > For example:
> >     > > > If you want to delete items with index from 1 to 4, you should 
> > set the
> >     > > > following request
> >     > > > DELETE /connections/hdfs_default/4
> >     > > > DELETE /connections/hdfs_default/3
> >     > > > DELETE /connections/hdfs_default/2
> >     > > > DELETE /connections/hdfs_default/1
> >     > > >
> >     > > > However, if the user sends these requests in a different order, 
> > they
> >     > > > will only delete the first and third items.
> >     > > > DELETE /connections/hdfs_default/1
> >     > > > DELETE /connections/hdfs_default/2
> >     > > > DELETE /connections/hdfs_default/3
> >     > > > DELETE /connections/hdfs_default/4
> >     > > >
> >     > > > If you use asynchronous HTTP clients (a popular in Node), the 
> > order of
> >     > > > requests will not be kept. It will also be a big problem with
> >     > > > simultaneous modifications.
> >     > > >
> >     > > > I am also in favor of abandoning support for many connections. 
> > This
> >     > > > can be solved on a different layer.
> >     > > >
> >     > > > Best regards,
> >     > > > Kamil
> >     > > >
> >     > > >
> >     > > > On Thu, Apr 9, 2020 at 6:47 PM Ash Berlin-Taylor <a...@apache.org>
> >     > wrote:
> >     > > > >
> >     > > > > They are, but we _can_ make a choice to remove that feature -- 
> > it is
> >     > > not
> >     > > > > widely used and is confusing to many when they stumble on it.
> >     > > > >
> >     > > > > It's not something we should do lightly, but it is a 
> > possibility.
> >     > > > >
> >     > > > > I think I'm probably leaning towards the "ordinal" concept:
> >     > > > >
> >     > > > > /connections/hdfs_default -> list of connections with that ID
> >     > > > > /connections/hdfs_default/0 first connection of that type
> >     > > > >
> >     > > > > Something like that.
> >     > > > >
> >     > > > > On Apr 9 2020, at 2:31 pm, Shaw, Damian P.
> >     > > > > <damian.sha...@credit-suisse.com> wrote:
> >     > > > >
> >     > > > > > FYI if you look back at the thread "Re: [2.0 spring cleaning]
> >     > Require
> >     > > > > > unique conn_id" on 2019-04-14 you can see a message from 
> > Kevin Yang
> >     > > > > > stating that this random choice of connections is a "feature" 
> > used
> >     > to
> >     > > > > > load balance connections in AirBnB. So users are relying on 
> > this
> >     > > > behavior.
> >     > > > > >
> >     > > > > >
> >     > > > > > -----Original Message-----
> >     > > > > > From: Daniel (Daniel Lamblin ) [Data Infrastructure]
> >     > > > > > <lamb...@coupang.com>
> >     > > > > > Sent: Wednesday, April 8, 2020 20:01
> >     > > > > > To: dev@airflow.apache.org
> >     > > > > > Subject: Re: API spec questions
> >     > > > > >
> >     > > > > > Having been bit by accidentally having two connections by the 
> > same
> >     > > > > > name or conn_id, I'd prefer if were made unique. In my 
> > experience
> >     > > > > > there's little utility in having multiple connections by the 
> > same
> >     > > > > > name. Tasks that use a connection do to fairly randomly 
> > choose one,
> >     > > > > > rather they seem pretty consistent in using the one created
> >     > earliest,
> >     > > > > > which often has the lower id integer.
> >     > > > > >
> >     > > > > > Circling back to how this is used by the API, from a user
> >     > > perspective,
> >     > > > > > the following in path integer fields were ones I'd have 
> > expected to
> >     > > be
> >     > > > > > strings instead:
> >     > > > > > ConnectionID
> >     > > >
> >     > >
> >     > 
> > https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1845&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=ur3AAggOlQ7BskKBxQxRq0i%2FS36Ol72c7H7ddwXo5C4%3D&amp;reserved=0
> >     > > > > > DAGID
> >     > > >
> >     > >
> >     > 
> > https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1853&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=vyhFiV6Fd3pnVvlAoxnscSvvcy2NX%2BMASGmuHKu%2Fy84%3D&amp;reserved=0
> >     > > > > > PoolID
> >     > > >
> >     > >
> >     > 
> > https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1901&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=4Fog92AZju0%2BN44FoBdU4fj2DwhgCbUfKYUbdjeWpgo%3D&amp;reserved=0
> >     > > > > >
> >     > > > > > Though it’s a url-encoding hassle, I also expected that 
> > DAGRunID
> >     > > would
> >     > > > > > be more like the Run ID E.G.
> >     > > > > > "scheduled__2020-04-08T23:10:00+00:00",
> >     > > > > > "manual__2020-04-08T23:00:56+00:00",
> >     > > > > > "manual__2020-04-08T16:24:56.692868+00:00" ,
> >     > > > > > "backfill_2020-04-08T22:05:00+00:00" etc
> >     > > > > >
> >     > > > > > Then TaskID is confusing to me; AFAIK the PK to task 
> > instances are
> >     > > > > > task_id, dag_id, and execution_date and the api call appears 
> > to
> >     > align
> >     > > > > > with that having the pattern:
> >     > > > > > /dags/{dag_id}/taskInstances/{task_id}/{execution_date}:
> >     > > > > > But if task_id is a numbered id, then… execution_date isn't 
> > even
> >     > > > > > needed… I'm thinking it should have been a string.
> >     > > > > >
> >     > > > > > An aside to this, I've always wondered what happens if an
> >     > externally
> >     > > > > > triggered DAG Run has the same execution date as a 
> > pre-existing
> >     > > > > > scheduled DAG Run. They'd have different run_ids, EG
> >     > > > > > "scheduled__2020-04-08T23:10:00+00:00" vs
> >     > > > > > "manual__2020-04-08T23:10:00+00:00" but then task instances 
> > for
> >     > those
> >     > > > > > runs might not be unique.
> >     > > > > >
> >     > > > > > Lastly, the UI and CLI operation of clearing tasks seems 
> > analogous
> >     > to
> >     > > > > > the delete task instance API end point. But probably it's 
> > not, and
> >     > > > > > this could become confusing to users.
> >     > > > > > There should be a non-db-model api call for clearing tasks 
> > like you
> >     > > > > > can from the UI and the CLI. If I read it right, clearing 
> > does the
> >     > > > > > following: it sets the state of the task instance to None 
> > unless
> >     > the
> >     > > > > > state was Running then instead it sets it and related job ids 
> > to
> >     > > > > > Shutdown. It deletes reschedules of the TI and it sets the 
> > dag runs
> >     > > > > > for those task instances back to Running.
> >     > > > > > This would be a lot to do for a user using PATCH calls to 
> > change
> >     > Task
> >     > > > > > Instances and Dag Runs together (and there's no API for Jobs).
> >     > > > > >
> >     > > > > > -Daniel L.
> >     > > > > > P.S. as far as renaming all parameters on operators and hooks 
> > with
> >     > > > > > *_conn_id, I do not want to see that breaking change happen. 
> > But IF
> >     > > it
> >     > > > > > HAS TO, I'm still of the opinion that the default_args use for
> >     > > > > > X_conn_id is not preferable to having all operators take 
> > simpler
> >     > > > > > source_conn_id and optional target_conn_id parameter names 
> > that are
> >     > > > > > consistent across all operators.
> >     > > > > >
> >     > > > > > On 4/8/20, 9:47 AM, "Ash Berlin-Taylor" <a...@apache.org> 
> > wrote:
> >     > > > > >
> >     > > > > >    [Warning]: This email originated from an external source. 
> > Do not
> >     > > > > > open links or attachments unless you know the content is safe.
> >     > > > > >    [경고]: 본 이메일은 회사 외부에서 유입되었습니다. 내용이 안전한지 확인하기 전까지는 링크나 첨부파일을 
> > 열지
> >     > > 마십시오.
> >     > > > > >
> >     > > > > >
> >     > > > > >    To expand on the "so I think we need to do one of":
> >     > > > > >
> >     > > > > >
> >     > > > > >    - we need to update the name of "conn_id" somehow. I can't 
> > think
> >     > > of
> >     > > > a
> >     > > > > >    better option, and given all the operators have 
> > "x_conn_id" I
> >     > > don't
> >     > > > > >    think that change should be made lightly.
> >     > > > > >
> >     > > > > >    - make conn_id unique (this "poor" HA has been a source of
> >     > > > > > confusion in
> >     > > > > >    the past) and the ID we use in the API
> >     > > > > >
> >     > > > > >    Or a third option:
> >     > > > > >
> >     > > > > >    - Have the API take conn_id and either return all conns 
> > for that
> >     > > > > >    conn_id, or conn_id plus an ordinal (?idx=0 type thing) to
> >     > return
> >     > > a
> >     > > > > >    single value.
> >     > > > > >
> >     > > > > >    On Apr 8 2020, at 5:42 pm, Ash Berlin-Taylor 
> > <a...@apache.org>
> >     > > > wrote:
> >     > > > > >
> >     > > > > >    > Hi everyone,
> >     > > > > >    >
> >     > > > > >    > So as I mentioned in the AIP voting thread, I think we 
> > need to
> >     > > > give
> >     > > > > >    > some more thought to how we are exposing connection ids 
> > in the
> >     > > > API.
> >     > > > > >    >
> >     > > > > >    > Right now as proposed (and merged without approval, not 
> > cool.
> >     > > The
> >     > > > AIP
> >     > > > > >    > we voted on did not contain a PR against 
> > apache/airflow.) it
> >     > has
> >     > > > an
> >     > > > > >    > end point of `/connections/{connection_id} `
> >     > > > > >    >
> >     > > > > >    > My issue here is as I said in the previous thread: that 
> > is
> >     > going
> >     > > > > > to be
> >     > > > > >    > mightly confusing to our users because there is a 
> > "conn_id"
> >     > > > concept
> >     > > > > >    > that is exposed, so people are going to try putting
> >     > > "aws_default"
> >     > > > etc
> >     > > > > >    > in there. I don't care what the API spec says, I care 
> > what our
> >     > > > users
> >     > > > > >    > expect, and having a connection_id/id and a conn_id 
> > fields is
> >     > > > > > just confusing
> >     > > > > >    >
> >     > > > > >    > So I think we need to do one of:
> >     > > > > >    > - we need to update the name of "conn_id" somehow, make
> >     > conn_id
> >     > > > unique
> >     > > > > >    > (this "poor" HA has been a source of confusion in the 
> > past)
> >     > > > > >    >
> >     > > > > >    > There are similar problems for the DAG run -- the spec 
> > has the
> >     > > > > > type as
> >     > > > > >    > an integer, but everything else about the Airflow system 
> > uses
> >     > > the
> >     > > > > >    > (unique) run_id parameter, and I would expect the API to 
> > use
> >     > > > > > that. The
> >     > > > > >    > autoinc. id on the run column is literally never used in 
> > the
> >     > > code
> >     > > > > >    > base, so exposing that via the API seems odd.
> >     > > > > >    >
> >     > > > > >    > A few other smaller points:
> >     > > > > >    >
> >     > > > > >    > EventLog: Those are "audit"/action logs, so we probably
> >     > > shouldn't
> >     > > > let
> >     > > > > >    > people delete them via the API!
> >     > > > > >    >
> >     > > > > >    > pool_id: still an integer. It should be the "name".
> >     > > > > >    >
> >     > > > > >    > How should add/delete variables and connections work in 
> > the
> >     > API
> >     > > > with
> >     > > > > >    > the addition of the new "Secrets Backends"?
> >     > > > > >    >
> >     > > > > >    > xcomValues: task_id is listed as an integer.
> >     > > > > >    >
> >     > > > > >    >
> >     > > > > >    > -ash
> >     > > > > >
> >     > > > > >
> >     > > > > >
> >     > > > > >
> >     > > > > >
> >     > > > > >
> >     > > > >
> >     > > >
> >     > >
> >     > 
> > ===============================================================================
> >     > > > > > Please access the attached hyperlink for an important 
> > electronic
> >     > > > > > communications disclaimer:
> >     > > > > > 
> > https://kor01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.credit-suisse.com%2Flegal%2Fen%2Fdisclaimer_email_ib.html&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=6e9Ck952oFmXDkoDYnM0XgCWsplYb7Dua37BkAPt22A%3D&amp;reserved=0
> >     > > > > >
> >     > > > >
> >     > > >
> >     > >
> >     > 
> > ===============================================================================
> >     > > > > >
> >     > > >
> >     > >
> >     > >
> >     > > --
> >     > >
> >     > > Jarek Potiuk
> >     > > Polidea 
> > <https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.polidea.com%2F&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=tMOZFKjLGoXh9eAVW5rXhmJeJw8kYhlWAfwFICN%2Fpfw%3D&amp;reserved=0>
> >  | Principal Software Engineer
> >     > >
> >     > > M: +48 660 796 129 <+48660796129>
> >     > > [image: Polidea] 
> > <https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.polidea.com%2F&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=tMOZFKjLGoXh9eAVW5rXhmJeJw8kYhlWAfwFICN%2Fpfw%3D&amp;reserved=0>
> >     > >
> >     >
> >
> >

Re: API spec questions

Reply via email to