+1 on making connection IDs unique. It's confusing to have Airflow handled
load balancing here.

On Tue, Apr 14, 2020, 4:59 AM Kaxil Naik <[email protected]> wrote:

> +1 to make connection ids unique
>
> On Tue, Apr 14, 2020 at 11:59 AM Jarek Potiuk <[email protected]>
> wrote:
>
> > I am also supporting converting the connection to be unique.
> >
> > I've worked with similar approach long time ago (10s of years) and it was
> > fine then where we have not yet figured out how to scale client/server
> > architecture and we did not have all the nice infrastructure like
> > load/balancing, cloud services etc. I believe in most of the current
> > services/systems load-balancing should be handled by the service itself
> or
> > some kind of proxy between - not by the client side - in production
> > environments.
> >
> > It's far more robust and might provide much better control (you can
> control
> > multiple independent client's connection and do not rely on random
> > distribution of connections). There are scenarios that would not work
> well
> > in this case - for example physical proximity of the workers, current
> load
> > on different services etc. etc. It is very limiting to rely on this
> > feature.
> >
> > On Tue, Apr 14, 2020 at 11:36 AM Kamil Breguła <
> [email protected]>
> > wrote:
> >
> > > Hello,
> > >
> > > This can cause big problems with idempotence. According to RFC-7231,
> > > the DELETE method should be idempotent.
> > >
> > > For example:
> > > If you want to delete items with index from 1 to 4, you should set the
> > > following request
> > > DELETE /connections/hdfs_default/4
> > > DELETE /connections/hdfs_default/3
> > > DELETE /connections/hdfs_default/2
> > > DELETE /connections/hdfs_default/1
> > >
> > > However, if the user sends these requests in a different order, they
> > > will only delete the first and third items.
> > > DELETE /connections/hdfs_default/1
> > > DELETE /connections/hdfs_default/2
> > > DELETE /connections/hdfs_default/3
> > > DELETE /connections/hdfs_default/4
> > >
> > > If you use asynchronous HTTP clients (a popular in Node), the order of
> > > requests will not be kept. It will also be a big problem with
> > > simultaneous modifications.
> > >
> > > I am also in favor of abandoning support for many connections. This
> > > can be solved on a different layer.
> > >
> > > Best regards,
> > > Kamil
> > >
> > >
> > > On Thu, Apr 9, 2020 at 6:47 PM Ash Berlin-Taylor <[email protected]>
> wrote:
> > > >
> > > > They are, but we _can_ make a choice to remove that feature -- it is
> > not
> > > > widely used and is confusing to many when they stumble on it.
> > > >
> > > > It's not something we should do lightly, but it is a possibility.
> > > >
> > > > I think I'm probably leaning towards the "ordinal" concept:
> > > >
> > > > /connections/hdfs_default -> list of connections with that ID
> > > > /connections/hdfs_default/0 first connection of that type
> > > >
> > > > Something like that.
> > > >
> > > > On Apr 9 2020, at 2:31 pm, Shaw, Damian P.
> > > > <[email protected]> wrote:
> > > >
> > > > > FYI if you look back at the thread "Re: [2.0 spring cleaning]
> Require
> > > > > unique conn_id" on 2019-04-14 you can see a message from Kevin Yang
> > > > > stating that this random choice of connections is a "feature" used
> to
> > > > > load balance connections in AirBnB. So users are relying on this
> > > behavior.
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Daniel (Daniel Lamblin ) [Data Infrastructure]
> > > > > <[email protected]>
> > > > > Sent: Wednesday, April 8, 2020 20:01
> > > > > To: [email protected]
> > > > > Subject: Re: API spec questions
> > > > >
> > > > > Having been bit by accidentally having two connections by the same
> > > > > name or conn_id, I'd prefer if were made unique. In my experience
> > > > > there's little utility in having multiple connections by the same
> > > > > name. Tasks that use a connection do to fairly randomly choose one,
> > > > > rather they seem pretty consistent in using the one created
> earliest,
> > > > > which often has the lower id integer.
> > > > >
> > > > > Circling back to how this is used by the API, from a user
> > perspective,
> > > > > the following in path integer fields were ones I'd have expected to
> > be
> > > > > strings instead:
> > > > > ConnectionID
> > >
> >
> https://github.com/apache/airflow/blob/11f1e0cad996d5596e3e4fb440eb4ec52c024f70/openapi.yaml#L1845
> > > > > DAGID
> > >
> >
> https://github.com/apache/airflow/blob/11f1e0cad996d5596e3e4fb440eb4ec52c024f70/openapi.yaml#L1853
> > > > > PoolID
> > >
> >
> https://github.com/apache/airflow/blob/11f1e0cad996d5596e3e4fb440eb4ec52c024f70/openapi.yaml#L1901
> > > > >
> > > > > Though it’s a url-encoding hassle, I also expected that DAGRunID
> > would
> > > > > be more like the Run ID E.G.
> > > > > "scheduled__2020-04-08T23:10:00+00:00",
> > > > > "manual__2020-04-08T23:00:56+00:00",
> > > > > "manual__2020-04-08T16:24:56.692868+00:00" ,
> > > > > "backfill_2020-04-08T22:05:00+00:00" etc
> > > > >
> > > > > Then TaskID is confusing to me; AFAIK the PK to task instances are
> > > > > task_id, dag_id, and execution_date and the api call appears to
> align
> > > > > with that having the pattern:
> > > > > /dags/{dag_id}/taskInstances/{task_id}/{execution_date}:
> > > > > But if task_id is a numbered id, then… execution_date isn't even
> > > > > needed… I'm thinking it should have been a string.
> > > > >
> > > > > An aside to this, I've always wondered what happens if an
> externally
> > > > > triggered DAG Run has the same execution date as a pre-existing
> > > > > scheduled DAG Run. They'd have different run_ids, EG
> > > > > "scheduled__2020-04-08T23:10:00+00:00" vs
> > > > > "manual__2020-04-08T23:10:00+00:00" but then task instances for
> those
> > > > > runs might not be unique.
> > > > >
> > > > > Lastly, the UI and CLI operation of clearing tasks seems analogous
> to
> > > > > the delete task instance API end point. But probably it's not, and
> > > > > this could become confusing to users.
> > > > > There should be a non-db-model api call for clearing tasks like you
> > > > > can from the UI and the CLI. If I read it right, clearing does the
> > > > > following: it sets the state of the task instance to None unless
> the
> > > > > state was Running then instead it sets it and related job ids to
> > > > > Shutdown. It deletes reschedules of the TI and it sets the dag runs
> > > > > for those task instances back to Running.
> > > > > This would be a lot to do for a user using PATCH calls to change
> Task
> > > > > Instances and Dag Runs together (and there's no API for Jobs).
> > > > >
> > > > > -Daniel L.
> > > > > P.S. as far as renaming all parameters on operators and hooks with
> > > > > *_conn_id, I do not want to see that breaking change happen. But IF
> > it
> > > > > HAS TO, I'm still of the opinion that the default_args use for
> > > > > X_conn_id is not preferable to having all operators take simpler
> > > > > source_conn_id and optional target_conn_id parameter names that are
> > > > > consistent across all operators.
> > > > >
> > > > > On 4/8/20, 9:47 AM, "Ash Berlin-Taylor" <[email protected]> wrote:
> > > > >
> > > > >    [Warning]: This email originated from an external source. Do not
> > > > > open links or attachments unless you know the content is safe.
> > > > >    [경고]: 본 이메일은 회사 외부에서 유입되었습니다. 내용이 안전한지 확인하기 전까지는 링크나 첨부파일을 열지
> > 마십시오.
> > > > >
> > > > >
> > > > >    To expand on the "so I think we need to do one of":
> > > > >
> > > > >
> > > > >    - we need to update the name of "conn_id" somehow. I can't think
> > of
> > > a
> > > > >    better option, and given all the operators have "x_conn_id" I
> > don't
> > > > >    think that change should be made lightly.
> > > > >
> > > > >    - make conn_id unique (this "poor" HA has been a source of
> > > > > confusion in
> > > > >    the past) and the ID we use in the API
> > > > >
> > > > >    Or a third option:
> > > > >
> > > > >    - Have the API take conn_id and either return all conns for that
> > > > >    conn_id, or conn_id plus an ordinal (?idx=0 type thing) to
> return
> > a
> > > > >    single value.
> > > > >
> > > > >    On Apr 8 2020, at 5:42 pm, Ash Berlin-Taylor <[email protected]>
> > > wrote:
> > > > >
> > > > >    > Hi everyone,
> > > > >    >
> > > > >    > So as I mentioned in the AIP voting thread, I think we need to
> > > give
> > > > >    > some more thought to how we are exposing connection ids in the
> > > API.
> > > > >    >
> > > > >    > Right now as proposed (and merged without approval, not cool.
> > The
> > > AIP
> > > > >    > we voted on did not contain a PR against apache/airflow.) it
> has
> > > an
> > > > >    > end point of `/connections/{connection_id} `
> > > > >    >
> > > > >    > My issue here is as I said in the previous thread: that is
> going
> > > > > to be
> > > > >    > mightly confusing to our users because there is a "conn_id"
> > > concept
> > > > >    > that is exposed, so people are going to try putting
> > "aws_default"
> > > etc
> > > > >    > in there. I don't care what the API spec says, I care what our
> > > users
> > > > >    > expect, and having a connection_id/id and a conn_id fields is
> > > > > just confusing
> > > > >    >
> > > > >    > So I think we need to do one of:
> > > > >    > - we need to update the name of "conn_id" somehow, make
> conn_id
> > > unique
> > > > >    > (this "poor" HA has been a source of confusion in the past)
> > > > >    >
> > > > >    > There are similar problems for the DAG run -- the spec has the
> > > > > type as
> > > > >    > an integer, but everything else about the Airflow system uses
> > the
> > > > >    > (unique) run_id parameter, and I would expect the API to use
> > > > > that. The
> > > > >    > autoinc. id on the run column is literally never used in the
> > code
> > > > >    > base, so exposing that via the API seems odd.
> > > > >    >
> > > > >    > A few other smaller points:
> > > > >    >
> > > > >    > EventLog: Those are "audit"/action logs, so we probably
> > shouldn't
> > > let
> > > > >    > people delete them via the API!
> > > > >    >
> > > > >    > pool_id: still an integer. It should be the "name".
> > > > >    >
> > > > >    > How should add/delete variables and connections work in the
> API
> > > with
> > > > >    > the addition of the new "Secrets Backends"?
> > > > >    >
> > > > >    > xcomValues: task_id is listed as an integer.
> > > > >    >
> > > > >    >
> > > > >    > -ash
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> ===============================================================================
> > > > > Please access the attached hyperlink for an important electronic
> > > > > communications disclaimer:
> > > > > http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
> > > > >
> > > >
> > >
> >
> ===============================================================================
> > > > >
> > >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
>

Reply via email to