Kevin Yang has done some benchmarking as well. On Wed, Apr 10, 2019 at 2:32 PM Gabriel Silk <gs...@dropbox.com.invalid> wrote:
> The main concern I have with FK's is the potential performance impact. > > When evaluating Airflow for use at Dropbox, I ran benchmarks that showed > several bottlenecks in the query patterns being used by the scheduler, > which led me to open https://issues.apache.org/jira/browse/AIRFLOW-2430 > and > submit a PR. > > If we move forward with FKs, we need to do a robust set of benchmarks to > understand how this will affect users who are running large Airflow > clusters. > > On Wed, Apr 10, 2019 at 1:05 PM Alex Guziel <alex.guz...@airbnb.com > .invalid> > wrote: > > > I'm not a huge fan of having foreign keys. I know Airbnb has and > definitely > > still has problems with DB load. I don't see any real convincing > arguments > > for how adding referential integrity will improve Airflow meaningfully > > (yet). > > > > On Wed, Apr 10, 2019 at 12:38 PM Bas Harenslak < > > basharens...@godatadriven.com> wrote: > > > > > In my experience it could go either way; in some cases FKs could impair > > > the performance and in some cases FKs can help the query optimizer > > improve > > > query performance. Each case is different and without testing it’s just > > > guessing. > > > > > > I’m in favour of adding FKs and value referential integrity over > > > performance. If you’re sacrificing integrity for performance you’re > doing > > > either advanced funky stuff or the wrong thing. I haven’t seen the > > database > > > being a bottleneck in Airflow, even with large setups (+-5000 DAGs). So > > why > > > not add FKs and performance test some Airflow queries, to know for sure > > :-) > > > > > > Bas > > > > > > On 10 Apr 2019, at 20:39, Bolke de Bruin <bdbr...@gmail.com<mailto: > > > bdbr...@gmail.com>> wrote: > > > > > > Please not that Fks can be quite slow... > > > > > > Verstuurd vanaf mijn iPad > > > > > > Op 10 apr. 2019 om 19:55 heeft Ash Berlin-Taylor <a...@apache.org > <mailto: > > > a...@apache.org>> het volgende geschreven: > > > > > > I am all for FKs. > > > > > > How do you think we should handle the case of "Xcom but missing TIs" > (or > > > similar) that we might run into on installs when a user runs `airflow > > > upgradedb`? > > > > > > -a > > > > > > On 10 Apr 2019, at 18:44, Driesprong, Fokko <fo...@driesprong.frl > > <mailto: > > > fo...@driesprong.frl>> wrote: > > > > > > Reviving this discussion again :-) > > > > > > Lately, I was digging into the PR of Julian regarding adding FK's to > the > > > database: https://github.com/apache/airflow/pull/4922 > > > > > > After digging into the details, I've noticed that the current situation > > > regarding referential integrity is bad. It is not uncommon the have > > > DagRun's without having the DAG in the database. For example, you can > do > > a > > > backfill job before the scheduler persisted the DAG in the database. I > > also > > > think this is often the case in the UI, where we the nuke when some of > > the > > > models haven't been persisted in the database. Therefore I'd like to > > > suggest to enforce consistency by foreign keys. This will prevent us > from > > > having DagRuns without DAGs, but also removing xcom objects of > > > TaskInstances that are already removed. > > > > > > To create an overview of the FK's, I've created subtasks the ticket of > > > Peter: https://jira.apache.org/jira/browse/AIRFLOW-3904 > > > > > > WDYT? > > > > > > Cheers, Fokko > > > > > > > > > > > > Op di 18 sep. 2018 om 21:51 schreef Maxime Beauchemin < > > > maximebeauche...@gmail.com<mailto:maximebeauche...@gmail.com>>: > > > > > > The database migration creating the FK will/would need to have > something > > > that either creates dummy missing PKs first, or delete the orphaned > keys > > to > > > insure the operation of creating the FK doesn't error out. Seems like > > > adding dummy keys is a better approach. Then you'll have to make sure > > that > > > everywhere where FKs are created that there are no edge cases of > missing > > > PKs. Then some delete operations in some cases may have to "cascade" > > > properly. > > > > > > The Django Admin had this nice confirm screen on delete that would show > > you > > > clearly the scope of the cascading operation when deleting objects. To > my > > > knowledge Flask-Admin and FAB don't have such a feature. Personally I > > > wouldn't allow cascade unless such a feature is implemented in some > way. > > > Note that SQLAlchemy has builtin semantics for specifying how/whether > > > cascading should take place. > > > > > > Personally I think referential integrity is overrated in some cases, > > > especially when using meaningful "business keys" (as opposed to > > > auto-increted "surrogate" keys) as PKs. It also slows down insert > > > operations. For data warehousing (this clearly doesn't apply to the > > Airflow > > > metadata store), the best practice on most db engine is to **not** > > enforce > > > FK constraints as it slows down inserts in fact tables and straight out > > > prevents bulk loading. > > > > > > Another approach is to avoid deleting in general, especially referenced > > > fks, and setting up some activity/visibility flag to false instead. > > > > > > Max > > > > > > On Tue, Sep 18, 2018 at 10:47 AM Driesprong, Fokko > <fo...@driesprong.frl > > > <mailto:fo...@driesprong.frl>> > > > wrote: > > > > > > I'm in favor of having referential integrity. It will add some load in > > > having to enforce the referential integrity, but it will also make sure > > > that the database stays clean. Also in Airflow we use transactions > which > > > will make sure that the integrity checks are not validated on every > > > statement, but after the commit. I'm happy to help with this as well. > > > > > > Cheers, Fokko > > > > > > Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin <bdbr...@gmail.com > > > <mailto:bdbr...@gmail.com>>: > > > > > > Adding these kind of checks which work for integrity well make database > > > access pretty slow. In addition it isnt there because in the past there > > > was > > > no strong connection between for example tasks and dagruns, it was more > > > or > > > less just coincidental. There also so some bisecting tools that > > > probably > > > have difficulty functioning in a new regime. In other words it is not > > > an > > > easy change and it will have operational challenges. > > > > > > On 18 Sep 2018, at 11:03, Ash Berlin-Taylor <a...@apache.org<mailto: > > > a...@apache.org>> wrote: > > > > > > Ooh good spot. > > > > > > Yes I would be in favour of adding these, but as you say we need to > > > thing about how we might migrate old data. > > > > > > Doing this at 2.0.0 and providing a cleanup script (or doing it as > > > part > > > of the migration?) is probably the way to go. > > > > > > -ash- > > > > > > On 17 Sep 2018, at 19:56, Stefan Seelmann <m...@stefan-seelmann.de > > <mailto: > > > m...@stefan-seelmann.de>> > > > wrote: > > > > > > Hi, > > > > > > looking into the DB schema there is almost no referral integrity > > > enforced at the database level. Many foreign key constraints between > > > dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make > > > sense > > > IMO. > > > > > > Is there a particular reason why that's not implemented? > > > > > > Introducing it now will be hard, probably any real-world setup has > > > some > > > violations. But I'm still in favor of this additional safety net. > > > > > > Kind Regards, > > > Stefan > > > > > > > > > > > > > > > > > > > > > > > > > > >