Please not that Fks can be quite slow...

Verstuurd vanaf mijn iPad

> Op 10 apr. 2019 om 19:55 heeft Ash Berlin-Taylor <a...@apache.org> het 
> volgende geschreven:
> 
> I am all for FKs.
> 
> How do you think we should handle the case of "Xcom but missing TIs" (or 
> similar) that we might run into on installs when a user runs `airflow 
> upgradedb`?
> 
> -a
> 
>> On 10 Apr 2019, at 18:44, Driesprong, Fokko <fo...@driesprong.frl> wrote:
>> 
>> Reviving this discussion again :-)
>> 
>> Lately, I was digging into the PR of Julian regarding adding FK's to the
>> database: https://github.com/apache/airflow/pull/4922
>> 
>> After digging into the details, I've noticed that the current situation
>> regarding referential integrity is bad. It is not uncommon the have
>> DagRun's without having the DAG in the database. For example, you can do a
>> backfill job before the scheduler persisted the DAG in the database. I also
>> think this is often the case in the UI, where we the nuke when some of the
>> models haven't been persisted in the database. Therefore I'd like to
>> suggest to enforce consistency by foreign keys. This will prevent us from
>> having DagRuns without DAGs, but also removing xcom objects of
>> TaskInstances that are already removed.
>> 
>> To create an overview of the FK's, I've created subtasks the ticket of
>> Peter: https://jira.apache.org/jira/browse/AIRFLOW-3904
>> 
>> WDYT?
>> 
>> Cheers, Fokko
>> 
>> 
>> 
>> Op di 18 sep. 2018 om 21:51 schreef Maxime Beauchemin <
>> maximebeauche...@gmail.com>:
>> 
>>> The database migration creating the FK will/would need to have something
>>> that either creates dummy missing PKs first, or delete the orphaned keys to
>>> insure the operation of creating the FK doesn't error out. Seems like
>>> adding dummy keys is a better approach. Then you'll have to make sure that
>>> everywhere where FKs are created that there are no edge cases of missing
>>> PKs. Then some delete operations in some cases may have to "cascade"
>>> properly.
>>> 
>>> The Django Admin had this nice confirm screen on delete that would show you
>>> clearly the scope of the cascading operation when deleting objects. To my
>>> knowledge Flask-Admin and FAB don't have such a feature. Personally I
>>> wouldn't allow cascade unless such a feature is implemented in some way.
>>> Note that SQLAlchemy has builtin semantics for specifying how/whether
>>> cascading should take place.
>>> 
>>> Personally I think referential integrity is overrated in some cases,
>>> especially when using meaningful "business keys" (as opposed to
>>> auto-increted "surrogate" keys) as PKs. It also slows down insert
>>> operations. For data warehousing (this clearly doesn't apply to the Airflow
>>> metadata store), the best practice on most db engine is to **not** enforce
>>> FK constraints as it slows down inserts in fact tables and straight out
>>> prevents bulk loading.
>>> 
>>> Another approach is to avoid deleting in general, especially referenced
>>> fks, and setting up some activity/visibility flag to false instead.
>>> 
>>> Max
>>> 
>>> On Tue, Sep 18, 2018 at 10:47 AM Driesprong, Fokko <fo...@driesprong.frl>
>>> wrote:
>>> 
>>>> I'm in favor of having referential integrity. It will add some load in
>>>> having to enforce the referential integrity, but it will also make sure
>>>> that the database stays clean. Also in Airflow we use transactions which
>>>> will make sure that the integrity checks are not validated on every
>>>> statement, but after the commit. I'm happy to help with this as well.
>>>> 
>>>> Cheers, Fokko
>>>> 
>>>> Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin <bdbr...@gmail.com>:
>>>> 
>>>>> Adding these kind of checks which work for integrity well make database
>>>>> access pretty slow. In addition it isnt there because in the past there
>>>> was
>>>>> no strong connection between for example tasks and dagruns, it was more
>>>> or
>>>>> less just coincidental. There also so some bisecting tools that
>>> probably
>>>>> have difficulty functioning in a new regime. In other words it is not
>>> an
>>>>> easy change and it will have operational challenges.
>>>>> 
>>>>>> On 18 Sep 2018, at 11:03, Ash Berlin-Taylor <a...@apache.org> wrote:
>>>>>> 
>>>>>> Ooh good spot.
>>>>>> 
>>>>>> Yes I would be in favour of adding these, but as you say we need to
>>>>> thing about how we might migrate old data.
>>>>>> 
>>>>>> Doing this at 2.0.0 and providing a cleanup script (or doing it as
>>> part
>>>>> of the migration?) is probably the way to go.
>>>>>> 
>>>>>> -ash-
>>>>>> 
>>>>>>> On 17 Sep 2018, at 19:56, Stefan Seelmann <m...@stefan-seelmann.de>
>>>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> looking into the DB schema there is almost no referral integrity
>>>>>>> enforced at the database level. Many foreign key constraints between
>>>>>>> dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make
>>>> sense
>>>>>>> IMO.
>>>>>>> 
>>>>>>> Is there a particular reason why that's not implemented?
>>>>>>> 
>>>>>>> Introducing it now will be hard, probably any real-world setup has
>>>> some
>>>>>>> violations. But I'm still in favor of this additional safety net.
>>>>>>> 
>>>>>>> Kind Regards,
>>>>>>> Stefan
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
> 

Reply via email to