Let me start separate threads from the main discussion. It will be easier to follow.
I'd really like to understand the actors involved and what kind of threats the allowlist is supposed to protect us. I think it would be great if we see the "attack" scenario we think allowlist will protect us against - explaining the actors involved and what actions they could make that the allowlist prevent. It might be that I do not see it (but as experiences from our security team had shown - having a clear step-by-step explanation of what happens allows us to reason better about it and see if the protection is actually "functional". I think for now it's not doing much (except making it a bit more harder for someone who would like to exploit it) , but I might be mistaken (happened many times during our security discussions. Pickle and the likes can execute arbitrary code that is inside the > serialized object. > Yep. This is super dangerous indeed. > Airflow's serializers do not execute arbitrary code. Also your assumption > of the 'serious' security problem is incorrect. I think that's a misunderstanding. I simply think that whatever we are trying to protect against is simply not effective because it might be bypassed easily by DAG authors (if this is the DAG authors action we are trying to protect against). I am not sure if it is serious or not because I am not sure what we are trying to protect against. But I believe whatever we do might simply not work as intended and even more - I think it's quite unnecessary. So I want to know the intentions, actors and scenarios :). I think clearly spelling out actors involved and scenario of the attack we are trying to protect against should help with clarifying it. I think it's been hinted at in the next few paragraphs, but I wanted to make sure that I understand it. > The deserializer will need > to be able to load the class by using 'import_string' after verifying that > it matches the allowed pattern (by default 'airflow.*'). If a DAG author > does something like `airflow.providers.bla.bla` that class cannot be loaded > neither can the DAG author make that class available to a different DAG in > a different context (which is where a real attack vector would be). > So do I understand correctly that this one is aimed to protect against one DAG author (writing DAGA and storing data to XCom to not to allow it to get that XCom in DAG B and execute a code that DAG A author wanted DAG B to execute during deserialization? In other words - we want to make sure that DAGS are somewhat isolated from each other? For now we are ignoring the fact that tasks for different DAGs are executed potentially on the same worker, in a fork of the same process, which has potentially other ways where one task influences what other tasks are doing. This is a different scenario / layer and I do understand the concept of layered protection but wanted to make sure that it's been a deliberate consideration. Is this a fair assesment? Are there any other reasons for that allowlist ? Finally - did we consider the potential atta scenario where importing subpackagest might come from multiple locations on PYTHONPATH (I am not sure it is possible to exploit in our case but as of https://peps.python.org/pep-0420/ you can import sub-packages of packages from several locations) ? An attack vector is considered (as mentioned above) if you can change > something in someone else's context. So, if I would be able to serialize > something and then influence / adjust its deserialization in a different > context which should not be under my control. > Yeah. I think here the "my control" and "others" and "different context" (i.e. actors involved) were not clear to me (still not entirely). We have a security model (just updated it to clarify some of the DAG author expectations) where currently we do not distinguish between different DAG authors - we assume that all DAG Authors have access to the same execution environment and generally speaking we cannot assume that other DAG authors are isolated from each other. It's just impossible (currently) for multiple reasons. We are working on changing it with AIP-44 and one more not-yet-written-aip, and maybe it's good that this serialisation is good to add extra additional layer of security, but I think it has the potential of giving our users false sense of security - if we do not properly describe it in the model. So we need to be very clear when we are describing the intended level of isolation provided. J.
