Re: [DISCUSS] Serialization in Apache Airflow: Clarifications and Future Directions

Jarek Potiuk Sun, 03 Dec 2023 17:10:50 -0800

>
>
> I am reaching out to initiate a discussion on a topic that has been causing
> some confusion within the community: serialization in Apache Airflow. The
> purpose of this email is to shed light on the current state of
> serialization and, if necessary, open the floor for discussions on the way
> forward.
>


Cool!

   2. *XCom Serializer (airflow.serialization.serde)** – Since 2.5**:*
>

It's rather cool - I agree and with noble goals.


> Other serializers, like pickle and dill, are selectively employed in
> certain areas, probably given the lack of serialization for a particular
> object at the time. While both serializers support encoding into JSON the
> misconception is that we directly serialize into JSON. We serialize into a
> dict with primitives which is then encoded into JSON if needed or something
> else if we would like to.
>

Yep. Serializing is surprisingly difficult and has a lot of traps.

So, let's rewind a bit and talk about how the XCom serializer came into the
> Apache Airflow scene. It wasn't just a random addition; The intention was
> to
> have the superhero version of serialization. Unlike its sibling, the DAG
> serializer, XCom was born to handle tricky stuff that the JSONEncoder
> struggled with. It wasn't just about being fast and slick; The XCom
> serializer
> wanted to be the cool kid on the block, offering better versioning and an
> easier way to add new features. To get out of the way for users as much as
> possible, so you could just use TaskFlow and not think about how to share
> results of tasks with the next task. For example, you can just share
> DataFrames and it handles them like a champ.
>

Yep. I think serializer as it is used now - especially in Task Flow where
we want
to make sure that producer produces an output that consumer can deserialize
the
serialized form to exactly the same  object is great. I really like the
transparency
here and I do remember problems we had to solve before it came to being.


> *Security*
>
> In addressing security concerns, there are inherent risks associated with
> serialization libraries such as pickle, dill, and cloudpickle due to their
> potential for executing arbitrary code during deserialization. This is main
> raison d’etre for both the DAG serializer and the XCom serializer. To
> mitigate the risks stemming from deserialization of arbitrary objects, the
> XCom serializer employs an allow list. This list specifies the permitted
> classes during deserialization, minimizing the threat of potential
> malicious exploits. The DAG serializer has a fixed scheme of inputs and
> thus is limited during deserialization to those inputs.
>

Actually this is one of the things I do not understand (and I even had
recently a discussion on this very subject with Elad). I think though
we have a serious security problem with it if it is going to protect us
from anything. Currently we have `airflow.*` exclusion and it's quite a
terrible hole in it (if I understand what the protection is about).
There is nothing preventing the user from declaring their own class
as `airflfow.providers.my_provider_package.my_class` in their DAG . There
is no
mechanism preventing it. I think we need to have serious discussion about
actual security of this mechanism, because I think it gives false sense of
security, and introduces unnecessary friction without giving any security
benefits.
IMHO the only "real" mechanism to prevent something is the entrypoint
mechanism we use for plugins and providers, because (unlike package name)
it requires installing package in the environment by Deployment Manager,
while package name limitation can be easily bypassed by simply having
code on the PYTHONPATH (which our DAGs already do).
But maybe I do not understand what it should protect us against. I think
that one needs a separate discussion.

*Encryption and Decryption*
>
> *When displaying information in the UI and also during retrieval from the
> database it is important the secure sensitive information as we do for
> Connections. Efforts* are underway to introduce encryption and decryption
> methods in the XCom serializer. The motivation behind this initiative stems
> from the need to protect potentially confidential data, such as credentials
> for cloud access, which may be required by underlying integrations. The
> proposed pattern draws inspiration from established practices seen in the
> Connection module. Despite the acknowledgment that certain fields may be
> sensitive, the community has traditionally left the encryption of such
> fields to the serializer at hand. The introduction of encryption and
> decryption methods in the XCom serializer seeks to address this gap,
> providing a standardized approach to enhance the overall security posture,
> with the pending integration awaiting community discussion (this thread)
> and
> potential follow-up modifications.
>

I think this one requires a separate discussion - especially on expected
security
properties of it. I am not at all sure whether we really want to encrypt
individual
fields (down to primitives like int) in the serialized data. In order to do
that, we
would have to find ways to defined which of the fields of the arbitrary
objects
are sensitive. This is not coming out-of-the-box with all the various ways
you can make an object serializable (@dataclass?, serialize/deserialize
method?).
It has pretty unknown security properties (what happens when you encrypt an
int
with a FERNET key? - isn't it easily recoverable then?). While for
connection - we could
define a set of heuristics (names of fields and extras that we deem as
sensitive),
generalizing them to "arbitrary" objects seems rather difficult.

Also I think we should understand why we want to encrypt the data and why
individual fields.
Are we going to communicate with our users that some of the data sent there
is sensitive?
I understand that the need comes from the fact that SOME of sources of the
XCOM data
already contain sensitive information (say containing passwords and URLS).
But would not
it be better to simply encrypt the whole blob we are sending as XCOM rather
than individual
fields? It might of course be impacting performance so maybe not the best
idea. Another
option is to `expect` from the serializer to encrypt sensitive data on
their own and give them
an API so that they can do it manually while serializing the field (or
maybe that is what really
is proposed and I misunderstand that). So I think we need a bit more
"design decision"
here on how we are going to approach it.



> *Coupling*
>
> Both primary serializers function transient to the user and developer.
> Except when explicit serialization or possibly encryption is required. This
> means that a provider or a task can just return an arbitrary object and
> that is going to be handled by core Airflow without any adjustments.
> Except, of course, when it doesn’t. The typical stack trace for this is an
> error that says  “cannot serialize an object of type XX”.


>
> This error comes from the XCom serializer and can thus be solved by
> providing the serialize/deserialize method, providing a custom serializer
> OR converting the object into something the the serializer does understand.
> However, this will result in loss of semantic information: the deserializer
> will not know how to re-create the original object.
>

Yeah. I have a few problems with that if we were to couple too much some of
the provider's code with serde:

In a number of cases we do not care about restoring the original object -
case
in question with common.sql "tuples" returned by some DBApiHooks.
Common.sql
and DBApiHook expect that we want to have "Serializable tuple" as an output
(basically - with a few caveats), But once returned we do not care if the
original
object was DatabricksNamedTuple or SnowflakeDictionary (both are possible
to be
returned by the DBApi. We care that the DBApiHook returns a serializable
tuple and
this is the "common" format we want to potentially serialize from (and back
to). We never
ever want to serialize it back to DatabricksNamedTuple. It's merely an
implementation detail
of the DBApi that we got it  while reading from the DB,

I fail to see what would be the benefits of coupling the code in the
provider in this
case with serde. Also I think (see security above) that IF we seriously
think about
security, then some mechanism of registering serializers from providers
should
be needed as soon as we solve the question of "how secure limiting the
package name for serialized objects is" and whether we need it at all. For
me
this calls for extension of our plugin mechanisms and "registering"
serializable
data by providers,


*Considerations / Questions:*
>
> ·      Do we want to move to a single serializer? What would that look
> like?
>

I am not sure. Case in question above with whatever is returned with
DB API need not to be deserialized or versioned. We need a tuple. That's it.

Also I think there are several cases where different serializations
might make more sense. Take Pydantic for one. Their serialization is very
different, based on different assumptions and for example in AIP-44 what
we are using for example their ability to serialize connected ORM classes.
They simply take a Row of Task Instance and when serialized it will
have it serialized to a Pydantic object that is not "ORM" bound but it will
have
all the foreign-keys linked classes available (DagRun, and Dag) with all
the "read-only" properties that we want to see. And it's basically one-way
serializing (we have no intention of deserializing it back into an ORM
object)

So while "serialization" as a name is common - these two serve quite
different
purposes. There is no point in using serde for Internal API, have
versioning and
deserialization, and more importantly it's going to be way slower with JSON
and
Python code. Pydantic v2 uses Rust a lot and pretty much all serialization
happens
in Rust and is up to 16x faster than any Python code we could write for it.
And for
internal API we need as FAST as possible because basically we are going to
replace multiple DB queries that are happening internally in our components.

I think in this context, we do not need "one serialization to rule them
all". I'd even
argue that maybe we do not need it for DAG serializers. Maybe it's "good
enough"
and maybe it's OK to keep a separate DAG-focused serializer for the purpose
of serializing DAGs? Again in this case we do not care about deserializing
it
to the original form. When we serialize DAG to DB, what we really care about
is to see the serializable form of it, not the original one. Actually even
more
- we DO NOT want to see original form. Having a serialized form that you
cannot
come back to the original (DAG Python code) is a security property of DAG
serialization.


·      Do we want to move serializers to their respective providers,
> effectively making “serde” (or any other serializer) into a Public API?
>

This is a good question, I think yes. Say deltalake and iceberg - I'd say
it makes sense
to move them to "Databricks" and "Tabular" providers respectively. Mainly
because of
dependencies. We do not want Airflow core to depend on the client libraries
and we've
already learned the benefits of moving the code out to providers. Mainly -
because if
there is a bugfix or upgrade to a 3rd-party library we will not be held be
having to
release Airflow to fix and adapt to those. We already had many cases like
this and this
has already happened even for k8s executor - which we could fix bugs in
without having
to release a new Airflow version.

Maybe that's a bit of a different question for Pandas - as we have no
separate Pandas provider
and we use and link to it already from Airflow core. But... maybe it's also
a sign we need pandas
provider in this case? This question for me is open.


> ·      Do we want encryption of values? Where should that take place?
>

I think we need to discuss the use case for it and how. the "API" to do
that would look like. IMHO
we should either encrypt everything or each serializer should "hard-code"
encrypting specific
data before generic serialization starts and decrypt after deserialization
ends.So rather than
building in encryption in serialization, we should provide simple
primitives ("encrypt", "decrypt") that
serializer should use to encrypt it's data, but
serialization/deserialization should not care about
whether the field is encrypted or not. Maybe this is what you also thought
about - if so - I am good
with it.


> ·      What do we define as a best practice for interacting with the
> serializers? Are we okay with losing semantic information if using an
> intermediate format? Or do we find it the best practice to provide a
> serializer?
>

I think it should be case-by-case. And we should clearly define where the
XCom serializer
should be used in and in what context. And where other serializers might
and should be
used. I see no particular value in applying the same rules and assumptions
for the
different serialization cases we might have.


> *Other links:*
>
> ·      Docs on serde / Xcom serializer:
> https://github.com/apache/airflow/pull/35885
>
> ·      Encryption: https://github.com/apache/airflow/pull/35867
>
> ·      Discussion on PyODBC: https://github.com/apache/airflow/pull/32319
>
>
>
> Kind regards,
>
>
>
> Bolke
>
>
>
> P.S. I tried to be as inclusive as possible but there is really a lot of
> history to cover here – so if I missed anything of importance please add it
> to the thread
>

Re: [DISCUSS] Serialization in Apache Airflow: Clarifications and Future Directions

Reply via email to