Hi All!

I am reaching out to initiate a discussion on a topic that has been causing
some confusion within the community: serialization in Apache Airflow. The
purpose of this email is to shed light on the current state of
serialization and, if necessary, open the floor for discussions on the way
forward.


*Background*


Apache Airflow employs two primary mechanisms for serialization, each
serving distinct purposes:

   1. *DAG Serializer
   (airflow.serialization.serialized_objects.BaseSerialization)** – Since
   2.0 (afaik,. ~2019)**:*
      - Used for DAG serialization, encompassing Tasks, DagRuns,
      TaskInstances, etc.
      - The serialized DAG forms part of the public API but operates
      invisibly to users and developers in most cases.
      - One order of magnitude slower (1:10) compared to the XCom
      serializer.
      - Hard to extend
   2. *XCom Serializer (airflow.serialization.serde)** – Since 2.5**:*
      - Handles serialization of arbitrary objects for XCom.
      - Not directly visible to users in typical scenarios.
      - Significantly faster than the DAG serializer.
      - Supports extension of functionality through methods like
      'serialize' and 'deserialize' or custom serializers in
      airflow.serialization.serializers.
      - Automatically handles objects of attr, dataclasses, and pydantic
      - Preserves semantic information.
      - Incorporates versioning for forward and backward compatibility.
      - Lacks schema explicit validation.

Other serializers, like pickle and dill, are selectively employed in
certain areas, probably given the lack of serialization for a particular
object at the time. While both serializers support encoding into JSON the
misconception is that we directly serialize into JSON. We serialize into a
dict with primitives which is then encoded into JSON if needed or something
else if we would like to.



*History*

The DAG serializer was created to make the webserver and scheduler
stateless. It was not created to serialize arbitrary objects, but has
sometimes been extended to be a bit more flexible. It has strong guarantees
around the integrity of what it puts out as the webserver and scheduler are
dependent on it. It form however made it less applicable to support what
was needed for XCom.



So, let's rewind a bit and talk about how the XCom serializer came into the
Apache Airflow scene. It wasn't just a random addition; The intention was to
have the superhero version of serialization. Unlike its sibling, the DAG
serializer, XCom was born to handle tricky stuff that the JSONEncoder
struggled with. It wasn't just about being fast and slick; The XCom serializer
wanted to be the cool kid on the block, offering better versioning and an
easier way to add new features. To get out of the way for users as much as
possible, so you could just use TaskFlow and not think about how to share
results of tasks with the next task. For example, you can just share
DataFrames and it handles them like a champ.



*Security*

In addressing security concerns, there are inherent risks associated with
serialization libraries such as pickle, dill, and cloudpickle due to their
potential for executing arbitrary code during deserialization. This is main
raison d’etre for both the DAG serializer and the XCom serializer. To
mitigate the risks stemming from deserialization of arbitrary objects, the
XCom serializer employs an allow list. This list specifies the permitted
classes during deserialization, minimizing the threat of potential
malicious exploits. The DAG serializer has a fixed scheme of inputs and
thus is limited during deserialization to those inputs.



Note: that pickle, dill, cloudpickle still have challenges serializing
arbitrary objects. Any scheme will require some work.



*Encryption and Decryption*



*When displaying information in the UI and also during retrieval from the
database it is important the secure sensitive information as we do for
Connections. Efforts* are underway to introduce encryption and decryption
methods in the XCom serializer. The motivation behind this initiative stems
from the need to protect potentially confidential data, such as credentials
for cloud access, which may be required by underlying integrations. The
proposed pattern draws inspiration from established practices seen in the
Connection module. Despite the acknowledgment that certain fields may be
sensitive, the community has traditionally left the encryption of such
fields to the serializer at hand. The introduction of encryption and
decryption methods in the XCom serializer seeks to address this gap,
providing a standardized approach to enhance the overall security posture,
with the pending integration awaiting community discussion (this thread) and
potential follow-up modifications.



*Coupling*

Both primary serializers function transient to the user and developer.
Except when explicit serialization or possibly encryption is required. This
means that a provider or a task can just return an arbitrary object and
that is going to be handled by core Airflow without any adjustments.
Except, of course, when it doesn’t. The typical stack trace for this is an
error that says  “cannot serialize an object of type XX”.



This error comes from the XCom serializer and can thus be solved by
providing the serialize/deserialize method, providing a custom serializer
OR converting the object into something the the serializer does understand.
However, this will result in loss of semantic information: the deserializer
will not know how to re-create the original object.



*Considerations / Questions:*

·      Do we want to move to a single serializer? What would that look like?

·      Do we want to move serializers to their respective providers,
effectively making “serde” (or any other serializer) into a Public API?

·      Do we want encryption of values? Where should that take place?

·      What do we define as a best practice for interacting with the
serializers? Are we okay with losing semantic information if using an
intermediate format? Or do we find it the best practice to provide a
serializer?



*Other links:*

·      Docs on serde / Xcom serializer:
https://github.com/apache/airflow/pull/35885

·      Encryption: https://github.com/apache/airflow/pull/35867

·      Discussion on PyODBC: https://github.com/apache/airflow/pull/32319



Kind regards,



Bolke



P.S. I tried to be as inclusive as possible but there is really a lot of
history to cover here – so if I missed anything of importance please add it
to the thread

Reply via email to