Re: [DISCUSS] Using serde for PyODBC case/other cases [was Serialization in Apache Airflow]

Bolke de Bruin Tue, 05 Dec 2023 02:05:55 -0800

On Mon, 4 Dec 2023 at 20:09, Jarek Potiuk <[email protected]> wrote:

>
> >
> > It is a choice that needs to be made at the API level - here Common.sql.
> In
> > this case the particular form of the tuple was, somewhat implicit, part
> of
> > the spec.
> >
>
> Yep. I rectified the implicitness by storing this as an explicit decision
> record
>  in the ADR proposal. I think it's been - similarly as lack of
> Serialization
> decision records - an omission that resulted in this, somewhat heated
> debate :D
>
>
<snip>



> >
> So to provide an example in context of our discussion of PyODBC.Row (see
> > the discussion of the linked PR) only common.sql can now deal with
> > PyODBC.Row. If for some reason another provider returns a PyODBC.Row that
> > object remains unserializable.
> >
>
> Yes. That's fine for me. I think we do not always have to serialize things
> the same way.
> Additionally serializing something using serde uses a custom format that
> requires
> deserialization step for it to be usable. DBApiHook response can be used
> "as is"
> as tuple for any other operator used by Common SQL
>
> I have explained what the downside is of the approach you took, although
> > its impact in this case is probably low (not many other places will
> return
> > PyODBC.Row, but consider DataFrame hypothetically).
> >
>
> Yes. For DataFrame I am fine. This is "common" format that it's worth to
> have
> serializer because it can be produced and consumed by many. It's quite
> unlikely to happen to ODBC and simply DBApiHook has its own "standard"
> being tuple that is the language DBAPIHook talks to the users. In a way
> returning PyODBC.Row in this sense violates assumptions made a year
> ago (now documented in the ADR proposal) - that this is the "API" of DB
> Hook.
> So while "make_serializable" was likely a wrong name, the goal of it was
> "produce a tuple that is directly serializable".
>
>
Yep, understood.


>
>
> > I think you needed a serializable tuple so that it could go into XCom.
> This
> > makes the need for an explicit choice imho.
> >
> > Yep. That's explicit choice we made.
>
>
>
> > >
> > >
> > On speed. Afaik AIP-44 still makes use of the DAG Serialization code
> apart
> > from Pydantic's rust implementation. The DAG serialization code is
> > seriously
> > slow. So, before claiming speed improvements here show the numbers please
> > :-)
>
>
> Yep. When we first added it we used Pydantic V1.And then it did not matter.
> But
> we have it using our serialization code - because at some point you
> insisted on
> making it part of "a serialization" - we have absolutely no need for that
> in AIP-44
> and I think when we go to prime time - we will just skip it entirely.
>

It was added to the DAG serializer at the time, I mentioned that that
looked weird
given the fact that it did not seem connected to DAG serialization (to me)
and serde
should be considered. So no, I wasn't insisting on making it part of
serialization imho,
it defacto was already.


>
> Going through either of the serializations of ours here is not needed
> overhead at all - we can just purely rely on Rust and get ORM on one side
> and Pydantic object on the other without touching any of our serialization
> code.
>

That's fine to me.


>
> We can of course keep both. However if you ever looked at the DAG
> > serialization code you might want to reconsider ;-).
>
>
> Yep. I am not at all strong on it. I do not intimately know that code
> either :).
>
>
I don't think you want to ;-).


>
> > Alternatively, you
> > could build on serde and provide the guarantees that you require there.
> It
> > can be done. I have been able to serialize a DAG with serde (hence
> knowing
> > the 1:10 ratio), but until now havent added it, because its schema looks
> > different.
> >
>
> Yeah. I am good for that if we can pull it off with backwards compatibility
> in place.
>
>
We could. It's probably just a three pass serializer then.


>
> >
> > BTW: There is no need from the XCom serializer / serde to deserialize
> into
> > the original object. By convention we like to of course, but it is not
> > required.
> >
>
> I understand, but the versioning feature for example assumes so. and adds
> complexity if the serialized form is to be used. From what I understand
> (and what some of the Serde tests
> verify the API of the serde is that you do in-fact get the same, comparable
> object
> when you do the serialize/deserialize roundtrip. And well, the only way to
> make use
> of the serialized object is to deserialize it first. Because of
> transparency, hiding the details
> and design of the serde, you should not care or understand about the
> "serialized" form so you
> effectively MUST deserialize to use it.
>
> So far - also looking at the tests - I was under the impression that it was
> expected property of Serde to get the roundtrip to return an equivalent
> object. But maybe I was mistaken here?
>

By convention, it is not required. It is just good practice, because what
you return
from a TaskFlow task you expect to have the same properties in a downstream
task. Hence, we go through lengths to serialize `builtin.set`,
`builtin.tuple` and
`builtin.frozenset` as the JSONEncoder loses type information when using
them.

Nevertheless, this is python so duck typing...

> For the record: the XCom serializer lazy loads its dependencies. So it
> does
> > not make
> > core Airflow dependent on client libraries arguably.
> >
>
> But it means that if we have the same libraries used in providers and in
> core - we have to declare the dependencies in two places - once for
> provider extra and once for serde "extension" extra. You cannot use
> provider dependencies if you want to have core use the same client library
> because of two reasons:
>
> 1) they might depend on different versions
> 2) you would have to specify provider extra and install the provider (which
> you might not need
>    in this case
>
> And in case 1) This might lead to the situation that if core and provider
> depend on different, conflicting versions of the libraries, we are f*d -
> because you might not be able to install provider
> in certain version with core in certain version. if we move the
> serialization code to the provider,
> we have no such problem. It always depends on the version that the provider
> requires.
>
>
> The con is that the API becomes strict. We cannot change it easily anymore.
> >
>
> Of course we can extend that. But I believe you wanted to make ODBCRow
> provider to depend on the very same API already? Isn't it. It seems I see
> contradiction. It looks like it's ok for
> the ODBCRow/provider to depend on serde and implement their serialization
> code, but on the other hand it's not ok for (say) DeltaLake to do the same
> and become Databricks provider?
>
> Or maybe I missed something? Maybe your idea was to make the ODBC
> serializer part of Airflow 2.8 and add >= airflow 2.8.0 for the ODBC
> provider? That's what I call unnecessary coupling.
>
>
Exactly not. By adding it to Airflow 2.8.0 it would have made every
previous version
of this provider suddenly work without any adjustments against Airflow
2.8.0. Remember
that the original bug report said "cannot serialize value of type
PyODBC.Row".

So no coupling.

You can argue both ways here architecturally.

1) Airflow 2.8 with a new feature that allows PyODBC.Row to be serialized,
affecting past and future providers at large and (TaskFlow) returned
variables.
2) Update the provider so the provider does the serialization individually.

1. requires an update to Airflow
2. requires an update of the provider

The bug report was used as a trigger for changing / improving / applying
the intended architecture of common.sql, but for the bug report 1 was
sufficient.


>
> >
> > > Maybe that's a bit of a different question for Pandas - as we have no
> > > separate Pandas provider
> > > and we use and link to it already from Airflow core. But... maybe it's
> > also
> > > a sign we need pandas
> > > provider in this case? This question for me is open.
> > >
> > >
> > To me this depends on what we want to do with data awareness (separate
> > discussion). If you think:
> >
> > ObjectStoragePath : Table : Dataframe
> >
> > It is useful to have dataframe as part of core. This could be pandas or
> > maybe polars which is so much faster :-).
> >
>
> Yeah - I am not too strong on that and I believe it has benefits to have
> both Pandas and Polars. Especially if we can make sure that it's very much
> loosely coupled (use public APIs ony and get very clear about SemVer/
> Versioning for those to get as open as possible) with Pandas and polars
> as optional extras, I am quite fine with it. I think we do not need a
> provider in this case.
>

This is exactly the intention with the serializers in
airflow.serialization.serializers,
however can be tough as not all expose the information in public APIs (e.g.
deltalake)
that allow to reinstate the object.

But yes: loose, and pragmatic public API / SemVer.


>
> Case-by-case is fine to me as long as the choice is explicit and thus
> > options are considered.
> >
> >
> Yep. I think this discussion is mostly about that :). And hopefully will
> result with some decisions recorded :)
>
>
Another one for the record :-).

Bolke


-- 

--
Bolke de Bruin
[email protected]

Re: [DISCUSS] Using serde for PyODBC case/other cases [was Serialization in Apache Airflow]

Reply via email to