Re: [DISCUSS] Using serde for PyODBC case/other cases [was Serialization in Apache Airflow]

Daniel Standish Mon, 11 Dec 2023 11:44:31 -0800

>
> Tried to get others some opportunity to comment, but I see it's mostly me
> <> Bolke.



It might help to start off this separate discussion thread with a simple /
concise problem statement.  (that might sound snarky but it ain't :) )

And I say *might help* cus you still might get crickets even if you
clarify...

But as for me, I didn't really follow the main thread and when I looked at
this one, which was a continuation of a convo, the context wasn't super
clear so I moved on.






On Mon, Dec 11, 2023 at 11:23 AM Jarek Potiuk <[email protected]> wrote:

> Tried to get others some opportunity to comment, but I see it's mostly me
> <> Bolke.
>
> >
> > > Yep. I rectified the implicitness by storing this as an explicit
> decision
> > > record
> > >  in the ADR proposal. I think it's been - similarly as lack of
> > >
> > <snip>
> >
>
> Happy to help here - since there is a good "vibe" on the ADR kind of
> approach, I am happy to  "lead by example" and propose a PR with ADRs for
> serialization (learning from the whole discussions and that's a good
> opportunity for me to learn some of the nitty-gritty details ).
>
> Would that work for you and can I count on some reviews ?
>
>
> > It was added to the DAG serializer at the time, I mentioned that that
> > looked weird
> > given the fact that it did not seem connected to DAG serialization (to
> me)
> > and serde
> > should be considered. So no, I wasn't insisting on making it part of
> > serialization imho,
> > it defacto was already.
> >
>
> Let's straighten it up after writing ADRs and rounding up the serialization
> case. We are aiming to enable AIP-44 and Pydantic in 2.9 so that's a good
> time to see and make some measurements there.
>
> >
> > >
> > > Going through either of the serializations of ours here is not needed
> > > overhead at all - we can just purely rely on Rust and get ORM on one
> side
> > > and Pydantic object on the other without touching any of our
> > serialization
> > > code.
> > >
> >
> > That's fine to me.
> >
>
> Yeah. We'll sort it out. Luckily it's a very localized place and we can
> simply open
> draft PR showing how it will look like and see if it looks simpler/whether
> we lose
> any of the properties that
>
> >
> >
> > I don't think you want to ;-).
> >
> >
> Happy also to take a look at helping to straighten up DAG serialization
> possibly
> once I get a bit more familiar with the context.
>
>
> > We could. It's probably just a three pass serializer then.
> >
>
> Interesting :).
>
>
> > >
> > Exactly not. By adding it to Airflow 2.8.0 it would have made every
> > previous version
> > of this provider suddenly work without any adjustments against Airflow
> > 2.8.0. Remember
> > that the original bug report said "cannot serialize value of type
> > PyODBC.Row".
> >
>
> I was more talking about compatibility of ODBC provider. If we want to
> directly
> return ODBC.Row, it will only be serializable in Airflow 2.8.0 (currently
> ODBC
> provider has >= Airflow 2.6.0. What I want to achieve is to implement a fix
> where you can upgrade just provider and get serializable output even if
> the user decides to stay with Airflow 2.8.0. This w
>
>
> > You can argue both ways here architecturally.
> >
> > 1) Airflow 2.8 with a new feature that allows PyODBC.Row to be
> serialized,
> > affecting past and future providers at large and (TaskFlow) returned
> > variables.
> >
>
> This requires users to upgrade to Airflow 2.8.0. Many of our users delay
> such upgrades for months or years. So we are talking about making it
> possible for them to upgrade "easily" where selective upgrades to
> providers is the easy way. This also works better for some
> Airflow-as-a-service
> where 2.8 might only be available months from now.
>
> Migration to a new Airflow version is way more involved and uptick of new
> versions (especially .0) is slow.
>
> We are basically talking about having something available now - for Airflow
> 2.6, 2.7, (and 2.8) now
> vs. many users having it months or years from now. And releasing it now
> fixes actual issue
> people have and raised us. And implementing it in provider has an added
> benefit that
> IF we also implement a generic serializer as part of ODBC and have a way
> for
> serde to discover serializers by providers, then we could implement it in
> the
> provider in a way that either uses internal "serialization" or allows serde
> to
> discover it.
>
> I see really no benefit of having Airflow know anything about ODBC. Are we
> going to
> implement all 80 provider's specific code and put it to airlfow and all
> dependencies (even
> optional) as airflow dependencies, now that we made enormous effort in
> separating these?
>
> Imagine ODBC code depends on odbc_library.
> What happens if Airflow has optional dependency on odbc_library <3 and the
> provider has dependency on odbc_library  >=3 ? Imagine that the provider
> returns
> ODBCRow from version 3 that serializer won't be able to handle.
>
> How do you install the two together? How do you install these two together?
> How do
> you make them work together?
>
> It really makes very little sense to have airflow depend on the library and
> code that
> provider already depends on, because this will create a whole new class of
> potential conflicts.
>
> I think the *only* possible way to make ODBC Row handled by serde is for
> serde to discover serializers from installed providers and use them from
> there.
>
> This solves all the "dependency conflict" cases nicely and is long term
> sustainable and maintainable.
>
>
> > This is exactly the intention with the serializers in
> > airflow.serialization.serializers,
> > however can be tough as not all expose the information in public APIs
> (e.g.
> > deltalake)
> > that allow to reinstate the object.
> >
> > But yes: loose, and pragmatic public API / SemVer.
> >
> > Yep.
>
> >
> >
>

Re: [DISCUSS] Using serde for PyODBC case/other cases [was Serialization in Apache Airflow]

Reply via email to