potiuk commented on PR #51059:
URL: https://github.com/apache/airflow/pull/51059#issuecomment-2979827579

    >  I feel pandas, numpy, datetime, are similar to this case, which are 
common objects maybe used by multiple providers, or by tasks to pass in XComs. 
This approach may help avoid implementing similar things in different providers.
   
   I am not quite sure. I'd say it might also come to "common.dataframe" 
provider (alongside polars and likely other things - see a discussion we had 
some time ago (nothing materialized from it - but it could) 
https://lists.apache.org/thread/qx3yh6h0l6jb0kh3fz9q95b3x5b4001l  - then, any 
provider that needs it could simply depend on it.
   
   There is conceptually no difference of "common.sql" and "common.dataframe" - 
and somehow we seem to want to have "dataframe" related code in Airflow core, 
while we keep "sql" releated code in provider.
   
   > However, i think pydantic is more of a core thing, it not essentially 
belongs to a provider, it can be consumed and used in core without additional 
dependencies. So there's no stopping anybody from returning a pydantic 
dataclass object as xcom.
   
   I think Pydantic is **two** different layers:
   
   * serialization code that handles various Pydantic models that can be 
registered with it - can indeed be in "airflow-core", especially that we 
already use it for other things (namely fast api)
   
   * but serialization/deserialization of concrete Pydantic Classes should be 
"injected" by respective provider. If Cohere, defines their own "Pydantic" 
model, and other provider wants to use that Pydantic model "deserialized" - 
this is perfectly fine for the other provider to depend on Cohere to be able to 
instantiate the concrete Cohere model. Basically you need ot have Cohere 
provider installed, in order to use it's model. And installing Cohere provider 
should "register" the Cohere model in the common Pydantic serializer.
   
   This is "classic" dependency injection that we need to implement (and we 
already do a lot of it common.sql for example) - but also `common.messaging`. 
You implement "common" code in comon place, and then specific code that 
implements "external entity variant of it" in a provider that registers the 
"specific code" when it gets installed.
   
   See https://github.com/apache/airflow/pull/50057 which "fixed" the way how 
it is done for "common.messaging" -> provider's manager loads common.messaging 
and the "common" code that then you can use to listen for incoming messages. 
Common.messaging provides that "common implementation". Then "amazon" provider 
providers SQS implementation of it and registers it (providers manager takes 
care about discovering "queue classes" and injects them in common.messging - so 
that common.messaging class can "know" about "sqs" class.
   
   No core class needs to know or import "amazon" classes in this case. It does 
not know about them at all. It's the amazon provider that "knows" about 
common.messaging and exposes it's specific "amazon" messaging implementation 
and allows provider's manager to register it when provider is installed (via 
entrypoints).
   
   This is IMHO how whole serialization of ours should be implemented. Airflow 
core should not care about, pandas, polars, iceberg etc. etc. It shoudl be 
"pure, no dependencies" code. Providers - when installed - should provide all 
the code that is specific for the "external entity" (amazon, iceberg, whatever) 
and make it usable by the core. But imporitng "iceberg" anywhere in 
"airlfow-core" is just **wrong**.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to