My 2 cents on this: What is wrong with having some code which can be used by multiple users. What is the point in 10s of 100s of companies maintaining their own implementation of things. Isn't that we've separated providers from the core so that the priority of maintaining them wouldn't be the same. Also, by this logic why we are maintaining 100s of operators and hooks which aren't core to the Airflow.
This feature may come 6 months ago, but I could not find any example on how to use that (wasn't aware of the blog). I like the PR you've raised and I feel that it should be part of the GCP provider package itself, as anyone who wants to use it as it is, have to figure out logistic around packaging it and making it available to Airflow. I agree that the draft PR I raised isn't perfect, but I think instead of making it perfect on the first try, we can open it up for the community and let them contribute on the same, instead of everyone having their own code versions. On Thu, Dec 3, 2020 at 5:42 AM Tomasz Urbaszek <turbas...@apache.org> wrote: > I created a PR: https://github.com/apache/airflow/pull/12768 > > On Wed, Dec 2, 2020 at 8:03 PM Gerard Casas Saez > <gcasass...@twitter.com.invalid> wrote: > >> Fair enough. >> >> I would argue then to have a small example (like the one pandas to csv >> and GCS) added to core as an example one maybe? Its fine to leave it in >> examples folder otherwise >> Gerard Casas Saez >> Twitter | Cortex | @casassaez <http://twitter.com/casassaez> >> >> >> On Wed, Dec 2, 2020 at 11:51 AM Tomasz Urbaszek <turbas...@apache.org> >> wrote: >> >>> I think it's rather hard to decouple the serialization and persisting. >>> First of all, each data type (the input) may use a different serializer. >>> Then (assuming 1-1 relation between data type and serializer) each >>> serializer may require other persisting logic (dataframes to buckets, but >>> jsons to redis). This multiplies the number of possible combinations and >>> problems. Also each serializer has to have deserializer, so users would >>> have to create a custom serialize and deserialize methods which brings us >>> to exactly to what we already have... :D >>> >>> Really, having tools like this one that just file handle and then store >>> its content somewhere make it really simple to customize XComs ( >>> https://github.com/apache/airflow/blob/67acdbdf92a039ba424704e2f9b4dc67a914f3bb/airflow/providers/google/cloud/hooks/gcs.py#L321-L331 >>> ). >>> >>> So, I'm still leaning towards good docs, examples and how-to instead of >>> code that "work out of the box but you need to adjust it". >>> >>> On Wed, Dec 2, 2020 at 7:12 PM Daniel Standish <dpstand...@gmail.com> >>> wrote: >>> >>>> just a thought.... >>>> >>>> If flexibilty with storage / serialization is desired, perhaps this >>>> would make sense to be accomplished with methods on the backend class. >>>> >>>> So you could have an xcom backend class that has metheds like >>>> `push_dataframe` or `push_json` or something like that. >>>> >>>> And if you need flexibility, you can add and use these kinds of methods. >>>> >>>> But the only issue is we would need to make the xcom "handler" >>>> available to the task. >>>> >>>> So instead of (*or in addition to, for backward compatibility*) >>>> calling `self.xcom_pull` from the operator, you would call `self.xcom.pull` >>>> --- so the handler class, with all its methods, is available. >>>> >>>> >>>> On Wed, Dec 2, 2020 at 9:55 AM Gerard Casas Saez >>>> <gcasass...@twitter.com.invalid> wrote: >>>> >>>>> Hi folks! >>>>> >>>>> Reading the conversation, I agree w Tomek. >>>>> At the same time I see value in adding some options out of the box for >>>>> serialization and storage. >>>>> >>>>> I see there's a pattern here where we can decouple storage service >>>>> (Redis, S3, GCS, Airflow DB...) and serialization format (pandas to csv, >>>>> pickling, json...). If we decouple them, then we can provide some options >>>>> in core for each, and provide options to configure them. >>>>> >>>>> something like: >>>>> >>>>> [xcom] >>>>> storage_layer = [airflow.xcom.GCS] >>>>> serialization = [airflow.xcom.panda2csv, airflow.xcom.jsondump] >>>>> >>>>> Where XCom layer would use GCS to store >>>>> (gs://bucket-name/{dag_id}/{dagrun_id}/{ti_id}/{key}) and then to >>>>> serialize >>>>> it would try to use pandas2csv first (if class is pandas) and then json >>>>> dump otherwise. This could be extended to serialize as preferred (even >>>>> using GCS/S3 and loading again) and would allow for adding some options to >>>>> core while providing extensibility. >>>>> >>>>> What do you all think? >>>>> >>>>> >>>>> Gerard Casas Saez >>>>> Twitter | Cortex | @casassaez <http://twitter.com/casassaez> >>>>> >>>>> >>>>> On Wed, Dec 2, 2020 at 10:25 AM Tomasz Urbaszek <turbas...@apache.org> >>>>> wrote: >>>>> >>>>>> > Then you could have XComBackendSerializationBackend >>>>>> >>>>>> That's definitely something we should avoid... :D >>>>>> >>>>>> On Wed, Dec 2, 2020 at 6:18 PM Daniel Standish <dpstand...@gmail.com> >>>>>> wrote: >>>>>> > >>>>>> > You could add xcom serialization utils in airflow.utils >>>>>> > >>>>>> > Then you could have XComBackendSerializationBackend ;) >>>>>> >>>>>