I created a PR: https://github.com/apache/airflow/pull/12768
On Wed, Dec 2, 2020 at 8:03 PM Gerard Casas Saez <gcasass...@twitter.com.invalid> wrote: > Fair enough. > > I would argue then to have a small example (like the one pandas to csv and > GCS) added to core as an example one maybe? Its fine to leave it in > examples folder otherwise > Gerard Casas Saez > Twitter | Cortex | @casassaez <http://twitter.com/casassaez> > > > On Wed, Dec 2, 2020 at 11:51 AM Tomasz Urbaszek <turbas...@apache.org> > wrote: > >> I think it's rather hard to decouple the serialization and persisting. >> First of all, each data type (the input) may use a different serializer. >> Then (assuming 1-1 relation between data type and serializer) each >> serializer may require other persisting logic (dataframes to buckets, but >> jsons to redis). This multiplies the number of possible combinations and >> problems. Also each serializer has to have deserializer, so users would >> have to create a custom serialize and deserialize methods which brings us >> to exactly to what we already have... :D >> >> Really, having tools like this one that just file handle and then store >> its content somewhere make it really simple to customize XComs ( >> https://github.com/apache/airflow/blob/67acdbdf92a039ba424704e2f9b4dc67a914f3bb/airflow/providers/google/cloud/hooks/gcs.py#L321-L331 >> ). >> >> So, I'm still leaning towards good docs, examples and how-to instead of >> code that "work out of the box but you need to adjust it". >> >> On Wed, Dec 2, 2020 at 7:12 PM Daniel Standish <dpstand...@gmail.com> >> wrote: >> >>> just a thought.... >>> >>> If flexibilty with storage / serialization is desired, perhaps this >>> would make sense to be accomplished with methods on the backend class. >>> >>> So you could have an xcom backend class that has metheds like >>> `push_dataframe` or `push_json` or something like that. >>> >>> And if you need flexibility, you can add and use these kinds of methods. >>> >>> But the only issue is we would need to make the xcom "handler" available >>> to the task. >>> >>> So instead of (*or in addition to, for backward compatibility*) calling >>> `self.xcom_pull` from the operator, you would call `self.xcom.pull` --- so >>> the handler class, with all its methods, is available. >>> >>> >>> On Wed, Dec 2, 2020 at 9:55 AM Gerard Casas Saez >>> <gcasass...@twitter.com.invalid> wrote: >>> >>>> Hi folks! >>>> >>>> Reading the conversation, I agree w Tomek. >>>> At the same time I see value in adding some options out of the box for >>>> serialization and storage. >>>> >>>> I see there's a pattern here where we can decouple storage service >>>> (Redis, S3, GCS, Airflow DB...) and serialization format (pandas to csv, >>>> pickling, json...). If we decouple them, then we can provide some options >>>> in core for each, and provide options to configure them. >>>> >>>> something like: >>>> >>>> [xcom] >>>> storage_layer = [airflow.xcom.GCS] >>>> serialization = [airflow.xcom.panda2csv, airflow.xcom.jsondump] >>>> >>>> Where XCom layer would use GCS to store >>>> (gs://bucket-name/{dag_id}/{dagrun_id}/{ti_id}/{key}) and then to serialize >>>> it would try to use pandas2csv first (if class is pandas) and then json >>>> dump otherwise. This could be extended to serialize as preferred (even >>>> using GCS/S3 and loading again) and would allow for adding some options to >>>> core while providing extensibility. >>>> >>>> What do you all think? >>>> >>>> >>>> Gerard Casas Saez >>>> Twitter | Cortex | @casassaez <http://twitter.com/casassaez> >>>> >>>> >>>> On Wed, Dec 2, 2020 at 10:25 AM Tomasz Urbaszek <turbas...@apache.org> >>>> wrote: >>>> >>>>> > Then you could have XComBackendSerializationBackend >>>>> >>>>> That's definitely something we should avoid... :D >>>>> >>>>> On Wed, Dec 2, 2020 at 6:18 PM Daniel Standish <dpstand...@gmail.com> >>>>> wrote: >>>>> > >>>>> > You could add xcom serialization utils in airflow.utils >>>>> > >>>>> > Then you could have XComBackendSerializationBackend ;) >>>>> >>>>