Fair enough.

I would argue then to have a small example (like the one pandas to csv and
GCS) added to core as an example one maybe? Its fine to leave it in
examples folder otherwise
Gerard Casas Saez
Twitter | Cortex | @casassaez <http://twitter.com/casassaez>


On Wed, Dec 2, 2020 at 11:51 AM Tomasz Urbaszek <[email protected]>
wrote:

> I think it's rather hard to decouple the serialization and persisting.
> First of all, each data type (the input) may use a different serializer.
> Then (assuming 1-1 relation between data type and serializer) each
> serializer may require other persisting logic (dataframes to buckets, but
> jsons to redis). This multiplies the number of possible combinations and
> problems. Also each serializer has to have deserializer, so users would
> have to create a custom serialize and deserialize methods which brings us
> to exactly to what we already have... :D
>
> Really, having tools like this one that just file handle and then store
> its content somewhere make it really simple to customize XComs (
> https://github.com/apache/airflow/blob/67acdbdf92a039ba424704e2f9b4dc67a914f3bb/airflow/providers/google/cloud/hooks/gcs.py#L321-L331
> ).
>
> So, I'm still leaning towards good docs, examples and how-to instead of
> code that "work out of the box but you need to adjust it".
>
> On Wed, Dec 2, 2020 at 7:12 PM Daniel Standish <[email protected]>
> wrote:
>
>> just a thought....
>>
>> If flexibilty with storage / serialization is desired, perhaps this would
>> make sense to be accomplished with methods on the backend class.
>>
>> So you could have an xcom backend class that has metheds like
>> `push_dataframe` or `push_json` or something like that.
>>
>> And if you need flexibility, you can add and use these kinds of methods.
>>
>> But the only issue is we would need to make the xcom "handler" available
>> to the task.
>>
>> So instead of (*or in addition to, for backward compatibility*) calling
>> `self.xcom_pull` from the operator, you would call `self.xcom.pull` --- so
>> the handler class, with all its methods, is available.
>>
>>
>> On Wed, Dec 2, 2020 at 9:55 AM Gerard Casas Saez
>> <[email protected]> wrote:
>>
>>> Hi folks!
>>>
>>> Reading the conversation, I agree w Tomek.
>>> At the same time I see value in adding some options out of the box for
>>> serialization and storage.
>>>
>>> I see there's a pattern here where we can decouple storage service
>>> (Redis, S3, GCS, Airflow DB...) and serialization format (pandas to csv,
>>> pickling, json...). If we decouple them, then we can provide some options
>>> in core for each, and provide options to configure them.
>>>
>>> something like:
>>>
>>> [xcom]
>>> storage_layer = [airflow.xcom.GCS]
>>> serialization = [airflow.xcom.panda2csv, airflow.xcom.jsondump]
>>>
>>> Where XCom layer would use GCS to store
>>> (gs://bucket-name/{dag_id}/{dagrun_id}/{ti_id}/{key}) and then to serialize
>>> it would try to use pandas2csv first (if class is pandas) and then json
>>> dump otherwise. This could be extended to serialize as preferred (even
>>> using GCS/S3 and loading again) and would allow for adding some options to
>>> core while providing extensibility.
>>>
>>> What do you all think?
>>>
>>>
>>> Gerard Casas Saez
>>> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
>>>
>>>
>>> On Wed, Dec 2, 2020 at 10:25 AM Tomasz Urbaszek <[email protected]>
>>> wrote:
>>>
>>>> > Then you could have XComBackendSerializationBackend
>>>>
>>>> That's definitely something we should avoid... :D
>>>>
>>>> On Wed, Dec 2, 2020 at 6:18 PM Daniel Standish <[email protected]>
>>>> wrote:
>>>> >
>>>> > You could add xcom serialization utils in airflow.utils
>>>> >
>>>> > Then you could have XComBackendSerializationBackend ;)
>>>>
>>>

Reply via email to