I created a PR: https://github.com/apache/airflow/pull/12768

On Wed, Dec 2, 2020 at 8:03 PM Gerard Casas Saez
<gcasass...@twitter.com.invalid> wrote:

> Fair enough.
>
> I would argue then to have a small example (like the one pandas to csv and
> GCS) added to core as an example one maybe? Its fine to leave it in
> examples folder otherwise
> Gerard Casas Saez
> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
>
>
> On Wed, Dec 2, 2020 at 11:51 AM Tomasz Urbaszek <turbas...@apache.org>
> wrote:
>
>> I think it's rather hard to decouple the serialization and persisting.
>> First of all, each data type (the input) may use a different serializer.
>> Then (assuming 1-1 relation between data type and serializer) each
>> serializer may require other persisting logic (dataframes to buckets, but
>> jsons to redis). This multiplies the number of possible combinations and
>> problems. Also each serializer has to have deserializer, so users would
>> have to create a custom serialize and deserialize methods which brings us
>> to exactly to what we already have... :D
>>
>> Really, having tools like this one that just file handle and then store
>> its content somewhere make it really simple to customize XComs (
>> https://github.com/apache/airflow/blob/67acdbdf92a039ba424704e2f9b4dc67a914f3bb/airflow/providers/google/cloud/hooks/gcs.py#L321-L331
>> ).
>>
>> So, I'm still leaning towards good docs, examples and how-to instead of
>> code that "work out of the box but you need to adjust it".
>>
>> On Wed, Dec 2, 2020 at 7:12 PM Daniel Standish <dpstand...@gmail.com>
>> wrote:
>>
>>> just a thought....
>>>
>>> If flexibilty with storage / serialization is desired, perhaps this
>>> would make sense to be accomplished with methods on the backend class.
>>>
>>> So you could have an xcom backend class that has metheds like
>>> `push_dataframe` or `push_json` or something like that.
>>>
>>> And if you need flexibility, you can add and use these kinds of methods.
>>>
>>> But the only issue is we would need to make the xcom "handler" available
>>> to the task.
>>>
>>> So instead of (*or in addition to, for backward compatibility*) calling
>>> `self.xcom_pull` from the operator, you would call `self.xcom.pull` --- so
>>> the handler class, with all its methods, is available.
>>>
>>>
>>> On Wed, Dec 2, 2020 at 9:55 AM Gerard Casas Saez
>>> <gcasass...@twitter.com.invalid> wrote:
>>>
>>>> Hi folks!
>>>>
>>>> Reading the conversation, I agree w Tomek.
>>>> At the same time I see value in adding some options out of the box for
>>>> serialization and storage.
>>>>
>>>> I see there's a pattern here where we can decouple storage service
>>>> (Redis, S3, GCS, Airflow DB...) and serialization format (pandas to csv,
>>>> pickling, json...). If we decouple them, then we can provide some options
>>>> in core for each, and provide options to configure them.
>>>>
>>>> something like:
>>>>
>>>> [xcom]
>>>> storage_layer = [airflow.xcom.GCS]
>>>> serialization = [airflow.xcom.panda2csv, airflow.xcom.jsondump]
>>>>
>>>> Where XCom layer would use GCS to store
>>>> (gs://bucket-name/{dag_id}/{dagrun_id}/{ti_id}/{key}) and then to serialize
>>>> it would try to use pandas2csv first (if class is pandas) and then json
>>>> dump otherwise. This could be extended to serialize as preferred (even
>>>> using GCS/S3 and loading again) and would allow for adding some options to
>>>> core while providing extensibility.
>>>>
>>>> What do you all think?
>>>>
>>>>
>>>> Gerard Casas Saez
>>>> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
>>>>
>>>>
>>>> On Wed, Dec 2, 2020 at 10:25 AM Tomasz Urbaszek <turbas...@apache.org>
>>>> wrote:
>>>>
>>>>> > Then you could have XComBackendSerializationBackend
>>>>>
>>>>> That's definitely something we should avoid... :D
>>>>>
>>>>> On Wed, Dec 2, 2020 at 6:18 PM Daniel Standish <dpstand...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > You could add xcom serialization utils in airflow.utils
>>>>> >
>>>>> > Then you could have XComBackendSerializationBackend ;)
>>>>>
>>>>

Reply via email to