Re: [DISCUSS] Custom XCom backends in core or not

Tomasz Urbaszek Thu, 03 Dec 2020 01:36:05 -0800

> What is wrong with having some code which can be used by multiple users.


There's nothing wrong with it. My main point about XCom backends is that it
is not simply "other storage" than database.

> I think instead of making it perfect on the first try, we can open it up
for the community and let them contribute on the same

I think the PR is really good and it was necessary because we need to
think about this subject. As far as I'm aware that's what we usually do
before we accept a new part of code. We discuss what is best for the whole
community (users and maintainers).

In my opinion, the whole point about XCom backends is that users can pass
between tasks data which is not:
- json serializable (as it is required by base XCom and your example)
- limited by size

While the second point is rather easy to adjust. The first one is rather
problematic. As mentioned in previous mails - if we decided to have custom
XComs then we have to extend the existing mechanism with something that
will allow users to configure serializers/deserializers (which raises
another question, should we maintain them all). Otherwise, the code we will
have will be hard to use imho. Eventually, we can be opinionated and
implement something that we think will work for 95% of use cases.

For example, consider the pandas data frame example. I decided to use csv
as a format, but users may want to use avro, pickle, tsv or whatever else
they want. Taking this into account a custom XCom backend for pandas (or
respectable serializer/deserializer) would have to allow configuration of
all of that.

On the other note, I'm afraid that users will want to use pickle in custom
XComs. We removed it from Airflow, but we left a option to configure it
back :D

Tomek

On Thu, Dec 3, 2020 at 8:45 AM Sumit Maheshwari <sumeet.ma...@gmail.com>
wrote:

> My 2 cents on this:
>
> What is wrong with having some code which can be used by multiple users.
> What is the point in 10s of 100s of companies maintaining their own
> implementation of things. Isn't that we've separated providers from the
> core so that the priority of maintaining them wouldn't be the same. Also,
> by this logic why we are maintaining 100s of operators and hooks which
> aren't core to the Airflow.
>
> This feature may come 6 months ago, but I could not find any example on
> how to use that (wasn't aware of the blog). I like the PR you've raised and
> I feel that it should be part of the GCP provider package itself, as anyone
> who wants to use it as it is, have to figure out logistic around packaging
> it and making it available to Airflow.
>
> I agree that the draft PR I raised isn't perfect, but I think instead of
> making it perfect on the first try, we can open it up for the community and
> let them contribute on the same, instead of everyone having their own code
> versions.
>
>
> On Thu, Dec 3, 2020 at 5:42 AM Tomasz Urbaszek <turbas...@apache.org>
> wrote:
>
>> I created a PR: https://github.com/apache/airflow/pull/12768
>>
>> On Wed, Dec 2, 2020 at 8:03 PM Gerard Casas Saez
>> <gcasass...@twitter.com.invalid> wrote:
>>
>>> Fair enough.
>>>
>>> I would argue then to have a small example (like the one pandas to csv
>>> and GCS) added to core as an example one maybe? Its fine to leave it in
>>> examples folder otherwise
>>> Gerard Casas Saez
>>> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
>>>
>>>
>>> On Wed, Dec 2, 2020 at 11:51 AM Tomasz Urbaszek <turbas...@apache.org>
>>> wrote:
>>>
>>>> I think it's rather hard to decouple the serialization and persisting.
>>>> First of all, each data type (the input) may use a different serializer.
>>>> Then (assuming 1-1 relation between data type and serializer) each
>>>> serializer may require other persisting logic (dataframes to buckets, but
>>>> jsons to redis). This multiplies the number of possible combinations and
>>>> problems. Also each serializer has to have deserializer, so users would
>>>> have to create a custom serialize and deserialize methods which brings us
>>>> to exactly to what we already have... :D
>>>>
>>>> Really, having tools like this one that just file handle and then store
>>>> its content somewhere make it really simple to customize XComs (
>>>> https://github.com/apache/airflow/blob/67acdbdf92a039ba424704e2f9b4dc67a914f3bb/airflow/providers/google/cloud/hooks/gcs.py#L321-L331
>>>> ).
>>>>
>>>> So, I'm still leaning towards good docs, examples and how-to instead of
>>>> code that "work out of the box but you need to adjust it".
>>>>
>>>> On Wed, Dec 2, 2020 at 7:12 PM Daniel Standish <dpstand...@gmail.com>
>>>> wrote:
>>>>
>>>>> just a thought....
>>>>>
>>>>> If flexibilty with storage / serialization is desired, perhaps this
>>>>> would make sense to be accomplished with methods on the backend class.
>>>>>
>>>>> So you could have an xcom backend class that has metheds like
>>>>> `push_dataframe` or `push_json` or something like that.
>>>>>
>>>>> And if you need flexibility, you can add and use these kinds of
>>>>> methods.
>>>>>
>>>>> But the only issue is we would need to make the xcom "handler"
>>>>> available to the task.
>>>>>
>>>>> So instead of (*or in addition to, for backward compatibility*)
>>>>> calling `self.xcom_pull` from the operator, you would call 
>>>>> `self.xcom.pull`
>>>>> --- so the handler class, with all its methods, is available.
>>>>>
>>>>>
>>>>> On Wed, Dec 2, 2020 at 9:55 AM Gerard Casas Saez
>>>>> <gcasass...@twitter.com.invalid> wrote:
>>>>>
>>>>>> Hi folks!
>>>>>>
>>>>>> Reading the conversation, I agree w Tomek.
>>>>>> At the same time I see value in adding some options out of the box
>>>>>> for serialization and storage.
>>>>>>
>>>>>> I see there's a pattern here where we can decouple storage service
>>>>>> (Redis, S3, GCS, Airflow DB...) and serialization format (pandas to csv,
>>>>>> pickling, json...). If we decouple them, then we can provide some options
>>>>>> in core for each, and provide options to configure them.
>>>>>>
>>>>>> something like:
>>>>>>
>>>>>> [xcom]
>>>>>> storage_layer = [airflow.xcom.GCS]
>>>>>> serialization = [airflow.xcom.panda2csv, airflow.xcom.jsondump]
>>>>>>
>>>>>> Where XCom layer would use GCS to store
>>>>>> (gs://bucket-name/{dag_id}/{dagrun_id}/{ti_id}/{key}) and then to 
>>>>>> serialize
>>>>>> it would try to use pandas2csv first (if class is pandas) and then json
>>>>>> dump otherwise. This could be extended to serialize as preferred (even
>>>>>> using GCS/S3 and loading again) and would allow for adding some options 
>>>>>> to
>>>>>> core while providing extensibility.
>>>>>>
>>>>>> What do you all think?
>>>>>>
>>>>>>
>>>>>> Gerard Casas Saez
>>>>>> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
>>>>>>
>>>>>>
>>>>>> On Wed, Dec 2, 2020 at 10:25 AM Tomasz Urbaszek <turbas...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> > Then you could have XComBackendSerializationBackend
>>>>>>>
>>>>>>> That's definitely something we should avoid... :D
>>>>>>>
>>>>>>> On Wed, Dec 2, 2020 at 6:18 PM Daniel Standish <dpstand...@gmail.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > You could add xcom serialization utils in airflow.utils
>>>>>>> >
>>>>>>> > Then you could have XComBackendSerializationBackend ;)
>>>>>>>
>>>>>>

Re: [DISCUSS] Custom XCom backends in core or not

Reply via email to