Happy to kill pickling for 2.0. While starting to review the HA, I see that
more and more we rely on Serialization and there are some rather
weird-looking left-overs in Ash's change that are only there because of
pickling.

I think we already know that Serialization becomes a first-class citizen in
Airflow 2.0. And while we know the first versions of serialization had some
teething problems - most of which have been already addressed (the most
interesting one was few orders of magnitude increase in outbound traffic
from the Airflow to the DB - but it's already fixed I believe).
If we think that what pickling was used for can be handled entirely by
serialization, I am all for killing pickling and rather than that focus
100% on serialization improvements, testing, and making it rock solid.

J.

On Fri, Sep 18, 2020 at 11:59 PM Daniel Imberman <daniel.imber...@gmail.com>
wrote:

> Are there any use-cases that REQUIRE pickle? Do we have any sense of what
> % of the Airflow community depends on Pickle? I’m all for killing it if
> possible but I want to make sure we’re not setting up a major hurdle for
> migration.
>
> via Newton Mail [
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> ]
> On Fri, Sep 18, 2020 at 2:50 PM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
> I'm getting bad flashbacks of fighting with pickles early on in the history
> of the project. I've learned since then to stay away. Almost all solutions
> that involve pickles are bad solutions. Beyond but related to the security
> implication are the issues of pickle entanglement, not really knowing
> what's in the pickle and how big it might get, and how it may affect the
> environment it's deserialized into.
>
> 2.0 is a great time to kill pickles with fire.
>
> On Fri, Sep 18, 2020 at 5:01 AM Kaxil Naik <kaxiln...@apache.org> wrote:
>
> > Hi all,
> >
> > We briefly discussed how pickling is currently used in Airflow codebase
> and
> > whether or not we should remove it for 2.0 in the Airflow 2.0 Dev call
> this
> > Monday.
> >
> > Currently, AFAIK only *CeleryExecutor* supports pickling (code
> > <
> >
> https://github.com/apache/airflow/blob/master/airflow/executors/executor_loader.py#L122-L126
> > >).
> > We also have a flag on *airflow scheduler
> > <https://airflow.readthedocs.io/en/latest/cli-ref.html#scheduler> *CLI
> > command (*--do-pickle*) and "*--ship-dag*" on *airflow tasks run
> > <https://airflow.readthedocs.io/en/latest/cli-ref.html#run>* command.
> >
> > If we want to remove pickling, I think Airflow 2.0 is the right time.
> >
> > We have also deprecated the use of pickling in XComs.
> >
> > https://docs.python.org/3/library/pickle.html -- lists some items on the
> > security implications of pickle and comparisons with JSON.
> >
> > Another alternative is using *cloudpickle
> > <https://github.com/cloudpipe/cloudpickle> *(used by PySpark) instead
> > of *pickle,
> > *it suffers from the same security issues like *pickle *but does have
> some
> > more features compared to pickle.
> >
> > What do you all think?
> >
> > Regards,
> > Kaxil
> >



-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to