Happy to kill pickling for 2.0. While starting to review the HA, I see that more and more we rely on Serialization and there are some rather weird-looking left-overs in Ash's change that are only there because of pickling.
I think we already know that Serialization becomes a first-class citizen in Airflow 2.0. And while we know the first versions of serialization had some teething problems - most of which have been already addressed (the most interesting one was few orders of magnitude increase in outbound traffic from the Airflow to the DB - but it's already fixed I believe). If we think that what pickling was used for can be handled entirely by serialization, I am all for killing pickling and rather than that focus 100% on serialization improvements, testing, and making it rock solid. J. On Fri, Sep 18, 2020 at 11:59 PM Daniel Imberman <daniel.imber...@gmail.com> wrote: > Are there any use-cases that REQUIRE pickle? Do we have any sense of what > % of the Airflow community depends on Pickle? I’m all for killing it if > possible but I want to make sure we’re not setting up a major hurdle for > migration. > > via Newton Mail [ > https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2 > ] > On Fri, Sep 18, 2020 at 2:50 PM, Maxime Beauchemin < > maximebeauche...@gmail.com> wrote: > I'm getting bad flashbacks of fighting with pickles early on in the history > of the project. I've learned since then to stay away. Almost all solutions > that involve pickles are bad solutions. Beyond but related to the security > implication are the issues of pickle entanglement, not really knowing > what's in the pickle and how big it might get, and how it may affect the > environment it's deserialized into. > > 2.0 is a great time to kill pickles with fire. > > On Fri, Sep 18, 2020 at 5:01 AM Kaxil Naik <kaxiln...@apache.org> wrote: > > > Hi all, > > > > We briefly discussed how pickling is currently used in Airflow codebase > and > > whether or not we should remove it for 2.0 in the Airflow 2.0 Dev call > this > > Monday. > > > > Currently, AFAIK only *CeleryExecutor* supports pickling (code > > < > > > https://github.com/apache/airflow/blob/master/airflow/executors/executor_loader.py#L122-L126 > > >). > > We also have a flag on *airflow scheduler > > <https://airflow.readthedocs.io/en/latest/cli-ref.html#scheduler> *CLI > > command (*--do-pickle*) and "*--ship-dag*" on *airflow tasks run > > <https://airflow.readthedocs.io/en/latest/cli-ref.html#run>* command. > > > > If we want to remove pickling, I think Airflow 2.0 is the right time. > > > > We have also deprecated the use of pickling in XComs. > > > > https://docs.python.org/3/library/pickle.html -- lists some items on the > > security implications of pickle and comparisons with JSON. > > > > Another alternative is using *cloudpickle > > <https://github.com/cloudpipe/cloudpickle> *(used by PySpark) instead > > of *pickle, > > *it suffers from the same security issues like *pickle *but does have > some > > more features compared to pickle. > > > > What do you all think? > > > > Regards, > > Kaxil > > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>