> We do not use pickling by default as we roll our own serialization format. We probably just need the key (zoneinfo.key).
Yeah, for sure we don't need to rely on pickling, if I don't miss anything than in Airflow 3 pickling serialisation should be removed. I just use as reference info which might be useful in our serialisation. There are two things that can become a nuisance in our serialisation mechanism: 1. Same as pickling if initially object created from ZoneInfo.from_file(...) than key would be None. How to fix? Depends on the case, but in general we could mention that we expected regular one 2. We would have fully qualified name backports.zoneinfo.ZoneInfo in Python 3.8 and zoneinfo.ZoneInfo in Python 3.9+. I think with some simple workaround it also could be solved ---- Best Wishes *Andrey Anshin* On Thu, 28 Sept 2023 at 17:12, Bolke de Bruin <[email protected]> wrote: > for serialization I am not too worried about ZoneInfo. We do not use > pickling by default as we roll our own serialization format. We probably > just need the key (zoneinfo.key). > > I'm not sure what happened about this: > > https://github.com/sdispater/pendulum/issues/590 > > Bolke > > On Thu, 28 Sept 2023 at 14:59, Andrey Anshin <[email protected]> > wrote: > > > I agree with all problems that you mention about datetime tz-aware data. > > I lived for almost 30 years in a country which had in different periods > of > > time up to 10 time zones, and on a regular basis changed it > (merge/unmerge) > > , disable DST, temporarily enable DST. In addition I also worked in a > > different bank for about 10 years (legacy systems which don't update > tzdata > > for ages) . I think I had most of the bad cases with time zones. And I > > think everyone somehow has a problem with different time zones: > Calendars + > > events, flight booking systems which don't know about timezones and you > > might find that your connecting flight flew away an hour ago, etc. > > > > In addition the error might happen in different places, databases (not > > updated tzdata, or db doesn't work correctly), client libraries, OS, etc. > > The person who finally solves tz-aware data should be granted all awards > in > > the World. > > > > > For example, we got recently bitten by datetime.tzname() (which is > > supposed > > to 'time zone name') returning short-hand notation timezones (e.g. PST) > > > instead of full timezone names (e.g. "Europe/Amsterdam") which makes > > deserialization non deterministic. > > > > Yeah, and even ZoneInfo doesn't solve the problem with `datetime.tzname` > > because final implementation depends on different factors, tzinfo > > implementation and internals of datetime. > > > > > moving to zoneinfo seems to make sense though and will also be in > > Pendulum 3 > > > > I've have a look couple days ago about zoneinfo, it also have some > > "pitfalls", e.g. if timezone created from file it can't be easily > > serialized > > https://docs.python.org/3.9/library/zoneinfo.html#the-zoneinfo-class > > > > > Pendulum has proven us in the past, maybe we indeed should help the > > project if possible and if that isn't possible verify formal correctness > of > > any other library > > > > I guess all other libraries might have a different kind of issue > including > > compatibility with databases. > > More close replacement it is dateutil, but it also maintained by one > person > > last release was 2 years ago and contains quite a few issues with > > timezones/DTS (no blame, that is just a fact) > > > > > > On Thu, 28 Sept 2023 at 15:39, Bolke de Bruin <[email protected]> wrote: > > > > > Thanks for starting the discussion Andrey. > > > > > > Some background on the choice for Pendulum at the time. In the early > days > > > of Airflow it wasn't timezone aware. Originating from Airbnb which had > a > > > reasonable mature data organization the view was everything needs to be > > in > > > UTC. According to Maxime the engineers would dream in UTC ;-). However, > > in > > > the real world which also needs to deal with legacy that didn't hold. > > Often > > > systems of record did not store timezone information but were localized > > > nevertheless. Cutoff times in banks happen in localized time and if you > > > want to meet those, Airflow needed to do better. > > > > > > Doing timezones and being timezone aware proved to be exceptionally > hard. > > > Many libraries get it wrong [1] and fail silently (i.e. Arrow) or apply > > DST > > > transitions wrongly (pytz). When dealing with payments that stuff > cannot > > > happen. To make things worse, in Python timezone support is pretty > > > convoluted, while some standardization happened in 3.9 by using IANA > > > provided timezone information from the local system, its API is messy. > > For > > > example, we got recently bitten by datetime.tzname() (which is > > > supposed to 'time > > > zone name') returning short-hand notation timezones (e.g. PST) instead > of > > > full timezone names (e.g. "Europe/Amsterdam") which makes > deserialization > > > non deterministic. > > > > > > So, what I am trying to say, is tread carefully when doing changes as > > > proposed in [2] (moving to zoneinfo seems to make sense though and will > > > also be in Pendulum 3). Make sure those changes are formally correct > and > > > don't assume because they are now part of python itself (pytz was the > > > defacto standard for a long time). Pendulum has proven us in the past, > > > maybe we indeed should help the project if possible and if that isn't > > > possible verify formal correctness of any other library. > > > > > > Bolke > > > > > > [1] https://pendulum.eustace.io/faq/ > > > [2] https://github.com/apache/airflow/issues/19450 > > > > > > On Thu, 28 Sept 2023 at 11:03, Andrey Anshin <[email protected] > > > > > wrote: > > > > > > > This discussion is more about the known problem of pendulum and how > we > > > > could deal with it and maybe how we (as Community) might help autor. > > > > > > > > The library is mostly supported by a single author Sébastien Eustace > ( > > > > https://github.com/sdispater) and it seems like we bump into the > > > situation > > > > which is described in xkcd #2347 ( > > > > https://imgs.xkcd.com/comics/dependency.png). To be honest it is not > > > > something new when library mainly supported by one author so there is > > > > always a risk that the library will no longer be supported / > abandoned > > > > And if takes in account that pendulum provides core functionality in > > > > Airflow it could have dramatical impact in the future. > > > > > > > > Pendulum is a really nice library which helps a lot of developers to > > work > > > > with dates/datetimes. However there is one major problem, the last > > > release > > > > of this library happened more than 3 years ago ( > > > > https://pypi.org/project/pendulum/#history) in the time when Airflow > > > > 1.10.11 was released > > > > > > > > Fortunately, the project is not abandoned and on a regular basis > > commits > > > > add into the master branch. However these commits are not included > into > > > any > > > > final release and that's why some things related to datetime don't > work > > > as > > > > expected in Airflow. There are list of known (for me) issues which > are > > > > affect Airflow > > > > > > > > *Memory Leak on parse*: > > > > - https://github.com/sdispater/pendulum/issues/720, this one fixed > 2 > > > > years > > > > ago but not available yet ( > > > https://github.com/sdispater/pendulum/pull/563 > > > > ). > > > > Since we use parse dates in airflow codebase: datetime parameters and > > > > datetime in logs this one could be a reason for memory leakage in > > > Airflow: > > > > - https://github.com/apache/airflow/discussions/24694 > > > > - https://github.com/apache/airflow/discussions/28597 > > > > > > > > *Incorrect time zones*, known issues and should be already fixed in > > > master > > > > branch > > > > - https://github.com/sdispater/pendulum/issues/700, Mexico do not > use > > > DST > > > > anymore > > > > - https://github.com/sdispater/pendulum/issues/706, Egypt reinstate > > DST > > > > > > > > We add clarification in https://github.com/apache/airflow/pull/30467 > , > > > > however it seems like there is no other way rather than patching > > Pendulum > > > > right now. > > > > > > > > All these issues should be solved as soon as pendulum 3 is released. > > The > > > > current announced estimation is end of september/ beginning of > October: > > > > > > https://github.com/sdispater/pendulum/issues/600#issuecomment-1711299677 > > > > > > > > So in theory we would have a fixed version of pendulum soon, and it > > might > > > > break something in Airflow but from my point of view it is better > than > > > > current status. > > > > > > > > However there might be a situation where the release of the pendulum > > > would > > > > be postponed, so maybe better to have a backup plan. What could we do > > in > > > > this case? > > > > > > > > Maybe we should start to use zoneinfo.ZoneInfo instead of pendulum > > > > datetime? https://github.com/apache/airflow/issues/19450 > > > > Pros: > > > > - stdlib (python 3.9+) > > > > - In pendulum 3.0 Timezone based on zoneinfo.Zoneinfo > > > > > > > > Cons: > > > > - Current serialization model can't deal with backport packages. E.g. > > > > timezone which are serialized in backport_zoneinfo can't be > > deserialized > > > in > > > > zoneinfo > > > > > > > > Maybe we should replace parse datetime with another solution. Does > > anyone > > > > know a good replacement? > > > > > > > > Maybe someone from Airflow Community could propose their help with > > > > maintenance of library: > > > > - https://github.com/sdispater/pendulum/issues/590 > > > > > > > > Maybe we should get rid of the pendulum at all, as a last resort > > > solution. > > > > I can't imagine how we could do that, because a lot of stuff depends > on > > > the > > > > pendulum and removing it would be a breaking change. > > > > > > > > ---- > > > > Best Wishes > > > > *Andrey Anshin* > > > > > > > > > > > > > -- > > > > > > -- > > > Bolke de Bruin > > > [email protected] > > > > > > > > -- > > -- > Bolke de Bruin > [email protected] >
