I've been working with a few data types that are in practice
unpicklable and I've run into a couple issues stemming from the `Any` type
hint, which when used, will result in the PickleCoder getting used even if
there's a coder in the coder registry that matches the data element.

This was pretty unexpected to me and can result in pretty cryptic
downstream issues. In the best case, you get an error at pickling time [1],
and in the worse case, the pickling "succeeds" (since many objects can get
(de)pickeld without obvious error) but then results in downstream issues
(e.g. some data doesn't survive depickling). One example case of the latter
is if you flatten a few pcollections including a pcollection generated by
`beam.Create([])` (the inferred output type an empty create becomes Any)

Would it make sense to introduce a new fallback coder that takes precedence
over the `PickleCoder` that encodes both the data type (by just pickling
it) and the data encoded using the registry-found coder? This would have
some space ramifications for storing the data type for every element. Of
course this coder would only kick in _if_ the data type was found in the
registry, otherwise we'd proceed to the picklecoder like we do currently

[1] https://github.com/apache/beam/issues/29908 (Issue arises from
ReshuffleFromKey using `Any` as a pcollection type

Reply via email to