imo supporting stripping trailing 0 bytes from the bitset encoding is the way to go. It's both backwards compatible with existing serialized rows, as well as more space efficient.
On Tue, Oct 12, 2021 at 4:52 PM Reuven Lax <re...@google.com> wrote: > Of course. But if we write a spec that doesn't agree with the existing > Java coders, that is also futile. We can't easily change Java coders due to > update compatibility concerns, and we definitely need portability to work > with Java! > > On Tue, Oct 12, 2021 at 1:47 PM Robert Burke <rob...@frantil.com> wrote: > >> If it's not in the spec it's not Beam, because any alternative is Anti >> Portability ;) >> >> On Tue, Oct 12, 2021, 1:45 PM Reuven Lax <re...@google.com> wrote: >> >>> Row just uses the existing Java BitSetCoder, which predates the writing >>> of that spec :) >>> >>> On Tue, Oct 12, 2021 at 1:42 PM Robert Burke <rob...@frantil.com> wrote: >>> >>>> The null fields bitset encoder is defines in the pipeline runner proto >>>> here: >>>> https://github.com/apache/beam/blob/4b11efdf96ea4a471e078ec49906c40ef033aafb/model/pipeline/src/main/proto/beam_runner_api.proto#L976 >>>> >>>> Per my reading of the spec, the bit set must include the ceiling of >>>> num_fields/8 bytes, as it doesn't say "trailing bytes for non-nil in fields >>>> may be dropped". However it might be interpreted as that by the that an >>>> empty byte array indicating no nils. This is what go implements in the >>>> coder.WriteRowHeader and coder.ReadRowHeader functions. >>>> >>>> But that strikes me as a special case for fully populated rows, not a >>>> natural extension of a poorly phrased general rule. >>>> >>>> On Tue, Oct 12, 2021, 1:31 PM Reuven Lax <re...@google.com> wrote: >>>> >>>>> Do you think that BitSetCoder is incorrect? >>>>> >>>>> On Tue, Oct 12, 2021 at 1:27 PM Steve Niemitz <sniem...@apache.org> >>>>> wrote: >>>>> >>>>>> Yeah I believe they're all bugs/missing features in the python >>>>>> implementation. The nullable BitSet one is arguably a bug in the java >>>>>> implementation, but since there's no low-level spec on how Rows are >>>>>> actually encoded it's hard to say who's right. I think Go might have the >>>>>> same bug there, in which case that's two languages doing it "wrong" and >>>>>> one >>>>>> doing it "right". :P >>>>>> >>>>>> On Tue, Oct 12, 2021 at 4:20 PM Reuven Lax <re...@google.com> wrote: >>>>>> >>>>>>> These are bugs in Python, correct? >>>>>>> >>>>>>> On Tue, Oct 12, 2021 at 1:18 PM Steve Niemitz <sniem...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> It seems like there's a good amount of incompatibility between java >>>>>>>> and python wrt beam Rows. For example the following are unsupported in >>>>>>>> python (that I've noticed so far) >>>>>>>> - BYTE >>>>>>>> - INT16 >>>>>>>> - OneOf >>>>>>>> >>>>>>>> Additionally, it seems like nullable fields don't really work >>>>>>>> correctly, the java BitSetCoder won't encoding trailing empty bytes in >>>>>>>> the >>>>>>>> BitSet, but the python side is expecting every num_fields / 8 bytes to >>>>>>>> be >>>>>>>> present. [1] >>>>>>>> >>>>>>>> Certainly these are bugs, but in general it seems to point to a >>>>>>>> lack of integration testing for xlang interop in general. I plan on >>>>>>>> submitting PRs to fix the bugs above (or at least some of them), are >>>>>>>> there >>>>>>>> tests I can change to better exercise these paths? >>>>>>>> >>>>>>>> [1] >>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/row_coder.py#L198 >>>>>>>> >>>>>>>