imo supporting stripping trailing 0 bytes from the bitset encoding is the
way to go.  It's both backwards compatible with existing serialized rows,
as well as more space efficient.

On Tue, Oct 12, 2021 at 4:52 PM Reuven Lax <re...@google.com> wrote:

> Of course. But if we write a spec that doesn't agree with the existing
> Java coders, that is also futile. We can't easily change Java coders due to
> update compatibility concerns, and we definitely need portability to work
> with Java!
>
> On Tue, Oct 12, 2021 at 1:47 PM Robert Burke <rob...@frantil.com> wrote:
>
>> If it's not in the spec it's not Beam,  because any alternative is Anti
>> Portability ;)
>>
>> On Tue, Oct 12, 2021, 1:45 PM Reuven Lax <re...@google.com> wrote:
>>
>>> Row just uses the existing Java BitSetCoder, which predates the writing
>>> of that spec :)
>>>
>>> On Tue, Oct 12, 2021 at 1:42 PM Robert Burke <rob...@frantil.com> wrote:
>>>
>>>> The null fields bitset encoder is defines in the pipeline runner proto
>>>> here:
>>>> https://github.com/apache/beam/blob/4b11efdf96ea4a471e078ec49906c40ef033aafb/model/pipeline/src/main/proto/beam_runner_api.proto#L976
>>>>
>>>> Per my reading of the spec, the bit set must include the ceiling of
>>>> num_fields/8 bytes, as it doesn't say "trailing bytes for non-nil in fields
>>>> may be dropped". However it might be interpreted as that by the that an
>>>> empty byte array indicating no nils.  This is what go implements in the
>>>> coder.WriteRowHeader and coder.ReadRowHeader functions.
>>>>
>>>> But that strikes me as a special case for fully populated rows, not a
>>>> natural extension of a poorly phrased general rule.
>>>>
>>>> On Tue, Oct 12, 2021, 1:31 PM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>> Do you think that BitSetCoder is incorrect?
>>>>>
>>>>> On Tue, Oct 12, 2021 at 1:27 PM Steve Niemitz <sniem...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Yeah I believe they're all bugs/missing features in the python
>>>>>> implementation.  The nullable BitSet one is arguably a bug in the java
>>>>>> implementation, but since there's no low-level spec on how Rows are
>>>>>> actually encoded it's hard to say who's right.  I think Go might have the
>>>>>> same bug there, in which case that's two languages doing it "wrong" and 
>>>>>> one
>>>>>> doing it "right". :P
>>>>>>
>>>>>> On Tue, Oct 12, 2021 at 4:20 PM Reuven Lax <re...@google.com> wrote:
>>>>>>
>>>>>>> These are bugs in Python, correct?
>>>>>>>
>>>>>>> On Tue, Oct 12, 2021 at 1:18 PM Steve Niemitz <sniem...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> It seems like there's a good amount of incompatibility between java
>>>>>>>> and python wrt beam Rows.  For example the following are unsupported in
>>>>>>>> python (that I've noticed so far)
>>>>>>>> - BYTE
>>>>>>>> - INT16
>>>>>>>> - OneOf
>>>>>>>>
>>>>>>>> Additionally, it seems like nullable fields don't really work
>>>>>>>> correctly, the java BitSetCoder won't encoding trailing empty bytes in 
>>>>>>>> the
>>>>>>>> BitSet, but the python side is expecting every num_fields / 8 bytes to 
>>>>>>>> be
>>>>>>>> present. [1]
>>>>>>>>
>>>>>>>> Certainly these are bugs, but in general it seems to point to a
>>>>>>>> lack of integration testing for xlang interop in general.  I plan on
>>>>>>>> submitting PRs to fix the bugs above (or at least some of them), are 
>>>>>>>> there
>>>>>>>> tests I can change to better exercise these paths?
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/row_coder.py#L198
>>>>>>>>
>>>>>>>

Reply via email to