Seems to me that Only Map<bytes[], ?>'s quality check cannot be solved by
deepEquals because Keys cannot be looked up correctly in Map<bytes[], ?>.
If we cannot have a useful use case for Map<bytes[], ?>,  we could reject
it in Schema and still keep byte[].

The option3 needs to find a wrapper of byte[] that is language-independent
& encoding-independent for portability, which seems a hard searching (and
not possible?) process.

-Rui

On Mon, Oct 29, 2018 at 10:15 AM Gleb Kanterov <[email protected]> wrote:

> There is an indirect connection to RowCoder because `MapCoder` isn't
> deterministic, therefore, this doesn't hold:
>
> >  - also each type (hence Row type) should have portable encoding(s) that
> respect this equality so shuffling is consistent
>
> I think it's a requirement only for rows we want to shuffle by.
>
> > About these specific use cases, how useful is it to support Map<byte[],
> ?> and List<byte[]>?
>
> Not sure about Map, but in BigQuery it's possible to define `ARRAY<BYTES>`
> type. It can group by BYTES, but not by ARRAYS.
>
>
> On Mon, Oct 29, 2018 at 5:42 PM Anton Kedin <[email protected]> wrote:
>
>> About these specific use cases, how useful is it to support Map<byte[],
>> ?> and List<byte[]>? These seem pretty exotic (maybe they aren't) and I
>> wonder whether it would make sense to just reject them until we have a
>> solid design.
>>
>> And wouldn't the same problems arise even without RowCoder? Is the path
>> in that case to implement a custom coder?
>>
>> Regards,
>> Anton
>>
>>
>> On Mon, Oct 29, 2018 at 9:05 AM Kenneth Knowles <[email protected]> wrote:
>>
>>> I'll summarize my input to the discussion. It is rather high level. But
>>> IMO:
>>>
>>>  - even though schemas are part of Beam Java today, I think they should
>>> become part of portability when ready
>>>  - so each type in a schema needs a language-independent &
>>> encoding-independent notion of domain of values and equality (so obviously
>>> equal bytes are equal)
>>>  - embedding in any language (hence Row in Java) must have a schema
>>> type-driven equality that matches this spec
>>>  - also each type (hence Row type) should have portable encoding(s) that
>>> respect this equality so shuffling is consistent
>>>  - Row in Java should be able to decode these encodings to different
>>> underlying representations and change its strategy over time
>>>
>>> Kenn
>>>
>>> On Mon, Oct 29, 2018 at 8:08 AM Gleb Kanterov <[email protected]> wrote:
>>>
>>>> With adding BYTES type, we broke equality.
>>>> `RowCoder#consistentWithEquals` is always true, but this property doesn't
>>>> hold for exotic types such as `Map<BYTES, ?>`, `List<BYTES>`. The root
>>>> cause is `byte[]`, where `equals` is implemented as reference equality
>>>> instead of structural.
>>>>
>>>> Before we jump into solution mode, let's state what we want to have:
>>>> - *API* have stable API and be able to evolve efficient and use-case
>>>> specific implementations without breaking it
>>>> - *Correctness *we can't trade off correctness, a trivial
>>>> implementation should work
>>>> - *Performance *comparing equality is a fundamental operation, and we
>>>> want to make it cheap
>>>>
>>>> 1. set `consistentWithEquals` if there is BYTES field
>>>> Pros: almost no pros
>>>> Cons: It would introduce a significant number of allocations when
>>>> comparing rows, so we reject this option.
>>>>
>>>> 2. implement custom deep equals in `Row#equals`
>>>> Pros: good performance, doesn't change API, `Row#equals` is correct
>>>> Cons: doesn't work for `Map<byte[], ?>`, unless we roll own
>>>> implementation
>>>> Cons: it's possible to obtain `List<byte[]>` from `getValue()` that has
>>>> broken equality, contains, etc, unless we roll own implementation
>>>> Cons: non-trivial and requires ~200LOC to implement
>>>>
>>>> 3. wrapping byte[] into Java object with fixed equality (e.g.,
>>>> StructuralByteArray)
>>>> Pros: good performance and flexible to change how Java wrapper is
>>>> implemented
>>>> Pros: simple, doesn't require any specialized collections, no
>>>> surprises, `Map<byte[], ?>` and `List<byte[]>` work.
>>>> Cons: will change the return type of `Row#getValue`
>>>>
>>>> I want to suggest going with option #3. However, it isn't completely
>>>> clear what wrapper we want to use, either it could be StructuralByteArray,
>>>> or ByteBuffer. ByteBuffer is more standard. However, it comes with 4
>>>> additional integer fields. StructuralByteArray doesn't have anything not
>>>> necessary. One option would be adding `Row#getByteBuffer` that would be
>>>> `ByteBuffer.wrap(getValue(i).getValues())`, specialized implementation can
>>>> override it for better performance, but `getValue(i)` must return
>>>> StructuralByteArray.
>>>>
>>>> References:
>>>> - [BEAM-5866] Fix `Row#equals`,
>>>> https://github.com/apache/beam/pull/6845
>>>> - [BEAM-5646] Fix quality and hashcode for bytes in Row,
>>>> https://github.com/apache/beam/pull/6765
>>>>
>>>> Gleb
>>>>
>>>
>
> --
> Cheers,
> Gleb
>

Reply via email to