Seems to me that Only Map<bytes[], ?>'s quality check cannot be solved by deepEquals because Keys cannot be looked up correctly in Map<bytes[], ?>. If we cannot have a useful use case for Map<bytes[], ?>, we could reject it in Schema and still keep byte[].
The option3 needs to find a wrapper of byte[] that is language-independent & encoding-independent for portability, which seems a hard searching (and not possible?) process. -Rui On Mon, Oct 29, 2018 at 10:15 AM Gleb Kanterov <[email protected]> wrote: > There is an indirect connection to RowCoder because `MapCoder` isn't > deterministic, therefore, this doesn't hold: > > > - also each type (hence Row type) should have portable encoding(s) that > respect this equality so shuffling is consistent > > I think it's a requirement only for rows we want to shuffle by. > > > About these specific use cases, how useful is it to support Map<byte[], > ?> and List<byte[]>? > > Not sure about Map, but in BigQuery it's possible to define `ARRAY<BYTES>` > type. It can group by BYTES, but not by ARRAYS. > > > On Mon, Oct 29, 2018 at 5:42 PM Anton Kedin <[email protected]> wrote: > >> About these specific use cases, how useful is it to support Map<byte[], >> ?> and List<byte[]>? These seem pretty exotic (maybe they aren't) and I >> wonder whether it would make sense to just reject them until we have a >> solid design. >> >> And wouldn't the same problems arise even without RowCoder? Is the path >> in that case to implement a custom coder? >> >> Regards, >> Anton >> >> >> On Mon, Oct 29, 2018 at 9:05 AM Kenneth Knowles <[email protected]> wrote: >> >>> I'll summarize my input to the discussion. It is rather high level. But >>> IMO: >>> >>> - even though schemas are part of Beam Java today, I think they should >>> become part of portability when ready >>> - so each type in a schema needs a language-independent & >>> encoding-independent notion of domain of values and equality (so obviously >>> equal bytes are equal) >>> - embedding in any language (hence Row in Java) must have a schema >>> type-driven equality that matches this spec >>> - also each type (hence Row type) should have portable encoding(s) that >>> respect this equality so shuffling is consistent >>> - Row in Java should be able to decode these encodings to different >>> underlying representations and change its strategy over time >>> >>> Kenn >>> >>> On Mon, Oct 29, 2018 at 8:08 AM Gleb Kanterov <[email protected]> wrote: >>> >>>> With adding BYTES type, we broke equality. >>>> `RowCoder#consistentWithEquals` is always true, but this property doesn't >>>> hold for exotic types such as `Map<BYTES, ?>`, `List<BYTES>`. The root >>>> cause is `byte[]`, where `equals` is implemented as reference equality >>>> instead of structural. >>>> >>>> Before we jump into solution mode, let's state what we want to have: >>>> - *API* have stable API and be able to evolve efficient and use-case >>>> specific implementations without breaking it >>>> - *Correctness *we can't trade off correctness, a trivial >>>> implementation should work >>>> - *Performance *comparing equality is a fundamental operation, and we >>>> want to make it cheap >>>> >>>> 1. set `consistentWithEquals` if there is BYTES field >>>> Pros: almost no pros >>>> Cons: It would introduce a significant number of allocations when >>>> comparing rows, so we reject this option. >>>> >>>> 2. implement custom deep equals in `Row#equals` >>>> Pros: good performance, doesn't change API, `Row#equals` is correct >>>> Cons: doesn't work for `Map<byte[], ?>`, unless we roll own >>>> implementation >>>> Cons: it's possible to obtain `List<byte[]>` from `getValue()` that has >>>> broken equality, contains, etc, unless we roll own implementation >>>> Cons: non-trivial and requires ~200LOC to implement >>>> >>>> 3. wrapping byte[] into Java object with fixed equality (e.g., >>>> StructuralByteArray) >>>> Pros: good performance and flexible to change how Java wrapper is >>>> implemented >>>> Pros: simple, doesn't require any specialized collections, no >>>> surprises, `Map<byte[], ?>` and `List<byte[]>` work. >>>> Cons: will change the return type of `Row#getValue` >>>> >>>> I want to suggest going with option #3. However, it isn't completely >>>> clear what wrapper we want to use, either it could be StructuralByteArray, >>>> or ByteBuffer. ByteBuffer is more standard. However, it comes with 4 >>>> additional integer fields. StructuralByteArray doesn't have anything not >>>> necessary. One option would be adding `Row#getByteBuffer` that would be >>>> `ByteBuffer.wrap(getValue(i).getValues())`, specialized implementation can >>>> override it for better performance, but `getValue(i)` must return >>>> StructuralByteArray. >>>> >>>> References: >>>> - [BEAM-5866] Fix `Row#equals`, >>>> https://github.com/apache/beam/pull/6845 >>>> - [BEAM-5646] Fix quality and hashcode for bytes in Row, >>>> https://github.com/apache/beam/pull/6765 >>>> >>>> Gleb >>>> >>> > > -- > Cheers, > Gleb >
