On Thu, Aug 11, 2022 at 8:50 AM Yi Hu via dev <dev@beam.apache.org> wrote:
> Hi, > > tl;dr There are two (or three) different representations used for the URN > "beam:logical_type:datetime:v1" in Beam Java SDK. Clarification or clean up > is needed. > > I recently try to resolve a long-time issue > https://github.com/apache/beam/issues/19817 for the datetime logical type > cross-language support. However I notice that currently in Java SDK this > URN is referred in two places: > > (1) > https://github.com/apache/beam/blob/cf9ea1f442636f781b9f449e953016bb39622781/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/logicaltypes/DateTime.java#L49 > where it has a representation of Row {Date: INT64, Time: INT64} > > (2) > https://github.com/apache/beam/blob/cf9ea1f442636f781b9f449e953016bb39622781/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaTranslation.java#L67 > where it is represented by a single INT64 > > Moreover, there is a third, "actual" representation that is used when one > encodes a joda.DateTime or joda.Instant object in a Row, which is an int64 > type encoded with fixed-size big endian. Note that this is different from > (2) because INT64 will be decoded using a VarInt coder which causes data > corruption when other sdk (e.g. python) pulls rows from java ptransform. > > To resolve this I try to implement the "actual" representation of logical > type "beam:logical_type:datetime:v1" ( > https://github.com/apache/beam/pull/22561 for contexts). It then becomes > necessary to resolve this inconsistency because we have reached a point of > adding a known logical type in schemas.proto. I am consider some solutions: > > i. Rename the logical type of case (2) as "beam:logical_type:instant:v1" > as suggested by Brian in an earlier review. > +1 > > ii. "beam:logical_type:instant:v1" is still backed by INT64, but in > implementation it will use BigEndianLongCoder to encode/decode the stream. > > Is this to be compatible with the current Java implementation ? And we have to update other SDKs to use big endian coders when encoding/decoding the "beam:logical_type:instant:v1" logical type ? > For the second step ii, the problem is that there is a primitive type > backed by a fixed length integer coder. Currently INT8, INT16, INT32, > INT64... are all backed by VarInt (and there is ongoing work to use fixed > size big endian to encode INT8, INT16 ( > https://github.com/apache/beam/issues/19815)). Ideally I would think > (INT8, INT16, INT32, INT64) are all fixed and having a generic (INT) > primitive type is backed by VarInt. But this may be a more substantial > change for the current code base. > I'm a bit confused by this. Did you mean that there's *no* primitive type backed by a fixed length integer coder ? Also, by primitive, I'm assuming you mean Beam Schema types here. Thanks, Cham > > I would like to have opinions from the community. Thanks for your > attention! > > Regards, > Yi > > -- > > Yi Hu, (he/him/his) > > Software Engineer > > >