Hi,

tl;dr There are two (or three) different representations used for the URN
"beam:logical_type:datetime:v1" in Beam Java SDK. Clarification or clean up
is needed.

I recently try to resolve a long-time issue
https://github.com/apache/beam/issues/19817 for the datetime logical type
cross-language support. However I notice that currently in Java SDK this
URN is referred in two places:

(1)
https://github.com/apache/beam/blob/cf9ea1f442636f781b9f449e953016bb39622781/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/logicaltypes/DateTime.java#L49
where it has a representation of Row {Date: INT64, Time: INT64}

(2)
https://github.com/apache/beam/blob/cf9ea1f442636f781b9f449e953016bb39622781/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaTranslation.java#L67
where it is represented by a single INT64

Moreover, there is a third, "actual" representation that is used when one
encodes a joda.DateTime or joda.Instant object in a Row, which is an int64
type encoded with fixed-size big endian. Note that this is different from
(2) because INT64 will be decoded using a VarInt coder which causes data
corruption when other sdk (e.g. python) pulls rows from java ptransform.

To resolve this I try to implement the "actual" representation of logical
type "beam:logical_type:datetime:v1" (
https://github.com/apache/beam/pull/22561 for contexts). It then becomes
necessary to resolve this inconsistency because we have reached a point of
adding a known logical type in schemas.proto. I am consider some solutions:

i. Rename the logical type of case (2) as "beam:logical_type:instant:v1" as
suggested by Brian in an earlier review.

ii. "beam:logical_type:instant:v1" is still backed by INT64, but in
implementation it will use BigEndianLongCoder to encode/decode the stream.

For the second step ii, the problem is that there is a primitive type
backed by a fixed length integer coder. Currently INT8, INT16, INT32,
INT64... are all backed by VarInt (and there is ongoing work to use fixed
size big endian to encode INT8, INT16 (
https://github.com/apache/beam/issues/19815)). Ideally I would think (INT8,
INT16, INT32, INT64) are all fixed and having a generic (INT) primitive
type is backed by VarInt. But this may be a more substantial change for the
current code base.

I would like to have opinions from the community. Thanks for your attention!

Regards,
Yi

-- 

Yi Hu, (he/him/his)

Software Engineer

Reply via email to