lazarillo commented on issue #32781:
URL: https://github.com/apache/beam/issues/32781#issuecomment-2414124334
I actually have the following notes in my internal repo, which might give
some idea as to why this sort of feature is useful.
As I'm sure anyone reading this thread is aware, the way that Google handles
timestamps across all of its products is all over the place.
I actually have notes within our repo to make sure that I (or anyone else)
does not forget all of these idiosyncracies:
---
## Working with timestamps in Dataflow (and BigQuery and Pub/Sub and
protobufs)
Timestamp defaults are *all over the place* within the Googleverse.
- When working with protobufs, the most common is to use a protobuf
`Timestamp` message, which has the fields `seconds` and `nanos`. (Note, it is
`nanos`, not `nanoseconds`).
- When working with the Pub/Sub-to-BigQuery direct subscription connector,
you _cannot use_ a protobuf `Timestamp` message, and you must instead provide
an integer which is the _number of microseconds_ since epoch (1970-01-01). Yes,
**microseconds**.
- When working within Dataflow itself (in Python), since there is no direct
means to `WriteToBigQuery` with a protobuf message, you have to first convert
to JSON or a `dict` or something similar. In this case (when working with JSON,
or a `dict`, which is pushed to a JSON representation behind the scenes as best
I can tell), the timestamp must either be an integer which is the _number of
seconds_ since epoch **or it can be a float which is the _number of
milliseconds_ since epoch**. So the expected resolution depends upon the
primitive type! Best to avoid using a numeric timestamp in this case.
**Yes, you read that correctly, when working with timestamps in the
Googleverse, it must be represented either as nanoseconds, microseconds,
milliseconds, or seconds, depending upon the resource you're talking to and the
data type of the timestamp itself (and maybe the language you're using).**
Best practice to deal with all of this:
- If the item will ever be written directly into BigQuery from Pub/Sub via
the direct subscription, **you have no choice, it must be an integer
representing the number of _microseconds_ since epoch**.
- If the item is not written directly to BigQuery (it is processed in
Dataflow), represent it as a proper protobuf `Timestamp` because when this is
converted to a JSON or `dict`, the protobuf conversion code manages this by
converting it to a string representation using the proper scale.
- This creates a larger message over the wire, but it prevents any
accidental errors depending upon source data type being `int` or `float`.
---
So the addition of this feature will _drastically_ simplify our workflow of
type checking, etc.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]