Kenn, this discussion is about the precision of the timestamp in the user
data. As you had mentioned, Runners need not have the same granularity of
user data as long as they correctly round the timestamp to guarantee that
triggers are executed correctly but the user data should have the same
precision across SDKs otherwise user data timestamps will be truncated in
cross language scenarios.

Based on the systems that were listed, either microsecond or nanosecond
would make sense. The issue with changing the precision is that all Beam
runners except for possibly Beam Python on Dataflow are using millisecond
precision since they are all using the same Java Runner windowing/trigger
logic.

Plan A: Swap precision to nanosecond
1) Change the Python SDK to only expose millisecond precision timestamps
(do now)
2) Change the user data encoding to support nanosecond precision (do now)
3) Swap runner libraries to be nanosecond precision aware updating all
window/triggering logic (do later)
4) Swap SDKs to expose nanosecond precision (do later)

Plan B:
1) Change the Python SDK to only expose millisecond precision timestamps
and keep the data encoding as is (do now)
(We could add greater precision later to plan B by creating a new version
beam:coder:windowed_value:v2 which would be nanosecond and would require
runners to correctly perform an internal conversions for
windowing/triggering.)

I think we should go with Plan B and when users request greater precision
we can make that an explicit effort. What do people think?



On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels <[email protected]> wrote:

> Hi,
>
> Thanks for taking care of this issue in the Python SDK, Thomas!
>
> It would be nice to have a uniform precision for timestamps but, as Kenn
> pointed out, timestamps are extracted from systems that have different
> precision.
>
> To add to the list: Flink - milliseconds
>
> After all, it doesn't matter as long as there is sufficient precision
> and conversions are done correctly.
>
> I think we could improve the situation by at least adding a
> "milliseconds" constructor to the Python SDK's Timestamp.
>
> Cheers,
> Max
>
> On 17.04.19 04:13, Kenneth Knowles wrote:
> > I am not so sure this is a good idea. Here are some systems and their
> > precision:
> >
> > Arrow - microseconds
> > BigQuery - microseconds
> > New Java instant - nanoseconds
> > Firestore - microseconds
> > Protobuf - nanoseconds
> > Dataflow backend - microseconds
> > Postgresql - microseconds
> > Pubsub publish time - nanoseconds
> > MSSQL datetime2 - 100 nanoseconds (original datetime about 3 millis)
> > Cassandra - milliseconds
> >
> > IMO it is important to be able to treat any of these as a Beam
> > timestamp, even though they aren't all streaming. Who knows when we
> > might be ingesting a streamed changelog, or using them for reprocessing
> > an archived stream. I think for this purpose we either should
> > standardize on nanoseconds or make the runner's resolution independent
> > of the data representation.
> >
> > I've had some offline conversations about this. I think we can have
> > higher-than-runner precision in the user data, and allow WindowFns and
> > DoFns to operate on this higher-than-runner precision data, and still
> > have consistent watermark treatment. Watermarks are just bounds, after
> all.
> >
> > Kenn
> >
> > On Tue, Apr 16, 2019 at 6:48 PM Thomas Weise <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     The Python SDK currently uses timestamps in microsecond resolution
> >     while Java SDK, as most would probably expect, uses milliseconds.
> >
> >     This causes a few difficulties with portability (Python coders need
> >     to convert to millis for WindowedValue and Timers, which is related
> >     to a bug I'm looking into:
> >
> >     https://issues.apache.org/jira/browse/BEAM-7035
> >
> >     As Luke pointed out, the issue was previously discussed:
> >
> >     https://issues.apache.org/jira/browse/BEAM-1524
> >
> >     I'm not privy to the reasons why we decided to go with micros in
> >     first place, but would it be too big of a change or impractical for
> >     other reasons to switch Python SDK to millis before it gets more
> users?
> >
> >     Thanks,
> >     Thomas
> >
>

Reply via email to