Re: Python SDK timestamp precision

Kenneth Knowles Tue, 23 Apr 2019 07:20:18 -0700

On Tue, Apr 23, 2019 at 5:48 AM Robert Bradshaw <rober...@google.com> wrote:


> On Thu, Apr 18, 2019 at 12:23 AM Kenneth Knowles <k...@apache.org> wrote:
> >
> > For Robert's benefit, I want to point out that my proposal is to support
> femtosecond data, with femtosecond-scale windows, even if watermarks/event
> timestamps/holds are only millisecond precision.
> >
> > So the workaround once I have time, for SQL and schema-based transforms,
> will be to have a logical type that matches the Java and protobuf
> definition of nanos (seconds-since-epoch + nanos-in-second) to preserve the
> user's data. And then when doing windowing inserting the necessary rounding
> somewhere in the SQL or schema layers.
>
> It seems to me that the underlying granularity of element timestamps
> and window boundaries, as seen an operated on by the runner (and
> transmitted over the FnAPI boundary), is not something we can make
> invisible to the user (and consequently we cannot just insert rounding
> on higher precision data and get the right results). However, I would
> be very interested in seeing proposals that could get around this.
> Watermarks, of course, can be as approximate (in one direction) as one
> likes.
>

I outlined a way... or perhaps I retracted it to ponder and sent the rest
of my email. Sorry! Something like this, TL;DR store the original data but
do runner ops on rounded data.

 -  WindowFn must receive exactly the data that came from the user's data
source. So that cannot be rounded.
 - The user's WindowFn assigns to a window, so it can contain arbitrary
precision as it should be grouped as bytes.
 - End of window, timers, watermark holds, etc, are all treated only as
bounds, so can all be rounded based on their use as an upper or lower bound.

We already do a lot of this - Pubsub publish timestamps are microsecond
precision (you could say our current connector constitutes data loss) as
are Windmill timestamps (since these are only combines of Beam timestamps
here there is no data loss). There are undoubtedly some corner cases I've
missed, and naively this might look like duplicating timestamps so that
could be an unacceptable performance concern.

As for choice of granularity, it would be ideal if any time-like field
> could be used as the timestamp (for subsequent windowing). On the
> other hand, nanoseconds (or smaller) complicates the arithmetic and
> encoding as a 64-bit int has a time range of only a couple hundred
> years without overflow (which is an argument for microseconds, as they
> are a nice balance between sub-second granularity and multi-millennia
> span). Standardizing on milliseconds is more restrictive but has the
> advantage that it's what Java and Joda Time use now (though it's
> always easier to pad precision than round it away).
>

A correction: Java *now* uses nanoseconds [1]. It uses the same breakdown
as proto (int64 seconds since epoch + int32 nanos within second). It has
legacy classes that use milliseconds, and Joda itself now encourages moving
back to Java's new Instant type. Nanoseconds should complicate the
arithmetic only for the one person authoring the date library, which they
have already done.

It would also be really nice to clean up the infinite-future being the
> somewhat arbitrary max micros rounded to millis, and
> end-of-global-window being infinite-future minus 1 hour (IIRC), etc.
> as well as the ugly logic in Python to cope with millis-micros
> conversion.
>

I actually don't have a problem with this. If you are trying to keep the
representation compact, not add bytes on top of instants, then you just
have to choose magic numbers, right?

Kenn

[1] https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html


> > On Wed, Apr 17, 2019 at 3:13 PM Robert Burke <rob...@frantil.com> wrote:
> >>
> >> +1 for plan B. Nano second precision on windowing seems... a little
> much for a system that's aggregating data over time. Even for processing
> say particle super collider data, they'd get away with artificially
> increasing the granularity in batch settings.
> >>
> >> Now if they were streaming... they'd probably want femtoseconds anyway.
> >> The point is, we should see if users demand it before adding in the
> necessary work.
> >>
> >> On Wed, 17 Apr 2019 at 14:26, Chamikara Jayalath <chamik...@google.com>
> wrote:
> >>>
> >>> +1 for plan B as well. I think it's important to make timestamp
> precision consistent now without introducing surprising behaviors for
> existing users. But we should move towards a higher granularity timestamp
> precision in the long run to support use-cases that Beam users otherwise
> might miss out (on a runner that supports such precision).
> >>>
> >>> - Cham
> >>>
> >>> On Wed, Apr 17, 2019 at 1:35 PM Lukasz Cwik <lc...@google.com> wrote:
> >>>>
> >>>> I also like Plan B because in the cross language case, the pipeline
> would not work since every party (Runners & SDKs) would have to be aware of
> the new beam:coder:windowed_value:v2 coder. Plan A has the property where
> if the SDK/Runner wasn't updated then it may start truncating the
> timestamps unexpectedly.
> >>>>
> >>>> On Wed, Apr 17, 2019 at 1:24 PM Lukasz Cwik <lc...@google.com> wrote:
> >>>>>
> >>>>> Kenn, this discussion is about the precision of the timestamp in the
> user data. As you had mentioned, Runners need not have the same granularity
> of user data as long as they correctly round the timestamp to guarantee
> that triggers are executed correctly but the user data should have the same
> precision across SDKs otherwise user data timestamps will be truncated in
> cross language scenarios.
> >>>>>
> >>>>> Based on the systems that were listed, either microsecond or
> nanosecond would make sense. The issue with changing the precision is that
> all Beam runners except for possibly Beam Python on Dataflow are using
> millisecond precision since they are all using the same Java Runner
> windowing/trigger logic.
> >>>>>
> >>>>> Plan A: Swap precision to nanosecond
> >>>>> 1) Change the Python SDK to only expose millisecond precision
> timestamps (do now)
> >>>>> 2) Change the user data encoding to support nanosecond precision (do
> now)
> >>>>> 3) Swap runner libraries to be nanosecond precision aware updating
> all window/triggering logic (do later)
> >>>>> 4) Swap SDKs to expose nanosecond precision (do later)
> >>>>>
> >>>>> Plan B:
> >>>>> 1) Change the Python SDK to only expose millisecond precision
> timestamps and keep the data encoding as is (do now)
> >>>>> (We could add greater precision later to plan B by creating a new
> version beam:coder:windowed_value:v2 which would be nanosecond and would
> require runners to correctly perform an internal conversions for
> windowing/triggering.)
> >>>>>
> >>>>> I think we should go with Plan B and when users request greater
> precision we can make that an explicit effort. What do people think?
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels <m...@apache.org>
> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Thanks for taking care of this issue in the Python SDK, Thomas!
> >>>>>>
> >>>>>> It would be nice to have a uniform precision for timestamps but, as
> Kenn
> >>>>>> pointed out, timestamps are extracted from systems that have
> different
> >>>>>> precision.
> >>>>>>
> >>>>>> To add to the list: Flink - milliseconds
> >>>>>>
> >>>>>> After all, it doesn't matter as long as there is sufficient
> precision
> >>>>>> and conversions are done correctly.
> >>>>>>
> >>>>>> I think we could improve the situation by at least adding a
> >>>>>> "milliseconds" constructor to the Python SDK's Timestamp.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Max
> >>>>>>
> >>>>>> On 17.04.19 04:13, Kenneth Knowles wrote:
> >>>>>> > I am not so sure this is a good idea. Here are some systems and
> their
> >>>>>> > precision:
> >>>>>> >
> >>>>>> > Arrow - microseconds
> >>>>>> > BigQuery - microseconds
> >>>>>> > New Java instant - nanoseconds
> >>>>>> > Firestore - microseconds
> >>>>>> > Protobuf - nanoseconds
> >>>>>> > Dataflow backend - microseconds
> >>>>>> > Postgresql - microseconds
> >>>>>> > Pubsub publish time - nanoseconds
> >>>>>> > MSSQL datetime2 - 100 nanoseconds (original datetime about 3
> millis)
> >>>>>> > Cassandra - milliseconds
> >>>>>> >
> >>>>>> > IMO it is important to be able to treat any of these as a Beam
> >>>>>> > timestamp, even though they aren't all streaming. Who knows when
> we
> >>>>>> > might be ingesting a streamed changelog, or using them for
> reprocessing
> >>>>>> > an archived stream. I think for this purpose we either should
> >>>>>> > standardize on nanoseconds or make the runner's resolution
> independent
> >>>>>> > of the data representation.
> >>>>>> >
> >>>>>> > I've had some offline conversations about this. I think we can
> have
> >>>>>> > higher-than-runner precision in the user data, and allow
> WindowFns and
> >>>>>> > DoFns to operate on this higher-than-runner precision data, and
> still
> >>>>>> > have consistent watermark treatment. Watermarks are just bounds,
> after all.
> >>>>>> >
> >>>>>> > Kenn
> >>>>>> >
> >>>>>> > On Tue, Apr 16, 2019 at 6:48 PM Thomas Weise <t...@apache.org
> >>>>>> > <mailto:t...@apache.org>> wrote:
> >>>>>> >
> >>>>>> >     The Python SDK currently uses timestamps in microsecond
> resolution
> >>>>>> >     while Java SDK, as most would probably expect, uses
> milliseconds.
> >>>>>> >
> >>>>>> >     This causes a few difficulties with portability (Python
> coders need
> >>>>>> >     to convert to millis for WindowedValue and Timers, which is
> related
> >>>>>> >     to a bug I'm looking into:
> >>>>>> >
> >>>>>> >     https://issues.apache.org/jira/browse/BEAM-7035
> >>>>>> >
> >>>>>> >     As Luke pointed out, the issue was previously discussed:
> >>>>>> >
> >>>>>> >     https://issues.apache.org/jira/browse/BEAM-1524
> >>>>>> >
> >>>>>> >     I'm not privy to the reasons why we decided to go with micros
> in
> >>>>>> >     first place, but would it be too big of a change or
> impractical for
> >>>>>> >     other reasons to switch Python SDK to millis before it gets
> more users?
> >>>>>> >
> >>>>>> >     Thanks,
> >>>>>> >     Thomas
> >>>>>> >
>

Re: Python SDK timestamp precision

Reply via email to