Re: Python SDK timestamp precision

Lukasz Cwik Wed, 17 Apr 2019 13:35:27 -0700

I also like Plan B because in the cross language case, the pipeline would
not work since every party (Runners & SDKs) would have to be aware of the
new beam:coder:windowed_value:v2 coder. Plan A has the property where if
the SDK/Runner wasn't updated then it may start truncating the timestamps
unexpectedly.


On Wed, Apr 17, 2019 at 1:24 PM Lukasz Cwik <lc...@google.com> wrote:

> Kenn, this discussion is about the precision of the timestamp in the user
> data. As you had mentioned, Runners need not have the same granularity of
> user data as long as they correctly round the timestamp to guarantee that
> triggers are executed correctly but the user data should have the same
> precision across SDKs otherwise user data timestamps will be truncated in
> cross language scenarios.
>
> Based on the systems that were listed, either microsecond or nanosecond
> would make sense. The issue with changing the precision is that all Beam
> runners except for possibly Beam Python on Dataflow are using millisecond
> precision since they are all using the same Java Runner windowing/trigger
> logic.
>
> Plan A: Swap precision to nanosecond
> 1) Change the Python SDK to only expose millisecond precision timestamps
> (do now)
> 2) Change the user data encoding to support nanosecond precision (do now)
> 3) Swap runner libraries to be nanosecond precision aware updating all
> window/triggering logic (do later)
> 4) Swap SDKs to expose nanosecond precision (do later)
>
> Plan B:
> 1) Change the Python SDK to only expose millisecond precision timestamps
> and keep the data encoding as is (do now)
> (We could add greater precision later to plan B by creating a new version
> beam:coder:windowed_value:v2 which would be nanosecond and would require
> runners to correctly perform an internal conversions for
> windowing/triggering.)
>
> I think we should go with Plan B and when users request greater precision
> we can make that an explicit effort. What do people think?
>
>
>
> On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels <m...@apache.org> wrote:
>
>> Hi,
>>
>> Thanks for taking care of this issue in the Python SDK, Thomas!
>>
>> It would be nice to have a uniform precision for timestamps but, as Kenn
>> pointed out, timestamps are extracted from systems that have different
>> precision.
>>
>> To add to the list: Flink - milliseconds
>>
>> After all, it doesn't matter as long as there is sufficient precision
>> and conversions are done correctly.
>>
>> I think we could improve the situation by at least adding a
>> "milliseconds" constructor to the Python SDK's Timestamp.
>>
>> Cheers,
>> Max
>>
>> On 17.04.19 04:13, Kenneth Knowles wrote:
>> > I am not so sure this is a good idea. Here are some systems and their
>> > precision:
>> >
>> > Arrow - microseconds
>> > BigQuery - microseconds
>> > New Java instant - nanoseconds
>> > Firestore - microseconds
>> > Protobuf - nanoseconds
>> > Dataflow backend - microseconds
>> > Postgresql - microseconds
>> > Pubsub publish time - nanoseconds
>> > MSSQL datetime2 - 100 nanoseconds (original datetime about 3 millis)
>> > Cassandra - milliseconds
>> >
>> > IMO it is important to be able to treat any of these as a Beam
>> > timestamp, even though they aren't all streaming. Who knows when we
>> > might be ingesting a streamed changelog, or using them for reprocessing
>> > an archived stream. I think for this purpose we either should
>> > standardize on nanoseconds or make the runner's resolution independent
>> > of the data representation.
>> >
>> > I've had some offline conversations about this. I think we can have
>> > higher-than-runner precision in the user data, and allow WindowFns and
>> > DoFns to operate on this higher-than-runner precision data, and still
>> > have consistent watermark treatment. Watermarks are just bounds, after
>> all.
>> >
>> > Kenn
>> >
>> > On Tue, Apr 16, 2019 at 6:48 PM Thomas Weise <t...@apache.org
>> > <mailto:t...@apache.org>> wrote:
>> >
>> >     The Python SDK currently uses timestamps in microsecond resolution
>> >     while Java SDK, as most would probably expect, uses milliseconds.
>> >
>> >     This causes a few difficulties with portability (Python coders need
>> >     to convert to millis for WindowedValue and Timers, which is related
>> >     to a bug I'm looking into:
>> >
>> >     https://issues.apache.org/jira/browse/BEAM-7035
>> >
>> >     As Luke pointed out, the issue was previously discussed:
>> >
>> >     https://issues.apache.org/jira/browse/BEAM-1524
>> >
>> >     I'm not privy to the reasons why we decided to go with micros in
>> >     first place, but would it be too big of a change or impractical for
>> >     other reasons to switch Python SDK to millis before it gets more
>> users?
>> >
>> >     Thanks,
>> >     Thomas
>> >
>>
>

Re: Python SDK timestamp precision

Reply via email to