I also like Plan B because in the cross language case, the pipeline would not work since every party (Runners & SDKs) would have to be aware of the new beam:coder:windowed_value:v2 coder. Plan A has the property where if the SDK/Runner wasn't updated then it may start truncating the timestamps unexpectedly.
On Wed, Apr 17, 2019 at 1:24 PM Lukasz Cwik <lc...@google.com> wrote: > Kenn, this discussion is about the precision of the timestamp in the user > data. As you had mentioned, Runners need not have the same granularity of > user data as long as they correctly round the timestamp to guarantee that > triggers are executed correctly but the user data should have the same > precision across SDKs otherwise user data timestamps will be truncated in > cross language scenarios. > > Based on the systems that were listed, either microsecond or nanosecond > would make sense. The issue with changing the precision is that all Beam > runners except for possibly Beam Python on Dataflow are using millisecond > precision since they are all using the same Java Runner windowing/trigger > logic. > > Plan A: Swap precision to nanosecond > 1) Change the Python SDK to only expose millisecond precision timestamps > (do now) > 2) Change the user data encoding to support nanosecond precision (do now) > 3) Swap runner libraries to be nanosecond precision aware updating all > window/triggering logic (do later) > 4) Swap SDKs to expose nanosecond precision (do later) > > Plan B: > 1) Change the Python SDK to only expose millisecond precision timestamps > and keep the data encoding as is (do now) > (We could add greater precision later to plan B by creating a new version > beam:coder:windowed_value:v2 which would be nanosecond and would require > runners to correctly perform an internal conversions for > windowing/triggering.) > > I think we should go with Plan B and when users request greater precision > we can make that an explicit effort. What do people think? > > > > On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels <m...@apache.org> wrote: > >> Hi, >> >> Thanks for taking care of this issue in the Python SDK, Thomas! >> >> It would be nice to have a uniform precision for timestamps but, as Kenn >> pointed out, timestamps are extracted from systems that have different >> precision. >> >> To add to the list: Flink - milliseconds >> >> After all, it doesn't matter as long as there is sufficient precision >> and conversions are done correctly. >> >> I think we could improve the situation by at least adding a >> "milliseconds" constructor to the Python SDK's Timestamp. >> >> Cheers, >> Max >> >> On 17.04.19 04:13, Kenneth Knowles wrote: >> > I am not so sure this is a good idea. Here are some systems and their >> > precision: >> > >> > Arrow - microseconds >> > BigQuery - microseconds >> > New Java instant - nanoseconds >> > Firestore - microseconds >> > Protobuf - nanoseconds >> > Dataflow backend - microseconds >> > Postgresql - microseconds >> > Pubsub publish time - nanoseconds >> > MSSQL datetime2 - 100 nanoseconds (original datetime about 3 millis) >> > Cassandra - milliseconds >> > >> > IMO it is important to be able to treat any of these as a Beam >> > timestamp, even though they aren't all streaming. Who knows when we >> > might be ingesting a streamed changelog, or using them for reprocessing >> > an archived stream. I think for this purpose we either should >> > standardize on nanoseconds or make the runner's resolution independent >> > of the data representation. >> > >> > I've had some offline conversations about this. I think we can have >> > higher-than-runner precision in the user data, and allow WindowFns and >> > DoFns to operate on this higher-than-runner precision data, and still >> > have consistent watermark treatment. Watermarks are just bounds, after >> all. >> > >> > Kenn >> > >> > On Tue, Apr 16, 2019 at 6:48 PM Thomas Weise <t...@apache.org >> > <mailto:t...@apache.org>> wrote: >> > >> > The Python SDK currently uses timestamps in microsecond resolution >> > while Java SDK, as most would probably expect, uses milliseconds. >> > >> > This causes a few difficulties with portability (Python coders need >> > to convert to millis for WindowedValue and Timers, which is related >> > to a bug I'm looking into: >> > >> > https://issues.apache.org/jira/browse/BEAM-7035 >> > >> > As Luke pointed out, the issue was previously discussed: >> > >> > https://issues.apache.org/jira/browse/BEAM-1524 >> > >> > I'm not privy to the reasons why we decided to go with micros in >> > first place, but would it be too big of a change or impractical for >> > other reasons to switch Python SDK to millis before it gets more >> users? >> > >> > Thanks, >> > Thomas >> > >> >