Rui Fan created FLINK-36071:
-------------------------------

             Summary: Using System.nanoTime to measure the elapsed time instead 
of System.currentTimeMillis
                 Key: FLINK-36071
                 URL: https://issues.apache.org/jira/browse/FLINK-36071
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Metrics
            Reporter: Rui Fan
            Assignee: Rui Fan


A series of flink metrics are using the System.currentTimeMillis[1] to measure 
the elapsed time. I propose to refactor them from  System.currentTimeMillis to  
System.nanoTime[2].
h1. Why do we need to refactor it?

Note: High precision *{color:#de350b}is not{color}* the reason for refactor.

Actually, System.currentTimeMillis() and System.nanoTime() have completely 
different semantics.

System.currentTimeMillis() *{color:#de350b}!={color}* System.nanoTime() / 
1_000_000
 * System.currentTimeMillis() is current system time of the server.

 ** The time can be updated by NTP[3], or it can be adjusted manually
 * System.nanoTime() usually indicates the length of time since the operating 
system was booted.
 ** So System.nanoTime isn't system time, and it's not effected by system time.
 ** System.nanoTime (inside the process) is monotonically increasing and never 
goes back.
 ** As the job doc[2] mentioned: this method can only be used to measure 
elapsed time and is not related to any other notion of system or wall-clock 
time.

Here is a blog[4] to explain their difference in detail.
h1. Current use cases:

Based on last part, we know the System.nanoTime is recommended for measuring 
the duration.

Most of tracing system is using it, and flink also uses it to measure the 
duration for some metrics, such as:
 * all latency tracks of state backend
 * SubtaskCheckpointCoordinatorImpl#takeSnapshotSync measures the checkpoint 
Sync Duration
 * etc

In addition, the Clock[5] of flink extracted the absoluteTimeMillis, 
relativeTimeMillis and relativeTimeNanos before. But I guess most of developers 
doesn't know these details.
h1. Proposed changes:

This jira proposes that Flink uses System.nanoTime uniformly for duration 
calculation.

Currently, many components still use System.currentTimeMillis to calculate 
duration, it includes:
 * TimerGauge
 * TaskIOMetricGroup
 * A lof of methods of StreamTask
 * etc

[1] 
[https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#currentTimeMillis--]

[2] [https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime--]

[3] [https://en.wikipedia.org/wiki/Network_Time_Protocol]

[4] 
[https://www.javaadvent.com/2019/12/measuring-time-from-java-to-kernel-and-back.html]

[5] 
[https://github.com/apache/flink/blob/729b8b81a77ba6c32711216b88a1bf57ccddfadc/flink-core/src/main/java/org/apache/flink/util/clock/Clock.java#L40]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to