[ https://issues.apache.org/jira/browse/BEAM-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on BEAM-7812 started by Ning Kang. --------------------------------------- > Work around Stackdriver error reporting double counting worker errors > --------------------------------------------------------------------- > > Key: BEAM-7812 > URL: https://issues.apache.org/jira/browse/BEAM-7812 > Project: Beam > Issue Type: Bug > Components: runner-dataflow > Reporter: Ning Kang > Assignee: Ning Kang > Priority: Minor > > h1. *Objective* > Work around Stackdriver Error Reporting to count worker errors only once when > double logging. > {color:#d04437}*Only applicable to dataflow runner workers in SDK*{color}. > h1. *Background* > Stackdriver error reporting will double count worker errors logged to > Stackdriver, because: > # workers log errors to Stackdriver; > # workers report the same errors to dfe and dfe will log them again to > Stackdriver. > The double counting is blocking us sending job message logs from dfe to > Stackdriver because we don't want to change the behavior of any existing log > and feature. > There happens to be an inconsistency in Java batch > [DataflowWorkerLoggingHandler|[https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/logging/DataflowWorkerLoggingHandler.java#L82]] > and streaming > ([StreamingDataflowWorker|[https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorker.java#L1747]]) > error reporting to dfe that results in reported error from streaming Java > worker will eventually be ignored by Stackdriver Error Reporting. > h1. *Details* > Inspired by the inconsistency, we decide to apply the streaming Java worker > error reporting logic to batch to both fix the inconsistency and work around > double counting issue on Stackdriver Error Reporting. > The change will be when workers reporting errors to dfe, > * For Java, construct stack trace from StackTrace object instead of using > printStackTrace; > * For Python, report the complete error message details exactly the same to > worker logging instead of only reporting traceback through traceback module. > Users will not experience change since job message logging to Stackdriver > hasn’t been launched yet. > h1. *Test Plan* > We'll add unit test for public methods changed in the process. > Google has internal integration tests where we can push worker harness images > and set worker harness container image to test in sandbox. > When releasing, we also have integration tests in different releasing stages. > The workaround needs to be released completely before we can enable job > message logging. > We can verify the format of stacktraces in sandbox and release stages by > executing example pipelines in our projects and directly browse prod > Stackdriver logging and error reporting consoles. This should be done before > and after enabling job message logging. > Run any other existing and required tests before sending PR. -- This message was sent by Atlassian JIRA (v7.6.14#76016)