Sahil Takiar created HIVE-18684: ----------------------------------- Summary: Race condition in RemoteSparkJobMonitor Key: HIVE-18684 URL: https://issues.apache.org/jira/browse/HIVE-18684 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Sahil Takiar Assignee: Sahil Takiar
There is a race condition in {{RemoteSparkJobMonitor}}. Sometimes the info in {{RemoteSparkJobMonitor#startMonitor.STARTED}} gets printed out, sometimes it doesn't. This can be easily verified by running a qtest on {{TestMiniSparkOnYarnCliDriver}} and counting the number of times {{Query Hive on Spark job}} is printed vs. the number of times {{Finished successfully in}} gets printed. The issue is that {{RemoteSparkJobMonitor}} runs every one second, and checks the state of {{JobHandle}}. Depending on the state, it prints out some logging info. The content of the logs contain an implicit assumption that logs in the {{STARTED}} state are printed before the logs in the {{SUCCEEDED}} state. However, this isn't always the case. The state transitions are driven by how long the remote Spark job takes to run, and it it finishes within one second then the logs in the {{STARTED}} state never printed. This can be confusing to users, and there is key debugging information that is printed in the {{STARTED}} state. -- This message was sent by Atlassian JIRA (v7.6.3#76005)