Sahil Takiar created HIVE-18684:
-----------------------------------

             Summary: Race condition in RemoteSparkJobMonitor
                 Key: HIVE-18684
                 URL: https://issues.apache.org/jira/browse/HIVE-18684
             Project: Hive
          Issue Type: Sub-task
          Components: Spark
            Reporter: Sahil Takiar
            Assignee: Sahil Takiar


There is a race condition in {{RemoteSparkJobMonitor}}. Sometimes the info in 
{{RemoteSparkJobMonitor#startMonitor.STARTED}} gets printed out, sometimes it 
doesn't. This can be easily verified by running a qtest on 
{{TestMiniSparkOnYarnCliDriver}} and counting the number of times {{Query Hive 
on Spark job}} is printed vs. the number of times {{Finished successfully in}} 
gets printed.

The issue is that {{RemoteSparkJobMonitor}} runs every one second, and checks 
the state of {{JobHandle}}. Depending on the state, it prints out some logging 
info. The content of the logs contain an implicit assumption that logs in the 
{{STARTED}} state are printed before the logs in the {{SUCCEEDED}} state. 
However, this isn't always the case. The state transitions are driven by how 
long the remote Spark job takes to run, and it it finishes within one second 
then the logs in the {{STARTED}} state never printed.

This can be confusing to users, and there is key debugging information that is 
printed in the {{STARTED}} state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to