t oo created AIRFLOW-6994:
-----------------------------

             Summary: SparkSubmitOperator re launches spark driver even when 
original driver still running
                 Key: AIRFLOW-6994
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6994
             Project: Apache Airflow
          Issue Type: Bug
          Components: scheduler
    Affects Versions: 1.10.6
            Reporter: t oo
            Assignee: t oo
             Fix For: 1.10.8


You click ‘release’ on a new spark cluster while the prior spark cluster is 
processing some spark submits from airflow. Then airflow is never able to 
finish the sparksubmit task as it polls from status on the new spark cluster 
build which it can’t find status for as the submit happened on earlier spark 
cluster build….the status loop goes on forever

 

[https://github.com/apache/airflow/blob/1.10.6/airflow/contrib/hooks/spark_submit_hook.py#L446]

[https://github.com/apache/airflow/blob/1.10.6/airflow/contrib/hooks/spark_submit_hook.py#L489]

It loops forever if it can’t find driverState tag in the json response, since 
the new build (pointed to by the released DNS name) doesn’t know about the 
driver submitted (in previously released build) then the 2nd response below 
does not contain the driverState tag.

  

#response before clicking release on new build

[ec2-user@reda ~]$

curl +[http://dns:6066/v1/submissions/status/driver-20191202142207-0000]+

{  "action" : "SubmissionStatusResponse",  "driverState" : "RUNNING",  
"serverSparkVersion" : "2.3.4",  "submissionId" : "driver-20191202142207-0000", 
 "success" : true,  "workerHostPort" : "reda:31489",  "workerId" : 
"worker-20191202133526-reda-31489"}

 

#response after clicking release on new build

[ec2-user@reda ~]$

curl [http://dns:6066/v1/submissions/status/driver-20191202142207-0000]     

{  "action" : "SubmissionStatusResponse",  "serverSparkVersion" : "2.3.4",  
"submissionId" : "driver-20191202142207-0000",  "success" : false               
}

               

 

Definitely a defect in current code. Can fix this by modifying 
_process_spark_status_log function to set driver status to UNKNOWN if 
driverState is not in response after iterating all lines.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to