[ https://issues.apache.org/jira/browse/AIRFLOW-5385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003406#comment-17003406 ]
t oo edited comment on AIRFLOW-5385 at 12/26/19 4:55 PM: --------------------------------------------------------- [~sergio.soto] [~Diego García] do u have a PR? BEFORE connection_cmd = self._get_spark_binary_path() # The url ot the spark master connection_cmd += ["--master", self._connection['master']] # The driver id so we can poll for its status if self._driver_id: connection_cmd += ["--status", self._driver_id] else: raise AirflowException( "Invalid status: attempted to poll driver " + "status but no driver id is known. Giving up.") AFTER #connection_cmd = self._get_spark_binary_path() #SPARK-27491 - spark 2.3.x status does not work # The url ot the spark master #connection_cmd += ["--master", self._connection['master']] #https://jira.apache.org/jira/browse/AIRFLOW-5385 curl_max_wait_time = 30 spark_host = self._connection['master'].replace("spark://", "http://") connection_cmd = ["/usr/bin/curl", "--max-time", str(curl_max_wait_time), "{host}/v1/submissions/status/{submission_id}".format(host=spark_host, submission_id=self._driver_id)] self.log.info(connection_cmd) # The driver id so we can poll for its status if self._driver_id: pass #connection_cmd += ["--status", self._driver_id] else: raise AirflowException( "Invalid status: attempted to poll driver " + "status but no driver id is known. Giving up.") another thing I notice is the polling every second is too frequent so: contrib/hooks/spark_submit_hook.py Poll spark server at a custom interval instead of every second *BEFORE* # Sleep for 1 second as we do not want to spam the cluster time.sleep(1) *AFTER* import airflow from airflow import configuration as conf Sleep for n second as we do not want to spam the cluster _poll_interval = conf.getint('sparksubmit', 'poll_interval') time.sleep(_poll_interval) was (Author: toopt4): [~sergio.soto] [~Diego García] do u have a PR? another thing I notice is the polling every second is too frequent so: contrib/hooks/spark_submit_hook.py Poll spark server at a custom interval instead of every second *BEFORE* # Sleep for 1 second as we do not want to spam the cluster time.sleep(1) *AFTER* import airflow from airflow import configuration as conf Sleep for n second as we do not want to spam the cluster _poll_interval = conf.getint('sparksubmit', 'poll_interval') time.sleep(_poll_interval) > SparkSubmit status spend lot of time > ------------------------------------ > > Key: AIRFLOW-5385 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5385 > Project: Apache Airflow > Issue Type: Improvement > Components: contrib > Affects Versions: 1.10.2 > Reporter: Sergio Soto > Priority: Blocker > > Hello, > we have an issue with SparkSubmitOperator. Airflow DAGs shows that some > streaming applications breaks out. I analyzed this behaviour. The > SparkSubmitHook is the responsable of check the driver status. > We discovered some timeouts and tried to reproduce checking command. This is > an execution with `time`: > {code:java} > time /opt/java/jdk1.8.0_181/jre/bin/java -cp > /opt/shared/spark/client/conf/:/opt/shared/spark/client/jars/* -Xmx1g > org.apache.spark.deploy.SparkSubmit --master > spark://spark-master.corp.com:6066 --status driver-20190901180337-2749 > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 19/09/02 17:05:53 INFO RestSubmissionClient: Submitting a request for the > status of submission driver-20190901180337-2749 in > spark://lgmadbdtpspk01v.corp.logitravelgroup.com:6066. > 19/09/02 17:05:59 INFO RestSubmissionClient: Server responded with > SubmissionStatusResponse: > { > "action" : "SubmissionStatusResponse", > "driverState" : "RUNNING", > "serverSparkVersion" : "2.2.1", > "submissionId" : "driver-20190901180337-2749", > "success" : true, > "workerHostPort" : "172.25.10.194:45441", > "workerId" : "worker-20190821201014-172.25.10.194-45441" > } > real 0m11.598s > user 0m2.092s > sys 0m0.222s{code} > We analyzed the Scala code and Spark API. This spark-submit status command > ends with a http get request to an url. Using curl, this is the time spent by > spark master to return status: > {code:java} > time curl > "http://spark-master.corp.com:6066/v1/submissions/status/driver-20190901180337-2749" > { > "action" : "SubmissionStatusResponse", > "driverState" : "RUNNING", > "serverSparkVersion" : "2.2.1", > "submissionId" : "driver-20190901180337-2749", > "success" : true, > "workerHostPort" : "172.25.10.194:45441", > "workerId" : "worker-20190821201014-172.25.10.194-45441" > } > real 0m0.011s > user 0m0.000s > sys 0m0.006s > {code} > Task spends 11.59 seconds with spark submit versus 0.011seconds with curl > How can be this behaviour explained? -- This message was sent by Atlassian Jira (v8.3.4#803005)