zhifanggao opened a new issue, #5169: URL: https://github.com/apache/kyuubi/issues/5169
### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) ### Search before asking - [X] I have searched in the [issues](https://github.com/apache/kyuubi/issues?q=is%3Aissue) and found no similar issues. ### Describe the bug test steps: 1. submit a batch job using rest api ``` curl -u "ocdp:112345" --location --request POST 'http://10.19.29.167:30099/api/v1/batches' --header 'Content-Type: application/json' --data '{ "batchType": "Spark", "resource": "hdfs://host-10-19-29-137:8020/tmp/ocdp/spark-examples_2.12-3.3.2.jar", "name": "kyuubi_batch_demo", "className": "org.apache.spark.examples.SparkPi", "conf": {"hive.server2.proxy.user":"ocdp"}}' ``` 2. check the pod of kyuubi server ``` kyuubi@kyuubi-deployment-example-7c7774d465-9f9xh:/opt/kyuubi$ ps -efl|grep driver 0 S kyuubi 584 1 87 80 0 - 755904 futex_ 19:03 ? 00:00:04 /opt/java/openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED org.apache.spark.deploy.SparkSubmit --conf spark.kyuubi.client.ipAddress=10.19.29.167 --conf spark.kyuubi.batch.resource. uploaded=false --conf spark.kubernetes.driverEnv.SPARK_USER_NAME=ocdp --conf spark.executorEnv.SPARK_USER_NAME=ocdp --conf spark.hive.server2.proxy.user=ocdp --conf spark.kubernetes.driver.label.kyuubi-unique-tag=1342671d-ac56-44cb-862f-7cc4aa9b6656 --conf spark.app.name=kyuubi_batch_demo --conf spark.kyuubi.session.real.user=ocdp --conf spark.kyuubi.server.ipAddress=0.0.0.0 --conf spark.kyuubi.session.connection.url=0.0.0.0:10099 --conf spark.kyuubi.batch.id=1342671d-ac56-44cb-862f-7cc4aa9b6656 --class org.apache.spark.examples.SparkPi --proxy-user ocdp hdfs://host-10-19-29-137:8020/tmp/ocdp/spark-examples_2.12-3.3.2.jar ``` 3. once the batch completed, check the status ``` [root@host-10-19-37-166 ~]# curl -u "ocdp:112345" --location --request GET 'http://10.19.29.167:30099/api/v1/batches/1342671d-ac56-44cb-862f-7cc4aa9b6656' {"id":"1342671d-ac56-44cb-862f-7cc4aa9b6656","user":"ocdp","batchType":"SPARK","name":"kyuubi_batch_demo","appStartTime":0,"appId":null,"appUrl":null,"appState":"NOT_FOUND","appDiagnostic":null,"kyuubiInstance":"0.0.0.0:10099","state":"ERROR","createTime":1692097410346,"endTime":1692097443926,"batchInfo":{}} ``` the status of batch job is ERROR, In fact the batch job is executed successfully. Checked the kyuubi logs ``` 2023-08-15 11:03:30.358 INFO org.apache.kyuubi.session.KyuubiSessionManager: ocdp's session with SessionHandle [1342671d-ac56-44cb-862f-7cc4aa9b6656]/kyuubi_batch_demo is opened, current opening sessions 4 2023-08-15 11:03:30.359 INFO org.apache.kyuubi.operation.BatchJobSubmission: Submitting SPARK batch[1342671d-ac56-44cb-862f-7cc4aa9b6656] job: /opt/spark/bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --conf spark.hive.server2.proxy.user=ocdp \ --conf spark.kyuubi.batch.id=1342671d-ac56-44cb-862f-7cc4aa9b6656 \ --conf spark.kyuubi.batch.resource.uploaded=false \ --conf spark.kyuubi.client.ipAddress=10.19.29.167 \ --conf spark.kyuubi.server.ipAddress=0.0.0.0 \ --conf spark.kyuubi.session.connection.url=0.0.0.0:10099 \ --conf spark.kyuubi.session.real.user=ocdp \ --conf spark.app.name=kyuubi_batch_demo \ --conf spark.kubernetes.driver.label.kyuubi-unique-tag=1342671d-ac56-44cb-862f-7cc4aa9b6656 \ --conf spark.kubernetes.driverEnv.SPARK_USER_NAME=ocdp \ --conf spark.executorEnv.SPARK_USER_NAME=ocdp \ --proxy-user ocdp hdfs://host-10-19-29-137:8020/tmp/ocdp/spark-examples_2.12-3.3.2.jar 2023-08-15 11:03:30.361 INFO org.apache.kyuubi.server.http.authentication.AuthenticationAuditLogger: user=ocdp(auth:BASIC) ip=10.19.29.167 proxyIp=null method=POST uri=/api/v1/batches params=null protocol=HTTP/1.1 status=200 2023-08-15 11:03:30.364 INFO org.apache.kyuubi.engine.ProcBuilder: Logging to /opt/kyuubi/work/ocdp/kyuubi-spark-batch-submit.log.3 2023-08-15 11:03:30.372 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to be created, elapsed time: 0ms, return UNKNOWN status 2023-08-15 11:03:30.374 INFO org.apache.kyuubi.operation.BatchJobSubmission: Batch report for 1342671d-ac56-44cb-862f-7cc4aa9b6656, Some(ApplicationInfo(null,null,UNKNOWN,None,None)) 2023-08-15 11:03:35.390 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to be created, elapsed time: 5018ms, return UNKNOWN status 2023-08-15 11:03:40.391 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to be created, elapsed time: 10019ms, return UNKNOWN status 2023-08-15 11:03:45.393 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to be created, elapsed time: 15021ms, return UNKNOWN status 2023-08-15 11:03:50.394 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to be created, elapsed time: 20022ms, return UNKNOWN status 2023-08-15 11:03:58.917 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to be created, elapsed time: 28545ms, return UNKNOWN status 2023-08-15 11:04:03.920 ERROR org.apache.kyuubi.engine.KubernetesApplicationOperation: Can't find target driver pod by tag: 1342671d-ac56-44cb-862f-7cc4aa9b6656, elapsed time: 33548ms exceeds 30000ms. 2023-08-15 11:04:03.923 INFO org.apache.kyuubi.operation.BatchJobSubmission: Batch report for 1342671d-ac56-44cb-862f-7cc4aa9b6656, Some(ApplicationInfo(null,null,NOT_FOUND,None,None)) 2023-08-15 11:04:03.926 INFO org.apache.kyuubi.operation.BatchJobSubmission: Processing ocdp's query[4adbdd40-5cec-42ca-b670-c25bc0d8bd19]: PENDING_STATE -> ERROR_STATE, time taken: 1.692097443926E9 seconds 2023-08-15 11:08:18.960 INFO org.apache.kyuubi.server.http.authentication.AuthenticationAuditLogger: user=ocdp(auth:BASIC) ip=10.19.29.167 proxyIp=null method=GET uri=/api/v1/batches/1342671d-ac56-44cb-862f-7cc4aa9b6656 params=null protocol=HTTP/1.1 status=200 ``` kyuub server check the driver pod tagged with batch id , once it is not found, It will mark the batch job error status. But in fact , no driver pod is created . ### Affects Version(s) 1.7.1 ### Kyuubi Server Log Output _No response_ ### Kyuubi Engine Log Output _No response_ ### Kyuubi Server Configurations ```yaml # # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # ################################################################################## # Kyuubi Configurations ################################################################################## # kyuubi.frontend.protocols THRIFT_BINARY,REST #kyuubi.frontend.rest.bind.host localhost kyuubi.frontend.rest.bind.port 10099 kyuubi.authentication NONE #kyuubi.authentication KERBEROS #kyuubi.kinit.principal hive/[email protected] #kyuubi.kinit.keytab /opt/kyuubi/conf/hive.service.keytab # kyuubi.frontend.bind.host localhost # kyuubi.frontend.bind.port 10009 kyuubi.session.engine.initialize.timeout 3000000000 # 设置引擎共享级别为用户 kyuubi.engine.share.level USER kyuubi.session.engine.idle.timeout PT10H # 开启HA这里使用的是k8s外部的zk集群 kyuubi.ha.enabled true kyuubi.ha.zookeeper.quorum 10.19.37.28:2181 kyuubi.ha.zookeeper.client.port 2181 kyuubi.ha.zookeeper.namespace kyuubi # 设置engine 的jar包位置,从共享存储S3 进行访问 kyuubi.session.engine.spark.main.resource local:///opt/spark/work-dir/kyuubi-spark-sql-engine_2.12-1.7.1.jar # 禁用hostname,在1.5.1 中不禁用会出现问题,无法解析 nameservice 具体原因不知,有兴趣可以自行研究 kyuubi.engine.connection.url.use.hostname=false ################################################################################## # Spark Configurations ################################################################################## spark.shuffle.file.buffer 2097151 spark.shuffle.io.backLog 8192 spark.shuffle.io.serverThreads 128 spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf spark.shuffle.service.enabled false spark.shuffle.unsafe.file.output.buffer 5m spark.sql.autoBroadcastJoinThreshold 10214400000 spark.sql.hive.convertMetastoreOrc true spark.sql.orc.filterPushdown true spark.sql.orc.impl native spark.sql.statistics.fallBackToHdfs true spark.unsafe.sorter.spill.reader.buffer.size 1m # must use kryo serializer because java serializer do not support relocation spark.serializer org.apache.spark.serializer.KryoSerializer # celeborn master # options: hash, sort # Hash shuffle writer use (partition count) * (celeborn.push.buffer.size) * (spark.executor.cores) memory. # Sort shuffle writer use less memory than hash shuffle writer, if your shuffle partition count is large, try to use sort hash writer. # we recommend set spark.celeborn.push.replicate.enabled to true to enable server-side data replication # If you have only one worker, this setting must be false # Support for Spark AQE only tested under Spark 3 # we recommend set localShuffleReader to false to get better performance of Celeborn spark.sql.adaptive.localShuffleReader.enabled true # we recommend enabling aqe support to gain better performance spark.sql.adaptive.enabled true spark.sql.adaptive.skewJoin.enabled true # Hive Metastore 配置 spark.sql.hive.metastore.version 2.3.9 #spark.sql.hive.metastore.jars path spark.sql.warehouse.dir /warehouse/tablespace/managed/hive # Spark native k8s 配置 # 指定 master spark.master=k8s://https://10.19.29.167:6443 # 设置为cluster模式 spark.submit.deployMode=cluster # Specify volcano scheduler and PodGroup template spark.kubernetes.scheduler.name=volcano spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/opt/kyuubi/conf/podgrp.yaml spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep spark.kubernetes.executor.podNamePrefix=kyuubi-ssql spark.kubernetes.driver.podTemplateFile=/opt/kyuubi/conf/hostalias.yaml spark.kubernetes.executor.podTemplateFile=/opt/kyuubi/conf/hostalias.yaml #指定k8s 命名空间 spark.kubernetes.namespace=bigdata # 指定使用的 serviceAccount spark.kubernetes.authenticate.driver.serviceAccountName=kyuubi # 设置spark镜像,从harbor自动拉取 spark.kubernetes.container.image=10.19.37.28:8033/bigdata/sparkkyuubi171:v3.3 spark.kubernetes.container.image.pullPolicy=IfNotPresent ``` ### Kyuubi Engine Configurations _No response_ ### Additional context _No response_ ### Are you willing to submit PR? - [ ] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix. - [ ] No. I cannot submit a PR at this time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
