zhifanggao opened a new issue, #5169:
URL: https://github.com/apache/kyuubi/issues/5169

   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   
   
   ### Search before asking
   
   - [X] I have searched in the 
[issues](https://github.com/apache/kyuubi/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Describe the bug
   
   test steps:
   
   1. submit a batch job using rest api
   ```
   curl -u "ocdp:112345" --location --request POST 
'http://10.19.29.167:30099/api/v1/batches' --header 'Content-Type: 
application/json' --data '{ "batchType": "Spark", "resource": 
"hdfs://host-10-19-29-137:8020/tmp/ocdp/spark-examples_2.12-3.3.2.jar", "name": 
"kyuubi_batch_demo", "className": "org.apache.spark.examples.SparkPi", "conf": 
{"hive.server2.proxy.user":"ocdp"}}'
   ```
   2. check the pod of kyuubi server
   ```
   kyuubi@kyuubi-deployment-example-7c7774d465-9f9xh:/opt/kyuubi$ ps -efl|grep 
driver
   0 S kyuubi     584     1 87  80   0 - 755904 futex_ 19:03 ?       00:00:04 
/opt/java/openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g 
-XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED 
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED 
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED 
--add-opens=java.base/java.io=ALL-UNNAMED 
--add-opens=java.base/java.net=ALL-UNNAMED 
--add-opens=java.base/java.nio=ALL-UNNAMED 
--add-opens=java.base/java.util=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED 
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED 
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED 
--add-opens=java.base/sun.security.action=ALL-UNNAMED 
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED 
--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED 
org.apache.spark.deploy.SparkSubmit --conf 
spark.kyuubi.client.ipAddress=10.19.29.167 --conf spark.kyuubi.batch.resource.
 uploaded=false --conf spark.kubernetes.driverEnv.SPARK_USER_NAME=ocdp --conf 
spark.executorEnv.SPARK_USER_NAME=ocdp --conf 
spark.hive.server2.proxy.user=ocdp --conf 
spark.kubernetes.driver.label.kyuubi-unique-tag=1342671d-ac56-44cb-862f-7cc4aa9b6656
 --conf spark.app.name=kyuubi_batch_demo --conf 
spark.kyuubi.session.real.user=ocdp --conf 
spark.kyuubi.server.ipAddress=0.0.0.0 --conf 
spark.kyuubi.session.connection.url=0.0.0.0:10099 --conf 
spark.kyuubi.batch.id=1342671d-ac56-44cb-862f-7cc4aa9b6656 --class 
org.apache.spark.examples.SparkPi --proxy-user ocdp 
hdfs://host-10-19-29-137:8020/tmp/ocdp/spark-examples_2.12-3.3.2.jar
   ```
   3. once the batch completed, check the status
   ```
   [root@host-10-19-37-166 ~]# curl -u "ocdp:112345" --location --request GET  
'http://10.19.29.167:30099/api/v1/batches/1342671d-ac56-44cb-862f-7cc4aa9b6656'
   
{"id":"1342671d-ac56-44cb-862f-7cc4aa9b6656","user":"ocdp","batchType":"SPARK","name":"kyuubi_batch_demo","appStartTime":0,"appId":null,"appUrl":null,"appState":"NOT_FOUND","appDiagnostic":null,"kyuubiInstance":"0.0.0.0:10099","state":"ERROR","createTime":1692097410346,"endTime":1692097443926,"batchInfo":{}}
   ```
   
   the status of batch job is ERROR, In fact the batch job is executed 
successfully. 
   
   Checked the kyuubi logs
   ```
   2023-08-15 11:03:30.358 INFO org.apache.kyuubi.session.KyuubiSessionManager: 
ocdp's session with SessionHandle 
[1342671d-ac56-44cb-862f-7cc4aa9b6656]/kyuubi_batch_demo is opened, current 
opening sessions 4
   2023-08-15 11:03:30.359 INFO org.apache.kyuubi.operation.BatchJobSubmission: 
Submitting SPARK batch[1342671d-ac56-44cb-862f-7cc4aa9b6656] job:
   /opt/spark/bin/spark-submit \
        --class org.apache.spark.examples.SparkPi \
        --conf spark.hive.server2.proxy.user=ocdp \
        --conf spark.kyuubi.batch.id=1342671d-ac56-44cb-862f-7cc4aa9b6656 \
        --conf spark.kyuubi.batch.resource.uploaded=false \
        --conf spark.kyuubi.client.ipAddress=10.19.29.167 \
        --conf spark.kyuubi.server.ipAddress=0.0.0.0 \
        --conf spark.kyuubi.session.connection.url=0.0.0.0:10099 \
        --conf spark.kyuubi.session.real.user=ocdp \
        --conf spark.app.name=kyuubi_batch_demo \
        --conf 
spark.kubernetes.driver.label.kyuubi-unique-tag=1342671d-ac56-44cb-862f-7cc4aa9b6656
 \
        --conf spark.kubernetes.driverEnv.SPARK_USER_NAME=ocdp \
        --conf spark.executorEnv.SPARK_USER_NAME=ocdp \
        --proxy-user ocdp 
hdfs://host-10-19-29-137:8020/tmp/ocdp/spark-examples_2.12-3.3.2.jar
   2023-08-15 11:03:30.361 INFO 
org.apache.kyuubi.server.http.authentication.AuthenticationAuditLogger: 
user=ocdp(auth:BASIC)   ip=10.19.29.167 proxyIp=null    method=POST     
uri=/api/v1/batches     params=null     protocol=HTTP/1.1       status=200
   2023-08-15 11:03:30.364 INFO org.apache.kyuubi.engine.ProcBuilder: Logging 
to /opt/kyuubi/work/ocdp/kyuubi-spark-batch-submit.log.3
   2023-08-15 11:03:30.372 WARN 
org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to 
be created, elapsed time: 0ms, return UNKNOWN status
   2023-08-15 11:03:30.374 INFO org.apache.kyuubi.operation.BatchJobSubmission: 
Batch report for 1342671d-ac56-44cb-862f-7cc4aa9b6656, 
Some(ApplicationInfo(null,null,UNKNOWN,None,None))
   2023-08-15 11:03:35.390 WARN 
org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to 
be created, elapsed time: 5018ms, return UNKNOWN status
   2023-08-15 11:03:40.391 WARN 
org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to 
be created, elapsed time: 10019ms, return UNKNOWN status
   2023-08-15 11:03:45.393 WARN 
org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to 
be created, elapsed time: 15021ms, return UNKNOWN status
   2023-08-15 11:03:50.394 WARN 
org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to 
be created, elapsed time: 20022ms, return UNKNOWN status
   2023-08-15 11:03:58.917 WARN 
org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to 
be created, elapsed time: 28545ms, return UNKNOWN status
   2023-08-15 11:04:03.920 ERROR 
org.apache.kyuubi.engine.KubernetesApplicationOperation: Can't find target 
driver pod by tag: 1342671d-ac56-44cb-862f-7cc4aa9b6656, elapsed time: 33548ms 
exceeds 30000ms.
   2023-08-15 11:04:03.923 INFO org.apache.kyuubi.operation.BatchJobSubmission: 
Batch report for 1342671d-ac56-44cb-862f-7cc4aa9b6656, 
Some(ApplicationInfo(null,null,NOT_FOUND,None,None))
   2023-08-15 11:04:03.926 INFO org.apache.kyuubi.operation.BatchJobSubmission: 
Processing ocdp's query[4adbdd40-5cec-42ca-b670-c25bc0d8bd19]: PENDING_STATE -> 
ERROR_STATE, time taken: 1.692097443926E9 seconds
   2023-08-15 11:08:18.960 INFO 
org.apache.kyuubi.server.http.authentication.AuthenticationAuditLogger: 
user=ocdp(auth:BASIC)   ip=10.19.29.167 proxyIp=null    method=GET      
uri=/api/v1/batches/1342671d-ac56-44cb-862f-7cc4aa9b6656        params=null     
protocol=HTTP/1.1       status=200
   ```
   kyuub server check the driver pod tagged with batch id , once it is not 
found, It will mark the batch job error status. 
   
   But in fact , no driver pod is created . 
   
   ### Affects Version(s)
   
   1.7.1
   
   ### Kyuubi Server Log Output
   
   _No response_
   
   ### Kyuubi Engine Log Output
   
   _No response_
   
   ### Kyuubi Server Configurations
   
   ```yaml
   #
   # Licensed to the Apache Software Foundation (ASF) under one or more
   # contributor license agreements.  See the NOTICE file distributed with
   # this work for additional information regarding copyright ownership.
   # The ASF licenses this file to You under the Apache License, Version 2.0
   # (the "License"); you may not use this file except in compliance with
   # the License.  You may obtain a copy of the License at
   #
   #    http://www.apache.org/licenses/LICENSE-2.0
   #
   # Unless required by applicable law or agreed to in writing, software
   # distributed under the License is distributed on an "AS IS" BASIS,
   # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   # See the License for the specific language governing permissions and
   # limitations under the License.
   #
   
   
##################################################################################
   #  Kyuubi Configurations
   
##################################################################################
   #
    kyuubi.frontend.protocols THRIFT_BINARY,REST
   #kyuubi.frontend.rest.bind.host localhost
   kyuubi.frontend.rest.bind.port 10099
   kyuubi.authentication           NONE
   #kyuubi.authentication KERBEROS
   #kyuubi.kinit.principal hive/[email protected]
   #kyuubi.kinit.keytab /opt/kyuubi/conf/hive.service.keytab
   # kyuubi.frontend.bind.host       localhost
   # kyuubi.frontend.bind.port       10009
   kyuubi.session.engine.initialize.timeout 3000000000
   # 设置引擎共享级别为用户
   kyuubi.engine.share.level USER
   kyuubi.session.engine.idle.timeout PT10H
   # 开启HA这里使用的是k8s外部的zk集群
   kyuubi.ha.enabled true
   kyuubi.ha.zookeeper.quorum 10.19.37.28:2181
   kyuubi.ha.zookeeper.client.port 2181
   kyuubi.ha.zookeeper.namespace kyuubi
   # 设置engine 的jar包位置,从共享存储S3 进行访问
   kyuubi.session.engine.spark.main.resource 
local:///opt/spark/work-dir/kyuubi-spark-sql-engine_2.12-1.7.1.jar
   # 禁用hostname,在1.5.1 中不禁用会出现问题,无法解析 nameservice 具体原因不知,有兴趣可以自行研究
   kyuubi.engine.connection.url.use.hostname=false
   
   
##################################################################################
   #  Spark Configurations
   
##################################################################################
       spark.shuffle.file.buffer 2097151
       spark.shuffle.io.backLog 8192
       spark.shuffle.io.serverThreads 128
      spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf
     spark.shuffle.service.enabled false
       spark.shuffle.unsafe.file.output.buffer 5m
       spark.sql.autoBroadcastJoinThreshold 10214400000
       spark.sql.hive.convertMetastoreOrc true
       spark.sql.orc.filterPushdown true
       spark.sql.orc.impl native
       spark.sql.statistics.fallBackToHdfs true
       spark.unsafe.sorter.spill.reader.buffer.size 1m
   # must use kryo serializer because java serializer do not support relocation
   spark.serializer org.apache.spark.serializer.KryoSerializer
   
   # celeborn master
   
   # options: hash, sort
   # Hash shuffle writer use (partition count) * (celeborn.push.buffer.size) * 
(spark.executor.cores) memory.
   # Sort shuffle writer use less memory than hash shuffle writer, if your 
shuffle partition count is large, try to use sort hash writer.
   
   # we recommend set spark.celeborn.push.replicate.enabled to true to enable 
server-side data replication
   # If you have only one worker, this setting must be false
   
   # Support for Spark AQE only tested under Spark 3
   # we recommend set localShuffleReader to false to get better performance of 
Celeborn
   spark.sql.adaptive.localShuffleReader.enabled true
   
   # we recommend enabling aqe support to gain better performance
   spark.sql.adaptive.enabled true
   spark.sql.adaptive.skewJoin.enabled true
   # Hive Metastore 配置
   spark.sql.hive.metastore.version 2.3.9
   #spark.sql.hive.metastore.jars path
   spark.sql.warehouse.dir /warehouse/tablespace/managed/hive
   # Spark native k8s 配置
   # 指定 master
   spark.master=k8s://https://10.19.29.167:6443
   # 设置为cluster模式
   spark.submit.deployMode=cluster
   # Specify volcano scheduler and PodGroup template
   spark.kubernetes.scheduler.name=volcano
   
spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/opt/kyuubi/conf/podgrp.yaml
   
spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
   
spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
   spark.kubernetes.executor.podNamePrefix=kyuubi-ssql
   spark.kubernetes.driver.podTemplateFile=/opt/kyuubi/conf/hostalias.yaml
   spark.kubernetes.executor.podTemplateFile=/opt/kyuubi/conf/hostalias.yaml
   #指定k8s 命名空间
   spark.kubernetes.namespace=bigdata
   # 指定使用的 serviceAccount
   spark.kubernetes.authenticate.driver.serviceAccountName=kyuubi
   # 设置spark镜像,从harbor自动拉取
   spark.kubernetes.container.image=10.19.37.28:8033/bigdata/sparkkyuubi171:v3.3
   spark.kubernetes.container.image.pullPolicy=IfNotPresent
   ```
   
   
   ### Kyuubi Engine Configurations
   
   _No response_
   
   ### Additional context
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes. I would be willing to submit a PR with guidance from the Kyuubi 
community to fix.
   - [ ] No. I cannot submit a PR at this time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to