Sessional Greetings , We're doing tpc-ds query tests using Spark 3.0.2 on kubernetes with data on HDFS and we're observing delays in query execution time when compared to Spark 3.0.1 on same environment. We've observed that some stages fail, but looks like it is taking some time to realise this failure and re-trigger these stages. I am attaching the configuration also which we used for the spark driver . We observe the same behaviour with sapark 3.0.3 also. Please let us know if anyone has observed similar issues.
Configuration which we use for spark driver: spark.io.compression.codec=snappy spark.sql.parquet.filterPushdown=true spark.sql.inMemoryColumnarStorage.batchSize=15000 spark.shuffle.file.buffer=1024k spark.ui.retainedStages=10000 spark.kerberos.keytab=<keytab loacation> spark.speculation=false spark.submit.deployMode=cluster spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true spark.sql.orc.filterPushdown=true spark.serializer=org.apache.spark.serializer.KryoSerializer spark.sql.crossJoin.enabled=true spark.kubernetes.kerberos.keytab=<key-tab location> spark.sql.adaptive.enabled=true spark.kryo.unsafe=true spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=<operator label> spark.executor.cores=2 spark.ui.retainedTasks=200000 spark.network.timeout=2400 spark.rdd.compress=true spark.executor.memoryoverhead=3G spark.master=k8s\:<master ip> spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=<label app name> spark.kubernetes.driver.limit.cores=6144m spark.kubernetes.submission.waitAppCompletion=false spark.kerberos.principal=<principal> spark.kubernetes.kerberos.enabled=true spark.kubernetes.allocation.batch.size=5 spark.kubernetes.authenticate.driver.serviceAccountName=<serviceAccount name> spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true spark.reducer.maxSizeInFlight=1024m spark.storage.memoryFraction=0.25 spark.kubernetes.namespace=<namespace name> spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=<executor label> spark.rpc.numRetries=5 spark.shuffle.consolidateFiles=true spark.sql.shuffle.partitions=400 spark.kubernetes.kerberos.krb5.path=/<file path> spark.sql.codegen=true spark.ui.strictTransportSecurity=max-age\=31557600 spark.ui.retainedJobs=10000 spark.driver.port=7078 spark.shuffle.io.backLog=256 spark.ssl.ui.enabled=true spark.kubernetes.memoryOverheadFactor=0.1 spark.driver.blockManager.port=7079 spark.kubernetes.executor.limit.cores=4096m spark.submit.pyFiles= spark.kubernetes.container.image=<image name> spark.shuffle.io.numConnectionsPerPeer=10 spark.sql.broadcastTimeout=7200 spark.driver.cores=3 spark.executor.memory=9g spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=dfbd9c75-3771-4392-928e-10bf28d94099 spark.driver.maxResultSize=4g spark.sql.parquet.mergeSchema=false spark.sql.inMemoryColumnarStorage.compressed=true spark.rpc.retry.wait=5 spark.hadoop.parquet.enable.summary-metadata=false spark.kubernetes.allocation.batch.delay=9 spark.driver.memory=16g spark.sql.starJoinOptimization=true spark.kubernetes.submitInDriver=true spark.shuffle.compress=true spark.memory.useLegacyMode=true spark.jars= spark.kubernetes.resource.type=java spark.locality.wait=0s spark.kubernetes.driver.ui.svc.port=4040 spark.sql.orc.splits.include.file.footer=true spark.kubernetes.kerberos.principal=<principle> spark.sql.orc.cache.stripe.details.size=10000 spark.executor.instances=22 spark.hadoop.fs.hdfs.impl.disable.cache=true spark.sql.hive.metastorePartitionPruning=true Thanks and Regards Prakash