Re: Performance Degradation in Spark 3.0.2 compared to Spark 3.0.1

Mich Talebzadeh Mon, 30 Aug 2021 12:36:52 -0700

The problem with these tickets is that it tends to generalise the
performance as opposed to a statement of specifics.


According to the latter ticket it states and I quote

 "Spark 3.1.1 is slower than 3.0.2 by 4-5 times".

This is not what we have observed migrating from 3.0.1 to 3.1.1. Unless it
impacts your area of interest specifically, I would not worry too about it.

Anyway back to your point, as I understand,  you are using Spark on
Kubernetes 3.0.2,launching with Spark-submit 3.0.2 right?  Your data is on
HDFS, Are you reading HDFS buckets. How is Spark accessing HDFS? Your Spark
on k8 gives me the impression that you are accessing cloud buckets.

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 30 Aug 2021 at 11:53, Sharma, Prakash (Nokia - IN/Bangalore) <
prakash.sha...@nokia.com> wrote:

> Hi ,
>
> we are not moving to 3.1.1 because some open ticket are there I have
> mentioned below.
> https://issues.apache.org/jira/browse/SPARK-30536
>
> https://issues.apache.org/jira/browse/SPARK-35066
>
>
> please refer attached mail for spark 35066.
>
> Thanks.
>
> ------------------------------
> *From:* Mich Talebzadeh <mich.talebza...@gmail.com>
> *Sent:* Monday, August 30, 2021 1:15:07 PM
> *To:* Sharma, Prakash (Nokia - IN/Bangalore) <prakash.sha...@nokia.com>
> *Cc:* user@spark.apache.org <user@spark.apache.org>
> *Subject:* Re: Performance Degradation in Spark 3.0.2 compared to Spark
> 3.0.1
>
> Hi,
>
> Any particular reason why you are not using 3.1.1 on Kubernetes?
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 30 Aug 2021 at 06:10, Sharma, Prakash (Nokia - IN/Bangalore) <
> prakash.sha...@nokia.com> wrote:
>
> Sessional Greetings ,
>      We're doing tpc-ds query tests using Spark 3.0.2 on kubernetes with
> data on HDFS and we're observing delays in query execution time *when*
> compared to Spark 3.0.1 on same environment. We've observed that some
> stages fail, but looks like it is taking some time to realise this failure
> and re-trigger these stages.  I am attaching the configuration also which
> we used for the spark driver . We observe the same behaviour with sapark
> 3.0.3 also.
>
> *Please let us know if anyone has observed similar issues.*
>
> Configuration which we use for spark driver:
>
> spark.io.compression.codec=snappy
>
> spark.sql.parquet.filterPushdown=true
>
>
>
> spark.sql.inMemoryColumnarStorage.batchSize=15000
>
> spark.shuffle.file.buffer=1024k
>
> spark.ui.retainedStages=10000
>
> spark.kerberos.keytab=<keytab loacation>
>
>
>
> spark.speculation=false
>
> spark.submit.deployMode=cluster
>
>
>
>
> spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true
>
>
>
> spark.sql.orc.filterPushdown=true
>
> spark.serializer=org.apache.spark.serializer.KryoSerializer
>
>
>
> spark.sql.crossJoin.enabled=true
>
> spark.kubernetes.kerberos.keytab=<key-tab location>
>
>
>
> spark.sql.adaptive.enabled=true
>
> spark.kryo.unsafe=true
>
> spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=<operator
> label>
>
> spark.executor.cores=2
>
> spark.ui.retainedTasks=200000
>
> spark.network.timeout=2400
>
>
>
>
>
> spark.rdd.compress=true
>
> spark.executor.memoryoverhead=3G
>
> spark.master=k8s\:<master ip>
>
>
>
> spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=<label app
> name>
>
> spark.kubernetes.driver.limit.cores=6144m
>
> spark.kubernetes.submission.waitAppCompletion=false
>
> spark.kerberos.principal=<principal>
>
> spark.kubernetes.kerberos.enabled=true
>
> spark.kubernetes.allocation.batch.size=5
>
>
>
> spark.kubernetes.authenticate.driver.serviceAccountName=<serviceAccount
> name>
>
>
>
>
> spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true
>
> spark.reducer.maxSizeInFlight=1024m
>
>
>
> spark.storage.memoryFraction=0.25
>
>
>
> spark.kubernetes.namespace=<namespace name>
>
> spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=<executor
> label>
>
> spark.rpc.numRetries=5
>
>
>
> spark.shuffle.consolidateFiles=true
>
> spark.sql.shuffle.partitions=400
>
> spark.kubernetes.kerberos.krb5.path=/<file path>
>
> spark.sql.codegen=true
>
> spark.ui.strictTransportSecurity=max-age\=31557600
>
> spark.ui.retainedJobs=10000
>
>
>
> spark.driver.port=7078
>
> spark.shuffle.io.backLog=256
>
> spark.ssl.ui.enabled=true
>
> spark.kubernetes.memoryOverheadFactor=0.1
>
>
>
> spark.driver.blockManager.port=7079
>
> spark.kubernetes.executor.limit.cores=4096m
>
> spark.submit.pyFiles=
>
> spark.kubernetes.container.image=<image name>
>
> spark.shuffle.io.numConnectionsPerPeer=10
>
>
>
> spark.sql.broadcastTimeout=7200
>
>
>
> spark.driver.cores=3
>
> spark.executor.memory=9g
>
>
> spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=dfbd9c75-3771-4392-928e-10bf28d94099
>
>
>
> spark.driver.maxResultSize=4g
>
> spark.sql.parquet.mergeSchema=false
>
>
>
> spark.sql.inMemoryColumnarStorage.compressed=true
>
> spark.rpc.retry.wait=5
>
> spark.hadoop.parquet.enable.summary-metadata=false
>
>
>
>
>
> spark.kubernetes.allocation.batch.delay=9
>
> spark.driver.memory=16g
>
> spark.sql.starJoinOptimization=true
>
> spark.kubernetes.submitInDriver=true
>
> spark.shuffle.compress=true
>
> spark.memory.useLegacyMode=true
>
> spark.jars=
>
> spark.kubernetes.resource.type=java
>
> spark.locality.wait=0s
>
> spark.kubernetes.driver.ui.svc.port=4040
>
> spark.sql.orc.splits.include.file.footer=true
>
> spark.kubernetes.kerberos.principal=<principle>
>
>
>
> spark.sql.orc.cache.stripe.details.size=10000
>
>
>
> spark.executor.instances=22
>
> spark.hadoop.fs.hdfs.impl.disable.cache=true
>
> spark.sql.hive.metastorePartitionPruning=true
>
>
>
> Thanks and Regards
> Prakash
>
>
>
>

Re: Performance Degradation in Spark 3.0.2 compared to Spark 3.0.1

Reply via email to