So if you disable shuffle tracking but enable shuffle block decommissioning it should work from memory
On Tue, Aug 8, 2023 at 4:13 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hm. I don't think it will work > > --conf spark.dynamicAllocation.shuffleTracking.enabled=false > > In Spark 3.4.1 running spark in k8s > > you get > > : org.apache.spark.SparkException: Dynamic allocation of executors > requires the external shuffle service. You may enable this through > spark.shuffle.service.enabled. > > HTH > > Mich Talebzadeh, > Solutions Architect/Engineering Lead > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 7 Aug 2023 at 21:24, Holden Karau <hol...@pigscanfly.ca> wrote: > >> I think you need to set >> "spark.dynamicAllocation.shuffleTracking.enabled=true" >> to false. >> >> On Mon, Aug 7, 2023 at 2:50 AM Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> Yes I have seen cases where the driver gone but a couple of executors >>> hanging on. Sounds like a code issue. >>> >>> HTH >>> >>> Mich Talebzadeh, >>> Solutions Architect/Engineering Lead >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Thu, 27 Jul 2023 at 15:01, Sergei Zhgirovski <ixane...@gmail.com> >>> wrote: >>> >>>> Hi everyone >>>> >>>> I'm trying to use pyspark 3.3.2. >>>> I have these relevant options set: >>>> >>>> ---- >>>> spark.dynamicAllocation.enabled=true >>>> spark.dynamicAllocation.shuffleTracking.enabled=true >>>> spark.dynamicAllocation.shuffleTracking.timeout=20s >>>> spark.dynamicAllocation.executorIdleTimeout=30s >>>> spark.dynamicAllocation.cachedExecutorIdleTimeout=40s >>>> spark.executor.instances=0 >>>> spark.dynamicAllocation.minExecutors=0 >>>> spark.dynamicAllocation.maxExecutors=20 >>>> spark.master=k8s://https://k8s-api.<....>:6443 >>>> ---- >>>> >>>> So I'm using kubernetes to deploy up to 20 executors >>>> >>>> then I run this piece of code: >>>> ---- >>>> df = spark.read.parquet("s3a://<directory with ~1.6TB of parquet >>>> files>") >>>> print(df.count()) >>>> time.sleep(999) >>>> ---- >>>> >>>> This works fine and as expected: during the execution ~1600 tasks are >>>> completed, 20 executors get deployed and are being quickly removed after >>>> the calculation is complete. >>>> >>>> Next, I add these to the config: >>>> ---- >>>> spark.decommission.enabled=true >>>> spark.storage.decommission.shuffleBlocks.enabled=true >>>> spark.storage.decommission.enabled=true >>>> spark.storage.decommission.rddBlocks.enabled=true >>>> ---- >>>> >>>> I repeat the experiment on an empty kubernetes cluster, so that no >>>> actual pod evicting is occuring. >>>> >>>> This time executors deallocation is not working as expected: depending >>>> on the run, after the job is complete, 0-3 executors out of 20 remain >>>> present forever and never seem to get removed. >>>> >>>> I tried to debug the code and found out that inside the >>>> 'ExecutorMonitor.timedOutExecutors' function those executors that never get >>>> to be removed do not make it to the 'timedOutExecs' variable, because the >>>> property 'hasActiveShuffle' remains 'true' for them. >>>> >>>> I'm a little stuck here trying to understand how all pod management, >>>> shuffle tracking and decommissioning were supposed to be working together, >>>> how to debug this and whether this is an expected behavior at all (to me it >>>> is not). >>>> >>>> Thank you for any hints! >>>> >>> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> > -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau