So if you disable shuffle tracking but enable shuffle block decommissioning
it should work from memory

On Tue, Aug 8, 2023 at 4:13 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hm. I don't think it will work
>
> --conf spark.dynamicAllocation.shuffleTracking.enabled=false
>
> In Spark 3.4.1 running spark in k8s
>
> you get
>
> : org.apache.spark.SparkException: Dynamic allocation of executors
> requires the external shuffle service. You may enable this through
> spark.shuffle.service.enabled.
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 7 Aug 2023 at 21:24, Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> I think you need to set 
>> "spark.dynamicAllocation.shuffleTracking.enabled=true"
>> to false.
>>
>> On Mon, Aug 7, 2023 at 2:50 AM Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> Yes I have seen cases where the driver gone but a couple of executors
>>> hanging on. Sounds like a code issue.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 27 Jul 2023 at 15:01, Sergei Zhgirovski <ixane...@gmail.com>
>>> wrote:
>>>
>>>> Hi everyone
>>>>
>>>> I'm trying to use pyspark 3.3.2.
>>>> I have these relevant options set:
>>>>
>>>> ----
>>>> spark.dynamicAllocation.enabled=true
>>>> spark.dynamicAllocation.shuffleTracking.enabled=true
>>>> spark.dynamicAllocation.shuffleTracking.timeout=20s
>>>> spark.dynamicAllocation.executorIdleTimeout=30s
>>>> spark.dynamicAllocation.cachedExecutorIdleTimeout=40s
>>>> spark.executor.instances=0
>>>> spark.dynamicAllocation.minExecutors=0
>>>> spark.dynamicAllocation.maxExecutors=20
>>>> spark.master=k8s://https://k8s-api.<....>:6443
>>>> ----
>>>>
>>>> So I'm using kubernetes to deploy up to 20 executors
>>>>
>>>> then I run this piece of code:
>>>> ----
>>>> df = spark.read.parquet("s3a://<directory with ~1.6TB of parquet
>>>> files>")
>>>> print(df.count())
>>>> time.sleep(999)
>>>> ----
>>>>
>>>> This works fine and as expected: during the execution ~1600 tasks are
>>>> completed, 20 executors get deployed and are being quickly removed after
>>>> the calculation is complete.
>>>>
>>>> Next, I add these to the config:
>>>> ----
>>>> spark.decommission.enabled=true
>>>> spark.storage.decommission.shuffleBlocks.enabled=true
>>>> spark.storage.decommission.enabled=true
>>>> spark.storage.decommission.rddBlocks.enabled=true
>>>> ----
>>>>
>>>> I repeat the experiment on an empty kubernetes cluster, so that no
>>>> actual pod evicting is occuring.
>>>>
>>>> This time executors deallocation is not working as expected: depending
>>>> on the run, after the job is complete, 0-3 executors out of 20 remain
>>>> present forever and never seem to get removed.
>>>>
>>>> I tried to debug the code and found out that inside the
>>>> 'ExecutorMonitor.timedOutExecutors' function those executors that never get
>>>> to be removed do not make it to the 'timedOutExecs' variable, because the
>>>> property 'hasActiveShuffle' remains 'true' for them.
>>>>
>>>> I'm a little stuck here trying to understand how all pod management,
>>>> shuffle tracking and decommissioning were supposed to be working together,
>>>> how to debug this and whether this is an expected behavior at all (to me it
>>>> is not).
>>>>
>>>> Thank you for any hints!
>>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Reply via email to