Re: Spark 3.1.2 full thread dumps

Maksim Grinman Fri, 11 Feb 2022 11:27:16 -0800

Thanks for these suggestions. Regarding hot nodes, are you referring to the
same as in this article?
https://www.elastic.co/blog/hot-warm-architecture-in-elasticsearch-5-x.
I am also curious where the 10MB heuristic came from, though I have heard a
similar heuristic with respect to the size of a partition. I suspect the
best way to see the size of a partition is simply to write to parquet and
observe the size of the written parquet partitions?


Thanks

On Fri, Feb 11, 2022 at 12:48 PM Lalwani, Jayesh <jlalw...@amazon.com>
wrote:

> You can probably tune writing to elastic search by
>
>    1. Increasing number of partitions so you are writing smaller batches
>    of rows to elastic search
>    2. Using Elastic search’s bulk api
>    3. Scaling up the number of hot nodes on elastic search cluster to
>    support writing in parallel.
>
>
>
> You want to minimize long running tasks. Not just to avoid the “thread
> dump”. Large number of shorter running tasks are better than Small number
> of long running tasks, because you can scale up your processing by throwing
> hardware at it. This is subject to law of diminishing returns; ie; at some
> point making your tasks smaller will start slowing you down. You need to
> find the sweet spot.
>
>
>
> Generally for elastic search, the sweet spot is that each task writes
> around 10MB of data using the bulk API. Writing 10MB of data per task
> should be take order of few seconds. You won’t get the dreaded thread dump
> if your tasks are taking few seconds
>
>
>
> *From: *Maksim Grinman <m...@resolute.ai>
> *Date: *Thursday, February 10, 2022 at 7:21 PM
> *To: *"Lalwani, Jayesh" <jlalw...@amazon.com>
> *Cc: *Mich Talebzadeh <mich.talebza...@gmail.com>, Holden Karau <
> hol...@pigscanfly.ca>, Sean Owen <sro...@gmail.com>, "user @spark" <
> user@spark.apache.org>
> *Subject: *RE: [EXTERNAL] Spark 3.1.2 full thread dumps
>
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> That's fair, but I do get the same thread dump at the last step of the
> spark job, where we write the final dataframe into an elasticsearch index.
> It's a df.rdd.map(lambda r: r.asDict(True)).foreachPartition operation
> which takes a while and we usually get a thread dump during that as well.
>
>
>
> On Mon, Feb 7, 2022 at 11:24 AM Lalwani, Jayesh <jlalw...@amazon.com>
> wrote:
>
> Probably not the answer you are looking for, but the best thing to do is
> to avoid making Spark code sleep. Is there a way you can predict how big
> your autoscaling group needs to be without looking at all the data? Are you
> using fixed number of Spark executors or are you have some way of scaling
> your executors? I am guessing that the size of your autoscaling group is
> proportional to the number of Spark executors. You can probably measure how
> many executors each can support. Then you can tie in the size of your
> autoscaling group to the number of executors.
>
>
>
> Alternatively, you can build your service so a) it autoscales as load
> increases b) throttle requests when the load is higher than it can manage
> now. This means that when Spark executors start hitting your nodes, your
> service will throttle many of the requests, and start autoscaling up. Note
> that this is an established pattern in the cloud. This is how most services
> on AWS work. The end result is that initially there will be higher latency
> due to cold start, but the system will catch up eventually
>
>
>
> *From: *Maksim Grinman <m...@resolute.ai>
> *Date: *Friday, February 4, 2022 at 9:35 PM
> *To: *Mich Talebzadeh <mich.talebza...@gmail.com>
> *Cc: *Holden Karau <hol...@pigscanfly.ca>, Sean Owen <sro...@gmail.com>,
> "user @spark" <user@spark.apache.org>
> *Subject: *RE: [EXTERNAL] Spark 3.1.2 full thread dumps
>
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> Not that this discussion is not interesting (it is), but this has strayed
> pretty far from my original question. Which was: How do I prevent spark
> from dumping huge Java Full Thread dumps when an executor appears to not be
> doing anything (in my case, there's a loop where it sleeps waiting for a
> service to come up). The service happens to be set up using an auto-scaling
> group, a coincidental and unimportant detail that seems to have derailed
> the conversation.
>
>
>
> On Fri, Feb 4, 2022 at 7:18 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> OK basically, do we have a scenario where Spark or for that matter any
> cluster manager can deploy a new node (after the loss of  an existing node)
> with the view of running the failed tasks on the new executor(s) deployed
> on that newly spun node?
>
>
>
>
>
>  *Error! Filename not specified.*  view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Sat, 5 Feb 2022 at 00:00, Holden Karau <hol...@pigscanfly.ca> wrote:
>
> We don’t block scaling up after node failure in classic Spark if that’s
> the question.
>
>
>
> On Fri, Feb 4, 2022 at 6:30 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> From what I can see in auto scaling setup, you will always need a min of
> two worker nodes as primary. It also states and I quote "Scaling primary
> workers is not recommended due to HDFS limitations which result in
> instability while scaling. These limitations do not exist for secondary
> workers". So the scaling comes with the secondary workers specifying the
> min and max instances. It also defaults to 2 minutes for the so-called auto
> scaling cooldown duration hence that delay observed. I presume task
> allocation to the new executors is FIFO for new tasks. This link
> <https://docs.qubole.com/en/latest/admin-guide/engine-admin/spark-admin/autoscale-spark.html#:~:text=dynamic%20allocation%20configurations.-,Autoscaling%20in%20Spark%20Clusters,scales%20down%20towards%20the%20minimum.&text=By%20default%2C%20Spark%20uses%20a%20static%20allocation%20of%20resources.>
> does some explanation on autoscaling.
>
>
> Handling Spot Node Loss and Spot Blocks in Spark Clusters
>
> "When the Spark AM receives the spot loss (Spot Node Loss or Spot Blocks)
> notification from the RM, it notifies the Spark driver. The driver then
> performs the following actions:
>
> 1.    Identifies all the executors affected by the upcoming node loss.
>
> 2.    Moves all of the affected executors to a decommissioning state, and
> no new tasks are scheduled on these executors.
>
> 3.    Kills all the executors after reaching 50% of the termination time.
>
> 4.    *Starts the failed tasks (if any) on other executors.*
>
> 5.    For these nodes, it removes all the entries of the shuffle data
> from the map output tracker on driver after reaching 90% of the termination
> time. This helps in preventing the shuffle-fetch failures due to spot loss.
>
> 6.    Recomputes the shuffle data from the lost node by stage
> resubmission and at the time shuffles data of spot node if required."
>
>    1.
>    2. So basically when a node fails classic spark comes into play and no
>    new nodes are added etc (no rescaling) and tasks are redistributed among
>    the existing executors as I read it?
>
>  *Error! Filename not specified.*  view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Fri, 4 Feb 2022 at 13:55, Sean Owen <sro...@gmail.com> wrote:
>
> I have not seen stack traces under autoscaling, so not even sure what the
> error in question is.
>
> There is always delay in acquiring a whole new executor in the cloud as it
> usually means a new VM is provisioned.
>
> Spark treats the new executor like any other, available for executing
> tasks.
>
>
>
> On Fri, Feb 4, 2022 at 4:28 AM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Thanks for the info.
>
>
>
> My concern has always been on how Spark handles autoscaling (adding new
> executors) when the load pattern changes.I have tried to test this with
> setting the following parameters (Spark 3.1.2 on GCP)
>
>
>
>         spark-submit --verbose \
>
>         .......
>
>           --conf spark.dynamicAllocation.enabled="true" \
>
>            --conf spark.shuffle.service.enabled="true" \
>
>            --conf spark.dynamicAllocation.minExecutors=2 \
>
>            --conf spark.dynamicAllocation.maxExecutors=10 \
>
>            --conf spark.dynamicAllocation.initialExecutors=4 \
>
>
>
> It is not very clear to me how Spark distributes tasks on the added
> executors and the source of delay. As you have observed there is a delay in
> adding new resources and allocating tasks. If that process is efficient?
>
>
>
> Thanks
>
>
>
>  *Error! Filename not specified.*  view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Fri, 4 Feb 2022 at 03:04, Maksim Grinman <m...@resolute.ai> wrote:
>
> It's actually on AWS EMR. The job bootstraps and runs fine -- the
> autoscaling group is to bring up a service that spark will be calling. Some
> code waits for the autoscaling group to come up before continuing
> processing in Spark, since the Spark cluster will need to make requests to
> the service in the autoscaling group. It takes several minutes for the
> service to come up, and during the wait, Spark starts to show these thread
> dumps, as presumably it thinks something is wrong since the executor is
> busy waiting and not doing anything. The previous version of Spark did not
> do this (2.4.4).
>
>
>
> On Thu, Feb 3, 2022 at 6:59 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Sounds like you are running this on Google Dataproc cluster (spark 3.1.2)
> with auto scaling policy?
>
>
>
>  Can you describe if this happens before Spark starts a new job on the
> cluster or somehow half way through processing an existing job?
>
>
>
> Also is the job involved doing Spark Structured Streaming?
>
>
>
> HTH
>
>
>
>
>
>  *Error! Filename not specified.*  view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 3 Feb 2022 at 21:29, Maksim Grinman <m...@resolute.ai> wrote:
>
> We've got a spark task that, after some processing, starts an autoscaling
> group and waits for it to be up before continuing processing. While waiting
> for the autoscaling group, spark starts throwing full thread dumps,
> presumably at the spark.executor.heartbeat interval. Is there a way to
> prevent the thread dumps?
>
>
>
> --
>
> Maksim Grinman
> VP Engineering
> Resolute AI
>
>
>
>
> --
>
> Maksim Grinman
> VP Engineering
> Resolute AI
>
> --
>
> Twitter: https://twitter.com/holdenkarau
>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
>
>
> --
>
> Maksim Grinman
> VP Engineering
> Resolute AI
>
>
>
>
> --
>
> Maksim Grinman
> VP Engineering
> Resolute AI
>


-- 
Maksim Grinman
VP Engineering
Resolute AI

Re: Spark 3.1.2 full thread dumps

Reply via email to