Some optimization questions about our beloved engine Spark

Aissam Chia Tue, 23 Jan 2024 04:46:21 -0800

Hi,

I hope this email finds you well.


Currently, I'm working on spark SQL and I have two main questions that I've
been struggling with for 2 weeks now. I'm running spark on AWS EMR :

   1. I'm running 30 spark applications in the same cluster. My
   applications are basically some SQL transformations that are computed on
   data stored on S3. I monitored the cluster using Ganglia and I noticed that
   most of the time I'm using roughly 2% on the total CPU. How can I
   optimize/maximize resource usage ?
   2. I have a spark application that is computing approximately 2Tb of
   data. I'm using 50 rg6.2xlarge ec2 instances and I'm using 90% of the total
   CPU capacity. My SQL job consists of joining a timeseries table with
   another table that I broadcast on the different nodes. Is it possible to
   control which data to load to each node ? For example, I want each node to
   group the rows with the same joining key.

Thank you for your time.

Kind regards,
Aissam CHIA

Some optimization questions about our beloved engine Spark

Reply via email to