Hi, I hope this email finds you well.
Currently, I'm working on spark SQL and I have two main questions that I've been struggling with for 2 weeks now. I'm running spark on AWS EMR : 1. I'm running 30 spark applications in the same cluster. My applications are basically some SQL transformations that are computed on data stored on S3. I monitored the cluster using Ganglia and I noticed that most of the time I'm using roughly 2% on the total CPU. How can I optimize/maximize resource usage ? 2. I have a spark application that is computing approximately 2Tb of data. I'm using 50 rg6.2xlarge ec2 instances and I'm using 90% of the total CPU capacity. My SQL job consists of joining a timeseries table with another table that I broadcast on the different nodes. Is it possible to control which data to load to each node ? For example, I want each node to group the rows with the same joining key. Thank you for your time. Kind regards, Aissam CHIA