Hi,

I hope this email finds you well.

Currently, I'm working on spark SQL and I have two main questions that I've
been struggling with for 2 weeks now. I'm running spark on AWS EMR :

   1. I'm running 30 spark applications in the same cluster. My
   applications are basically some SQL transformations that are computed on
   data stored on S3. I monitored the cluster using Ganglia and I noticed that
   most of the time I'm using roughly 2% on the total CPU. How can I
   optimize/maximize resource usage ?
   2. I have a spark application that is computing approximately 2Tb of
   data. I'm using 50 rg6.2xlarge ec2 instances and I'm using 90% of the total
   CPU capacity. My SQL job consists of joining a timeseries table with
   another table that I broadcast on the different nodes. Is it possible to
   control which data to load to each node ? For example, I want each node to
   group the rows with the same joining key.

Thank you for your time.

Kind regards,
Aissam CHIA

Reply via email to