Hi Supun, A couple of things with regard to your question.
--executor-cores means the number of worker threads per VM. According to your requirement this should be set to 8. *repartitionAndSortWithinPartitions *is a RDD operation, RDD operations in Spark are not performant both in terms of execution and memory. I would rather use Dataframe sort operation if performance is key. Regards, Keith. http://keith-chapman.com On Mon, Jul 15, 2019 at 8:45 AM Supun Kamburugamuve < supun.kamburugam...@gmail.com> wrote: > Hi all, > > We are trying to measure the sorting performance of Spark. We have a 16 > node cluster with 48 cores and 256GB of ram in each machine and 10Gbps > network. > > Let's say we are running with 128 parallel tasks and each partition > generates about 1GB of data (total 128GB). > > We are using the method *repartitionAndSortWithinPartitions* > > A standalone cluster is used with the following configuration. > > SPARK_WORKER_CORES=1 > SPARK_WORKER_MEMORY=16G > SPARK_WORKER_INSTANCES=8 > > --executor-memory 16G --executor-cores 1 --num-executors 128 > > I believe this sets 128 executors to run the job each having 16GB of > memory and spread across 16 nodes with 8 threads in each node. This > configuration runs very slow. The program doesn't use disks to read or > write data (data generated in-memory and we don't write to file after > sorting). > > It seems even though the data size is small, it uses disk for the shuffle. > We are not sure our configurations are optimal to achieve the best > performance. > > Best, > Supun.. > >