Hi Supun,

A couple of things with regard to your question.

--executor-cores means the number of worker threads per VM. According to
your requirement this should be set to 8.

*repartitionAndSortWithinPartitions *is a RDD operation, RDD operations in
Spark are not performant both in terms of execution and memory. I would
rather use Dataframe sort operation if performance is key.

Regards,
Keith.

http://keith-chapman.com


On Mon, Jul 15, 2019 at 8:45 AM Supun Kamburugamuve <
supun.kamburugam...@gmail.com> wrote:

> Hi all,
>
> We are trying to measure the sorting performance of Spark. We have a 16
> node cluster with 48 cores and 256GB of ram in each machine and 10Gbps
> network.
>
> Let's say we are running with 128 parallel tasks and each partition
> generates about 1GB of data (total 128GB).
>
> We are using the method *repartitionAndSortWithinPartitions*
>
> A standalone cluster is used with the following configuration.
>
> SPARK_WORKER_CORES=1
> SPARK_WORKER_MEMORY=16G
> SPARK_WORKER_INSTANCES=8
>
> --executor-memory 16G --executor-cores 1 --num-executors 128
>
> I believe this sets 128 executors to run the job each having 16GB of
> memory and spread across 16 nodes with 8 threads in each node. This
> configuration runs very slow. The program doesn't use disks to read or
> write data (data generated in-memory and we don't write to file after
> sorting).
>
> It seems even though the data size is small, it uses disk for the shuffle.
> We are not sure our configurations are optimal to achieve the best
> performance.
>
> Best,
> Supun..
>
>

Reply via email to