Re: Sorting tuples with byte key and byte value
Thanks, Keith. we have set the SPARK_WORKER_INSTANCES=8. So that means we are running 8 workers in a single machine with 1 thread and this gives the 8 threads? Is there a preference for running 1 worker and 8 threads inside it? These are dual CPU machines, so I believe we at least need 2 worker instances per machine. If this is the case, I can use 2 worker instances each having 4 threads. Another question is how to avoid the disk for shuffle operation? Best, Supun.. On Mon, Jul 15, 2019 at 8:49 PM Keith Chapman wrote: > Hi Supun, > > A couple of things with regard to your question. > > --executor-cores means the number of worker threads per VM. According to > your requirement this should be set to 8. > > *repartitionAndSortWithinPartitions *is a RDD operation, RDD operations > in Spark are not performant both in terms of execution and memory. I would > rather use Dataframe sort operation if performance is key. > > Regards, > Keith. > > http://keith-chapman.com > > > On Mon, Jul 15, 2019 at 8:45 AM Supun Kamburugamuve < > supun.kamburugam...@gmail.com> wrote: > >> Hi all, >> >> We are trying to measure the sorting performance of Spark. We have a 16 >> node cluster with 48 cores and 256GB of ram in each machine and 10Gbps >> network. >> >> Let's say we are running with 128 parallel tasks and each partition >> generates about 1GB of data (total 128GB). >> >> We are using the method *repartitionAndSortWithinPartitions* >> >> A standalone cluster is used with the following configuration. >> >> SPARK_WORKER_CORES=1 >> SPARK_WORKER_MEMORY=16G >> SPARK_WORKER_INSTANCES=8 >> >> --executor-memory 16G --executor-cores 1 --num-executors 128 >> >> I believe this sets 128 executors to run the job each having 16GB of >> memory and spread across 16 nodes with 8 threads in each node. This >> configuration runs very slow. The program doesn't use disks to read or >> write data (data generated in-memory and we don't write to file after >> sorting). >> >> It seems even though the data size is small, it uses disk for the >> shuffle. We are not sure our configurations are optimal to achieve the best >> performance. >> >> Best, >> Supun.. >> >>
Re: Sorting tuples with byte key and byte value
Hi Supun, A couple of things with regard to your question. --executor-cores means the number of worker threads per VM. According to your requirement this should be set to 8. *repartitionAndSortWithinPartitions *is a RDD operation, RDD operations in Spark are not performant both in terms of execution and memory. I would rather use Dataframe sort operation if performance is key. Regards, Keith. http://keith-chapman.com On Mon, Jul 15, 2019 at 8:45 AM Supun Kamburugamuve < supun.kamburugam...@gmail.com> wrote: > Hi all, > > We are trying to measure the sorting performance of Spark. We have a 16 > node cluster with 48 cores and 256GB of ram in each machine and 10Gbps > network. > > Let's say we are running with 128 parallel tasks and each partition > generates about 1GB of data (total 128GB). > > We are using the method *repartitionAndSortWithinPartitions* > > A standalone cluster is used with the following configuration. > > SPARK_WORKER_CORES=1 > SPARK_WORKER_MEMORY=16G > SPARK_WORKER_INSTANCES=8 > > --executor-memory 16G --executor-cores 1 --num-executors 128 > > I believe this sets 128 executors to run the job each having 16GB of > memory and spread across 16 nodes with 8 threads in each node. This > configuration runs very slow. The program doesn't use disks to read or > write data (data generated in-memory and we don't write to file after > sorting). > > It seems even though the data size is small, it uses disk for the shuffle. > We are not sure our configurations are optimal to achieve the best > performance. > > Best, > Supun.. > >
Sorting tuples with byte key and byte value
Hi all, We are trying to measure the sorting performance of Spark. We have a 16 node cluster with 48 cores and 256GB of ram in each machine and 10Gbps network. Let's say we are running with 128 parallel tasks and each partition generates about 1GB of data (total 128GB). We are using the method *repartitionAndSortWithinPartitions* A standalone cluster is used with the following configuration. SPARK_WORKER_CORES=1 SPARK_WORKER_MEMORY=16G SPARK_WORKER_INSTANCES=8 --executor-memory 16G --executor-cores 1 --num-executors 128 I believe this sets 128 executors to run the job each having 16GB of memory and spread across 16 nodes with 8 threads in each node. This configuration runs very slow. The program doesn't use disks to read or write data (data generated in-memory and we don't write to file after sorting). It seems even though the data size is small, it uses disk for the shuffle. We are not sure our configurations are optimal to achieve the best performance. Best, Supun..