Hi, We faced an issue when testing the scalability of parallel merge sort using reduction tree on an array of size 1024^3. Currently, only the master opens the input file and parse it into an array using fscanf and then distribute the array to other processors. When using 32 processors, it took ~109 seconds to read from file. When using 64 processors, it took ~216 seconds to read from file. Despite varying number of processors, only one processor (the master) read the file. The input file is stored in a tmpfs, its made up of 1024^3 + 1 numbers (where the first number is the array size).
Additionally, I ran a C program that only read the file, it took ~104 seconds. However, I also ran an MPI program that only read the file, it took ~116 and ~118 seconds on 32 and 64 processors respectively. Code at https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf parallel_ms.c: https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf#file-parallel_ms-c mpi_just_read.c: https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf#file-mpi_just_read-c just_read.c: https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf#file-just_read-c Clearly, increasing number of processors on mpi_just_read.c did not severely affect the elapsed time. For parallel_ms.c, is it possible that 63 processors are in a blocking-read state from processor 0 somehow affecting the read from file elapsed time? Any assistance or clarification would be appreciated. Ali.
mpijr-vader-32.log
Description: mpijr-vader-32.log
mpijr-vader-64.log
Description: mpijr-vader-64.log
pms-tcp-32.log
Description: pms-tcp-32.log
pms-tcp-64.log
Description: pms-tcp-64.log
pms-vader-32.log
Description: pms-vader-32.log
pms-vader-64.log
Description: pms-vader-64.log