I made a plot that compares the time it takes to move an array with the one-sided methods we are using now and with MPI.jl. It is here
https://github.com/JuliaLang/julia/issues/9167#issuecomment-64721543 2015-01-07 1:54 GMT-05:00 Amuthan <apar...@gmail.com>: > Amit: Thanks for the suggestion. I gave it a quick try, but wasn't > successful. It appears to me that communication between the processors (to > obtain the boundary data) would require reconstructing the DArray from the > localparts at the end of each iteration. I guess I'll have to take a deeper > look into the implementation of DArrays to understand how best to implement > this. > > In the meantime, I got a reasonable speedup using the Julia wrapper for > MPI (https://github.com/JuliaParallel/MPI.jl). Has anyone tried comparing > the performance of the one-sided message passing model of DArray and the > standard (2-sided) MPI model? > > Amuthan > > On Mon, Jan 5, 2015 at 12:53 AM, Amit Murthy <amit.mur...@gmail.com> > wrote: > >> You can have only two DArrays and use localpart() to get the local parts >> of the arrays on each worker and work off that. >> >> With a single iteration the network overhead will be much more than any >> gains from distributed computation - it depends on the computation of >> course. >> >> Currently, DArrays work best if the distributed computation can work >> solely off localparts. An efficient means of setindex! on darrays is a TODO >> at this time. >> >> On Mon, Jan 5, 2015 at 12:34 PM, Amuthan <apar...@gmail.com> wrote: >> >>> Hi Amit: yes, the idea is to have just two DArrays, one each for the >>> previous and current iterations. I had some trouble assigning values >>> directly to a DArray (a setindex! error) and so had to write it like this. >>> Do you know any means around this? >>> >>> Btw, the parallel code runs slower than the serial version even for just >>> one iteration. >>> >>> On Sun, Jan 4, 2015 at 10:27 PM, Amit Murthy <amit.mur...@gmail.com> >>> wrote: >>> >>>> As written, this is creating a 1000 DArrays. I think you intended to >>>> have only 2 of them and swap values in each iteration? >>>> >>>> >>>> On Sunday, 4 January 2015 11:07:47 UTC+5:30, Amuthan A. Ramabathiran >>>> wrote: >>>>> >>>>> Hello: I recently started exploring the parallel capabilities of Julia >>>>> and I need some help in understanding and improving the performance a very >>>>> elementary parallel code using DArrays (I use Julia >>>>> version 0.4.0-dev+2431). The code pasted below (based essentially on >>>>> plife.jl) solves u''(x) = 0, x \in [0,1] with u(0) and u(1) specified, >>>>> using the 2nd order central difference approximation. The parallel version >>>>> of the code runs significantly slower than the serial version. It would be >>>>> nice if someone could point out ways to improve this and/or suggest an >>>>> alternative efficient version. >>>>> >>>>> function laplace_1D_serial(u::Array{Float64}) >>>>> N = length(u) - 2 >>>>> u_new = zeros(N) >>>>> >>>>> for i = 1:N >>>>> u_new[i] = 0.5(u[i] + u[i + 2]) >>>>> end >>>>> >>>>> u_new >>>>> end >>>>> >>>>> function serial_iterate(u::Array{Float64}) >>>>> u_new = laplace_1D_serial(u) >>>>> >>>>> for i = 1:length(u_new) >>>>> u[i + 1] = u_new[i] >>>>> end >>>>> end >>>>> >>>>> function parallel_iterate(u::DArray) >>>>> DArray(size(u), procs(u)) do I >>>>> J = I[1] >>>>> >>>>> if myid() == 2 >>>>> local_array = zeros(length(J) + 1) >>>>> for i = J[1] : J[end] + 1 >>>>> local_array[i - J[1] + 1] = u[i] >>>>> end >>>>> append!([float(u[1])], laplace_1D_serial(local_array)) >>>>> >>>>> elseif myid() == length(procs(u)) + 1 >>>>> local_array = zeros(length(J) + 1) >>>>> for i = J[1] - 1 : J[end] >>>>> local_array[i - J[1] + 2] = u[i] >>>>> end >>>>> append!(laplace_1D_serial(local_array), [float(u[end])]) >>>>> >>>>> else >>>>> local_array = zeros(length(J) + 2) >>>>> for i = J[1] - 1 : J[end] + 1 >>>>> local_array[i - J[1] + 2] = u[i] >>>>> end >>>>> laplace_1D_serial(local_array) >>>>> >>>>> end >>>>> end >>>>> end >>>>> >>>>> A sample run on my laptop with 4 processors: >>>>> julia> u = zeros(1000); u[end] = 1.0; u_distributed = distribute(u); >>>>> >>>>> julia> @time for i = 1:1000 >>>>> serial_iterate(u) >>>>> end >>>>> elapsed time: 0.011452192 seconds (8300112 bytes allocated) >>>>> >>>>> julia> @time for i = 1:1000 >>>>> u_distributed = parallel_iterate(u_distributed) >>>>> end >>>>> elapsed time: 4.461922218 seconds (190565036 bytes allocated, 10.17% >>>>> gc time) >>>>> >>>>> Thanks for your help! >>>>> >>>>> Cheers, >>>>> Amuthan >>>>> >>>>> >>>>> >>> >> >