[julia-users] Re: Tips on parallel performance?
In case it helps someone else, the issue was just cache contention and not an issue with Julia. See https://github.com/JuliaLang/julia/issues/11354 On Tuesday, May 19, 2015 at 12:34:51 PM UTC-4, Jason Morton wrote: > > Done > > On Tuesday, May 19, 2015 at 6:45:25 AM UTC-4, Viral Shah wrote: >> >> Please file an issue for this one. >> >> -viral >> >> On Monday, May 18, 2015 at 11:58:06 PM UTC+5:30, Jason Morton wrote: >>> >>> Working with a 16 core / 32 thread machine with 32GB ram that presents >>> to ubuntu as 32 cores. I'm trying to understand how to get the best >>> performance for embarrassingly parallel tasks. I want to take a bunch of >>> svds in parallel as an example. The scaling seems to be perfect (6.6 >>> seconds regardless of number of svds) until about 7 or 8 simultaneous svds, >>> at which point it starts to creep up, scaling roughly linearly although >>> with high variance, up to 22 seconds for 16 and 47 seconds for 31. >>> >>> I can confirm that the number of processors being used seems to equals >>> the number getting pmapped over by watching htop, so I don't think openblas >>> multithreading is the issue. Memory usage stays low. Any guess on what is >>> going on? I'm using the generic linux binary julia-79599ada44. I don't >>> think there should be any sending of the matrices but perhaps that is the >>> issue. >>> >>> Probably I am missing something obvious. >>> >>> with nprocs = 16 >>> @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in >>> 1:16]) >>> elapsed time: 22.350466328 seconds (12292776 bytes allocated) >>> @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in >>> 1:16]) >>> elapsed time: 91.135322511 seconds (10269056672 bytes allocated, 2.57% >>> gc time) >>> >>> with nprocs = 31 >>> #perfect scaling until here (at 6x speedup) >>> @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in >>> 1:6]) >>> elapsed time: 6.720786336 seconds (159168 bytes allocated) >>> @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:6]) >>> elapsed time: 34.146665292 seconds (3847940044 bytes allocated, 2.46% gc >>> time) >>> >>> #4.5x speedup >>> @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in >>> 1:16]) >>> elapsed time: 19.819358972 seconds (391056 bytes allocated) >>> @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in >>> 1:16]) >>> elapsed time: 90.688842475 seconds (10260844684 bytes allocated, 2.36% >>> gc time) >>> >>> #3.69x speedup >>> @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in >>> 1:nprocs()]) >>> elapsed time: 47.411315342 seconds (738616 bytes allocated) >>> @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in >>> 1:nprocs()]) >>> elapsed time: 175.308752879 seconds (19880206220 bytes allocated, 2.34% >>> gc time) >>> >>> >>> >>> >>>
[julia-users] Re: Tips on parallel performance?
Done On Tuesday, May 19, 2015 at 6:45:25 AM UTC-4, Viral Shah wrote: > > Please file an issue for this one. > > -viral > > On Monday, May 18, 2015 at 11:58:06 PM UTC+5:30, Jason Morton wrote: >> >> Working with a 16 core / 32 thread machine with 32GB ram that presents to >> ubuntu as 32 cores. I'm trying to understand how to get the best >> performance for embarrassingly parallel tasks. I want to take a bunch of >> svds in parallel as an example. The scaling seems to be perfect (6.6 >> seconds regardless of number of svds) until about 7 or 8 simultaneous svds, >> at which point it starts to creep up, scaling roughly linearly although >> with high variance, up to 22 seconds for 16 and 47 seconds for 31. >> >> I can confirm that the number of processors being used seems to equals >> the number getting pmapped over by watching htop, so I don't think openblas >> multithreading is the issue. Memory usage stays low. Any guess on what is >> going on? I'm using the generic linux binary julia-79599ada44. I don't >> think there should be any sending of the matrices but perhaps that is the >> issue. >> >> Probably I am missing something obvious. >> >> with nprocs = 16 >> @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in >> 1:16]) >> elapsed time: 22.350466328 seconds (12292776 bytes allocated) >> @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16]) >> elapsed time: 91.135322511 seconds (10269056672 bytes allocated, 2.57% gc >> time) >> >> with nprocs = 31 >> #perfect scaling until here (at 6x speedup) >> @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in >> 1:6]) >> elapsed time: 6.720786336 seconds (159168 bytes allocated) >> @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:6]) >> elapsed time: 34.146665292 seconds (3847940044 bytes allocated, 2.46% gc >> time) >> >> #4.5x speedup >> @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in >> 1:16]) >> elapsed time: 19.819358972 seconds (391056 bytes allocated) >> @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in >> 1:16]) >> elapsed time: 90.688842475 seconds (10260844684 bytes allocated, 2.36% gc >> time) >> >> #3.69x speedup >> @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in >> 1:nprocs()]) >> elapsed time: 47.411315342 seconds (738616 bytes allocated) >> @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in >> 1:nprocs()]) >> elapsed time: 175.308752879 seconds (19880206220 bytes allocated, 2.34% >> gc time) >> >> >> >> >>
[julia-users] Tips on parallel performance?
Working with a 16 core / 32 thread machine with 32GB ram that presents to ubuntu as 32 cores. I'm trying to understand how to get the best performance for embarrassingly parallel tasks. I want to take a bunch of svds in parallel as an example. The scaling seems to be perfect (6.6 seconds regardless of number of svds) until about 7 or 8 simultaneous svds, at which point it starts to creep up, scaling roughly linearly although with high variance, up to 22 seconds for 16 and 47 seconds for 31. I can confirm that the number of processors being used seems to equals the number getting pmapped over by watching htop, so I don't think openblas multithreading is the issue. Memory usage stays low. Any guess on what is going on? I'm using the generic linux binary julia-79599ada44. I don't think there should be any sending of the matrices but perhaps that is the issue. Probably I am missing something obvious. with nprocs = 16 @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16]) elapsed time: 22.350466328 seconds (12292776 bytes allocated) @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16]) elapsed time: 91.135322511 seconds (10269056672 bytes allocated, 2.57% gc time) with nprocs = 31 #perfect scaling until here (at 6x speedup) @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:6]) elapsed time: 6.720786336 seconds (159168 bytes allocated) @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:6]) elapsed time: 34.146665292 seconds (3847940044 bytes allocated, 2.46% gc time) #4.5x speedup @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16]) elapsed time: 19.819358972 seconds (391056 bytes allocated) @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16]) elapsed time: 90.688842475 seconds (10260844684 bytes allocated, 2.36% gc time) #3.69x speedup @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:nprocs()]) elapsed time: 47.411315342 seconds (738616 bytes allocated) @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:nprocs()]) elapsed time: 175.308752879 seconds (19880206220 bytes allocated, 2.34% gc time)
Re: [julia-users] Typeclass implementation
Also check out https://github.com/jasonmorton/Typeclass.jl Available from Pkg, which tries to do this.