[julia-users] Re: Tips on parallel performance?

2015-05-20 Thread Jason Morton
In case it helps someone else, the issue was just cache contention and not 
an issue with Julia.  See https://github.com/JuliaLang/julia/issues/11354

On Tuesday, May 19, 2015 at 12:34:51 PM UTC-4, Jason Morton wrote:
>
> Done
>
> On Tuesday, May 19, 2015 at 6:45:25 AM UTC-4, Viral Shah wrote:
>>
>> Please file an issue for this one.
>>
>> -viral
>>
>> On Monday, May 18, 2015 at 11:58:06 PM UTC+5:30, Jason Morton wrote:
>>>
>>> Working with a 16 core / 32 thread machine with 32GB ram that presents 
>>> to ubuntu as 32 cores.  I'm trying to understand how to get the best 
>>> performance for embarrassingly parallel tasks.  I want to take a bunch of 
>>> svds in parallel as an example.  The scaling seems to be perfect (6.6 
>>> seconds regardless of number of svds) until about 7 or 8 simultaneous svds, 
>>> at which point it starts to creep up, scaling roughly linearly although 
>>> with high variance, up to 22 seconds for 16 and 47 seconds for 31.
>>>
>>> I can confirm that the number of processors being used seems to equals 
>>> the number getting pmapped over by watching htop, so I don't think openblas 
>>> multithreading is the issue.  Memory usage stays low.  Any guess on what is 
>>> going on?  I'm using the generic linux binary julia-79599ada44.  I don't 
>>> think there should be any sending of the matrices but perhaps that is the 
>>> issue.
>>>
>>> Probably I am missing something obvious.
>>>
>>>  with nprocs = 16 
>>> @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 
>>> 1:16])
>>> elapsed time: 22.350466328 seconds (12292776 bytes allocated)
>>> @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 
>>> 1:16])
>>> elapsed time: 91.135322511 seconds (10269056672 bytes allocated, 2.57% 
>>> gc time)
>>>
>>>  with nprocs = 31 
>>> #perfect scaling until here (at 6x speedup)
>>>  @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 
>>> 1:6])
>>> elapsed time: 6.720786336 seconds (159168 bytes allocated)
>>> @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:6])
>>> elapsed time: 34.146665292 seconds (3847940044 bytes allocated, 2.46% gc 
>>> time)
>>>
>>> #4.5x speedup
>>> @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 
>>> 1:16])
>>> elapsed time: 19.819358972 seconds (391056 bytes allocated)
>>>  @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 
>>> 1:16])
>>> elapsed time: 90.688842475 seconds (10260844684 bytes allocated, 2.36% 
>>> gc time)
>>>  
>>> #3.69x speedup
>>> @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 
>>> 1:nprocs()])
>>> elapsed time: 47.411315342 seconds (738616 bytes allocated)
>>> @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 
>>> 1:nprocs()])
>>> elapsed time: 175.308752879 seconds (19880206220 bytes allocated, 2.34% 
>>> gc time)
>>>
>>>
>>>
>>>
>>>

[julia-users] Re: Tips on parallel performance?

2015-05-19 Thread Jason Morton
Done

On Tuesday, May 19, 2015 at 6:45:25 AM UTC-4, Viral Shah wrote:
>
> Please file an issue for this one.
>
> -viral
>
> On Monday, May 18, 2015 at 11:58:06 PM UTC+5:30, Jason Morton wrote:
>>
>> Working with a 16 core / 32 thread machine with 32GB ram that presents to 
>> ubuntu as 32 cores.  I'm trying to understand how to get the best 
>> performance for embarrassingly parallel tasks.  I want to take a bunch of 
>> svds in parallel as an example.  The scaling seems to be perfect (6.6 
>> seconds regardless of number of svds) until about 7 or 8 simultaneous svds, 
>> at which point it starts to creep up, scaling roughly linearly although 
>> with high variance, up to 22 seconds for 16 and 47 seconds for 31.
>>
>> I can confirm that the number of processors being used seems to equals 
>> the number getting pmapped over by watching htop, so I don't think openblas 
>> multithreading is the issue.  Memory usage stays low.  Any guess on what is 
>> going on?  I'm using the generic linux binary julia-79599ada44.  I don't 
>> think there should be any sending of the matrices but perhaps that is the 
>> issue.
>>
>> Probably I am missing something obvious.
>>
>>  with nprocs = 16 
>> @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 
>> 1:16])
>> elapsed time: 22.350466328 seconds (12292776 bytes allocated)
>> @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
>> elapsed time: 91.135322511 seconds (10269056672 bytes allocated, 2.57% gc 
>> time)
>>
>>  with nprocs = 31 
>> #perfect scaling until here (at 6x speedup)
>>  @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 
>> 1:6])
>> elapsed time: 6.720786336 seconds (159168 bytes allocated)
>> @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:6])
>> elapsed time: 34.146665292 seconds (3847940044 bytes allocated, 2.46% gc 
>> time)
>>
>> #4.5x speedup
>> @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 
>> 1:16])
>> elapsed time: 19.819358972 seconds (391056 bytes allocated)
>>  @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 
>> 1:16])
>> elapsed time: 90.688842475 seconds (10260844684 bytes allocated, 2.36% gc 
>> time)
>>  
>> #3.69x speedup
>> @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 
>> 1:nprocs()])
>> elapsed time: 47.411315342 seconds (738616 bytes allocated)
>> @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 
>> 1:nprocs()])
>> elapsed time: 175.308752879 seconds (19880206220 bytes allocated, 2.34% 
>> gc time)
>>
>>
>>
>>
>>

[julia-users] Tips on parallel performance?

2015-05-18 Thread Jason Morton
Working with a 16 core / 32 thread machine with 32GB ram that presents to 
ubuntu as 32 cores.  I'm trying to understand how to get the best 
performance for embarrassingly parallel tasks.  I want to take a bunch of 
svds in parallel as an example.  The scaling seems to be perfect (6.6 
seconds regardless of number of svds) until about 7 or 8 simultaneous svds, 
at which point it starts to creep up, scaling roughly linearly although 
with high variance, up to 22 seconds for 16 and 47 seconds for 31.

I can confirm that the number of processors being used seems to equals the 
number getting pmapped over by watching htop, so I don't think openblas 
multithreading is the issue.  Memory usage stays low.  Any guess on what is 
going on?  I'm using the generic linux binary julia-79599ada44.  I don't 
think there should be any sending of the matrices but perhaps that is the 
issue.

Probably I am missing something obvious.

 with nprocs = 16 
@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 22.350466328 seconds (12292776 bytes allocated)
@time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 91.135322511 seconds (10269056672 bytes allocated, 2.57% gc 
time)

 with nprocs = 31 
#perfect scaling until here (at 6x speedup)
 @time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:6])
elapsed time: 6.720786336 seconds (159168 bytes allocated)
@time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:6])
elapsed time: 34.146665292 seconds (3847940044 bytes allocated, 2.46% gc 
time)

#4.5x speedup
@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 19.819358972 seconds (391056 bytes allocated)
 @time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 1:16])
elapsed time: 90.688842475 seconds (10260844684 bytes allocated, 2.36% gc 
time)
 
#3.69x speedup
@time pmap(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 
1:nprocs()])
elapsed time: 47.411315342 seconds (738616 bytes allocated)
@time map(x->[svd(rand(1000,1000))[2][1] for i in 1:10],[i for i in 
1:nprocs()])
elapsed time: 175.308752879 seconds (19880206220 bytes allocated, 2.34% gc 
time)






Re: [julia-users] Typeclass implementation

2014-11-21 Thread Jason Morton
Also check out
https://github.com/jasonmorton/Typeclass.jl
Available from Pkg, which tries to do this.