On 04/02/2015 14:29, Michael Black wrote:
Hi Mike,
> I don't think you'll find any gain using FFTW openmp.  WSJT-X does not do
> big enough FFTs to overtake the thread create/delete overhead.
That's not so Mike. Joe has already determined that FFTW3 given roughly 
2 threads has a small performance gain on the larger FFTs in the decoders.

There may be some confusion here, the talk about using the OpenMP 
version of FFTW3 is as an option to the native/pthreads version. Both 
are multi-threaded and have similar performance. The OpenMP version has 
the benefit that it is aware of the threads also being used elsewhere in 
the application and therefore plays well with the dynamic number of 
threads algorithm in OpenMP. This is currently not relevant to us as we 
are simply dividing the work in half and running the two halves (JT65 
decode and JT9 decode) in parallel, the thread allocation for that is 
trivial i.e. 2 if there are at least 2 CPUs available. We also have 
direct control of the number of threads FFTW3 uses so we can allocate 
any spare CPUs, above the two used for the decoder threads, to the 
larger FFT plans.
>
> I worked a job a few years ago on a 512-core machine doing FFTs on synthetic
> aperture radar systems.  Using FFTW with OpenMp did very little good.  Using
> OpenMP at the layer above did...which is the same thing I think we'll find
> here.
> OpenMP inside FFTW for small FFTs wil have the overhead dominate and defeat
> it.
wsjtx/jt9 use a number of different FFT sizes, currently the '-m #' 
argument is being used as the thread count for all of them, we probably 
need to only use more than one thread for the large FFTs as you are 
correct that there is a high proportion thread synchronization overhead 
for small FFTs, but FFTW3 does address this internally by only using 
multiple threads on FFTs larger than ~2^11.
>
> We're already seeing only a 20-25% improvement in openmp at this level which
> is a clear indication to me that we're not getting anywhere near 100% gain
> for threading so doing it at a lower level isn't worth it.
That is comparing Apples and Screwdrivers ;) the threading strategy for 
the decoders is one task per thread whereas the FFT  strategy is a true 
divide and conquer algorithm with a recursive distribution to threads. 
They are both able to deliver performance improvements in the same 
application given enough CPUs to run on (2 for decoders + ~2 for FFTs 
has been shown to be optimal). Note that there is absolutely no 
threading contention or overhead between the FFTs and the decoders, even 
though the latter uses the former. So given that the average low end PC 
these days is usually at least a dual core hyper-threaded Intel 
processor or equivalent, we can assume that 4 CPUs are available.

Not achieving 100% improvement at this stage from parallel decoding is 
likely to be due to overheads that we can and should address like not 
having the correct granularity on locks and being too pessimistic about 
data sharing controls, the FFTW3 concurrency is in and working but the 
direct use of OpenMP for parallel decoding is yet to be fully implemented.
> When I did my 512-core system I was getting over 90% gain for each thread I
> added.
> Much like "don't sweat the small stuff" I think you''ll find "don't
> multi-thread the small FFTs" is a good paradigm...
> When you got a ~50% gain then it's time to look at multi-threading below
> that level.
OK but the FFTW3 threading is almost free in terms of complexity, the 
FFTW developers have done all the hard stuff, we just need to turn it 
on. That means even quite small gains are cost effective. OTOH the 
direct use of OpenMP in the decoder is adding a lot of complexity since 
we have to design and implement or eliminate the data sharing controls, 
the potential gain is large so is probably worth the cost in development 
effort and complexity.
>
> Mike W9MDB
73
Bill
G4WJS.
>
>
> -----Original Message-----
> From: Bill Somerville [mailto:[email protected]]
> Sent: Wednesday, February 04, 2015 8:21 AM
> To: [email protected]
> Subject: Re: [wsjt-devel] v4926 OpenMP
>
> On 04/02/2015 14:05, John Nelson wrote:
>> Hi Bill and Joe,
> Hi John,
>> With regard to Mac builds, your [Bill] code test with workspace and
> workspace_mt executes correctly with my gfortran compiler.   However, as you
> point out the current clang/clang++ do not [yet] have OpenMP support.
>> So when I compile fftw_3.3.4 with --enable-threads, I cannot also use
> --with-openmp.  I also get:
>> -- Try OpenMP C flag = [ ]
>> -- Performing Test OpenMP_FLAG_DETECTED
>> -- Performing Test OpenMP_FLAG_DETECTED - Failed
> I am experimenting with the MacPorts gcc 4.9 suite with building WSJT-X.
> That needs changes to the CMake script which I have not committed yet.
> So far it doesn't seem to be necessary to build or use the OpenMP version of
> FFTW3, the native/pthreads version is working well and seems to be
> compatible with an OpenMP program. I believe the only issue is that we need
> to control the number of threads used by FFTW3 and OpenMP manually to a
> certain extent. If it does become necessary to use the OpenMP version of
> FFTW3, that can be built on Mac, again I have the MacPorts version
> available.
>
> There also appears to be a bug in CMake that is causing it not to pass on
> the portability options to the gcc compilers/linker (MAC_OSX_SYSROOT and
> MAC_OSX_DEPLOYMENT_TARGET). This is not serious and can be worked around if
> necessary but I want to get it sorted out properly if possible.
>
> My current focus apart from v1.4 issues is to help Joe with multi-threading
> hazards in jt9 but I am working on the Mac builds with OpenMP as well.
>> when building WSJT-X r4928 which is currently executing successfully - and
> certainly decodes rapidly.
> You are getting the latest performance increases which are significant.
> The OpenMP jt9, which is not in WSJT-X yet, has the potential to almost half
> decoding times in dual JT65+JT9 mode when there is equivalent work to be
> done in each mode.
>> --- John G4KLA
> 73
> Bill
> G4WJS.
>
> ----------------------------------------------------------------------------
> --
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> wsjt-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/wsjt-devel
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> wsjt-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/wsjt-devel


------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
wsjt-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/wsjt-devel

Reply via email to