Re: [wsjt-devel] WSJT-X Decoder Performance
I sent this in via HTML and it got blocked...so here it is plain text... Mike W9MDB Did a 20-pass run on the last two versions of interest - I have a dual 4-Core CPU so apparently would have 8 threads available on 4928 cut to 2 threads on 4930. So an ever so slight improvement with 1 thread..2 threads got worse though but they were already worse to start with. Col1 = TimeMem-1.0.exe jt9_omp -p 1 -d 3 -w 2 -m 1 130610_2343.wav | grep Elapsed | cut -f2 -d: Col2 = TimeMem-1.0.exe jt9_omp -p 1 -d 3 -w 2 -m 2 130610_2343.wav | grep Elapsed | cut -f2 -d: Thread 1 2 %Diff 49281.131.14-0.88% 1.081.13-4.63% 1.1 1.2 -9.09% 1.1 1.13-2.73% 1.191.190.00% 1.121.1 1.79% 1.1 1.14-3.64% 1.081.13-4.63% 1.111.18-6.31% 1.1 1.12-1.82% 1.091.12-2.75% 1.1 1.13-2.73% 1.091.18-8.26% 1.1 1.12-1.82% 1.091.17-7.34% 1.081.2 -11.11% 1.091.31-20.18% 1.111.23-10.81% 1.1 1.24-12.73% 1.091.19-9.17% Average 1.1025 1.1675 -5.94% Thread 1 2 %Diff 49301.1 1.28-16.36% 1.081.21-12.04% 1.081.2 -11.11% 1.1 1.22-10.91% 1.081.23-13.89% 1.081.22-12.96% 1.091.22-11.93% 1.071.23-14.95% 1.091.23-12.84% 1.131.22-7.96% 1.091.22-11.93% 1.081.22-12.96% 1.081.25-15.74% 1.081.22-12.96% 1.111.24-11.71% 1.091.22-11.93% 1.111.24-11.71% 1.1 1.22-10.91% 1.081.2 -11.11% 1.091.2 -10.09% Average 1.0905 1.2245 -12.30% -Original Message- From: Bill Somerville [mailto:g4...@classdesign.com] Sent: Wednesday, February 04, 2015 9:31 AM To: wsjt-devel@lists.sourceforge.net Subject: Re: [wsjt-devel] WSJT-X Decoder Performance On 04/02/2015 15:27, Joe Taylor wrote: Hi Bill, Hi Joe, OK, by all means go ahead. BTW: I notice that jt9_omp.exe r4929 always runs with 4 threads on my 4-core machine. Since we have only two tasks running in parallal, I can see little reason to use more than 2 threads. Should we specify two threads explicitly? Yes, I have addressed that as well. -- Joe 73 Bill G4WJS. On 2/4/2015 10:24 AM, Bill Somerville wrote: On 04/02/2015 15:21, Joe Taylor wrote: Hi Bill and all, Hi Joe, snip Note that decoder.f90 now decodes the two modes in parallel sections *ONLY* if txmode is JT9. I will fix this. Joe, I already have this in hand, I can check it in if you wish. snip -- Joe 73 Bill G4WJS. - - Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub
Re: [wsjt-devel] WSJT-X Decoder Performance
Doing the same 20-pass run on my Windows 10 HP Envy with i7-4702MQ @ 2.2Ghz (compare to the dual X5450 4-core CPU at 3Ghz) -- you can see Ghz doesn't tell the whole story... With 2 threads on Windows 10 I see a long run once in a great while. Thread 1 2 %Diff 49300.790.82-3.80% 0.780.79-1.28% 0.780.79-1.28% 0.750.78-4.00% 0.780.780.00% 0.770.8 -3.90% 0.780.780.00% 0.770.79-2.60% 0.780.79-1.28% 0.790.82-3.80% 0.8 0.782.50% 0.760.77-1.32% 0.770.8 -3.90% 0.780.83-6.41% 0.8 0.82-2.50% 0.780.79-1.28% 0.780.780.00% 0.780.780.00% 0.770.78-1.30% 0.770.78-1.30% Average 0.778 0.7925 -1.87% -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
I removed all the flush(6) except the one in decoder.f90. There was an unprotected one in jt9c.f90 which may explain the long runtimes I see one-in-a-great while on my Windows 10 system. Last long runtime was 7 seconds using 2 threads before I removed the flushes. I am now running a loop test to see if any long run times are seen on both my computers. Mike W9MDB #include stdio.h int main(int argc,char *argv[]) { char *cmd = TimeMem-1.0.exe jt9_omp -p 1 -d 3 -w 2 -m 2 130610_2343.wav | grep Elapsed | cut -f2 -d: doit.txt; double total=0; int n=0; char buf[4096]; while(1) { system(cmd); FILE *fp=fopen(doit.txt,r); fgets(buf,sizeof(buf),fp); fclose(fp); double sec = atof(buf); ++n; total+=sec; double avg = total/n; if (sec avg*1.5) { printf(long run %.2f avg=.2f\n,sec,avg); } printf(%d\r,n); fflush(stdout); } } Looking at how the output comes out of jt9_omp it would appear to me these flushes are not necessary as it appears each line is being flushed anyways. Not really any change in the timing Mike W9MDB Thread 1 2 %Diff !flush 49301.1 1.28-16.36% 1.111.21-9.01% 1.081.21-12.04% 1.111.22-9.91% 1.081.2 -11.11% 1.091.22-11.93% 1.1 1.22-10.91% 1.071.22-14.02% 1.081.23-13.89% 1.071.23-14.95% 1.081.22-12.96% 1.141.22-7.02% 1.091.22-11.93% 1.081.23-13.89% 1.071.23-14.95% 1.081.24-14.81% 1.091.23-12.84% 1.091.25-14.68% 1.131.22-7.96% 1.1 1.26-14.55% 1.091.22-11.93% 1.091.26-15.60% 1.081.22-12.96% 1.081.26-16.67% 1.081.25-15.74% 1.091.21-11.01% 1.081.22-12.96% 1.091.23-12.84% 1.111.24-11.71% 1.061.26-18.87% 1.091.22-11.93% 1.091.24-13.76% 1.111.24-11.71% 1.071.23-14.95% 1.1 1.22-10.91% 1.081.24-14.81% 1.081.2 -11.11% 1.071.22-14.02% 1.091.2 -10.09% 1.081.21-12.04% Avg 1.0905 1.2245 -12.30% 1.087 1.233 -13.47% -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
Doing some testing on the 4928 jt9_omp on my Windows 10 box using command line test. I'm getting periodic long runs of 40-100 seconds...kind of like it's running wisdom again or such. There are a lot more page faults when that happens too. I haven't see this behavior on Windows 7 yet. Mike W9MDB -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
Did a 20-pass run on the last two versions of interest - I have a dual 4-Core CPU so apparently would have 8 threads available on 4928 cut to 2 threads on 4930. So an ever so slight improvement with 1 thread..2 threads got worse though but they were already worse to start with. Col1 = TimeMem-1.0.exe jt9_omp -p 1 -d 3 -w 2 -m 1 130610_2343.wav | grep Elapsed | cut -f2 -d: Col2 = TimeMem-1.0.exe jt9_omp -p 1 -d 3 -w 2 -m 2 130610_2343.wav | grep Elapsed | cut -f2 -d: Threads 1 2 %Diff 4928 1.13 1.14 -0.88% 1.08 1.13 -4.63% 1.1 1.2 -9.09% 1.1 1.13 -2.73% 1.19 1.19 0.00% 1.12 1.1 1.79% 1.1 1.14 -3.64% 1.08 1.13 -4.63% 1.11 1.18 -6.31% 1.1 1.12 -1.82% 1.09 1.12 -2.75% 1.1 1.13 -2.73% 1.09 1.18 -8.26% 1.1 1.12 -1.82% 1.09 1.17 -7.34% 1.08 1.2 -11.11% 1.09 1.31 -20.18% 1.11 1.23 -10.81% 1.1 1.24 -12.73% 1.09 1.19 -9.17% Average 1.1025 1.1675 -5.94% Threads 1 2 %Diff 4930 1.1 1.28 -16.36% 1.08 1.21 -12.04% 1.08 1.2 -11.11% 1.1 1.22 -10.91% 1.08 1.23 -13.89% 1.08 1.22 -12.96% 1.09 1.22 -11.93% 1.07 1.23 -14.95% 1.09 1.23 -12.84% 1.13 1.22 -7.96% 1.09 1.22 -11.93% 1.08 1.22 -12.96% 1.08 1.25 -15.74% 1.08 1.22 -12.96% 1.11 1.24 -11.71% 1.09 1.22 -11.93% 1.11 1.24 -11.71% 1.1 1.22 -10.91% 1.08 1.2 -11.11% 1.09 1.2 -10.09% Average 1.0905 1.2245 -12.30% -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
On 02/04/2015 08:42 AM, Claude Frantz wrote: Please see here the result I have got with SVNVERSION 4928. I'm very sorry, I have used the wrong executables. Here the right output now: $ time ./jt9 -p 1 -d 3 -w 2 -m 1 /home/claude/.wsjtx/bin/save/samples/130610_2343.wav 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished 0 1 real0m2.407s user0m2.324s sys 0m0.073s $ time ./jt9_omp -p 1 -d 3 -w 2 -m 1 /home/claude/.wsjtx/bin/save/samples/130610_2343.wav 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished 0 1 real0m1.663s user0m2.502s sys 0m0.090s -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
Hi Claude, Thanks for your timing report. Your first test may have used the correct executables, too. To get a good test you must run a configuration at least twice. In the first run, the program accumulates wisdom about the best way to configure the FFT calculations. This wisdom is saved and used for subsequent runs. If you change the -w # or -m # parameters, new wisdom will need to be accumulated. -- 73, Joe, K1JT On 2/4/2015 5:44 AM, Claude Frantz wrote: On 02/04/2015 08:42 AM, Claude Frantz wrote: Please see here the result I have got with SVNVERSION 4928. I'm very sorry, I have used the wrong executables. Here the right output now: $ time ./jt9 -p 1 -d 3 -w 2 -m 1 /home/claude/.wsjtx/bin/save/samples/130610_2343.wav 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished0 1 real 0m2.407s user 0m2.324s sys 0m0.073s $ time ./jt9_omp -p 1 -d 3 -w 2 -m 1 /home/claude/.wsjtx/bin/save/samples/130610_2343.wav 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished0 1 real 0m1.663s user 0m2.502s sys 0m0.090s -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
On 04/02/2015 13:37, Joe Taylor wrote: Hi Claude, Hi Claude Joe, Thanks for your timing report. Your first test may have used the correct executables, too. To get a good test you must run a configuration at least twice. In the first run, the program accumulates wisdom about the best way to configure the FFT calculations. This wisdom is saved and used for subsequent runs. If you change the -w # or -m # parameters, new wisdom will need to be accumulated. That is also the case if the number of threads (the '-m #' option) is changed. -- 73, Joe, K1JT 73 Bill G4WJS. On 2/4/2015 5:44 AM, Claude Frantz wrote: On 02/04/2015 08:42 AM, Claude Frantz wrote: Please see here the result I have got with SVNVERSION 4928. I'm very sorry, I have used the wrong executables. Here the right output now: $ time ./jt9 -p 1 -d 3 -w 2 -m 1 /home/claude/.wsjtx/bin/save/samples/130610_2343.wav 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished0 1 real 0m2.407s user 0m2.324s sys 0m0.073s $ time ./jt9_omp -p 1 -d 3 -w 2 -m 1 /home/claude/.wsjtx/bin/save/samples/130610_2343.wav 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished0 1 real 0m1.663s user 0m2.502s sys 0m0.090s -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
Hi Bill and all, Tests here suggest that r4929 produces a Windows jt9_omp.exe that runs correctly. At least, it runs to completion on my sequence of 25 test files -- which r4928 does not. Timing results on a 4-core Win7 machine: Params jt9 jt9_omp -- -w 2 -m 1 25.5 s 21.1 s -w 2 -m 2 24.921.0 When using OpenMP to run JT9 and JT65 decoders in parallel, we gain almost nothing by using multi-threading for the FFTW plans. Note that decoder.f90 now decodes the two modes in parallel sections *ONLY* if txmode is JT9. I will fix this. I may also look for additional places where concurrent processing could help performance... but I don't consider this a very high priority. -- Joe -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
On 04/02/2015 15:21, Joe Taylor wrote: Hi Bill and all, Hi Joe, snip Note that decoder.f90 now decodes the two modes in parallel sections *ONLY* if txmode is JT9. I will fix this. Joe, I already have this in hand, I can check it in if you wish. snip -- Joe 73 Bill G4WJS. -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
Hi Bill, OK, by all means go ahead. BTW: I notice that jt9_omp.exe r4929 always runs with 4 threads on my 4-core machine. Since we have only two tasks running in parallal, I can see little reason to use more than 2 threads. Should we specify two threads explicitly? -- Joe On 2/4/2015 10:24 AM, Bill Somerville wrote: On 04/02/2015 15:21, Joe Taylor wrote: Hi Bill and all, Hi Joe, snip Note that decoder.f90 now decodes the two modes in parallel sections *ONLY* if txmode is JT9. I will fix this. Joe, I already have this in hand, I can check it in if you wish. snip -- Joe 73 Bill G4WJS. -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
On 04/02/2015 15:27, Joe Taylor wrote: Hi Bill, Hi Joe, OK, by all means go ahead. BTW: I notice that jt9_omp.exe r4929 always runs with 4 threads on my 4-core machine. Since we have only two tasks running in parallal, I can see little reason to use more than 2 threads. Should we specify two threads explicitly? Yes, I have addressed that as well. -- Joe 73 Bill G4WJS. On 2/4/2015 10:24 AM, Bill Somerville wrote: On 04/02/2015 15:21, Joe Taylor wrote: Hi Bill and all, Hi Joe, snip Note that decoder.f90 now decodes the two modes in parallel sections *ONLY* if txmode is JT9. I will fix this. Joe, I already have this in hand, I can check it in if you wish. snip -- Joe 73 Bill G4WJS. -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
Please see here the result I have got with SVNVERSION 4928. I have suppressed the -m flag because the software rejects it. Best 88 de Claude $ time ./jt9 -p 1 -d 3 -w 2 -e . /home/claude/.wsjtx/bin/save/samples/130610_2343.wav 2343 -7 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -16 1.0 3372 @ KK4HEG KE0CO CN87 2343 16 0.1 3490 @ CQ AG4M EM75 2343 -18 -1.3 3567 @ CQ TA4A KM37 2343 -14 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -22 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -15 0.2 3774 @ CQ M0ABA JO01 2343 -1 0.2 3843 @ EI3HGB DD2EE JO31 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished 1 1 real0m30.057s user0m27.397s sys 0m1.847s $ time /home/claude/ham/JoeTaylor/wsjtx/build/jt9_omp -p 1 -d 3 -w 2 -e . /home/claude/.wsjtx/bin/save/samples/130610_2343.wav 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished 0 1 real1m57.819s user1m52.374s sys 0m4.787s $ uname -a Linux defi 3.18.3-201.fc21.i686+PAE #1 SMP Mon Jan 19 16:09:58 UTC 2015 i686 i686 i386 GNU/Linux # lshw description: Notebook product: P50IJ vendor: ASUSTeK Computer Inc. version: 1.0 serial: 103144040038 width: 32 bits capabilities: smbios-2.5 dmi-2.5 smp-1.4 smp configuration: chassis=notebook cpus=2 *-cpu:0 description: CPU product: Core 2 Duo (PPN12345678901234567) vendor: Intel Corp. physical id: 4 bus info: cpu@0 version: 6.7.10 serial: 0001-067A---- slot: Socket 478 size: 2101MHz capacity: 2101MHz width: 64 bits clock: 200MHz capabilities: x86-64 boot fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm ida dtherm tpr_shadow vnmi flexpriority cpufreq configuration: cores=2 enabledcores=2 id=0 threads=2 *-cache:0 description: L1 cache physical id: 5 slot: L1-Cache size: 64KiB capacity: 64KiB capabilities: internal write-back data *-cache:1 description: L2 cache physical id: 7 slot: L2-Cache size: 2MiB capacity: 2MiB capabilities: internal write-back unified *-logicalcpu:0 description: Logical CPU physical id: 0.1 width: 64 bits capabilities: logical *-logicalcpu:1 description: Logical CPU physical id: 0.2 width: 64 bits capabilities: logical *-cache description: L1 cache physical id: 6 slot: L1-Cache size: 64KiB capacity: 64KiB capabilities: internal write-back instruction *-memory description: System Memory physical id: 1d slot: System board or motherboard size: 2GiB *-bank:0 description: SODIMM DDR2 Synchronous 667 MHz (1.5 ns) product: N/A vendor: N/A physical id: 0 serial: N/A slot: SODIMM0 size: 2GiB width: 64 bits clock: 667MHz (1.5ns) *-bank:1 description: SODIMM [empty] product: N/A vendor: N/A physical id: 1 serial: N/A slot: SODIMM1 *-cpu:1 physical id: 1 bus info: cpu@1 version: 6.7.10 serial: 0001-067A---- size: 2101MHz capacity: 2101MHz capabilities: vmx ht cpufreq configuration: id=1 *-logicalcpu:0 description: Logical CPU physical id: 1.1 capabilities: logical
Re: [wsjt-devel] WSJT-X Decoder Performance
Dear Colleagues, I have made some further performance tests of the decoders in WSJT-X. I copied a collection of 25 *.wav files into a clean directory. The files wererecorded in *JT9+JT65* mode during a busy period of activity on 20 meters. On average, around a dozen decodable signals are present in each file -- typically 7 or 8 JT65 signals and 4 or 5 JT9 signals. My procedure was as follows: 1. Start the program. 2. Activate File | Erase ALL.TXT. 3. Activate File | Open and select first file in the test directory, clicking the Open button exactly at the top of a UTC minute. 4. Allow decoding of the first file to finish, then hit Shift+F6 as soon as the blue background has cleared from the *Decode* button. 5. The program then proceeds to decode the remaining 24 files. 6. Manually record the UTC when the last decode has finished, thereby producing the total wall clock time to decode the 25 files. 7. Record the number of decoded lines in the file ALL.TXT. (Don't count the date line, at the top.) 8. Record the larger of the two bottom-line numbers from the file timer.out. This is the time that would be spent in the decoders at the end of an Rx minute -- in this case, it is essentially the wall-clock time minus the time spent reading files, producing the waterfall, etc. Here's a summary of my results: Program Version Wall Clock Time Decode# --- v1.3 r3673 90 s62.14 s Deepest 290 v1.4.0-rc2, r440076 53.98 Deepest 302 v1.5, r4926 46 24.27 Deepest 309 v1.5, r4926 42 21.47 Normal 307 v1.5, r4926 40 20.14 Fast 305 Bottom line: The decoder in v1.5 r4926 is 2.2 to 3 times faster than the ones in v1.3 r3673 and v1.4.0-rc2, and it also decodes more signals. Note that in revision v1.5 r4926 we are not yet taking advantage of concurrent processing in the decoder, on computers with more than one CPU. Further gains can probably be achieved, if we put the effort into it. -- 73, Joe, K1JT -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
On 02/02/2015 08:45 PM, Joe Taylor wrote: The following table presents measurements of decoding speed for a number of tests using WSJT-X versions 1.3, 1.4.0-rc2, 1.5r4925, and 1.5r4926. Time gives the time is seconds to decode the sample file 130610_2343.wav, which has 8 decodable JT9 signals and 9 decodable JT65 signals. Decode is the setting on the WSJT-X *Decode* menu. The column labeled # gives the number of decoded signals. (Note that selecting Deepest is required in order to decode one of the JT9 signals.) Hi Joe, I think that many ones of us are interested to supply the results of the test in her/his own environment. Please give us a good recommendation how to make the test, so that the results become comparable. Perhaps, such a test could be incorporated in the Makefile. Best 88 de Claude -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
Hi Bill, Thanks for the explanation. I was looking for the CMake module OpenMP, didn't realize it was the different find module. It seems to be working here ( JTSDK v2 Cmake 3.0.2 ): snip -- -- Try OpenMP C flag = [-fopenmp] -- Performing Test OpenMP_FLAG_DETECTED -- Performing Test OpenMP_FLAG_DETECTED - Success -- Try OpenMP CXX flag = [-fopenmp] -- Performing Test OpenMP_FLAG_DETECTED -- Performing Test OpenMP_FLAG_DETECTED - Success -- Found OpenMP: -fopenmp -- snip I'm not sure why I didn't see it initially, but, in any case, all seems to be working properly. 73's Greg, KI7MT On 2/3/2015 2:16 PM, Bill Somerville wrote: On 03/02/2015 21:04, Joe Taylor wrote: Hi Greg, Hi Greg Joe, Yes, we need the -fopenmp flag to be set. I think that's being done appropriately now in CMakeLists.txt, although I confess I'm not always confident that I've understood its syntax fully. Bill should be able to give the definitive answer. Yes that is correct. The CMake script uses the package finder for OpenMP (part of the CMake distribution) to find the OpenMP capabilities of the compilers on the build platform. That sets the variable OPENMP_FOUND (or OPENMP-NOTFOUND if it's not available) along with the result variables OpenMP_C_FLAGS, OpenMP_CXX_FLAGS and, OpenMP_Fortran_FLAGS (actually this last one is only set by CMake v3.1 and later so I have substituted the C compiler flag as it is the same for the compilers we use at present). These flags are added to '*_omp' source compiles. The CMake script builds two versions of the internal static library target 'wsjt', the second target is 'wsjt_omp'. It also builds two versions of the 'jt9' target, the second being 'jt9_omp' which itself depends on the 'wsjt_omp' library. The 'wsjt' library is basically all the Fortran and C modules that are used by jt9, jt9code, jt9sim, jt65code and wsjt-x. -- Joe 73 Bill G4WJS. On 2/3/2015 3:36 PM, ki...@yahoo.com wrote: Hi Joe, I'm not 100% certain on this, but for openmp on Windows, don't you have to enable that as a C flag with something l-ke: -fopenmp I would assume that's to be done in the CMakeLists.txt file. I'm not sure about linking the libraries. The Qt5 Tool chain has winpthreads and gcc -v has --enable-libgomp so the tool chain looks to be OpenMP capable. 73's Greg, KI7MT On 2/3/2015 1:06 PM, Joe Taylor wrote: Hi Bill and all, Perhaps you already tried jt9_omp in Linux, but I had not. I tried it today, and it seems to work OK, as is. Here are some timing tests made on my rather elderly 2-core Linux machine. This time all tests were made with the Deepest setting, ndepth=3, and all resulted in 17 good decodes of the sample file 130610_2343.wav. To get the times I measured real time to execute jt9 or jt9_omp from the command-prompt. Program Version ParamsTime (s) jt9 v1.3 r36732.467 jt9 v1.4.0-rc2, r4400 2.658 jt9 v1.5 r4926 -w 1 -m 1 1.243 jt9 v1.5 r4926 -w 2 -m 1 1.202 jt9 v1.5 r4926 -w 2 -m 2 1.140 jt9_omp v1.5 r4926 -w 2 -m 1 0.834 jt9_omp v1.5 r4926 -w 2 -m 2 0.843 When jt9_omp is used it's better *not* to use the multi-threaded FFTW plans, at least on this 2-core machine. The two cores are already being used effectively by running the two big FFTs concurrently. For interest, here are the actual outputs of a pair of timing runs with jt9 and jt9_omp. Note that the decoded lines are the same, but JT65 lines are intermingled with JT9 lines. (I like the original ordering better -- first the one at the decode frequency; then others in the same mode in order of increasing frequancy; then thos in the other mode, again in order of increasing frequancy. With effort, I guess we could have it both ways by letting the GUI insert decodes (after the first one) in the proper place in the sequence.) # $ time jt9 -p 1 -d 3 -w 2 -m 1 130610_2343.wav junk 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished0 1 real0m1.196s user0m1.157s sys 0m0.037s $ time jt9_omp -p 1 -d 3 -w 2 -m 1 130610_2343.wav junk 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -7 0.3 815 # KK4DSD W7VP -16
Re: [wsjt-devel] WSJT-X Decoder Performance
On 03/02/2015 21:48, Michael Black wrote: Hi Mike, I ran a test using a sample program. Compiled with gfortran -g -fopenmp -o omp1 omp1.f Every few runs it hangs after all threads have completed. That program is not thread safe. All I/O statements need serializing to make it safe. Try adding: !$omp critical(io) immediately before each print or write and: !$omp end critical(io) immediately after each print or write. It also is not a good example for another reason as it doesn't conditionally compile some OpenMP code so it will not compile without OpenMP support. This is bad practice because one usually wants a program to give identical results (as possible) when run with and without multi-threading. I haven't let them run to completion...I tried the jt9_omp and let it run and eventually it dies without an intelligent message after quite a few minutes. jt9 is not yet suitable for multi-threaded decoding, I will be posting some amendments to deal with this when another unrelated issue has been tracked down. Running this program under gdb never hangs. GDB will interfere with the thread scheduling, that is one of the gotchas of multi-threaded testing in that programs often run a different path when debugged or even when output statements are added to try and locate issues. 73 Bill G4WJS. C*** *** C FILE: omp_workshare1.f C DESCRIPTION: C OpenMP Example - Loop Work-sharing - Fortran Version C In this example, the iterations of a loop are scheduled dynamically C across the team of threads. A thread will perform CHUNK iterations C at a time before being scheduled for the next CHUNK of work. C AUTHOR: Blaise Barney 5/99 C LAST REVISED: 01/09/04 C*** *** PROGRAM WORKSHARE1 INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS, + OMP_GET_THREAD_NUM, N, CHUNKSIZE, CHUNK, I PARAMETER (N=100) PARAMETER (CHUNKSIZE=10) REAL A(N), B(N), C(N) ! Some initializations DO I = 1, N A(I) = I * 1.0 B(I) = A(I) ENDDO CHUNK = CHUNKSIZE !$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK) PRIVATE(I,TID) TID = OMP_GET_THREAD_NUM() IF (TID .EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS() PRINT *, 'Number of threads =', NTHREADS END IF PRINT *, 'Thread',TID,' starting...' !$OMP DO SCHEDULE(DYNAMIC,CHUNK) DO I = 1, N C(I) = A(I) + B(I) WRITE(*,100) TID,I,C(I) 100FORMAT(' Thread',I2,': C(',I3,')=',F8.2) ENDDO !$OMP END DO NOWAIT PRINT *, 'Thread',TID,' done.' !$OMP END PARALLEL END Mike W9MDB -Original Message- From: Bill Somerville [mailto:g4...@classdesign.com] Sent: Tuesday, February 03, 2015 3:17 PM To: wsjt-devel@lists.sourceforge.net Subject: Re: [wsjt-devel] WSJT-X Decoder Performance On 03/02/2015 21:04, Joe Taylor wrote: Hi Greg, Hi Greg Joe, Yes, we need the -fopenmp flag to be set. I think that's being done appropriately now in CMakeLists.txt, although I confess I'm not always confident that I've understood its syntax fully. Bill should be able to give the definitive answer. Yes that is correct. The CMake script uses the package finder for OpenMP (part of the CMake distribution) to find the OpenMP capabilities of the compilers on the build platform. That sets the variable OPENMP_FOUND (or OPENMP-NOTFOUND if it's not available) along with the result variables OpenMP_C_FLAGS, OpenMP_CXX_FLAGS and, OpenMP_Fortran_FLAGS (actually this last one is only set by CMake v3.1 and later so I have substituted the C compiler flag as it is the same for the compilers we use at present). These flags are added to '*_omp' source compiles. The CMake script builds two versions of the internal static library target 'wsjt', the second target is 'wsjt_omp'. It also builds two versions of the 'jt9' target, the second being 'jt9_omp' which itself depends on the 'wsjt_omp' library. The 'wsjt' library is basically all the Fortran and C modules that are used by jt9, jt9code, jt9sim, jt65code and wsjt-x. -- Joe 73 Bill G4WJS. On 2/3/2015 3:36 PM, ki...@yahoo.com wrote: Hi Joe, I'm not 100% certain on this, but for openmp on Windows, don't you have to enable that as a C flag with something l-ke: -fopenmp I would assume that's to be done in the CMakeLists.txt file. I'm not sure about linking the libraries. The Qt5 Tool chain has winpthreads and gcc -v has --enable-libgomp so the tool chain looks to be OpenMP capable. 73's Greg, KI7MT On 2/3/2015 1:06 PM, Joe Taylor wrote: Hi Bill and all, Perhaps you already tried jt9_omp in Linux, but I had not. I tried it today, and it seems to work OK, as is. Here are some timing tests made on my rather elderly 2-core Linux machine. This time all tests
Re: [wsjt-devel] WSJT-X Decoder Performance
On 03/02/2015 21:52, Michael Black wrote: Hi Mike, Also -- the equivalent C program doesn't hang either...so it's a problem It is not an equivalent program. First off it doesn't print a termination message at the end of each thread run. Secondly there is no certainty that C I/O will fail in the same way as Fortran I/O when it is used in an thread-unsafe way. with FORTRAN in 4.8.0 or so it would appear. 73 Bill G4WJS. gcc -g -fopenmp -o omp1c omp1c.c /*** *** * FILE: omp_workshare1.c * DESCRIPTION: * OpenMP Example - Loop Work-sharing - C/C++ Version * In this example, the iterations of a loop are scheduled dynamically * across the team of threads. A thread will perform CHUNK iterations * at a time before being scheduled for the next CHUNK of work. * AUTHOR: Blaise Barney 5/99 * LAST REVISED: 04/06/05 **/ #include omp.h #include stdio.h #include stdlib.h #define CHUNKSIZE 10 #define N 100 int main (int argc, char *argv[]) { int nthreads, tid, i, chunk; float a[N], b[N], c[N]; /* Some initializations */ for (i=0; i N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; #pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf(Number of threads = %d\n, nthreads); } printf(Thread %d starting...\n,tid); #pragma omp for schedule(dynamic,chunk) for (i=0; iN; i++) { c[i] = a[i] + b[i]; printf(Thread %d: c[%d]= %f\n,tid,i,c[i]); } } /* end of parallel section */ } -Original Message- From: Bill Somerville [mailto:g4...@classdesign.com] Sent: Tuesday, February 03, 2015 3:17 PM To: wsjt-devel@lists.sourceforge.net Subject: Re: [wsjt-devel] WSJT-X Decoder Performance On 03/02/2015 21:04, Joe Taylor wrote: Hi Greg, Hi Greg Joe, Yes, we need the -fopenmp flag to be set. I think that's being done appropriately now in CMakeLists.txt, although I confess I'm not always confident that I've understood its syntax fully. Bill should be able to give the definitive answer. Yes that is correct. The CMake script uses the package finder for OpenMP (part of the CMake distribution) to find the OpenMP capabilities of the compilers on the build platform. That sets the variable OPENMP_FOUND (or OPENMP-NOTFOUND if it's not available) along with the result variables OpenMP_C_FLAGS, OpenMP_CXX_FLAGS and, OpenMP_Fortran_FLAGS (actually this last one is only set by CMake v3.1 and later so I have substituted the C compiler flag as it is the same for the compilers we use at present). These flags are added to '*_omp' source compiles. The CMake script builds two versions of the internal static library target 'wsjt', the second target is 'wsjt_omp'. It also builds two versions of the 'jt9' target, the second being 'jt9_omp' which itself depends on the 'wsjt_omp' library. The 'wsjt' library is basically all the Fortran and C modules that are used by jt9, jt9code, jt9sim, jt65code and wsjt-x. -- Joe 73 Bill G4WJS. On 2/3/2015 3:36 PM, ki...@yahoo.com wrote: Hi Joe, I'm not 100% certain on this, but for openmp on Windows, don't you have to enable that as a C flag with something l-ke: -fopenmp I would assume that's to be done in the CMakeLists.txt file. I'm not sure about linking the libraries. The Qt5 Tool chain has winpthreads and gcc -v has --enable-libgomp so the tool chain looks to be OpenMP capable. 73's Greg, KI7MT On 2/3/2015 1:06 PM, Joe Taylor wrote: Hi Bill and all, Perhaps you already tried jt9_omp in Linux, but I had not. I tried it today, and it seems to work OK, as is. Here are some timing tests made on my rather elderly 2-core Linux machine. This time all tests were made with the Deepest setting, ndepth=3, and all resulted in 17 good decodes of the sample file 130610_2343.wav. To get the times I measured real time to execute jt9 or jt9_omp from the command-prompt. Program Version ParamsTime (s) jt9 v1.3 r36732.467 jt9 v1.4.0-rc2, r4400 2.658 jt9 v1.5 r4926 -w 1 -m 1 1.243 jt9 v1.5 r4926 -w 2 -m 1 1.202 jt9 v1.5 r4926 -w 2 -m 2 1.140 jt9_omp v1.5 r4926 -w 2 -m 1 0.834 jt9_omp v1.5 r4926 -w 2 -m 2 0.843 When jt9_omp is used it's better *not* to use the multi-threaded FFTW plans, at least on this 2-core machine. The two cores are already being used effectively by running the two big FFTs concurrently. For interest, here are the actual outputs of a pair of timing runs with jt9 and jt9_omp. Note that the decoded lines are the same, but JT65 lines are intermingled with JT9 lines. (I like the original ordering better -- first the one
Re: [wsjt-devel] WSJT-X Decoder Performance
On 03/02/2015 22:15, Michael Black wrote: Hi Mike, read my suggestion again, particularly the word each! See below. 73 Bill G4WJS. Nope...still hangs... C*** *** C FILE: omp_workshare1.f C DESCRIPTION: C OpenMP Example - Loop Work-sharing - Fortran Version C In this example, the iterations of a loop are scheduled dynamically C across the team of threads. A thread will perform CHUNK iterations C at a time before being scheduled for the next CHUNK of work. C AUTHOR: Blaise Barney 5/99 C LAST REVISED: 01/09/04 C*** *** PROGRAM WORKSHARE1 INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS, + OMP_GET_THREAD_NUM, N, CHUNKSIZE, CHUNK, I PARAMETER (N=100) PARAMETER (CHUNKSIZE=10) REAL A(N), B(N), C(N) ! Some initializations DO I = 1, N A(I) = I * 1.0 B(I) = A(I) ENDDO CHUNK = CHUNKSIZE !$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK) PRIVATE(I,TID) TID = OMP_GET_THREAD_NUM() IF (TID .EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS() !$omp critical(io) PRINT *, 'Number of threads =', NTHREADS !$omp end critical(io) END IF !$omp critical(io) PRINT *, 'Thread',TID,' starting...' !$omp end critical(io) !$OMP DO SCHEDULE(DYNAMIC,CHUNK) DO I = 1, N C(I) = A(I) + B(I) !$OMP CRITICAL(io) WRITE(*,100) TID,I,C(I) !$OMP END CRITICAL(io) 100FORMAT(' Thread',I2,': C(',I3,')=',F8.2) ENDDO !$OMP END DO NOWAIT !$omp critical(io) PRINT *, 'Thread',TID,' done. !$omp end critical(io) !$OMP END PARALLEL END -Original Message- From: Bill Somerville [mailto:g4...@classdesign.com] Sent: Tuesday, February 03, 2015 4:10 PM To: wsjt-devel@lists.sourceforge.net Subject: Re: [wsjt-devel] WSJT-X Decoder Performance On 03/02/2015 21:48, Michael Black wrote: Hi Mike, I ran a test using a sample program. Compiled with gfortran -g -fopenmp -o omp1 omp1.f Every few runs it hangs after all threads have completed. That program is not thread safe. All I/O statements need serializing to make it safe. Try adding: !$omp critical(io) immediately before each print or write and: !$omp end critical(io) immediately after each print or write. It also is not a good example for another reason as it doesn't conditionally compile some OpenMP code so it will not compile without OpenMP support. This is bad practice because one usually wants a program to give identical results (as possible) when run with and without multi-threading. I haven't let them run to completion...I tried the jt9_omp and let it run and eventually it dies without an intelligent message after quite a few minutes. jt9 is not yet suitable for multi-threaded decoding, I will be posting some amendments to deal with this when another unrelated issue has been tracked down. Running this program under gdb never hangs. GDB will interfere with the thread scheduling, that is one of the gotchas of multi-threaded testing in that programs often run a different path when debugged or even when output statements are added to try and locate issues. 73 Bill G4WJS. C*** *** C FILE: omp_workshare1.f C DESCRIPTION: C OpenMP Example - Loop Work-sharing - Fortran Version C In this example, the iterations of a loop are scheduled dynamically C across the team of threads. A thread will perform CHUNK iterations C at a time before being scheduled for the next CHUNK of work. C AUTHOR: Blaise Barney 5/99 C LAST REVISED: 01/09/04 C*** *** PROGRAM WORKSHARE1 INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS, + OMP_GET_THREAD_NUM, N, CHUNKSIZE, CHUNK, I PARAMETER (N=100) PARAMETER (CHUNKSIZE=10) REAL A(N), B(N), C(N) ! Some initializations DO I = 1, N A(I) = I * 1.0 B(I) = A(I) ENDDO CHUNK = CHUNKSIZE !$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK) PRIVATE(I,TID) TID = OMP_GET_THREAD_NUM() IF (TID .EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS() PRINT *, 'Number of threads =', NTHREADS END IF PRINT *, 'Thread',TID,' starting...' !$OMP DO SCHEDULE(DYNAMIC,CHUNK) DO I = 1, N C(I) = A(I) + B(I) WRITE(*,100) TID,I,C(I) 100FORMAT(' Thread',I2,': C(',I3,')=',F8.2) ENDDO !$OMP END DO NOWAIT PRINT *, 'Thread',TID,' done.' !$OMP END PARALLEL END Mike W9MDB -Original Message- From: Bill Somerville [mailto:g4...@classdesign.com] Sent: Tuesday, February 03, 2015 3:17 PM To: wsjt-devel@lists.sourceforge.net
Re: [wsjt-devel] WSJT-X Decoder Performance
Sorry, I wrote free form so you will need spaces in front of those extra directives to make it fixed form compatible. 73 Bill G4WJS. On 03/02/2015 22:19, Bill Somerville wrote: On 03/02/2015 22:15, Michael Black wrote: Hi Mike, read my suggestion again, particularly the word each! See below. 73 Bill G4WJS. Nope...still hangs... C*** *** C FILE: omp_workshare1.f C DESCRIPTION: C OpenMP Example - Loop Work-sharing - Fortran Version C In this example, the iterations of a loop are scheduled dynamically C across the team of threads. A thread will perform CHUNK iterations C at a time before being scheduled for the next CHUNK of work. C AUTHOR: Blaise Barney 5/99 C LAST REVISED: 01/09/04 C*** *** PROGRAM WORKSHARE1 INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS, + OMP_GET_THREAD_NUM, N, CHUNKSIZE, CHUNK, I PARAMETER (N=100) PARAMETER (CHUNKSIZE=10) REAL A(N), B(N), C(N) ! Some initializations DO I = 1, N A(I) = I * 1.0 B(I) = A(I) ENDDO CHUNK = CHUNKSIZE !$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK) PRIVATE(I,TID) TID = OMP_GET_THREAD_NUM() IF (TID .EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS() !$omp critical(io) PRINT *, 'Number of threads =', NTHREADS !$omp end critical(io) END IF !$omp critical(io) PRINT *, 'Thread',TID,' starting...' !$omp end critical(io) !$OMP DO SCHEDULE(DYNAMIC,CHUNK) DO I = 1, N C(I) = A(I) + B(I) !$OMP CRITICAL(io) WRITE(*,100) TID,I,C(I) !$OMP END CRITICAL(io) 100FORMAT(' Thread',I2,': C(',I3,')=',F8.2) ENDDO !$OMP END DO NOWAIT !$omp critical(io) PRINT *, 'Thread',TID,' done. !$omp end critical(io) !$OMP END PARALLEL END -Original Message- From: Bill Somerville [mailto:g4...@classdesign.com] Sent: Tuesday, February 03, 2015 4:10 PM To: wsjt-devel@lists.sourceforge.net Subject: Re: [wsjt-devel] WSJT-X Decoder Performance On 03/02/2015 21:48, Michael Black wrote: Hi Mike, I ran a test using a sample program. Compiled with gfortran -g -fopenmp -o omp1 omp1.f Every few runs it hangs after all threads have completed. That program is not thread safe. All I/O statements need serializing to make it safe. Try adding: !$omp critical(io) immediately before each print or write and: !$omp end critical(io) immediately after each print or write. It also is not a good example for another reason as it doesn't conditionally compile some OpenMP code so it will not compile without OpenMP support. This is bad practice because one usually wants a program to give identical results (as possible) when run with and without multi-threading. I haven't let them run to completion...I tried the jt9_omp and let it run and eventually it dies without an intelligent message after quite a few minutes. jt9 is not yet suitable for multi-threaded decoding, I will be posting some amendments to deal with this when another unrelated issue has been tracked down. Running this program under gdb never hangs. GDB will interfere with the thread scheduling, that is one of the gotchas of multi-threaded testing in that programs often run a different path when debugged or even when output statements are added to try and locate issues. 73 Bill G4WJS. C*** *** C FILE: omp_workshare1.f C DESCRIPTION: C OpenMP Example - Loop Work-sharing - Fortran Version C In this example, the iterations of a loop are scheduled dynamically C across the team of threads. A thread will perform CHUNK iterations C at a time before being scheduled for the next CHUNK of work. C AUTHOR: Blaise Barney 5/99 C LAST REVISED: 01/09/04 C*** *** PROGRAM WORKSHARE1 INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS, + OMP_GET_THREAD_NUM, N, CHUNKSIZE, CHUNK, I PARAMETER (N=100) PARAMETER (CHUNKSIZE=10) REAL A(N), B(N), C(N) ! Some initializations DO I = 1, N A(I) = I * 1.0 B(I) = A(I) ENDDO CHUNK = CHUNKSIZE !$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK) PRIVATE(I,TID) TID = OMP_GET_THREAD_NUM() IF (TID .EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS() PRINT *, 'Number of threads =', NTHREADS END IF PRINT *, 'Thread',TID,' starting...' !$OMP DO SCHEDULE(DYNAMIC,CHUNK) DO I = 1, N C(I) = A(I) + B(I) WRITE(*,100) TID,I,C(I) 100FORMAT(' Thread',I2,': C(',I3,')=',F8.2) ENDDO !$OMP END DO NOWAIT
Re: [wsjt-devel] WSJT-X Decoder Performance
On 02/02/2015 19:45, Joe Taylor wrote: Hi all, I have made further improvements to the speed of the decoders in WSJT-X, independently of any recourse to concurrent processing in machines with multiple CPUs. snip These measurements were made on a Windows 7 machine with 4-core i5-2500 CPU. Program VersionTimeDecode # --- v1.3 r3673 2.48 s Deepest 17 v1.4.0-rc2, r4400 2.28Deepest 17 v1.5, r49251.01Deepest 17 v1.5, r49260.83Deepest 17 v1.5, r4926 -w 2 -m 2 0.80Deepest 17 v1.5, r49260.75Normal 16 v1.5, r49260.69Fast 16 The bottom line: At this stage, much has been gained by some careful algorithmic tuning. The decoder in r4926 is 3 times faster than the one in r3673, and 2.7 times faster than the one in r4400. In r4926 a small further improvement (about 4%) is obtained by using patience level -w 2 and two threads (-m 2) for the FFTs. Similar speed improvements were measured on a linux machine (Core 2 Duo, E6750 CPU). This is an excellent performance improvement and in the context of the ~10 second window where successful decodes are most desirable, it is significant improvement to operating experience. snip -- Joe, K1JT 73 Bill G4WJS. -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
Hi Bill and all, Perhaps you already tried jt9_omp in Linux, but I had not. I tried it today, and it seems to work OK, as is. Here are some timing tests made on my rather elderly 2-core Linux machine. This time all tests were made with the Deepest setting, ndepth=3, and all resulted in 17 good decodes of the sample file 130610_2343.wav. To get the times I measured real time to execute jt9 or jt9_omp from the command-prompt. Program Version ParamsTime (s) jt9 v1.3 r36732.467 jt9 v1.4.0-rc2, r4400 2.658 jt9 v1.5 r4926 -w 1 -m 1 1.243 jt9 v1.5 r4926 -w 2 -m 1 1.202 jt9 v1.5 r4926 -w 2 -m 2 1.140 jt9_omp v1.5 r4926 -w 2 -m 1 0.834 jt9_omp v1.5 r4926 -w 2 -m 2 0.843 When jt9_omp is used it's better *not* to use the multi-threaded FFTW plans, at least on this 2-core machine. The two cores are already being used effectively by running the two big FFTs concurrently. For interest, here are the actual outputs of a pair of timing runs with jt9 and jt9_omp. Note that the decoded lines are the same, but JT65 lines are intermingled with JT9 lines. (I like the original ordering better -- first the one at the decode frequency; then others in the same mode in order of increasing frequancy; then thos in the other mode, again in order of increasing frequancy. With effort, I guess we could have it both ways by letting the GUI insert decodes (after the first one) in the proper place in the sequence.) # $ time jt9 -p 1 -d 3 -w 2 -m 1 130610_2343.wav junk 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished 0 1 real0m1.196s user0m1.157s sys 0m0.037s $ time jt9_omp -p 1 -d 3 -w 2 -m 1 130610_2343.wav junk 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished 0 1 real0m0.806s user0m1.260s sys 0m0.055s # In its present state the jt9_omp code does not run in Windows. I haven't yet determined why. -- Joe, K1JT -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
Hi Greg, Yes, we need the -fopenmp flag to be set. I think that's being done appropriately now in CMakeLists.txt, although I confess I'm not always confident that I've understood its syntax fully. Bill should be able to give the definitive answer. -- Joe On 2/3/2015 3:36 PM, ki...@yahoo.com wrote: Hi Joe, I'm not 100% certain on this, but for openmp on Windows, don't you have to enable that as a C flag with something l-ke: -fopenmp I would assume that's to be done in the CMakeLists.txt file. I'm not sure about linking the libraries. The Qt5 Tool chain has winpthreads and gcc -v has --enable-libgomp so the tool chain looks to be OpenMP capable. 73's Greg, KI7MT On 2/3/2015 1:06 PM, Joe Taylor wrote: Hi Bill and all, Perhaps you already tried jt9_omp in Linux, but I had not. I tried it today, and it seems to work OK, as is. Here are some timing tests made on my rather elderly 2-core Linux machine. This time all tests were made with the Deepest setting, ndepth=3, and all resulted in 17 good decodes of the sample file 130610_2343.wav. To get the times I measured real time to execute jt9 or jt9_omp from the command-prompt. Program Version ParamsTime (s) jt9 v1.3 r36732.467 jt9 v1.4.0-rc2, r4400 2.658 jt9 v1.5 r4926 -w 1 -m 1 1.243 jt9 v1.5 r4926 -w 2 -m 1 1.202 jt9 v1.5 r4926 -w 2 -m 2 1.140 jt9_omp v1.5 r4926 -w 2 -m 1 0.834 jt9_omp v1.5 r4926 -w 2 -m 2 0.843 When jt9_omp is used it's better *not* to use the multi-threaded FFTW plans, at least on this 2-core machine. The two cores are already being used effectively by running the two big FFTs concurrently. For interest, here are the actual outputs of a pair of timing runs with jt9 and jt9_omp. Note that the decoded lines are the same, but JT65 lines are intermingled with JT9 lines. (I like the original ordering better -- first the one at the decode frequency; then others in the same mode in order of increasing frequancy; then thos in the other mode, again in order of increasing frequancy. With effort, I guess we could have it both ways by letting the GUI insert decodes (after the first one) in the proper place in the sequence.) # $ time jt9 -p 1 -d 3 -w 2 -m 1 130610_2343.wav junk 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished0 1 real0m1.196s user0m1.157s sys 0m0.037s $ time jt9_omp -p 1 -d 3 -w 2 -m 1 130610_2343.wav junk 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished0 1 real0m0.806s user0m1.260s sys 0m0.055s # In its present state the jt9_omp code does not run in Windows. I haven't yet determined why. -- Joe, K1JT -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for
Re: [wsjt-devel] WSJT-X Decoder Performance
On 03/02/2015 20:06, Joe Taylor wrote: Hi Bill and all, Hi Joe, Perhaps you already tried jt9_omp in Linux, but I had not. I tried it today, and it seems to work OK, as is. There are definitely thread safety issues but I think I have most of them dealt with. I will commit them soon after I have done some basic validation checks. Here are some timing tests made on my rather elderly 2-core Linux machine. This time all tests were made with the Deepest setting, ndepth=3, and all resulted in 17 good decodes of the sample file 130610_2343.wav. To get the times I measured real time to execute jt9 or jt9_omp from the command-prompt. Program Version ParamsTime (s) jt9 v1.3 r36732.467 jt9 v1.4.0-rc2, r4400 2.658 jt9 v1.5 r4926 -w 1 -m 1 1.243 jt9 v1.5 r4926 -w 2 -m 1 1.202 jt9 v1.5 r4926 -w 2 -m 2 1.140 jt9_omp v1.5 r4926 -w 2 -m 1 0.834 jt9_omp v1.5 r4926 -w 2 -m 2 0.843 When jt9_omp is used it's better *not* to use the multi-threaded FFTW plans, at least on this 2-core machine. The two cores are already being used effectively by running the two big FFTs concurrently. Yes that will need a basic algorithm to avoid using more threads than CPUs at any time. For interest, here are the actual outputs of a pair of timing runs with jt9 and jt9_omp. Note that the decoded lines are the same, but JT65 lines are intermingled with JT9 lines. (I like the original ordering better -- first the one at the decode frequency; then others in the same mode in order of increasing frequancy; then thos in the other mode, again in order of increasing frequancy. With effort, I guess we could have it both ways by letting the GUI insert decodes (after the first one) in the proper place in the sequence.) # $ time jt9 -p 1 -d 3 -w 2 -m 1 130610_2343.wav junk 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished 0 1 real0m1.196s user0m1.157s sys 0m0.037s $ time jt9_omp -p 1 -d 3 -w 2 -m 1 130610_2343.wav junk 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished 0 1 real0m0.806s user0m1.260s sys 0m0.055s # In its present state the jt9_omp code does not run in Windows. I haven't yet determined why. I am seeing that too but not every run, I am looking for the cause(s). -- Joe, K1JT 73 Bill G4WJS. -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
Hi Joe, I'm not 100% certain on this, but for openmp on Windows, don't you have to enable that as a C flag with something l-ke: -fopenmp I would assume that's to be done in the CMakeLists.txt file. I'm not sure about linking the libraries. The Qt5 Tool chain has winpthreads and gcc -v has --enable-libgomp so the tool chain looks to be OpenMP capable. 73's Greg, KI7MT On 2/3/2015 1:06 PM, Joe Taylor wrote: Hi Bill and all, Perhaps you already tried jt9_omp in Linux, but I had not. I tried it today, and it seems to work OK, as is. Here are some timing tests made on my rather elderly 2-core Linux machine. This time all tests were made with the Deepest setting, ndepth=3, and all resulted in 17 good decodes of the sample file 130610_2343.wav. To get the times I measured real time to execute jt9 or jt9_omp from the command-prompt. Program Version ParamsTime (s) jt9 v1.3 r36732.467 jt9 v1.4.0-rc2, r4400 2.658 jt9 v1.5 r4926 -w 1 -m 1 1.243 jt9 v1.5 r4926 -w 2 -m 1 1.202 jt9 v1.5 r4926 -w 2 -m 2 1.140 jt9_omp v1.5 r4926 -w 2 -m 1 0.834 jt9_omp v1.5 r4926 -w 2 -m 2 0.843 When jt9_omp is used it's better *not* to use the multi-threaded FFTW plans, at least on this 2-core machine. The two cores are already being used effectively by running the two big FFTs concurrently. For interest, here are the actual outputs of a pair of timing runs with jt9 and jt9_omp. Note that the decoded lines are the same, but JT65 lines are intermingled with JT9 lines. (I like the original ordering better -- first the one at the decode frequency; then others in the same mode in order of increasing frequancy; then thos in the other mode, again in order of increasing frequancy. With effort, I guess we could have it both ways by letting the GUI insert decodes (after the first one) in the proper place in the sequence.) # $ time jt9 -p 1 -d 3 -w 2 -m 1 130610_2343.wav junk 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished 0 1 real0m1.196s user0m1.157s sys 0m0.037s $ time jt9_omp -p 1 -d 3 -w 2 -m 1 130610_2343.wav junk 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished 0 1 real0m0.806s user0m1.260s sys 0m0.055s # In its present state the jt9_omp code does not run in Windows. I haven't yet determined why. -- Joe, K1JT -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net
Re: [wsjt-devel] WSJT-X Decoder Performance
HI Joe, Bill may be still working on this. I couldn't' find several things including the -fopenmp flag, so I'll just wait and see what shakes out. 73's Greg, KI7MT On 2/3/2015 2:04 PM, Joe Taylor wrote: Hi Greg, Yes, we need the -fopenmp flag to be set. I think that's being done appropriately now in CMakeLists.txt, although I confess I'm not always confident that I've understood its syntax fully. Bill should be able to give the definitive answer. -- Joe On 2/3/2015 3:36 PM, ki...@yahoo.com wrote: Hi Joe, I'm not 100% certain on this, but for openmp on Windows, don't you have to enable that as a C flag with something l-ke: -fopenmp I would assume that's to be done in the CMakeLists.txt file. I'm not sure about linking the libraries. The Qt5 Tool chain has winpthreads and gcc -v has --enable-libgomp so the tool chain looks to be OpenMP capable. 73's Greg, KI7MT On 2/3/2015 1:06 PM, Joe Taylor wrote: Hi Bill and all, Perhaps you already tried jt9_omp in Linux, but I had not. I tried it today, and it seems to work OK, as is. Here are some timing tests made on my rather elderly 2-core Linux machine. This time all tests were made with the Deepest setting, ndepth=3, and all resulted in 17 good decodes of the sample file 130610_2343.wav. To get the times I measured real time to execute jt9 or jt9_omp from the command-prompt. Program Version ParamsTime (s) jt9 v1.3 r36732.467 jt9 v1.4.0-rc2, r4400 2.658 jt9 v1.5 r4926 -w 1 -m 1 1.243 jt9 v1.5 r4926 -w 2 -m 1 1.202 jt9 v1.5 r4926 -w 2 -m 2 1.140 jt9_omp v1.5 r4926 -w 2 -m 1 0.834 jt9_omp v1.5 r4926 -w 2 -m 2 0.843 When jt9_omp is used it's better *not* to use the multi-threaded FFTW plans, at least on this 2-core machine. The two cores are already being used effectively by running the two big FFTs concurrently. For interest, here are the actual outputs of a pair of timing runs with jt9 and jt9_omp. Note that the decoded lines are the same, but JT65 lines are intermingled with JT9 lines. (I like the original ordering better -- first the one at the decode frequency; then others in the same mode in order of increasing frequancy; then thos in the other mode, again in order of increasing frequancy. With effort, I guess we could have it both ways by letting the GUI insert decodes (after the first one) in the proper place in the sequence.) # $ time jt9 -p 1 -d 3 -w 2 -m 1 130610_2343.wav junk 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished0 1 real0m1.196s user0m1.157s sys 0m0.037s $ time jt9_omp -p 1 -d 3 -w 2 -m 1 130610_2343.wav junk 2343 -20 0.3 718 # VE6WQ SQ2NIJ -14 2343 -9 0.3 3196 @ WB8QPG IZ0MIT -11 2343 -7 0.3 815 # KK4DSD W7VP -16 2343 -18 1.0 3372 @ KK4HEG KE0CO CN87 2343 -10 0.5 975 # CQ DL7ACA JO40 2343 -9 0.8 1089 # N2SU W0JMW R-14 2343 -11 0.8 1259 # YV6BFE F6GUU R-08 2343 -9 1.7 1471 # VA3UG F1HMR 73 2343 14 0.1 3490 @ CQ AG4M EM75 2343 -20 -1.3 3567 @ CQ TA4A KM37 2343 -15 0.1 3627 @ CT1FBK IK5YZT R+02 2343 -23 0.3 3721 @ KF5SLN KB1SUA FN42 2343 -16 0.2 3774 @ CQ M0ABA JO01 2343 -1 0.6 1718 # BG THX JOE 73 2343 -15 1.3 1951 # RA3Y VE3NLS 73 2343 -2 0.2 3843 @ EI3HGB DD2EE JO31 2343 -20 0.4 2065 # K2OI AJ4UU R-20 DecodeFinished0 1 real0m0.806s user0m1.260s sys 0m0.055s # In its present state the jt9_omp code does not run in Windows. I haven't yet determined why. -- Joe, K1JT -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
I ran a test using a sample program. Compiled with gfortran -g -fopenmp -o omp1 omp1.f Every few runs it hangs after all threads have completed. I haven't let them run to completion...I tried the jt9_omp and let it run and eventually it dies without an intelligent message after quite a few minutes. Running this program under gdb never hangs. C*** *** C FILE: omp_workshare1.f C DESCRIPTION: C OpenMP Example - Loop Work-sharing - Fortran Version C In this example, the iterations of a loop are scheduled dynamically C across the team of threads. A thread will perform CHUNK iterations C at a time before being scheduled for the next CHUNK of work. C AUTHOR: Blaise Barney 5/99 C LAST REVISED: 01/09/04 C*** *** PROGRAM WORKSHARE1 INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS, + OMP_GET_THREAD_NUM, N, CHUNKSIZE, CHUNK, I PARAMETER (N=100) PARAMETER (CHUNKSIZE=10) REAL A(N), B(N), C(N) ! Some initializations DO I = 1, N A(I) = I * 1.0 B(I) = A(I) ENDDO CHUNK = CHUNKSIZE !$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK) PRIVATE(I,TID) TID = OMP_GET_THREAD_NUM() IF (TID .EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS() PRINT *, 'Number of threads =', NTHREADS END IF PRINT *, 'Thread',TID,' starting...' !$OMP DO SCHEDULE(DYNAMIC,CHUNK) DO I = 1, N C(I) = A(I) + B(I) WRITE(*,100) TID,I,C(I) 100FORMAT(' Thread',I2,': C(',I3,')=',F8.2) ENDDO !$OMP END DO NOWAIT PRINT *, 'Thread',TID,' done.' !$OMP END PARALLEL END Mike W9MDB -Original Message- From: Bill Somerville [mailto:g4...@classdesign.com] Sent: Tuesday, February 03, 2015 3:17 PM To: wsjt-devel@lists.sourceforge.net Subject: Re: [wsjt-devel] WSJT-X Decoder Performance On 03/02/2015 21:04, Joe Taylor wrote: Hi Greg, Hi Greg Joe, Yes, we need the -fopenmp flag to be set. I think that's being done appropriately now in CMakeLists.txt, although I confess I'm not always confident that I've understood its syntax fully. Bill should be able to give the definitive answer. Yes that is correct. The CMake script uses the package finder for OpenMP (part of the CMake distribution) to find the OpenMP capabilities of the compilers on the build platform. That sets the variable OPENMP_FOUND (or OPENMP-NOTFOUND if it's not available) along with the result variables OpenMP_C_FLAGS, OpenMP_CXX_FLAGS and, OpenMP_Fortran_FLAGS (actually this last one is only set by CMake v3.1 and later so I have substituted the C compiler flag as it is the same for the compilers we use at present). These flags are added to '*_omp' source compiles. The CMake script builds two versions of the internal static library target 'wsjt', the second target is 'wsjt_omp'. It also builds two versions of the 'jt9' target, the second being 'jt9_omp' which itself depends on the 'wsjt_omp' library. The 'wsjt' library is basically all the Fortran and C modules that are used by jt9, jt9code, jt9sim, jt65code and wsjt-x. -- Joe 73 Bill G4WJS. On 2/3/2015 3:36 PM, ki...@yahoo.com wrote: Hi Joe, I'm not 100% certain on this, but for openmp on Windows, don't you have to enable that as a C flag with something l-ke: -fopenmp I would assume that's to be done in the CMakeLists.txt file. I'm not sure about linking the libraries. The Qt5 Tool chain has winpthreads and gcc -v has --enable-libgomp so the tool chain looks to be OpenMP capable. 73's Greg, KI7MT On 2/3/2015 1:06 PM, Joe Taylor wrote: Hi Bill and all, Perhaps you already tried jt9_omp in Linux, but I had not. I tried it today, and it seems to work OK, as is. Here are some timing tests made on my rather elderly 2-core Linux machine. This time all tests were made with the Deepest setting, ndepth=3, and all resulted in 17 good decodes of the sample file 130610_2343.wav. To get the times I measured real time to execute jt9 or jt9_omp from the command-prompt. Program Version ParamsTime (s) jt9 v1.3 r36732.467 jt9 v1.4.0-rc2, r4400 2.658 jt9 v1.5 r4926 -w 1 -m 1 1.243 jt9 v1.5 r4926 -w 2 -m 1 1.202 jt9 v1.5 r4926 -w 2 -m 2 1.140 jt9_omp v1.5 r4926 -w 2 -m 1 0.834 jt9_omp v1.5 r4926 -w 2 -m 2 0.843 When jt9_omp is used it's better *not* to use the multi-threaded FFTW plans, at least on this 2-core machine. The two cores are already being used effectively by running the two big FFTs concurrently. For interest, here are the actual outputs of a pair of timing runs with jt9 and jt9_omp. Note that the decoded lines are the same, but JT65 lines are intermingled with JT9 lines. (I like the original ordering better -- first the one at the decode frequency
Re: [wsjt-devel] WSJT-X Decoder Performance
Can one download an executable with these changes? On Mon, 2/2/15, Joe Taylor j...@princeton.edu wrote: Subject: [wsjt-devel] WSJT-X Decoder Performance To: wsjt-devel@lists.sourceforge.net Date: Monday, February 2, 2015, 11:45 AM Hi all, I have made further improvements to the speed of the decoders in WSJT-X, independently of any recourse to concurrent processing in machines with multiple CPUs. The changes involve 1. Making better choices for NFFT1 and NFFT2 (the lengths of forward and inverse FFTs in the JT9 downsampler. 2. Adjusting values of limit (the Fano timeout parameter) and ccflim (JT9 synchronizing threshold) under specified conditions. 3. Using -O3 for the gfortran optimizer level. The following table presents measurements of decoding speed for a number of tests using WSJT-X versions 1.3, 1.4.0-rc2, 1.5r4925, and 1.5r4926. Time gives the time is seconds to decode the sample file 130610_2343.wav, which has 8 decodable JT9 signals and 9 decodable JT65 signals. Decode is the setting on the WSJT-X *Decode* menu. The column labeled # gives the number of decoded signals. (Note that selecting Deepest is required in order to decode one of the JT9 signals.) These measurements were made on a Windows 7 machine with 4-core i5-2500 CPU. Program Version Time Decode # --- v1.3 r3673 2.48 s Deepest 17 v1.4.0-rc2, r4400 2.28 Deepest 17 v1.5, r4925 1.01 Deepest 17 v1.5, r4926 0.83 Deepest 17 v1.5, r4926 -w 2 -m 2 0.80 Deepest 17 v1.5, r4926 0.75 Normal 16 v1.5, r4926 0.69 Fast 16 The bottom line: At this stage, much has been gained by some careful algorithmic tuning. The decoder in r4926 is 3 times faster than the one in r3673, and 2.7 times faster than the one in r4400. In r4926 a small further improvement (about 4%) is obtained by using patience level -w 2 and two threads (-m 2) for the FFTs. Similar speed improvements were measured on a linux machine (Core 2 Duo, E6750 CPU). A further speed improvement around 10% should be obtainable by computing the JT65 symbol spectra (subroutine symspec65) on the fly, during the Rx minute, rather than as part of the end-of-minute *Decode* procedure. (This is already done for the JT9 symbol spectra.) My current view is that beyond that step, further speed improvement on single-core machines (or single-core processing on multi-core machines, as in all of the tabulated tests except one) will be difficult. Further improvements can probably be made by using more than one core concurrently, e.g., by using OpenMP. As I mentioned before, the biggest (or at least easiest) gain may come from running the JT9 and JT65 decoders concurrently. It's hard to know whether the gains will be worthwhile, without trying. The programming effort may not be trivial. -- Joe, K1JT -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
I should have mentioned that if you're especially interested in snappy performance of the JT9 decoder, you may consider the penalty for using menu setting Decode | Fast to be unimportant. It will cost you nothing at all at the QSO frequency: that first decoding attempt is always done at Deepest level. -- Joe, K1JT -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel
Re: [wsjt-devel] WSJT-X Decoder Performance
Hi Jim, On 2/2/2015 5:11 PM, Jim Pennino wrote: Can one download an executable with these changes? No, not yet. Like everything else in the development (aka v1.5) branch, these changes are available only on a compile-it-yourself basis. -- 73, Joe, K1JT On Mon, 2/2/15, Joe Taylorj...@princeton.edu wrote: Subject: [wsjt-devel] WSJT-X Decoder Performance To: wsjt-devel@lists.sourceforge.net Date: Monday, February 2, 2015, 11:45 AM Hi all, I have made further improvements to the speed of the decoders in WSJT-X, independently of any recourse to concurrent processing in machines with multiple CPUs. The changes involve 1. Making better choices for NFFT1 and NFFT2 (the lengths of forward and inverse FFTs in the JT9 downsampler. 2. Adjusting values of limit (the Fano timeout parameter) and ccflim (JT9 synchronizing threshold) under specified conditions. 3. Using -O3 for the gfortran optimizer level. The following table presents measurements of decoding speed for a number of tests using WSJT-X versions 1.3, 1.4.0-rc2, 1.5r4925, and 1.5r4926. Time gives the time is seconds to decode the sample file 130610_2343.wav, which has 8 decodable JT9 signals and 9 decodable JT65 signals. Decode is the setting on the WSJT-X *Decode* menu. The column labeled # gives the number of decoded signals. (Note that selecting Deepest is required in order to decode one of the JT9 signals.) These measurements were made on a Windows 7 machine with 4-core i5-2500 CPU. Program VersionTime Decode # --- v1.3 r3673 2.48 s Deepest 17 v1.4.0-rc2, r4400 2.28 Deepest 17 v1.5, r4925 1.01Deepest 17 v1.5, r4926 0.83Deepest 17 v1.5, r4926 -w 2 -m 2 0.80Deepest 17 v1.5, r4926 0.75Normal 16 v1.5, r4926 0.69Fast 16 The bottom line: At this stage, much has been gained by some careful algorithmic tuning. The decoder in r4926 is 3 times faster than the one in r3673, and 2.7 times faster than the one in r4400. In r4926 a small further improvement (about 4%) is obtained by using patience level -w 2 and two threads (-m 2) for the FFTs. Similar speed improvements were measured on a linux machine (Core 2 Duo, E6750 CPU). A further speed improvement around 10% should be obtainable by computing the JT65 symbol spectra (subroutine symspec65) on the fly, during the Rx minute, rather than as part of the end-of-minute *Decode* procedure. (This is already done for the JT9 symbol spectra.) My current view is that beyond that step, further speed improvement on single-core machines (or single-core processing on multi-core machines, as in all of the tabulated tests except one) will be difficult. Further improvements can probably be made by using more than one core concurrently, e.g., by using OpenMP. As I mentioned before, the biggest (or at least easiest) gain may come from running the JT9 and JT65 decoders concurrently. It's hard to know whether the gains will be worthwhile, without trying. The programming effort may not be trivial. -- Joe, K1JT -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more.