i made some test of openMP and Guile with Guile 3.0.8.99-f3ea8 on MacOS M1 and Linux Intel because i was not sure of the performances. I find a problem on Linux the code is slower (could be a factor of 5) with openMP and in Mac OS the gain is is of 100% (divide by 2) or 15% depending of computation complexity. i can not explain why it works under MacOS and not Linux, the only difference of compilation is that under Mac OS i had to force this option to succeed compiling: configure --enable-mini-gmp
Anyway it is not good performance for openMP with scheme, under openMP with n CPUs i have gain of almost n x 100% of speedup, in C language or Fortran OpenMP when use for astronomical numerical simulation. in the // region i have only this code on MacOS: scm_init_guile(); #pragma omp parallel for for (i=start; i<=stop; i++) { /* i is private by default */ scm_call_1( func , scm_from_int(i) ); with linux this create a segmentation fault unless i move inside the for loop the line scm_init_guile(); like this: #pragma omp parallel for for (i=start; i<=stop; i++) { /* i is private by default */ scm_init_guile(); scm_call_1( func , scm_from_int(i) ); https://github.com/damien-mattei/library-FunctProg/blob/master/guile-openMP.c#L91 the scheme+ code for speed test looks like that (i use collatz function to make the computation unpredictable for any C compiler optimisations when i compare with pur C code): ;; only for speed tests {vtstlen <+ 2642245} {vtst <+ (make-vector vtstlen 0)} {fct <+ (lambda (x) {x * x * x})} (define (fctapply i) {vtst[i] <- fct(vtst[i])}) ;; neoteric expression of {vtst[i] <- (fct vtst[i])} (define (fctpluscollatzapply i) {vtst[i] <- fctpluscollatz(vtst[i])}) (define (speed-test) ;; init data (display-nl "speed-test : Initialising data.") (for ({i <+ 0} {i < vtstlen} {i <- {i + 1}}) {vtst[i] <- i}) ;; compute (display-nl "speed-test : testing Scheme alone : start") (for ({i <+ 0} {i < vtstlen} {i <- {i + 1}}) (fctpluscollatzapply i));;(fctapply i)) (display-nl "speed-test : testing Scheme alone : end") (newline) ;; display a few results (for ({i <+ 0} {i < 10} {i <- {i + 1}}) (display-nl {vtst[i]})) (display-nl ".....") (for ({i <+ {vtstlen - 10}} {i < vtstlen} {i <- {i + 1}}) (display-nl {vtst[i]})) ;; init data (display-nl "speed-test : Initialising data.") (for ({i <+ 0} {i < vtstlen} {i <- {i + 1}}) {vtst[i] <- i}) ;; compute (display-nl "speed-test : testing Scheme with OpenMP : start") (openmp 0 {vtstlen - 1} (string->pointer "fctpluscollatzapply"));;"fctapply")) (display-nl "speed-test : testing Scheme with OpenMP : end") (newline) ;; display a few results (for ({i <+ 0} {i < 10} {i <- {i + 1}}) (display-nl {vtst[i]})) (display-nl ".....") (for ({i <+ {vtstlen - 10}} {i < vtstlen} {i <- {i + 1}}) (display-nl {vtst[i]})) ) (define (collatz n) (cond ({n = 1} 1) ({(modulo n 2) = 0} {n / 2}) (else {{3 * n} + 1}))) (define (fctpluscollatz x) (declare c) (if {x = 0} {c <- 0} {c <- collatz(x)}) {{x * x * x} + c}) (define openmp (foreign-library-function "./libguile-openMP" "openmp" #:return-type int #:arg-types (list int int '*))) (define libomp (dynamic-link "libomp")) ;; note: require a link : ln -s /opt/homebrew/opt/libomp/lib/libomp.dylib libomp.dylib ;; export LTDL_LIBRARY_PATH=. under linux with a link as above ;; or better solution: export LTDL_LIBRARY_PATH=/usr/lib/llvm-14/lib (define omp-get-max-threads (pointer->procedure int (dynamic-func "omp_get_max_threads" libomp) '())) https://github.com/damien-mattei/library-FunctProg/blob/master/guile/logiki%2B.scm#L3581 output: scheme@(guile-user)> (speed-test ) speed-test : Initialising data. speed-test : testing Scheme alone : start speed-test : testing Scheme alone : end 0 2 9 37 66 141 219 365 516 757 ..... 18446514741354254581 18446535685572961374 18446556629820732765 18446577574071146391 18446598518350624637 18446619462632745120 18446640406943930245 18446661351257757609 18446682295600649637 18446703239946183906 speed-test : Initialising data. speed-test : testing Scheme with OpenMP : start speed-test : testing Scheme with OpenMP : end 0 2 9 37 66 141 219 365 516 757 ..... 18446514741354254581 18446535685572961374 18446556629820732765 18446577574071146391 18446598518350624637 18446619462632745120 18446640406943930245 18446661351257757609 18446682295600649637 18446703239946183906 the sequential region : 4" the // region: 2" (twice faster) of course if i run a pure C eqivlent code it is instantaneous: // openMP cube - collatz test #include <omp.h> #include <stdio.h> #include <stdlib.h> // OpenMP on macOS with Xcode tools: // https://mac.r-project.org/openmp/ // export OMP_NUM_THREADS=8 // this main() in a library was only for testing openMP with Mac OS Xcode and Linux for use uncomment main() and comment openmp() functions // mac os : // clang -I/opt/homebrew/opt/libomp/include -L/opt/homebrew/opt/libomp/lib -Xclang -fopenmp -o collatz -lomp collatz.c // gcc -L/usr/lib/llvm-14/lib/ -fopenmp -o collatz -lomp collatz.c unsigned long long *vtst; unsigned long long collatz(unsigned long long n) { if (n == 1) return 1; if ((n % 2) == 0) return n / 2; else return 3*n + 1; } unsigned long long fct(unsigned long long x) { unsigned long long c; if (x == 0) c = 0; else c = collatz(x); return (x * x * x) + c; } unsigned long long fctapply(unsigned long long i) { return vtst[i] = fct(vtst[i]); } int main() { int vtstlen = 2642245; // cubic root of 18,446,744,073,709,551,615 https://en.wikipedia.org/wiki/C_data_types vtst = calloc(vtstlen, sizeof(unsigned long long)); int ncpus = omp_get_max_threads(); printf("Found a maximum of %i cores.\n",ncpus); printf("Program compute cube of numbers and add collatz result (1) with and without parallelisation with OpenMP library.\n\n"); printf("Initialising data.\n\n"); //int iam,nthr; // init data sequential for (int i=0; i<vtstlen; i++) { /* i is private by default because it is the for indice*/ //iam = omp_get_thread_num(); //printf("iam=%i\n",iam); //nthr = omp_get_num_threads() ; //printf("total number of threads=%i\n",nthr); vtst[i]=i; } printf("STARTING computation without //.\n"); for (int i=0; i<vtstlen; i++) { fctapply(i); } printf("ENDING computation without //.\n\n"); // display a few results for (int i=0;i < 10; i++) { printf("%llu\n",vtst[i]); } printf( ".....\n"); for (int i=vtstlen - 10; i < vtstlen; i++) { printf("%llu\n",vtst[i]); } printf("Initialising data in //.\n\n"); //int iam,nthr; #pragma omp parallel for private(vtstlen) shared(vtst) for (int i=0; i<vtstlen; i++) { /* i is private by default because it is the for indice*/ vtst[i]=i; } printf("STARTING computation in //.\n"); // setting private disable unecessary // overload work on some variables (mutex...) #pragma omp parallel for private(vtstlen) shared(vtst) for (int i=0; i<vtstlen; i++) { /* i is private by default */ fctapply(i); } printf("ENDING computation in //.\n\n"); // display a few results for (int i=0;i < 10; i++) { printf("%llu\n",vtst[i]); } printf( ".....\n"); for (int i=vtstlen - 10; i < vtstlen; i++) { printf("%llu\n",vtst[i]); } } https://github.com/damien-mattei/library-FunctProg/blob/master/collatz.c in conclusion openMP with Guile give a few improvement of a factor between 1.15 (with logic algo) of 2 (benchmarks with cube and collatz) of speed only on MacOS under Linux it fails with segfault or is slower. there should be difference in implementation of Guile between Mac OS and Linux but i do not know the inner mechanism and algorithm used to run Guile in a C environment,what scm_init_guile() is doing? why must it be placed under the // region on Linux (with slower result) and anywhere under MacOS ? (speed up code) possibly this could be improved. It is already a good result to see it works with OpenMP in Scheme . Best wishes, Damien On Fri, Jan 6, 2023 at 6:06 PM Maxime Devos <maximede...@telenet.be> wrote: > > no it returns something based on address: > > scheme@(guile-user)> (current-thread) > > $1 = #<thread 8814535936 (102a61d80)> > > the good thing it is that it is different for each address, the bad is > that i do not know how to extract it from the result and anyway i need a > number : 0,1,2,3... ordered and being a partition to make scheduling that > each thread deal with a part of the array (vector) the way it is in OpenMP > like in the FOR example i posted a week ago > > You could define a (weak key) hash table from threads to numbers, and > whenever a thread is encountered that isn't yet in the table, assign it > an unused number and insert it in the table. Requires locking (or an > atomics equivalent) though, so not ideal. > > (Maybe there's a method to get a number, directly, but I don't know any.) > > > just do a 'for like in openMP (mentioned above) > > In that case, when implementing slicing the array between different new > fibers, you can give each of the fibers you spawn (one fiber per slice, > if I understand the terminology correctly) an entry in the vector, and > after all the fibers complete do the usual 'sum/multiply/... all > entries' trick. > > As each fiber has its own (independent) storage, not touched by the > other fibers, that should be safe. > > I suppose this might take more memory storage than with openMP. > > > i undertand fibers is better for scheduling web server request but not > for parallelizing like openMP - it is two differents world. > > You can do parallelisation with fibers (see ‘In that case, when > implementing slicing ...’), but from what I'm reading, it will be > somewhat unlike openMP. > > On 06-01-2023 16:06, Damien Mattei wrote: > > > > (define omp-get-max-threads > > (pointer->procedure int > > (dynamic-func "omp_get_max_threads" libomp) > > (list void))) > > > > but i get this error: > > ice-9/boot-9.scm:1685:16: In procedure raise-exception: > > In procedure pointer->procedure: Wrong type argument in position 3: 0 > > > > i do not understand why. > > > ‘int omp_get_max_thread(void);’ is C's way to declare that > omp_get_max_thread has no arguments -- there is no 'void'-typed argument. > > Try (untested): > > (define omp-get-max-threads > (pointer->procedure int > (dynamic-func "omp_get_max_threads" libomp) > (list))) > > Greetings, > Maxime. >