Re: fibers,questions about thread id and mutation of vectors

Damien Mattei Fri, 13 Jan 2023 03:12:13 -0800

i made some test of openMP and Guile with Guile 3.0.8.99-f3ea8 on MacOS M1
and Linux Intel because i was not sure of the performances. I find a
problem on Linux the code is slower (could be a factor of 5) with openMP
and in Mac OS the gain is is of 100% (divide by 2) or 15% depending of
computation complexity.
i can not explain why it works under MacOS and not Linux, the only
difference of compilation is that under Mac OS i had to force this option
to succeed compiling:
configure --enable-mini-gmp


Anyway it is not good performance for openMP with scheme, under openMP with
n CPUs i have gain of almost n x 100% of speedup, in C language or Fortran
OpenMP when use for astronomical numerical simulation.
in the // region i have only this code on MacOS:

  scm_init_guile();

#pragma omp parallel for

  for (i=start; i<=stop; i++)  { /* i is private by default */

    scm_call_1( func , scm_from_int(i) );

with linux this create a segmentation fault unless i move inside the for
loop the line scm_init_guile();

like this:

#pragma omp parallel for

  for (i=start; i<=stop; i++)  { /* i is private by default */

    scm_init_guile();
    scm_call_1( func , scm_from_int(i) );

https://github.com/damien-mattei/library-FunctProg/blob/master/guile-openMP.c#L91

the scheme+ code for speed test looks like that (i use collatz function to
make the computation unpredictable for any C compiler optimisations when i
compare with pur C code):

;; only for speed tests
{vtstlen <+ 2642245}
{vtst <+ (make-vector vtstlen 0)}

{fct <+ (lambda (x) {x * x * x})}

(define (fctapply i) {vtst[i] <- fct(vtst[i])}) ;; neoteric expression of
{vtst[i] <- (fct vtst[i])}

(define (fctpluscollatzapply i) {vtst[i] <- fctpluscollatz(vtst[i])})

(define (speed-test)

  ;; init data
  (display-nl "speed-test : Initialising data.")
  (for ({i <+ 0} {i < vtstlen} {i <- {i + 1}})
       {vtst[i] <- i})

  ;; compute
  (display-nl "speed-test : testing Scheme alone : start")
  (for ({i <+ 0} {i < vtstlen} {i <- {i + 1}})
       (fctpluscollatzapply i));;(fctapply i))
  (display-nl "speed-test : testing Scheme alone : end")

  (newline)

  ;; display a few results
  (for ({i <+ 0} {i < 10} {i <- {i + 1}})
       (display-nl {vtst[i]}))
  (display-nl ".....")
  (for ({i <+ {vtstlen - 10}} {i < vtstlen} {i <- {i + 1}})
       (display-nl {vtst[i]}))

  ;; init data
  (display-nl "speed-test : Initialising data.")
  (for ({i <+ 0} {i < vtstlen} {i <- {i + 1}})
       {vtst[i] <- i})

  ;; compute
  (display-nl "speed-test : testing Scheme with OpenMP : start")
  (openmp 0 {vtstlen - 1} (string->pointer
"fctpluscollatzapply"));;"fctapply"))
  (display-nl "speed-test : testing Scheme with OpenMP : end")

  (newline)

  ;; display a few results
  (for ({i <+ 0} {i < 10} {i <- {i + 1}})
       (display-nl {vtst[i]}))
  (display-nl ".....")
  (for ({i <+ {vtstlen - 10}} {i < vtstlen} {i <- {i + 1}})
       (display-nl {vtst[i]}))

  )


(define (collatz n)
  (cond ({n = 1} 1)
({(modulo n 2) = 0} {n / 2})
(else {{3 * n} + 1})))


(define (fctpluscollatz x)
  (declare c)
  (if {x = 0}
      {c <- 0}
      {c <- collatz(x)})
  {{x * x * x} + c})


(define openmp (foreign-library-function "./libguile-openMP" "openmp"
#:return-type int #:arg-types (list int int '*)))


(define libomp (dynamic-link "libomp")) ;;  note: require a link : ln -s
/opt/homebrew/opt/libomp/lib/libomp.dylib libomp.dylib
;; export LTDL_LIBRARY_PATH=. under linux with a link as above
;; or better solution: export LTDL_LIBRARY_PATH=/usr/lib/llvm-14/lib

(define omp-get-max-threads
  (pointer->procedure int
                      (dynamic-func "omp_get_max_threads" libomp)
                      '()))

https://github.com/damien-mattei/library-FunctProg/blob/master/guile/logiki%2B.scm#L3581

output:

scheme@(guile-user)> (speed-test )
speed-test : Initialising data.
speed-test : testing Scheme alone : start
speed-test : testing Scheme alone : end

0
2
9
37
66
141
219
365
516
757
.....
18446514741354254581
18446535685572961374
18446556629820732765
18446577574071146391
18446598518350624637
18446619462632745120
18446640406943930245
18446661351257757609
18446682295600649637
18446703239946183906
speed-test : Initialising data.
speed-test : testing Scheme with OpenMP : start
speed-test : testing Scheme with OpenMP : end

0
2
9
37
66
141
219
365
516
757
.....
18446514741354254581
18446535685572961374
18446556629820732765
18446577574071146391
18446598518350624637
18446619462632745120
18446640406943930245
18446661351257757609
18446682295600649637
18446703239946183906

the sequential region : 4"
the // region: 2" (twice faster)

of course if i run a pure C eqivlent code it is instantaneous:

// openMP cube - collatz test

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>



// OpenMP on macOS with Xcode tools:
// https://mac.r-project.org/openmp/

// export OMP_NUM_THREADS=8

// this main() in a library was only for testing openMP with Mac OS Xcode
and Linux for use uncomment main() and comment openmp() functions


// mac os :
// clang  -I/opt/homebrew/opt/libomp/include
 -L/opt/homebrew/opt/libomp/lib -Xclang -fopenmp -o collatz  -lomp collatz.c

// gcc -L/usr/lib/llvm-14/lib/ -fopenmp  -o collatz  -lomp collatz.c


unsigned long long *vtst;



unsigned long long collatz(unsigned long long n) {

  if (n == 1) return 1;

  if ((n % 2) == 0)
    return n / 2;
  else
    return 3*n + 1;

}

unsigned long long fct(unsigned long long x) {

  unsigned long long c;
  if (x == 0)
    c = 0;
  else
    c = collatz(x);

  return (x * x * x) + c;
}


unsigned long long fctapply(unsigned long long i) {
  return vtst[i] = fct(vtst[i]);
}




int main() {
  int vtstlen = 2642245; // cubic root of 18,446,744,073,709,551,615
https://en.wikipedia.org/wiki/C_data_types
  vtst = calloc(vtstlen, sizeof(unsigned long long));

  int ncpus = omp_get_max_threads();
  printf("Found a maximum of %i cores.\n",ncpus);
  printf("Program compute cube of numbers and add collatz result (1) with
and without parallelisation with OpenMP library.\n\n");
  printf("Initialising data.\n\n");
  //int iam,nthr;

  // init data sequential
  for (int i=0; i<vtstlen; i++) { /* i is private by default because it is
the for indice*/
    //iam = omp_get_thread_num();
    //printf("iam=%i\n",iam);
    //nthr = omp_get_num_threads() ;
    //printf("total number of threads=%i\n",nthr);
    vtst[i]=i;

  }


  printf("STARTING computation without //.\n");


  for (int i=0; i<vtstlen; i++) {

    fctapply(i);

  }

  printf("ENDING computation without //.\n\n");

  // display a few results
  for (int i=0;i < 10; i++) {
    printf("%llu\n",vtst[i]);
  }
  printf( ".....\n");
  for (int i=vtstlen - 10; i < vtstlen; i++) {
    printf("%llu\n",vtst[i]);
  }


  printf("Initialising data in //.\n\n");
  //int iam,nthr;

#pragma omp parallel for private(vtstlen) shared(vtst)


  for (int i=0; i<vtstlen; i++) { /* i is private by default because it is
the for indice*/

    vtst[i]=i;

  }

  printf("STARTING computation in //.\n");


  // setting private disable unecessary // overload work on some variables
(mutex...)
#pragma omp parallel for private(vtstlen) shared(vtst)


  for (int i=0; i<vtstlen; i++) { /* i is private by default */

    fctapply(i);

  }

  printf("ENDING computation in //.\n\n");


  // display a few results
  for (int i=0;i < 10; i++) {
    printf("%llu\n",vtst[i]);
  }
  printf( ".....\n");
  for (int i=vtstlen - 10; i < vtstlen; i++) {
    printf("%llu\n",vtst[i]);
  }


}

https://github.com/damien-mattei/library-FunctProg/blob/master/collatz.c

in conclusion openMP with Guile give a few improvement of a factor between
1.15 (with logic algo) of 2 (benchmarks with cube and collatz) of speed
only on MacOS under Linux it fails with segfault or is slower.

there should be difference in implementation of Guile between Mac OS and
Linux but i do not know the inner mechanism and algorithm used to run Guile
in a C environment,what  scm_init_guile() is doing?
why must it be placed under the // region on Linux (with slower result) and
anywhere under MacOS ? (speed up code)
possibly this could be improved. It is already a good result to see it
works with OpenMP in Scheme .

Best wishes,

Damien


On Fri, Jan 6, 2023 at 6:06 PM Maxime Devos <maximede...@telenet.be> wrote:

> > no it returns something based on address:
> > scheme@(guile-user)> (current-thread)
> > $1 = #<thread 8814535936 (102a61d80)>
> > the good thing it is that it is different for each address, the bad is
> that i do not know how to extract it from the result and anyway i need a
> number : 0,1,2,3... ordered and  being a partition to make scheduling that
> each thread deal with a part of the array (vector) the way it is in OpenMP
> like in the FOR example i posted a week ago
>
> You could define a (weak key) hash table from threads to numbers, and
> whenever a thread is encountered that isn't yet in the table, assign it
> an unused number and insert it in the table.  Requires locking (or an
> atomics equivalent) though, so not ideal.
>
> (Maybe there's a method to get a number, directly, but I don't know any.)
>
> > just do a 'for like in openMP (mentioned above)
>
> In that case, when implementing slicing the array between different new
> fibers, you can give each of the fibers you spawn (one fiber per slice,
> if I understand the terminology correctly) an entry in the vector, and
> after all the fibers complete do the usual 'sum/multiply/... all
> entries' trick.
>
> As each fiber has its own (independent) storage, not touched by the
> other fibers, that should be safe.
>
> I suppose this might take more memory storage than with openMP.
>
> > i undertand fibers is better for scheduling web server request but not
> for parallelizing like openMP - it is two differents world.
>
> You can do parallelisation with fibers (see ‘In that case, when
> implementing slicing ...’), but from what I'm reading, it will be
> somewhat unlike openMP.
>
> On 06-01-2023 16:06, Damien Mattei wrote:
> >
> > (define omp-get-max-threads
> >    (pointer->procedure int
> >                        (dynamic-func "omp_get_max_threads" libomp)
> >                        (list void)))
> >
> > but i get this error:
> > ice-9/boot-9.scm:1685:16: In procedure raise-exception:
> > In procedure pointer->procedure: Wrong type argument in position 3: 0
> >
> > i do not understand why.
>
>
> ‘int omp_get_max_thread(void);’ is C's way to declare that
> omp_get_max_thread has no arguments -- there is no 'void'-typed argument.
>
> Try (untested):
>
> (define omp-get-max-threads
>    (pointer->procedure int
>                        (dynamic-func "omp_get_max_threads" libomp)
>                        (list)))
>
> Greetings,
> Maxime.
>

Re: fibers,questions about thread id and mutation of vectors

Reply via email to