I was trying to create a much more elaborate example when Matthew sent
his tiny one which is enough to show the problem.

I started a 64core machine on aws to show the issue.

I see a massive degradation as the number of places increases.

I use this slightly modified code:
#lang racket

(define (go n)
  (place/context p
         (let ([v (vector 0.0)])
           (let loop ([i 3000000000])
             (unless (zero? i)
               (vector-set! v 0 (+ (vector-ref v 0) 1.0))
               (loop (sub1 i)))))
         (printf "Place ~a done~n" n)

(module+ main
  (define cores
     #:args (cores)
     (string->number cores)))

   (map place-wait
        (for/list ([i (in-range cores)])
          (printf "Starting core ~a~n" i)
          (go i)))))

Here's the results in the video (might take a few minutes until it is live):

The guide says about places:
"The place form creates a place, which is effectively a new Racket
instance that can run in parallel to other places, including the initial

I think this is misleading at the moment. If this behaviour can be
'fixed' then great, if not I will have to redesign my system to use
'subprocess' to start another racket process and a footnote should be
added to places in documentation to alert the users about this behaviour.

Matthew, Sam, do you understand why this is happening?

On 05/10/2018 16:51, Sam Tobin-Hochstadt wrote:
> I tried this same program on my desktop, which also has 4 (i7-4770)
> cores with hyperthreading. Here's what I see:
> [samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
> plt] time r ~/Downloads/p.rkt 1
> N: 1, cpu: 5808/5808.0, real: 5804
> [samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
> plt] time r ~/Downloads/p.rkt 2
> N: 2, cpu: 12057/6028.5, real: 6063
> [samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
> plt] time r ~/Downloads/p.rkt 3
> N: 3, cpu: 23377/7792.333333333333, real: 7914
> [samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
> plt] time r ~/Downloads/p.rkt 4
> N: 4, cpu: 41155/10288.75, real: 10357
> [samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
> plt] time r ~/Downloads/p.rkt 6
> N: 6, cpu: 89932/14988.666666666666, real: 15687
> [samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
> plt] time r ~/Downloads/p.rkt 8
> N: 8, cpu: 165152/20644.0, real: 21104
> Real time goes up about 80% from 1-4 places, and then doubles again
> from 4 to 8. System time for 8 places is also about 10x what it is for
> 2 places, but only gets up to 2 seconds.
> On Fri, Oct 5, 2018 at 10:32 AM Matthew Flatt <mfl...@cs.utah.edu> wrote:
>> At Fri, 5 Oct 2018 15:36:04 +0200, Paulo Matos wrote:
>>> Again, I am really surprised that you mention that places are not
>>> separate processes. Documentation does say they are separate racket
>>> virtual machines, how is this accomplished if not by using separate
>>> processes?
>> Each place is an OS thread within the Racket process. The virtual
>> machine is essentially instantiated once in each thread, where things
>> that look like global variables at the C level are actually
>> thread-local variables to make them place-specific. Still, there is
>> some sharing among the threads.
>>> My workers are really doing Z3 style work - number crushing and lots of
>>> searching. No IO (writing to disk) or communication so I would expect
>>> them to really max out all CPUs.
>> My best guess is that it's memory-allocation bottlenecks, probably at
>> the point of using mmap() and mprotect(). Maybe things don't scale well
>> beyond the 4-core machines that I use.
>> On my machines, the enclosed program can max out CPU use with system
>> time being a small fraction. It scales ok from 1 to 4 places (i.e.,
>> real time increased only some). The machine's core are hyperthreaded,
>> and the example maxes out CPU utilization at 8 --- but it takes twice
>> as long in real time, so the hardware threads don't help much in this
>> case. Running two processes with 4 places takes about the same real
>> time as running one process with 8 places, as does 2 processes with 2
>> places.
>> Do you see similar effects, or does this little example stop scaling
>> before the number of processes matches the number of cores?
