On Thu, 2 Dec 2010 13:57:23 +0100
Damien Doligez <damien.doli...@inria.fr> wrote:

> 
> On 2010-11-29, at 23:27, Török Edwin wrote:
> 
> > This seems to be in concordance with the "smaller minor heap => more
> > minor collections => slower program" observation, since it is:
> > smaller minor heap => more minor collections => more major slices =>
> > major slices can't collect long-lived objects => slower program
> > (long lived objects are sweeped too many times).
> > So more minor collections => higher cost for a major slice
> 
> This is a bit naive, because the size of a major slice depends on the
> amount of data promoted by the corresponding minor GC.  A smaller
> minor heap promotes less data each time, so you get more major
> slices, but they are smaller.  The total amount of work done by the
> major GC doesn't change because of that.  It only changes because a
> bigger minor heap gives more time for data to die before being
> promoted to the major heap.

Thanks for the explanation. 

Is there a way I could use more than 2 bits for the color?
I thought to try excluding some objects from collection if they failed
to get collected for a few cycles (i.e. just mark them black from the
start). Tried modifying Wosize_hd, and Make_header, but it is not
enough, anything else I missed?

> 
> > I think OCaml's GC should take into account how successful the last
> > major GC was (how many words it freed), and adjust speed
> > accordingly: if we know that the major slice will collect next to
> > nothing there is no point in wasting time and running it.
> > 
> > So this formula should probably be changed:
> >  p = (double) caml_allocated_words * 3.0 * (100 + caml_percent_free)
> >      / Wsize_bsize (caml_stat_heap_size) / caml_percent_free / 2.0;
> > 
> > Probably to something that also does:
> >  p =  p * major_slice_successrate
> 
> 
> The success rate is already taken into account by the major GC.  In
> fact a target success rate is set by the user: it is
>    caml_percent_free / (100 + caml_percent_free)

Thanks, that is probably what I had in mind, but adjusting it doesn't
work as I initially expected.

> and the major GC adjusts its speed according to this target.  If the
> target is not met, the free list is emptied faster than the GC can
> replenish it, so it gets depleted at some point and the major heap
> is then extended to satisfy allocation requests.  The bigger major
> heap then helps the major GC meet its target, because the success rate
> is simply (heap_size - live_data) / heap_size, and that gets higher
> as heap_size increases.
> 
> I didn't do the math, but I think your modification would make the
> major heap size increase without bound.

Yeah, I've been experimenting, and unless I put a lower bound on
'p' (like half of initial one) it used way too much memory (sure it was
also faster, but then it is also faster by not running the GC at all
which is not good).

Which parameter do you think could be automatically tuned to accomodate
bursts of allocations? Maybe major_heap_increment (based on recent
history of heap expansions, and promoted words)?

BTW I was also looking at ways to speed up mark_slice and sweep_slice,
(which is not easy since it is mostly optimized already).
Found a place where GCC is not smart enough though. Attached patch
improves GC sweep_slice by 3% - 10% in ocamlopt, depending on CPU (3%
Intel, 10% AMD Phenom II X6). Should I open a bug and attach it?

Using gcc -O3 gives would sometimes give the same benefits, but I
don't really trust -O3, -O2 is as far as I would go. See here for all
the details: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46763
[speaking of which why is the GC compiled with -O, is -O2 too risky
too?]

P.S.: minor_heap_size got bumped in 3.12 branch, so it is now greater
than major_heap_increment. Is that intended?

Best regards,
--Edwin
diff -ru /home/edwin/ocaml-3.12.0+rc1/byterun//major_gc.c ./major_gc.c
--- /home/edwin/ocaml-3.12.0+rc1/byterun//major_gc.c	2009-11-04 14:25:47.000000000 +0200
+++ ./major_gc.c	2010-12-02 17:02:38.000000000 +0200
@@ -286,21 +286,25 @@
 {
   char *hp;
   header_t hd;
+  /* speed opt: keep global in local var */
+  char *gc_sweep_hp = caml_gc_sweep_hp;
 
   caml_gc_message (0x40, "Sweeping %ld words\n", work);
   while (work > 0){
-    if (caml_gc_sweep_hp < limit){
-      hp = caml_gc_sweep_hp;
+    if (gc_sweep_hp < limit){
+      hp = gc_sweep_hp;
       hd = Hd_hp (hp);
       work -= Whsize_hd (hd);
-      caml_gc_sweep_hp += Bhsize_hd (hd);
+      gc_sweep_hp += Bhsize_hd (hd);
+      PREFETCH_READ_NT(gc_sweep_hp);
       switch (Color_hd (hd)){
       case Caml_white:
+	caml_gc_sweep_hp = gc_sweep_hp;
         if (Tag_hd (hd) == Custom_tag){
           void (*final_fun)(value) = Custom_ops_val(Val_hp(hp))->finalize;
           if (final_fun != NULL) final_fun(Val_hp(hp));
         }
-        caml_gc_sweep_hp = caml_fl_merge_block (Bp_hp (hp));
+        gc_sweep_hp = caml_fl_merge_block (Bp_hp (hp));
         break;
       case Caml_blue:
         /* Only the blocks of the free-list are blue.  See [freelist.c]. */
@@ -311,7 +315,7 @@
         Hd_hp (hp) = Whitehd_hd (hd);
         break;
       }
-      Assert (caml_gc_sweep_hp <= limit);
+      Assert (gc_sweep_hp <= limit);
     }else{
       chunk = Chunk_next (chunk);
       if (chunk == NULL){
@@ -320,11 +324,12 @@
         work = 0;
         caml_gc_phase = Phase_idle;
       }else{
-        caml_gc_sweep_hp = chunk;
+        gc_sweep_hp = chunk;
         limit = chunk + Chunk_size (chunk);
       }
     }
   }
+  caml_gc_sweep_hp = gc_sweep_hp;
 }
 
diff -ru /home/edwin/ocaml-3.12.0+rc1/byterun//memory.h ./memory.h
--- /home/edwin/ocaml-3.12.0+rc1/byterun//memory.h	2008-12-03 20:09:09.000000000 +0200
+++ ./memory.h	2010-12-02 17:06:12.000000000 +0200
@@ -215,6 +215,16 @@
   #define CAMLunused
 #endif
 
+/* non-temporal prefetch for read (it need not be left in the cache after the access) */
+#if defined (__GNUC__) && (__GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ > 1))
+  #define PREFETCH_READ_NT(addr) __builtin_prefetch((addr), 0, 0)
+  #define PREFETCH_READ(addr) __builtin_prefetch((addr), 0, 3)
+#else
+  #define PREFETCH_READ_NT(addr)
+  #define PREFETCH_READ(addr)
+#endif
+
+
 #define CAMLxparam1(x) \
   struct caml__roots_block caml__roots_##x; \
   CAMLunused int caml__dummy_##x = ( \
_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Reply via email to