RE: [Caml-list] Re: multicore wish

2009-12-21 Thread Fischbacher T.

Mihamina,

> Ok, so for the beginner I am (must I ask on the beginners ML?): is
> multicore support just useless or not?

That *entirely* depends on what you want to do. If, for example, you 
have to do a large calculation that is limited by memory and not by CPU,
or, if you have an application that is trivially parallelized anyway, 
multicore support won't make much of a difference. There are 
(many) other applications, however, where it does matter quite a lot.

Actually, the biggest effect of multicore architectures I see is to shift
the emphasis from raw CPU power to memory bandwidth.

-- 
Thomas

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Re: multicore wish

2009-12-21 Thread Jon Harrop
On Monday 21 December 2009 14:19:36 Mihamina Rakotomandimby wrote:
> Ok, so for the beginner I am (must I ask on the beginners ML?): is
> multicore support just useless or not?

I have found a great many uses for multicores but you need a decent foundation 
to make effective use of it.

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Re: multicore wish

2009-12-21 Thread Dario Teixeira
Hi,

> I am beginning using Ocsigen, for a growing web project:
> Is multicore support useless for scaling on Ocsigen?

Categorically, yes.  In fact, I would say that the model used by Ocsigen
is close to being optimal performance-wise as far as web applications are
concerned.  The Ocsigen server and Eliom applications use Lwt for concurrency,
ensuring that the CPU is always busy and will not idle waiting for I/O.
Moreover, the green threads offered by Lwt are much lighter than system
threads, and avoid the context switching penalty incurred by the latter.

And what about those multiple cores?  Simple, if you have n cores, then
simply fire up n instances of the Ocsigen server, and put a dispatching
server like HAProxy or Ocsigen itself as frontend (there are some simple
tricks to ensure that the same client is always directed to same server).
This solution takes advantage of the fact that serving web requests from
multiple clients is embarrasingly parallel from the web application
standpoint. (Sure, you'll have contention on the database side, but
most DBMSs handle that reasonably well).

(And yes, I am aware that Lwt's performance could be improved further
by using syscalls like epoll/kqeueue/etc instead of select.  That is
however an implementation issue, not an architectural flaw).

Best regards,
Dario Teixeira





___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Re: multicore wish

2009-12-22 Thread Jon Harrop
On Tuesday 22 December 2009 13:09:27 Goswin von Brederlow wrote:
> Jon Harrop  writes:
> > 1. The array "a" is just an ordinary array of any type of values on the
> > shared heap in F# but, for generality in OCaml, this must be both the
> > underlying ordinary data and a manually-managed shared big array of
> > indices into the ordinary data where the indices get sorted while the
> > original data remain in place until they are permuted at the end.
>
> Unless you have a primitive type that isn't a pointer.

In OCaml, you would need to write a custom quicksort optimized for that 
particular type. In F#, the generic version just works and works efficiently.

> The advantage with ocaml though is that you never have pointers into a
> structure. Makes thinks a lot simpler for the GC and avoids large
> overheads in memory.

I don't understand what you mean by OCaml "never has pointers into a 
structure". Half the problem with OCaml is that OCaml almost always uses 
pointers and the programmer has no choice, e.g. for complex numbers.

> > 2. The sorted subarrays are contiguous in memory and, at some
> > subdivision, will fit into L2 cache. So F# offers optimal locality. In
> > contrast, there is no locality whatsoever in the OCaml code and most
> > accesses into the unsorted original array will incur cache misses right
> > up to main memory. So the OCaml approach does not scale as well and will
> > never see superlinear speedup because it cannot be cache friendly.
>
> On the other hand swapping two elements in the array has a constant
> cost no matter what size they have. At some size there will be a break
> even point where copying the data costs more than the cache misses and
> with increasing size the cache won't help F# so much either.

In theory, yes. In practice, that threshold is far larger than any value type 
that a real program would use so it is of no practical concern. Moreover, F# 
gives the programmer control over whether data are unboxed (value types) or 
boxed (reference types) anyway. In contrast, OCaml is tied to a few value 
types that are hard-coded into the GC.

> > 3. Child tasks are likely to be executed on the same core as their parent
> > and use a subset of their parent's data in F#, so they offer the best
> > possible locality. In contrast, child processes are likely to be executed
> > on another core in OCaml and offer the worst possible locality.
>
> But, if I understood you right, you first fork one process per
> core.

No, in OCaml I fork every child. That is the only transparent way to give the 
child a coherent view of the heap but it is extremely slow (~1ms):

  "F# can do 60MFLOPS of computation in the time it takes OCaml 
to fork a single process" -
http://caml.inria.fr/pub/ml-archives/caml-list/2009/06/542b8bed77022b4a4824de2da5b7f714.en.html

> Those should then also each pin themself to one core.

You have no idea which core a forked child process will run on.

> Each process then has a work queue which is works through. So they will 
> always use the local data. Only when a queue runs dry they steal from
> another process and ruin locality.

You are correctly describing the efficient solution based upon work-stealing 
queues that F# uses but OCaml cannot express it.

> So I don't see where your argument fits. You are not creating childs
> on the fly. Only at the start and they run till all the work is done.
> At least in this example.

No, every recursive invocation of the parallel quicksort spawns another child 
on-the-fly. That's precisely why it parallelizes so efficiently when you have 
wait-free work-stealing task deques and a shared heap. In general, you 
rewrite algorithms into this recursive divide and conquer form and 
parallelize when possible. You can parallelize a *lot* of problems 
efficiently that way.

> > 4. Task deques can handle an arbitrary number of tasks limited only by
> > memory whereas processes are a scarce resource and forking is likely to
> > fail, whereupon the "invoke" combinator will simply execute sequentially.
> > So it is much easier to write reliable and performant code in F# than
> > OCaml.
>
> Why would you fork in invoke?

Fork is currently the only transparent way to implement "invoke" but it is 
extremely slow and unreliable.

> > 5. OCaml's fork-based "invoke" combinator is many orders of magnitude
> > slower than pushing a closure onto a concurrent task deque in F#.
> >
> > 6. The "threshold" value is a magic number derived from measurements on a
> > given machine in my OCaml code but is dynamically adjusted in a
> > machine-independent way by the "invoke" combinator in my F# code using
> > atomic operations and real time profiling of the proportion of time spent
> > spawning tasks vs doing actual work.
>
> 5+6 seem to be an implementation detail of some specific
> implementation you are talking about.

Yes. I'm talking about today's OCaml and F# implementations.

> I don't see anything in the theory that would require that.

If by "

Re: [Caml-list] Re: multicore wish

2009-12-22 Thread Edgar Friendly

On 12/22/2009 01:12 PM, Jon Harrop wrote:

On Tuesday 22 December 2009 13:09:27 Goswin von Brederlow wrote:
   

The advantage with ocaml though is that you never have pointers into a
structure. Makes thinks a lot simpler for the GC and avoids large
overheads in memory.
 

I don't understand what you mean by OCaml "never has pointers into a
structure". Half the problem with OCaml is that OCaml almost always uses
pointers and the programmer has no choice, e.g. for complex numbers.

   
I think he means that ocaml structs (records, tuples) will only ever 
have pointers pointing to their beginning - you can't have a pointer to 
somewhere in the middle of a structure.


E

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Re: multicore wish

2009-12-22 Thread Jon Harrop
On Tuesday 22 December 2009 18:02:32 Edgar Friendly wrote:
> On 12/22/2009 01:12 PM, Jon Harrop wrote:
> > On Tuesday 22 December 2009 13:09:27 Goswin von Brederlow wrote:
> >> The advantage with ocaml though is that you never have pointers into a
> >> structure. Makes thinks a lot simpler for the GC and avoids large
> >> overheads in memory.
> >
> > I don't understand what you mean by OCaml "never has pointers into a
> > structure". Half the problem with OCaml is that OCaml almost always uses
> > pointers and the programmer has no choice, e.g. for complex numbers.
>
> I think he means that ocaml structs (records, tuples) will only ever
> have pointers pointing to their beginning - you can't have a pointer to
> somewhere in the middle of a structure.

If so then I do not understand the relevance. You cannot have pointers into a 
structure in F# or HLVM either...

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Re: multicore wish

2009-12-24 Thread Goswin von Brederlow
Jon Harrop  writes:

> On Tuesday 22 December 2009 18:02:32 Edgar Friendly wrote:
>> On 12/22/2009 01:12 PM, Jon Harrop wrote:
>> > On Tuesday 22 December 2009 13:09:27 Goswin von Brederlow wrote:
>> >> The advantage with ocaml though is that you never have pointers into a
>> >> structure. Makes thinks a lot simpler for the GC and avoids large
>> >> overheads in memory.
>> >
>> > I don't understand what you mean by OCaml "never has pointers into a
>> > structure". Half the problem with OCaml is that OCaml almost always uses
>> > pointers and the programmer has no choice, e.g. for complex numbers.
>>
>> I think he means that ocaml structs (records, tuples) will only ever
>> have pointers pointing to their beginning - you can't have a pointer to
>> somewhere in the middle of a structure.

Exactly.

> If so then I do not understand the relevance. You cannot have pointers into a 
> structure in F# or HLVM either...

If you have an array a of (int * int) then in ocaml a is an array of
pointers to tuples. a.(5) is a pointer to the 6th tuple.

You said that in F# the array will be really an array of tuples. Then
a.(5) will be a pointer into the array at the position where the 6th
tuple is. It does not point to the begining of an allocated block but
to the middle of one.

That means as long as a.(5) is reachable the full array has to remain
allocated or the GC has to recognise that only one (a few) items of an
array are reachable and copy them before freeing the array. The GC
also needs a way to find the begining of an allocated block from a
pointer into the block. Which means extra overhead in both memory and
time.


Another think is that tuples are immutable but arrays are mutable. In
ocaml you get this nice behaviour:

# let a = Array.init 5 (fun x -> (x,x));;
val a : (int * int) array = [|(0, 0); (1, 1); (2, 2); (3, 3); (4, 4)|]
# let x = a.(1);;
val x : int * int = (1, 1)
# a.(1) <- (2,2);;
- : unit = ()
# x;;
- : int * int = (1, 1)

In F# you would get (2,2) for your fast sortable array. A different
semantic.

MfG
Goswin

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Re: multicore wish

2009-12-24 Thread Goswin von Brederlow
Jon Harrop  writes:

> On Tuesday 22 December 2009 13:09:27 Goswin von Brederlow wrote:
>> Jon Harrop  writes:
>> > 1. The array "a" is just an ordinary array of any type of values on the
>> > shared heap in F# but, for generality in OCaml, this must be both the
>> > underlying ordinary data and a manually-managed shared big array of
>> > indices into the ordinary data where the indices get sorted while the
>> > original data remain in place until they are permuted at the end.
>>
>> Unless you have a primitive type that isn't a pointer.
>
> In OCaml, you would need to write a custom quicksort optimized for that 
> particular type. In F#, the generic version just works and works efficiently.
>
>> The advantage with ocaml though is that you never have pointers into a
>> structure. Makes thinks a lot simpler for the GC and avoids large
>> overheads in memory.
>
> I don't understand what you mean by OCaml "never has pointers into a 
> structure". Half the problem with OCaml is that OCaml almost always uses 
> pointers and the programmer has no choice, e.g. for complex numbers.
>
>> > 2. The sorted subarrays are contiguous in memory and, at some
>> > subdivision, will fit into L2 cache. So F# offers optimal locality. In
>> > contrast, there is no locality whatsoever in the OCaml code and most
>> > accesses into the unsorted original array will incur cache misses right
>> > up to main memory. So the OCaml approach does not scale as well and will
>> > never see superlinear speedup because it cannot be cache friendly.
>>
>> On the other hand swapping two elements in the array has a constant
>> cost no matter what size they have. At some size there will be a break
>> even point where copying the data costs more than the cache misses and
>> with increasing size the cache won't help F# so much either.
>
> In theory, yes. In practice, that threshold is far larger than any value type 
> that a real program would use so it is of no practical concern. Moreover, F# 
> gives the programmer control over whether data are unboxed (value types) or 
> boxed (reference types) anyway. In contrast, OCaml is tied to a few value 
> types that are hard-coded into the GC.
>
>> > 3. Child tasks are likely to be executed on the same core as their parent
>> > and use a subset of their parent's data in F#, so they offer the best
>> > possible locality. In contrast, child processes are likely to be executed
>> > on another core in OCaml and offer the worst possible locality.
>>
>> But, if I understood you right, you first fork one process per
>> core.
>
> No, in OCaml I fork every child. That is the only transparent way to give the 
> child a coherent view of the heap but it is extremely slow (~1ms):

So if you add a (sleep 60) to the ocaml code then ocaml gets even
more worse than F#? You are comparing an implementation that is
specifically optimized to use things F# is good at and ocaml is
bad. To give a fair comparison you need to optimize the implementation
to the language.

>   "F# can do 60MFLOPS of computation in the time it takes OCaml 
> to fork a single process" -
> http://caml.inria.fr/pub/ml-archives/caml-list/2009/06/542b8bed77022b4a4824de2da5b7f714.en.html
>
>> Those should then also each pin themself to one core.
>
> You have no idea which core a forked child process will run on.

I can pin each child to one core and the scheduler will move it there.
I would expect a multi-core ocaml to actualy do this internaly already
so there is on GC threads per core ocaml runs on. But that depends on
the GC implementation that will be used.

>> Each process then has a work queue which is works through. So they will 
>> always use the local data. Only when a queue runs dry they steal from
>> another process and ruin locality.
>
> You are correctly describing the efficient solution based upon work-stealing 
> queues that F# uses but OCaml cannot express it.

You mean that you didn't implement that way. I can easily express that
with closures and pre-forked worker threads.

>> So I don't see where your argument fits. You are not creating childs
>> on the fly. Only at the start and they run till all the work is done.
>> At least in this example.
>
> No, every recursive invocation of the parallel quicksort spawns another child 
> on-the-fly. That's precisely why it parallelizes so efficiently when you have 
> wait-free work-stealing task deques and a shared heap. In general, you 
> rewrite algorithms into this recursive divide and conquer form and 
> parallelize when possible. You can parallelize a *lot* of problems 
> efficiently that way.

That seems awfully ineficient. Then you have a million children
running on maybe 4 cores and each job queue holds no job.

I think we mean different things by children. By children I mean what
the kernel sees as children. Newly forked threads. I think you mean
jobs that get put into the queues. There is no reason to fork a new
system thread every time the parallel quicksort splits an array in
two.

>> > 

Re: [Caml-list] Re: multicore wish

2009-12-24 Thread Jon Harrop
On Thursday 24 December 2009 12:58:18 Goswin von Brederlow wrote:
> Jon Harrop  writes:
> > On Tuesday 22 December 2009 18:02:32 Edgar Friendly wrote:
> >> On 12/22/2009 01:12 PM, Jon Harrop wrote:
> >> > On Tuesday 22 December 2009 13:09:27 Goswin von Brederlow wrote:
> >> >> The advantage with ocaml though is that you never have pointers into
> >> >> a structure. Makes thinks a lot simpler for the GC and avoids large
> >> >> overheads in memory.
> >> >
> >> > I don't understand what you mean by OCaml "never has pointers into a
> >> > structure". Half the problem with OCaml is that OCaml almost always
> >> > uses pointers and the programmer has no choice, e.g. for complex
> >> > numbers.
> >>
> >> I think he means that ocaml structs (records, tuples) will only ever
> >> have pointers pointing to their beginning - you can't have a pointer to
> >> somewhere in the middle of a structure.
>
> Exactly.
>
> > If so then I do not understand the relevance. You cannot have pointers
> > into a structure in F# or HLVM either...
>
> If you have an array a of (int * int) then in ocaml a is an array of
> pointers to tuples. a.(5) is a pointer to the 6th tuple.

Yes.

> You said that in F# the array will be really an array of tuples. Then
> a.(5) will be a pointer into the array at the position where the 6th
> tuple is.

No. The expression a.(5) loads all fields of the value type. For example, if 
it were an array of complex number then a.(5) would load the real and 
imaginary parts.

> That means as long as a.(5) is reachable the full array has to remain
> allocated or the GC has to recognise that only one (a few) items of an
> array are reachable and copy them before freeing the array. The GC
> also needs a way to find the begining of an allocated block from a
> pointer into the block. Which means extra overhead in both memory and
> time.
>
> Another think is that tuples are immutable but arrays are mutable. In
> ocaml you get this nice behaviour:
>
> # let a = Array.init 5 (fun x -> (x,x));;
> val a : (int * int) array = [|(0, 0); (1, 1); (2, 2); (3, 3); (4, 4)|]
> # let x = a.(1);;
> val x : int * int = (1, 1)
> # a.(1) <- (2,2);;
> - : unit = ()
> # x;;
> - : int * int = (1, 1)
>
> In F# you would get (2,2) for your fast sortable array. A different
> semantic.

No. Even if F# represented tuples as structs (which it does not, 
unfortunately) you would still get the same result as you do in OCaml.

HLVM already implements tuples as structs and your example works exactly as 
the OCaml does:

# let xs = create(6, (1, 1)) in let x = xs.(1) in xs.(1) <- (2, 2); x;;
- : `Struct[`Int; `Int] = (1, 1)

The only difference is that HLVM unboxes the pairs in the array so the heap 
contains only a single pointer and accessing a pair requires only one 
indirection.

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Re: multicore wish

2009-12-24 Thread Jon Harrop
On Thursday 24 December 2009 13:19:52 Goswin von Brederlow wrote:
> Jon Harrop  writes:
> > No, in OCaml I fork every child. That is the only transparent way to give
> > the child a coherent view of the heap but it is extremely slow (~1ms):
>
> So if you add a (sleep 60) to the ocaml code then ocaml gets even
> more worse than F#? You are comparing an implementation that is
> specifically optimized to use things F# is good at and ocaml is
> bad. To give a fair comparison you need to optimize the implementation 
> to the language.

Post a better OCaml solution if you can.

> >> Each process then has a work queue which is works through. So they will
> >> always use the local data. Only when a queue runs dry they steal from
> >> another process and ruin locality.
> >
> > You are correctly describing the efficient solution based upon
> > work-stealing queues that F# uses but OCaml cannot express it.
>
> You mean that you didn't implement that way.

No, I mean OCaml cannot express it.

> I can easily express that with closures and pre-forked worker threads.

OCaml's threads do not run in parallel so that will not work.

> >> So I don't see where your argument fits. You are not creating childs
> >> on the fly. Only at the start and they run till all the work is done.
> >> At least in this example.
> >
> > No, every recursive invocation of the parallel quicksort spawns another
> > child on-the-fly. That's precisely why it parallelizes so efficiently
> > when you have wait-free work-stealing task deques and a shared heap. In
> > general, you rewrite algorithms into this recursive divide and conquer
> > form and parallelize when possible. You can parallelize a *lot* of
> > problems efficiently that way.
>
> That seems awfully ineficient. Then you have a million children
> running on maybe 4 cores and each job queue holds no job.
>
> I think we mean different things by children. By children I mean what
> the kernel sees as children. Newly forked threads. I think you mean 
> jobs that get put into the queues. There is no reason to fork a new
> system thread every time the parallel quicksort splits an array in
> two.

The children are lightweight tasks, not threads or processes.

> >> Why would you fork in invoke?
> >
> > Fork is currently the only transparent way to implement "invoke" but it
> > is extremely slow and unreliable.
>
> No, it is the insane way. Use closures.

Please demonstrate.

> > parallel programs that leverage multicores and run a *lot* faster than
> > anything that can be written in OCaml. You can even make this accessible
> > to OCaml programmers as a DSL with automatic interop to make high
> > performance parallel programming as easy as possible from OCaml.
>
> Or just implement it properly in ocaml. The problem part is the
> GC. The rest is easy.

No kidding.

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Re: multicore wish

2009-12-27 Thread Goswin von Brederlow
Jon Harrop  writes:

> On Thursday 24 December 2009 13:19:52 Goswin von Brederlow wrote:
>> Jon Harrop  writes:
>> > No, in OCaml I fork every child. That is the only transparent way to give
>> > the child a coherent view of the heap but it is extremely slow (~1ms):
>>
>> So if you add a (sleep 60) to the ocaml code then ocaml gets even
>> more worse than F#? You are comparing an implementation that is
>> specifically optimized to use things F# is good at and ocaml is
>> bad. To give a fair comparison you need to optimize the implementation 
>> to the language.
>
> Post a better OCaml solution if you can.
>
>> >> Each process then has a work queue which is works through. So they will
>> >> always use the local data. Only when a queue runs dry they steal from
>> >> another process and ruin locality.
>> >
>> > You are correctly describing the efficient solution based upon
>> > work-stealing queues that F# uses but OCaml cannot express it.
>>
>> You mean that you didn't implement that way.
>
> No, I mean OCaml cannot express it.
>
>> I can easily express that with closures and pre-forked worker threads.
>
> OCaml's threads do not run in parallel so that will not work.

>> Or just implement it properly in ocaml. The problem part is the
>> GC. The rest is easy.
>
> No kidding.

There is one implementation: http://www.algo-prog.info/ocmc/web/
But as said maybe not a verry good one.

I tried implementing parallel threads under the original GC by forking
multiple instances of the same program and using a Bigarray to mmap
/dev/null for shared memory between the instances. That works for
sharing primitive types, flat records (records of primitive types) and
even fixed (or bound) sized structures but it gets more and more
complex to share each and needs some Obj magic, marshaling or C stubs
(except for primitive types). It works but is a real hack.

I hope someone will pick up the pices and update OCaml4Multicore to
the latest ocaml or maybe for ocaml to add it directly. If not then I
fear ocaml will be left behind soon.

MfG
Goswin

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Re: multicore wish

2009-12-27 Thread Jon Harrop
On Sunday 27 December 2009 12:45:53 Goswin von Brederlow wrote:
> There is one implementation: http://www.algo-prog.info/ocmc/web/
> But as said maybe not a very good one.
>
> I tried implementing parallel threads under the original GC by forking
> multiple instances of the same program and using a Bigarray to mmap
> /dev/null for shared memory between the instances. That works for
> sharing primitive types, flat records (records of primitive types) and
> even fixed (or bound) sized structures but it gets more and more
> complex to share each and needs some Obj magic, marshaling or C stubs
> (except for primitive types). It works but is a real hack.

Once you've conceded to manual memory management (mmap of shared bigarrays) 
and low-level programming (no polymorphism, Obj.magic) you've lost the main 
advantages of OCaml and you still cannot get great performance.

> I hope someone will pick up the pices and update OCaml4Multicore to
> the latest ocaml or maybe for ocaml to add it directly. If not then I
> fear ocaml will be left behind soon.

Building upon OCaml rather than starting from scratch makes it vastly more 
difficult to implement a useful parallel GC not just because the GC and code 
gen must work in harmony together but because OCaml has been so heavily 
optimized in the wrong direction for this (e.g. a bit twiddled uniform 
representation that burdens the GC with everything from complex numbers to 
pairs).

That's why I think the best solution is to start from scratch and build a 
completely separate VM bred for shared-memory parallelism. Indeed, HLVM 
already outperforms OC4MC even though I have put a fraction of the effort 
into it and built a lot more surrounding infrastructure. For example, my 
latest GC only took 5 days to write.

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Re: multicore wish

2009-12-28 Thread Gerd Stolpmann

Am Sonntag, den 27.12.2009, 13:45 +0100 schrieb Goswin von Brederlow:
> Jon Harrop  writes:
> 
> > On Thursday 24 December 2009 13:19:52 Goswin von Brederlow wrote:
> >> Jon Harrop  writes:
> >> > No, in OCaml I fork every child. That is the only transparent way to give
> >> > the child a coherent view of the heap but it is extremely slow (~1ms):
> >>
> >> So if you add a (sleep 60) to the ocaml code then ocaml gets even
> >> more worse than F#? You are comparing an implementation that is
> >> specifically optimized to use things F# is good at and ocaml is
> >> bad. To give a fair comparison you need to optimize the implementation 
> >> to the language.
> >
> > Post a better OCaml solution if you can.
> >
> >> >> Each process then has a work queue which is works through. So they will
> >> >> always use the local data. Only when a queue runs dry they steal from
> >> >> another process and ruin locality.
> >> >
> >> > You are correctly describing the efficient solution based upon
> >> > work-stealing queues that F# uses but OCaml cannot express it.
> >>
> >> You mean that you didn't implement that way.
> >
> > No, I mean OCaml cannot express it.
> >
> >> I can easily express that with closures and pre-forked worker threads.
> >
> > OCaml's threads do not run in parallel so that will not work.
> 
> >> Or just implement it properly in ocaml. The problem part is the
> >> GC. The rest is easy.
> >
> > No kidding.
> 
> There is one implementation: http://www.algo-prog.info/ocmc/web/
> But as said maybe not a verry good one.
> 
> I tried implementing parallel threads under the original GC by forking
> multiple instances of the same program and using a Bigarray to mmap
> /dev/null for shared memory between the instances. That works for
> sharing primitive types, flat records (records of primitive types) and
> even fixed (or bound) sized structures but it gets more and more
> complex to share each and needs some Obj magic, marshaling or C stubs
> (except for primitive types). It works but is a real hack.

It works with all types:

https://godirepo.camlcity.org/svn/lib-ocamlnet2/trunk/code/src/netsys/netsys_mem.mli

look for init_value. It's non-released code yet.

However, there are some problems: Values outside the heap do not support
the polymorphic comparison and hash functions. That's a hard limitation,
e.g. you cannot even compare two strings, or build a hash table with
strings as keys. That limits the usefulness of shared memory.

Also, as we are striving for safe programs, there is the question when
shared memory can be safely released. The GC could help here, but there
are no ways to hook in, e.g. to get notified when there are no pointers
anymore to a value living in shared memory.

Gerd

> 
> I hope someone will pick up the pices and update OCaml4Multicore to
> the latest ocaml or maybe for ocaml to add it directly. If not then I
> fear ocaml will be left behind soon.
> 
> MfG
> Goswin
> 
> ___
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 
-- 

Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
g...@gerd-stolpmann.de  http://www.gerd-stolpmann.de
Phone: +49-6151-153855  Fax: +49-6151-997714


___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Re: multicore wish

2009-12-28 Thread Anil Madhavapeddy
On 28 Dec 2009, at 12:28, Gerd Stolpmann wrote:

> However, there are some problems: Values outside the heap do not support
> the polymorphic comparison and hash functions. That's a hard limitation,
> e.g. you cannot even compare two strings, or build a hash table with
> strings as keys. That limits the usefulness of shared memory.

Camlp4 may help here; Thomas Gazagnaire and I have been working on a 
language-integrated ORM, and it has a reliable type-conv hash generator library 
[1].

It works with mutable records as well as immutable ones (the main limitation of 
the built-in polymorphic hash function) but it could be used to generate 
explicit comparison functions for shared memory as well.  We need to split out 
the various support libraries in the ORM separately at some point anyway, so if 
the hash generator is of any use I'll cut a release.

[1] http://github.com/avsm/ocaml-orm-sqlite/tree/master/hash/

-anill
___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Re: multicore wish

2009-12-28 Thread Xavier Leroy
Gerd Stolpmann wrote:

> It works with all types:
> 
> https://godirepo.camlcity.org/svn/lib-ocamlnet2/trunk/code/src/netsys/netsys_mem.mli
> 
> look for init_value. It's non-released code yet.
> 
> However, there are some problems: Values outside the heap do not support
> the polymorphic comparison and hash functions. That's a hard limitation,
> e.g. you cannot even compare two strings, or build a hash table with
> strings as keys. That limits the usefulness of shared memory.

In OCaml 3.11 and later, you can call

   caml_page_table_add(In_static_area, start, end)

to inform the run-time system that the address range [start, end)
contains well-formed Caml data that polymorphic primitives can safely
work on.  This should solve your problem.

- Xavier Leroy

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Re: multicore wish

2009-12-29 Thread Gerd Stolpmann

Am Montag, den 28.12.2009, 19:05 +0100 schrieb Xavier Leroy:
> Gerd Stolpmann wrote:
> 
> > It works with all types:
> > 
> > https://godirepo.camlcity.org/svn/lib-ocamlnet2/trunk/code/src/netsys/netsys_mem.mli
> > 
> > look for init_value. It's non-released code yet.
> > 
> > However, there are some problems: Values outside the heap do not support
> > the polymorphic comparison and hash functions. That's a hard limitation,
> > e.g. you cannot even compare two strings, or build a hash table with
> > strings as keys. That limits the usefulness of shared memory.
> 
> In OCaml 3.11 and later, you can call
> 
>caml_page_table_add(In_static_area, start, end)
> 
> to inform the run-time system that the address range [start, end)
> contains well-formed Caml data that polymorphic primitives can safely
> work on.  This should solve your problem.

Yes, in deed! Very nice.

As there are still unused bits in the page table... I'm thinking about
the following. One bit could be used by the GC to indicate that values
somewhere in the page are still referenced from heap memory. At the
beginning of the mark phase the bit is cleared. Whenever a pointer is
seen that goes to a page marked as In_static_area, the bit for the area
is set. As before, the mark phase ignores these pointers otherwise (they
are not followed). After the mark phase, one could check which of the
pages are still used, and this would make it possible to call finalisers
for whole pages. As we only add an "else case" to the mark loop, the
costs for this are negligible.

In the shared memory scenario, each process mapping share memory could
define such a finaliser, and only when it is called the memory block is
unmapped. (And when a global counter is decreased to 0, the shared
memory object as such can be deleted. But that's outside the ocaml
runtime.)

With that help, shared memory would be relatively safe to use. The only
remaining trap would be value pointers that go from shared memory to
heap memory - but ocaml has already lots of ways to control mutability
of values, so I think it is ok to let the programmer do this.

Gerd
-- 

Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
g...@gerd-stolpmann.de  http://www.gerd-stolpmann.de
Phone: +49-6151-153855  Fax: +49-6151-997714


___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs