Re: Performance of small allocations via prim ops

2023-04-06 Thread Harendra Kumar
Ah, some other optimization seems to be kicking in here. When I increase
the size of the array to > 128 then I see a call to stg_newByteArray# being
emitted:

 {offset
   c1kb: // global
   if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto c1kd;
   c1kc: // global
   R1 = Main.main1_closure;
   call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
   c1kd: // global
   I64[Sp - 8] = c1k9;
   R1 = 129;
   Sp = Sp - 8;
   call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8,
upd: 8;

-harendra

On Fri, 7 Apr 2023 at 10:49, Harendra Kumar 
wrote:

> Thanks Ben and Carter.
>
> I compiled the following to Cmm:
>
> {-# LANGUAGE MagicHash #-}
> {-# LANGUAGE UnboxedTuples #-}
>
> import GHC.IO
> import GHC.Exts
>
> data M = M (MutableByteArray# RealWorld)
>
> main = do
>  _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M
> arr #))
>  return ()
>
> It produced the following Cmm:
>
>  {offset
>c1k3: // global
>Hp = Hp + 24;
>if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6;
>c1k7: // global
>HpAlloc = 24;
>R1 = Main.main1_closure;
>call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
>c1k6: // global
>I64[Hp - 16] = stg_ARR_WORDS_info;
>I64[Hp - 8] = 1;
>R1 = GHC.Tuple.()_closure+1;
>call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
>  }
>
> It seems to be as good as it gets. There is absolutely no scope for
> improvement in this.
>
> -harendra
>
> On Fri, 7 Apr 2023 at 03:32, Ben Gamari  wrote:
>
>> Harendra Kumar  writes:
>>
>> > I was looking at the RTS code for allocating small objects via prim ops
>> > e.g. newByteArray# . The code looks like:
>> >
>> > stg_newByteArrayzh ( W_ n )
>> > {
>> > MAYBE_GC_N(stg_newByteArrayzh, n);
>> >
>> > payload_words = ROUNDUP_BYTES_TO_WDS(n);
>> > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
>> > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
>> >
>> > We are making a foreign call here (ccall). I am wondering how much
>> overhead
>> > a ccall adds? I guess it may have to save and restore registers. Would
>> it
>> > be better to do the fast path case of allocating small objects from the
>> > nursery using cmm code like in stg_gc_noregs?
>> >
>> GHC's operational model is designed in such a way that foreign calls are
>> fairly cheap (e.g. we don't need to switch stacks, which can be quite
>> costly). Judging by the assembler produced for newByteArray# in one
>> random x86-64 tree that I have lying around, it's only a couple of
>> data-movement instructions, an %eax clear, and a stack pop:
>>
>>   36:   48 89 cemov%rcx,%rsi
>>   39:   48 89 c7mov%rax,%rdi
>>   3c:   31 c0   xor%eax,%eax
>>   3e:   e8 00 00 00 00  call   43
>> 
>>   43:   48 83 c4 08 add$0x8,%rsp
>>
>> The data movement operations in particular are quite cheap on most
>> microarchitectures where GHC would run due to register renaming. I doubt
>> that this overhead would be noticable in anything but a synthetic
>> benchmark. However, it never hurts to measure.
>>
>> Cheers,
>>
>> - Ben
>>
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Performance of small allocations via prim ops

2023-04-06 Thread Harendra Kumar
Thanks Ben and Carter.

I compiled the following to Cmm:

{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples #-}

import GHC.IO
import GHC.Exts

data M = M (MutableByteArray# RealWorld)

main = do
 _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M
arr #))
 return ()

It produced the following Cmm:

 {offset
   c1k3: // global
   Hp = Hp + 24;
   if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6;
   c1k7: // global
   HpAlloc = 24;
   R1 = Main.main1_closure;
   call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
   c1k6: // global
   I64[Hp - 16] = stg_ARR_WORDS_info;
   I64[Hp - 8] = 1;
   R1 = GHC.Tuple.()_closure+1;
   call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
 }

It seems to be as good as it gets. There is absolutely no scope for
improvement in this.

-harendra

On Fri, 7 Apr 2023 at 03:32, Ben Gamari  wrote:

> Harendra Kumar  writes:
>
> > I was looking at the RTS code for allocating small objects via prim ops
> > e.g. newByteArray# . The code looks like:
> >
> > stg_newByteArrayzh ( W_ n )
> > {
> > MAYBE_GC_N(stg_newByteArrayzh, n);
> >
> > payload_words = ROUNDUP_BYTES_TO_WDS(n);
> > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
> > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
> >
> > We are making a foreign call here (ccall). I am wondering how much
> overhead
> > a ccall adds? I guess it may have to save and restore registers. Would it
> > be better to do the fast path case of allocating small objects from the
> > nursery using cmm code like in stg_gc_noregs?
> >
> GHC's operational model is designed in such a way that foreign calls are
> fairly cheap (e.g. we don't need to switch stacks, which can be quite
> costly). Judging by the assembler produced for newByteArray# in one
> random x86-64 tree that I have lying around, it's only a couple of
> data-movement instructions, an %eax clear, and a stack pop:
>
>   36:   48 89 cemov%rcx,%rsi
>   39:   48 89 c7mov%rax,%rdi
>   3c:   31 c0   xor%eax,%eax
>   3e:   e8 00 00 00 00  call   43 
>   43:   48 83 c4 08 add$0x8,%rsp
>
> The data movement operations in particular are quite cheap on most
> microarchitectures where GHC would run due to register renaming. I doubt
> that this overhead would be noticable in anything but a synthetic
> benchmark. However, it never hurts to measure.
>
> Cheers,
>
> - Ben
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Performance of small allocations via prim ops

2023-04-06 Thread Ben Gamari
Harendra Kumar  writes:

> I was looking at the RTS code for allocating small objects via prim ops
> e.g. newByteArray# . The code looks like:
>
> stg_newByteArrayzh ( W_ n )
> {
> MAYBE_GC_N(stg_newByteArrayzh, n);
>
> payload_words = ROUNDUP_BYTES_TO_WDS(n);
> words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
> ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
>
> We are making a foreign call here (ccall). I am wondering how much overhead
> a ccall adds? I guess it may have to save and restore registers. Would it
> be better to do the fast path case of allocating small objects from the
> nursery using cmm code like in stg_gc_noregs?
>
GHC's operational model is designed in such a way that foreign calls are
fairly cheap (e.g. we don't need to switch stacks, which can be quite
costly). Judging by the assembler produced for newByteArray# in one
random x86-64 tree that I have lying around, it's only a couple of
data-movement instructions, an %eax clear, and a stack pop:

  36:   48 89 cemov%rcx,%rsi
  39:   48 89 c7mov%rax,%rdi
  3c:   31 c0   xor%eax,%eax
  3e:   e8 00 00 00 00  call   43 
  43:   48 83 c4 08 add$0x8,%rsp

The data movement operations in particular are quite cheap on most
microarchitectures where GHC would run due to register renaming. I doubt
that this overhead would be noticable in anything but a synthetic
benchmark. However, it never hurts to measure.

Cheers,

- Ben


signature.asc
Description: PGP signature
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Performance of small allocations via prim ops

2023-04-06 Thread Carter Schonwald
That sounds like a worthy experiment!

I  guess that would look like having an inline macro’d up path that checks
if it can get the job done that falls back to the general code?

Last I checked, the overhead for this sort of c call was on the order of
10nanoseconds or less which seems like it’d be very unlikely to be a
bottleneck, but do you have any natural or artificial benchmark programs
that would show case this?

For this sortah code, extra branching for that optimization could easily
have a larger performance impact than the known function call on modern
hardware.  (Though take my intuitions about these things with a grain of
salt. )

On Tue, Apr 4, 2023 at 9:50 PM Harendra Kumar 
wrote:

> I was looking at the RTS code for allocating small objects via prim ops
> e.g. newByteArray# . The code looks like:
>
> stg_newByteArrayzh ( W_ n )
> {
> MAYBE_GC_N(stg_newByteArrayzh, n);
>
> payload_words = ROUNDUP_BYTES_TO_WDS(n);
> words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
> ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
>
> We are making a foreign call here (ccall). I am wondering how much
> overhead a ccall adds? I guess it may have to save and restore registers.
> Would it be better to do the fast path case of allocating small objects
> from the nursery using cmm code like in stg_gc_noregs?
>
> -harendra
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs