Re: Performance of small allocations via prim ops

2023-04-12 Thread Sylvain Henry



One complication is that currently GHC has no way to know the value of
LARGE_OBJECT_THRESHOLD (which is a runtime system macro). Typically to
handle this sort of thing we use utils/deriveConstants to generate a
Haskell binding mirroring the value of the C declaration. However,
as GHC becomes runtime-retargetable we may need to revisit this design.


Since 
https://gitlab.haskell.org/ghc/ghc/-/commit/085983e63bfe6af23f8b85fbfcca8db4872d2f60 
(2021-03) we don't do this. We only read constants from the header file 
provided by the RTS unit. Adding one more constant for 
LARGE_OBJECT_THRESHOLD shouldn't be an issue.


Cheers

Sylvain

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Performance of small allocations via prim ops

2023-04-09 Thread Ben Gamari
Harendra Kumar  writes:

> I am confused by this flag. This flag allows us to allocate statically
> known arrays sizes of <= n to be allocated from the current nursery block.
> But looking at the code in allocateMightFail, as I interpret it, any size
> array up to LARGE_OBJECT_THRESHOLD is anyway allocated from the current
> nursery block. So why have this option? Why not fix this to
> LARGE_OBJECT_THRESHOLD? Maybe I am missing something.
>
In principle we could do so. The motivation for making this a flag isn't
immediately clear from the commit implementing this optimisation
(1eece45692fb5d1a5f4ec60c1537f8068237e9c1).

One complication is that currently GHC has no way to know the value of
LARGE_OBJECT_THRESHOLD (which is a runtime system macro). Typically to
handle this sort of thing we use utils/deriveConstants to generate a
Haskell binding mirroring the value of the C declaration. However,
as GHC becomes runtime-retargetable we may need to revisit this design.

Cheers,

- Ben


signature.asc
Description: PGP signature
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Performance of small allocations via prim ops

2023-04-07 Thread Carter Schonwald
Great /fast experimentation!

I will admit I’m pleased that my dated intuition is still correct, but more
importantly we have more current data!

Thanks for the exploration and sharing what you found!

On Fri, Apr 7, 2023 at 7:35 AM Harendra Kumar 
wrote:

>
>
> On Fri, 7 Apr 2023 at 02:18, Carter Schonwald 
> wrote:
>
>> That sounds like a worthy experiment!
>>
>> I  guess that would look like having an inline macro’d up path that
>> checks if it can get the job done that falls back to the general code?
>>
>> Last I checked, the overhead for this sort of c call was on the order of
>> 10nanoseconds or less which seems like it’d be very unlikely to be a
>> bottleneck, but do you have any natural or artificial benchmark programs
>> that would show case this?
>>
>
> I converted my example code into a loop and ran it a million times with a
> 1 byte array size (would be 8 bytes after alignment). So roughly 3 words
> would be allocated per array, including the header and length. It took 5 ms
> using the statically known size optimization which inlines the alloc
> completely, and 10 ms using an unknown size (from program arg) which makes
> a call to newByteArray# . That turns out to be of the order of 5ns more per
> allocation. It does not sound like a big deal.
>
> -harendra
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Performance of small allocations via prim ops

2023-04-07 Thread Harendra Kumar
On Fri, 7 Apr 2023 at 02:18, Carter Schonwald 
wrote:

> That sounds like a worthy experiment!
>
> I  guess that would look like having an inline macro’d up path that checks
> if it can get the job done that falls back to the general code?
>
> Last I checked, the overhead for this sort of c call was on the order of
> 10nanoseconds or less which seems like it’d be very unlikely to be a
> bottleneck, but do you have any natural or artificial benchmark programs
> that would show case this?
>

I converted my example code into a loop and ran it a million times with a 1
byte array size (would be 8 bytes after alignment). So roughly 3 words
would be allocated per array, including the header and length. It took 5 ms
using the statically known size optimization which inlines the alloc
completely, and 10 ms using an unknown size (from program arg) which makes
a call to newByteArray# . That turns out to be of the order of 5ns more per
allocation. It does not sound like a big deal.

-harendra
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Performance of small allocations via prim ops

2023-04-07 Thread Harendra Kumar
I am confused by this flag. This flag allows us to allocate statically
known arrays sizes of <= n to be allocated from the current nursery block.
But looking at the code in allocateMightFail, as I interpret it, any size
array up to LARGE_OBJECT_THRESHOLD is anyway allocated from the current
nursery block. So why have this option? Why not fix this to
LARGE_OBJECT_THRESHOLD? Maybe I am missing something.

-harendra

On Fri, 7 Apr 2023 at 15:45, Harendra Kumar 
wrote:

>
>
> On Fri, 7 Apr 2023 at 12:57, Simon Peyton Jones <
> simon.peytonjo...@gmail.com> wrote:
>
>> > We are emitting a more efficient code when the size of the array is
>> smaller. And the threshold is governed by a compiler flag:
>>
>> It would be good if this was documented.  Perhaps in the Haddock for
>> `newByteArray#`?  Or where?
>>
>
> The flag is documented in the GHC user guide but the behavior would be
> better discoverable if `newByteArray#` mentions it.
>
> -harendra
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Performance of small allocations via prim ops

2023-04-07 Thread Harendra Kumar
On Fri, 7 Apr 2023 at 12:57, Simon Peyton Jones 
wrote:

> > We are emitting a more efficient code when the size of the array is
> smaller. And the threshold is governed by a compiler flag:
>
> It would be good if this was documented.  Perhaps in the Haddock for
> `newByteArray#`?  Or where?
>

The flag is documented in the GHC user guide but the behavior would be
better discoverable if `newByteArray#` mentions it.

-harendra
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Performance of small allocations via prim ops

2023-04-07 Thread Simon Peyton Jones
> We are emitting a more efficient code when the size of the array is
smaller. And the threshold is governed by a compiler flag:

It would be good if this was documented.  Perhaps in the Haddock for
`newByteArray#`?  Or where?

S

On Fri, 7 Apr 2023 at 07:07, Harendra Kumar 
wrote:

> Little bit of grepping in the code gave me this:
>
> emitPrimOp cfg primop =
>   let max_inl_alloc_size = fromIntegral (stgToCmmMaxInlAllocSize cfg)
>   in case primop of
>   NewByteArrayOp_Char -> \case
> [(CmmLit (CmmInt n w))]
>   | asUnsigned w n <= max_inl_alloc_size --
> <--- see this line
>   -> opIntoRegs  $ \ [res] -> doNewByteArrayOp res (fromInteger n)
> _ -> PrimopCmmEmit_External
>
> We are emitting a more efficient code when the size of the array is
> smaller. And the threshold is governed by a compiler flag:
>
>   , make_ord_flag defGhcFlag "fmax-inline-alloc-size"
>   (intSuffix (\n d -> d { maxInlineAllocSize = n }))
>
> This means allocation of smaller arrays is extremely efficient and we can
> control it using `-fmax-inline-alloc-size`, the default is 128. That's a
> new thing I learnt today.
>
> Given this new finding, my original question now applies only to the case
> when the array size is bigger than this configurable threshold, which is a
> little less motivating. And Ben says that the call is not expensive, so we
> can leave it there.
>
> -harendra
>
> On Fri, 7 Apr 2023 at 11:08, Harendra Kumar 
> wrote:
>
>> Ah, some other optimization seems to be kicking in here. When I increase
>> the size of the array to > 128 then I see a call to stg_newByteArray# being
>> emitted:
>>
>>  {offset
>>c1kb: // global
>>if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto
>> c1kd;
>>c1kc: // global
>>R1 = Main.main1_closure;
>>call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
>>c1kd: // global
>>I64[Sp - 8] = c1k9;
>>R1 = 129;
>>Sp = Sp - 8;
>>call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8,
>> upd: 8;
>>
>> -harendra
>>
>> On Fri, 7 Apr 2023 at 10:49, Harendra Kumar 
>> wrote:
>>
>>> Thanks Ben and Carter.
>>>
>>> I compiled the following to Cmm:
>>>
>>> {-# LANGUAGE MagicHash #-}
>>> {-# LANGUAGE UnboxedTuples #-}
>>>
>>> import GHC.IO
>>> import GHC.Exts
>>>
>>> data M = M (MutableByteArray# RealWorld)
>>>
>>> main = do
>>>  _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M
>>> arr #))
>>>  return ()
>>>
>>> It produced the following Cmm:
>>>
>>>  {offset
>>>c1k3: // global
>>>Hp = Hp + 24;
>>>if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6;
>>>c1k7: // global
>>>HpAlloc = 24;
>>>R1 = Main.main1_closure;
>>>call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
>>>c1k6: // global
>>>I64[Hp - 16] = stg_ARR_WORDS_info;
>>>I64[Hp - 8] = 1;
>>>R1 = GHC.Tuple.()_closure+1;
>>>call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
>>>  }
>>>
>>> It seems to be as good as it gets. There is absolutely no scope for
>>> improvement in this.
>>>
>>> -harendra
>>>
>>> On Fri, 7 Apr 2023 at 03:32, Ben Gamari  wrote:
>>>
 Harendra Kumar  writes:

 > I was looking at the RTS code for allocating small objects via prim
 ops
 > e.g. newByteArray# . The code looks like:
 >
 > stg_newByteArrayzh ( W_ n )
 > {
 > MAYBE_GC_N(stg_newByteArrayzh, n);
 >
 > payload_words = ROUNDUP_BYTES_TO_WDS(n);
 > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
 > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
 >
 > We are making a foreign call here (ccall). I am wondering how much
 overhead
 > a ccall adds? I guess it may have to save and restore registers.
 Would it
 > be better to do the fast path case of allocating small objects from
 the
 > nursery using cmm code like in stg_gc_noregs?
 >
 GHC's operational model is designed in such a way that foreign calls are
 fairly cheap (e.g. we don't need to switch stacks, which can be quite
 costly). Judging by the assembler produced for newByteArray# in one
 random x86-64 tree that I have lying around, it's only a couple of
 data-movement instructions, an %eax clear, and a stack pop:

   36:   48 89 cemov%rcx,%rsi
   39:   48 89 c7mov%rax,%rdi
   3c:   31 c0   xor%eax,%eax
   3e:   e8 00 00 00 00  call   43
 
   43:   48 83 c4 08 add$0x8,%rsp

 The data movement operations in particular are quite cheap on most
 microarchitectures where GHC would run due to register renaming. I doubt
 that this overhead would be noticable in anything but a synthetic
 

Re: Performance of small allocations via prim ops

2023-04-07 Thread Harendra Kumar
Little bit of grepping in the code gave me this:

emitPrimOp cfg primop =
  let max_inl_alloc_size = fromIntegral (stgToCmmMaxInlAllocSize cfg)
  in case primop of
  NewByteArrayOp_Char -> \case
[(CmmLit (CmmInt n w))]
  | asUnsigned w n <= max_inl_alloc_size --
<--- see this line
  -> opIntoRegs  $ \ [res] -> doNewByteArrayOp res (fromInteger n)
_ -> PrimopCmmEmit_External

We are emitting a more efficient code when the size of the array is
smaller. And the threshold is governed by a compiler flag:

  , make_ord_flag defGhcFlag "fmax-inline-alloc-size"
  (intSuffix (\n d -> d { maxInlineAllocSize = n }))

This means allocation of smaller arrays is extremely efficient and we can
control it using `-fmax-inline-alloc-size`, the default is 128. That's a
new thing I learnt today.

Given this new finding, my original question now applies only to the case
when the array size is bigger than this configurable threshold, which is a
little less motivating. And Ben says that the call is not expensive, so we
can leave it there.

-harendra

On Fri, 7 Apr 2023 at 11:08, Harendra Kumar 
wrote:

> Ah, some other optimization seems to be kicking in here. When I increase
> the size of the array to > 128 then I see a call to stg_newByteArray# being
> emitted:
>
>  {offset
>c1kb: // global
>if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto
> c1kd;
>c1kc: // global
>R1 = Main.main1_closure;
>call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
>c1kd: // global
>I64[Sp - 8] = c1k9;
>R1 = 129;
>Sp = Sp - 8;
>call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8,
> upd: 8;
>
> -harendra
>
> On Fri, 7 Apr 2023 at 10:49, Harendra Kumar 
> wrote:
>
>> Thanks Ben and Carter.
>>
>> I compiled the following to Cmm:
>>
>> {-# LANGUAGE MagicHash #-}
>> {-# LANGUAGE UnboxedTuples #-}
>>
>> import GHC.IO
>> import GHC.Exts
>>
>> data M = M (MutableByteArray# RealWorld)
>>
>> main = do
>>  _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M
>> arr #))
>>  return ()
>>
>> It produced the following Cmm:
>>
>>  {offset
>>c1k3: // global
>>Hp = Hp + 24;
>>if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6;
>>c1k7: // global
>>HpAlloc = 24;
>>R1 = Main.main1_closure;
>>call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
>>c1k6: // global
>>I64[Hp - 16] = stg_ARR_WORDS_info;
>>I64[Hp - 8] = 1;
>>R1 = GHC.Tuple.()_closure+1;
>>call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
>>  }
>>
>> It seems to be as good as it gets. There is absolutely no scope for
>> improvement in this.
>>
>> -harendra
>>
>> On Fri, 7 Apr 2023 at 03:32, Ben Gamari  wrote:
>>
>>> Harendra Kumar  writes:
>>>
>>> > I was looking at the RTS code for allocating small objects via prim ops
>>> > e.g. newByteArray# . The code looks like:
>>> >
>>> > stg_newByteArrayzh ( W_ n )
>>> > {
>>> > MAYBE_GC_N(stg_newByteArrayzh, n);
>>> >
>>> > payload_words = ROUNDUP_BYTES_TO_WDS(n);
>>> > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
>>> > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
>>> >
>>> > We are making a foreign call here (ccall). I am wondering how much
>>> overhead
>>> > a ccall adds? I guess it may have to save and restore registers. Would
>>> it
>>> > be better to do the fast path case of allocating small objects from the
>>> > nursery using cmm code like in stg_gc_noregs?
>>> >
>>> GHC's operational model is designed in such a way that foreign calls are
>>> fairly cheap (e.g. we don't need to switch stacks, which can be quite
>>> costly). Judging by the assembler produced for newByteArray# in one
>>> random x86-64 tree that I have lying around, it's only a couple of
>>> data-movement instructions, an %eax clear, and a stack pop:
>>>
>>>   36:   48 89 cemov%rcx,%rsi
>>>   39:   48 89 c7mov%rax,%rdi
>>>   3c:   31 c0   xor%eax,%eax
>>>   3e:   e8 00 00 00 00  call   43
>>> 
>>>   43:   48 83 c4 08 add$0x8,%rsp
>>>
>>> The data movement operations in particular are quite cheap on most
>>> microarchitectures where GHC would run due to register renaming. I doubt
>>> that this overhead would be noticable in anything but a synthetic
>>> benchmark. However, it never hurts to measure.
>>>
>>> Cheers,
>>>
>>> - Ben
>>>
>>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Performance of small allocations via prim ops

2023-04-06 Thread Harendra Kumar
Ah, some other optimization seems to be kicking in here. When I increase
the size of the array to > 128 then I see a call to stg_newByteArray# being
emitted:

 {offset
   c1kb: // global
   if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto c1kd;
   c1kc: // global
   R1 = Main.main1_closure;
   call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
   c1kd: // global
   I64[Sp - 8] = c1k9;
   R1 = 129;
   Sp = Sp - 8;
   call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8,
upd: 8;

-harendra

On Fri, 7 Apr 2023 at 10:49, Harendra Kumar 
wrote:

> Thanks Ben and Carter.
>
> I compiled the following to Cmm:
>
> {-# LANGUAGE MagicHash #-}
> {-# LANGUAGE UnboxedTuples #-}
>
> import GHC.IO
> import GHC.Exts
>
> data M = M (MutableByteArray# RealWorld)
>
> main = do
>  _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M
> arr #))
>  return ()
>
> It produced the following Cmm:
>
>  {offset
>c1k3: // global
>Hp = Hp + 24;
>if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6;
>c1k7: // global
>HpAlloc = 24;
>R1 = Main.main1_closure;
>call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
>c1k6: // global
>I64[Hp - 16] = stg_ARR_WORDS_info;
>I64[Hp - 8] = 1;
>R1 = GHC.Tuple.()_closure+1;
>call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
>  }
>
> It seems to be as good as it gets. There is absolutely no scope for
> improvement in this.
>
> -harendra
>
> On Fri, 7 Apr 2023 at 03:32, Ben Gamari  wrote:
>
>> Harendra Kumar  writes:
>>
>> > I was looking at the RTS code for allocating small objects via prim ops
>> > e.g. newByteArray# . The code looks like:
>> >
>> > stg_newByteArrayzh ( W_ n )
>> > {
>> > MAYBE_GC_N(stg_newByteArrayzh, n);
>> >
>> > payload_words = ROUNDUP_BYTES_TO_WDS(n);
>> > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
>> > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
>> >
>> > We are making a foreign call here (ccall). I am wondering how much
>> overhead
>> > a ccall adds? I guess it may have to save and restore registers. Would
>> it
>> > be better to do the fast path case of allocating small objects from the
>> > nursery using cmm code like in stg_gc_noregs?
>> >
>> GHC's operational model is designed in such a way that foreign calls are
>> fairly cheap (e.g. we don't need to switch stacks, which can be quite
>> costly). Judging by the assembler produced for newByteArray# in one
>> random x86-64 tree that I have lying around, it's only a couple of
>> data-movement instructions, an %eax clear, and a stack pop:
>>
>>   36:   48 89 cemov%rcx,%rsi
>>   39:   48 89 c7mov%rax,%rdi
>>   3c:   31 c0   xor%eax,%eax
>>   3e:   e8 00 00 00 00  call   43
>> 
>>   43:   48 83 c4 08 add$0x8,%rsp
>>
>> The data movement operations in particular are quite cheap on most
>> microarchitectures where GHC would run due to register renaming. I doubt
>> that this overhead would be noticable in anything but a synthetic
>> benchmark. However, it never hurts to measure.
>>
>> Cheers,
>>
>> - Ben
>>
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Performance of small allocations via prim ops

2023-04-06 Thread Harendra Kumar
Thanks Ben and Carter.

I compiled the following to Cmm:

{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples #-}

import GHC.IO
import GHC.Exts

data M = M (MutableByteArray# RealWorld)

main = do
 _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M
arr #))
 return ()

It produced the following Cmm:

 {offset
   c1k3: // global
   Hp = Hp + 24;
   if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6;
   c1k7: // global
   HpAlloc = 24;
   R1 = Main.main1_closure;
   call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
   c1k6: // global
   I64[Hp - 16] = stg_ARR_WORDS_info;
   I64[Hp - 8] = 1;
   R1 = GHC.Tuple.()_closure+1;
   call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
 }

It seems to be as good as it gets. There is absolutely no scope for
improvement in this.

-harendra

On Fri, 7 Apr 2023 at 03:32, Ben Gamari  wrote:

> Harendra Kumar  writes:
>
> > I was looking at the RTS code for allocating small objects via prim ops
> > e.g. newByteArray# . The code looks like:
> >
> > stg_newByteArrayzh ( W_ n )
> > {
> > MAYBE_GC_N(stg_newByteArrayzh, n);
> >
> > payload_words = ROUNDUP_BYTES_TO_WDS(n);
> > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
> > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
> >
> > We are making a foreign call here (ccall). I am wondering how much
> overhead
> > a ccall adds? I guess it may have to save and restore registers. Would it
> > be better to do the fast path case of allocating small objects from the
> > nursery using cmm code like in stg_gc_noregs?
> >
> GHC's operational model is designed in such a way that foreign calls are
> fairly cheap (e.g. we don't need to switch stacks, which can be quite
> costly). Judging by the assembler produced for newByteArray# in one
> random x86-64 tree that I have lying around, it's only a couple of
> data-movement instructions, an %eax clear, and a stack pop:
>
>   36:   48 89 cemov%rcx,%rsi
>   39:   48 89 c7mov%rax,%rdi
>   3c:   31 c0   xor%eax,%eax
>   3e:   e8 00 00 00 00  call   43 
>   43:   48 83 c4 08 add$0x8,%rsp
>
> The data movement operations in particular are quite cheap on most
> microarchitectures where GHC would run due to register renaming. I doubt
> that this overhead would be noticable in anything but a synthetic
> benchmark. However, it never hurts to measure.
>
> Cheers,
>
> - Ben
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Performance of small allocations via prim ops

2023-04-06 Thread Ben Gamari
Harendra Kumar  writes:

> I was looking at the RTS code for allocating small objects via prim ops
> e.g. newByteArray# . The code looks like:
>
> stg_newByteArrayzh ( W_ n )
> {
> MAYBE_GC_N(stg_newByteArrayzh, n);
>
> payload_words = ROUNDUP_BYTES_TO_WDS(n);
> words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
> ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
>
> We are making a foreign call here (ccall). I am wondering how much overhead
> a ccall adds? I guess it may have to save and restore registers. Would it
> be better to do the fast path case of allocating small objects from the
> nursery using cmm code like in stg_gc_noregs?
>
GHC's operational model is designed in such a way that foreign calls are
fairly cheap (e.g. we don't need to switch stacks, which can be quite
costly). Judging by the assembler produced for newByteArray# in one
random x86-64 tree that I have lying around, it's only a couple of
data-movement instructions, an %eax clear, and a stack pop:

  36:   48 89 cemov%rcx,%rsi
  39:   48 89 c7mov%rax,%rdi
  3c:   31 c0   xor%eax,%eax
  3e:   e8 00 00 00 00  call   43 
  43:   48 83 c4 08 add$0x8,%rsp

The data movement operations in particular are quite cheap on most
microarchitectures where GHC would run due to register renaming. I doubt
that this overhead would be noticable in anything but a synthetic
benchmark. However, it never hurts to measure.

Cheers,

- Ben


signature.asc
Description: PGP signature
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Performance of small allocations via prim ops

2023-04-06 Thread Carter Schonwald
That sounds like a worthy experiment!

I  guess that would look like having an inline macro’d up path that checks
if it can get the job done that falls back to the general code?

Last I checked, the overhead for this sort of c call was on the order of
10nanoseconds or less which seems like it’d be very unlikely to be a
bottleneck, but do you have any natural or artificial benchmark programs
that would show case this?

For this sortah code, extra branching for that optimization could easily
have a larger performance impact than the known function call on modern
hardware.  (Though take my intuitions about these things with a grain of
salt. )

On Tue, Apr 4, 2023 at 9:50 PM Harendra Kumar 
wrote:

> I was looking at the RTS code for allocating small objects via prim ops
> e.g. newByteArray# . The code looks like:
>
> stg_newByteArrayzh ( W_ n )
> {
> MAYBE_GC_N(stg_newByteArrayzh, n);
>
> payload_words = ROUNDUP_BYTES_TO_WDS(n);
> words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
> ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
>
> We are making a foreign call here (ccall). I am wondering how much
> overhead a ccall adds? I guess it may have to save and restore registers.
> Would it be better to do the fast path case of allocating small objects
> from the nursery using cmm code like in stg_gc_noregs?
>
> -harendra
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Performance of small allocations via prim ops

2023-04-04 Thread Harendra Kumar
I was looking at the RTS code for allocating small objects via prim ops
e.g. newByteArray# . The code looks like:

stg_newByteArrayzh ( W_ n )
{
MAYBE_GC_N(stg_newByteArrayzh, n);

payload_words = ROUNDUP_BYTES_TO_WDS(n);
words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);

We are making a foreign call here (ccall). I am wondering how much overhead
a ccall adds? I guess it may have to save and restore registers. Would it
be better to do the fast path case of allocating small objects from the
nursery using cmm code like in stg_gc_noregs?

-harendra
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs