Re: Performance of small allocations via prim ops
One complication is that currently GHC has no way to know the value of LARGE_OBJECT_THRESHOLD (which is a runtime system macro). Typically to handle this sort of thing we use utils/deriveConstants to generate a Haskell binding mirroring the value of the C declaration. However, as GHC becomes runtime-retargetable we may need to revisit this design. Since https://gitlab.haskell.org/ghc/ghc/-/commit/085983e63bfe6af23f8b85fbfcca8db4872d2f60 (2021-03) we don't do this. We only read constants from the header file provided by the RTS unit. Adding one more constant for LARGE_OBJECT_THRESHOLD shouldn't be an issue. Cheers Sylvain ___ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Re: Performance of small allocations via prim ops
Harendra Kumar writes: > I am confused by this flag. This flag allows us to allocate statically > known arrays sizes of <= n to be allocated from the current nursery block. > But looking at the code in allocateMightFail, as I interpret it, any size > array up to LARGE_OBJECT_THRESHOLD is anyway allocated from the current > nursery block. So why have this option? Why not fix this to > LARGE_OBJECT_THRESHOLD? Maybe I am missing something. > In principle we could do so. The motivation for making this a flag isn't immediately clear from the commit implementing this optimisation (1eece45692fb5d1a5f4ec60c1537f8068237e9c1). One complication is that currently GHC has no way to know the value of LARGE_OBJECT_THRESHOLD (which is a runtime system macro). Typically to handle this sort of thing we use utils/deriveConstants to generate a Haskell binding mirroring the value of the C declaration. However, as GHC becomes runtime-retargetable we may need to revisit this design. Cheers, - Ben signature.asc Description: PGP signature ___ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Re: Performance of small allocations via prim ops
Great /fast experimentation! I will admit I’m pleased that my dated intuition is still correct, but more importantly we have more current data! Thanks for the exploration and sharing what you found! On Fri, Apr 7, 2023 at 7:35 AM Harendra Kumar wrote: > > > On Fri, 7 Apr 2023 at 02:18, Carter Schonwald > wrote: > >> That sounds like a worthy experiment! >> >> I guess that would look like having an inline macro’d up path that >> checks if it can get the job done that falls back to the general code? >> >> Last I checked, the overhead for this sort of c call was on the order of >> 10nanoseconds or less which seems like it’d be very unlikely to be a >> bottleneck, but do you have any natural or artificial benchmark programs >> that would show case this? >> > > I converted my example code into a loop and ran it a million times with a > 1 byte array size (would be 8 bytes after alignment). So roughly 3 words > would be allocated per array, including the header and length. It took 5 ms > using the statically known size optimization which inlines the alloc > completely, and 10 ms using an unknown size (from program arg) which makes > a call to newByteArray# . That turns out to be of the order of 5ns more per > allocation. It does not sound like a big deal. > > -harendra > ___ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Re: Performance of small allocations via prim ops
On Fri, 7 Apr 2023 at 02:18, Carter Schonwald wrote: > That sounds like a worthy experiment! > > I guess that would look like having an inline macro’d up path that checks > if it can get the job done that falls back to the general code? > > Last I checked, the overhead for this sort of c call was on the order of > 10nanoseconds or less which seems like it’d be very unlikely to be a > bottleneck, but do you have any natural or artificial benchmark programs > that would show case this? > I converted my example code into a loop and ran it a million times with a 1 byte array size (would be 8 bytes after alignment). So roughly 3 words would be allocated per array, including the header and length. It took 5 ms using the statically known size optimization which inlines the alloc completely, and 10 ms using an unknown size (from program arg) which makes a call to newByteArray# . That turns out to be of the order of 5ns more per allocation. It does not sound like a big deal. -harendra ___ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Re: Performance of small allocations via prim ops
I am confused by this flag. This flag allows us to allocate statically known arrays sizes of <= n to be allocated from the current nursery block. But looking at the code in allocateMightFail, as I interpret it, any size array up to LARGE_OBJECT_THRESHOLD is anyway allocated from the current nursery block. So why have this option? Why not fix this to LARGE_OBJECT_THRESHOLD? Maybe I am missing something. -harendra On Fri, 7 Apr 2023 at 15:45, Harendra Kumar wrote: > > > On Fri, 7 Apr 2023 at 12:57, Simon Peyton Jones < > simon.peytonjo...@gmail.com> wrote: > >> > We are emitting a more efficient code when the size of the array is >> smaller. And the threshold is governed by a compiler flag: >> >> It would be good if this was documented. Perhaps in the Haddock for >> `newByteArray#`? Or where? >> > > The flag is documented in the GHC user guide but the behavior would be > better discoverable if `newByteArray#` mentions it. > > -harendra > ___ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Re: Performance of small allocations via prim ops
On Fri, 7 Apr 2023 at 12:57, Simon Peyton Jones wrote: > > We are emitting a more efficient code when the size of the array is > smaller. And the threshold is governed by a compiler flag: > > It would be good if this was documented. Perhaps in the Haddock for > `newByteArray#`? Or where? > The flag is documented in the GHC user guide but the behavior would be better discoverable if `newByteArray#` mentions it. -harendra ___ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Re: Performance of small allocations via prim ops
> We are emitting a more efficient code when the size of the array is smaller. And the threshold is governed by a compiler flag: It would be good if this was documented. Perhaps in the Haddock for `newByteArray#`? Or where? S On Fri, 7 Apr 2023 at 07:07, Harendra Kumar wrote: > Little bit of grepping in the code gave me this: > > emitPrimOp cfg primop = > let max_inl_alloc_size = fromIntegral (stgToCmmMaxInlAllocSize cfg) > in case primop of > NewByteArrayOp_Char -> \case > [(CmmLit (CmmInt n w))] > | asUnsigned w n <= max_inl_alloc_size -- > <--- see this line > -> opIntoRegs $ \ [res] -> doNewByteArrayOp res (fromInteger n) > _ -> PrimopCmmEmit_External > > We are emitting a more efficient code when the size of the array is > smaller. And the threshold is governed by a compiler flag: > > , make_ord_flag defGhcFlag "fmax-inline-alloc-size" > (intSuffix (\n d -> d { maxInlineAllocSize = n })) > > This means allocation of smaller arrays is extremely efficient and we can > control it using `-fmax-inline-alloc-size`, the default is 128. That's a > new thing I learnt today. > > Given this new finding, my original question now applies only to the case > when the array size is bigger than this configurable threshold, which is a > little less motivating. And Ben says that the call is not expensive, so we > can leave it there. > > -harendra > > On Fri, 7 Apr 2023 at 11:08, Harendra Kumar > wrote: > >> Ah, some other optimization seems to be kicking in here. When I increase >> the size of the array to > 128 then I see a call to stg_newByteArray# being >> emitted: >> >> {offset >>c1kb: // global >>if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto >> c1kd; >>c1kc: // global >>R1 = Main.main1_closure; >>call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; >>c1kd: // global >>I64[Sp - 8] = c1k9; >>R1 = 129; >>Sp = Sp - 8; >>call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8, >> upd: 8; >> >> -harendra >> >> On Fri, 7 Apr 2023 at 10:49, Harendra Kumar >> wrote: >> >>> Thanks Ben and Carter. >>> >>> I compiled the following to Cmm: >>> >>> {-# LANGUAGE MagicHash #-} >>> {-# LANGUAGE UnboxedTuples #-} >>> >>> import GHC.IO >>> import GHC.Exts >>> >>> data M = M (MutableByteArray# RealWorld) >>> >>> main = do >>> _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M >>> arr #)) >>> return () >>> >>> It produced the following Cmm: >>> >>> {offset >>>c1k3: // global >>>Hp = Hp + 24; >>>if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6; >>>c1k7: // global >>>HpAlloc = 24; >>>R1 = Main.main1_closure; >>>call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; >>>c1k6: // global >>>I64[Hp - 16] = stg_ARR_WORDS_info; >>>I64[Hp - 8] = 1; >>>R1 = GHC.Tuple.()_closure+1; >>>call (P64[Sp])(R1) args: 8, res: 0, upd: 8; >>> } >>> >>> It seems to be as good as it gets. There is absolutely no scope for >>> improvement in this. >>> >>> -harendra >>> >>> On Fri, 7 Apr 2023 at 03:32, Ben Gamari wrote: >>> Harendra Kumar writes: > I was looking at the RTS code for allocating small objects via prim ops > e.g. newByteArray# . The code looks like: > > stg_newByteArrayzh ( W_ n ) > { > MAYBE_GC_N(stg_newByteArrayzh, n); > > payload_words = ROUNDUP_BYTES_TO_WDS(n); > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words); > > We are making a foreign call here (ccall). I am wondering how much overhead > a ccall adds? I guess it may have to save and restore registers. Would it > be better to do the fast path case of allocating small objects from the > nursery using cmm code like in stg_gc_noregs? > GHC's operational model is designed in such a way that foreign calls are fairly cheap (e.g. we don't need to switch stacks, which can be quite costly). Judging by the assembler produced for newByteArray# in one random x86-64 tree that I have lying around, it's only a couple of data-movement instructions, an %eax clear, and a stack pop: 36: 48 89 cemov%rcx,%rsi 39: 48 89 c7mov%rax,%rdi 3c: 31 c0 xor%eax,%eax 3e: e8 00 00 00 00 call 43 43: 48 83 c4 08 add$0x8,%rsp The data movement operations in particular are quite cheap on most microarchitectures where GHC would run due to register renaming. I doubt that this overhead would be noticable in anything but a synthetic
Re: Performance of small allocations via prim ops
Little bit of grepping in the code gave me this: emitPrimOp cfg primop = let max_inl_alloc_size = fromIntegral (stgToCmmMaxInlAllocSize cfg) in case primop of NewByteArrayOp_Char -> \case [(CmmLit (CmmInt n w))] | asUnsigned w n <= max_inl_alloc_size -- <--- see this line -> opIntoRegs $ \ [res] -> doNewByteArrayOp res (fromInteger n) _ -> PrimopCmmEmit_External We are emitting a more efficient code when the size of the array is smaller. And the threshold is governed by a compiler flag: , make_ord_flag defGhcFlag "fmax-inline-alloc-size" (intSuffix (\n d -> d { maxInlineAllocSize = n })) This means allocation of smaller arrays is extremely efficient and we can control it using `-fmax-inline-alloc-size`, the default is 128. That's a new thing I learnt today. Given this new finding, my original question now applies only to the case when the array size is bigger than this configurable threshold, which is a little less motivating. And Ben says that the call is not expensive, so we can leave it there. -harendra On Fri, 7 Apr 2023 at 11:08, Harendra Kumar wrote: > Ah, some other optimization seems to be kicking in here. When I increase > the size of the array to > 128 then I see a call to stg_newByteArray# being > emitted: > > {offset >c1kb: // global >if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto > c1kd; >c1kc: // global >R1 = Main.main1_closure; >call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; >c1kd: // global >I64[Sp - 8] = c1k9; >R1 = 129; >Sp = Sp - 8; >call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8, > upd: 8; > > -harendra > > On Fri, 7 Apr 2023 at 10:49, Harendra Kumar > wrote: > >> Thanks Ben and Carter. >> >> I compiled the following to Cmm: >> >> {-# LANGUAGE MagicHash #-} >> {-# LANGUAGE UnboxedTuples #-} >> >> import GHC.IO >> import GHC.Exts >> >> data M = M (MutableByteArray# RealWorld) >> >> main = do >> _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M >> arr #)) >> return () >> >> It produced the following Cmm: >> >> {offset >>c1k3: // global >>Hp = Hp + 24; >>if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6; >>c1k7: // global >>HpAlloc = 24; >>R1 = Main.main1_closure; >>call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; >>c1k6: // global >>I64[Hp - 16] = stg_ARR_WORDS_info; >>I64[Hp - 8] = 1; >>R1 = GHC.Tuple.()_closure+1; >>call (P64[Sp])(R1) args: 8, res: 0, upd: 8; >> } >> >> It seems to be as good as it gets. There is absolutely no scope for >> improvement in this. >> >> -harendra >> >> On Fri, 7 Apr 2023 at 03:32, Ben Gamari wrote: >> >>> Harendra Kumar writes: >>> >>> > I was looking at the RTS code for allocating small objects via prim ops >>> > e.g. newByteArray# . The code looks like: >>> > >>> > stg_newByteArrayzh ( W_ n ) >>> > { >>> > MAYBE_GC_N(stg_newByteArrayzh, n); >>> > >>> > payload_words = ROUNDUP_BYTES_TO_WDS(n); >>> > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; >>> > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words); >>> > >>> > We are making a foreign call here (ccall). I am wondering how much >>> overhead >>> > a ccall adds? I guess it may have to save and restore registers. Would >>> it >>> > be better to do the fast path case of allocating small objects from the >>> > nursery using cmm code like in stg_gc_noregs? >>> > >>> GHC's operational model is designed in such a way that foreign calls are >>> fairly cheap (e.g. we don't need to switch stacks, which can be quite >>> costly). Judging by the assembler produced for newByteArray# in one >>> random x86-64 tree that I have lying around, it's only a couple of >>> data-movement instructions, an %eax clear, and a stack pop: >>> >>> 36: 48 89 cemov%rcx,%rsi >>> 39: 48 89 c7mov%rax,%rdi >>> 3c: 31 c0 xor%eax,%eax >>> 3e: e8 00 00 00 00 call 43 >>> >>> 43: 48 83 c4 08 add$0x8,%rsp >>> >>> The data movement operations in particular are quite cheap on most >>> microarchitectures where GHC would run due to register renaming. I doubt >>> that this overhead would be noticable in anything but a synthetic >>> benchmark. However, it never hurts to measure. >>> >>> Cheers, >>> >>> - Ben >>> >> ___ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Re: Performance of small allocations via prim ops
Ah, some other optimization seems to be kicking in here. When I increase the size of the array to > 128 then I see a call to stg_newByteArray# being emitted: {offset c1kb: // global if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto c1kd; c1kc: // global R1 = Main.main1_closure; call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; c1kd: // global I64[Sp - 8] = c1k9; R1 = 129; Sp = Sp - 8; call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8, upd: 8; -harendra On Fri, 7 Apr 2023 at 10:49, Harendra Kumar wrote: > Thanks Ben and Carter. > > I compiled the following to Cmm: > > {-# LANGUAGE MagicHash #-} > {-# LANGUAGE UnboxedTuples #-} > > import GHC.IO > import GHC.Exts > > data M = M (MutableByteArray# RealWorld) > > main = do > _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M > arr #)) > return () > > It produced the following Cmm: > > {offset >c1k3: // global >Hp = Hp + 24; >if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6; >c1k7: // global >HpAlloc = 24; >R1 = Main.main1_closure; >call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; >c1k6: // global >I64[Hp - 16] = stg_ARR_WORDS_info; >I64[Hp - 8] = 1; >R1 = GHC.Tuple.()_closure+1; >call (P64[Sp])(R1) args: 8, res: 0, upd: 8; > } > > It seems to be as good as it gets. There is absolutely no scope for > improvement in this. > > -harendra > > On Fri, 7 Apr 2023 at 03:32, Ben Gamari wrote: > >> Harendra Kumar writes: >> >> > I was looking at the RTS code for allocating small objects via prim ops >> > e.g. newByteArray# . The code looks like: >> > >> > stg_newByteArrayzh ( W_ n ) >> > { >> > MAYBE_GC_N(stg_newByteArrayzh, n); >> > >> > payload_words = ROUNDUP_BYTES_TO_WDS(n); >> > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; >> > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words); >> > >> > We are making a foreign call here (ccall). I am wondering how much >> overhead >> > a ccall adds? I guess it may have to save and restore registers. Would >> it >> > be better to do the fast path case of allocating small objects from the >> > nursery using cmm code like in stg_gc_noregs? >> > >> GHC's operational model is designed in such a way that foreign calls are >> fairly cheap (e.g. we don't need to switch stacks, which can be quite >> costly). Judging by the assembler produced for newByteArray# in one >> random x86-64 tree that I have lying around, it's only a couple of >> data-movement instructions, an %eax clear, and a stack pop: >> >> 36: 48 89 cemov%rcx,%rsi >> 39: 48 89 c7mov%rax,%rdi >> 3c: 31 c0 xor%eax,%eax >> 3e: e8 00 00 00 00 call 43 >> >> 43: 48 83 c4 08 add$0x8,%rsp >> >> The data movement operations in particular are quite cheap on most >> microarchitectures where GHC would run due to register renaming. I doubt >> that this overhead would be noticable in anything but a synthetic >> benchmark. However, it never hurts to measure. >> >> Cheers, >> >> - Ben >> > ___ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Re: Performance of small allocations via prim ops
Thanks Ben and Carter. I compiled the following to Cmm: {-# LANGUAGE MagicHash #-} {-# LANGUAGE UnboxedTuples #-} import GHC.IO import GHC.Exts data M = M (MutableByteArray# RealWorld) main = do _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M arr #)) return () It produced the following Cmm: {offset c1k3: // global Hp = Hp + 24; if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6; c1k7: // global HpAlloc = 24; R1 = Main.main1_closure; call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; c1k6: // global I64[Hp - 16] = stg_ARR_WORDS_info; I64[Hp - 8] = 1; R1 = GHC.Tuple.()_closure+1; call (P64[Sp])(R1) args: 8, res: 0, upd: 8; } It seems to be as good as it gets. There is absolutely no scope for improvement in this. -harendra On Fri, 7 Apr 2023 at 03:32, Ben Gamari wrote: > Harendra Kumar writes: > > > I was looking at the RTS code for allocating small objects via prim ops > > e.g. newByteArray# . The code looks like: > > > > stg_newByteArrayzh ( W_ n ) > > { > > MAYBE_GC_N(stg_newByteArrayzh, n); > > > > payload_words = ROUNDUP_BYTES_TO_WDS(n); > > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; > > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words); > > > > We are making a foreign call here (ccall). I am wondering how much > overhead > > a ccall adds? I guess it may have to save and restore registers. Would it > > be better to do the fast path case of allocating small objects from the > > nursery using cmm code like in stg_gc_noregs? > > > GHC's operational model is designed in such a way that foreign calls are > fairly cheap (e.g. we don't need to switch stacks, which can be quite > costly). Judging by the assembler produced for newByteArray# in one > random x86-64 tree that I have lying around, it's only a couple of > data-movement instructions, an %eax clear, and a stack pop: > > 36: 48 89 cemov%rcx,%rsi > 39: 48 89 c7mov%rax,%rdi > 3c: 31 c0 xor%eax,%eax > 3e: e8 00 00 00 00 call 43 > 43: 48 83 c4 08 add$0x8,%rsp > > The data movement operations in particular are quite cheap on most > microarchitectures where GHC would run due to register renaming. I doubt > that this overhead would be noticable in anything but a synthetic > benchmark. However, it never hurts to measure. > > Cheers, > > - Ben > ___ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Re: Performance of small allocations via prim ops
Harendra Kumar writes: > I was looking at the RTS code for allocating small objects via prim ops > e.g. newByteArray# . The code looks like: > > stg_newByteArrayzh ( W_ n ) > { > MAYBE_GC_N(stg_newByteArrayzh, n); > > payload_words = ROUNDUP_BYTES_TO_WDS(n); > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words); > > We are making a foreign call here (ccall). I am wondering how much overhead > a ccall adds? I guess it may have to save and restore registers. Would it > be better to do the fast path case of allocating small objects from the > nursery using cmm code like in stg_gc_noregs? > GHC's operational model is designed in such a way that foreign calls are fairly cheap (e.g. we don't need to switch stacks, which can be quite costly). Judging by the assembler produced for newByteArray# in one random x86-64 tree that I have lying around, it's only a couple of data-movement instructions, an %eax clear, and a stack pop: 36: 48 89 cemov%rcx,%rsi 39: 48 89 c7mov%rax,%rdi 3c: 31 c0 xor%eax,%eax 3e: e8 00 00 00 00 call 43 43: 48 83 c4 08 add$0x8,%rsp The data movement operations in particular are quite cheap on most microarchitectures where GHC would run due to register renaming. I doubt that this overhead would be noticable in anything but a synthetic benchmark. However, it never hurts to measure. Cheers, - Ben signature.asc Description: PGP signature ___ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Re: Performance of small allocations via prim ops
That sounds like a worthy experiment! I guess that would look like having an inline macro’d up path that checks if it can get the job done that falls back to the general code? Last I checked, the overhead for this sort of c call was on the order of 10nanoseconds or less which seems like it’d be very unlikely to be a bottleneck, but do you have any natural or artificial benchmark programs that would show case this? For this sortah code, extra branching for that optimization could easily have a larger performance impact than the known function call on modern hardware. (Though take my intuitions about these things with a grain of salt. ) On Tue, Apr 4, 2023 at 9:50 PM Harendra Kumar wrote: > I was looking at the RTS code for allocating small objects via prim ops > e.g. newByteArray# . The code looks like: > > stg_newByteArrayzh ( W_ n ) > { > MAYBE_GC_N(stg_newByteArrayzh, n); > > payload_words = ROUNDUP_BYTES_TO_WDS(n); > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words); > > We are making a foreign call here (ccall). I am wondering how much > overhead a ccall adds? I guess it may have to save and restore registers. > Would it be better to do the fast path case of allocating small objects > from the nursery using cmm code like in stg_gc_noregs? > > -harendra > ___ > ghc-devs mailing list > ghc-devs@haskell.org > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs > ___ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Performance of small allocations via prim ops
I was looking at the RTS code for allocating small objects via prim ops e.g. newByteArray# . The code looks like: stg_newByteArrayzh ( W_ n ) { MAYBE_GC_N(stg_newByteArrayzh, n); payload_words = ROUNDUP_BYTES_TO_WDS(n); words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words); We are making a foreign call here (ccall). I am wondering how much overhead a ccall adds? I guess it may have to save and restore registers. Would it be better to do the fast path case of allocating small objects from the nursery using cmm code like in stg_gc_noregs? -harendra ___ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs