Re: Performance of small allocations via prim ops
Ah, some other optimization seems to be kicking in here. When I increase the size of the array to > 128 then I see a call to stg_newByteArray# being emitted: {offset c1kb: // global if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto c1kd; c1kc: // global R1 = Main.main1_closure; call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; c1kd: // global I64[Sp - 8] = c1k9; R1 = 129; Sp = Sp - 8; call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8, upd: 8; -harendra On Fri, 7 Apr 2023 at 10:49, Harendra Kumar wrote: > Thanks Ben and Carter. > > I compiled the following to Cmm: > > {-# LANGUAGE MagicHash #-} > {-# LANGUAGE UnboxedTuples #-} > > import GHC.IO > import GHC.Exts > > data M = M (MutableByteArray# RealWorld) > > main = do > _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M > arr #)) > return () > > It produced the following Cmm: > > {offset >c1k3: // global >Hp = Hp + 24; >if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6; >c1k7: // global >HpAlloc = 24; >R1 = Main.main1_closure; >call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; >c1k6: // global >I64[Hp - 16] = stg_ARR_WORDS_info; >I64[Hp - 8] = 1; >R1 = GHC.Tuple.()_closure+1; >call (P64[Sp])(R1) args: 8, res: 0, upd: 8; > } > > It seems to be as good as it gets. There is absolutely no scope for > improvement in this. > > -harendra > > On Fri, 7 Apr 2023 at 03:32, Ben Gamari wrote: > >> Harendra Kumar writes: >> >> > I was looking at the RTS code for allocating small objects via prim ops >> > e.g. newByteArray# . The code looks like: >> > >> > stg_newByteArrayzh ( W_ n ) >> > { >> > MAYBE_GC_N(stg_newByteArrayzh, n); >> > >> > payload_words = ROUNDUP_BYTES_TO_WDS(n); >> > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; >> > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words); >> > >> > We are making a foreign call here (ccall). I am wondering how much >> overhead >> > a ccall adds? I guess it may have to save and restore registers. Would >> it >> > be better to do the fast path case of allocating small objects from the >> > nursery using cmm code like in stg_gc_noregs? >> > >> GHC's operational model is designed in such a way that foreign calls are >> fairly cheap (e.g. we don't need to switch stacks, which can be quite >> costly). Judging by the assembler produced for newByteArray# in one >> random x86-64 tree that I have lying around, it's only a couple of >> data-movement instructions, an %eax clear, and a stack pop: >> >> 36: 48 89 cemov%rcx,%rsi >> 39: 48 89 c7mov%rax,%rdi >> 3c: 31 c0 xor%eax,%eax >> 3e: e8 00 00 00 00 call 43 >> >> 43: 48 83 c4 08 add$0x8,%rsp >> >> The data movement operations in particular are quite cheap on most >> microarchitectures where GHC would run due to register renaming. I doubt >> that this overhead would be noticable in anything but a synthetic >> benchmark. However, it never hurts to measure. >> >> Cheers, >> >> - Ben >> > ___ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Re: Performance of small allocations via prim ops
Thanks Ben and Carter. I compiled the following to Cmm: {-# LANGUAGE MagicHash #-} {-# LANGUAGE UnboxedTuples #-} import GHC.IO import GHC.Exts data M = M (MutableByteArray# RealWorld) main = do _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M arr #)) return () It produced the following Cmm: {offset c1k3: // global Hp = Hp + 24; if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6; c1k7: // global HpAlloc = 24; R1 = Main.main1_closure; call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; c1k6: // global I64[Hp - 16] = stg_ARR_WORDS_info; I64[Hp - 8] = 1; R1 = GHC.Tuple.()_closure+1; call (P64[Sp])(R1) args: 8, res: 0, upd: 8; } It seems to be as good as it gets. There is absolutely no scope for improvement in this. -harendra On Fri, 7 Apr 2023 at 03:32, Ben Gamari wrote: > Harendra Kumar writes: > > > I was looking at the RTS code for allocating small objects via prim ops > > e.g. newByteArray# . The code looks like: > > > > stg_newByteArrayzh ( W_ n ) > > { > > MAYBE_GC_N(stg_newByteArrayzh, n); > > > > payload_words = ROUNDUP_BYTES_TO_WDS(n); > > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; > > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words); > > > > We are making a foreign call here (ccall). I am wondering how much > overhead > > a ccall adds? I guess it may have to save and restore registers. Would it > > be better to do the fast path case of allocating small objects from the > > nursery using cmm code like in stg_gc_noregs? > > > GHC's operational model is designed in such a way that foreign calls are > fairly cheap (e.g. we don't need to switch stacks, which can be quite > costly). Judging by the assembler produced for newByteArray# in one > random x86-64 tree that I have lying around, it's only a couple of > data-movement instructions, an %eax clear, and a stack pop: > > 36: 48 89 cemov%rcx,%rsi > 39: 48 89 c7mov%rax,%rdi > 3c: 31 c0 xor%eax,%eax > 3e: e8 00 00 00 00 call 43 > 43: 48 83 c4 08 add$0x8,%rsp > > The data movement operations in particular are quite cheap on most > microarchitectures where GHC would run due to register renaming. I doubt > that this overhead would be noticable in anything but a synthetic > benchmark. However, it never hurts to measure. > > Cheers, > > - Ben > ___ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Re: Performance of small allocations via prim ops
Harendra Kumar writes: > I was looking at the RTS code for allocating small objects via prim ops > e.g. newByteArray# . The code looks like: > > stg_newByteArrayzh ( W_ n ) > { > MAYBE_GC_N(stg_newByteArrayzh, n); > > payload_words = ROUNDUP_BYTES_TO_WDS(n); > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words); > > We are making a foreign call here (ccall). I am wondering how much overhead > a ccall adds? I guess it may have to save and restore registers. Would it > be better to do the fast path case of allocating small objects from the > nursery using cmm code like in stg_gc_noregs? > GHC's operational model is designed in such a way that foreign calls are fairly cheap (e.g. we don't need to switch stacks, which can be quite costly). Judging by the assembler produced for newByteArray# in one random x86-64 tree that I have lying around, it's only a couple of data-movement instructions, an %eax clear, and a stack pop: 36: 48 89 cemov%rcx,%rsi 39: 48 89 c7mov%rax,%rdi 3c: 31 c0 xor%eax,%eax 3e: e8 00 00 00 00 call 43 43: 48 83 c4 08 add$0x8,%rsp The data movement operations in particular are quite cheap on most microarchitectures where GHC would run due to register renaming. I doubt that this overhead would be noticable in anything but a synthetic benchmark. However, it never hurts to measure. Cheers, - Ben signature.asc Description: PGP signature ___ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Re: Performance of small allocations via prim ops
That sounds like a worthy experiment! I guess that would look like having an inline macro’d up path that checks if it can get the job done that falls back to the general code? Last I checked, the overhead for this sort of c call was on the order of 10nanoseconds or less which seems like it’d be very unlikely to be a bottleneck, but do you have any natural or artificial benchmark programs that would show case this? For this sortah code, extra branching for that optimization could easily have a larger performance impact than the known function call on modern hardware. (Though take my intuitions about these things with a grain of salt. ) On Tue, Apr 4, 2023 at 9:50 PM Harendra Kumar wrote: > I was looking at the RTS code for allocating small objects via prim ops > e.g. newByteArray# . The code looks like: > > stg_newByteArrayzh ( W_ n ) > { > MAYBE_GC_N(stg_newByteArrayzh, n); > > payload_words = ROUNDUP_BYTES_TO_WDS(n); > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words); > > We are making a foreign call here (ccall). I am wondering how much > overhead a ccall adds? I guess it may have to save and restore registers. > Would it be better to do the fast path case of allocating small objects > from the nursery using cmm code like in stg_gc_noregs? > > -harendra > ___ > ghc-devs mailing list > ghc-devs@haskell.org > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs > ___ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs