That sounds like a worthy experiment! I guess that would look like having an inline macro’d up path that checks if it can get the job done that falls back to the general code?
Last I checked, the overhead for this sort of c call was on the order of 10nanoseconds or less which seems like it’d be very unlikely to be a bottleneck, but do you have any natural or artificial benchmark programs that would show case this? For this sortah code, extra branching for that optimization could easily have a larger performance impact than the known function call on modern hardware. (Though take my intuitions about these things with a grain of salt. ) On Tue, Apr 4, 2023 at 9:50 PM Harendra Kumar <harendra.ku...@gmail.com> wrote: > I was looking at the RTS code for allocating small objects via prim ops > e.g. newByteArray# . The code looks like: > > stg_newByteArrayzh ( W_ n ) > { > MAYBE_GC_N(stg_newByteArrayzh, n); > > payload_words = ROUNDUP_BYTES_TO_WDS(n); > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words); > > We are making a foreign call here (ccall). I am wondering how much > overhead a ccall adds? I guess it may have to save and restore registers. > Would it be better to do the fast path case of allocating small objects > from the nursery using cmm code like in stg_gc_noregs? > > -harendra > _______________________________________________ > ghc-devs mailing list > ghc-devs@haskell.org > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs >
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs