> We are emitting a more efficient code when the size of the array is smaller. And the threshold is governed by a compiler flag:
It would be good if this was documented. Perhaps in the Haddock for `newByteArray#`? Or where? S On Fri, 7 Apr 2023 at 07:07, Harendra Kumar <harendra.ku...@gmail.com> wrote: > Little bit of grepping in the code gave me this: > > emitPrimOp cfg primop = > let max_inl_alloc_size = fromIntegral (stgToCmmMaxInlAllocSize cfg) > in case primop of > NewByteArrayOp_Char -> \case > [(CmmLit (CmmInt n w))] > | asUnsigned w n <= max_inl_alloc_size -- > <------------------------------- see this line > -> opIntoRegs $ \ [res] -> doNewByteArrayOp res (fromInteger n) > _ -> PrimopCmmEmit_External > > We are emitting a more efficient code when the size of the array is > smaller. And the threshold is governed by a compiler flag: > > , make_ord_flag defGhcFlag "fmax-inline-alloc-size" > (intSuffix (\n d -> d { maxInlineAllocSize = n })) > > This means allocation of smaller arrays is extremely efficient and we can > control it using `-fmax-inline-alloc-size`, the default is 128. That's a > new thing I learnt today. > > Given this new finding, my original question now applies only to the case > when the array size is bigger than this configurable threshold, which is a > little less motivating. And Ben says that the call is not expensive, so we > can leave it there. > > -harendra > > On Fri, 7 Apr 2023 at 11:08, Harendra Kumar <harendra.ku...@gmail.com> > wrote: > >> Ah, some other optimization seems to be kicking in here. When I increase >> the size of the array to > 128 then I see a call to stg_newByteArray# being >> emitted: >> >> {offset >> c1kb: // global >> if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto >> c1kd; >> c1kc: // global >> R1 = Main.main1_closure; >> call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; >> c1kd: // global >> I64[Sp - 8] = c1k9; >> R1 = 129; >> Sp = Sp - 8; >> call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8, >> upd: 8; >> >> -harendra >> >> On Fri, 7 Apr 2023 at 10:49, Harendra Kumar <harendra.ku...@gmail.com> >> wrote: >> >>> Thanks Ben and Carter. >>> >>> I compiled the following to Cmm: >>> >>> {-# LANGUAGE MagicHash #-} >>> {-# LANGUAGE UnboxedTuples #-} >>> >>> import GHC.IO >>> import GHC.Exts >>> >>> data M = M (MutableByteArray# RealWorld) >>> >>> main = do >>> _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M >>> arr #)) >>> return () >>> >>> It produced the following Cmm: >>> >>> {offset >>> c1k3: // global >>> Hp = Hp + 24; >>> if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6; >>> c1k7: // global >>> HpAlloc = 24; >>> R1 = Main.main1_closure; >>> call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; >>> c1k6: // global >>> I64[Hp - 16] = stg_ARR_WORDS_info; >>> I64[Hp - 8] = 1; >>> R1 = GHC.Tuple.()_closure+1; >>> call (P64[Sp])(R1) args: 8, res: 0, upd: 8; >>> } >>> >>> It seems to be as good as it gets. There is absolutely no scope for >>> improvement in this. >>> >>> -harendra >>> >>> On Fri, 7 Apr 2023 at 03:32, Ben Gamari <b...@smart-cactus.org> wrote: >>> >>>> Harendra Kumar <harendra.ku...@gmail.com> writes: >>>> >>>> > I was looking at the RTS code for allocating small objects via prim >>>> ops >>>> > e.g. newByteArray# . The code looks like: >>>> > >>>> > stg_newByteArrayzh ( W_ n ) >>>> > { >>>> > MAYBE_GC_N(stg_newByteArrayzh, n); >>>> > >>>> > payload_words = ROUNDUP_BYTES_TO_WDS(n); >>>> > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; >>>> > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words); >>>> > >>>> > We are making a foreign call here (ccall). I am wondering how much >>>> overhead >>>> > a ccall adds? I guess it may have to save and restore registers. >>>> Would it >>>> > be better to do the fast path case of allocating small objects from >>>> the >>>> > nursery using cmm code like in stg_gc_noregs? >>>> > >>>> GHC's operational model is designed in such a way that foreign calls are >>>> fairly cheap (e.g. we don't need to switch stacks, which can be quite >>>> costly). Judging by the assembler produced for newByteArray# in one >>>> random x86-64 tree that I have lying around, it's only a couple of >>>> data-movement instructions, an %eax clear, and a stack pop: >>>> >>>> 36: 48 89 ce mov %rcx,%rsi >>>> 39: 48 89 c7 mov %rax,%rdi >>>> 3c: 31 c0 xor %eax,%eax >>>> 3e: e8 00 00 00 00 call 43 >>>> <stg_newByteArrayzh+0x43> >>>> 43: 48 83 c4 08 add $0x8,%rsp >>>> >>>> The data movement operations in particular are quite cheap on most >>>> microarchitectures where GHC would run due to register renaming. I doubt >>>> that this overhead would be noticable in anything but a synthetic >>>> benchmark. However, it never hurts to measure. >>>> >>>> Cheers, >>>> >>>> - Ben >>>> >>> _______________________________________________ > ghc-devs mailing list > ghc-devs@haskell.org > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs >
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs