Definitely... I'm somewhat fully occupied for the next two weeks, but should be able to dig it out then and organize/share it. On Oct 1, 2013 3:50 PM, "Carter Schonwald" <carter.schonw...@gmail.com> wrote:
> awesome! > > please let us know when some of the info is available publicly, perhaps so > other folks can help out wiht experimentation > > > On Tue, Oct 1, 2013 at 4:30 PM, Andrew Farmer <afar...@ittc.ku.edu> wrote: > >> I did indeed implement dynamic nursery sizing and did some preliminary >> benchmarking. The headline figure: 15% speedup on the nofib/gc benchmarks, >> though the variance was pretty large, and there were some slowdowns. >> >> My scheme was very simple... I kept track of the size and rough >> collection time of the previous three collections and did a sort of crude >> binary search to find a minimum in the search space. I did it this way >> because it was simple and required constant time and memory to make a >> decision. Though one of the conclusions was that collection time was a bad >> metric, due to the way the RTS re-uses blocks. As Simon pointed out, >> tracking retainment or some other metric would probably be better, but I >> need to explore it. Another result: the default size is almost always too >> small (at least for the nofib programs). CPUs come with huge caches, and >> using the RTS flag -A to set the allocation area to be roughly the size of >> the L3 cache usually gave pretty decent speedups. >> >> I did this for a class project, and had to put it down to focus on other >> things, and just haven't picked it back up. I still have a patch laying >> around, and several pages of notes with ideas for improvement in both the >> metric and search. I'm hoping to pick it back up again in a couple months, >> with an eye on a workshop paper, and a real patch for 7.10. >> >> >> On Tue, Oct 1, 2013 at 3:36 AM, Simon Marlow <marlo...@gmail.com> wrote: >> >>> It's typical for benchmarks that allocate a large data structure to >>> spend a lot of time in the GC. The data gets copied twice - once in the >>> young generation and then again when promoted to the old generation. You >>> can make this kind of benchmark much faster by just using a bigger >>> allocation area. >>> >>> There's nothing inherently costly about StgMutArrPtrs compared to other >>> objects, except that they are variable size and therefore we can't unroll >>> the copy loop, but I don't think that's a big effect. The actual copying >>> is the major cost. >>> >>> The way to improve this kind of benchmark would be to add some >>> heuristics for varying the nursery size based on the quantity of data >>> retained, for example. I think there's a lot of room for improvement here, >>> but someone needs to do some careful benchmarking and experimentation. >>> Andrew Farmer did some work on this and allegedly got good results but we >>> never saw the code (hint hint!). >>> >>> Cheers, >>> Simon >>> >>> >>> On 1 October 2013 06:43, Johan Tibell <johan.tib...@gmail.com> wrote: >>> >>>> The code for 'allocate' in rts/sm/Storage.c doesn't seem that >>>> expensive. An extra branch compared to inline allocation and >>>> allocation is done in the next nursery block (risking fragmentation?). >>>> >>>> -- Johan >>>> >>>> On Mon, Sep 30, 2013 at 9:50 PM, Johan Tibell <johan.tib...@gmail.com> >>>> wrote: >>>> > Hi, >>>> > >>>> > When I benchmark Data.HashMap.insert from unordered-containers >>>> > (inserting the keys [0..10000]) the runtime is dominated by GC: >>>> > >>>> > $ cat Test.hs >>>> > module Main where >>>> > >>>> > import Control.DeepSeq >>>> > import Control.Exception >>>> > import Control.Monad >>>> > import qualified Data.HashMap.Strict as HM >>>> > import Data.List (foldl') >>>> > >>>> > main = do >>>> > let ks = [0..10000] :: [Int] >>>> > evaluate (rnf ks) >>>> > forM_ ([0..1000] :: [Int]) $ \ x -> do >>>> > evaluate $ HM.null $ foldl' (\ m k -> HM.insert k x m) >>>> HM.empty ks >>>> > >>>> > $ perf record -g ./Test +RTS -s >>>> > 6,187,678,112 bytes allocated in the heap >>>> > 3,309,887,128 bytes copied during GC >>>> > 1,299,200 bytes maximum residency (1002 sample(s)) >>>> > 118,816 bytes maximum slop >>>> > 5 MB total memory in use (0 MB lost due to >>>> fragmentation) >>>> > >>>> > Tot time (elapsed) Avg pause >>>> Max pause >>>> > Gen 0 11089 colls, 0 par 1.31s 1.30s 0.0001s >>>> 0.0005s >>>> > Gen 1 1002 colls, 0 par 0.49s 0.51s 0.0005s >>>> 0.0022s >>>> > >>>> > INIT time 0.00s ( 0.00s elapsed) >>>> > MUT time 1.02s ( 1.03s elapsed) >>>> > GC time 1.80s ( 1.80s elapsed) >>>> > EXIT time 0.00s ( 0.00s elapsed) >>>> > Total time 2.82s ( 2.84s elapsed) >>>> > >>>> > %GC time 63.7% (63.5% elapsed) >>>> > >>>> > Alloc rate 6,042,264,963 bytes per MUT second >>>> > >>>> > Productivity 36.3% of total user, 36.1% of total elapsed >>>> > >>>> > $ perf report >>>> > 41.46% Test Test [.] evacuate >>>> > 15.47% Test Test [.] scavenge_block >>>> > 11.04% Test Test [.] s3cN_info >>>> > 8.74% Test Test [.] s3aZ_info >>>> > 3.59% Test Test [.] 0x7ff5 >>>> > 2.83% Test Test [.] scavenge_mut_arr_ptrs >>>> > 2.69% Test libc-2.15.so [.] 0x147fd9 >>>> > 2.51% Test Test [.] allocate >>>> > 2.00% Test Test [.] s3oo_info >>>> > 0.91% Test Test [.] todo_block_full >>>> > 0.87% Test Test [.] hs_popcnt64 >>>> > 0.80% Test Test [.] s3en_info >>>> > 0.62% Test Test [.] s3el_info >>>> > >>>> > Is GC:ing StgMutArrPtrs and StgArrPtrs, which I create a lot of, more >>>> > expensive than GC:ing normal heap objects (i.e. for standard data >>>> > types)? >>>> > >>>> > -- Johan >>>> >>> >>> >> >> _______________________________________________ >> ghc-devs mailing list >> ghc-devs@haskell.org >> http://www.haskell.org/mailman/listinfo/ghc-devs >> >> >
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs