Perf: Table.del(key)is taking 85% of my code's time
I've got some networking code running now (implementation of a protocol based on WebSockets) and I have a little client and server that send each other messages as fast as they can. Now I get to profile it! (With `-d:release` of course.) Disappointingly, **about 85% of the total time is being spent in a single call to ``del(table,key)``** , where the Table's keys are `uint64` and the values are a `ref object` type. Here's a snippet what the macOS `sample` tool is showing me: 1614 pendingResponse__9bNzW9abL3s854sblP5selTw_2 (in client) + 434 [0x10d342b72] 1096 del__F9aUzT5FfSHbDaq0vvIxNng (in client) + 1935,1769,... [0x10d2e292f,0x10d2e2889,...] 518 del__F9aUzT5FfSHbDaq0vvIxNng (in client) + 1419 [0x10d2e272b] 277 isEmpty__5Qpmu5QqwDY0esVXn6wf7w_2tables (in client) + 48,23,... [0x10d2e2cc0,0x10d2e2ca7,...] 180 isEmpty__5Qpmu5QqwDY0esVXn6wf7w_2tables (in client) + 57 [0x10d2e2cc9] 180 nimFrame (in client) + 1,15,... [0x10d2e04d1,0x10d2e04df,...] 61 isEmpty__5Qpmu5QqwDY0esVXn6wf7w_2tables (in client) + 102 [0x10d2e2cf6] 61 popFrame (in client) + 1,14,... [0x10d2e0651,0x10d2e065e,...] Run _(Indentation shows the call nesting. The numbers on the left are the number of times the given stack frame was observed; there were 1886 samples total over 2 seconds.)_ I haven't dug into the Table source code yet, but it looks like there may be something really unoptimal there. The name-mangling is weird and very random-looking: are those base64 digests being appended to function names? **Is there a tool to unmangle names** so I can find out exactly what a specific function is? PS: From my past experiences with Go, I was expecting GC to be the culprit; there are a few stack frames that look like GC but they're wa down low, only a handful of samples. Nice! Is Nim using ARC by default now?
Re: Perf: Table.del(key)is taking 85% of my code's time
Sorry, false alarm! I was misled by the `nim` command's help text: --opt:none|speed|size optimize not at all or for speed|size Note: use -d:release for a release build! Run **I took this to mean that ``-d:release`` was *sufficient* to get a fully optimized release build. Nope!** Once I started to look at the x86 disassembly it was painfully non-optimal. 🤮 I changed the flags to `-d:release --opt:size --stackTrace:off --lineTrace:off --assertions:off --checks:off`, and everything got about 20 times faster... Now the code is spending almost all its time in `kevent` waiting for I/O, as it should be. **Question:** Is this the optimal set of flags for a release build? Is there an easy way to toggle between debug and release builds, without having to edit a bunch of lines in my `nim.cfg` file?
Re: Perf: Table.del(key)is taking 85% of my code's time
> and everything got about 20 times faster. We have never seen such an behaviour before, so it would be great if you could investigate it in more detail. Maybe a strange bug? Did you use gcc10 or clang, or maybe the microsoft or intel compiler as backend? -d:release is fine, -d:danger can be a bit faster as it turns off all runtime checks. Both is O3 for gcc by default. But specifying --opt:size at the same time may be not that good, do you use -O3 and -Os for gcc at the same time? Most people do not.| ---|--- Try --passC:-flto to for lto, that give you inlining for all procs. I enable generally -march=native also. And you may try PGO, there was a post about it recently on the forum.
Re: Perf: Table.del(key)is taking 85% of my code's time
-d:danger -d:lto|
Re: Perf: Table.del(key)is taking 85% of my code's time
I also experienced really bad Table performance which was related to really bad hash generation. Print out the hash(key) and see if you hashes are usually bad.
Re: Perf: Table.del(key)is taking 85% of my code's time
`--opt:size` can me stuff slower, it changes file size for speed.
Re: Perf: Table.del(key)is taking 85% of my code's time
I looked at the Compiler User Guide, and it also implies that `-d:release` is sufficient to turn on compiler optimizations. As do your comments above. Weird... Here's my environment: * MacBook Pro, macOS 10.15.5 * Xcode 11.5; `clang --version` reports "Apple clang version 11.0.3 (clang-1103.0.32.62)" * I used to have Nim 1.2 installed from HomeBrew; this morning I uninstalled that, installed choosenim, and installed 1.2.2. I'm building a Nimble package that I created a few days ago, which imports asynchttpserver and the 'news' WebSocket library. I'm using `nimble build` to compile. My `nim.cfg` file contains: --path:src --outdir:bin --hints:off Run I'll use binary size as a rough indicator of optimization, for now. * Default (debug): binary size is 752K * With `-d:release`: 701K * With `-d:release --opt:speed`: 608K * With `-d:release --opt:size`: 528K * With `-d:release --opt:size --stackTrace:off --lineTrace:off --assertions:off --checks:off`: 273K * With `-d:danger`: 752K * With `-d:danger --opt:speed`: 647K * With `-d: danger --opt:size`: 570K * With `-d: danger --opt:size --stackTrace:off --lineTrace:off --assertions:off --checks:off`: 307K Conclusions: It definitely appears `-d:release` is not turning on optimizations. And it's weird that `-d:danger` results in a larger binary; it should be the other way around since it disables generating runtime checks, right?
Re: Perf: Table.del(key)is taking 85% of my code's time
> But specifying --opt:size at the same time may be not that good, do you use > -O3 and -Os for gcc at the same time? Most people do not. They're mutually exclusive, so the compiler's going to pick one if both are given. It looks as though the -Os wins since I do see a large reduction in code size. I'm of the belief that -Os is the best setting in most cases. (That was definitely the rule when I worked at Apple in the '00s.) Smaller code is more cache-friendly, launches faster, and contributes less to memory issues. If necessary you can use Clang or GCC function attributes to force individual functions to optimize for speed.
Re: Perf: Table.del(key)is taking 85% of my code's time
> use binary size as a rough indicator of optimization That makes not much sense. For gcc -O3 gives generally large but fast executables. Large because gcc unrolls loops and apply SIMD instructions and much more. Have you tested -d:danger ? That option should really give you fastest code and turn of all checks. If not then there is something wrong. If -d:danger is fine, then you may try without danger and turn of checks to find out what check makes it slow. Or create a minimal example for table that we can investigate.
Re: Perf: Table.del(key)is taking 85% of my code's time
@snej * what is nim -v ? (I'm assuming it's 1.2.2 but please double check) * please post minimum reproducible example, otherwise it's hard to do anything about your problem
Re: Perf: Table.del(key)is taking 85% of my code's time
Are you putting -d:release in the config file by any chance? If you do - that wouldn't work, you must specify it in the call to the compiler/nimble build
Re: Perf: Table.del(key)is taking 85% of my code's time
Just wanted to add to what @treeform mentioned, key distribution matters a lot when deleting with tables. Check out the timings of sequential keys vs a more distributed table access. import tables, times, random, sugar template bench(code: untyped) = let t1 = cpuTime() code let t2 = cpuTime() echo "Line ", instantiationInfo().line, " took ", t2 - t1 var t: Table[int, int] const tests = 100_000 bench: for i in 0 ..< tests: t.add i, 1 bench: for i in 0 ..< tests: t.del i let randomKeys = collect(newSeqOfCap(tests)): for i in 0 ..< tests: rand int.high bench: for i in 0 ..< tests: t[randomKeys[i]] = 1 bench: for i in 0 ..< tests: t.del randomKeys[i] Run I get the following output with -d:danger -d:release: * Line 12 took 0.003 * Line 16 took 7.209 * Line 24 took 0.00178 * Line 28 took 0.003114 In this example, deleting with sequential keys is 2400 times slower than keys with a more random distribution.
Re: Perf: Table.del(key)is taking 85% of my code's time
> key distribution matters a lot That is true, but it is obviously not the problem for Mr snej. He gets good performance when he disables all checks one by one, but not with -d:danger. So his configuration is somewhat broken. And note, -d:danger is enough, no need for -d:danger -d:release. We should not confuse people too much. Same for -d:release --opt:speed, -d:release already includes --opt:speed.
Re: Perf: Table.del(key)is taking 85% of my code's time
> Are you putting -d:release in the config file by any chance? If you do - that > wouldn't work, you must specify it in the call to the compiler/nimble build That's it! I put it in `nim.cfg` like all the other compile options. If instead I run `nimble build -d:release`, I get an optimized build. The help text for `nimble` doesn't describe the options to `build`, it just says `[opts, ...] [bin]`. And I already knew I could put `nim c` option flags in the config file, so it seemed obvious to put `-d:release` there. Help me understand the build process better -- why is it different putting it on the command line vs. in the config file?
Re: Perf: Table.del(key)is taking 85% of my code's time
That's not a nimble bug - see [https://github.com/nim-lang/Nim/issues/14272](https://github.com/nim-lang/Nim/issues/14272)
Re: Perf: Table.del(key)is taking 85% of my code's time
By "main config" I mean [https://github.com/nim-lang/Nim/blob/devel/config/nim.cfg](https://github.com/nim-lang/Nim/blob/devel/config/nim.cfg)