> use binary size as a rough indicator of optimization That makes not much sense. For gcc -O3 gives generally large but fast executables. Large because gcc unrolls loops and apply SIMD instructions and much more.
Have you tested -d:danger ? That option should really give you fastest code and turn of all checks. If not then there is something wrong. If -d:danger is fine, then you may try without danger and turn of checks to find out what check makes it slow. Or create a minimal example for table that we can investigate.