Right, when doing 64-bit operations on an 8-bit mcu with limited registers, a hand-written assembler routine that you call as needed will beat anything gcc spits out - for size-per-call. And I had a lot of trouble getting gcc to deal with the rl78's limited register set and addressing modes - compiling libgcc/libstdc++ is a torture test for alloc/reload on small mcus. That forced me to be conservative, and add the virtual ISA that gcc can work with.
So I'm OK with your approach, and if you come up with something even better, I'm OK with that too :-) (pending reviews, of course, that's next ;)