Re: ld.so speedup (part 2)
On Tue, 7 May 2019, Jeremie Courreges-Anglas wrote: > On Sat, Apr 27 2019, Nathanael Rensen > wrote: > > The diff below speeds up ld.so library intialisation where the > > dependency tree is broad and deep, such as samba's smbd which links > > over 100 libraries. ... > As I told mpi@ earlier today, I think your changes are correct as is, > and are good to be committed. So this counts as an ok jca@. But I'd > expect other developers to chime in soon, maybe they'll spot something > that I didn't. drahn@ and I pulled on our ld.so waders and agreed it's good, so I've committed it with some tweaking to the #defines to make them self-explanatory and have contiguous bit-assignments. Thank you for identifying this badly inefficient algorithm and spotting how easy it was to fix! Philip Guenther
Re: ld.so speedup (part 2)
On Sat, Apr 27 2019, Nathanael Rensen wrote: > The diff below speeds up ld.so library intialisation where the dependency > tree is broad and deep, such as samba's smbd which links over 100 libraries. > > See for example https://marc.info/?l=openbsd-misc&m=155007285712913&w=2 > > See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for part 1 > that speeds up library loading. > > The timings below are for /usr/local/sbin/smbd --version: > > Timing without either diff : 6m45.67s real 6m45.65s user 0m00.02s system > Timing with part 1 diff only: 4m42.88s real 4m42.85s user 0m00.02s system > Timing with part 2 diff only: 2m02.61s real 2m02.60s user 0m00.01s system > Timing with both diffs : 0m00.03s real 0m00.03s user 0m00.00s system First off, thanks a lot for solving this long outstanding issue. The use of ld --as-needed hides the problem but it looks like ld.lld isn't as good as ld.bfd at eliminating extra inter-library references. As I told mpi@ earlier today, I think your changes are correct as is, and are good to be committed. So this counts as an ok jca@. But I'd expect other developers to chime in soon, maybe they'll spot something that I didn't. -- jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF DDCC 0DFA 74AE 1524 E7EE
Re: ld.so speedup (part 2)
On Sun, 5 May 2019 at 06:26, Martin Pieuchot wrote: > > On 27/04/19(Sat) 21:55, Nathanael Rensen wrote: > > The diff below speeds up ld.so library intialisation where the dependency > > tree is broad and deep, such as samba's smbd which links over 100 libraries. > > > > See for example https://marc.info/?l=openbsd-misc&m=155007285712913&w=2 > > > > See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for part 1 > > that speeds up library loading. > > > > The timings below are for /usr/local/sbin/smbd --version: > > > > Timing without either diff : 6m45.67s real 6m45.65s user 0m00.02s system > > Timing with part 1 diff only: 4m42.88s real 4m42.85s user 0m00.02s system > > Timing with part 2 diff only: 2m02.61s real 2m02.60s user 0m00.01s system > > Timing with both diffs : 0m00.03s real 0m00.03s user 0m00.00s system > > > > Note that these timings are for a build of a recent samba master tree > > (linked with kerberos) which is probably slower than the OpenBSD port. > > Nice numbers. Could you explain in words what your diff is doing? Why > does splitting the flag help? Is it because some ctors/initarray are > being initialized multiple times currently? No, the STAT_INIT_DONE flag prevents that. > Or is it just to prevent some traversal? Yes. > In that case does that mean the `STAT_VISISTED' flag is removed too > early? Yes, STAT_VISITED is removed too early. The visited flag is set on a node while traversing the child nodes of that node and then removed. It serves to protect against circular dependencies, but does not prevent repeatedly traversing through a node that appears on separate branches. The entire tree must be traversed twice - first to initialise the DF_1_INITFIRST libraries, and secondly to initialise the others. This is presumably why this diff contributes roughly twice as much speedup as the part 1 diff. To be effective in avoiding repeated traversals the visited flag must persist throughout an entire tree traversal but it must either be cleared between first and second traversals or a different flag used for the second traversal. My approach was to add a second visited flag and make them both persistent. My rationale for why I believe the flags may be persisted is as follows. dlopen() calls _dl_call_init() with the newly loaded object and neither the newly loaded object nor any newly loaded children of that object will have either visited flag set. Already loaded children will have those flags set, but they won't have gained any new children as a result of the dlopen(). If this reasoning is wrong then the diff is wrong and could lead to uninitialised libraries (and an ld.so regress test should probably be created to catch that situation). It occurs to me as I'm writing this that perhaps it's possible to avoid a tree traversal entirely by walking the linearised grpsym_list in reverse and relying only on the STAT_INIT_DONE flag. /* * grpsym_list is an ordered list of all child libs of the * _dl_loading_object with no dups. The order is equivalent * to a breadth-first traversal of the child list without dups. */ I don't think it is a true breadth-first traversal, not in the way I understand breadth-first, but it does ensure that parent nodes appear before child nodes. So in reverse, child nodes will appear before parent nodes. While this is not the same as a depth-first traversal it may be OK. There may be some specific requirements of DF_1_INITFIRST that need to be taken into account. Nathanael > > > Index: libexec/ld.so/loader.c > > === > > RCS file: /cvs/src/libexec/ld.so/loader.c,v > > retrieving revision 1.177 > > diff -u -p -p -u -r1.177 loader.c > > --- libexec/ld.so/loader.c3 Dec 2018 05:29:56 - 1.177 > > +++ libexec/ld.so/loader.c27 Apr 2019 13:24:02 - > > @@ -749,15 +749,15 @@ _dl_call_init_recurse(elf_object_t *obje > > { > > struct dep_node *n; > > > > - object->status |= STAT_VISITED; > > + int visited_flag = initfirst ? STAT_VISITED_1 : STAT_VISITED_2; > > + > > + object->status |= visited_flag; > > > > TAILQ_FOREACH(n, &object->child_list, next_sib) { > > - if (n->data->status & STAT_VISITED) > > + if (n->data->status & visited_flag) > > continue; > > _dl_call_init_recurse(n->data, initfirst); > > } > > - > > - object->status &= ~STAT_VISITED; > > > > if (object->status & STAT_INIT_DONE) > > return; > > Index: libexec/ld.so/resolve.h > > === > > RCS file: /cvs/src/libexec/ld.so/resolve.h,v > > retrieving revision 1.90 > > diff -u -p -p -u -r1.90 resolve.h > > --- libexec/ld.so/resolve.h 21 Apr 2019 04:11:42 - 1.90 > > +++ libexec/ld.so/resolve.h 27 Apr 2019 13:24:02 - > > @@ -125,8 +125,9 @@ struct elf_object { > > #defin
Re: ld.so speedup (part 2)
On 27/04/19(Sat) 21:55, Nathanael Rensen wrote: > The diff below speeds up ld.so library intialisation where the dependency > tree is broad and deep, such as samba's smbd which links over 100 libraries. > > See for example https://marc.info/?l=openbsd-misc&m=155007285712913&w=2 > > See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for part 1 > that speeds up library loading. > > The timings below are for /usr/local/sbin/smbd --version: > > Timing without either diff : 6m45.67s real 6m45.65s user 0m00.02s system > Timing with part 1 diff only: 4m42.88s real 4m42.85s user 0m00.02s system > Timing with part 2 diff only: 2m02.61s real 2m02.60s user 0m00.01s system > Timing with both diffs : 0m00.03s real 0m00.03s user 0m00.00s system > > Note that these timings are for a build of a recent samba master tree > (linked with kerberos) which is probably slower than the OpenBSD port. Nice numbers. Could you explain in words what your diff is doing? Why does splitting the flag help? Is it because some ctors/initarray are being initialized multiple times currently? Or is it just to prevent some traversal? In that case does that mean the `STAT_VISISTED' flag is removed too early? > Index: libexec/ld.so/loader.c > === > RCS file: /cvs/src/libexec/ld.so/loader.c,v > retrieving revision 1.177 > diff -u -p -p -u -r1.177 loader.c > --- libexec/ld.so/loader.c3 Dec 2018 05:29:56 - 1.177 > +++ libexec/ld.so/loader.c27 Apr 2019 13:24:02 - > @@ -749,15 +749,15 @@ _dl_call_init_recurse(elf_object_t *obje > { > struct dep_node *n; > > - object->status |= STAT_VISITED; > + int visited_flag = initfirst ? STAT_VISITED_1 : STAT_VISITED_2; > + > + object->status |= visited_flag; > > TAILQ_FOREACH(n, &object->child_list, next_sib) { > - if (n->data->status & STAT_VISITED) > + if (n->data->status & visited_flag) > continue; > _dl_call_init_recurse(n->data, initfirst); > } > - > - object->status &= ~STAT_VISITED; > > if (object->status & STAT_INIT_DONE) > return; > Index: libexec/ld.so/resolve.h > === > RCS file: /cvs/src/libexec/ld.so/resolve.h,v > retrieving revision 1.90 > diff -u -p -p -u -r1.90 resolve.h > --- libexec/ld.so/resolve.h 21 Apr 2019 04:11:42 - 1.90 > +++ libexec/ld.so/resolve.h 27 Apr 2019 13:24:02 - > @@ -125,8 +125,9 @@ struct elf_object { > #define STAT_FINI_READY 0x10 > #define STAT_UNLOADED 0x20 > #define STAT_NODELETE 0x40 > -#define STAT_VISITED0x80 > +#define STAT_VISITED_1 0x80 > #define STAT_GNU_HASH 0x100 > +#define STAT_VISITED_2 0x200 > > Elf_Phdr*phdrp; > int phdrc; >
Re: ld.so speedup (part 2)
On Mon, Apr 29 2019, Stuart Henderson wrote: > On 2019/04/28 09:45, Brian Callahan wrote: >> >> >> On 4/28/19 6:01 AM, Matthieu Herrb wrote: >> > On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote: >> > > > > > > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote: >> > > > > > > > The diff below speeds up ld.so library intialisation where the >> > > > > > dependency >> > > > > > > > tree is broad and deep, such as samba's smbd which links over >> > > > > > > > 100 >> > > > > > libraries. >> > > Past experience with ld.so changes suggests it would be good to have >> > > test reports from multiple arches, *especially* hppa. >> > The regress test seem to pass here on hppa. >> > >> >> Pass here too on hppa and macppc and armv7. >> >> ~Brian >> > > Regress is clean for me on i386 and I am using it on my current ports bulk > build there (halfway done, no issues seen yet). Using this in current ports bulk on sparc64, no fallout. > Regress is also clean on arm64. and on sparc64. -- jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF DDCC 0DFA 74AE 1524 E7EE
Re: ld.so speedup (part 2)
On 2019/04/29 09:47, Chris Cappuccio wrote: > Stuart Henderson [s...@spacehopper.org] wrote: > > > > This doesn't match my experience: > > > > $ time sudo rcctl start samba > > smbd(ok) > > nmbd(ok) > > 0m00.81s real 0m00.31s user 0m00.31s system > > He was linking Samba with Kerberos libs too. > OP was but I don't think Ian was. That is with the ld.so diffs of course. Startup takes getting on for a minute for me without them.
Re: ld.so speedup (part 2)
Stuart Henderson [s...@spacehopper.org] wrote: > > This doesn't match my experience: > > $ time sudo rcctl start samba > smbd(ok) > nmbd(ok) > 0m00.81s real 0m00.31s user 0m00.31s system He was linking Samba with Kerberos libs too.
Re: ld.so speedup (part 2)
On 2019/04/28 09:45, Brian Callahan wrote: > > > On 4/28/19 6:01 AM, Matthieu Herrb wrote: > > On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote: > > > > > > > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote: > > > > > > > > The diff below speeds up ld.so library intialisation where the > > > > > > dependency > > > > > > > > tree is broad and deep, such as samba's smbd which links over > > > > > > > > 100 > > > > > > libraries. > > > Past experience with ld.so changes suggests it would be good to have > > > test reports from multiple arches, *especially* hppa. > > The regress test seem to pass here on hppa. > > > > Pass here too on hppa and macppc and armv7. > > ~Brian > Regress is clean for me on i386 and I am using it on my current ports bulk build there (halfway done, no issues seen yet). Regress is also clean on arm64.
Re: ld.so speedup (part 2)
On 4/28/19 6:01 AM, Matthieu Herrb wrote: On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote: On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote: The diff below speeds up ld.so library intialisation where the dependency tree is broad and deep, such as samba's smbd which links over 100 libraries. Past experience with ld.so changes suggests it would be good to have test reports from multiple arches, *especially* hppa. The regress test seem to pass here on hppa. Pass here too on hppa and macppc and armv7. ~Brian
Re: ld.so speedup (part 2)
On Sun, 28 Apr 2019 13:04:22 +0200 Robert Nagy wrote: > On 28/04/19 12:01 +0200, Matthieu Herrb wrote: > > On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote: > > > > >> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen > > > > >> > wrote: > > > > >> > > The diff below speeds up ld.so library intialisation > > > > >> > > where the > > > > >>dependency > > > > >> > > tree is broad and deep, such as samba's smbd which links > > > > >> > > over 100 > > > > >>libraries. > > > > > > Past experience with ld.so changes suggests it would be good to > > > have test reports from multiple arches, *especially* hppa. > > > > The regress test seem to pass here on hppa. It seems good on macppc as well, here is the log [0]. Startup time for clang has been reduced from 3.2s to 0.11s with the two diff applied! > > -- > > Matthieu Herrb > > > > This also fixes the component FLAVOR of chromium which uses a > gazillion shared objects. Awesome work! > Charlène. [0] http://0x0.st/zbUa.txt
Re: ld.so speedup (part 2)
On 28/04/19 12:01 +0200, Matthieu Herrb wrote: > On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote: > > > >> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote: > > > >> > > The diff below speeds up ld.so library intialisation where the > > > >>dependency > > > >> > > tree is broad and deep, such as samba's smbd which links over 100 > > > >>libraries. > > > > Past experience with ld.so changes suggests it would be good to have > > test reports from multiple arches, *especially* hppa. > > The regress test seem to pass here on hppa. > > -- > Matthieu Herrb > This also fixes the component FLAVOR of chromium which uses a gazillion shared objects. Awesome work!
Re: ld.so speedup (part 2)
On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote: > > >> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote: > > >> > > The diff below speeds up ld.so library intialisation where the > > >>dependency > > >> > > tree is broad and deep, such as samba's smbd which links over 100 > > >>libraries. > > Past experience with ld.so changes suggests it would be good to have > test reports from multiple arches, *especially* hppa. The regress test seem to pass here on hppa. -- Matthieu Herrb
Re: ld.so speedup (part 2)
> >> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote: > >> > > The diff below speeds up ld.so library intialisation where the > >>dependency > >> > > tree is broad and deep, such as samba's smbd which links over 100 > >>libraries. Past experience with ld.so changes suggests it would be good to have test reports from multiple arches, *especially* hppa. On 2019/04/28 01:57, Ian McWilliam wrote: > Using both patches on old hardware helps speed up the process but I still > see the rc script timeout before smbd is loaded causing the rest of the > samba processes to fail to load. This did not happen under 6.4 (amd64) so > the change of linker / compiler update is still potentially where the > problem may lie. > > Starting smbd with both patches > 0m46.55s real 0m46.47s user 0m00.07s system This doesn't match my experience: $ time sudo rcctl start samba smbd(ok) nmbd(ok) 0m00.81s real 0m00.31s user 0m00.31s system
Re: ld.so speedup (part 2)
On Sun, Apr 28, 2019 at 01:57:46AM +, Ian McWilliam wrote: > > > On 28/4/19, 12:56 am, "owner-t...@openbsd.org on behalf of Otto Moerbeek" > wrote: > > >On Sat, Apr 27, 2019 at 04:43:14PM +0200, Otto Moerbeek wrote: > > > >> On Sat, Apr 27, 2019 at 04:37:23PM +0200, Antoine Jacoutot wrote: > >> > >> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote: > >> > > The diff below speeds up ld.so library intialisation where the > >>dependency > >> > > tree is broad and deep, such as samba's smbd which links over 100 > >>libraries. > >> > > > >> > > See for example > >>https://marc.info/?l=openbsd-misc&m=155007285712913&w=2 > >> > > > >> > > See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for > >>part 1 > >> > > that speeds up library loading. > >> > > > >> > > The timings below are for /usr/local/sbin/smbd --version: > >> > > > >> > > Timing without either diff : 6m45.67s real 6m45.65s user > >>0m00.02s system > >> > > Timing with part 1 diff only: 4m42.88s real 4m42.85s user > >>0m00.02s system > >> > > Timing with part 2 diff only: 2m02.61s real 2m02.60s user > >>0m00.01s system > >> > > Timing with both diffs : 0m00.03s real 0m00.03s user > >>0m00.00s system > >> > > > >> > > Note that these timings are for a build of a recent samba master > >>tree > >> > > (linked with kerberos) which is probably slower than the OpenBSD > >>port. > >> > > > >> > > Nathanael > >> > > >> > Wow. Tried your part1 and part2 diffs and the difference is indeed > >>insane! > >> > mail/evolution always took 10+ seconds to start for me and now it's > >>almost > >> > instant... > >> > Crazy... But this sounds too good to be true ;-) > >> > What are the potential regressions? > >> > >> Speaking off regression tests, we have quite en extensive collection. > >> The tests in libexec/ld.so should all pass. > > > >And the do on amd64 > > > >> > >>-Otto > >> > >> > > The results look good but it still doesn¹t resolve the root cause of the > issue. Speedup of lds.o is nice in any circostance and samba issues should be viewed seperately. In other word, please don't hijack the thread. -Otto > Using both patches on old hardware helps speed up the process but I still > see the rc script timeout before smbd is loaded causing the rest of the > samba processes to fail to load. This did not happen under 6.4 (amd64) so > the change of linker / compiler update is still potentially where the > problem may lie. > > Starting smbd with both patches > 0m46.55s real 0m46.47s user 0m00.07s system > > > Would still be good to see this work committed though. > > Ian McWilliam > > OpenBSD 6.5 (GENERIC.MP) #0: Mon Apr 15 16:28:00 AEST 2019 > > ianm@ianm-openbsd65.localdomain:/usr/src/sys/arch/amd64/compile/GENERIC.MP > real mem = 6424494080 (6126MB) > avail mem = 6220148736 (5931MB) > mpath0 at root > scsibus0 at mpath0: 256 targets > mainbus0 at root > bios0 at mainbus0: SMBIOS rev. 2.4 @ 0xf0100 (55 entries) > bios0: vendor Award Software International, Inc. version "F10d" date > 07/22/2010 > bios0: Gigabyte Technology Co., Ltd. GA-MA790X-DS4 > acpi0 at bios0: rev 0 > acpi0: sleep states S0 S1 S4 S5 > acpi0: tables DSDT FACP SSDT HPET MCFG APIC > acpi0: wakeup devices USB0(S3) USB1(S3) USB2(S3) USB3(S3) USB4(S3) > USB5(S3) SBAZ(S4) P2P_(S5) PCE2(S4) PCE3(S4) PCE4(S4) PCE5(S4) PCE6(S4) > PCE7(S4) PCE8(S4) PCE9(S4) [...] > acpitimer0 at acpi0: 3579545 Hz, 32 bits > acpihpet0 at acpi0: 14318180 Hz > acpimcfg0 at acpi0 > acpimcfg0: addr 0xe000, bus 0-255 > acpimadt0 at acpi0 addr 0xfee0: PC-AT compat > cpu0 at mainbus0: apid 0 (boot processor) > cpu0: AMD Phenom(tm) 9750 Quad-Core Processor, 2411.28 MHz, 10-02-03 > cpu0: > FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL > USH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDT > SCP,LONG,3DNOW2,3DNOW,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP, > OSVW,IBS,ITSC > cpu0: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB > 64b/line 16-way L2 cache, 2MB 64b/line 32-way L3 cache > cpu0: ITLB 32 4KB entries fully associative, 16 4MB entries fully > associative > cpu0: DTLB 48 4KB entries fully associative, 48 4MB entries fully > associative > cpu0: AMD erratum 721 detected and fixed > cpu0: smt 0, core 0, package 0 > mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges > cpu0: apic clock running at 200MHz > cpu0: mwait min=64, max=64, IBE > cpu1 at mainbus0: apid 1 (application processor) > cpu1: AMD Phenom(tm) 9750 Quad-Core Processor, 2410.99 MHz, 10-02-03 > cpu1: > FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL > USH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDT > SCP,LONG,3DNOW2,3DNOW,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP, > OSVW,IBS,ITSC > cpu1: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB > 64b/line 16-way L2 cache, 2MB 64b/line 32-way L3 cache >
Re: ld.so speedup (part 2)
On 28/4/19, 12:56 am, "owner-t...@openbsd.org on behalf of Otto Moerbeek" wrote: >On Sat, Apr 27, 2019 at 04:43:14PM +0200, Otto Moerbeek wrote: > >> On Sat, Apr 27, 2019 at 04:37:23PM +0200, Antoine Jacoutot wrote: >> >> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote: >> > > The diff below speeds up ld.so library intialisation where the >>dependency >> > > tree is broad and deep, such as samba's smbd which links over 100 >>libraries. >> > > >> > > See for example >>https://marc.info/?l=openbsd-misc&m=155007285712913&w=2 >> > > >> > > See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for >>part 1 >> > > that speeds up library loading. >> > > >> > > The timings below are for /usr/local/sbin/smbd --version: >> > > >> > > Timing without either diff : 6m45.67s real 6m45.65s user >>0m00.02s system >> > > Timing with part 1 diff only: 4m42.88s real 4m42.85s user >>0m00.02s system >> > > Timing with part 2 diff only: 2m02.61s real 2m02.60s user >>0m00.01s system >> > > Timing with both diffs : 0m00.03s real 0m00.03s user >>0m00.00s system >> > > >> > > Note that these timings are for a build of a recent samba master >>tree >> > > (linked with kerberos) which is probably slower than the OpenBSD >>port. >> > > >> > > Nathanael >> > >> > Wow. Tried your part1 and part2 diffs and the difference is indeed >>insane! >> > mail/evolution always took 10+ seconds to start for me and now it's >>almost >> > instant... >> > Crazy... But this sounds too good to be true ;-) >> > What are the potential regressions? >> >> Speaking off regression tests, we have quite en extensive collection. >> The tests in libexec/ld.so should all pass. > >And the do on amd64 > >> >> -Otto >> >> The results look good but it still doesn¹t resolve the root cause of the issue. Using both patches on old hardware helps speed up the process but I still see the rc script timeout before smbd is loaded causing the rest of the samba processes to fail to load. This did not happen under 6.4 (amd64) so the change of linker / compiler update is still potentially where the problem may lie. Starting smbd with both patches 0m46.55s real 0m46.47s user 0m00.07s system Would still be good to see this work committed though. Ian McWilliam OpenBSD 6.5 (GENERIC.MP) #0: Mon Apr 15 16:28:00 AEST 2019 ianm@ianm-openbsd65.localdomain:/usr/src/sys/arch/amd64/compile/GENERIC.MP real mem = 6424494080 (6126MB) avail mem = 6220148736 (5931MB) mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 2.4 @ 0xf0100 (55 entries) bios0: vendor Award Software International, Inc. version "F10d" date 07/22/2010 bios0: Gigabyte Technology Co., Ltd. GA-MA790X-DS4 acpi0 at bios0: rev 0 acpi0: sleep states S0 S1 S4 S5 acpi0: tables DSDT FACP SSDT HPET MCFG APIC acpi0: wakeup devices USB0(S3) USB1(S3) USB2(S3) USB3(S3) USB4(S3) USB5(S3) SBAZ(S4) P2P_(S5) PCE2(S4) PCE3(S4) PCE4(S4) PCE5(S4) PCE6(S4) PCE7(S4) PCE8(S4) PCE9(S4) [...] acpitimer0 at acpi0: 3579545 Hz, 32 bits acpihpet0 at acpi0: 14318180 Hz acpimcfg0 at acpi0 acpimcfg0: addr 0xe000, bus 0-255 acpimadt0 at acpi0 addr 0xfee0: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: AMD Phenom(tm) 9750 Quad-Core Processor, 2411.28 MHz, 10-02-03 cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL USH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDT SCP,LONG,3DNOW2,3DNOW,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP, OSVW,IBS,ITSC cpu0: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB 64b/line 16-way L2 cache, 2MB 64b/line 32-way L3 cache cpu0: ITLB 32 4KB entries fully associative, 16 4MB entries fully associative cpu0: DTLB 48 4KB entries fully associative, 48 4MB entries fully associative cpu0: AMD erratum 721 detected and fixed cpu0: smt 0, core 0, package 0 mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges cpu0: apic clock running at 200MHz cpu0: mwait min=64, max=64, IBE cpu1 at mainbus0: apid 1 (application processor) cpu1: AMD Phenom(tm) 9750 Quad-Core Processor, 2410.99 MHz, 10-02-03 cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL USH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDT SCP,LONG,3DNOW2,3DNOW,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP, OSVW,IBS,ITSC cpu1: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB 64b/line 16-way L2 cache, 2MB 64b/line 32-way L3 cache cpu1: ITLB 32 4KB entries fully associative, 16 4MB entries fully associative cpu1: DTLB 48 4KB entries fully associative, 48 4MB entries fully associative cpu1: AMD erratum 721 detected and fixed cpu1: smt 0, core 1, package 0 cpu2 at mainbus0: apid 2 (application processor) cpu2: AMD Phenom(tm) 9750 Quad-Core Processor, 2410.99 MHz, 10-02-03 cpu2: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL USH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,C
Re: ld.so speedup (part 2)
On Sat, Apr 27, 2019 at 04:43:14PM +0200, Otto Moerbeek wrote: > On Sat, Apr 27, 2019 at 04:37:23PM +0200, Antoine Jacoutot wrote: > > > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote: > > > The diff below speeds up ld.so library intialisation where the dependency > > > tree is broad and deep, such as samba's smbd which links over 100 > > > libraries. > > > > > > See for example https://marc.info/?l=openbsd-misc&m=155007285712913&w=2 > > > > > > See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for part 1 > > > that speeds up library loading. > > > > > > The timings below are for /usr/local/sbin/smbd --version: > > > > > > Timing without either diff : 6m45.67s real 6m45.65s user 0m00.02s > > > system > > > Timing with part 1 diff only: 4m42.88s real 4m42.85s user 0m00.02s > > > system > > > Timing with part 2 diff only: 2m02.61s real 2m02.60s user 0m00.01s > > > system > > > Timing with both diffs : 0m00.03s real 0m00.03s user 0m00.00s > > > system > > > > > > Note that these timings are for a build of a recent samba master tree > > > (linked with kerberos) which is probably slower than the OpenBSD port. > > > > > > Nathanael > > > > Wow. Tried your part1 and part2 diffs and the difference is indeed insane! > > mail/evolution always took 10+ seconds to start for me and now it's almost > > instant... > > Crazy... But this sounds too good to be true ;-) > > What are the potential regressions? > > Speaking off regression tests, we have quite en extensive collection. > The tests in libexec/ld.so should all pass. And the do on amd64 > > -Otto > > > > > > > > > Index: libexec/ld.so/loader.c > > > === > > > RCS file: /cvs/src/libexec/ld.so/loader.c,v > > > retrieving revision 1.177 > > > diff -u -p -p -u -r1.177 loader.c > > > --- libexec/ld.so/loader.c3 Dec 2018 05:29:56 - 1.177 > > > +++ libexec/ld.so/loader.c27 Apr 2019 13:24:02 - > > > @@ -749,15 +749,15 @@ _dl_call_init_recurse(elf_object_t *obje > > > { > > > struct dep_node *n; > > > > > > - object->status |= STAT_VISITED; > > > + int visited_flag = initfirst ? STAT_VISITED_1 : STAT_VISITED_2; > > > + > > > + object->status |= visited_flag; > > > > > > TAILQ_FOREACH(n, &object->child_list, next_sib) { > > > - if (n->data->status & STAT_VISITED) > > > + if (n->data->status & visited_flag) > > > continue; > > > _dl_call_init_recurse(n->data, initfirst); > > > } > > > - > > > - object->status &= ~STAT_VISITED; > > > > > > if (object->status & STAT_INIT_DONE) > > > return; > > > Index: libexec/ld.so/resolve.h > > > === > > > RCS file: /cvs/src/libexec/ld.so/resolve.h,v > > > retrieving revision 1.90 > > > diff -u -p -p -u -r1.90 resolve.h > > > --- libexec/ld.so/resolve.h 21 Apr 2019 04:11:42 - 1.90 > > > +++ libexec/ld.so/resolve.h 27 Apr 2019 13:24:02 - > > > @@ -125,8 +125,9 @@ struct elf_object { > > > #define STAT_FINI_READY 0x10 > > > #define STAT_UNLOADED 0x20 > > > #define STAT_NODELETE 0x40 > > > -#define STAT_VISITED0x80 > > > +#define STAT_VISITED_1 0x80 > > > #define STAT_GNU_HASH 0x100 > > > +#define STAT_VISITED_2 0x200 > > > > > > Elf_Phdr*phdrp; > > > int phdrc; > > > > > > > -- > > Antoine > > >
Re: ld.so speedup (part 2)
On Sat, Apr 27, 2019 at 04:37:23PM +0200, Antoine Jacoutot wrote: > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote: > > The diff below speeds up ld.so library intialisation where the dependency > > tree is broad and deep, such as samba's smbd which links over 100 libraries. > > > > See for example https://marc.info/?l=openbsd-misc&m=155007285712913&w=2 > > > > See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for part 1 > > that speeds up library loading. > > > > The timings below are for /usr/local/sbin/smbd --version: > > > > Timing without either diff : 6m45.67s real 6m45.65s user 0m00.02s system > > Timing with part 1 diff only: 4m42.88s real 4m42.85s user 0m00.02s system > > Timing with part 2 diff only: 2m02.61s real 2m02.60s user 0m00.01s system > > Timing with both diffs : 0m00.03s real 0m00.03s user 0m00.00s system > > > > Note that these timings are for a build of a recent samba master tree > > (linked with kerberos) which is probably slower than the OpenBSD port. > > > > Nathanael > > Wow. Tried your part1 and part2 diffs and the difference is indeed insane! > mail/evolution always took 10+ seconds to start for me and now it's almost > instant... > Crazy... But this sounds too good to be true ;-) > What are the potential regressions? Speaking off regression tests, we have quite en extensive collection. The tests in libexec/ld.so should all pass. -Otto > > > > Index: libexec/ld.so/loader.c > > === > > RCS file: /cvs/src/libexec/ld.so/loader.c,v > > retrieving revision 1.177 > > diff -u -p -p -u -r1.177 loader.c > > --- libexec/ld.so/loader.c 3 Dec 2018 05:29:56 - 1.177 > > +++ libexec/ld.so/loader.c 27 Apr 2019 13:24:02 - > > @@ -749,15 +749,15 @@ _dl_call_init_recurse(elf_object_t *obje > > { > > struct dep_node *n; > > > > - object->status |= STAT_VISITED; > > + int visited_flag = initfirst ? STAT_VISITED_1 : STAT_VISITED_2; > > + > > + object->status |= visited_flag; > > > > TAILQ_FOREACH(n, &object->child_list, next_sib) { > > - if (n->data->status & STAT_VISITED) > > + if (n->data->status & visited_flag) > > continue; > > _dl_call_init_recurse(n->data, initfirst); > > } > > - > > - object->status &= ~STAT_VISITED; > > > > if (object->status & STAT_INIT_DONE) > > return; > > Index: libexec/ld.so/resolve.h > > === > > RCS file: /cvs/src/libexec/ld.so/resolve.h,v > > retrieving revision 1.90 > > diff -u -p -p -u -r1.90 resolve.h > > --- libexec/ld.so/resolve.h 21 Apr 2019 04:11:42 - 1.90 > > +++ libexec/ld.so/resolve.h 27 Apr 2019 13:24:02 - > > @@ -125,8 +125,9 @@ struct elf_object { > > #defineSTAT_FINI_READY 0x10 > > #defineSTAT_UNLOADED 0x20 > > #defineSTAT_NODELETE 0x40 > > -#defineSTAT_VISITED0x80 > > +#defineSTAT_VISITED_1 0x80 > > #defineSTAT_GNU_HASH 0x100 > > +#defineSTAT_VISITED_2 0x200 > > > > Elf_Phdr*phdrp; > > int phdrc; > > > > -- > Antoine >
Re: ld.so speedup (part 2)
On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote: > The diff below speeds up ld.so library intialisation where the dependency > tree is broad and deep, such as samba's smbd which links over 100 libraries. > > See for example https://marc.info/?l=openbsd-misc&m=155007285712913&w=2 > > See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for part 1 > that speeds up library loading. > > The timings below are for /usr/local/sbin/smbd --version: > > Timing without either diff : 6m45.67s real 6m45.65s user 0m00.02s system > Timing with part 1 diff only: 4m42.88s real 4m42.85s user 0m00.02s system > Timing with part 2 diff only: 2m02.61s real 2m02.60s user 0m00.01s system > Timing with both diffs : 0m00.03s real 0m00.03s user 0m00.00s system > > Note that these timings are for a build of a recent samba master tree > (linked with kerberos) which is probably slower than the OpenBSD port. > > Nathanael Wow. Tried your part1 and part2 diffs and the difference is indeed insane! mail/evolution always took 10+ seconds to start for me and now it's almost instant... Crazy... But this sounds too good to be true ;-) What are the potential regressions? > Index: libexec/ld.so/loader.c > === > RCS file: /cvs/src/libexec/ld.so/loader.c,v > retrieving revision 1.177 > diff -u -p -p -u -r1.177 loader.c > --- libexec/ld.so/loader.c3 Dec 2018 05:29:56 - 1.177 > +++ libexec/ld.so/loader.c27 Apr 2019 13:24:02 - > @@ -749,15 +749,15 @@ _dl_call_init_recurse(elf_object_t *obje > { > struct dep_node *n; > > - object->status |= STAT_VISITED; > + int visited_flag = initfirst ? STAT_VISITED_1 : STAT_VISITED_2; > + > + object->status |= visited_flag; > > TAILQ_FOREACH(n, &object->child_list, next_sib) { > - if (n->data->status & STAT_VISITED) > + if (n->data->status & visited_flag) > continue; > _dl_call_init_recurse(n->data, initfirst); > } > - > - object->status &= ~STAT_VISITED; > > if (object->status & STAT_INIT_DONE) > return; > Index: libexec/ld.so/resolve.h > === > RCS file: /cvs/src/libexec/ld.so/resolve.h,v > retrieving revision 1.90 > diff -u -p -p -u -r1.90 resolve.h > --- libexec/ld.so/resolve.h 21 Apr 2019 04:11:42 - 1.90 > +++ libexec/ld.so/resolve.h 27 Apr 2019 13:24:02 - > @@ -125,8 +125,9 @@ struct elf_object { > #define STAT_FINI_READY 0x10 > #define STAT_UNLOADED 0x20 > #define STAT_NODELETE 0x40 > -#define STAT_VISITED0x80 > +#define STAT_VISITED_1 0x80 > #define STAT_GNU_HASH 0x100 > +#define STAT_VISITED_2 0x200 > > Elf_Phdr*phdrp; > int phdrc; > -- Antoine
ld.so speedup (part 2)
The diff below speeds up ld.so library intialisation where the dependency tree is broad and deep, such as samba's smbd which links over 100 libraries. See for example https://marc.info/?l=openbsd-misc&m=155007285712913&w=2 See https://marc.info/?l=openbsd-tech&m=155637285221396&w=2 for part 1 that speeds up library loading. The timings below are for /usr/local/sbin/smbd --version: Timing without either diff : 6m45.67s real 6m45.65s user 0m00.02s system Timing with part 1 diff only: 4m42.88s real 4m42.85s user 0m00.02s system Timing with part 2 diff only: 2m02.61s real 2m02.60s user 0m00.01s system Timing with both diffs : 0m00.03s real 0m00.03s user 0m00.00s system Note that these timings are for a build of a recent samba master tree (linked with kerberos) which is probably slower than the OpenBSD port. Nathanael Index: libexec/ld.so/loader.c === RCS file: /cvs/src/libexec/ld.so/loader.c,v retrieving revision 1.177 diff -u -p -p -u -r1.177 loader.c --- libexec/ld.so/loader.c 3 Dec 2018 05:29:56 - 1.177 +++ libexec/ld.so/loader.c 27 Apr 2019 13:24:02 - @@ -749,15 +749,15 @@ _dl_call_init_recurse(elf_object_t *obje { struct dep_node *n; - object->status |= STAT_VISITED; + int visited_flag = initfirst ? STAT_VISITED_1 : STAT_VISITED_2; + + object->status |= visited_flag; TAILQ_FOREACH(n, &object->child_list, next_sib) { - if (n->data->status & STAT_VISITED) + if (n->data->status & visited_flag) continue; _dl_call_init_recurse(n->data, initfirst); } - - object->status &= ~STAT_VISITED; if (object->status & STAT_INIT_DONE) return; Index: libexec/ld.so/resolve.h === RCS file: /cvs/src/libexec/ld.so/resolve.h,v retrieving revision 1.90 diff -u -p -p -u -r1.90 resolve.h --- libexec/ld.so/resolve.h 21 Apr 2019 04:11:42 - 1.90 +++ libexec/ld.so/resolve.h 27 Apr 2019 13:24:02 - @@ -125,8 +125,9 @@ struct elf_object { #defineSTAT_FINI_READY 0x10 #defineSTAT_UNLOADED 0x20 #defineSTAT_NODELETE 0x40 -#defineSTAT_VISITED0x80 +#defineSTAT_VISITED_1 0x80 #defineSTAT_GNU_HASH 0x100 +#defineSTAT_VISITED_2 0x200 Elf_Phdr*phdrp; int phdrc;