Re: ld.so speedup (part 2)

2019-05-11 Thread Philip Guenther
On Tue, 7 May 2019, Jeremie Courreges-Anglas wrote:
> On Sat, Apr 27 2019, Nathanael Rensen  
> wrote:
> > The diff below speeds up ld.so library intialisation where the 
> > dependency tree is broad and deep, such as samba's smbd which links 
> > over 100 libraries.
...
> As I told mpi@ earlier today, I think your changes are correct as is,
> and are good to be committed.  So this counts as an ok jca@.  But I'd
> expect other developers to chime in soon, maybe they'll spot something
> that I didn't.

drahn@ and I pulled on our ld.so waders and agreed it's good, so I've 
committed it with some tweaking to the #defines to make them 
self-explanatory and have contiguous bit-assignments.

Thank you for identifying this badly inefficient algorithm and spotting 
how easy it was to fix!


Philip Guenther



Re: ld.so speedup (part 2)

2019-05-07 Thread Jeremie Courreges-Anglas
On Sat, Apr 27 2019, Nathanael Rensen  wrote:
> The diff below speeds up ld.so library intialisation where the dependency
> tree is broad and deep, such as samba's smbd which links over 100 libraries.
>
> See for example https://marc.info/?l=openbsd-misc=155007285712913=2
>
> See https://marc.info/?l=openbsd-tech=155637285221396=2 for part 1
> that speeds up library loading.
>
> The timings below are for /usr/local/sbin/smbd --version:
>
> Timing without either diff  : 6m45.67s real  6m45.65s user  0m00.02s system
> Timing with part 1 diff only: 4m42.88s real  4m42.85s user  0m00.02s system
> Timing with part 2 diff only: 2m02.61s real  2m02.60s user  0m00.01s system
> Timing with both diffs  : 0m00.03s real  0m00.03s user  0m00.00s system

First off, thanks a lot for solving this long outstanding issue.  The
use of ld --as-needed hides the problem but it looks like ld.lld isn't
as good as ld.bfd at eliminating extra inter-library references.

As I told mpi@ earlier today, I think your changes are correct as is,
and are good to be committed.  So this counts as an ok jca@.  But I'd
expect other developers to chime in soon, maybe they'll spot something
that I didn't.

-- 
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE



Re: ld.so speedup (part 2)

2019-05-05 Thread Nathanael Rensen
On Sun, 5 May 2019 at 06:26, Martin Pieuchot  wrote:
>
> On 27/04/19(Sat) 21:55, Nathanael Rensen wrote:
> > The diff below speeds up ld.so library intialisation where the dependency
> > tree is broad and deep, such as samba's smbd which links over 100 libraries.
> >
> > See for example https://marc.info/?l=openbsd-misc=155007285712913=2
> >
> > See https://marc.info/?l=openbsd-tech=155637285221396=2 for part 1
> > that speeds up library loading.
> >
> > The timings below are for /usr/local/sbin/smbd --version:
> >
> > Timing without either diff  : 6m45.67s real  6m45.65s user  0m00.02s system
> > Timing with part 1 diff only: 4m42.88s real  4m42.85s user  0m00.02s system
> > Timing with part 2 diff only: 2m02.61s real  2m02.60s user  0m00.01s system
> > Timing with both diffs  : 0m00.03s real  0m00.03s user  0m00.00s system
> >
> > Note that these timings are for a build of a recent samba master tree
> > (linked with kerberos) which is probably slower than the OpenBSD port.
>
> Nice numbers.  Could you explain in words what your diff is doing?  Why
> does splitting the flag help?  Is it because some ctors/initarray are
> being initialized multiple times currently?

No, the STAT_INIT_DONE flag prevents that.

> Or is it just to prevent some traversal?

Yes.

> In that case does that mean the `STAT_VISISTED' flag is removed too
> early?

Yes, STAT_VISITED is removed too early. The visited flag is set on a node
while traversing the child nodes of that node and then removed. It serves
to protect against circular dependencies, but does not prevent repeatedly
traversing through a node that appears on separate branches.

The entire tree must be traversed twice - first to initialise the
DF_1_INITFIRST libraries, and secondly to initialise the others. This
is presumably why this diff contributes roughly twice as much speedup
as the part 1 diff. To be effective in avoiding repeated traversals
the visited flag must persist throughout an entire tree traversal but
it must either be cleared between first and second traversals or a
different flag used for the second traversal.

My approach was to add a second visited flag and make them both
persistent. My rationale for why I believe the flags may be
persisted is as follows. dlopen() calls _dl_call_init() with the
newly loaded object and neither the newly loaded object nor any
newly loaded children of that object will have either visited flag
set. Already loaded children will have those flags set, but they
won't have gained any new children as a result of the dlopen().
If this reasoning is wrong then the diff is wrong and could lead to
uninitialised libraries (and an ld.so regress test should probably
be created to catch that situation).

It occurs to me as I'm writing this that perhaps it's possible to
avoid a tree traversal entirely by walking the linearised grpsym_list
in reverse and relying only on the STAT_INIT_DONE flag.
/*
 * grpsym_list is an ordered list of all child libs of the
 * _dl_loading_object with no dups. The order is equivalent
 * to a breadth-first traversal of the child list without dups.
 */
I don't think it is a true breadth-first traversal, not in the way I
understand breadth-first, but it does ensure that parent nodes appear
before child nodes. So in reverse, child nodes will appear before
parent nodes. While this is not the same as a depth-first traversal
it may be OK. There may be some specific requirements of DF_1_INITFIRST
that need to be taken into account.

Nathanael

>
> > Index: libexec/ld.so/loader.c
> > ===
> > RCS file: /cvs/src/libexec/ld.so/loader.c,v
> > retrieving revision 1.177
> > diff -u -p -p -u -r1.177 loader.c
> > --- libexec/ld.so/loader.c3 Dec 2018 05:29:56 -   1.177
> > +++ libexec/ld.so/loader.c27 Apr 2019 13:24:02 -
> > @@ -749,15 +749,15 @@ _dl_call_init_recurse(elf_object_t *obje
> >  {
> >   struct dep_node *n;
> > 
> > - object->status |= STAT_VISITED;
> > + int visited_flag = initfirst ? STAT_VISITED_1 : STAT_VISITED_2;
> > +
> > + object->status |= visited_flag;
> > 
> >   TAILQ_FOREACH(n, >child_list, next_sib) {
> > - if (n->data->status & STAT_VISITED)
> > + if (n->data->status & visited_flag)
> >   continue;
> >   _dl_call_init_recurse(n->data, initfirst);
> >   }
> > -
> > - object->status &= ~STAT_VISITED;
> > 
> >   if (object->status & STAT_INIT_DONE)
> >   return;
> > Index: libexec/ld.so/resolve.h
> > ===
> > RCS file: /cvs/src/libexec/ld.so/resolve.h,v
> > retrieving revision 1.90
> > diff -u -p -p -u -r1.90 resolve.h
> > --- libexec/ld.so/resolve.h   21 Apr 2019 04:11:42 -  1.90
> > +++ libexec/ld.so/resolve.h   27 Apr 2019 13:24:02 -
> > @@ -125,8 +125,9 @@ struct elf_object {
> >  #define  

Re: ld.so speedup (part 2)

2019-05-04 Thread Martin Pieuchot
On 27/04/19(Sat) 21:55, Nathanael Rensen wrote:
> The diff below speeds up ld.so library intialisation where the dependency
> tree is broad and deep, such as samba's smbd which links over 100 libraries.
> 
> See for example https://marc.info/?l=openbsd-misc=155007285712913=2
> 
> See https://marc.info/?l=openbsd-tech=155637285221396=2 for part 1
> that speeds up library loading.
> 
> The timings below are for /usr/local/sbin/smbd --version:
> 
> Timing without either diff  : 6m45.67s real  6m45.65s user  0m00.02s system
> Timing with part 1 diff only: 4m42.88s real  4m42.85s user  0m00.02s system
> Timing with part 2 diff only: 2m02.61s real  2m02.60s user  0m00.01s system
> Timing with both diffs  : 0m00.03s real  0m00.03s user  0m00.00s system
> 
> Note that these timings are for a build of a recent samba master tree
> (linked with kerberos) which is probably slower than the OpenBSD port.

Nice numbers.  Could you explain in words what your diff is doing?  Why
does splitting the flag help?  Is it because some ctors/initarray are
being initialized multiple times currently?  Or is it just to prevent
some traversal?  In that case does that mean the `STAT_VISISTED' flag
is removed too early?

> Index: libexec/ld.so/loader.c
> ===
> RCS file: /cvs/src/libexec/ld.so/loader.c,v
> retrieving revision 1.177
> diff -u -p -p -u -r1.177 loader.c
> --- libexec/ld.so/loader.c3 Dec 2018 05:29:56 -   1.177
> +++ libexec/ld.so/loader.c27 Apr 2019 13:24:02 -
> @@ -749,15 +749,15 @@ _dl_call_init_recurse(elf_object_t *obje
>  {
>   struct dep_node *n;
>  
> - object->status |= STAT_VISITED;
> + int visited_flag = initfirst ? STAT_VISITED_1 : STAT_VISITED_2;
> +
> + object->status |= visited_flag;
>  
>   TAILQ_FOREACH(n, >child_list, next_sib) {
> - if (n->data->status & STAT_VISITED)
> + if (n->data->status & visited_flag)
>   continue;
>   _dl_call_init_recurse(n->data, initfirst);
>   }
> -
> - object->status &= ~STAT_VISITED;
>  
>   if (object->status & STAT_INIT_DONE)
>   return;
> Index: libexec/ld.so/resolve.h
> ===
> RCS file: /cvs/src/libexec/ld.so/resolve.h,v
> retrieving revision 1.90
> diff -u -p -p -u -r1.90 resolve.h
> --- libexec/ld.so/resolve.h   21 Apr 2019 04:11:42 -  1.90
> +++ libexec/ld.so/resolve.h   27 Apr 2019 13:24:02 -
> @@ -125,8 +125,9 @@ struct elf_object {
>  #define  STAT_FINI_READY 0x10
>  #define  STAT_UNLOADED   0x20
>  #define  STAT_NODELETE   0x40
> -#define  STAT_VISITED0x80
> +#define  STAT_VISITED_1  0x80
>  #define  STAT_GNU_HASH   0x100
> +#define  STAT_VISITED_2  0x200
>  
>   Elf_Phdr*phdrp;
>   int phdrc;
> 



Re: ld.so speedup (part 2)

2019-05-01 Thread Jeremie Courreges-Anglas
On Mon, Apr 29 2019, Stuart Henderson  wrote:
> On 2019/04/28 09:45, Brian Callahan wrote:
>> 
>> 
>> On 4/28/19 6:01 AM, Matthieu Herrb wrote:
>> > On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote:
>> > > > > > > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
>> > > > > > > > The diff below speeds up ld.so library intialisation where the
>> > > > > > dependency
>> > > > > > > > tree is broad and deep, such as samba's smbd which links over 
>> > > > > > > > 100
>> > > > > > libraries.
>> > > Past experience with ld.so changes suggests it would be good to have
>> > > test reports from multiple arches, *especially* hppa.
>> > The regress test seem to pass here on hppa.
>> > 
>> 
>> Pass here too on hppa and macppc and armv7.
>> 
>> ~Brian
>> 
>
> Regress is clean for me on i386 and I am using it on my current ports bulk
> build there (halfway done, no issues seen yet).

Using this in current ports bulk on sparc64, no fallout.

> Regress is also clean on arm64.

and on sparc64.

-- 
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE



Re: ld.so speedup (part 2)

2019-04-29 Thread Stuart Henderson
On 2019/04/29 09:47, Chris Cappuccio wrote:
> Stuart Henderson [s...@spacehopper.org] wrote:
> > 
> > This doesn't match my experience:
> > 
> > $ time sudo rcctl start samba
> > smbd(ok)
> > nmbd(ok)
> > 0m00.81s real 0m00.31s user 0m00.31s system
> 
> He was linking Samba with Kerberos libs too.
> 

OP was but I don't think Ian was.

That is with the ld.so diffs of course. Startup takes getting on for a
minute for me without them.



Re: ld.so speedup (part 2)

2019-04-29 Thread Chris Cappuccio
Stuart Henderson [s...@spacehopper.org] wrote:
> 
> This doesn't match my experience:
> 
> $ time sudo rcctl start samba
> smbd(ok)
> nmbd(ok)
> 0m00.81s real 0m00.31s user 0m00.31s system

He was linking Samba with Kerberos libs too.



Re: ld.so speedup (part 2)

2019-04-29 Thread Stuart Henderson
On 2019/04/28 09:45, Brian Callahan wrote:
> 
> 
> On 4/28/19 6:01 AM, Matthieu Herrb wrote:
> > On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote:
> > > > > > > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
> > > > > > > > The diff below speeds up ld.so library intialisation where the
> > > > > > dependency
> > > > > > > > tree is broad and deep, such as samba's smbd which links over 
> > > > > > > > 100
> > > > > > libraries.
> > > Past experience with ld.so changes suggests it would be good to have
> > > test reports from multiple arches, *especially* hppa.
> > The regress test seem to pass here on hppa.
> > 
> 
> Pass here too on hppa and macppc and armv7.
> 
> ~Brian
> 

Regress is clean for me on i386 and I am using it on my current ports bulk
build there (halfway done, no issues seen yet).

Regress is also clean on arm64.



Re: ld.so speedup (part 2)

2019-04-28 Thread Brian Callahan




On 4/28/19 6:01 AM, Matthieu Herrb wrote:

On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote:

On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:

The diff below speeds up ld.so library intialisation where the

dependency

tree is broad and deep, such as samba's smbd which links over 100

libraries.

Past experience with ld.so changes suggests it would be good to have
test reports from multiple arches, *especially* hppa.

The regress test seem to pass here on hppa.



Pass here too on hppa and macppc and armv7.

~Brian



Re: ld.so speedup (part 2)

2019-04-28 Thread Charlene Wendling
On Sun, 28 Apr 2019 13:04:22 +0200
Robert Nagy  wrote:

> On 28/04/19 12:01 +0200, Matthieu Herrb wrote:
> > On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote:
> > > > >> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen
> > > > >> > wrote:
> > > > >> > > The diff below speeds up ld.so library intialisation
> > > > >> > > where the
> > > > >>dependency
> > > > >> > > tree is broad and deep, such as samba's smbd which links
> > > > >> > > over 100
> > > > >>libraries.
> > > 
> > > Past experience with ld.so changes suggests it would be good to
> > > have test reports from multiple arches, *especially* hppa.
> > 
> > The regress test seem to pass here on hppa.

It seems good on macppc as well, here is the log [0]. Startup time for
clang has been reduced from 3.2s to 0.11s with the two diff applied!

> > -- 
> > Matthieu Herrb
> > 
> 
> This also fixes the component FLAVOR of chromium which uses a
> gazillion shared objects. Awesome work!
> 

Charlène.

[0] http://0x0.st/zbUa.txt



Re: ld.so speedup (part 2)

2019-04-28 Thread Robert Nagy
On 28/04/19 12:01 +0200, Matthieu Herrb wrote:
> On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote:
> > > >> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
> > > >> > > The diff below speeds up ld.so library intialisation where the
> > > >>dependency
> > > >> > > tree is broad and deep, such as samba's smbd which links over 100
> > > >>libraries.
> > 
> > Past experience with ld.so changes suggests it would be good to have
> > test reports from multiple arches, *especially* hppa.
> 
> The regress test seem to pass here on hppa.
> 
> -- 
> Matthieu Herrb
> 

This also fixes the component FLAVOR of chromium which uses a gazillion
shared objects. Awesome work!



Re: ld.so speedup (part 2)

2019-04-28 Thread Matthieu Herrb
On Sun, Apr 28, 2019 at 08:55:16AM +0100, Stuart Henderson wrote:
> > >> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
> > >> > > The diff below speeds up ld.so library intialisation where the
> > >>dependency
> > >> > > tree is broad and deep, such as samba's smbd which links over 100
> > >>libraries.
> 
> Past experience with ld.so changes suggests it would be good to have
> test reports from multiple arches, *especially* hppa.

The regress test seem to pass here on hppa.

-- 
Matthieu Herrb



Re: ld.so speedup (part 2)

2019-04-28 Thread Stuart Henderson
> >> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
> >> > > The diff below speeds up ld.so library intialisation where the
> >>dependency
> >> > > tree is broad and deep, such as samba's smbd which links over 100
> >>libraries.

Past experience with ld.so changes suggests it would be good to have
test reports from multiple arches, *especially* hppa.

On 2019/04/28 01:57, Ian McWilliam wrote:
> Using both patches on old hardware helps speed up the process but I still
> see the rc script timeout before smbd is loaded causing the rest of the
> samba processes to fail to load. This did not happen under 6.4 (amd64) so
> the change of linker / compiler update is still potentially where the
> problem may lie. 
> 
> Starting smbd with both patches
>  0m46.55s real 0m46.47s user 0m00.07s system

This doesn't match my experience:

$ time sudo rcctl start samba
smbd(ok)
nmbd(ok)
0m00.81s real 0m00.31s user 0m00.31s system



Re: ld.so speedup (part 2)

2019-04-28 Thread Otto Moerbeek
On Sun, Apr 28, 2019 at 01:57:46AM +, Ian McWilliam wrote:

> 
> 
> On 28/4/19, 12:56 am, "owner-t...@openbsd.org on behalf of Otto Moerbeek"
>  wrote:
> 
> >On Sat, Apr 27, 2019 at 04:43:14PM +0200, Otto Moerbeek wrote:
> >
> >> On Sat, Apr 27, 2019 at 04:37:23PM +0200, Antoine Jacoutot wrote:
> >> 
> >> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
> >> > > The diff below speeds up ld.so library intialisation where the
> >>dependency
> >> > > tree is broad and deep, such as samba's smbd which links over 100
> >>libraries.
> >> > > 
> >> > > See for example
> >>https://marc.info/?l=openbsd-misc=155007285712913=2
> >> > > 
> >> > > See https://marc.info/?l=openbsd-tech=155637285221396=2 for
> >>part 1
> >> > > that speeds up library loading.
> >> > > 
> >> > > The timings below are for /usr/local/sbin/smbd --version:
> >> > > 
> >> > > Timing without either diff  : 6m45.67s real  6m45.65s user
> >>0m00.02s system
> >> > > Timing with part 1 diff only: 4m42.88s real  4m42.85s user
> >>0m00.02s system
> >> > > Timing with part 2 diff only: 2m02.61s real  2m02.60s user
> >>0m00.01s system
> >> > > Timing with both diffs  : 0m00.03s real  0m00.03s user
> >>0m00.00s system
> >> > > 
> >> > > Note that these timings are for a build of a recent samba master
> >>tree
> >> > > (linked with kerberos) which is probably slower than the OpenBSD
> >>port.
> >> > > 
> >> > > Nathanael
> >> > 
> >> > Wow. Tried your part1 and part2 diffs and the difference is indeed
> >>insane!
> >> > mail/evolution always took 10+ seconds to start for me and now it's
> >>almost
> >> > instant...
> >> > Crazy... But this sounds too good to be true ;-)
> >> > What are the potential regressions?
> >> 
> >> Speaking off regression tests, we have quite en extensive collection.
> >> The tests in libexec/ld.so should all pass.
> >
> >And the do on amd64
> >
> >> 
> >>-Otto
> >> 
> >> 
> 
> The results look good but it still doesn¹t resolve the root cause of the
> issue.


Speedup of lds.o is nice in any circostance and samba issues should be
viewed seperately.  In other word, please don't hijack the thread.

-Otto

> Using both patches on old hardware helps speed up the process but I still
> see the rc script timeout before smbd is loaded causing the rest of the
> samba processes to fail to load. This did not happen under 6.4 (amd64) so
> the change of linker / compiler update is still potentially where the
> problem may lie. 
> 
> Starting smbd with both patches
>  0m46.55s real 0m46.47s user 0m00.07s system
> 
> 
> Would still be good to see this work committed though.
> 
> Ian McWilliam
> 
> OpenBSD 6.5 (GENERIC.MP) #0: Mon Apr 15 16:28:00 AEST 2019
> 
> ianm@ianm-openbsd65.localdomain:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> real mem = 6424494080 (6126MB)
> avail mem = 6220148736 (5931MB)
> mpath0 at root
> scsibus0 at mpath0: 256 targets
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 2.4 @ 0xf0100 (55 entries)
> bios0: vendor Award Software International, Inc. version "F10d" date
> 07/22/2010
> bios0: Gigabyte Technology Co., Ltd. GA-MA790X-DS4
> acpi0 at bios0: rev 0
> acpi0: sleep states S0 S1 S4 S5
> acpi0: tables DSDT FACP SSDT HPET MCFG APIC
> acpi0: wakeup devices USB0(S3) USB1(S3) USB2(S3) USB3(S3) USB4(S3)
> USB5(S3) SBAZ(S4) P2P_(S5) PCE2(S4) PCE3(S4) PCE4(S4) PCE5(S4) PCE6(S4)
> PCE7(S4) PCE8(S4) PCE9(S4) [...]
> acpitimer0 at acpi0: 3579545 Hz, 32 bits
> acpihpet0 at acpi0: 14318180 Hz
> acpimcfg0 at acpi0
> acpimcfg0: addr 0xe000, bus 0-255
> acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
> cpu0 at mainbus0: apid 0 (boot processor)
> cpu0: AMD Phenom(tm) 9750 Quad-Core Processor, 2411.28 MHz, 10-02-03
> cpu0: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL
> USH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDT
> SCP,LONG,3DNOW2,3DNOW,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,
> OSVW,IBS,ITSC
> cpu0: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB
> 64b/line 16-way L2 cache, 2MB 64b/line 32-way L3 cache
> cpu0: ITLB 32 4KB entries fully associative, 16 4MB entries fully
> associative
> cpu0: DTLB 48 4KB entries fully associative, 48 4MB entries fully
> associative
> cpu0: AMD erratum 721 detected and fixed
> cpu0: smt 0, core 0, package 0
> mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
> cpu0: apic clock running at 200MHz
> cpu0: mwait min=64, max=64, IBE
> cpu1 at mainbus0: apid 1 (application processor)
> cpu1: AMD Phenom(tm) 9750 Quad-Core Processor, 2410.99 MHz, 10-02-03
> cpu1: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL
> USH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDT
> SCP,LONG,3DNOW2,3DNOW,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,
> OSVW,IBS,ITSC
> cpu1: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB
> 64b/line 16-way L2 cache, 2MB 64b/line 32-way L3 cache
> cpu1: 

Re: ld.so speedup (part 2)

2019-04-27 Thread Ian McWilliam



On 28/4/19, 12:56 am, "owner-t...@openbsd.org on behalf of Otto Moerbeek"
 wrote:

>On Sat, Apr 27, 2019 at 04:43:14PM +0200, Otto Moerbeek wrote:
>
>> On Sat, Apr 27, 2019 at 04:37:23PM +0200, Antoine Jacoutot wrote:
>> 
>> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
>> > > The diff below speeds up ld.so library intialisation where the
>>dependency
>> > > tree is broad and deep, such as samba's smbd which links over 100
>>libraries.
>> > > 
>> > > See for example
>>https://marc.info/?l=openbsd-misc=155007285712913=2
>> > > 
>> > > See https://marc.info/?l=openbsd-tech=155637285221396=2 for
>>part 1
>> > > that speeds up library loading.
>> > > 
>> > > The timings below are for /usr/local/sbin/smbd --version:
>> > > 
>> > > Timing without either diff  : 6m45.67s real  6m45.65s user
>>0m00.02s system
>> > > Timing with part 1 diff only: 4m42.88s real  4m42.85s user
>>0m00.02s system
>> > > Timing with part 2 diff only: 2m02.61s real  2m02.60s user
>>0m00.01s system
>> > > Timing with both diffs  : 0m00.03s real  0m00.03s user
>>0m00.00s system
>> > > 
>> > > Note that these timings are for a build of a recent samba master
>>tree
>> > > (linked with kerberos) which is probably slower than the OpenBSD
>>port.
>> > > 
>> > > Nathanael
>> > 
>> > Wow. Tried your part1 and part2 diffs and the difference is indeed
>>insane!
>> > mail/evolution always took 10+ seconds to start for me and now it's
>>almost
>> > instant...
>> > Crazy... But this sounds too good to be true ;-)
>> > What are the potential regressions?
>> 
>> Speaking off regression tests, we have quite en extensive collection.
>> The tests in libexec/ld.so should all pass.
>
>And the do on amd64
>
>> 
>>  -Otto
>> 
>> 

The results look good but it still doesn¹t resolve the root cause of the
issue.
Using both patches on old hardware helps speed up the process but I still
see the rc script timeout before smbd is loaded causing the rest of the
samba processes to fail to load. This did not happen under 6.4 (amd64) so
the change of linker / compiler update is still potentially where the
problem may lie. 

Starting smbd with both patches
 0m46.55s real 0m46.47s user 0m00.07s system


Would still be good to see this work committed though.

Ian McWilliam

OpenBSD 6.5 (GENERIC.MP) #0: Mon Apr 15 16:28:00 AEST 2019

ianm@ianm-openbsd65.localdomain:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 6424494080 (6126MB)
avail mem = 6220148736 (5931MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.4 @ 0xf0100 (55 entries)
bios0: vendor Award Software International, Inc. version "F10d" date
07/22/2010
bios0: Gigabyte Technology Co., Ltd. GA-MA790X-DS4
acpi0 at bios0: rev 0
acpi0: sleep states S0 S1 S4 S5
acpi0: tables DSDT FACP SSDT HPET MCFG APIC
acpi0: wakeup devices USB0(S3) USB1(S3) USB2(S3) USB3(S3) USB4(S3)
USB5(S3) SBAZ(S4) P2P_(S5) PCE2(S4) PCE3(S4) PCE4(S4) PCE5(S4) PCE6(S4)
PCE7(S4) PCE8(S4) PCE9(S4) [...]
acpitimer0 at acpi0: 3579545 Hz, 32 bits
acpihpet0 at acpi0: 14318180 Hz
acpimcfg0 at acpi0
acpimcfg0: addr 0xe000, bus 0-255
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: AMD Phenom(tm) 9750 Quad-Core Processor, 2411.28 MHz, 10-02-03
cpu0: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL
USH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDT
SCP,LONG,3DNOW2,3DNOW,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,
OSVW,IBS,ITSC
cpu0: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB
64b/line 16-way L2 cache, 2MB 64b/line 32-way L3 cache
cpu0: ITLB 32 4KB entries fully associative, 16 4MB entries fully
associative
cpu0: DTLB 48 4KB entries fully associative, 48 4MB entries fully
associative
cpu0: AMD erratum 721 detected and fixed
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
cpu0: apic clock running at 200MHz
cpu0: mwait min=64, max=64, IBE
cpu1 at mainbus0: apid 1 (application processor)
cpu1: AMD Phenom(tm) 9750 Quad-Core Processor, 2410.99 MHz, 10-02-03
cpu1: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL
USH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB,RDT
SCP,LONG,3DNOW2,3DNOW,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,
OSVW,IBS,ITSC
cpu1: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB
64b/line 16-way L2 cache, 2MB 64b/line 32-way L3 cache
cpu1: ITLB 32 4KB entries fully associative, 16 4MB entries fully
associative
cpu1: DTLB 48 4KB entries fully associative, 48 4MB entries fully
associative
cpu1: AMD erratum 721 detected and fixed
cpu1: smt 0, core 1, package 0
cpu2 at mainbus0: apid 2 (application processor)
cpu2: AMD Phenom(tm) 9750 Quad-Core Processor, 2410.99 MHz, 10-02-03
cpu2: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFL

Re: ld.so speedup (part 2)

2019-04-27 Thread Otto Moerbeek
On Sat, Apr 27, 2019 at 04:43:14PM +0200, Otto Moerbeek wrote:

> On Sat, Apr 27, 2019 at 04:37:23PM +0200, Antoine Jacoutot wrote:
> 
> > On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
> > > The diff below speeds up ld.so library intialisation where the dependency
> > > tree is broad and deep, such as samba's smbd which links over 100 
> > > libraries.
> > > 
> > > See for example https://marc.info/?l=openbsd-misc=155007285712913=2
> > > 
> > > See https://marc.info/?l=openbsd-tech=155637285221396=2 for part 1
> > > that speeds up library loading.
> > > 
> > > The timings below are for /usr/local/sbin/smbd --version:
> > > 
> > > Timing without either diff  : 6m45.67s real  6m45.65s user  0m00.02s 
> > > system
> > > Timing with part 1 diff only: 4m42.88s real  4m42.85s user  0m00.02s 
> > > system
> > > Timing with part 2 diff only: 2m02.61s real  2m02.60s user  0m00.01s 
> > > system
> > > Timing with both diffs  : 0m00.03s real  0m00.03s user  0m00.00s 
> > > system
> > > 
> > > Note that these timings are for a build of a recent samba master tree
> > > (linked with kerberos) which is probably slower than the OpenBSD port.
> > > 
> > > Nathanael
> > 
> > Wow. Tried your part1 and part2 diffs and the difference is indeed insane!
> > mail/evolution always took 10+ seconds to start for me and now it's almost
> > instant...
> > Crazy... But this sounds too good to be true ;-)
> > What are the potential regressions?
> 
> Speaking off regression tests, we have quite en extensive collection.
> The tests in libexec/ld.so should all pass.

And the do on amd64

> 
>   -Otto
> 
> 
> > 
> > 
> > > Index: libexec/ld.so/loader.c
> > > ===
> > > RCS file: /cvs/src/libexec/ld.so/loader.c,v
> > > retrieving revision 1.177
> > > diff -u -p -p -u -r1.177 loader.c
> > > --- libexec/ld.so/loader.c3 Dec 2018 05:29:56 -   1.177
> > > +++ libexec/ld.so/loader.c27 Apr 2019 13:24:02 -
> > > @@ -749,15 +749,15 @@ _dl_call_init_recurse(elf_object_t *obje
> > >  {
> > >   struct dep_node *n;
> > >  
> > > - object->status |= STAT_VISITED;
> > > + int visited_flag = initfirst ? STAT_VISITED_1 : STAT_VISITED_2;
> > > +
> > > + object->status |= visited_flag;
> > >  
> > >   TAILQ_FOREACH(n, >child_list, next_sib) {
> > > - if (n->data->status & STAT_VISITED)
> > > + if (n->data->status & visited_flag)
> > >   continue;
> > >   _dl_call_init_recurse(n->data, initfirst);
> > >   }
> > > -
> > > - object->status &= ~STAT_VISITED;
> > >  
> > >   if (object->status & STAT_INIT_DONE)
> > >   return;
> > > Index: libexec/ld.so/resolve.h
> > > ===
> > > RCS file: /cvs/src/libexec/ld.so/resolve.h,v
> > > retrieving revision 1.90
> > > diff -u -p -p -u -r1.90 resolve.h
> > > --- libexec/ld.so/resolve.h   21 Apr 2019 04:11:42 -  1.90
> > > +++ libexec/ld.so/resolve.h   27 Apr 2019 13:24:02 -
> > > @@ -125,8 +125,9 @@ struct elf_object {
> > >  #define  STAT_FINI_READY 0x10
> > >  #define  STAT_UNLOADED   0x20
> > >  #define  STAT_NODELETE   0x40
> > > -#define  STAT_VISITED0x80
> > > +#define  STAT_VISITED_1  0x80
> > >  #define  STAT_GNU_HASH   0x100
> > > +#define  STAT_VISITED_2  0x200
> > >  
> > >   Elf_Phdr*phdrp;
> > >   int phdrc;
> > > 
> > 
> > -- 
> > Antoine
> > 
> 



Re: ld.so speedup (part 2)

2019-04-27 Thread Otto Moerbeek
On Sat, Apr 27, 2019 at 04:37:23PM +0200, Antoine Jacoutot wrote:

> On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
> > The diff below speeds up ld.so library intialisation where the dependency
> > tree is broad and deep, such as samba's smbd which links over 100 libraries.
> > 
> > See for example https://marc.info/?l=openbsd-misc=155007285712913=2
> > 
> > See https://marc.info/?l=openbsd-tech=155637285221396=2 for part 1
> > that speeds up library loading.
> > 
> > The timings below are for /usr/local/sbin/smbd --version:
> > 
> > Timing without either diff  : 6m45.67s real  6m45.65s user  0m00.02s system
> > Timing with part 1 diff only: 4m42.88s real  4m42.85s user  0m00.02s system
> > Timing with part 2 diff only: 2m02.61s real  2m02.60s user  0m00.01s system
> > Timing with both diffs  : 0m00.03s real  0m00.03s user  0m00.00s system
> > 
> > Note that these timings are for a build of a recent samba master tree
> > (linked with kerberos) which is probably slower than the OpenBSD port.
> > 
> > Nathanael
> 
> Wow. Tried your part1 and part2 diffs and the difference is indeed insane!
> mail/evolution always took 10+ seconds to start for me and now it's almost
> instant...
> Crazy... But this sounds too good to be true ;-)
> What are the potential regressions?

Speaking off regression tests, we have quite en extensive collection.
The tests in libexec/ld.so should all pass.

-Otto


> 
> 
> > Index: libexec/ld.so/loader.c
> > ===
> > RCS file: /cvs/src/libexec/ld.so/loader.c,v
> > retrieving revision 1.177
> > diff -u -p -p -u -r1.177 loader.c
> > --- libexec/ld.so/loader.c  3 Dec 2018 05:29:56 -   1.177
> > +++ libexec/ld.so/loader.c  27 Apr 2019 13:24:02 -
> > @@ -749,15 +749,15 @@ _dl_call_init_recurse(elf_object_t *obje
> >  {
> > struct dep_node *n;
> >  
> > -   object->status |= STAT_VISITED;
> > +   int visited_flag = initfirst ? STAT_VISITED_1 : STAT_VISITED_2;
> > +
> > +   object->status |= visited_flag;
> >  
> > TAILQ_FOREACH(n, >child_list, next_sib) {
> > -   if (n->data->status & STAT_VISITED)
> > +   if (n->data->status & visited_flag)
> > continue;
> > _dl_call_init_recurse(n->data, initfirst);
> > }
> > -
> > -   object->status &= ~STAT_VISITED;
> >  
> > if (object->status & STAT_INIT_DONE)
> > return;
> > Index: libexec/ld.so/resolve.h
> > ===
> > RCS file: /cvs/src/libexec/ld.so/resolve.h,v
> > retrieving revision 1.90
> > diff -u -p -p -u -r1.90 resolve.h
> > --- libexec/ld.so/resolve.h 21 Apr 2019 04:11:42 -  1.90
> > +++ libexec/ld.so/resolve.h 27 Apr 2019 13:24:02 -
> > @@ -125,8 +125,9 @@ struct elf_object {
> >  #defineSTAT_FINI_READY 0x10
> >  #defineSTAT_UNLOADED   0x20
> >  #defineSTAT_NODELETE   0x40
> > -#defineSTAT_VISITED0x80
> > +#defineSTAT_VISITED_1  0x80
> >  #defineSTAT_GNU_HASH   0x100
> > +#defineSTAT_VISITED_2  0x200
> >  
> > Elf_Phdr*phdrp;
> > int phdrc;
> > 
> 
> -- 
> Antoine
> 



Re: ld.so speedup (part 2)

2019-04-27 Thread Antoine Jacoutot
On Sat, Apr 27, 2019 at 09:55:33PM +0800, Nathanael Rensen wrote:
> The diff below speeds up ld.so library intialisation where the dependency
> tree is broad and deep, such as samba's smbd which links over 100 libraries.
> 
> See for example https://marc.info/?l=openbsd-misc=155007285712913=2
> 
> See https://marc.info/?l=openbsd-tech=155637285221396=2 for part 1
> that speeds up library loading.
> 
> The timings below are for /usr/local/sbin/smbd --version:
> 
> Timing without either diff  : 6m45.67s real  6m45.65s user  0m00.02s system
> Timing with part 1 diff only: 4m42.88s real  4m42.85s user  0m00.02s system
> Timing with part 2 diff only: 2m02.61s real  2m02.60s user  0m00.01s system
> Timing with both diffs  : 0m00.03s real  0m00.03s user  0m00.00s system
> 
> Note that these timings are for a build of a recent samba master tree
> (linked with kerberos) which is probably slower than the OpenBSD port.
> 
> Nathanael

Wow. Tried your part1 and part2 diffs and the difference is indeed insane!
mail/evolution always took 10+ seconds to start for me and now it's almost
instant...
Crazy... But this sounds too good to be true ;-)
What are the potential regressions?


> Index: libexec/ld.so/loader.c
> ===
> RCS file: /cvs/src/libexec/ld.so/loader.c,v
> retrieving revision 1.177
> diff -u -p -p -u -r1.177 loader.c
> --- libexec/ld.so/loader.c3 Dec 2018 05:29:56 -   1.177
> +++ libexec/ld.so/loader.c27 Apr 2019 13:24:02 -
> @@ -749,15 +749,15 @@ _dl_call_init_recurse(elf_object_t *obje
>  {
>   struct dep_node *n;
>  
> - object->status |= STAT_VISITED;
> + int visited_flag = initfirst ? STAT_VISITED_1 : STAT_VISITED_2;
> +
> + object->status |= visited_flag;
>  
>   TAILQ_FOREACH(n, >child_list, next_sib) {
> - if (n->data->status & STAT_VISITED)
> + if (n->data->status & visited_flag)
>   continue;
>   _dl_call_init_recurse(n->data, initfirst);
>   }
> -
> - object->status &= ~STAT_VISITED;
>  
>   if (object->status & STAT_INIT_DONE)
>   return;
> Index: libexec/ld.so/resolve.h
> ===
> RCS file: /cvs/src/libexec/ld.so/resolve.h,v
> retrieving revision 1.90
> diff -u -p -p -u -r1.90 resolve.h
> --- libexec/ld.so/resolve.h   21 Apr 2019 04:11:42 -  1.90
> +++ libexec/ld.so/resolve.h   27 Apr 2019 13:24:02 -
> @@ -125,8 +125,9 @@ struct elf_object {
>  #define  STAT_FINI_READY 0x10
>  #define  STAT_UNLOADED   0x20
>  #define  STAT_NODELETE   0x40
> -#define  STAT_VISITED0x80
> +#define  STAT_VISITED_1  0x80
>  #define  STAT_GNU_HASH   0x100
> +#define  STAT_VISITED_2  0x200
>  
>   Elf_Phdr*phdrp;
>   int phdrc;
> 

-- 
Antoine