On 10/12/21(Fri) 09:56, Yuichiro NAITO wrote:
> Any comments about this topic?

I'm ok with this approach.  I would appreciate if somebody
else could take it over, I'm too busy with other stuff.

> On 7/12/21 18:09, Yuichiro NAITO wrote:
> > Hi, Martin
> > 
> > n 2021/07/10 16:55, Martin Pieuchot wrote:
> > > Hello Yuichiro, thanks for your work !
> > 
> > Thanks for the response.
> > 
> > > > On 2021/06/16 16:34, Yuichiro NAITO wrote:
> > > > > When I compile a multi-threaded application with '-pg' option, I 
> > > > > always get no
> > > > > results in gmon.out. With bad luck, my application gets SIGSEGV while 
> > > > > running.
> > > > > Because the buffer that holds number of caller/callee count is the 
> > > > > only one
> > > > > in the process and will be broken by multi-threaded access.
> > > > > 
> > > > > I get the idea to solve this problem from NetBSD. NetBSD has 
> > > > > individual buffers
> > > > > for each thread and merges them at the end of profiling.
> > > 
> > > Note that the kernel use a similar approach but doesn't merge the buffer,
> > > instead it generates a file for each CPU.
> > 
> > Yes, so the number of output files are limited by the number of CPUs in 
> > case of
> > the kernel profiling. I think number of application threads can be 
> > increased more
> > casually. I don't want to see dozens of 'gmon.out' files.
> > 
> > > > > NetBSD stores the reference to the individual buffer by 
> > > > > pthread_setspecific(3).
> > > > > I think it causes infinite recursive call if whole libc library 
> > > > > (except
> > > > > mcount.c) is compiled with -pg.
> > > > > 
> > > > > The compiler generates '_mcount' function call at the beginning of 
> > > > > every
> > > > > functions. If '_mcount' calls pthread_getspecific(3) to get the 
> > > > > individual
> > > > > buffer, pthread_getspecific() calls '_mcount' again and causes 
> > > > > infinite
> > > > > recursion.
> > > > > 
> > > > > NetBSD prevents from infinite recursive call by checking a global 
> > > > > variable. But
> > > > > this approach also prevents simultaneously call of '_mcount' on a 
> > > > > multi-threaded
> > > > > application. It makes a little bit wrong results of profiling.
> > > > > 
> > > > > So I added a pointer to the buffer in `struct pthread` that can be 
> > > > > accessible
> > > > > via macro call as same as pthread_self(3). This approach can prevent 
> > > > > of
> > > > > infinite recursive call of '_mcount'.
> > > 
> > > Not calling a libc function for this makes sense.  However I'm not
> > > convinced that accessing `tib_thread' before _rthread_init() has been
> > > called is safe.
> > 
> > Before '_rthread_init’ is called, '__isthreaded' global variable is kept to 
> > be 0.
> > My patch doesn't access tib_thread in this case.
> > After calling `_rthread_init`, `pthread_create()` changes `__isthreaded` to 
> > 1.
> > Tib_thread will be accessed by all threads that are newly created and the 
> > initial one.
> > 
> > I believe tib of the initial thread has been initialized in `_libc_preinit' 
> > function
> > in 'lib/libc/dlfcn/init.c'.
> > 
> > > I'm not sure where is the cleanest way to place the per-thread buffer,
> > > I'd suggest you ask guenther@ about this.
> > 
> > I added guenther@ in CC of this mail.
> > I hope I can get an advise about per-thread buffer.
> > 
> > > Maybe the initialization can be done outside of _mcount()?
> > 
> > Yes, I think tib is initialized in `pthread_create()` and `_libc_preinit()`.
> > 
> > > > > I obtained merging function from NetBSD that is called in '_mcleanup' 
> > > > > function.
> > > > > Merging function needs to walk through all the individual buffers,
> > > > > I added SLIST_ENTRY member in 'struct gmonparam' to make a list of 
> > > > > the buffers.
> > > > > And also added '#ifndef _KERNEL' for the SLIST_ENTRY member not to be 
> > > > > used for
> > > > > the kernel.
> > > > > 
> > > > > But I still use pthread_getspecific(3) for that can call destructor 
> > > > > when
> > > > > a thread is terminated. The individual buffer is moved to free list 
> > > > > to reuse
> > > > > for a new thread.
> > > > 
> > > > Here is a patch for this approach.
> > > 
> > > I have various comments:
> > > 
> > > We choose not to use C++ style comment (// comment) in the tree, could you
> > > fix yours?
> > > 
> > > There's two copies of mcount.c, the other one lies in sys/lib/libkern 
> > > could
> > > you keep them in sync?
> > > 
> > > gmon.c is only compiled in userland and don't need #ifndef _KERNEL
> > > 
> > > In libc there's also the use of _MUTEX_LOCK/UNLOCK() macro instead of
> > > calling pthread_mutex* directly.  This might help reduce the differences
> > > between ST and MT paths.
> > 
> > Thanks for letting me know them.
> > I updated my patch as following according to the advice.
> > 
> > diff --git a/lib/libc/gmon/gmon.c b/lib/libc/gmon/gmon.c
> > index 1ce0a1c289e..5169fa70449 100644
> > --- a/lib/libc/gmon/gmon.c
> > +++ b/lib/libc/gmon/gmon.c
> > @@ -42,6 +42,15 @@
> >   struct gmonparam _gmonparam = { GMON_PROF_OFF };
> > +#include <pthread.h>
> > +#include <thread_private.h>
> > +
> > +SLIST_HEAD(, gmonparam) _gmonfree = SLIST_HEAD_INITIALIZER(_gmonfree);
> > +SLIST_HEAD(, gmonparam) _gmoninuse = SLIST_HEAD_INITIALIZER(_gmoninuse);
> > +void* _gmonlock = NULL;
> > +pthread_key_t _gmonkey;
> > +struct gmonparam _gmondummy;
> > +
> >   static int        s_scale;
> >   /* see profil(2) where this is describe (incorrectly) */
> >   #define           SCALE_1_TO_1    0x10000L
> > @@ -52,6 +61,11 @@ PROTO_NORMAL(moncontrol);
> >   PROTO_DEPRECATED(monstartup);
> >   static int hertz(void);
> > +static void _gmon_destructor(void *);
> > +struct gmonparam *_gmon_alloc(void);
> > +static void _gmon_merge(void);
> > +static void _gmon_merge_two(struct gmonparam *, struct gmonparam *);
> > +
> >   void
> >   monstartup(u_long lowpc, u_long highpc)
> >   {
> > @@ -114,6 +128,9 @@ monstartup(u_long lowpc, u_long highpc)
> >     } else
> >             s_scale = SCALE_1_TO_1;
> > +   _gmondummy.state = GMON_PROF_BUSY;
> > +   pthread_key_create(&_gmonkey, _gmon_destructor);
> > +
> >     moncontrol(1);
> >     return;
> > @@ -134,6 +151,192 @@ mapfailed:
> >   }
> >   __strong_alias(_monstartup,monstartup);
> > +static void
> > +_gmon_destructor(void *arg)
> > +{
> > +   struct gmonparam *p = arg, *q, **prev;
> > +
> > +   if (p == &_gmondummy)
> > +           return;
> > +
> > +   pthread_setspecific(_gmonkey, &_gmondummy);
> > +
> > +   _MUTEX_LOCK(&_gmonlock);
> > +   SLIST_REMOVE(&_gmoninuse, p, gmonparam, next);
> > +   SLIST_INSERT_HEAD(&_gmonfree, p, next);
> > +   _MUTEX_UNLOCK(&_gmonlock);
> > +
> > +   pthread_setspecific(_gmonkey, NULL);
> > +}
> > +
> > +struct gmonparam *
> > +_gmon_alloc(void)
> > +{
> > +   void *addr;
> > +   struct gmonparam *p;
> > +
> > +   _MUTEX_LOCK(&_gmonlock);
> > +   p = SLIST_FIRST(&_gmonfree);
> > +   if (p != NULL) {
> > +           SLIST_REMOVE_HEAD(&_gmonfree, next);
> > +           SLIST_INSERT_HEAD(&_gmoninuse, p ,next);
> > +   } else {
> > +           _MUTEX_UNLOCK(&_gmonlock);
> > +           p = mmap(NULL, sizeof (struct gmonparam),
> > +                    PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, -1, 0);
> > +           if (p == MAP_FAILED)
> > +                   goto mapfailed_2;
> > +           *p = _gmonparam;
> > +           p->kcount = NULL;
> > +           p->kcountsize = 0;
> > +           p->froms = NULL;
> > +           p->tos = NULL;
> > +           addr = mmap(NULL, p->fromssize, PROT_READ|PROT_WRITE,
> > +                       MAP_ANON|MAP_PRIVATE, -1, 0);
> > +           if (addr == MAP_FAILED)
> > +                   goto mapfailed;
> > +           p->froms = addr;
> > +
> > +           addr = mmap(NULL, p->tossize, PROT_READ|PROT_WRITE,
> > +                       MAP_ANON|MAP_PRIVATE, -1, 0);
> > +           if (addr == MAP_FAILED)
> > +                   goto mapfailed;
> > +           p->tos = addr;
> > +           _MUTEX_LOCK(&_gmonlock);
> > +           SLIST_INSERT_HEAD(&_gmoninuse, p ,next);
> > +   }
> > +   _MUTEX_UNLOCK(&_gmonlock);
> > +   pthread_setspecific(_gmonkey, p);
> > +
> > +   return p;
> > +
> > +mapfailed:
> > +   if (p->froms != NULL) {
> > +           munmap(p->froms, p->fromssize);
> > +           p->froms = NULL;
> > +   }
> > +   if (p->tos != NULL) {
> > +           munmap(p->tos, p->tossize);
> > +           p->tos = NULL;
> > +   }
> > +mapfailed_2:
> > +   pthread_setspecific(_gmonkey, NULL);
> > +   ERR("_gmon_alloc: out of memory\n");
> > +   return NULL;
> > +}
> > +
> > +static void
> > +_gmon_merge_two(struct gmonparam *p, struct gmonparam *q)
> > +{
> > +   u_long fromindex;
> > +   u_short *frompcindex, qtoindex, toindex;
> > +   u_long selfpc;
> > +   u_long endfrom;
> > +   long count;
> > +   struct tostruct *top;
> > +
> > +   endfrom = (q->fromssize / sizeof(*q->froms));
> > +   for (fromindex = 0; fromindex < endfrom; fromindex++) {
> > +           if (q->froms[fromindex] == 0)
> > +                   continue;
> > +           for (qtoindex = q->froms[fromindex]; qtoindex != 0;
> > +                qtoindex = q->tos[qtoindex].link) {
> > +                   selfpc = q->tos[qtoindex].selfpc;
> > +                   count = q->tos[qtoindex].count;
> > +                   /* cribbed from mcount */
> > +                   frompcindex = &p->froms[fromindex];
> > +                   toindex = *frompcindex;
> > +                   if (toindex == 0) {
> > +                           /*
> > +                            *      first time traversing this arc
> > +                            */
> > +                           toindex = ++p->tos[0].link;
> > +                           if (toindex >= p->tolimit)
> > +                                   /* halt further profiling */
> > +                                   goto overflow;
> > +
> > +                           *frompcindex = (u_short)toindex;
> > +                           top = &p->tos[(size_t)toindex];
> > +                           top->selfpc = selfpc;
> > +                           top->count = count;
> > +                           top->link = 0;
> > +                           goto done;
> > +                   }
> > +                   top = &p->tos[(size_t)toindex];
> > +                   if (top->selfpc == selfpc) {
> > +                           /*
> > +                            * arc at front of chain; usual case.
> > +                            */
> > +                           top->count+= count;
> > +                           goto done;
> > +                   }
> > +                   /*
> > +                    * have to go looking down chain for it.
> > +                    * top points to what we are looking at,
> > +                    * we know it is not at the head of the chain.
> > +                    */
> > +                   for (; /* goto done */; ) {
> > +                           if (top->link == 0) {
> > +                                   /*
> > +                                    * top is end of the chain and
> > +                                    * none of the chain had
> > +                                    * top->selfpc == selfpc.  so
> > +                                    * we allocate a new tostruct
> > +                                    * and link it to the head of
> > +                                    * the chain.
> > +                                    */
> > +                                   toindex = ++p->tos[0].link;
> > +                                   if (toindex >= p->tolimit)
> > +                                           goto overflow;
> > +                                   top = &p->tos[(size_t)toindex];
> > +                                   top->selfpc = selfpc;
> > +                                   top->count = count;
> > +                                   top->link = *frompcindex;
> > +                                   *frompcindex = (u_short)toindex;
> > +                                   goto done;
> > +                           }
> > +                           /*
> > +                            * otherwise, check the next arc on the chain.
> > +                            */
> > +                           top = &p->tos[top->link];
> > +                           if (top->selfpc == selfpc) {
> > +                                   /*
> > +                                    * there it is.
> > +                                    * add to its count.
> > +                                    */
> > +                                   top->count += count;
> > +                                   goto done;
> > +                           }
> > +
> > +                   }
> > +
> > +           done: ;
> > +           }
> > +
> > +   }
> > +overflow: ;
> > +
> > +}
> > +
> > +static void
> > +_gmon_merge(void)
> > +{
> > +   struct gmonparam *q;
> > +
> > +   _MUTEX_LOCK(&_gmonlock);
> > +
> > +   SLIST_FOREACH(q, &_gmonfree, next)
> > +           _gmon_merge_two(&_gmonparam, q);
> > +
> > +   SLIST_FOREACH(q, &_gmoninuse, next) {
> > +           q->state = GMON_PROF_OFF;
> > +           _gmon_merge_two(&_gmonparam, q);
> > +   }
> > +
> > +   _MUTEX_UNLOCK(&_gmonlock);
> > +}
> > +
> > +
> >   void
> >   _mcleanup(void)
> >   {
> > @@ -234,6 +437,9 @@ _mcleanup(void)
> >         p->kcount, p->kcountsize);
> >     write(log, dbuf, strlen(dbuf));
> >   #endif
> > +
> > +   _gmon_merge();
> > +
> >     hdr = (struct gmonhdr *)&gmonhdr;
> >     bzero(hdr, sizeof(*hdr));
> >     hdr->lpc = p->lowpc;
> > diff --git a/lib/libc/gmon/mcount.c b/lib/libc/gmon/mcount.c
> > index f0ce70dd6ae..21ca16ab9e7 100644
> > --- a/lib/libc/gmon/mcount.c
> > +++ b/lib/libc/gmon/mcount.c
> > @@ -31,6 +31,15 @@
> >   #include <sys/types.h>
> >   #include <sys/gmon.h>
> > +#ifndef _KERNEL
> > +#include <stdio.h>         /* for the use of '__isthreaded'. */
> > +#include <pthread.h>
> > +#include <thread_private.h>
> > +#include <tib.h>
> > +extern struct gmonparam _gmondummy;
> > +struct gmonparam *_gmon_alloc(void);
> > +#endif
> > +
> >   /*
> >    * mcount is called on entry to each function compiled with the profiling
> >    * switch set.  _mcount(), which is declared in a machine-dependent way
> > @@ -65,7 +74,22 @@ _MCOUNT_DECL(u_long frompc, u_long selfpc)
> >     if ((p = curcpu()->ci_gmon) == NULL)
> >             return;
> >   #else
> > -   p = &_gmonparam;
> > +   if (__isthreaded) {
> > +           if (_gmonparam.state != GMON_PROF_ON)
> > +                   return;
> > +           pthread_t t = TIB_GET()->tib_thread;
> > +           p = t->gmonparam;
> > +           if (p == &_gmondummy)
> > +                   return;
> > +           if (p == NULL) {
> > +                   /* prevent recursive call of _gmon_alloc(). */
> > +                   t->gmonparam = &_gmondummy;
> > +                   if ((t->gmonparam = _gmon_alloc()) == NULL)
> > +                           return;
> > +                   p = t->gmonparam;
> > +           }
> > +   } else
> > +           p = &_gmonparam;
> >   #endif
> >     /*
> >      * check that we are profiling
> > diff --git a/lib/libc/include/thread_private.h 
> > b/lib/libc/include/thread_private.h
> > index 237c3fbd034..d6c89c8334b 100644
> > --- a/lib/libc/include/thread_private.h
> > +++ b/lib/libc/include/thread_private.h
> > @@ -5,6 +5,8 @@
> >   #ifndef _THREAD_PRIVATE_H_
> >   #define _THREAD_PRIVATE_H_
> > +#include <sys/types.h>
> > +#include <sys/gmon.h>
> >   #include <stdio.h>                /* for FILE and __isthreaded */
> >   #define _MALLOC_MUTEXES 32
> > @@ -389,6 +391,7 @@ struct pthread {
> >     /* cancel received in a delayed cancel block? */
> >     int delayed_cancel;
> > +   struct gmonparam *gmonparam;
> >   };
> >   /* flags in pthread->flags */
> >   #define   THREAD_DONE             0x001
> > diff --git a/sys/lib/libkern/mcount.c b/sys/lib/libkern/mcount.c
> > index 684ddadb9aa..e30c9554ff5 100644
> > --- a/sys/lib/libkern/mcount.c
> > +++ b/sys/lib/libkern/mcount.c
> > @@ -33,6 +33,15 @@
> >   #include <sys/param.h>
> >   #include <sys/gmon.h>
> > +#ifndef _KERNEL
> > +#include <stdio.h>         /* for the use of '__isthreaded'. */
> > +#include <pthread.h>
> > +#include <thread_private.h>
> > +#include <tib.h>
> > +extern struct gmonparam _gmondummy;
> > +struct gmonparam *_gmon_alloc(void);
> > +#endif
> > +
> >   /*
> >    * mcount is called on entry to each function compiled with the profiling
> >    * switch set.  _mcount(), which is declared in a machine-dependent way
> > @@ -67,7 +76,22 @@ _MCOUNT_DECL(u_long frompc, u_long selfpc)
> >     if ((p = curcpu()->ci_gmon) == NULL)
> >             return;
> >   #else
> > -   p = &_gmonparam;
> > +   if (__isthreaded) {
> > +           if (_gmonparam.state != GMON_PROF_ON)
> > +                   return;
> > +           pthread_t t = TIB_GET()->tib_thread;
> > +           p = t->gmonparam;
> > +           if (p == &_gmondummy)
> > +                   return;
> > +           if (p == NULL) {
> > +                   /* prevent recursive call of _gmon_alloc(). */
> > +                   t->gmonparam = &_gmondummy;
> > +                   if ((t->gmonparam = _gmon_alloc()) == NULL)
> > +                           return;
> > +                   p = t->gmonparam;
> > +           }
> > +   } else
> > +           p = &_gmonparam;
> >   #endif
> >     /*
> >      * check that we are profiling
> > diff --git a/sys/sys/gmon.h b/sys/sys/gmon.h
> > index 1a63cb52b1b..cce98752346 100644
> > --- a/sys/sys/gmon.h
> > +++ b/sys/sys/gmon.h
> > @@ -35,6 +35,7 @@
> >   #ifndef _SYS_GMON_H_
> >   #define _SYS_GMON_H_
> > +#include <sys/queue.h>
> >   #include <machine/profile.h>
> >   /*
> > @@ -136,6 +137,7 @@ struct gmonparam {
> >     u_long          highpc;
> >     u_long          textsize;
> >     u_long          hashfraction;
> > +   SLIST_ENTRY(gmonparam)  next;
> >   };
> >   /*
> > 
> > 
> 
> -- 
> Yuichiro NAITO (naito.yuich...@gmail.com)

Reply via email to