On Sat, Jun 27, 2020 at 12:10:24PM +0200, Martin Pieuchot wrote:
> On 27/06/20(Sat) 00:35, Vitaliy Makkoveev wrote:
> > On Fri, Jun 26, 2020 at 09:12:16PM +0200, Martin Pieuchot wrote:
> > > On 26/06/20(Fri) 16:56, Vitaliy Makkoveev wrote:
> > > > if_clone_create() has the races caused by context switch.
> > >
> > > Can you share a backtrace of such race? Where does the kernel panic?
> > >
> >
> > This diff was inspired by thread [1]. As I explained [2] here is 3
> > issues that cause panics produced by command below:
> >
> > ---- cut begin ----
> > for i in 1 2 3; do while true; do ifconfig bridge0 create& \
> > ifconfig bridge0 destroy& done& done
> > ---- cut end ----
>
> Thanks, I couldn't reproduce it on any of the machines I tried. Did you
> managed to reproduce it with other pseudo-devices or just with bridge0?
>
> > My system was stable with the last diff I did for thread [1]. But since
> > this final diff [3] which include fixes for tun(4) is quick and dirty
> > and not for commit I decided to make the diff to fix the races caused by
> > if_clone_create() at first.
> >
> > I included screenshot with panic.
>
> Thanks, interesting that the corruption happens on a list that should be
> initialized. Does that mean the context switch on Thread 1 is happening
> before if_attach_common() is called?
>
> You said your previous email that there's a context switch. Do you know
> when it happens? You could see that in ddb by looking at the backtrace
> of the other CPU.
>
> Is the context switch leading to the race common to all pseudo-drivers
> or is it in the bridge(4) driver?
>
> Regarding your solution, do I understand correctly that the goal is to
> serialize all if_clone_create()? Is it really needed to remember which
> unit is being currently created or can't we just serialize all of them?
>
> The fact that a lock is not held over the cloning operation is imho
> positive.
>
I reworked tool for reproduce. Now I avoided fork()/exec() route and it
takes couple of minutes to take panic on 4 cores. I attached some
screenshots with panics caused by various pseudo-interfaces but my
previous mail was banned. I will try to attach them with separate mails.
I hope anyone else will try it. Now switch(4) is the bast way to
reproduce.
---- cut begin ----
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <net/if.h>
#include <stdio.h>
#include <stdlib.h>
#include <err.h>
#include <errno.h>
#include <pthread.h>
#include <string.h>
#include <unistd.h>
static struct ifreq ifr;
static void *clone_create(void *arg)
{
int s;
if((s=socket(AF_INET, SOCK_DGRAM, 0))<0)
err(1, "socket()");
while(1){
if(ioctl(s, SIOCIFCREATE, &ifr)<0)
if(errno==EINVAL)
exit(1);
}
return NULL;
}
static void *clone_destroy(void *arg)
{
int s;
if((s=socket(AF_INET, SOCK_DGRAM, 0))<0)
err(1, "socket()");
while(1){
if(ioctl(s, SIOCIFDESTROY, &ifr)<0)
if(errno==EINVAL)
exit(1);
}
return NULL;
}
int main(int argc, char *argv[])
{
pthread_t thr;
int i;
if(argc!=2){
fprintf(stderr, "usage: %s ifname\n", getprogname());
return 1;
}
if(getuid()!=0){
fprintf(stderr, "should be root\n");
return 1;
}
memset(&ifr, 0, sizeof(ifr));
strlcpy(ifr.ifr_name, argv[1], sizeof(ifr.ifr_name));
for(i=0; i<8*4; ++i){
if(pthread_create(&thr, NULL, clone_create, NULL)!=0)
errx(1, "pthread_create(clone_create)");
}
clone_destroy(NULL);
return 0;
}
---- cut end ----