Re: multicast.4: drop unneeded (void *) casts

2018-10-19 Thread Philip Guenther
On Fri, Oct 19, 2018 at 6:20 AM Scott Cheloha 
wrote:
...

> Sure, looks cleaner.  ok?


ok guenther@


Re: bypass support for iommu on sparc64

2018-10-19 Thread David Gwynne
On Sat, Oct 20, 2018 at 02:44:29AM +, Joseph Mayer wrote:
> 
> Last iteration from me on this one.
> 
> Why is this not a problem on some other architectures?

It is a problem, it's just that other archs don't have an iommu like
sparc64.

> I'd have thought DMA and hardware being assigned transitory addresses
> (from memory allocator or other OS subsystem or driver) mostly is a
> lower level phenomenon and memcpy normally applies on higher levels,
> isn't it so - for networking for instance, mbuf's take over soon above
> the driver level. Does OpenBSD have a pool of to-be-mbufs and it asks
> network drivers to write received ethernet frames directly to them, and
> similarly transmit ethernet frames directly from mbuf:s?

Hrm. There's three views of memory you need to keep in mind here.
Memory has a physical address which gets mapped to virtual addresses
that the kernel and programs see. Finally, there's the DMA address,
which is the address devices use to access physical memory.

On most archs the physical and dma addresses are the same thing. On
archs with an IOMMU or similar, the dma address can be virtual, just
like the kernel addresses are virtual.

When you allocate an mbuf, you're getting a chunk of physical memory
that is mapped into the kernel virtual address space. For a device
to do something with it, the kernel has the bus_dma api that figures
out the dma address of the physical memory behind the kernel virtual
address.

On sparc64, that figuring out involves finding the physical address on
the memory, then allocating and filling TTEs. On amd64, it just has to
get the physical address of the kva and the device can use it directly.

> What potentially or clearly sensitive memory would passthru expose,
> driver-owned structures only or all memory?

Passthru menas a device can access all the physical memory in a
computer. So everything.



Re: bypass support for iommu on sparc64

2018-10-19 Thread Joseph Mayer
On Saturday, October 20, 2018 10:14 AM, David Gwynne  wrote:

> > On 20 Oct 2018, at 11:56 am, Joseph Mayer joseph.ma...@protonmail.com wrote:
> > ‐‐‐ Original Message ‐‐‐
> > On Friday, October 19, 2018 5:15 PM, Mark Kettenis mark.kette...@xs4all.nl 
> > wrote:
> >
> > > > Date: Fri, 19 Oct 2018 10:22:30 +1000
> > > > From: David Gwynne da...@gwynne.id.au
> > > > On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote:
> > > >
> > > > > On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote:
> > > > >
> > > > > > on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
> > > > > > setting up and tearing down the translation table entries (TTEs)
> > > > > > is very expensive. so expensive that the cost of doing it for disk
> > > > > > io has a noticable impact on compile times.
> > > > > > now that there's a BUS_DMA_64BIT flag, we can use that to decide
> > > > > > to bypass the iommu for devices that set that flag, therefore
> > > > > > avoiding the cost of handling the TTEs.
> >
> > Question for the unintroduced, what's the scope here, TTE is Sparc's
> > page table and reconfiguring them at (process) context switch is
> > expensive and this suggestion removes the need for TTE:s for hardware
> > device access, but those don't change at context switch?
>
> We're talking about an IOMMU here, not a traditional MMU providing virtual 
> addresses for programs. An IOMMU sits between physical memory and the devices 
> in a machine. It allows DMA addresses to mapped to different parts of 
> physical memory. Mapping physical memory to a DMA virtual address (or dva) is 
> how a device that only understands 32bit addresses can work in a 64bit 
> machine. Memory at high addresses gets mapped to a low dva.
>
> This is done at runtime on OpenBSD when DMA mappings are loaded or unloaded 
> by populating Translation Table Entries (TTEs). A TTE is effectively a table 
> or array mapping DVA pages to physical addresses. Generally device drivers 
> load and unload dma memory for every I/O or packet or so on.
>
> IOMMUs in sparc64s have some more features than this. Because they really are 
> between memory and the devices they can act as a gatekeeper for all memory 
> accesses. They also have a toggle that can allow a device to have direct or 
> passthru access to physical memory. If passthru is enabled, there's a special 
> address range that effectively maps all physical memory into a DVA range. 
> Devices can be pointed at it without having to manage TTEs. When passthru is 
> disabled, all accesses must go through TTEs.
>
> Currently OpenBSD disables passthru. The benefit is devices can't blindly 
> access sensitive memory unless it is explicitly shared. Note that this is how 
> it is on most architectures anyway. However, the consequence of managing the 
> TTEs is that it is expensive, and extremely so in some cases.
>
> dlg

Last iteration from me on this one.

Why is this not a problem on some other architectures?

I'd have thought DMA and hardware being assigned transitory addresses
(from memory allocator or other OS subsystem or driver) mostly is a
lower level phenomenon and memcpy normally applies on higher levels,
isn't it so - for networking for instance, mbuf's take over soon above
the driver level. Does OpenBSD have a pool of to-be-mbufs and it asks
network drivers to write received ethernet frames directly to them, and
similarly transmit ethernet frames directly from mbuf:s?

What potentially or clearly sensitive memory would passthru expose,
driver-owned structures only or all memory?



Re: vmd: servicing virtio devices from separate processes

2018-10-19 Thread David Gwynne



> On 20 Oct 2018, at 4:01 am, Sergio Lopez  wrote:
> 
> On Wed, Oct 17, 2018 at 02:04:46PM -0700, Mike Larkin wrote:
>> On Fri, Oct 05, 2018 at 01:50:10PM +0200, Sergio Lopez wrote:
>>> Right now, vmd already features an excellent privsep model to ensure
>>> the process servicing the VM requests to the outside world is running
>>> with the lowest possible privileges.
>>> 
>>> I was wondering if we could take a step further, servicing each virtio
>>> device from a different process. This design would simplify the
>>> implementation and maintenance of those devices, improve the privsep
>>> model and increase the resilience of VMs (the crash of a process
>>> servicing a device won't bring down the whole VM, and a mechanism to
>>> recover from this scenario could be explored).
>> 
>> Our model is generally to not try to recover from crashes like that. Indeed,
>> you *want* to crash so the underlying bug can be found and fixed.
> 
> With separate processes you'll still have a crash with core dump from
> the process servicing the device, with the additional advantage of being
> able to, optionally, try to recover device or debug it independently. 
> 
>>> - An in-kernel IRQ chip. This one is the most controversial, as it
>>> means emulating a device from a privileged domain, but I'm pretty sure
>>> a lapic implementation with enough functionality to serve *BSD/Linux
>>> Guests can be small and simple enough to be easily auditable.
>> 
>> This needs to be done, for a number of reasons (device emulation being just
>> one). pd@ and I are working on how to implement this using features in
>> recent CPUs, since much of the LAPIC emulation can now be handled by the
>> CPU itself. We're thinking skylake and later will be the line in the sand
>> for this. Otherwise the software emulation is more complex and more prone
>> to bugs. I've resisted the urge to put this stuff in the kernel for exactly
>> that reason, but with later model CPUs we may be in better shape. We may
>> also decide to focus solely on x2APIC. If you're interested in helping in
>> this area, I'll keep you in the loop.
> 
> Sure, I'd like to help. I'm quite familiar with KVM's in-kernel irqchip
> implementation.
> 
>>> Do you think it's worth exploring this model? What are feelings
>>> regarding the in-kernel IRQ chip?
>>> 
>>> Sergio (slp).
>>> 
>> 
>> All things considered, I'm less sold on the idea of splitting out devices
>> into their own processes. I don't see any compelling reason. But we do
>> need an IOAPIC and LAPIC implementation at some point, as you point out.
> 
> Well, vmd is still small enough (thankfully) to be able debug it easily.
> But, eventually, it'll grow in both size and complexity (specially with
> SMP support, I/O optimizations, additional storage backends...) and
> having a high degree of modularity really helps here.
> 
> In fact, the QEMU/KVM community is starting to consider going this route
> (but QEMU is *huge*). [1]
> 
> Anyways, this is not a "now-or-never" thing. As you suggest, we can work
> now on kickfd and IOAPIC/LAPIC, which are useful by themselves. And when
> those are in place, I'll be able to write a PoC so we can evaluate its
> benefits and drawbacks.

Would sending and receiving a VM still work if I/O is run in different 
processes?

dlg



Re: bypass support for iommu on sparc64

2018-10-19 Thread David Gwynne



> On 20 Oct 2018, at 11:56 am, Joseph Mayer  wrote:
> 
> ‐‐‐ Original Message ‐‐‐
> On Friday, October 19, 2018 5:15 PM, Mark Kettenis  
> wrote:
> 
>>> Date: Fri, 19 Oct 2018 10:22:30 +1000
>>> From: David Gwynne da...@gwynne.id.au
>>> On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote:
>>> 
 On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote:
 
> on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
> setting up and tearing down the translation table entries (TTEs)
> is very expensive. so expensive that the cost of doing it for disk
> io has a noticable impact on compile times.
> now that there's a BUS_DMA_64BIT flag, we can use that to decide
> to bypass the iommu for devices that set that flag, therefore
> avoiding the cost of handling the TTEs.
> 
> Question for the unintroduced, what's the scope here, TTE is Sparc's
> page table and reconfiguring them at (process) context switch is
> expensive and this suggestion removes the need for TTE:s for hardware
> device access, but those don't change at context switch?

We're talking about an IOMMU here, not a traditional MMU providing virtual 
addresses for programs. An IOMMU sits between physical memory and the devices 
in a machine. It allows DMA addresses to mapped to different parts of physical 
memory. Mapping physical memory to a DMA virtual address (or dva) is how a 
device that only understands 32bit addresses can work in a 64bit machine. 
Memory at high addresses gets mapped to a low dva.

This is done at runtime on OpenBSD when DMA mappings are loaded or unloaded by 
populating Translation Table Entries (TTEs). A TTE is effectively a table or 
array mapping DVA pages to physical addresses. Generally device drivers load 
and unload dma memory for every I/O or packet or so on.

IOMMUs in sparc64s have some more features than this. Because they really are 
between memory and the devices they can act as a gatekeeper for all memory 
accesses. They also have a toggle that can allow a device to have direct or 
passthru access to physical memory. If passthru is enabled, there's a special 
address range that effectively maps all physical memory into a DVA range. 
Devices can be pointed at it without having to manage TTEs. When passthru is 
disabled, all accesses must go through TTEs.

Currently OpenBSD disables passthru. The benefit is devices can't blindly 
access sensitive memory unless it is explicitly shared. Note that this is how 
it is on most architectures anyway. However, the consequence of managing the 
TTEs is that it is expensive, and extremely so in some cases.

dlg



Re: bypass support for iommu on sparc64

2018-10-19 Thread Joseph Mayer
‐‐‐ Original Message ‐‐‐
On Friday, October 19, 2018 5:15 PM, Mark Kettenis  
wrote:

> > Date: Fri, 19 Oct 2018 10:22:30 +1000
> > From: David Gwynne da...@gwynne.id.au
> > On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote:
> >
> > > On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote:
> > >
> > > > on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
> > > > setting up and tearing down the translation table entries (TTEs)
> > > > is very expensive. so expensive that the cost of doing it for disk
> > > > io has a noticable impact on compile times.
> > > > now that there's a BUS_DMA_64BIT flag, we can use that to decide
> > > > to bypass the iommu for devices that set that flag, therefore
> > > > avoiding the cost of handling the TTEs.

Question for the unintroduced, what's the scope here, TTE is Sparc's
page table and reconfiguring them at (process) context switch is
expensive and this suggestion removes the need for TTE:s for hardware
device access, but those don't change at context switch?



Re: lib/libfuse: Handle signals that get sent to any thread

2018-10-19 Thread Rian Hunter
On Fri, Oct 19 2018, at 4:02 PM, Klemens Nanni  wrote:
> On Fri, Oct 19, 2018 at 02:03:59PM -0700, Rian Hunter wrote:
>> Gentle bump on this. Would be nice if this were fixed for the next release.
>> 
>> On 2018-10-07 10:55, Rian Hunter wrote:
>> > (re-posting because I realized the last patch didn't apply cleanly)
> Your patch does not apply due to whitespace issues, your MUA is probably
> the culprit.  I suggest you mail diffs to yourself first and and try to
> apply them as reviewers would have to do.
>
> Even after fixing it with `sed "s/^  / /"', your diff does not compile:
>
>   /usr/src/lib/libfuse/fuse.c:496:20: error: use of undeclared identifier 
> 'SIG_SIGCHLD'
>   signal(SIGPIPE, SIG_SIGCHLD);
>
>> +if (sigaction(SIGCHLD, NULL, &old_sa) == 0)
>> +if (old_sa.sa_handler == SIG_IGN)
>> +signal(SIGPIPE, SIG_SIGCHLD);
> It's SIGCHLD or SIG_DFL, see signal(3).

Thanks for the feedback, in my haste to re-post my patch I posted an
old version. Here's the current version, there should be no
tab-mangling:

Index: lib/libfuse/fuse.c
===
RCS file: /cvs/src/lib/libfuse/fuse.c,v
retrieving revision 1.49
diff -u -r1.49 fuse.c
--- lib/libfuse/fuse.c  5 Jul 2018 10:57:31 -   1.49
+++ lib/libfuse/fuse.c  20 Oct 2018 01:03:23 -
@@ -32,8 +32,6 @@
 #include "fuse_private.h"
 #include "debug.h"
 
-static volatile sig_atomic_t sigraised = 0;
-static volatile sig_atomic_t signum = 0;
 static struct fuse_context *ictx = NULL;
 
 enum {
@@ -116,21 +114,10 @@
 };
 
 static void
-ifuse_sighdlr(int num)
-{
-   if (!sigraised || (num == SIGCHLD)) {
-   sigraised = 1;
-   signum = num;
-   }
-}
-
-static void
 ifuse_try_unmount(struct fuse *f)
 {
pid_t child;
 
-   signal(SIGCHLD, ifuse_sighdlr);
-
/* unmount in another thread so fuse_loop() doesn't deadlock */
child = fork();
 
@@ -152,7 +139,6 @@
 {
int status;
 
-   signal(SIGCHLD, SIG_DFL);
if (waitpid(WAIT_ANY, &status, WNOHANG) == -1)
fprintf(stderr, "fuse: %s\n", strerror(errno));
 
@@ -160,7 +146,6 @@
fprintf(stderr, "fuse: %s: %s\n",
f->fc->dir, strerror(WEXITSTATUS(status)));
 
-   sigraised = 0;
return;
 }
 
@@ -170,6 +155,7 @@
struct fusebuf fbuf;
struct fuse_context ctx;
struct fb_ioctl_xch ioexch;
+   struct kevent event[5];
struct kevent ev;
ssize_t n;
int ret;
@@ -181,28 +167,39 @@
if (fuse->fc->kq == -1)
return (-1);
 
-   EV_SET(&fuse->fc->event, fuse->fc->fd, EVFILT_READ, EV_ADD |
+   EV_SET(&event[0], fuse->fc->fd, EVFILT_READ, EV_ADD |
EV_ENABLE, 0, 0, 0);
 
+   /* signal events */
+   EV_SET(&event[1], SIGCHLD, EVFILT_SIGNAL, EV_ADD |
+   EV_ENABLE, 0, 0, 0);
+   EV_SET(&event[2], SIGHUP, EVFILT_SIGNAL, EV_ADD |
+   EV_ENABLE, 0, 0, 0);
+   EV_SET(&event[3], SIGINT, EVFILT_SIGNAL, EV_ADD |
+   EV_ENABLE, 0, 0, 0);
+   EV_SET(&event[4], SIGTERM, EVFILT_SIGNAL, EV_ADD |
+   EV_ENABLE, 0, 0, 0);
+   
while (!fuse->fc->dead) {
-   ret = kevent(fuse->fc->kq, &fuse->fc->event, 1, &ev, 1, NULL);
+   ret = kevent(fuse->fc->kq, &event[0], 5, &ev, 1, NULL);
if (ret == -1) {
-   if (errno == EINTR) {
-   switch (signum) {
-   case SIGCHLD:
-   ifuse_child_exit(fuse);
-   break;
-   case SIGHUP:
-   case SIGINT:
-   case SIGTERM:
-   ifuse_try_unmount(fuse);
-   break;
-   default:
-   fprintf(stderr, "%s: %s\n", __func__,
-   strsignal(signum));
-   }
-   } else
+   if (errno != EINTR)
DPERROR(__func__);
+   } else if (ret > 0 && ev.filter == EVFILT_SIGNAL) {
+   int signum = ev.ident;
+   switch (signum) {
+   case SIGCHLD:
+   ifuse_child_exit(fuse);
+   break;
+   case SIGHUP:
+   case SIGINT:
+   case SIGTERM:
+   ifuse_try_unmount(fuse);
+   break;
+   default:
+   fprintf(stderr, "%s: %s\n", __func__,
+   strsignal(signum));
+   }
 

Re: bypass support for iommu on sparc64

2018-10-19 Thread David Gwynne



> On 19 Oct 2018, at 9:59 pm, Andrew Grillet  wrote:
> 
> Is the setup and teardown per transfer or when file is opened and closed?
> Or is it set up once per context switch of task?
> 
> I am partly interested cos I would like to improve mt one day (as user of
> tape
> and Sparc64 Txxx) if I get the time.
> 
> Andrew

The overhead is per transfer. You might not get better performance out of a 
tx000 because of the PCIe bridges involved, but you may also be lucky and not 
have that bridge in the way.

> 
> 
> 
> On Fri, 19 Oct 2018 at 10:22, Mark Kettenis  wrote:
> 
>>> Date: Fri, 19 Oct 2018 10:22:30 +1000
>>> From: David Gwynne 
>>> 
>>> On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote:
 On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote:
> on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
> setting up and tearing down the translation table entries (TTEs)
> is very expensive. so expensive that the cost of doing it for disk
> io has a noticable impact on compile times.
> 
> now that there's a BUS_DMA_64BIT flag, we can use that to decide
> to bypass the iommu for devices that set that flag, therefore
> avoiding the cost of handling the TTEs.
> 
> the following diff adds support for bypass mappings to the iommu
> code on sparc64. it's based on a diff from kettenis@ back in 2009.
> the main changes are around coping with the differences between
> schizo/psycho and fire/oberon.
> 
> the differences between the chips are now represented by a iommu_hw
> struct. these differences include how to enable the iommu (now via
> a function pointer), and masks for bypass addresses.
> 
> ive tested this on oberon (on an m4000) and schizo (on a v880).
> however, the bypass code isnt working on fire (v245s). to cope with
> that for now, the iommu_hw struct lets drivers mask flag bits that
> are handled when creating a dmamap. this means fire boards will
> ignore BUS_DMA_64BIT until i can figure out whats wrong with them.
 
 i figured it out. it turns out Fire was working fine. however,
 enabling 64bit dva on the onboard devices didnt work because the
 serverworks/broadcom pcie to pcix bridge can only handle dma addresses
 in the low 40 bits. because the fire bypass window is higher than
 this, the bridge would choke and things stopped working.
 
 the updated diff attempts to handle this. basically when probing
 the bridge, the platform creates a custom dma tag for it. this tag
 intercets bus_dmamap_create and clears the BUS_DMA_64BIT flag before
 handing it up to the parent bridge, which is pyro in my situation.
 it looks like early sun4v boxes could make use of this too.
 
> i have not tested this on psycho yet. if anyone has such a machine
> and is willing to work with me to figure it out, please talk to me.
 
 i still dont have psycho reports.
>>> 
>>> Would anyone object if I committed this? I've been running it for the
>>> last release or two without issues, but with significant improvements in
>>> performance on the machines involved.
>> 
>> At the price of giving all PCI devices unrestricted access to memory.
>> 
>> So I'm not eager to this, especially since on sun4v hardware bypassing
>> the iommu isn't possible as soon as multiple domains are enabled.  And
>> we lose a useful diagnostic when developing drivers.  Are you sure the
>> iommu overhead can't be reduced some other way?  At some point we
>> probably want to add iommu support on amd64 and arm64, but if that
>> comes with a similar overhead as on sparc64 that's going to be a bit
>> of an issue.
>> 
 Index: dev/iommu.c
 ===
 RCS file: /cvs/src/sys/arch/sparc64/dev/iommu.c,v
 retrieving revision 1.74
 diff -u -p -r1.74 iommu.c
 --- dev/iommu.c 30 Apr 2017 16:45:45 -  1.74
 +++ dev/iommu.c 10 May 2017 12:00:09 -
 @@ -100,6 +100,25 @@ void iommu_iomap_clear_pages(struct iomm
 void _iommu_dvmamap_sync(bus_dma_tag_t, bus_dma_tag_t, bus_dmamap_t,
 bus_addr_t, bus_size_t, int);
 
 +void iommu_hw_enable(struct iommu_state *);
 +
 +const struct iommu_hw iommu_hw_default = {
 +   .ihw_enable = iommu_hw_enable,
 +
 +   .ihw_dvma_pa= IOTTE_PAMASK,
 +
 +   .ihw_bypass = 0x3fffUL << 50,
 +   .ihw_bypass_nc  = 0,
 +   .ihw_bypass_ro  = 0,
 +};
 +
 +void
 +iommu_hw_enable(struct iommu_state *is)
 +{
 +   IOMMUREG_WRITE(is, iommu_tsb, is->is_ptsb);
 +   IOMMUREG_WRITE(is, iommu_cr, IOMMUCR_EN | (is->is_tsbsize << 16));
 +}
 +
 /*
  * Initiate an STC entry flush.
  */
 @@ -125,7 +144,8 @@ iommu_strbuf_flush(struct strbuf_ctl *sb
  * - create a private DVMA map.
  */
 void
 -iommu_init(char *name, struct iommu_sta

Re: bypass support for iommu on sparc64

2018-10-19 Thread David Gwynne



> On 19 Oct 2018, at 7:15 pm, Mark Kettenis  wrote:
> 
>> Date: Fri, 19 Oct 2018 10:22:30 +1000
>> From: David Gwynne 
>> 
>> On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote:
>>> On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote:
 on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
 setting up and tearing down the translation table entries (TTEs)
 is very expensive. so expensive that the cost of doing it for disk
 io has a noticable impact on compile times.
 
 now that there's a BUS_DMA_64BIT flag, we can use that to decide
 to bypass the iommu for devices that set that flag, therefore
 avoiding the cost of handling the TTEs.
 
 the following diff adds support for bypass mappings to the iommu
 code on sparc64. it's based on a diff from kettenis@ back in 2009.
 the main changes are around coping with the differences between
 schizo/psycho and fire/oberon.
 
 the differences between the chips are now represented by a iommu_hw
 struct. these differences include how to enable the iommu (now via
 a function pointer), and masks for bypass addresses.
 
 ive tested this on oberon (on an m4000) and schizo (on a v880).
 however, the bypass code isnt working on fire (v245s). to cope with
 that for now, the iommu_hw struct lets drivers mask flag bits that
 are handled when creating a dmamap. this means fire boards will
 ignore BUS_DMA_64BIT until i can figure out whats wrong with them.
>>> 
>>> i figured it out. it turns out Fire was working fine. however,
>>> enabling 64bit dva on the onboard devices didnt work because the
>>> serverworks/broadcom pcie to pcix bridge can only handle dma addresses
>>> in the low 40 bits. because the fire bypass window is higher than
>>> this, the bridge would choke and things stopped working.
>>> 
>>> the updated diff attempts to handle this. basically when probing
>>> the bridge, the platform creates a custom dma tag for it. this tag
>>> intercets bus_dmamap_create and clears the BUS_DMA_64BIT flag before
>>> handing it up to the parent bridge, which is pyro in my situation.
>>> it looks like early sun4v boxes could make use of this too.
>>> 
 i have not tested this on psycho yet. if anyone has such a machine
 and is willing to work with me to figure it out, please talk to me.
>>> 
>>> i still dont have psycho reports.
>> 
>> Would anyone object if I committed this? I've been running it for the
>> last release or two without issues, but with significant improvements in
>> performance on the machines involved.
> 
> At the price of giving all PCI devices unrestricted access to memory.
> 
> So I'm not eager to this, especially since on sun4v hardware bypassing
> the iommu isn't possible as soon as multiple domains are enabled.  And
> we lose a useful diagnostic when developing drivers.  Are you sure the
> iommu overhead can't be reduced some other way?  At some point we
> probably want to add iommu support on amd64 and arm64, but if that
> comes with a similar overhead as on sparc64 that's going to be a bit
> of an issue.

First, note that it doesn't turn the iommu off. By default drivers still go 
through it unless they opt out with BUS_DMA_64BIT. This is because the iommu is 
still in between the device and ram, and it provides the passthru window up at 
0xfffc. 

As an aside, and as hinted at in my previous mails, it means that devices with 
ppb6 at pci6 dev 0 function 0 "ServerWorks PCIE-PCIX" rev 0xb5 in them cannot 
really use BUS_DMA_64BIT cos those bridges are buggy and don't handle DVAs 
above 48 or 56 bits or something. That bridge is used in v215s, v245s, v445s, 
t1000s, and so on.

I have a theory that because of that bridge, there was a meme going around Sun 
at the time that it was cheaper to memcpy in and out of preallocated DMA memory 
than it was to do DMA for every packet or disk I/O or whatever.

Which leads me to the conclusion that an alternative to using the passthru 
window would be to have bus_dma preallocate the dmaable memory and bounce in 
and out of it. The performance hit I'm trying to avoid is with setting up and 
tearing down the transaction table entries. If they already exist, you avoid 
that hit.

Bouncing is complicated though, both in the bus_dma layer, and especially by 
pushing it into drivers.

The amount of overhead varies between machines. It seems less of a difference 
with nvme(4) in a slot that is not behind the dodgy bridge on a v245. It was 
about 20 or 30 percent of a difference with gem(4) and tcpbench in a v880 
(schizo). It is particularly bad on the M4000 I have, this is why I looked into 
this. There are orders of magnitude of difference between tcpbench results with 
a tweaked ix(4) and this diff on or off. We've not enabled mitigations before 
because of performance hits less than this.

dlg

> 
>>> Index: dev/iommu.c
>>> ===

Re: lib/libfuse: Handle signals that get sent to any thread

2018-10-19 Thread Klemens Nanni
On Fri, Oct 19, 2018 at 02:03:59PM -0700, Rian Hunter wrote:
> Gentle bump on this. Would be nice if this were fixed for the next release.
> 
> On 2018-10-07 10:55, Rian Hunter wrote:
> > (re-posting because I realized the last patch didn't apply cleanly)
Your patch does not apply due to whitespace issues, your MUA is probably
the culprit.  I suggest you mail diffs to yourself first and and try to
apply them as reviewers would have to do.

Even after fixing it with `sed "s/^  / /"', your diff does not compile:

/usr/src/lib/libfuse/fuse.c:496:20: error: use of undeclared identifier 
'SIG_SIGCHLD'
signal(SIGPIPE, SIG_SIGCHLD);

> + if (sigaction(SIGCHLD, NULL, &old_sa) == 0)
> + if (old_sa.sa_handler == SIG_IGN)
> + signal(SIGPIPE, SIG_SIGCHLD);
It's SIGCHLD or SIG_DFL, see signal(3).



Re: lib/libfuse: Handle signals that get sent to any thread

2018-10-19 Thread Rian Hunter
Gentle bump on this. Would be nice if this were fixed for the next 
release.


On 2018-10-07 10:55, Rian Hunter wrote:

(re-posting because I realized the last patch didn't apply cleanly)

lib/libfuse/fuse.c was using a EINTR return from kevent() to detect
when a signal had occurred, but in a multi-threaded process the signal
may not have been delivered to the thread blocking on kevent(). This
changes makes it so the signals are caught using EVFILT_SIGNAL filters
so they can be detected even when they happen on another thread.


Index: lib/libfuse/fuse.c
===
RCS file: /cvs/src/lib/libfuse/fuse.c,v
retrieving revision 1.49
diff -u -p -u -p -r1.49 fuse.c
--- lib/libfuse/fuse.c  5 Jul 2018 10:57:31 -   1.49
+++ lib/libfuse/fuse.c  5 Oct 2018 22:13:41 -
@@ -32,8 +32,6 @@
 #include "fuse_private.h"
 #include "debug.h"

-static volatile sig_atomic_t sigraised = 0;
-static volatile sig_atomic_t signum = 0;
 static struct fuse_context *ictx = NULL;

 enum {
@@ -116,21 +114,10 @@ static struct fuse_opt fuse_mount_opts[]
 };

 static void
-ifuse_sighdlr(int num)
-{
-   if (!sigraised || (num == SIGCHLD)) {
-   sigraised = 1;
-   signum = num;
-   }
-}
-
-static void
 ifuse_try_unmount(struct fuse *f)
 {
pid_t child;

-   signal(SIGCHLD, ifuse_sighdlr);
-
/* unmount in another thread so fuse_loop() doesn't deadlock */
child = fork();

@@ -152,7 +139,6 @@ ifuse_child_exit(const struct fuse *f)
 {
int status;

-   signal(SIGCHLD, SIG_DFL);
if (waitpid(WAIT_ANY, &status, WNOHANG) == -1)
fprintf(stderr, "fuse: %s\n", strerror(errno));

@@ -160,7 +146,6 @@ ifuse_child_exit(const struct fuse *f)
fprintf(stderr, "fuse: %s: %s\n",
f->fc->dir, strerror(WEXITSTATUS(status)));

-   sigraised = 0;
return;
 }

@@ -170,6 +155,7 @@ fuse_loop(struct fuse *fuse)
struct fusebuf fbuf;
struct fuse_context ctx;
struct fb_ioctl_xch ioexch;
+   struct kevent event[5];
struct kevent ev;
ssize_t n;
int ret;
@@ -181,28 +167,39 @@ fuse_loop(struct fuse *fuse)
if (fuse->fc->kq == -1)
return (-1);

-   EV_SET(&fuse->fc->event, fuse->fc->fd, EVFILT_READ, EV_ADD |
+   EV_SET(&event[0], fuse->fc->fd, EVFILT_READ, EV_ADD |
EV_ENABLE, 0, 0, 0);

+   /* signal events */
+   EV_SET(&event[1], SIGCHLD, EVFILT_SIGNAL, EV_ADD |
+   EV_ENABLE, 0, 0, 0);
+   EV_SET(&event[2], SIGHUP, EVFILT_SIGNAL, EV_ADD |
+   EV_ENABLE, 0, 0, 0);
+   EV_SET(&event[3], SIGINT, EVFILT_SIGNAL, EV_ADD |
+   EV_ENABLE, 0, 0, 0);
+   EV_SET(&event[4], SIGTERM, EVFILT_SIGNAL, EV_ADD |
+   EV_ENABLE, 0, 0, 0);
+
while (!fuse->fc->dead) {
-   ret = kevent(fuse->fc->kq, &fuse->fc->event, 1, &ev, 1, NULL);
+   ret = kevent(fuse->fc->kq, &event[0], 5, &ev, 1, NULL);
if (ret == -1) {
-   if (errno == EINTR) {
-   switch (signum) {
-   case SIGCHLD:
-   ifuse_child_exit(fuse);
-   break;
-   case SIGHUP:
-   case SIGINT:
-   case SIGTERM:
-   ifuse_try_unmount(fuse);
-   break;
-   default:
-   fprintf(stderr, "%s: %s\n", __func__,
-   strsignal(signum));
-   }
-   } else
+   if (errno != EINTR)
DPERROR(__func__);
+   } else if (ret > 0 && ev.filter == EVFILT_SIGNAL) {
+   int signum = ev.ident;
+   switch (signum) {
+   case SIGCHLD:
+   ifuse_child_exit(fuse);
+   break;
+   case SIGHUP:
+   case SIGINT:
+   case SIGTERM:
+   ifuse_try_unmount(fuse);
+   break;
+   default:
+   fprintf(stderr, "%s: %s\n", __func__,
+   strsignal(signum));
+   }
} else if (ret > 0) {
n = read(fuse->fc->fd, &fbuf, sizeof(fbuf));
if (n != sizeof(fbuf)) {
@@ -479,20 +476,24 @@ fuse_remove_signal_handlers(unused struc
struct sigaction old_sa;

if (sigaction(SIGHUP, NULL, &old_sa) == 0)
-   if (old_sa.sa_handler == ifuse_sighdlr)
+   if (old_sa.sa_handl

Re: [RFC PATCH] [vmm] Lightweight mechanism to kick virtio VQs

2018-10-19 Thread Sergio Lopez
On Wed, Oct 17, 2018 at 02:22:23PM -0700, Mike Larkin wrote:
> On Fri, Oct 05, 2018 at 12:07:13PM +0200, Sergio Lopez wrote:
> > Hi,
> > 
> > This patch implements a mechanism to allow users register an I/O port
> > with a special file descriptor (kickfd) which can be monitored for
> > events using kevent. The kernel will note an event each time the Guest
> > writes to an I/O port registered with a kickfd.
> 
> > 
> > This is mainly intended for kicking virtio virtqueues, and allows the
> > VCPU to return immediately to the Guest, instead of having to exit all
> > the way to vmd to process the request. With kickfd, the actual work
> > (i.e. an I/O operation) requested by the Guest can be done in a
> > parallel thread that eventually asserts the corresponding IRQ.
> 
> The reason our virtio is so slow today is, as you point out, we process
> the I/O sequentially with the queue notify back in vmd. To this end, there
> are three "performance overhead components" involved:
> 
> 1. the amount of time it takes to return from the vmm run ioctl back to
>vmd
> 2. the time it takes to process the events in the queue
> 3. the time it takes to re-enter the vm back to the point after the OUT
>instruction that did the queue notify.
> 
> I've benchmarked this before; the biggest performance penalty here is in
> #2. The exit and re-entry overheads are miniscule compared to the overhead
> of the I/O itself.

I think that, in addition to the absolute number of cycles spent on each
approach, we should also consider where are being spent.

The kickfd approach is probably more expensive (ignoring TLB flushes,
which may end making it cheaper) than having the vCPU do the actual I/O,
but it allows the latter to return much earlier to the Guest, resulting
in a significantly lower submission latency and more CPU time available
for the VM.

> Optimizing #2 is therefore the most bang for the buck. We (dlg@ and I)
> had a taskq-based diff to do this previously (I'm not sure what state it
> was left in). In that diff, the I/O was processed in a set of worker threads,
> effectively eliminating the overhead for #2.

I had to partially rewrite the vioblk_notifyq function, as it didn't
play nice with concurrent access (Guest + iothread), but I haven't
implemented any optimization for #2.

There's the option of merging requests and using working threads to
avoid having the iothread actively waiting for the I/O, but this
would significantly increase the complexity of vmd's virtio code, so
must be done with care (it's the problem behind *many* qemu issues).

> Your diff (if I read it correctly), optimizes all 3 cases, and that's good.
> I'll admit it's not a way I thought of solving the performance issue, and
> I'm not sure if this is the right way or if the taskq way is better. The
> taskq way is a bit cleaner and easier to understand (there is no separate
> backchannel from vmm->vmd, but it doesn't look too crazy from what I
> read below).
> 
> Do you have some performance measurements with this diff applied vs without?

Yes, I ran some random read and random write tests with fio. The Guest
is Linux, mainly because its virtio implementation is more mature. Next
week I'll try with OpenBSD.

In a general sense, the kickfd approach is similar or slightly better
than the current one when generating I/O from just one process, and
significantly better (up to 2x the throughput) with 2 and 4 processes.

The test machine is an AMD Phenom II X4 (yeah, not the best thing for
this kind of tests, but the only non-virtual environment I have here)
with a Samsung SSD Evo 850. OpenBSD's kernel is running with HZ=1000 and
Linux's with HZ=100 (otherwise I couldn't get reliable numbers due to
missing ticks). The target device is backed by a raw image.

I'm pasting the raw output for fio below.

Sergio (slp).


==
| Linux + kickfd |
==

 - randread

[root@localhost ~]# fio --clocksource=clock_gettime --name=randread 
--rw=randread --bs=4k --filename=/dev/vdb --size=1g --numjobs=1 --ioengine=sync 
--iodepth=1 --direct=1 --group_reporting 
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=sync, iodepth=1
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [r(1)][95.7%][r=49.3MiB/s,w=0KiB/s][r=12.6k,w=0 IOPS][eta 
00m:01s]
randread: (groupid=0, jobs=1): err= 0: pid=318: Fri Oct 19 18:36:02 2018
   read: IOPS=11.5k, BW=44.8MiB/s (46.0MB/s)(1024MiB/22850msec)
clat (nsec): min=0, max=2k, avg=85372.92, stdev=924160.90
 lat (nsec): min=0, max=2k, avg=85525.51, stdev=924971.98
clat percentiles (nsec):
 |  1.00th=[   0],  5.00th=[   0], 10.00th=[   0],
 | 20.00th=[   0], 30.00th=[   0], 40.00th=[   0],
 | 50.00th=[   0], 60.00th=[   0], 70.00th=[   0],
 | 80.00th=[   0], 90.00th=[   0], 95.00th=[   0],
 | 99.00th=[   0], 99.50th=[10027008], 99.90th=[10027008],
 | 99.95th=[10027008], 99.99th=[10027008]
  

Re: bgpd throttling for peers

2018-10-19 Thread Denis Fondras
On Fri, Oct 19, 2018 at 05:51:20PM +0200, Claudio Jeker wrote:
> On Wed, Oct 17, 2018 at 02:37:48PM +0200, Claudio Jeker wrote:
> > I noticed that the throttling for peers which was added some time ago is
> > incomplete. The following diff solved these issues.
> > 
> > In rde_update_queue_runner() only process peers which are currently not
> > throttled. Additionally only run the runners if not too many imsg are
> > pending in the RDE. Both these changes should help imporve responsiveness.
> > In the SE only set the throttled flag if the imsg sending was successful.
> > 
> > Does work fine on my systems with little or no effect on convergance time.
> > Please test on your systems and look for preformance changes.
> 
> Updated version after recent commit.
> 

OK denis@

> -- 
> :wq Claudio
> 
> Index: rde.c
> ===
> RCS file: /cvs/src/usr.sbin/bgpd/rde.c,v
> retrieving revision 1.437
> diff -u -p -r1.437 rde.c
> --- rde.c 18 Oct 2018 12:19:09 -  1.437
> +++ rde.c 19 Oct 2018 15:49:43 -
> @@ -334,15 +334,16 @@ rde_main(int debug, int verbose)
>   mctx = LIST_NEXT(mctx, entry);
>   }
>  
> - rde_update_queue_runner();
> - for (aid = AID_INET6; aid < AID_MAX; aid++)
> - rde_update6_queue_runner(aid);
> + if (ibuf_se && ibuf_se->w.queued < SESS_MSG_HIGH_MARK) {
> + rde_update_queue_runner();
> + for (aid = AID_INET6; aid < AID_MAX; aid++)
> + rde_update6_queue_runner(aid);
> + }
>   if (rde_dump_pending() &&
>   ibuf_se_ctl && ibuf_se_ctl->w.queued <= 10)
>   rde_dump_runner();
> - if (softreconfig) {
> + if (softreconfig)
>   rde_reload_runner();
> - }
>   }
>  
>   /* do not clean up on shutdown on production, it takes ages. */
> @@ -2664,6 +2665,8 @@ rde_update_queue_runner(void)
>   continue;
>   if (peer->state != PEER_UP)
>   continue;
> + if (peer->throttled)
> + continue;
>   eor = 0;
>   /* first withdraws */
>   wpos = 2; /* reserve space for the length field */
> @@ -2730,6 +2733,8 @@ rde_update6_queue_runner(u_int8_t aid)
>   continue;
>   if (peer->state != PEER_UP)
>   continue;
> + if (peer->throttled)
> + continue;
>   len = sizeof(queue_buf) - MSGSIZE_HEADER;
>   b = up_dump_mp_unreach(queue_buf, &len, peer, aid);
>  
> @@ -2753,6 +2758,8 @@ rde_update6_queue_runner(u_int8_t aid)
>   if (peer->conf.id == 0)
>   continue;
>   if (peer->state != PEER_UP)
> + continue;
> + if (peer->throttled)
>   continue;
>   len = sizeof(queue_buf) - MSGSIZE_HEADER;
>   r = up_dump_mp_reach(queue_buf, &len, peer, aid);
> Index: session.c
> ===
> RCS file: /cvs/src/usr.sbin/bgpd/session.c,v
> retrieving revision 1.369
> diff -u -p -r1.369 session.c
> --- session.c 29 Sep 2018 07:58:06 -  1.369
> +++ session.c 17 Oct 2018 12:18:51 -
> @@ -1382,7 +1382,8 @@ session_sendmsg(struct bgp_msg *msg, str
>   if (!p->throttled && p->wbuf.queued > SESS_MSG_HIGH_MARK) {
>   if (imsg_rde(IMSG_XOFF, p->conf.id, NULL, 0) == -1)
>   log_peer_warn(&p->conf, "imsg_compose XOFF");
> - p->throttled = 1;
> + else
> + p->throttled = 1;
>   }
>  
>   free(msg);
> @@ -1773,7 +1774,8 @@ session_dispatch_msg(struct pollfd *pfd,
>   if (p->throttled && p->wbuf.queued < SESS_MSG_LOW_MARK) {
>   if (imsg_rde(IMSG_XON, p->conf.id, NULL, 0) == -1)
>   log_peer_warn(&p->conf, "imsg_compose XON");
> - p->throttled = 0;
> + else
> + p->throttled = 0;
>   }
>   if (!(pfd->revents & POLLIN))
>   return (1);
> 



Re: vmd: servicing virtio devices from separate processes

2018-10-19 Thread Sergio Lopez
On Wed, Oct 17, 2018 at 02:04:46PM -0700, Mike Larkin wrote:
> On Fri, Oct 05, 2018 at 01:50:10PM +0200, Sergio Lopez wrote:
> > Right now, vmd already features an excellent privsep model to ensure
> > the process servicing the VM requests to the outside world is running
> > with the lowest possible privileges.
> > 
> > I was wondering if we could take a step further, servicing each virtio
> > device from a different process. This design would simplify the
> > implementation and maintenance of those devices, improve the privsep
> > model and increase the resilience of VMs (the crash of a process
> > servicing a device won't bring down the whole VM, and a mechanism to
> > recover from this scenario could be explored).
> 
> Our model is generally to not try to recover from crashes like that. Indeed,
> you *want* to crash so the underlying bug can be found and fixed.

With separate processes you'll still have a crash with core dump from
the process servicing the device, with the additional advantage of being
able to, optionally, try to recover device or debug it independently. 

> >  - An in-kernel IRQ chip. This one is the most controversial, as it
> > means emulating a device from a privileged domain, but I'm pretty sure
> > a lapic implementation with enough functionality to serve *BSD/Linux
> > Guests can be small and simple enough to be easily auditable.
> 
> This needs to be done, for a number of reasons (device emulation being just
> one). pd@ and I are working on how to implement this using features in
> recent CPUs, since much of the LAPIC emulation can now be handled by the
> CPU itself. We're thinking skylake and later will be the line in the sand
> for this. Otherwise the software emulation is more complex and more prone
> to bugs. I've resisted the urge to put this stuff in the kernel for exactly
> that reason, but with later model CPUs we may be in better shape. We may
> also decide to focus solely on x2APIC. If you're interested in helping in
> this area, I'll keep you in the loop.

Sure, I'd like to help. I'm quite familiar with KVM's in-kernel irqchip
implementation.

> > Do you think it's worth exploring this model? What are feelings
> > regarding the in-kernel IRQ chip?
> > 
> > Sergio (slp).
> > 
> 
> All things considered, I'm less sold on the idea of splitting out devices
> into their own processes. I don't see any compelling reason. But we do
> need an IOAPIC and LAPIC implementation at some point, as you point out.

Well, vmd is still small enough (thankfully) to be able debug it easily.
But, eventually, it'll grow in both size and complexity (specially with
SMP support, I/O optimizations, additional storage backends...) and
having a high degree of modularity really helps here.

In fact, the QEMU/KVM community is starting to consider going this route
(but QEMU is *huge*). [1]

Anyways, this is not a "now-or-never" thing. As you suggest, we can work
now on kickfd and IOAPIC/LAPIC, which are useful by themselves. And when
those are in place, I'll be able to write a PoC so we can evaluate its
benefits and drawbacks.

Sergio (slp).


[1] https://www.linux-kvm.org/images/f/fc/KVM_FORUM_multi-process.pdf



igmp_slowtimo: drop NETLOCK

2018-10-19 Thread Scott Cheloha
Hi,

If we introduce a mutex for the igmp module we can drop the NETLOCK
from igmp_slowtimo().  The router_info list, rti_head, and its entries
are only ever accessed within the igmp module.  The mutex also needs
to protect igmp_timers_are_running.

This is largely a monkey-see-monkey-do imitation of what visa@ did for
frag6.c.  Obviously any errors are mine.

There are spots in igmp_input_if() where we could push the mutex into
smaller sections of the switch statement.  I'm under the impression that
IGMP is a cooler path, though, so that additional complexity is probably
pointless, but I figure it's worth mentioning.  Maybe on a later pass.

I must admit I only have a small test setup for multipath routing.  We
also don't have any regress tests for IGMP (no idea how tricky those
would be to write).  The introduction of the mutex was pretty
straightforward, though, given the size of igmp.c and the fact that the
mutex is module-local.

Thoughts?  ok?

-Scott

P.S. It doesn't look like we need KERNEL_LOCK in igmp_input_if() until
rip_input().  Is that fair game,  or is there a larger plan to coordinate
moving KERNEL_LOCK into various *_input_if() routines later?

Index: sys/netinet/igmp.c
===
RCS file: /cvs/src/sys/netinet/igmp.c,v
retrieving revision 1.74
diff -u -p -r1.74 igmp.c
--- sys/netinet/igmp.c  18 Oct 2018 15:46:28 -  1.74
+++ sys/netinet/igmp.c  19 Oct 2018 17:07:31 -
@@ -1,4 +1,4 @@
-/* $OpenBSD: igmp.c,v 1.74 2018/10/18 15:46:28 cheloha Exp $   */
+/* $OpenBSD: igmp.c,v 1.73 2018/10/18 15:23:04 cheloha Exp $   */
 /* $NetBSD: igmp.c,v 1.15 1996/02/13 23:41:25 christos Exp $   */
 
 /*
@@ -77,6 +77,8 @@
 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -98,7 +100,13 @@
 
 int *igmpctl_vars[IGMPCTL_MAXID] = IGMPCTL_VARS;
 
-intigmp_timers_are_running;
+/*
+ * Protects igmp_timers_are_running and rti_head.  Note that rti_head needs
+ * to be protected when one of its entries is accessed via in_multi->rti.
+ */
+struct mutex igmp_mutex = MUTEX_INITIALIZER(IPL_SOFTNET);
+
+int igmp_timers_are_running;
 static LIST_HEAD(, router_info) rti_head;
 static struct mbuf *router_alert;
 struct cpumem *igmpcounters;
@@ -150,6 +158,8 @@ rti_fill(struct in_multi *inm)
 {
struct router_info *rti;
 
+   MUTEX_ASSERT_LOCKED(&igmp_mutex);
+
LIST_FOREACH(rti, &rti_head, rti_list) {
if (rti->rti_ifidx == inm->inm_ifidx) {
inm->inm_rti = rti;
@@ -176,6 +186,8 @@ rti_find(struct ifnet *ifp)
struct router_info *rti;
 
KERNEL_ASSERT_LOCKED();
+   MUTEX_ASSERT_LOCKED(&igmp_mutex);
+
LIST_FOREACH(rti, &rti_head, rti_list) {
if (rti->rti_ifidx == ifp->if_index)
return (rti);
@@ -195,13 +207,18 @@ rti_delete(struct ifnet *ifp)
 {
struct router_info *rti, *trti;
 
+   mtx_enter(&igmp_mutex);
+
LIST_FOREACH_SAFE(rti, &rti_head, rti_list, trti) {
if (rti->rti_ifidx == ifp->if_index) {
LIST_REMOVE(rti, rti_list);
+   mtx_leave(&igmp_mutex);
free(rti, M_MRTABLE, sizeof(*rti));
-   break;
+   return;
}
}
+
+   mtx_leave(&igmp_mutex);
 }
 
 int
@@ -271,6 +288,8 @@ igmp_input_if(struct ifnet *ifp, struct 
m->m_len += iphlen;
ip = mtod(m, struct ip *);
 
+   mtx_enter(&igmp_mutex);
+
switch (igmp->igmp_type) {
 
case IGMP_HOST_MEMBERSHIP_QUERY:
@@ -282,6 +301,7 @@ igmp_input_if(struct ifnet *ifp, struct 
if (igmp->igmp_code == 0) {
rti = rti_find(ifp);
if (rti == NULL) {
+   mtx_leave(&igmp_mutex);
m_freem(m);
return IPPROTO_DONE;
}
@@ -289,6 +309,7 @@ igmp_input_if(struct ifnet *ifp, struct 
rti->rti_age = 0;
 
if (ip->ip_dst.s_addr != INADDR_ALLHOSTS_GROUP) {
+   mtx_leave(&igmp_mutex);
igmpstat_inc(igps_rcv_badqueries);
m_freem(m);
return IPPROTO_DONE;
@@ -314,6 +335,7 @@ igmp_input_if(struct ifnet *ifp, struct 
}
} else {
if (!IN_MULTICAST(ip->ip_dst.s_addr)) {
+   mtx_leave(&igmp_mutex);
igmpstat_inc(igps_rcv_badqueries);
m_freem(m);
return IPPROTO_DONE;
@@ -371,6 +393,7 @@ igmp_input_if(struct ifnet *ifp, struct 
 
if (!IN_MULTICAST(igmp->igmp_group.s_addr) ||
igmp->igmp_group.s_addr != ip->i

Re: bgpd throttling for peers

2018-10-19 Thread Claudio Jeker
On Wed, Oct 17, 2018 at 02:37:48PM +0200, Claudio Jeker wrote:
> I noticed that the throttling for peers which was added some time ago is
> incomplete. The following diff solved these issues.
> 
> In rde_update_queue_runner() only process peers which are currently not
> throttled. Additionally only run the runners if not too many imsg are
> pending in the RDE. Both these changes should help imporve responsiveness.
> In the SE only set the throttled flag if the imsg sending was successful.
> 
> Does work fine on my systems with little or no effect on convergance time.
> Please test on your systems and look for preformance changes.

Updated version after recent commit.

-- 
:wq Claudio

Index: rde.c
===
RCS file: /cvs/src/usr.sbin/bgpd/rde.c,v
retrieving revision 1.437
diff -u -p -r1.437 rde.c
--- rde.c   18 Oct 2018 12:19:09 -  1.437
+++ rde.c   19 Oct 2018 15:49:43 -
@@ -334,15 +334,16 @@ rde_main(int debug, int verbose)
mctx = LIST_NEXT(mctx, entry);
}
 
-   rde_update_queue_runner();
-   for (aid = AID_INET6; aid < AID_MAX; aid++)
-   rde_update6_queue_runner(aid);
+   if (ibuf_se && ibuf_se->w.queued < SESS_MSG_HIGH_MARK) {
+   rde_update_queue_runner();
+   for (aid = AID_INET6; aid < AID_MAX; aid++)
+   rde_update6_queue_runner(aid);
+   }
if (rde_dump_pending() &&
ibuf_se_ctl && ibuf_se_ctl->w.queued <= 10)
rde_dump_runner();
-   if (softreconfig) {
+   if (softreconfig)
rde_reload_runner();
-   }
}
 
/* do not clean up on shutdown on production, it takes ages. */
@@ -2664,6 +2665,8 @@ rde_update_queue_runner(void)
continue;
if (peer->state != PEER_UP)
continue;
+   if (peer->throttled)
+   continue;
eor = 0;
/* first withdraws */
wpos = 2; /* reserve space for the length field */
@@ -2730,6 +2733,8 @@ rde_update6_queue_runner(u_int8_t aid)
continue;
if (peer->state != PEER_UP)
continue;
+   if (peer->throttled)
+   continue;
len = sizeof(queue_buf) - MSGSIZE_HEADER;
b = up_dump_mp_unreach(queue_buf, &len, peer, aid);
 
@@ -2753,6 +2758,8 @@ rde_update6_queue_runner(u_int8_t aid)
if (peer->conf.id == 0)
continue;
if (peer->state != PEER_UP)
+   continue;
+   if (peer->throttled)
continue;
len = sizeof(queue_buf) - MSGSIZE_HEADER;
r = up_dump_mp_reach(queue_buf, &len, peer, aid);
Index: session.c
===
RCS file: /cvs/src/usr.sbin/bgpd/session.c,v
retrieving revision 1.369
diff -u -p -r1.369 session.c
--- session.c   29 Sep 2018 07:58:06 -  1.369
+++ session.c   17 Oct 2018 12:18:51 -
@@ -1382,7 +1382,8 @@ session_sendmsg(struct bgp_msg *msg, str
if (!p->throttled && p->wbuf.queued > SESS_MSG_HIGH_MARK) {
if (imsg_rde(IMSG_XOFF, p->conf.id, NULL, 0) == -1)
log_peer_warn(&p->conf, "imsg_compose XOFF");
-   p->throttled = 1;
+   else
+   p->throttled = 1;
}
 
free(msg);
@@ -1773,7 +1774,8 @@ session_dispatch_msg(struct pollfd *pfd,
if (p->throttled && p->wbuf.queued < SESS_MSG_LOW_MARK) {
if (imsg_rde(IMSG_XON, p->conf.id, NULL, 0) == -1)
log_peer_warn(&p->conf, "imsg_compose XON");
-   p->throttled = 0;
+   else
+   p->throttled = 0;
}
if (!(pfd->revents & POLLIN))
return (1);



Re: multicast.4: drop unneeded (void *) casts

2018-10-19 Thread Scott Cheloha
On Thu, Oct 18, 2018 at 05:38:42PM -0900, Philip Guenther wrote:
> On Thu, Oct 18, 2018 at 4:00 PM  wrote:
> 
> > getsockopt(2) and setsockopt(2) take a void pointer for the third
> > parameter so there is no need for any of these casts.
> >
> 
> Yep.
> 
> 
> > I can do other style(9)-type cleanup in a subsequent diff or I can
> > cook this diff and do it all in one.
> >
> 
> > Thoughts, preferences?
> >
> 
> I would prefer line-wrapping adjustment that's enabled by the deletion of
> the casts be done in the same commit.  Those to change, IMHO, called out
> below.
> 
> 
> [...]

Sure, looks cleaner.  ok?

Index: share/man/man4/multicast.4
===
RCS file: /cvs/src/share/man/man4/multicast.4,v
retrieving revision 1.12
diff -u -p -r1.12 multicast.4
--- share/man/man4/multicast.4  7 Mar 2018 09:54:23 -   1.12
+++ share/man/man4/multicast.4  19 Oct 2018 15:15:21 -
@@ -138,17 +138,17 @@ or disable multicast forwarding in the k
 .Bd -literal -offset 5n
 /* IPv4 */
 int v = 1;/* 1 to enable, or 0 to disable */
-setsockopt(mrouter_s4, IPPROTO_IP, MRT_INIT, (void *)&v, sizeof(v));
+setsockopt(mrouter_s4, IPPROTO_IP, MRT_INIT, &v, sizeof(v));
 .Ed
 .Bd -literal -offset 5n
 /* IPv6 */
 int v = 1;/* 1 to enable, or 0 to disable */
-setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_INIT, (void *)&v, sizeof(v));
+setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_INIT, &v, sizeof(v));
 \&...
 /* If necessary, filter all ICMPv6 messages */
 struct icmp6_filter filter;
 ICMP6_FILTER_SETBLOCKALL(&filter);
-setsockopt(mrouter_s6, IPPROTO_ICMPV6, ICMP6_FILTER, (void *)&filter,
+setsockopt(mrouter_s6, IPPROTO_ICMPV6, ICMP6_FILTER, &filter,
sizeof(filter));
 .Ed
 .Pp
@@ -168,8 +168,7 @@ memcpy(&vc.vifc_lcl_addr, &vif_local_add
 if (vc.vifc_flags & VIFF_TUNNEL)
 memcpy(&vc.vifc_rmt_addr, &vif_remote_address,
sizeof(vc.vifc_rmt_addr));
-setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_VIF, (void *)&vc,
-   sizeof(vc));
+setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc));
 .Ed
 .Pp
 The
@@ -205,8 +204,7 @@ memset(&mc, 0, sizeof(mc));
 mc.mif6c_mifi = mif_index;
 mc.mif6c_flags = mif_flags;
 mc.mif6c_pifi = pif_index;
-setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MIF, (void *)&mc,
-   sizeof(mc));
+setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MIF, &mc, sizeof(mc));
 .Ed
 .Pp
 The
@@ -226,13 +224,13 @@ A multicast interface is deleted by:
 .Bd -literal -offset indent
 /* IPv4 */
 vifi_t vifi = vif_index;
-setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_VIF, (void *)&vifi,
+setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_VIF, &vifi,
sizeof(vifi));
 .Ed
 .Bd -literal -offset indent
 /* IPv6 */
 mifi_t mifi = mif_index;
-setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MIF, (void *)&mifi,
+setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MIF, &mifi,
sizeof(mifi));
 .Ed
 .Pp
@@ -301,8 +299,7 @@ memcpy(&mc.mfcc_mcastgrp, &group_addr, s
 mc.mfcc_parent = iif_index;
 for (i = 0; i < maxvifs; i++)
 mc.mfcc_ttls[i] = oifs_ttl[i];
-setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_MFC,
-   (void *)&mc, sizeof(mc));
+setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_MFC, &mc, sizeof(mc));
 .Ed
 .Bd -literal -offset indent
 /* IPv6 */
@@ -314,8 +311,7 @@ mc.mf6cc_parent = iif_index;
 for (i = 0; i < maxvifs; i++)
 if (oifs_ttl[i] > 0)
 IF_SET(i, &mc.mf6cc_ifset);
-setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_ADD_MFC,
-   (void *)&mc, sizeof(mc));
+setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_ADD_MFC, &mc, sizeof(mc));
 .Ed
 .Pp
 The
@@ -344,8 +340,7 @@ struct mfcctl mc;
 memset(&mc, 0, sizeof(mc));
 memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin));
 memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp));
-setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_MFC,
-   (void *)&mc, sizeof(mc));
+setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_MFC, &mc, sizeof(mc));
 .Ed
 .Bd -literal -offset indent
 /* IPv6 */
@@ -353,8 +348,7 @@ struct mf6cctl mc;
 memset(&mc, 0, sizeof(mc));
 memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin));
 memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp));
-setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_DEL_MFC,
-   (void *)&mc, sizeof(mc));
+setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_DEL_MFC, &mc, sizeof(mc));
 .Ed
 .Pp
 The following method can be used to get various statistics per
@@ -453,7 +447,7 @@ and
 An example:
 .Bd -literal -offset 3n
 uint32_t v;
-getsockopt(sock, IPPROTO_IP, MRT_API_SUPPORT, (void *)&v, sizeof(v));
+getsockopt(sock, IPPROTO_IP, MRT_API_SUPPORT, &v, sizeof(v));
 .Ed
 .Pp
 This would set
@@ -478,10 +472,8 @@ would fail.
 To modify the API, and to set some specific feature in the kernel, then:
 .Bd -literal -offset 3n
 uint32_t v = MRT_MFC_FLAGS_DISABLE_WRONGVIF;
-if (setsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v))
-!= 0) {
+if (setsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, &v, sizeof(v)

Re: bypass support for iommu on sparc64

2018-10-19 Thread Andrew Grillet
Is the setup and teardown per transfer or when file is opened and closed?
Or is it set up once per context switch of task?

I am partly interested cos I would like to improve mt one day (as user of
tape
and Sparc64 Txxx) if I get the time.

Andrew



On Fri, 19 Oct 2018 at 10:22, Mark Kettenis  wrote:

> > Date: Fri, 19 Oct 2018 10:22:30 +1000
> > From: David Gwynne 
> >
> > On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote:
> > > On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote:
> > > > on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
> > > > setting up and tearing down the translation table entries (TTEs)
> > > > is very expensive. so expensive that the cost of doing it for disk
> > > > io has a noticable impact on compile times.
> > > >
> > > > now that there's a BUS_DMA_64BIT flag, we can use that to decide
> > > > to bypass the iommu for devices that set that flag, therefore
> > > > avoiding the cost of handling the TTEs.
> > > >
> > > > the following diff adds support for bypass mappings to the iommu
> > > > code on sparc64. it's based on a diff from kettenis@ back in 2009.
> > > > the main changes are around coping with the differences between
> > > > schizo/psycho and fire/oberon.
> > > >
> > > > the differences between the chips are now represented by a iommu_hw
> > > > struct. these differences include how to enable the iommu (now via
> > > > a function pointer), and masks for bypass addresses.
> > > >
> > > > ive tested this on oberon (on an m4000) and schizo (on a v880).
> > > > however, the bypass code isnt working on fire (v245s). to cope with
> > > > that for now, the iommu_hw struct lets drivers mask flag bits that
> > > > are handled when creating a dmamap. this means fire boards will
> > > > ignore BUS_DMA_64BIT until i can figure out whats wrong with them.
> > >
> > > i figured it out. it turns out Fire was working fine. however,
> > > enabling 64bit dva on the onboard devices didnt work because the
> > > serverworks/broadcom pcie to pcix bridge can only handle dma addresses
> > > in the low 40 bits. because the fire bypass window is higher than
> > > this, the bridge would choke and things stopped working.
> > >
> > > the updated diff attempts to handle this. basically when probing
> > > the bridge, the platform creates a custom dma tag for it. this tag
> > > intercets bus_dmamap_create and clears the BUS_DMA_64BIT flag before
> > > handing it up to the parent bridge, which is pyro in my situation.
> > > it looks like early sun4v boxes could make use of this too.
> > >
> > > > i have not tested this on psycho yet. if anyone has such a machine
> > > > and is willing to work with me to figure it out, please talk to me.
> > >
> > > i still dont have psycho reports.
> >
> > Would anyone object if I committed this? I've been running it for the
> > last release or two without issues, but with significant improvements in
> > performance on the machines involved.
>
> At the price of giving all PCI devices unrestricted access to memory.
>
> So I'm not eager to this, especially since on sun4v hardware bypassing
> the iommu isn't possible as soon as multiple domains are enabled.  And
> we lose a useful diagnostic when developing drivers.  Are you sure the
> iommu overhead can't be reduced some other way?  At some point we
> probably want to add iommu support on amd64 and arm64, but if that
> comes with a similar overhead as on sparc64 that's going to be a bit
> of an issue.
>
> > > Index: dev/iommu.c
> > > ===
> > > RCS file: /cvs/src/sys/arch/sparc64/dev/iommu.c,v
> > > retrieving revision 1.74
> > > diff -u -p -r1.74 iommu.c
> > > --- dev/iommu.c 30 Apr 2017 16:45:45 -  1.74
> > > +++ dev/iommu.c 10 May 2017 12:00:09 -
> > > @@ -100,6 +100,25 @@ void iommu_iomap_clear_pages(struct iomm
> > >  void _iommu_dvmamap_sync(bus_dma_tag_t, bus_dma_tag_t, bus_dmamap_t,
> > >  bus_addr_t, bus_size_t, int);
> > >
> > > +void iommu_hw_enable(struct iommu_state *);
> > > +
> > > +const struct iommu_hw iommu_hw_default = {
> > > +   .ihw_enable = iommu_hw_enable,
> > > +
> > > +   .ihw_dvma_pa= IOTTE_PAMASK,
> > > +
> > > +   .ihw_bypass = 0x3fffUL << 50,
> > > +   .ihw_bypass_nc  = 0,
> > > +   .ihw_bypass_ro  = 0,
> > > +};
> > > +
> > > +void
> > > +iommu_hw_enable(struct iommu_state *is)
> > > +{
> > > +   IOMMUREG_WRITE(is, iommu_tsb, is->is_ptsb);
> > > +   IOMMUREG_WRITE(is, iommu_cr, IOMMUCR_EN | (is->is_tsbsize << 16));
> > > +}
> > > +
> > >  /*
> > >   * Initiate an STC entry flush.
> > >   */
> > > @@ -125,7 +144,8 @@ iommu_strbuf_flush(struct strbuf_ctl *sb
> > >   * - create a private DVMA map.
> > >   */
> > >  void
> > > -iommu_init(char *name, struct iommu_state *is, int tsbsize, u_int32_t
> iovabase)
> > > +iommu_init(char *name, const struct iommu_hw *ihw, struct iommu_state
> *is,
> > > +int tsbsize, u_int32_t iovabase)
> > >  {
> > > psize_t si

Re: bypass support for iommu on sparc64

2018-10-19 Thread Mark Kettenis
> Date: Fri, 19 Oct 2018 10:22:30 +1000
> From: David Gwynne 
> 
> On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote:
> > On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote:
> > > on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
> > > setting up and tearing down the translation table entries (TTEs)
> > > is very expensive. so expensive that the cost of doing it for disk
> > > io has a noticable impact on compile times.
> > > 
> > > now that there's a BUS_DMA_64BIT flag, we can use that to decide
> > > to bypass the iommu for devices that set that flag, therefore
> > > avoiding the cost of handling the TTEs.
> > > 
> > > the following diff adds support for bypass mappings to the iommu
> > > code on sparc64. it's based on a diff from kettenis@ back in 2009.
> > > the main changes are around coping with the differences between
> > > schizo/psycho and fire/oberon.
> > > 
> > > the differences between the chips are now represented by a iommu_hw
> > > struct. these differences include how to enable the iommu (now via
> > > a function pointer), and masks for bypass addresses.
> > > 
> > > ive tested this on oberon (on an m4000) and schizo (on a v880).
> > > however, the bypass code isnt working on fire (v245s). to cope with
> > > that for now, the iommu_hw struct lets drivers mask flag bits that
> > > are handled when creating a dmamap. this means fire boards will
> > > ignore BUS_DMA_64BIT until i can figure out whats wrong with them.
> > 
> > i figured it out. it turns out Fire was working fine. however,
> > enabling 64bit dva on the onboard devices didnt work because the
> > serverworks/broadcom pcie to pcix bridge can only handle dma addresses
> > in the low 40 bits. because the fire bypass window is higher than
> > this, the bridge would choke and things stopped working.
> > 
> > the updated diff attempts to handle this. basically when probing
> > the bridge, the platform creates a custom dma tag for it. this tag
> > intercets bus_dmamap_create and clears the BUS_DMA_64BIT flag before
> > handing it up to the parent bridge, which is pyro in my situation.
> > it looks like early sun4v boxes could make use of this too.
> > 
> > > i have not tested this on psycho yet. if anyone has such a machine
> > > and is willing to work with me to figure it out, please talk to me.
> > 
> > i still dont have psycho reports.
> 
> Would anyone object if I committed this? I've been running it for the
> last release or two without issues, but with significant improvements in
> performance on the machines involved.

At the price of giving all PCI devices unrestricted access to memory.

So I'm not eager to this, especially since on sun4v hardware bypassing
the iommu isn't possible as soon as multiple domains are enabled.  And
we lose a useful diagnostic when developing drivers.  Are you sure the
iommu overhead can't be reduced some other way?  At some point we
probably want to add iommu support on amd64 and arm64, but if that
comes with a similar overhead as on sparc64 that's going to be a bit
of an issue.

> > Index: dev/iommu.c
> > ===
> > RCS file: /cvs/src/sys/arch/sparc64/dev/iommu.c,v
> > retrieving revision 1.74
> > diff -u -p -r1.74 iommu.c
> > --- dev/iommu.c 30 Apr 2017 16:45:45 -  1.74
> > +++ dev/iommu.c 10 May 2017 12:00:09 -
> > @@ -100,6 +100,25 @@ void iommu_iomap_clear_pages(struct iomm
> >  void _iommu_dvmamap_sync(bus_dma_tag_t, bus_dma_tag_t, bus_dmamap_t,
> >  bus_addr_t, bus_size_t, int);
> >  
> > +void iommu_hw_enable(struct iommu_state *);
> > +
> > +const struct iommu_hw iommu_hw_default = {
> > +   .ihw_enable = iommu_hw_enable,
> > +
> > +   .ihw_dvma_pa= IOTTE_PAMASK,
> > +
> > +   .ihw_bypass = 0x3fffUL << 50,
> > +   .ihw_bypass_nc  = 0,
> > +   .ihw_bypass_ro  = 0,
> > +};
> > +
> > +void
> > +iommu_hw_enable(struct iommu_state *is)
> > +{
> > +   IOMMUREG_WRITE(is, iommu_tsb, is->is_ptsb);
> > +   IOMMUREG_WRITE(is, iommu_cr, IOMMUCR_EN | (is->is_tsbsize << 16));
> > +}
> > +
> >  /*
> >   * Initiate an STC entry flush.
> >   */
> > @@ -125,7 +144,8 @@ iommu_strbuf_flush(struct strbuf_ctl *sb
> >   * - create a private DVMA map.
> >   */
> >  void
> > -iommu_init(char *name, struct iommu_state *is, int tsbsize, u_int32_t 
> > iovabase)
> > +iommu_init(char *name, const struct iommu_hw *ihw, struct iommu_state *is,
> > +int tsbsize, u_int32_t iovabase)
> >  {
> > psize_t size;
> > vaddr_t va;
> > @@ -149,13 +169,9 @@ iommu_init(char *name, struct iommu_stat
> >  * be hard-wired, so we read the start and size from the PROM and
> >  * just use those values.
> >  */
> > -   if (strncmp(name, "pyro", 4) == 0) {
> > -   is->is_cr = IOMMUREG_READ(is, iommu_cr);
> > -   is->is_cr &= ~IOMMUCR_FIRE_BE;
> > -   is->is_cr |= (IOMMUCR_FIRE_SE | IOMMUCR_FIRE_CM_EN |
> > -   IOMMUCR_FIRE_TE);
> > -   } else 
> > - 

Re: multicast.4: drop unneeded (void *) casts

2018-10-19 Thread Klemens Nanni
OK kn with guenther's suggestions in.