> On 25 Jan 2020, at 10:57 am, Mark Kettenis <mark.kette...@xs4all.nl> wrote:
> 
> David Gwynne schreef op 2020-01-25 01:28:
>>> On 23 Jan 2020, at 10:38 pm, Mark Kettenis <mark.kette...@xs4all.nl> wrote:
>>> Martin Pieuchot schreef op 2020-01-23 11:28:
>>>> I'd like to make progress towards interrupting multiple CPUs in order to
>>>> one day make use of multiple queues in some network drivers.  The road
>>>> towards that goal is consequent and I'd like to proceed in steps to make
>>>> it easier to squash bugs.  I'm currently thinking of the following steps:
>>>> 1. Is my interrupt handler safe to be executed on CPU != CPU0?
>>> Except for things that are inherently tied to a specific CPU (clock 
>>> interrupts,
>>> performance counters, etc) I think the answer here should always be "yes".
>> Agreed.
>>> It probably only makes sense for mpsafe handlers to run on secondary CPUs 
>>> though.
>> Only because keeping !mpsafe handlers on one CPU means they're less
>> likely to need to spin against other !mpsafe interrupts on other CPUs
>> waiting for the kernel lock before they can execute. Otherwise this
>> shouldn't matter.
>>>> 2. Is it safe to execute this handler on two or more CPUs at the same
>>>>   time?
>>> I think that is never safe.  Unless you run execute the handler on 
>>> different "data".
>>> Running multiple rx interrupt handlers on different CPUs should be fine.
>> Agreed.
>>>> 3. How does interrupting multiple CPUs influence packet processing in
>>>>   the softnet thread?  Is any knowledge required (CPU affinity?) to
>>>>   have an optimum processing when multiple softnet threads are used?
>> I think this is my question to answer.
>> Packet sources (ie, rx rings) are supposed to be tied to a specific
>> nettq. Part of this is to avoid packet reordering where multiple
>> nettqs for one ring could overlap processing of packets for a single
>> TCP stream. The other part is so a busy nettq can apply backpressure
>> when it is overloaded to the rings that are feeding it.
>> Experience from other systems is that affinity does matter, but
>> running stuff in parallel matters more. Affinity between rings and
>> nettqs is something that can be worked on later.
>>>> 4. How to split traffic in one incoming NIC between multiple processing
>>>>   units?
>>> You'll need to have some sort of hardware filter that uses a hash of the
>>> packet header to assign an rx queue such that all packets from a single 
>>> "flow"
>>> end up on the same queue and therefore will be processed by the same 
>>> interrupt
>>> handler.
>> Yep.
>>>> This new journey comes with the requirement of being able to interrupt
>>>> an arbitrary CPU.  For that we need a new API.  Patrick gave me the
>>>> diff below during u2k20 and I'd like to use it to start a discussion.
>>>> We currently have 6 drivers using pci_intr_map_msix().  Since we want to
>>>> be able to specify a CPU should we introduce a new function like in the
>>>> diff below or do we prefer to add a new argument (cpuid_t?) to this one?
>>>> This change in itself should already allow us to proceed with the first
>>>> item of the above list.
>>> I'm not sure you want to have the driver pick the CPU to which to assign the
>>> interrupt.  In fact I think that doesn't make sense at all.  The CPU
>>> should be picked by more generic code instead.  But perhaps we do need to
>>> pass a hint from the driver to that code.
>> Letting the driver pick the CPU is Good Enough(tm) today. It may limit
>> us to 70 or 80 percent of some theoretical maximum, but we don't have
>> the machinery to make a better decision on behalf of the driver at
>> this point. It is much better to start with something simple today
>> (ie, letting the driver pick the CPU) and improve on it after we hit
>> the limits with the simple thing.
>> I also look at how far dfly has got, and from what I can tell their
>> MSI-X stuff let's the driver pick the CPU. So it can't be too bad.
>>>> Then we need a way to read the MSI-X control table size using the define
>>>> PCI_MSIX_CTL_TBLSIZE() below.  This can be done in MI, we might also
>>>> want to print that information in dmesg, some maybe cache it in pci(4)?
>>> There are already defines for MSIX in pcireg.h, some of which are duplicated
>>> by the defines in this diff.  Don't think caching makes all that much sense.
>>> Don't think we need to print the table size in dmesg; pcidump(8) already
>>> prints it.  Might make sense to print the vector number though.
>> I'm ok with with using pcidump(8) to see what a particular device
>> offers rather than having it in dmesg. I'd avoid putting vectors in
>> dmesg output, cos if have a lot of rings there's going to be a lot of
>> dmesg output. Probably better to make vmstat -i more useful, or systat
>> mb.
>>>> Does somebody has a better/stronger/magic way to achieve this goal?
>>> I playes a little bit with assigning interrupts to different CPUs in the
>>> past, but at that point this didn't really result in a performance boost.
>>> That was quite a while ago though.  I don't think there are fundamental 
>>> problems
>>> in getting this going.
>> Well, packet processing still goes through a single nettq, and that's
>> the limit I hit on my firewalls. I have a lot of CARP, LACP and VLAN
>> stuff though, so my cost per packet is probably higher than most.
>> However, unless your workload is tpmr(4) without any filtering, I'd be
>> surprised if ISR handling was the limit you're hitting.
>>>> What do you think?
>> I think it needs to go in so we can talk about something else.
> 
> I think using a "cpu ID" is the wrong thing though.  These IDs are somewhat
> confusing and the lookup can be expensive.  I think having a pointer to
> the struct cpuinfo * is better.  Unless maybe we're serious about wanting
> hotplug CPU support in OpenBSD in the near future.

Are you sure getting a struct cpu_info * is easier? I feel like the only time 
that's true is when you get to use curcpu(). Can't you go through 
cpu_cd.cd_devs to get from a number to a CPU?

While we're still talking, don't CPUs attach late on some archs? Does that mean 
we should establish these interrupts when an interface comes up rather than 
when it is attached?

dlg

Reply via email to