Re: [PATCH net v2] failover: allow name change on IFF_UP slave interfaces

2019-03-20 Thread Samudrala, Sridhar



On 3/19/2019 10:20 PM, si-wei liu wrote:

Hi Sridhar,

Are you fine with leaving the IFF_SLAVE_RENAME_OK flag as is, and thus 
can provide your Ack-by or Reviewed-by? I can change the code if you 
feel strong.


My preference would be not to introduce a new flag unless there is any 
usecase where we want a IFF_FAILOVER_SLAVE type of device to support 2 
different behaviors. (rename_ok and rename_not_ok)


Thanks
Sridhar



Thanks,
-Siwei


On 3/6/2019 8:54 PM, si-wei liu wrote:



On 3/6/2019 8:13 PM, Samudrala, Sridhar wrote:


On 3/6/2019 7:08 PM, Si-Wei Liu wrote:

When a netdev appears through hot plug then gets enslaved by a failover
master that is already up and running, the slave will be opened
right away after getting enslaved. Today there's a race that userspace
(udev) may fail to rename the slave if the kernel (net_failover)
opens the slave earlier than when the userspace rename happens.
Unlike bond or team, the primary slave of failover can't be renamed by
userspace ahead of time, since the kernel initiated auto-enslavement is
unable to, or rather, is never meant to be synchronized with the rename
request from userspace.

As the failover slave interfaces are not designed to be operated
directly by userspace apps: IP configuration, filter rules with
regard to network traffic passing and etc., should all be done on 
master

interface. In general, userspace apps only care about the
name of master interface, while slave names are less important as long
as admin users can see reliable names that may carry
other information describing the netdev. For e.g., they can infer that
"ens3nsby" is a standby slave of "ens3", while for a
name like "eth0" they can't tell which master it belongs to.

Historically the name of IFF_UP interface can't be changed because
there might be admin script or management software that is already
relying on such behavior and assumes that the slave name can't be
changed once UP. But failover is special: with the in-kernel
auto-enslavement mechanism, the userspace expectation for device
enumeration and bring-up order is already broken. Previously initramfs
and various userspace config tools were modified to bypass failover
slaves because of auto-enslavement and duplicate MAC address. 
Similarly,

in case that users care about seeing reliable slave name, the new type
of failover slaves needs to be taken care of specifically in userspace
anyway.

It's less risky to lift up the rename restriction on failover slave
which is already UP. Although it's possible this change may potentially
break userspace component (most likely configuration scripts or
management software) that assumes slave name can't be changed while
UP, it's relatively a limited and controllable set among all userspace
components, which can be fixed specifically to work with the new naming
behavior of failover slaves. Userspace component interacting with
slaves should be changed to operate on failover master instead, as the
failover slave is dynamic in nature which may come and go at any point.
The goal is to make the role of failover slaves less relevant, and
all userspace should only deal with master in the long run.

Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
Signed-off-by: Si-Wei Liu 
Reviewed-by: Liran Alon 
Acked-by: Michael S. Tsirkin 

---
v1 -> v2:
- Drop configurable module parameter (Sridhar)


  include/linux/netdevice.h | 3 +++
  net/core/dev.c    | 3 ++-
  net/core/failover.c   | 6 +++---
  3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 857f8ab..6d9e4e0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1487,6 +1487,7 @@ struct net_device_ops {
   * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
   * @IFF_FAILOVER: device is a failover master device
   * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master 
device
+ * @IFF_SLAVE_RENAME_OK: rename is allowed while slave device is 
running

   */
  enum netdev_priv_flags {
  IFF_802_1Q_VLAN    = 1<<0,
@@ -1518,6 +1519,7 @@ enum netdev_priv_flags {
  IFF_NO_RX_HANDLER    = 1<<26,
  IFF_FAILOVER    = 1<<27,
  IFF_FAILOVER_SLAVE    = 1<<28,
+    IFF_SLAVE_RENAME_OK    = 1<<29,
  };
    #define IFF_802_1Q_VLAN    IFF_802_1Q_VLAN
@@ -1548,6 +1550,7 @@ enum netdev_priv_flags {
  #define IFF_NO_RX_HANDLER    IFF_NO_RX_HANDLER
  #define IFF_FAILOVER    IFF_FAILOVER
  #define IFF_FAILOVER_SLAVE    IFF_FAILOVER_SLAVE
+#define IFF_SLAVE_RENAME_OK    IFF_SLAVE_RENAME_OK
    /**
   *    struct net_device - The DEVICE structure.
diff --git a/net/core/dev.c b/net/core/dev.c
index 722d50d..ae070de 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1180,7 +1180,8 @@ int dev_change_name(struct net_device *dev, 
const char *newname)

  BUG_ON(!dev_net(dev));
    net = dev_net(dev);
-    if (dev->flags & IFF_UP)
+    if (dev->flags & 

Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-20 Thread Dongli Zhang


On 3/20/19 8:53 PM, Jason Wang wrote:
> 
> On 2019/3/19 上午10:22, Dongli Zhang wrote:
>> Hi Jason,
>>
>> On 3/18/19 3:47 PM, Jason Wang wrote:
>>> On 2019/3/15 下午8:41, Cornelia Huck wrote:
 On Fri, 15 Mar 2019 12:50:11 +0800
 Jason Wang  wrote:

> Or something like I proposed several years ago?
> https://do-db2.lkml.org/lkml/2014/12/25/169
>
> Btw, for virtio-net, I think we actually want to go for having a maximum
> number of supported queues like what hardware did. This would be useful
> for e.g cpu hotplug or XDP (requires per cpu TX queue). But the current
> vector allocation doesn't support this which will results all virtqueues
> to share a single vector. We may indeed need more flexible policy here.
 I think it should be possible for the driver to give the transport
 hints how to set up their queues/interrupt structures. (The driver
 probably knows best about its requirements.) Perhaps whether a queue is
 high or low frequency, or whether it should be low latency, or even
 whether two queues could share a notification mechanism without
 drawbacks. It's up to the transport to make use of that information, if
 possible.
>>>
>>> Exactly and it was what the above series tried to do by providing hints of 
>>> e.g
>>> which queues want to share a notification.
>>>
>> I read about your patch set on providing more flexibility of queue-to-vector
>> mapping.
>>
>> One use case of the patch set is we would be able to enable more queues when
>> there is limited number of vectors.
>>
>> Another use case we may classify queues as hight priority or low priority as
>> mentioned by Cornelia.
>>
>> For virtio-blk, we may extend virtio-blk based on this patch set to enable
>> something similar to write_queues/poll_queues in nvme, when (set->nr_maps != 
>> 1).
>>
>>
>> Yet, the question I am asking in this email thread is for a difference 
>> scenario.
>>
>> The issue is not we are not having enough vectors (although this is why only 
>> 1
>> vector is allocated for all virtio-blk queues). As so far virtio-blk has
>> (set->nr_maps == 1), block layer would limit the number of hw queues by
>> nr_cpu_ids, we indeed do not need more than nr_cpu_ids hw queues in 
>> virtio-blk.
>>
>> That's why I ask why not change the flow as below options when the number of
>> supported hw queues is more than nr_cpu_ids (and set->nr_maps == 1. 
>> virtio-blk
>> does not set nr_maps and block layer would set it to 1 when the driver does 
>> not
>> specify with a value):
>>
>> option 1:
>> As what nvme and xen-netfront do, limit the hw queue number by nr_cpu_ids.
> 
> 
> How do they limit the hw queue number? A command?

The max #queue is also limited by other factors, e.g., kernel param
configuration, xen dom0 configuration or nvme hardware support.

Here we would ignore those factors for simplicity and only talk about the
relation between #queue and #cpu.


About nvme pci:

Regardless about new write_queues and poll_queues, the default queue type number
is limited by num_possible_cpus() as line 2120 and 252.

2113 static int nvme_setup_io_queues(struct nvme_dev *dev)
2114 {
2115 struct nvme_queue *adminq = >queues[0];
2116 struct pci_dev *pdev = to_pci_dev(dev->dev);
2117 int result, nr_io_queues;
2118 unsigned long size;
2119
2120 nr_io_queues = max_io_queues();
2121 result = nvme_set_queue_count(>ctrl, _io_queues);

 250 static unsigned int max_io_queues(void)
 251 {
 252 return num_possible_cpus() + write_queues + poll_queues;
 253 }

The cons of this is there might be many unused hw queues and vectors when
num_possible_cpus() is very very large while only a small number of cpu are
online. I am looking if there is way to improve this.



About xen-blkfront:

Indeed the max #queue is limited by num_online_cpus() when xen-blkfront module
is loaded as line 2733 and 2736.

2707 static int __init xlblk_init(void)
... ...
2710 int nr_cpus = num_online_cpus();
... ...
2733 if (xen_blkif_max_queues > nr_cpus) {
2734 pr_info("Invalid max_queues (%d), will use default max: 
%d.\n",
2735 xen_blkif_max_queues, nr_cpus);
2736 xen_blkif_max_queues = nr_cpus;
2737 }

The cons of this is the number of hw-queue/hctx is limited and cannot increase
after cpu hotplug. I am looking if there is way to improve this.


While both have cons for cpu hotplug, they are trying to make #vector
proportional to the number of cpu.


For xen-blkfront and virtio-blk, as (set=nr_maps == 1), the number of hw queue
is limited by nr_cpu_ids again at block layer.


As virtio-blk is a PCI device, can we use the solution in nvme, that is, to use
num_possible_cpus to limited the max queues in virtio-blk?

Thank you very much!

Dongli Zhang

> 
> 
>>
>> option 2:
>> If the vectors is not enough, use the max number vector (indeed nr_cpu_ids) 
>> as
>> number of hw queues.
> 

Re: [summary] virtio network device failover writeup

2019-03-20 Thread Michael S. Tsirkin
On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
> 
> 
> > On 20 Mar 2019, at 16:09, Michael S. Tsirkin  wrote:
> > 
> > On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin  wrote:
> >>> 
> >>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>  
>  
> > On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
> > 
> > On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
> >> On Tue, 19 Mar 2019 14:38:06 +0200
> >> Liran Alon  wrote:
> >> 
> >>> b.3) cloud-init: If configured to perform network-configuration, it 
> >>> attempts to configure all available netdevs. It should avoid however 
> >>> doing so on net-failover slaves.
> >>> (Microsoft has handled this by adding a mechanism in cloud-init to 
> >>> blacklist a netdev from being configured in case it is owned by a 
> >>> specific PCI driver. Specifically, they blacklist Mellanox VF driver. 
> >>> However, this technique doesn’t work for the net-failover mechanism 
> >>> because both the net-failover netdev and the virtio-net netdev are 
> >>> owned by the virtio-net PCI driver).
> >> 
> >> Cloud-init should really just ignore all devices that have a master 
> >> device.
> >> That would have been more general, and safer for other use cases.
> > 
> > Given lots of userspace doesn't do this, I wonder whether it would be
> > safer to just somehow pretend to userspace that the slave links are
> > down? And add a special attribute for the actual link state.
>  
>  I think this may be problematic as it would also break legit use case
>  of userspace attempt to set various config on VF slave.
>  In general, lying to userspace usually leads to problems.
> >>> 
> >>> I hear you on this. So how about instead of lying,
> >>> we basically just fail some accesses to slaves
> >>> unless a flag is set e.g. in ethtool.
> >>> 
> >>> Some userspace will need to change to set it but in a minor way.
> >>> Arguably/hopefully failure to set config would generally be a safer
> >>> failure.
> >> 
> >> Once userspace will set this new flag by ethtool, all operations done by 
> >> other userspace components will still work.
> > 
> > Sorry about being unclear, the idea would be to require the flag on each 
> > ethtool operation.
> 
> Oh. I have indeed misunderstood your previous email then. :)
> Thanks for clarifying.
> 
> > 
> >> E.g. Running dhclient without parameters, after this flag was set, will 
> >> still attempt to perform DHCP on it and will now succeed.
> > 
> > I think sending/receiving should probably just fail unconditionally.
> 
> You mean that you wish that somehow kernel will prevent Tx on net-failover 
> slave netdev
> unless skb is marked with some flag to indicate it has been sent via the 
> net-failover master?

We can maybe avoid binding a protocol socket to the device?

> This indeed resolves the group of userspace issues around performing DHCP on 
> net-failover slaves directly (By dracut/initramfs, dhclient and etc.).
> 
> However, I see a couple of down-sides to it:
> 1) It doesn’t resolve all userspace issues listed in this email thread. For 
> example, cloud-init will still attempt to perform network config on 
> net-failover slaves.
> It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev 
> rules that match only by MAC.


How about we fail to retrieve mac from the slave?

> 2) It brings non-intuitive customer experience. For example, a customer may 
> attempt to analyse connectivity issue by checking the connectivity
> on a net-failover slave (e.g. the VF) but will see no connectivity when 
> in-fact checking the connectivity on the net-failover master netdev shows 
> correct connectivity.
> 
> The set of changes I vision to fix our issues are:
> 1) Hide net-failover slaves in a different netns created and managed by the 
> kernel. But that user can enter to it and manage the netdevs there if wishes 
> to do so explicitly.
> (E.g. Configure the net-failover VF slave in some special way).
> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. 
> (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get 
> PCI slot where the matching VF will be hot-plugged by hypervisor.
> 3) Have an explicit virtio-net control message to command hypervisor to 
> switch data-path from virtio-net to VF and vice-versa. Instead of relying on 
> intercepting the PCI master enable-bit
> as an indicator on when VF is about to be set up. (Similar to as done in 
> NetVSC).
> 
> Is there any clear issue we see regarding the above suggestion?
> 
> -Liran

The issue would be this: how do we avoid conflicting with namespaces
created by users?

> > 
> >> Therefore, this proposal just effectively delays when the net-failover 
> >> slave can be operated on by userspace.
> >> But what we actually want is to 

Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-03-20 Thread Michael S. Tsirkin
On Wed, Mar 20, 2019 at 01:13:41PM -0300, Thiago Jung Bauermann wrote:
> >> Another way of looking at this issue which also explains our reluctance
> >> is that the only difference between a secure guest and a regular guest
> >> (at least regarding virtio) is that the former uses swiotlb while the
> >> latter doens't.
> >
> > But swiotlb is just one implementation. It's a guest internal thing. The
> > issue is that memory isn't host accessible.
> 
> >From what I understand of the ACCESS_PLATFORM definition, the host will
> only ever try to access memory addresses that are supplied to it by the
> guest, so all of the secure guest memory that the host cares about is
> accessible:
> 
> If this feature bit is set to 0, then the device has same access to
> memory addresses supplied to it as the driver has. In particular,
> the device will always use physical addresses matching addresses
> used by the driver (typically meaning physical addresses used by the
> CPU) and not translated further, and can access any address supplied
> to it by the driver. When clear, this overrides any
> platform-specific description of whether device access is limited or
> translated in any way, e.g. whether an IOMMU may be present.
> 
> All of the above is true for POWER guests, whether they are secure
> guests or not.
> 
> Or are you saying that a virtio device may want to access memory
> addresses that weren't supplied to it by the driver?

Your logic would apply to IOMMUs as well.  For your mode, there are
specific encrypted memory regions that driver has access to but device
does not. that seems to violate the constraint.


> >> And from the device's point of view they're
> >> indistinguishable. It can't tell one guest that is using swiotlb from
> >> one that isn't. And that implies that secure guest vs regular guest
> >> isn't a virtio interface issue, it's "guest internal affairs". So
> >> there's no reason to reflect that in the feature flags.
> >
> > So don't. The way not to reflect that in the feature flags is
> > to set ACCESS_PLATFORM.  Then you say *I don't care let platform device*.
> >
> >
> > Without ACCESS_PLATFORM
> > virtio has a very specific opinion about the security of the
> > device, and that opinion is that device is part of the guest
> > supervisor security domain.
> 
> Sorry for being a bit dense, but not sure what "the device is part of
> the guest supervisor security domain" means. In powerpc-speak,
> "supervisor" is the operating system so perhaps that explains my
> confusion. Are you saying that without ACCESS_PLATFORM, the guest
> considers the host to be part of the guest operating system's security
> domain?

I think so. The spec says "device has same access as driver".

> If so, does that have any other implication besides "the host
> can access any address supplied to it by the driver"? If that is the
> case, perhaps the definition of ACCESS_PLATFORM needs to be amended to
> include that information because it's not part of the current
> definition.
> 
> >> That said, we still would like to arrive at a proper design for this
> >> rather than add yet another hack if we can avoid it. So here's another
> >> proposal: considering that the dma-direct code (in kernel/dma/direct.c)
> >> automatically uses swiotlb when necessary (thanks to Christoph's recent
> >> DMA work), would it be ok to replace virtio's own direct-memory code
> >> that is used in the !ACCESS_PLATFORM case with the dma-direct code? That
> >> way we'll get swiotlb even with !ACCESS_PLATFORM, and virtio will get a
> >> code cleanup (replace open-coded stuff with calls to existing
> >> infrastructure).
> >
> > Let's say I have some doubts that there's an API that
> > matches what virtio with its bag of legacy compatibility exactly.
> 
> Ok.
> 
> >> > But the name "sev_active" makes me scared because at least AMD guys who
> >> > were doing the sensible thing and setting ACCESS_PLATFORM
> >>
> >> My understanding is, AMD guest-platform knows in advance that their
> >> guest will run in secure mode and hence sets the flag at the time of VM
> >> instantiation. Unfortunately we dont have that luxury on our platforms.
> >
> > Well you do have that luxury. It looks like that there are existing
> > guests that already acknowledge ACCESS_PLATFORM and you are not happy
> > with how that path is slow. So you are trying to optimize for
> > them by clearing ACCESS_PLATFORM and then you have lost ability
> > to invoke DMA API.
> >
> > For example if there was another flag just like ACCESS_PLATFORM
> > just not yet used by anyone, you would be all fine using that right?
> 
> Yes, a new flag sounds like a great idea. What about the definition
> below?
> 
> VIRTIO_F_ACCESS_PLATFORM_NO_IOMMU This feature has the same meaning as
> VIRTIO_F_ACCESS_PLATFORM both when set and when not set, with the
> exception that the IOMMU is explicitly defined to be off or bypassed
> when accessing memory addresses supplied to 

Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-03-20 Thread Thiago Jung Bauermann


Hello Michael,

Sorry for the delay in responding. We had some internal discussions on
this.

Michael S. Tsirkin  writes:

> On Mon, Feb 04, 2019 at 04:14:20PM -0200, Thiago Jung Bauermann wrote:
>>
>> Hello Michael,
>>
>> Michael S. Tsirkin  writes:
>>
>> > On Tue, Jan 29, 2019 at 03:42:44PM -0200, Thiago Jung Bauermann wrote:
>> So while ACCESS_PLATFORM solves our problems for secure guests, we can't
>> turn it on by default because we can't affect legacy systems. Doing so
>> would penalize existing systems that can access all memory. They would
>> all have to unnecessarily go through address translations, and take a
>> performance hit.
>
> So as step one, you just give hypervisor admin an option to run legacy
> systems faster by blocking secure mode. I don't see why that is
> so terrible.

There are a few reasons why:

1. It's bad user experience to require people to fiddle with knobs for
obscure reasons if it's possible to design things such that they Just
Work.

2. "User" in this case can be a human directly calling QEMU, but could
also be libvirt or one of its users, or some other framework. This means
having to adjust and/or educate an open-ended number of people and
software. It's best avoided if possible.

3. The hypervisor admin and the admin of the guest system don't
necessarily belong to the same organization (e.g., cloud provider and
cloud customer), so there may be some friction when they need to
coordinate to get this right.

4. A feature of our design is that the guest may or may not decide to
"go secure" at boot time, so it's best not to depend on flags that may
or may not have been set at the time QEMU was started.

>> The semantics of ACCESS_PLATFORM assume that the hypervisor/QEMU knows
>> in advance - right when the VM is instantiated - that it will not have
>> access to all guest memory.
>
> Not quite. It just means that hypervisor can live with not having
> access to all memory. If platform wants to give it access
> to all memory that is quite all right.

Except that on powerpc it also means "there's an IOMMU present" and
there's no way to say "bypass IOMMU translation". :-/

>> Another way of looking at this issue which also explains our reluctance
>> is that the only difference between a secure guest and a regular guest
>> (at least regarding virtio) is that the former uses swiotlb while the
>> latter doens't.
>
> But swiotlb is just one implementation. It's a guest internal thing. The
> issue is that memory isn't host accessible.

>From what I understand of the ACCESS_PLATFORM definition, the host will
only ever try to access memory addresses that are supplied to it by the
guest, so all of the secure guest memory that the host cares about is
accessible:

If this feature bit is set to 0, then the device has same access to
memory addresses supplied to it as the driver has. In particular,
the device will always use physical addresses matching addresses
used by the driver (typically meaning physical addresses used by the
CPU) and not translated further, and can access any address supplied
to it by the driver. When clear, this overrides any
platform-specific description of whether device access is limited or
translated in any way, e.g. whether an IOMMU may be present.

All of the above is true for POWER guests, whether they are secure
guests or not.

Or are you saying that a virtio device may want to access memory
addresses that weren't supplied to it by the driver?

>> And from the device's point of view they're
>> indistinguishable. It can't tell one guest that is using swiotlb from
>> one that isn't. And that implies that secure guest vs regular guest
>> isn't a virtio interface issue, it's "guest internal affairs". So
>> there's no reason to reflect that in the feature flags.
>
> So don't. The way not to reflect that in the feature flags is
> to set ACCESS_PLATFORM.  Then you say *I don't care let platform device*.
>
>
> Without ACCESS_PLATFORM
> virtio has a very specific opinion about the security of the
> device, and that opinion is that device is part of the guest
> supervisor security domain.

Sorry for being a bit dense, but not sure what "the device is part of
the guest supervisor security domain" means. In powerpc-speak,
"supervisor" is the operating system so perhaps that explains my
confusion. Are you saying that without ACCESS_PLATFORM, the guest
considers the host to be part of the guest operating system's security
domain? If so, does that have any other implication besides "the host
can access any address supplied to it by the driver"? If that is the
case, perhaps the definition of ACCESS_PLATFORM needs to be amended to
include that information because it's not part of the current
definition.

>> That said, we still would like to arrive at a proper design for this
>> rather than add yet another hack if we can avoid it. So here's another
>> proposal: considering that the dma-direct code (in kernel/dma/direct.c)
>> automatically uses 

Re: [summary] virtio network device failover writeup

2019-03-20 Thread Michael S. Tsirkin
On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
> 
> 
> > On 20 Mar 2019, at 12:25, Michael S. Tsirkin  wrote:
> > 
> > On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
> >>> 
> >>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>  On Tue, 19 Mar 2019 14:38:06 +0200
>  Liran Alon  wrote:
>  
> > b.3) cloud-init: If configured to perform network-configuration, it 
> > attempts to configure all available netdevs. It should avoid however 
> > doing so on net-failover slaves.
> > (Microsoft has handled this by adding a mechanism in cloud-init to 
> > blacklist a netdev from being configured in case it is owned by a 
> > specific PCI driver. Specifically, they blacklist Mellanox VF driver. 
> > However, this technique doesn’t work for the net-failover mechanism 
> > because both the net-failover netdev and the virtio-net netdev are 
> > owned by the virtio-net PCI driver).
>  
>  Cloud-init should really just ignore all devices that have a master 
>  device.
>  That would have been more general, and safer for other use cases.
> >>> 
> >>> Given lots of userspace doesn't do this, I wonder whether it would be
> >>> safer to just somehow pretend to userspace that the slave links are
> >>> down? And add a special attribute for the actual link state.
> >> 
> >> I think this may be problematic as it would also break legit use case
> >> of userspace attempt to set various config on VF slave.
> >> In general, lying to userspace usually leads to problems.
> > 
> > I hear you on this. So how about instead of lying,
> > we basically just fail some accesses to slaves
> > unless a flag is set e.g. in ethtool.
> > 
> > Some userspace will need to change to set it but in a minor way.
> > Arguably/hopefully failure to set config would generally be a safer
> > failure.
> 
> Once userspace will set this new flag by ethtool, all operations done by 
> other userspace components will still work.

Sorry about being unclear, the idea would be to require the flag on each 
ethtool operation.

> E.g. Running dhclient without parameters, after this flag was set, will still 
> attempt to perform DHCP on it and will now succeed.

I think sending/receiving should probably just fail unconditionally.

> Therefore, this proposal just effectively delays when the net-failover slave 
> can be operated on by userspace.
> But what we actually want is to never allow a net-failover slave to be 
> operated by userspace unless it is explicitly stated
> by userspace that it wishes to perform a set of actions on the net-failover 
> slave.
> 
> Something that was achieved if, for example, the net-failover slaves were in 
> a different netns than default netns.
> This also aligns with expected customer experience that most customers just 
> want to see a 1:1 mapping between a vNIC and a visible netdev.
> But of course maybe there are other ideas that can achieve similar behaviour.
> 
> -Liran
> 
> > 
> > Which things to fail? Probably sending/receiving packets?  Getting MAC?
> > More?
> > 
> >> If we reach
> >> to a scenario where we try to avoid userspace issues generically and
> >> not on a userspace component basis, I believe the right path should be
> >> to hide the net-failover slaves such that explicit action is required
> >> to actually manipulate them (As described in blog-post). E.g.
> >> Automatically move net-failover slaves by kernel to a different netns.
> >> 
> >> -Liran
> >> 
> >>> 
> >>> -- 
> >>> MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-20 Thread Jason Wang


On 2019/3/19 上午10:22, Dongli Zhang wrote:

Hi Jason,

On 3/18/19 3:47 PM, Jason Wang wrote:

On 2019/3/15 下午8:41, Cornelia Huck wrote:

On Fri, 15 Mar 2019 12:50:11 +0800
Jason Wang  wrote:


Or something like I proposed several years ago?
https://do-db2.lkml.org/lkml/2014/12/25/169

Btw, for virtio-net, I think we actually want to go for having a maximum
number of supported queues like what hardware did. This would be useful
for e.g cpu hotplug or XDP (requires per cpu TX queue). But the current
vector allocation doesn't support this which will results all virtqueues
to share a single vector. We may indeed need more flexible policy here.

I think it should be possible for the driver to give the transport
hints how to set up their queues/interrupt structures. (The driver
probably knows best about its requirements.) Perhaps whether a queue is
high or low frequency, or whether it should be low latency, or even
whether two queues could share a notification mechanism without
drawbacks. It's up to the transport to make use of that information, if
possible.


Exactly and it was what the above series tried to do by providing hints of e.g
which queues want to share a notification.


I read about your patch set on providing more flexibility of queue-to-vector
mapping.

One use case of the patch set is we would be able to enable more queues when
there is limited number of vectors.

Another use case we may classify queues as hight priority or low priority as
mentioned by Cornelia.

For virtio-blk, we may extend virtio-blk based on this patch set to enable
something similar to write_queues/poll_queues in nvme, when (set->nr_maps != 1).


Yet, the question I am asking in this email thread is for a difference scenario.

The issue is not we are not having enough vectors (although this is why only 1
vector is allocated for all virtio-blk queues). As so far virtio-blk has
(set->nr_maps == 1), block layer would limit the number of hw queues by
nr_cpu_ids, we indeed do not need more than nr_cpu_ids hw queues in virtio-blk.

That's why I ask why not change the flow as below options when the number of
supported hw queues is more than nr_cpu_ids (and set->nr_maps == 1. virtio-blk
does not set nr_maps and block layer would set it to 1 when the driver does not
specify with a value):

option 1:
As what nvme and xen-netfront do, limit the hw queue number by nr_cpu_ids.



How do they limit the hw queue number? A command?




option 2:
If the vectors is not enough, use the max number vector (indeed nr_cpu_ids) as
number of hw queues.



We can share vectors in this case.




option 3:
We should allow more vectors even the block layer would support at most
nr_cpu_ids queues.


I understand a new policy for queue-vector mapping is very helpful. I am just
asking the question from block layer's point of view.

Thank you very much!

Dongli Zhang



Don't know much for block, cc Stefan for more idea.

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-03-20 Thread Michael S. Tsirkin
On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
> 
> 
> > On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
> > 
> > On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
> >> On Tue, 19 Mar 2019 14:38:06 +0200
> >> Liran Alon  wrote:
> >> 
> >>> b.3) cloud-init: If configured to perform network-configuration, it 
> >>> attempts to configure all available netdevs. It should avoid however 
> >>> doing so on net-failover slaves.
> >>> (Microsoft has handled this by adding a mechanism in cloud-init to 
> >>> blacklist a netdev from being configured in case it is owned by a 
> >>> specific PCI driver. Specifically, they blacklist Mellanox VF driver. 
> >>> However, this technique doesn’t work for the net-failover mechanism 
> >>> because both the net-failover netdev and the virtio-net netdev are owned 
> >>> by the virtio-net PCI driver).
> >> 
> >> Cloud-init should really just ignore all devices that have a master device.
> >> That would have been more general, and safer for other use cases.
> > 
> > Given lots of userspace doesn't do this, I wonder whether it would be
> > safer to just somehow pretend to userspace that the slave links are
> > down? And add a special attribute for the actual link state.
> 
> I think this may be problematic as it would also break legit use case
> of userspace attempt to set various config on VF slave.
> In general, lying to userspace usually leads to problems.

I hear you on this. So how about instead of lying,
we basically just fail some accesses to slaves
unless a flag is set e.g. in ethtool.

Some userspace will need to change to set it but in a minor way.
Arguably/hopefully failure to set config would generally be a safer
failure.

Which things to fail? Probably sending/receiving packets?  Getting MAC?
More?

> If we reach
> to a scenario where we try to avoid userspace issues generically and
> not on a userspace component basis, I believe the right path should be
> to hide the net-failover slaves such that explicit action is required
> to actually manipulate them (As described in blog-post). E.g.
> Automatically move net-failover slaves by kernel to a different netns.
> 
> -Liran
> 
> > 
> > -- 
> > MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH] drm/virtio: make resource id workaround runtime switchable.

2019-03-20 Thread Gerd Hoffmann
Also update the comment with a reference to the virglrenderer fix.

Signed-off-by: Gerd Hoffmann 
---
 drivers/gpu/drm/virtio/virtgpu_object.c | 44 ++---
 1 file changed, 24 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_object.c 
b/drivers/gpu/drm/virtio/virtgpu_object.c
index e7e946035027..a73dae4e5029 100644
--- a/drivers/gpu/drm/virtio/virtgpu_object.c
+++ b/drivers/gpu/drm/virtio/virtgpu_object.c
@@ -25,34 +25,38 @@
 
 #include "virtgpu_drv.h"
 
+static int virtio_gpu_virglrenderer_workaround = 1;
+module_param_named(virglhack, virtio_gpu_virglrenderer_workaround, int, 0400);
+
 static int virtio_gpu_resource_id_get(struct virtio_gpu_device *vgdev,
   uint32_t *resid)
 {
-#if 0
-   int handle = ida_alloc(>resource_ida, GFP_KERNEL);
-
-   if (handle < 0)
-   return handle;
-#else
-   static int handle;
-
-   /*
-* FIXME: dirty hack to avoid re-using IDs, virglrenderer
-* can't deal with that.  Needs fixing in virglrenderer, also
-* should figure a better way to handle that in the guest.
-*/
-   handle++;
-#endif
-
-   *resid = handle + 1;
+   if (virtio_gpu_virglrenderer_workaround) {
+   /*
+* Hack to avoid re-using resource IDs.
+*
+* virglrenderer versions up to (and including) 0.7.0
+* can't deal with that.  virglrenderer commit
+* "f91a9dd35715 Fix unlinking resources from hash
+* table." (Feb 2019) fixes the bug.
+*/
+   static int handle;
+   handle++;
+   *resid = handle + 1;
+   } else {
+   int handle = ida_alloc(>resource_ida, GFP_KERNEL);
+   if (handle < 0)
+   return handle;
+   *resid = handle + 1;
+   }
return 0;
 }
 
 static void virtio_gpu_resource_id_put(struct virtio_gpu_device *vgdev, 
uint32_t id)
 {
-#if 0
-   ida_free(>resource_ida, id - 1);
-#endif
+   if (!virtio_gpu_virglrenderer_workaround) {
+   ida_free(>resource_ida, id - 1);
+   }
 }
 
 static void virtio_gpu_ttm_bo_destroy(struct ttm_buffer_object *tbo)
-- 
2.18.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] drm/virtio: add virtio-gpu-features debugfs file.

2019-03-20 Thread Gerd Hoffmann
This file prints which features the virtio-gpu device has.

Also add "virtio-gpu-" prefix to the existing fence file,
to make clear this is a driver-specific debugfs file.

Signed-off-by: Gerd Hoffmann 
---
 drivers/gpu/drm/virtio/virtgpu_debugfs.c | 27 +++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_debugfs.c 
b/drivers/gpu/drm/virtio/virtgpu_debugfs.c
index 73dc99046c43..ed0fcda713c3 100644
--- a/drivers/gpu/drm/virtio/virtgpu_debugfs.c
+++ b/drivers/gpu/drm/virtio/virtgpu_debugfs.c
@@ -28,6 +28,30 @@
 
 #include "virtgpu_drv.h"
 
+static void virtio_add_bool(struct seq_file *m, const char *name,
+   bool value)
+{
+   seq_printf(m, "%-16s : %s\n", name, value ? "yes" : "no");
+}
+
+static void virtio_add_int(struct seq_file *m, const char *name,
+  int value)
+{
+   seq_printf(m, "%-16s : %d\n", name, value);
+}
+
+static int virtio_gpu_features(struct seq_file *m, void *data)
+{
+   struct drm_info_node *node = (struct drm_info_node *) m->private;
+   struct virtio_gpu_device *vgdev = node->minor->dev->dev_private;
+
+   virtio_add_bool(m, "virgl", vgdev->has_virgl_3d);
+   virtio_add_bool(m, "edid", vgdev->has_edid);
+   virtio_add_int(m, "cap sets", vgdev->num_capsets);
+   virtio_add_int(m, "scanouts", vgdev->num_scanouts);
+   return 0;
+}
+
 static int
 virtio_gpu_debugfs_irq_info(struct seq_file *m, void *data)
 {
@@ -41,7 +65,8 @@ virtio_gpu_debugfs_irq_info(struct seq_file *m, void *data)
 }
 
 static struct drm_info_list virtio_gpu_debugfs_list[] = {
-   { "irq_fence", virtio_gpu_debugfs_irq_info, 0, NULL },
+   { "virtio-gpu-features", virtio_gpu_features },
+   { "virtio-gpu-irq-fence", virtio_gpu_debugfs_irq_info, 0, NULL },
 };
 
 #define VIRTIO_GPU_DEBUGFS_ENTRIES ARRAY_SIZE(virtio_gpu_debugfs_list)
-- 
2.18.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization