Re: INFO: task hung in lock_sock_nested (2)

2020-02-25 Thread Stefano Garzarella
On Tue, Feb 25, 2020 at 05:44:03AM +, Dexuan Cui wrote:
> > From: Stefano Garzarella 
> > Sent: Monday, February 24, 2020 2:09 AM
> > ...
> > > > syz-executor280 D27912  9768   9766 0x
> > > > Call Trace:
> > > >  context_switch kernel/sched/core.c:3386 [inline]
> > > >  __schedule+0x934/0x1f90 kernel/sched/core.c:4082
> > > >  schedule+0xdc/0x2b0 kernel/sched/core.c:4156
> > > >  __lock_sock+0x165/0x290 net/core/sock.c:2413
> > > >  lock_sock_nested+0xfe/0x120 net/core/sock.c:2938
> > > >  virtio_transport_release+0xc4/0xd60
> > net/vmw_vsock/virtio_transport_common.c:832
> > > >  vsock_assign_transport+0xf3/0x3b0 net/vmw_vsock/af_vsock.c:454
> > > >  vsock_stream_connect+0x2b3/0xc70 net/vmw_vsock/af_vsock.c:1288
> > > >  __sys_connect_file+0x161/0x1c0 net/socket.c:1857
> > > >  __sys_connect+0x174/0x1b0 net/socket.c:1874
> > > >  __do_sys_connect net/socket.c:1885 [inline]
> > > >  __se_sys_connect net/socket.c:1882 [inline]
> > > >  __x64_sys_connect+0x73/0xb0 net/socket.c:1882
> > > >  do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
> 
> My understanding about the call trace is: in vsock_stream_connect() 
> after we call lock_sock(sk), we call vsock_assign_transport(), which may
> call vsk->transport->release(vsk), i.e. virtio_transport_release(), and in
> virtio_transport_release() we try to get the same lock and hang.

Yes, that's what I got.

> 
> > > Seems like vsock needs a word to track lock owner in an attempt to
> > > avoid trying to lock sock while the current is the lock owner.
> 
> I'm unfamilar with the g2h/h2g :-) 
> Here I'm wondering if it's acceptable to add an 'already_locked'
> parameter like this:
>   bool already_locked = true;
>   vsk->transport->release(vsk, already_locked) ?

Could be acceptable, but I prefer to avoid.

>  
> > Thanks for this possible solution.
> > What about using sock_owned_by_user()?
>  
> > We should fix also hyperv_transport, because it could suffer from the same
> > problem.
> 
> IIUC hyperv_transport doesn't supprot the h2g/g2h feature, so it should not
> suffers from the deadlock issue here?

The h2g/g2h is in the vsock core, and the hyperv_transport is one of the g2h
transports available.

If we have a L1 VM with hyperv_transport (G2H) to communicate with L0 and a
nested KVM VM (L2) we need to load also vhost_transport (H2G). If the
user creates a socket and it tries the following:
connect(fd, <2,1234>) - socket assigned to hyperv_transport (because
the user wants to reach the host using CID_HOST)
fails

connect(fd, <3,1234>) - socket must be reassigned to vhost_transport
(because the user wants to reach a nested guest)

So, I think in this case we can have the deadlock.

> 
> > At this point, it might be better to call vsk->transport->release(vsk)
> > always with the lock taken and remove it in the transports as in the
> > following patch.
> > 
> > What do you think?
> > 
> > 
> > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> > index 9c5b2a91baad..a073d8efca33 100644
> > --- a/net/vmw_vsock/af_vsock.c
> > +++ b/net/vmw_vsock/af_vsock.c
> > @@ -753,20 +753,18 @@ static void __vsock_release(struct sock *sk, int
> > level)
> > vsk = vsock_sk(sk);
> > pending = NULL; /* Compiler warning. */
> > 
> > -   /* The release call is supposed to use lock_sock_nested()
> > -* rather than lock_sock(), if a sock lock should be acquired.
> > -*/
> > -   if (vsk->transport)
> > -   vsk->transport->release(vsk);
> > -   else if (sk->sk_type == SOCK_STREAM)
> > -   vsock_remove_sock(vsk);
> > -
> > /* When "level" is SINGLE_DEPTH_NESTING, use the nested
> >  * version to avoid the warning "possible recursive locking
> >  * detected". When "level" is 0, lock_sock_nested(sk, level)
> >  * is the same as lock_sock(sk).
> >  */
> > lock_sock_nested(sk, level);
> > +
> > +   if (vsk->transport)
> > +   vsk->transport->release(vsk);
> > +   else if (sk->sk_type == SOCK_STREAM)
> > +   vsock_remove_sock(vsk);
> > +
> > sock_orphan(sk);
> > sk->sk_shutdown = SHUTDOWN_MASK;
> > 
> > diff --git a/net/vmw_vsock/hyperv_transport.c
> > b/net/vmw_vsock/hyperv_transport.c
> > index 3492c021925f..510f25f4a856 100644
> > --- a/net/vmw_vsock/hyperv_transport.c
> > +++ b/net/vmw_vsock/hyperv_transport.c
> > @@ -529,9 +529,7 @@ static void hvs_release(struct vsock_sock *vsk)
> > struct sock *sk = sk_vsock(vsk);
> > bool remove_sock;
> > 
> > -   lock_sock_nested(sk, SINGLE_DEPTH_NESTING);
> > remove_sock = hvs_close_lock_held(vsk);
> > -   release_sock(sk);
> > if (remove_sock)
> > vsock_remove_sock(vsk);
> >  }
> 
> This looks good to me, but do we know why vsk->transport->release(vsk)
> is called without holdi

Re: INFO: task hung in lock_sock_nested (2)

2020-02-25 Thread Stefano Garzarella
On Mon, Feb 24, 2020 at 09:44:28PM +0800, Hillf Danton wrote:
> 
> On Mon, 24 Feb 2020 11:08:53 +0100 Stefano Garzarella wrote:
> > On Sun, Feb 23, 2020 at 03:50:25PM +0800, Hillf Danton wrote:
> > > 
> > > Seems like vsock needs a word to track lock owner in an attempt to
> > > avoid trying to lock sock while the current is the lock owner.
> > 
> > Thanks for this possible solution.
> > What about using sock_owned_by_user()?
> > 
> No chance for vsock_locked() if it works.
> 
> > We should fix also hyperv_transport, because it could suffer from the same
> > problem.
> > 
> You're right. My diff is at most for introducing vsk's lock owner.

Sure, thanks for this!

> 
> > At this point, it might be better to call vsk->transport->release(vsk)
> > always with the lock taken and remove it in the transports as in the
> > following patch.
> > 
> > What do you think?
> > 
> Yes and ... please take a look at the output of grep
> 
>   grep -n lock_sock linux/net/vmw_vsock/af_vsock.c
> 
> as it drove me mad.

:-) I'll go in this direction and I'll check all the cases.

We should avoid to take lock_sock in the transports when it is possible.

Thanks for the help,
Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3] virtio: Work around frames incorrectly marked as gso

2020-02-25 Thread Anton Ivanov



On 25/02/2020 07:48, Anton Ivanov wrote:



On 24/02/2020 22:22, Willem de Bruijn wrote:

On Mon, Feb 24, 2020 at 4:00 PM Anton Ivanov
 wrote:


On 24/02/2020 20:20, Willem de Bruijn wrote:

On Mon, Feb 24, 2020 at 2:55 PM Anton Ivanov
 wrote:

On 24/02/2020 19:27, Willem de Bruijn wrote:

On Mon, Feb 24, 2020 at 8:26 AM  wrote:

From: Anton Ivanov 

Some of the locally generated frames marked as GSO which
arrive at virtio_net_hdr_from_skb() have no GSO_TYPE, no
fragments (data_len = 0) and length significantly shorter
than the MTU (752 in my experiments).

Do we understand how these packets are generated?

No, we have not been able to trace them.

The only thing we know is that this is specific to locally generated
packets. Something arriving from the network does not show this.


Else it seems this
might be papering over a deeper problem.

The stack should not create GSO packets less than or equal to
skb_shinfo(skb)->gso_size. See for instance the check in
tcp_gso_segment after pulling the tcp header:

   mss = skb_shinfo(skb)->gso_size;
   if (unlikely(skb->len <= mss))
   goto out;

What is the gso_type, and does it include SKB_GSO_DODGY?



0 - not set.

Thanks for the follow-up details. Is this something that you can trigger easily?


Yes, if you have a UML instance handy.

Running iperf between the host and a UML guest using raw socket
transport triggers it immediately.

This is my UML command line:

vmlinux mem=2048M umid=OPX \
  ubd0=OPX-3.0-Work.img \
vec0:transport=raw,ifname=p-veth0,depth=128,gro=1,mac=92:9b:36:5e:38:69 \
  root=/dev/ubda ro con=null con0=null,fd:2 con1=fd:0,fd:1

p-right is a part of a vEth pair:

ip link add l-veth0 type veth peer name p-veth0 && ifconfig p-veth0 up

iperf server is on host, iperf -c in the guest.



An skb_dump() + dump_stack() when the packet socket gets such a
packet may point us to the root cause and fix that.


We tried dump stack, it was not informative - it was just the recvmmsg
call stack coming from the UML until it hits the relevant recv bit in
af_packet - it does not tell us where the packet is coming from.

Quoting from the message earlier in the thread:

[ 2334.180854] Call Trace:
[ 2334.181947]  dump_stack+0x5c/0x80
[ 2334.183021]  packet_recvmsg.cold+0x23/0x49
[ 2334.184063]  ___sys_recvmsg+0xe1/0x1f0
[ 2334.185034]  ? packet_poll+0xca/0x130
[ 2334.186014]  ? sock_poll+0x77/0xb0
[ 2334.186977]  ? ep_item_poll.isra.0+0x3f/0xb0
[ 2334.187936]  ? ep_send_events_proc+0xf1/0x240
[ 2334.188901]  ? dequeue_signal+0xdb/0x180
[ 2334.189848]  do_recvmmsg+0xc8/0x2d0
[ 2334.190728]  ? ep_poll+0x8c/0x470
[ 2334.191581]  __sys_recvmmsg+0x108/0x150
[ 2334.192441]  __x64_sys_recvmmsg+0x25/0x30
[ 2334.193346]  do_syscall_64+0x53/0x140
[ 2334.194262]  entry_SYSCALL_64_after_hwframe+0x44/0xa9


That makes sense. skb_dump might show more interesting details about
the packet.


I will add that and retest later today.



skb len=818 headroom=2 headlen=818 tailroom=908
mac=(2,14) net=(16,0) trans=16
shinfo(txflags=0 nr_frags=0 gso(size=752 type=0 segs=1))
csum(0x100024 ip_summed=3 complete_sw=0 valid=0 level=0)
hash(0x0 sw=0 l4=0) proto=0x0800 pkttype=4 iif=0
sk family=17 type=3 proto=0

Deciphering the actual packet data gives a

TCP packet, ACK and PSH set.

The PSH flag looks like the only "interesting" thing about it in first read.




From the previous thread, these are assumed to be TCP
packets?


Yes



I had missed the original thread. If the packet has

 sinfo(skb)->gso_size = 752.
 skb->len = 818

then this is a GSO packet. Even though UML will correctly process it
as a normal 818 B packet if psock_rcv pretends that it is, treating it
like that is not strictly correct. A related question is how the setup
arrived at that low MTU size, assuming that is not explicitly
configured that low.


The mtu on the interface is normal. I suspect it is one of the first packets
in the stream or something iperf uses for communication between the server and
the client which always ends up that size.



As of commit 51466a7545b7 ("tcp: fill shinfo->gso_type at last
moment") tcp unconditionally sets gso_type, even for non gso packets.
So either this is not a tcp packet or the field gets zeroed somewhere
along the way. I could not quickly find a possible path to
skb_gso_reset or a raw write.


Same. I have tried to trace a possible origin and I have not seen anything 
which may cause it.



It may be useful to insert tests for this condition (skb_is_gso(skb)
&& !skb_shinfo(skb)->gso_type) that call skb_dump at other points in
the network stack. For instance in __ip_queue_xmit and
__dev_queue_xmit.

Since skb segmentation fails in tcp_gso_segment for such packets, it
may also be informative to disable TSO on the veth device and see if
the test fails.


Ack.







--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/

Re: [PATCH RFC v4 00/13] virtio-mem: paravirtualized memory

2020-02-25 Thread David Hildenbrand
On 29.01.20 10:41, David Hildenbrand wrote:
> On 09.01.20 14:48, David Hildenbrand wrote:
>> On 12.12.19 18:11, David Hildenbrand wrote:
>>> This series is based on latest linux-next. The patches are located at:
>>> https://github.com/davidhildenbrand/linux.git virtio-mem-rfc-v4
>>>
>>> The basic idea of virtio-mem is to provide a flexible,
>>> cross-architecture memory hot(un)plug solution that avoids many limitations
>>> imposed by existing technologies, architectures, and interfaces. More
>>> details can be found below and in linked material.
>>>
>>> This RFC is limited to x86-64, however, should theoretically work on any
>>> architecture that supports virtio and implements memory hot(un)plug under
>>> Linux - like s390x, powerpc64 and arm64. On x86-64, it is currently
>>> possible to add/remove memory to the system in >= 4MB granularity.
>>> Memory hotplug works very reliably. For memory unplug, there are no
>>> guarantees how much memory can actually get unplugged, it depends on the
>>> setup (especially: fragmentation of (unmovable) memory). I have plans to
>>> improve that in the future.
>>>
>>> --
>>> 1. virtio-mem
>>> --
>>>
>>> The basic idea behind virtio-mem was presented at KVM Forum 2018. The
>>> slides can be found at [1]. The previous RFC can be found at [2]. The
>>> first RFC can be found at [3]. However, the concept evolved over time. The
>>> KVM Forum slides roughly match the current design.
>>>
>>> Patch #2 ("virtio-mem: Paravirtualized memory hotplug") contains quite some
>>> information, especially in "include/uapi/linux/virtio_mem.h":
>>>
>>>   Each virtio-mem device manages a dedicated region in physical address
>>>   space. Each device can belong to a single NUMA node, multiple devices
>>>   for a single NUMA node are possible. A virtio-mem device is like a
>>>   "resizable DIMM" consisting of small memory blocks that can be plugged
>>>   or unplugged. The device driver is responsible for (un)plugging memory
>>>   blocks on demand.
>>>
>>>   Virtio-mem devices can only operate on their assigned memory region in
>>>   order to (un)plug memory. A device cannot (un)plug memory belonging to
>>>   other devices.
>>>
>>>   The "region_size" corresponds to the maximum amount of memory that can
>>>   be provided by a device. The "size" corresponds to the amount of memory
>>>   that is currently plugged. "requested_size" corresponds to a request
>>>   from the device to the device driver to (un)plug blocks. The
>>>   device driver should try to (un)plug blocks in order to reach the
>>>   "requested_size". It is impossible to plug more memory than requested.
>>>
>>>   The "usable_region_size" represents the memory region that can actually
>>>   be used to (un)plug memory. It is always at least as big as the
>>>   "requested_size" and will grow dynamically. It will only shrink when
>>>   explicitly triggered (VIRTIO_MEM_REQ_UNPLUG).
>>>
>>>   Memory in the usable region can usually be read, however, there are no
>>>   guarantees. It can happen that the device cannot process a request,
>>>   because it is busy. The device driver has to retry later.
>>>
>>>   Usually, during system resets all memory will get unplugged, so the
>>>   device driver can start with a clean state. However, in specific
>>>   scenarios (if the device is busy) it can happen that the device still
>>>   has memory plugged. The device driver can request to unplug all memory
>>>   (VIRTIO_MEM_REQ_UNPLUG) - which might take a while to succeed if the
>>>   device is busy.
>>>
>>> --
>>> 2. Linux Implementation
>>> --
>>>
>>> This RFC reuses quite some existing MM infrastructure, however, has to
>>> expose some additional functionality.
>>>
>>> Memory blocks (e.g., 128MB) are added/removed on demand. Within these
>>> memory blocks, subblocks (e.g., 4MB) are plugged/unplugged. The sizes
>>> depend on the target architecture, MAX_ORDER + pageblock_order, and
>>> the block size of a virtio-mem device.
>>>
>>> add_memory()/try_remove_memory() is used to add/remove memory blocks.
>>> virtio-mem will not online memory blocks itself. This has to be done by
>>> user space, or configured into the kernel
>>> (CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE). virtio-mem will only unplug memory
>>> that was online to the ZONE_NORMAL. Memory is suggested to be onlined to
>>> the ZONE_NORMAL for now.
>>>
>>> The memory hotplug notifier is used to properly synchronize against
>>> onlining/offlining of memory blocks and to track the states of memory
>>> blocks (including the zone memory blocks are onlined to).
>>>
>>> The set_online_page() callback is used to keep unplugged subblocks
>>> of a memory block fake-offline when onlining the memory block.
>>> generic_onlin

[PATCH v3 0/4] drm: Provide a simple encoder

2020-02-25 Thread Thomas Zimmermann
Many DRM drivers implement an encoder with an empty implementation. This
patchset adds drm_simple_encoder_init(), which can be used by drivers
instead. Except for the destroy callback, the simple encoder's implementation
is empty.

The patchset also converts 4 encoder instances to use the simple-encoder
helpers. But there are at least 11 other drivers which can use the helper
and I think I did not examine all drivers yet.

The patchset was smoke-tested on mgag200 by running the fbdev console
and Gnome on X11.

v3:
* remove drm_simple_encoder_create() for lack of users (Sam, Daniel)
* provide more precise documentation (Sam)
v2:
* move simple encoder to KMS helpers (Daniel)
* remove name argument; simplifies implementation (Gerd)
* don't allocate with devm_ interfaces; unsafe with DRM (Noralf)

Thomas Zimmermann (4):
  drm/simple-kms: Add drm_simple_encoder_{init,create}()
  drm/ast: Use simple encoder
  drm/mgag200: Use simple encoder
  drm/qxl: Use simple encoder

 drivers/gpu/drm/ast/ast_drv.h   |  6 +-
 drivers/gpu/drm/ast/ast_mode.c  | 25 +++-
 drivers/gpu/drm/drm_simple_kms_helper.c | 34 +-
 drivers/gpu/drm/mgag200/mgag200_drv.h   |  9 +--
 drivers/gpu/drm/mgag200/mgag200_mode.c  | 85 +++--
 drivers/gpu/drm/qxl/qxl_display.c   | 18 +-
 include/drm/drm_simple_kms_helper.h |  4 ++
 7 files changed, 59 insertions(+), 122 deletions(-)

--
2.25.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 1/4] drm/simple-kms: Add drm_simple_encoder_{init, create}()

2020-02-25 Thread Thomas Zimmermann
This patch makes the internal encoder implementation of the simple
KMS helpers available to drivers.

These simple-encoder helpers initialize an encoder with an empty
implementation. This covers the requirements of most of the existing
DRM drivers. A call to drm_simple_encoder_create() allocates and
initializes an encoder instance, a call to drm_simple_encoder_init()
initializes a pre-allocated instance.

v3:
* remove drm_simple_encoder_create(); not required yet
* provide more precise documentation
v2:
* move simple encoder to KMS helpers
* remove name argument; simplifies implementation
* don't allocate with devm_ interfaces; unsafe with DRM

Signed-off-by: Thomas Zimmermann 
---
 drivers/gpu/drm/drm_simple_kms_helper.c | 34 ++---
 include/drm/drm_simple_kms_helper.h |  4 +++
 2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/drm_simple_kms_helper.c 
b/drivers/gpu/drm/drm_simple_kms_helper.c
index 15fb516ae2d8..04309e4660de 100644
--- a/drivers/gpu/drm/drm_simple_kms_helper.c
+++ b/drivers/gpu/drm/drm_simple_kms_helper.c
@@ -26,12 +26,41 @@
  * entity. Some flexibility for code reuse is provided through a separately
  * allocated &drm_connector object and supporting optional &drm_bridge
  * encoder drivers.
+ *
+ * Many drivers require only a very simple encoder that fulfills the minimum
+ * requirements of the display pipeline and does not add additional
+ * functionality. The function drm_simple_encoder_init() provides an
+ * implementation of such an encoder.
  */
 
-static const struct drm_encoder_funcs drm_simple_kms_encoder_funcs = {
+static const struct drm_encoder_funcs drm_simple_encoder_funcs_cleanup = {
.destroy = drm_encoder_cleanup,
 };
 
+/**
+ * drm_simple_encoder_init - Initialize a preallocated encoder
+ * @dev: drm device
+ * @funcs: callbacks for this encoder
+ * @encoder_type: user visible type of the encoder
+ *
+ * Initialises a preallocated encoder that has no further functionality.
+ * Settings for possible CRTC and clones are left to their initial values.
+ * The encoder will be cleaned up automatically as part of the mode-setting
+ * cleanup.
+ *
+ * Returns:
+ * Zero on success, error code on failure.
+ */
+int drm_simple_encoder_init(struct drm_device *dev,
+   struct drm_encoder *encoder,
+   int encoder_type)
+{
+   return drm_encoder_init(dev, encoder,
+   &drm_simple_encoder_funcs_cleanup,
+   encoder_type, NULL);
+}
+EXPORT_SYMBOL(drm_simple_encoder_init);
+
 static enum drm_mode_status
 drm_simple_kms_crtc_mode_valid(struct drm_crtc *crtc,
   const struct drm_display_mode *mode)
@@ -288,8 +317,7 @@ int drm_simple_display_pipe_init(struct drm_device *dev,
return ret;
 
encoder->possible_crtcs = drm_crtc_mask(crtc);
-   ret = drm_encoder_init(dev, encoder, &drm_simple_kms_encoder_funcs,
-  DRM_MODE_ENCODER_NONE, NULL);
+   ret = drm_simple_encoder_init(dev, encoder, DRM_MODE_ENCODER_NONE);
if (ret || !connector)
return ret;
 
diff --git a/include/drm/drm_simple_kms_helper.h 
b/include/drm/drm_simple_kms_helper.h
index e253ba7bea9d..a026375464ff 100644
--- a/include/drm/drm_simple_kms_helper.h
+++ b/include/drm/drm_simple_kms_helper.h
@@ -181,4 +181,8 @@ int drm_simple_display_pipe_init(struct drm_device *dev,
const uint64_t *format_modifiers,
struct drm_connector *connector);
 
+int drm_simple_encoder_init(struct drm_device *dev,
+   struct drm_encoder *encoder,
+   int encoder_type);
+
 #endif /* __LINUX_DRM_SIMPLE_KMS_HELPER_H */
-- 
2.25.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 3/4] drm/mgag200: Use simple encoder

2020-02-25 Thread Thomas Zimmermann
The mgag200 driver uses an empty implementation for its encoder. Replace
the code with the generic simple encoder.

v3:
* init pre-allocated encoder with drm_simple_encoder_init()
v2:
* rebase onto new simple-encoder interface

Signed-off-by: Thomas Zimmermann 
---
 drivers/gpu/drm/mgag200/mgag200_drv.h  |  9 +--
 drivers/gpu/drm/mgag200/mgag200_mode.c | 85 +++---
 2 files changed, 12 insertions(+), 82 deletions(-)

diff --git a/drivers/gpu/drm/mgag200/mgag200_drv.h 
b/drivers/gpu/drm/mgag200/mgag200_drv.h
index aa32aad222c2..9691252d6233 100644
--- a/drivers/gpu/drm/mgag200/mgag200_drv.h
+++ b/drivers/gpu/drm/mgag200/mgag200_drv.h
@@ -95,7 +95,6 @@
 #define MATROX_DPMS_CLEARED (-1)
 
 #define to_mga_crtc(x) container_of(x, struct mga_crtc, base)
-#define to_mga_encoder(x) container_of(x, struct mga_encoder, base)
 #define to_mga_connector(x) container_of(x, struct mga_connector, base)
 
 struct mga_crtc {
@@ -110,12 +109,6 @@ struct mga_mode_info {
struct mga_crtc *crtc;
 };
 
-struct mga_encoder {
-   struct drm_encoder base;
-   int last_dpms;
-};
-
-
 struct mga_i2c_chan {
struct i2c_adapter adapter;
struct drm_device *dev;
@@ -185,6 +178,8 @@ struct mga_device {
 
/* SE model number stored in reg 0x1e24 */
u32 unique_rev_id;
+
+   struct drm_encoder encoder;
 };
 
 static inline enum mga_type
diff --git a/drivers/gpu/drm/mgag200/mgag200_mode.c 
b/drivers/gpu/drm/mgag200/mgag200_mode.c
index 62a8e9ccb16d..efc9eaa6a4d4 100644
--- a/drivers/gpu/drm/mgag200/mgag200_mode.c
+++ b/drivers/gpu/drm/mgag200/mgag200_mode.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mgag200_drv.h"
 
@@ -1449,76 +1450,6 @@ static void mga_crtc_init(struct mga_device *mdev)
drm_crtc_helper_add(&mga_crtc->base, &mga_helper_funcs);
 }
 
-/*
- * The encoder comes after the CRTC in the output pipeline, but before
- * the connector. It's responsible for ensuring that the digital
- * stream is appropriately converted into the output format. Setup is
- * very simple in this case - all we have to do is inform qemu of the
- * colour depth in order to ensure that it displays appropriately
- */
-
-/*
- * These functions are analagous to those in the CRTC code, but are intended
- * to handle any encoder-specific limitations
- */
-static void mga_encoder_mode_set(struct drm_encoder *encoder,
-   struct drm_display_mode *mode,
-   struct drm_display_mode *adjusted_mode)
-{
-
-}
-
-static void mga_encoder_dpms(struct drm_encoder *encoder, int state)
-{
-   return;
-}
-
-static void mga_encoder_prepare(struct drm_encoder *encoder)
-{
-}
-
-static void mga_encoder_commit(struct drm_encoder *encoder)
-{
-}
-
-static void mga_encoder_destroy(struct drm_encoder *encoder)
-{
-   struct mga_encoder *mga_encoder = to_mga_encoder(encoder);
-   drm_encoder_cleanup(encoder);
-   kfree(mga_encoder);
-}
-
-static const struct drm_encoder_helper_funcs mga_encoder_helper_funcs = {
-   .dpms = mga_encoder_dpms,
-   .mode_set = mga_encoder_mode_set,
-   .prepare = mga_encoder_prepare,
-   .commit = mga_encoder_commit,
-};
-
-static const struct drm_encoder_funcs mga_encoder_encoder_funcs = {
-   .destroy = mga_encoder_destroy,
-};
-
-static struct drm_encoder *mga_encoder_init(struct drm_device *dev)
-{
-   struct drm_encoder *encoder;
-   struct mga_encoder *mga_encoder;
-
-   mga_encoder = kzalloc(sizeof(struct mga_encoder), GFP_KERNEL);
-   if (!mga_encoder)
-   return NULL;
-
-   encoder = &mga_encoder->base;
-   encoder->possible_crtcs = 0x1;
-
-   drm_encoder_init(dev, encoder, &mga_encoder_encoder_funcs,
-DRM_MODE_ENCODER_DAC, NULL);
-   drm_encoder_helper_add(encoder, &mga_encoder_helper_funcs);
-
-   return encoder;
-}
-
-
 static int mga_vga_get_modes(struct drm_connector *connector)
 {
struct mga_connector *mga_connector = to_mga_connector(connector);
@@ -1686,8 +1617,9 @@ static struct drm_connector *mga_vga_init(struct 
drm_device *dev)
 
 int mgag200_modeset_init(struct mga_device *mdev)
 {
-   struct drm_encoder *encoder;
+   struct drm_encoder *encoder = &mdev->encoder;
struct drm_connector *connector;
+   int ret;
 
mdev->mode_info.mode_config_initialized = true;
 
@@ -1698,11 +1630,14 @@ int mgag200_modeset_init(struct mga_device *mdev)
 
mga_crtc_init(mdev);
 
-   encoder = mga_encoder_init(mdev->dev);
-   if (!encoder) {
-   DRM_ERROR("mga_encoder_init failed\n");
-   return -1;
+   ret = drm_simple_encoder_init(mdev->dev, encoder,
+ DRM_MODE_ENCODER_DAC);
+   if (ret) {
+   DRM_ERROR("drm_simple_encoder_init() failed, error %d\n",
+ -ret);
+   return ret;
}
+   encoder->possible_

[PATCH v3 4/4] drm/qxl: Use simple encoder

2020-02-25 Thread Thomas Zimmermann
The qxl driver uses an empty implementation for its encoder. Replace
the code with the generic simple encoder.

v2:
* rebase onto new simple-encoder interface

Signed-off-by: Thomas Zimmermann 
Acked-by: Sam Ravnborg 
Acked-by: Gerd Hoffmann 
---
 drivers/gpu/drm/qxl/qxl_display.c | 18 +++---
 1 file changed, 3 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/qxl/qxl_display.c 
b/drivers/gpu/drm/qxl/qxl_display.c
index ab4f8dd00400..9c0e1add59fb 100644
--- a/drivers/gpu/drm/qxl/qxl_display.c
+++ b/drivers/gpu/drm/qxl/qxl_display.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "qxl_drv.h"
 #include "qxl_object.h"
@@ -1007,9 +1008,6 @@ static struct drm_encoder *qxl_best_encoder(struct 
drm_connector *connector)
return &qxl_output->enc;
 }
 
-static const struct drm_encoder_helper_funcs qxl_enc_helper_funcs = {
-};
-
 static const struct drm_connector_helper_funcs qxl_connector_helper_funcs = {
.get_modes = qxl_conn_get_modes,
.mode_valid = qxl_conn_mode_valid,
@@ -1059,15 +1057,6 @@ static const struct drm_connector_funcs 
qxl_connector_funcs = {
.atomic_destroy_state = drm_atomic_helper_connector_destroy_state,
 };
 
-static void qxl_enc_destroy(struct drm_encoder *encoder)
-{
-   drm_encoder_cleanup(encoder);
-}
-
-static const struct drm_encoder_funcs qxl_enc_funcs = {
-   .destroy = qxl_enc_destroy,
-};
-
 static int qxl_mode_create_hotplug_mode_update_property(struct qxl_device 
*qdev)
 {
if (qdev->hotplug_mode_update_property)
@@ -1098,15 +1087,14 @@ static int qdev_output_init(struct drm_device *dev, int 
num_output)
drm_connector_init(dev, &qxl_output->base,
   &qxl_connector_funcs, DRM_MODE_CONNECTOR_VIRTUAL);
 
-   drm_encoder_init(dev, &qxl_output->enc, &qxl_enc_funcs,
-DRM_MODE_ENCODER_VIRTUAL, NULL);
+   drm_simple_encoder_init(dev, &qxl_output->enc,
+   DRM_MODE_ENCODER_VIRTUAL);
 
/* we get HPD via client monitors config */
connector->polled = DRM_CONNECTOR_POLL_HPD;
encoder->possible_crtcs = 1 << num_output;
drm_connector_attach_encoder(&qxl_output->base,
  &qxl_output->enc);
-   drm_encoder_helper_add(encoder, &qxl_enc_helper_funcs);
drm_connector_helper_add(connector, &qxl_connector_helper_funcs);
 
drm_object_attach_property(&connector->base,
-- 
2.25.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 2/4] drm/ast: Use simple encoder

2020-02-25 Thread Thomas Zimmermann
The ast driver uses an empty implementation for its encoder. Replace
the code with the generic simple encoder.

v2:
* rebase onto new simple-encoder interface

Signed-off-by: Thomas Zimmermann 
Acked-by: Sam Ravnborg 
---
 drivers/gpu/drm/ast/ast_drv.h  |  6 +-
 drivers/gpu/drm/ast/ast_mode.c | 25 -
 2 files changed, 9 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/ast/ast_drv.h b/drivers/gpu/drm/ast/ast_drv.h
index f5d8780776ae..656d591b154b 100644
--- a/drivers/gpu/drm/ast/ast_drv.h
+++ b/drivers/gpu/drm/ast/ast_drv.h
@@ -121,6 +121,7 @@ struct ast_private {
unsigned int next_index;
} cursor;
 
+   struct drm_encoder encoder;
struct drm_plane primary_plane;
struct drm_plane cursor_plane;
 
@@ -238,13 +239,8 @@ struct ast_crtc {
u8 offset_x, offset_y;
 };
 
-struct ast_encoder {
-   struct drm_encoder base;
-};
-
 #define to_ast_crtc(x) container_of(x, struct ast_crtc, base)
 #define to_ast_connector(x) container_of(x, struct ast_connector, base)
-#define to_ast_encoder(x) container_of(x, struct ast_encoder, base)
 
 struct ast_vbios_stdtable {
u8 misc;
diff --git a/drivers/gpu/drm/ast/ast_mode.c b/drivers/gpu/drm/ast/ast_mode.c
index 562ea6d9df13..7a9f20a2fd30 100644
--- a/drivers/gpu/drm/ast/ast_mode.c
+++ b/drivers/gpu/drm/ast/ast_mode.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ast_drv.h"
 #include "ast_tables.h"
@@ -968,28 +969,18 @@ static int ast_crtc_init(struct drm_device *dev)
  * Encoder
  */
 
-static void ast_encoder_destroy(struct drm_encoder *encoder)
-{
-   drm_encoder_cleanup(encoder);
-   kfree(encoder);
-}
-
-static const struct drm_encoder_funcs ast_enc_funcs = {
-   .destroy = ast_encoder_destroy,
-};
-
 static int ast_encoder_init(struct drm_device *dev)
 {
-   struct ast_encoder *ast_encoder;
+   struct ast_private *ast = dev->dev_private;
+   struct drm_encoder *encoder = &ast->encoder;
+   int ret;
 
-   ast_encoder = kzalloc(sizeof(struct ast_encoder), GFP_KERNEL);
-   if (!ast_encoder)
-   return -ENOMEM;
+   ret = drm_simple_encoder_init(dev, encoder, DRM_MODE_ENCODER_DAC);
+   if (ret)
+   return ret;
 
-   drm_encoder_init(dev, &ast_encoder->base, &ast_enc_funcs,
-DRM_MODE_ENCODER_DAC, NULL);
+   encoder->possible_crtcs = 1;
 
-   ast_encoder->base.possible_crtcs = 1;
return 0;
 }
 
-- 
2.25.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC v4 08/13] mm/memory_hotplug: Introduce offline_and_remove_memory()

2020-02-25 Thread Michal Hocko
On Thu 12-12-19 18:11:32, David Hildenbrand wrote:
> virtio-mem wants to offline and remove a memory block once it unplugged
> all subblocks (e.g., using alloc_contig_range()). Let's provide
> an interface to do that from a driver. virtio-mem already supports to
> offline partially unplugged memory blocks. Offlining a fully unplugged
> memory block will not require to migrate any pages. All unplugged
> subblocks are PageOffline() and have a reference count of 0 - so
> offlining code will simply skip them.
> 
> All we need an interface to trigger the "offlining" and the removing in a
> single operation - to make sure the memory block cannot get onlined by
> user space again before it gets removed.

Why does that matter? Is it really likely that the userspace would
interfere? What would be the scenario?

Or is still mostly about not requiring callers to open code this general
patter?

> To keep things simple, allow to only work on a single memory block.
> 
> Cc: Andrew Morton 
> Cc: David Hildenbrand 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: Pavel Tatashin 
> Cc: Wei Yang 
> Cc: Dan Williams 
> Cc: Qian Cai 
> Signed-off-by: David Hildenbrand 
> ---
>  include/linux/memory_hotplug.h |  1 +
>  mm/memory_hotplug.c| 35 ++
>  2 files changed, 36 insertions(+)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index ba0dca6aac6e..586f5c59c291 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -310,6 +310,7 @@ extern void try_offline_node(int nid);
>  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>  extern int remove_memory(int nid, u64 start, u64 size);
>  extern void __remove_memory(int nid, u64 start, u64 size);
> +extern int offline_and_remove_memory(int nid, u64 start, u64 size);
>  
>  #else
>  static inline bool is_mem_section_removable(unsigned long pfn,
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index da01453a04e6..d04369e6d3cc 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1825,4 +1825,39 @@ int remove_memory(int nid, u64 start, u64 size)
>   return rc;
>  }
>  EXPORT_SYMBOL_GPL(remove_memory);
> +
> +/*
> + * Try to offline and remove a memory block. Might take a long time to
> + * finish in case memory is still in use. Primarily useful for memory devices
> + * that logically unplugged all memory (so it's no longer in use) and want to
> + * offline + remove the memory block.
> + */
> +int offline_and_remove_memory(int nid, u64 start, u64 size)
> +{
> + struct memory_block *mem;
> + int rc = -EINVAL;
> +
> + if (!IS_ALIGNED(start, memory_block_size_bytes()) ||
> + size != memory_block_size_bytes())
> + return rc;
> +
> + lock_device_hotplug();
> + mem = find_memory_block(__pfn_to_section(PFN_DOWN(start)));
> + if (mem)
> + rc = device_offline(&mem->dev);
> + /* Ignore if the device is already offline. */
> + if (rc > 0)
> + rc = 0;
> +
> + /*
> +  * In case we succeeded to offline the memory block, remove it.
> +  * This cannot fail as it cannot get onlined in the meantime.
> +  */
> + if (!rc && try_remove_memory(nid, start, size))
> + BUG();
> + unlock_device_hotplug();
> +
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(offline_and_remove_memory);
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
> -- 
> 2.23.0

-- 
Michal Hocko
SUSE Labs
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC v4 11/13] mm/vmscan: Move count_vm_event(DROP_SLAB) into drop_slab()

2020-02-25 Thread Michal Hocko
On Thu 12-12-19 18:11:35, David Hildenbrand wrote:
> Let's count within the function itself, so every invocation (of future
> users) will be counted.
> 
> Cc: Alexander Viro 
> Cc: Andrew Morton 
> Cc: linux-fsde...@vger.kernel.org
> Signed-off-by: David Hildenbrand 

Slight inconsistency with the page cache droppint but nothing earth
shattering.

Acked-by: Michal Hocko 

> ---
>  fs/drop_caches.c | 4 +---
>  mm/vmscan.c  | 1 +
>  2 files changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> index d31b6c72b476..a042da782fcd 100644
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -61,10 +61,8 @@ int drop_caches_sysctl_handler(struct ctl_table *table, 
> int write,
>   iterate_supers(drop_pagecache_sb, NULL);
>   count_vm_event(DROP_PAGECACHE);
>   }
> - if (sysctl_drop_caches & 2) {
> + if (sysctl_drop_caches & 2)
>   drop_slab();
> - count_vm_event(DROP_SLAB);
> - }
>   if (!stfu) {
>   pr_info("%s (%d): drop_caches: %d\n",
>   current->comm, task_pid_nr(current),
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5a6445e86328..c3e53502a84a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -726,6 +726,7 @@ void drop_slab(void)
>  
>   for_each_online_node(nid)
>   drop_slab_node(nid);
> + count_vm_event(DROP_SLAB);
>  }
>  
>  static inline int is_page_cache_freeable(struct page *page)
> -- 
> 2.23.0

-- 
Michal Hocko
SUSE Labs
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH][next] drm: Replace zero-length array with flexible-array member

2020-02-25 Thread Jani Nikula
On Tue, 25 Feb 2020, "Gustavo A. R. Silva"  wrote:
> The current codebase makes use of the zero-length array language
> extension to the C90 standard, but the preferred mechanism to declare
> variable-length types such as these ones is a flexible array member[1][2],
> introduced in C99:
>
> struct foo {
> int stuff;
> struct boo array[];
> };
>
> By making use of the mechanism above, we will get a compiler warning
> in case the flexible array does not occur last in the structure, which
> will help us prevent some kind of undefined behavior bugs from being
> inadvertently introduced[3] to the codebase from now on.
>
> Also, notice that, dynamic memory allocations won't be affected by
> this change:
>
> "Flexible array members have incomplete type, and so the sizeof operator
> may not be applied. As a quirk of the original implementation of
> zero-length arrays, sizeof evaluates to zero."[1]
>
> This issue was found with the help of Coccinelle.
>
> [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
> [2] https://github.com/KSPP/linux/issues/21
> [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")
>
> Signed-off-by: Gustavo A. R. Silva 
> ---
>  drivers/gpu/drm/etnaviv/etnaviv_gem.h | 2 +-
>  drivers/gpu/drm/gma500/intel_bios.h   | 2 +-
>  drivers/gpu/drm/i915/display/intel_vbt_defs.h | 4 ++--
>  drivers/gpu/drm/i915/gt/intel_lrc.c   | 2 +-
>  drivers/gpu/drm/i915/i915_gpu_error.h | 2 +-

Please split out the i915 changes to a separate patch.

>  drivers/gpu/drm/msm/msm_gem.h | 2 +-
>  drivers/gpu/drm/qxl/qxl_cmd.c | 2 +-
>  drivers/gpu/drm/vboxvideo/vboxvideo.h | 2 +-
>  drivers/gpu/drm/vc4/vc4_drv.h | 2 +-
>  drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c| 2 +-
>  drivers/gpu/drm/vmwgfx/vmwgfx_surface.c   | 2 +-
>  include/drm/bridge/mhl.h  | 4 ++--
>  include/drm/drm_displayid.h   | 2 +-
>  include/uapi/drm/i915_drm.h   | 4 ++--

Not sure it's worth touching uapi headers. They're full of both [0] and
[]. Again, please at least split it to a separate patch to be decided
separately.

BR,
Jani.


-- 
Jani Nikula, Intel Open Source Graphics Center
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC v4 08/13] mm/memory_hotplug: Introduce offline_and_remove_memory()

2020-02-25 Thread David Hildenbrand
On 25.02.20 15:11, Michal Hocko wrote:
> On Thu 12-12-19 18:11:32, David Hildenbrand wrote:
>> virtio-mem wants to offline and remove a memory block once it unplugged
>> all subblocks (e.g., using alloc_contig_range()). Let's provide
>> an interface to do that from a driver. virtio-mem already supports to
>> offline partially unplugged memory blocks. Offlining a fully unplugged
>> memory block will not require to migrate any pages. All unplugged
>> subblocks are PageOffline() and have a reference count of 0 - so
>> offlining code will simply skip them.
>>
>> All we need an interface to trigger the "offlining" and the removing in a
>> single operation - to make sure the memory block cannot get onlined by
>> user space again before it gets removed.
> 
> Why does that matter? Is it really likely that the userspace would
> interfere? What would be the scenario?

I guess it's not that relevant after all (I think this comment dates
back to the times where we didn't have try_remove_memory() and could
actually BUG_ON() in remove_memory() if there would have been a race).
Can drop that part.

> 
> Or is still mostly about not requiring callers to open code this general
> patter?

>From kernel module context, I cannot get access to the actual memory
block device (find_memory_block()) and call the device_unregister().

Especially, also the device hotplug lock is not exported. So this is a
clean helper function to be used from kernel module context. (e.g., also
hyper-v showed interest for using that)

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH][next] drm: Replace zero-length array with flexible-array member

2020-02-25 Thread Gustavo A. R. Silva
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
int stuff;
struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva 
---
 drivers/gpu/drm/etnaviv/etnaviv_gem.h | 2 +-
 drivers/gpu/drm/gma500/intel_bios.h   | 2 +-
 drivers/gpu/drm/i915/display/intel_vbt_defs.h | 4 ++--
 drivers/gpu/drm/i915/gt/intel_lrc.c   | 2 +-
 drivers/gpu/drm/i915/i915_gpu_error.h | 2 +-
 drivers/gpu/drm/msm/msm_gem.h | 2 +-
 drivers/gpu/drm/qxl/qxl_cmd.c | 2 +-
 drivers/gpu/drm/vboxvideo/vboxvideo.h | 2 +-
 drivers/gpu/drm/vc4/vc4_drv.h | 2 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c| 2 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_surface.c   | 2 +-
 include/drm/bridge/mhl.h  | 4 ++--
 include/drm/drm_displayid.h   | 2 +-
 include/uapi/drm/i915_drm.h   | 4 ++--
 14 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.h 
b/drivers/gpu/drm/etnaviv/etnaviv_gem.h
index 6b68fe16041b..98e60df882b6 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_gem.h
+++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.h
@@ -105,7 +105,7 @@ struct etnaviv_gem_submit {
unsigned int nr_pmrs;
struct etnaviv_perfmon_request *pmrs;
unsigned int nr_bos;
-   struct etnaviv_gem_submit_bo bos[0];
+   struct etnaviv_gem_submit_bo bos[];
/* No new members here, the previous one is variable-length! */
 };
 
diff --git a/drivers/gpu/drm/gma500/intel_bios.h 
b/drivers/gpu/drm/gma500/intel_bios.h
index a1f9ce9465a5..0e6facf21e33 100644
--- a/drivers/gpu/drm/gma500/intel_bios.h
+++ b/drivers/gpu/drm/gma500/intel_bios.h
@@ -227,7 +227,7 @@ struct bdb_general_definitions {
 * number = (block_size - sizeof(bdb_general_definitions))/
 *   sizeof(child_device_config);
 */
-   struct child_device_config devices[0];
+   struct child_device_config devices[];
 };
 
 struct bdb_lvds_options {
diff --git a/drivers/gpu/drm/i915/display/intel_vbt_defs.h 
b/drivers/gpu/drm/i915/display/intel_vbt_defs.h
index 05c7cbe32eb4..aef7fe932d1a 100644
--- a/drivers/gpu/drm/i915/display/intel_vbt_defs.h
+++ b/drivers/gpu/drm/i915/display/intel_vbt_defs.h
@@ -462,7 +462,7 @@ struct bdb_general_definitions {
 * number = (block_size - sizeof(bdb_general_definitions))/
 *   defs->child_dev_size;
 */
-   u8 devices[0];
+   u8 devices[];
 } __packed;
 
 /*
@@ -839,7 +839,7 @@ struct bdb_mipi_config {
 
 struct bdb_mipi_sequence {
u8 version;
-   u8 data[0]; /* up to 6 variable length blocks */
+   u8 data[]; /* up to 6 variable length blocks */
 } __packed;
 
 /*
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 47561dc29304..5cec79152f17 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -216,7 +216,7 @@ struct virtual_engine {
 
/* And finally, which physical engines this virtual engine maps onto. */
unsigned int num_siblings;
-   struct intel_engine_cs *siblings[0];
+   struct intel_engine_cs *siblings[];
 };
 
 static struct virtual_engine *to_virtual_engine(struct intel_engine_cs *engine)
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h 
b/drivers/gpu/drm/i915/i915_gpu_error.h
index 0d1f6c8ff355..5a6561f7a210 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.h
+++ b/drivers/gpu/drm/i915/i915_gpu_error.h
@@ -42,7 +42,7 @@ struct i915_vma_coredump {
int num_pages;
int page_count;
int unused;
-   u32 *pages[0];
+   u32 *pages[];
 };
 
 struct i915_request_coredump {
diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
index 9e0953c2b7ce..37aa556c5f92 100644
--- a/drivers/gpu/drm/msm/msm_gem.h
+++ b/drivers/gpu/drm/msm/msm_gem.h
@@ -157,7 +157,7 @@ struct msm_gem_submit {
uint32_t handle;
};
uint64_t iova;
-   } bos[0];
+   } bo

Re: [PATCH RFC v4 12/13] mm/vmscan: Export drop_slab() and drop_slab_node()

2020-02-25 Thread Michal Hocko
On Thu 12-12-19 18:11:36, David Hildenbrand wrote:
> We already have a way to trigger reclaiming of all reclaimable slab objects
> from user space (echo 2 > /proc/sys/vm/drop_caches). Let's allow drivers
> to also trigger this when they really want to make progress and know what
> they are doing.

I cannot say I would be fan of this. This is a global action with user
visible performance impact. I am worried that we will find out that all
sorts of drivers have a very good idea that dropping slab caches is
going to help their problem whatever it is. We have seen the same patter
in the userspace already and that is the reason we are logging the usage
to the log and count invocations in the counter.

> virtio-mem wants to use these functions when it failed to unplug memory
> for quite some time (e.g., after 30 minutes). It will then try to
> free up reclaimable objects by dropping the slab caches every now and
> then (e.g., every 30 minutes) as long as necessary. There will be a way to
> disable this feature and info messages will be logged.
> 
> In the future, we want to have a drop_slab_range() functionality
> instead. Memory offlining code has similar demands and also other
> alloc_contig_range() users (e.g., gigantic pages) could make good use of
> this feature. Adding it, however, requires more work/thought.

We already do have a memory_notify(MEM_GOING_OFFLINE) for that purpose
and slab allocator implements a callback (slab_mem_going_offline_callback).
The callback is quite dumb and it doesn't really try to free objects
from the given memory range or even try to drop active objects which
might turn out to be hard but this sounds like a more robust way to
achieve what you want.
 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Signed-off-by: David Hildenbrand 
> ---
>  include/linux/mm.h | 4 ++--
>  mm/vmscan.c| 2 ++
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 64799c5cb39f..483300f58be8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2706,8 +2706,8 @@ int drop_caches_sysctl_handler(struct ctl_table *, int,
>   void __user *, size_t *, loff_t *);
>  #endif
>  
> -void drop_slab(void);
> -void drop_slab_node(int nid);
> +extern void drop_slab(void);
> +extern void drop_slab_node(int nid);
>  
>  #ifndef CONFIG_MMU
>  #define randomize_va_space 0
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c3e53502a84a..4e1cdaaec5e6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -719,6 +719,7 @@ void drop_slab_node(int nid)
>   } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
>   } while (freed > 10);
>  }
> +EXPORT_SYMBOL(drop_slab_node);
>  
>  void drop_slab(void)
>  {
> @@ -728,6 +729,7 @@ void drop_slab(void)
>   drop_slab_node(nid);
>   count_vm_event(DROP_SLAB);
>  }
> +EXPORT_SYMBOL(drop_slab);
>  
>  static inline int is_page_cache_freeable(struct page *page)
>  {
> -- 
> 2.23.0

-- 
Michal Hocko
SUSE Labs
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC v4 12/13] mm/vmscan: Export drop_slab() and drop_slab_node()

2020-02-25 Thread David Hildenbrand
On 25.02.20 15:58, Michal Hocko wrote:
> On Thu 12-12-19 18:11:36, David Hildenbrand wrote:
>> We already have a way to trigger reclaiming of all reclaimable slab objects
>> from user space (echo 2 > /proc/sys/vm/drop_caches). Let's allow drivers
>> to also trigger this when they really want to make progress and know what
>> they are doing.
> 
> I cannot say I would be fan of this. This is a global action with user
> visible performance impact. I am worried that we will find out that all
> sorts of drivers have a very good idea that dropping slab caches is
> going to help their problem whatever it is. We have seen the same patter
> in the userspace already and that is the reason we are logging the usage
> to the log and count invocations in the counter.

Yeah, I decided to hold back patch 11-13 for the v1 (which I am planning
to post in March after more testing). What we really want is to make
memory offlining an alloc_contig_range() work better with reclaimable
objects.

> 
>> virtio-mem wants to use these functions when it failed to unplug memory
>> for quite some time (e.g., after 30 minutes). It will then try to
>> free up reclaimable objects by dropping the slab caches every now and
>> then (e.g., every 30 minutes) as long as necessary. There will be a way to
>> disable this feature and info messages will be logged.
>>
>> In the future, we want to have a drop_slab_range() functionality
>> instead. Memory offlining code has similar demands and also other
>> alloc_contig_range() users (e.g., gigantic pages) could make good use of
>> this feature. Adding it, however, requires more work/thought.
> 
> We already do have a memory_notify(MEM_GOING_OFFLINE) for that purpose
> and slab allocator implements a callback (slab_mem_going_offline_callback).
> The callback is quite dumb and it doesn't really try to free objects
> from the given memory range or even try to drop active objects which
> might turn out to be hard but this sounds like a more robust way to
> achieve what you want.

Two things:

1. memory_notify(MEM_GOING_OFFLINE) is called after trying to isolate
the page range and checking if we only have movable pages. Won't help
much I guess.

2. alloc_contig_range() won't benefit from that.

Something like drop_slab_range() would be better, and calling it from
the right spots in the core (e.g., set_migratetype_isolate() see below).

Especially, have a look at mm/page_isolation.c:set_migratetype_isolate()

"FIXME: Now, memory hotplug doesn't call shrink_slab() by itself. We
just check MOVABLE pages."

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH][next] drm: Replace zero-length array with flexible-array member

2020-02-25 Thread Chris Wilson
Quoting Gustavo A. R. Silva (2020-02-25 14:03:47)
> The current codebase makes use of the zero-length array language
> extension to the C90 standard, but the preferred mechanism to declare
> variable-length types such as these ones is a flexible array member[1][2],
> introduced in C99:

I remember when gcc didn't support []. For the record, it appears
support for flexible arrays landed in gcc-3.0. So passes the minimum
compiler spec. That would be useful to mention for old farts with
forgetful memories.
-Chris
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3] virtio: Work around frames incorrectly marked as gso

2020-02-25 Thread Willem de Bruijn
>  An skb_dump() + dump_stack() when the packet socket gets such a
>  packet may point us to the root cause and fix that.
> >>>
> >>> We tried dump stack, it was not informative - it was just the recvmmsg
> >>> call stack coming from the UML until it hits the relevant recv bit in
> >>> af_packet - it does not tell us where the packet is coming from.
> >>>
> >>> Quoting from the message earlier in the thread:
> >>>
> >>> [ 2334.180854] Call Trace:
> >>> [ 2334.181947]  dump_stack+0x5c/0x80
> >>> [ 2334.183021]  packet_recvmsg.cold+0x23/0x49
> >>> [ 2334.184063]  ___sys_recvmsg+0xe1/0x1f0
> >>> [ 2334.185034]  ? packet_poll+0xca/0x130
> >>> [ 2334.186014]  ? sock_poll+0x77/0xb0
> >>> [ 2334.186977]  ? ep_item_poll.isra.0+0x3f/0xb0
> >>> [ 2334.187936]  ? ep_send_events_proc+0xf1/0x240
> >>> [ 2334.188901]  ? dequeue_signal+0xdb/0x180
> >>> [ 2334.189848]  do_recvmmsg+0xc8/0x2d0
> >>> [ 2334.190728]  ? ep_poll+0x8c/0x470
> >>> [ 2334.191581]  __sys_recvmmsg+0x108/0x150
> >>> [ 2334.192441]  __x64_sys_recvmmsg+0x25/0x30
> >>> [ 2334.193346]  do_syscall_64+0x53/0x140
> >>> [ 2334.194262]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>
> >> That makes sense. skb_dump might show more interesting details about
> >> the packet.
> >
> > I will add that and retest later today.
>
>
> skb len=818 headroom=2 headlen=818 tailroom=908
> mac=(2,14) net=(16,0) trans=16
> shinfo(txflags=0 nr_frags=0 gso(size=752 type=0 segs=1))
> csum(0x100024 ip_summed=3 complete_sw=0 valid=0 level=0)
> hash(0x0 sw=0 l4=0) proto=0x0800 pkttype=4 iif=0
> sk family=17 type=3 proto=0
>
> Deciphering the actual packet data gives a
>
> TCP packet, ACK and PSH set.
>
> The PSH flag looks like the only "interesting" thing about it in first read.

Thanks.

TCP always sets the PSH bit on a GSO packet as of commit commit
051ba67447de  ("tcp: force a PSH flag on TSO packets"), so that is
definitely informative.

The lower gso size might come from a path mtu probing depending on
tcp_base_mss, but that's definitely wild speculation. Increasing that
value to, say, 1024, could tell us.

In this case it may indeed not be a GSO packet. As 752 is the MSS + 28
B TCP header including timestamp + 20 B IPv4 header + 14B Eth header.
Which adds up to 814 already.

Not sure what those 2 B between skb->data and mac_header are. Was this
captured inside packet_rcv? network_header and transport_header both
at 16B offset is also sketchy, but again may be an artifact of where
exactly this is being read.

Perhaps this is a segment of a larger GSO packet that is retransmitted
in part. Like an mtu probe or loss probe. See for instance this in
tcp_send_loss_probe for  how a single MSS is extracted:

   if ((pcount > 1) && (skb->len > (pcount - 1) * mss)) {
if (unlikely(tcp_fragment(sk, TCP_FRAG_IN_RTX_QUEUE, skb,
  (pcount - 1) * mss, mss,
  GFP_ATOMIC)))
goto rearm_timer;
skb = skb_rb_next(skb);
}

Note that I'm not implicating this specific code. I don't see anything
wrong with it. Just an indication that a trace would be very
informative, as it could tell if any of these edge cases is being hit.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio_ring: Fix mem leak with vring_new_virtqueue()

2020-02-25 Thread Suman Anna via Virtualization
Hi Jason,

On 2/24/20 11:39 PM, Jason Wang wrote:
> 
> On 2020/2/25 上午5:26, Suman Anna wrote:
>> The functions vring_new_virtqueue() and __vring_new_virtqueue() are used
>> with split rings, and any allocations within these functions are managed
>> outside of the .we_own_ring flag. The commit cbeedb72b97a ("virtio_ring:
>> allocate desc state for split ring separately") allocates the desc state
>> within the __vring_new_virtqueue() but frees it only when the
>> .we_own_ring
>> flag is set. This leads to a memory leak when freeing such allocated
>> virtqueues with the vring_del_virtqueue() function.
>>
>> Fix this by moving the desc_state free code outside the flag and only
>> for split rings. Issue was discovered during testing with remoteproc
>> and virtio_rpmsg.
>>
>> Fixes: cbeedb72b97a ("virtio_ring: allocate desc state for split ring
>> separately")
>> Signed-off-by: Suman Anna 
>> ---
>>   drivers/virtio/virtio_ring.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>> index 867c7ebd3f10..58b96baa8d48 100644
>> --- a/drivers/virtio/virtio_ring.c
>> +++ b/drivers/virtio/virtio_ring.c
>> @@ -2203,10 +2203,10 @@ void vring_del_virtqueue(struct virtqueue *_vq)
>>    vq->split.queue_size_in_bytes,
>>    vq->split.vring.desc,
>>    vq->split.queue_dma_addr);
>> -
>> -    kfree(vq->split.desc_state);
>>   }
>>   }
>> +    if (!vq->packed_ring)
>> +    kfree(vq->split.desc_state);
> 
> 
> Nitpick, it looks to me it would be more clear if we just free
> desc_state unconditionally here (and remove the kfree for packed above).

OK, are you sure you want that to be folded into this patch? It looks to
me a separate cleanup/consolidation patch, and packed desc_state does
not suffer this memleak, and need not be backported into stable kernels.

regards
Suman

> Anyway desc_state will be allocated by use even if !we_own_ring.
> 
> Thanks
> 
> 
>>   list_del(&_vq->list);
>>   kfree(vq);
>>   }
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v4 12/13] mm/vmscan: Export drop_slab() and drop_slab_node()

2020-02-25 Thread Michal Hocko
On Tue 25-02-20 16:09:29, David Hildenbrand wrote:
> On 25.02.20 15:58, Michal Hocko wrote:
> > On Thu 12-12-19 18:11:36, David Hildenbrand wrote:
> >> We already have a way to trigger reclaiming of all reclaimable slab objects
> >> from user space (echo 2 > /proc/sys/vm/drop_caches). Let's allow drivers
> >> to also trigger this when they really want to make progress and know what
> >> they are doing.
> > 
> > I cannot say I would be fan of this. This is a global action with user
> > visible performance impact. I am worried that we will find out that all
> > sorts of drivers have a very good idea that dropping slab caches is
> > going to help their problem whatever it is. We have seen the same patter
> > in the userspace already and that is the reason we are logging the usage
> > to the log and count invocations in the counter.
> 
> Yeah, I decided to hold back patch 11-13 for the v1 (which I am planning
> to post in March after more testing). What we really want is to make
> memory offlining an alloc_contig_range() work better with reclaimable
> objects.
> 
> > 
> >> virtio-mem wants to use these functions when it failed to unplug memory
> >> for quite some time (e.g., after 30 minutes). It will then try to
> >> free up reclaimable objects by dropping the slab caches every now and
> >> then (e.g., every 30 minutes) as long as necessary. There will be a way to
> >> disable this feature and info messages will be logged.
> >>
> >> In the future, we want to have a drop_slab_range() functionality
> >> instead. Memory offlining code has similar demands and also other
> >> alloc_contig_range() users (e.g., gigantic pages) could make good use of
> >> this feature. Adding it, however, requires more work/thought.
> > 
> > We already do have a memory_notify(MEM_GOING_OFFLINE) for that purpose
> > and slab allocator implements a callback (slab_mem_going_offline_callback).
> > The callback is quite dumb and it doesn't really try to free objects
> > from the given memory range or even try to drop active objects which
> > might turn out to be hard but this sounds like a more robust way to
> > achieve what you want.
> 
> Two things:
> 
> 1. memory_notify(MEM_GOING_OFFLINE) is called after trying to isolate
> the page range and checking if we only have movable pages. Won't help
> much I guess.

You are right, I have missed that. Can we reorder those two calls?

> 2. alloc_contig_range() won't benefit from that.

True.

> Something like drop_slab_range() would be better, and calling it from
> the right spots in the core (e.g., set_migratetype_isolate() see below).
> 
> Especially, have a look at mm/page_isolation.c:set_migratetype_isolate()
> 
> "FIXME: Now, memory hotplug doesn't call shrink_slab() by itself. We
> just check MOVABLE pages."

shrink_slab is really a large hammer for this purpose. The notifier
mechanism sounds more appropriate to me. If that means to move it
outside of its current position then let's try to experiment with that.
But there is a long route to have per pfn range reclaim.
-- 
Michal Hocko
SUSE Labs
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC v4 12/13] mm/vmscan: Export drop_slab() and drop_slab_node()

2020-02-25 Thread David Hildenbrand
On 25.02.20 18:06, Michal Hocko wrote:
> On Tue 25-02-20 16:09:29, David Hildenbrand wrote:
>> On 25.02.20 15:58, Michal Hocko wrote:
>>> On Thu 12-12-19 18:11:36, David Hildenbrand wrote:
 We already have a way to trigger reclaiming of all reclaimable slab objects
 from user space (echo 2 > /proc/sys/vm/drop_caches). Let's allow drivers
 to also trigger this when they really want to make progress and know what
 they are doing.
>>>
>>> I cannot say I would be fan of this. This is a global action with user
>>> visible performance impact. I am worried that we will find out that all
>>> sorts of drivers have a very good idea that dropping slab caches is
>>> going to help their problem whatever it is. We have seen the same patter
>>> in the userspace already and that is the reason we are logging the usage
>>> to the log and count invocations in the counter.
>>
>> Yeah, I decided to hold back patch 11-13 for the v1 (which I am planning
>> to post in March after more testing). What we really want is to make
>> memory offlining an alloc_contig_range() work better with reclaimable
>> objects.
>>
>>>
 virtio-mem wants to use these functions when it failed to unplug memory
 for quite some time (e.g., after 30 minutes). It will then try to
 free up reclaimable objects by dropping the slab caches every now and
 then (e.g., every 30 minutes) as long as necessary. There will be a way to
 disable this feature and info messages will be logged.

 In the future, we want to have a drop_slab_range() functionality
 instead. Memory offlining code has similar demands and also other
 alloc_contig_range() users (e.g., gigantic pages) could make good use of
 this feature. Adding it, however, requires more work/thought.
>>>
>>> We already do have a memory_notify(MEM_GOING_OFFLINE) for that purpose
>>> and slab allocator implements a callback (slab_mem_going_offline_callback).
>>> The callback is quite dumb and it doesn't really try to free objects
>>> from the given memory range or even try to drop active objects which
>>> might turn out to be hard but this sounds like a more robust way to
>>> achieve what you want.
>>
>> Two things:
>>
>> 1. memory_notify(MEM_GOING_OFFLINE) is called after trying to isolate
>> the page range and checking if we only have movable pages. Won't help
>> much I guess.
> 
> You are right, I have missed that. Can we reorder those two calls?

AFAIK no (would have to look up the details, but there was a good reason
for the order, e.g., avoid races with other users of page isolation like
alloc_contig_range()).

Especially, "[PATCH RFC v4 06/13] mm: Allow to offline unmovable
PageOffline() pages via MEM_GOING_OFFLINE" (which is still impatiently
waiting for an ACK ;) ) also works around that ordering issue in a way
we discussed back then.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] mm: move force_dma_unencrypted() to mem_encrypt.h

2020-02-25 Thread Cornelia Huck
On Mon, 24 Feb 2020 19:49:53 +0100
Halil Pasic  wrote:

> On Mon, 24 Feb 2020 14:33:14 +1100
> David Gibson  wrote:
> 
> > On Fri, Feb 21, 2020 at 07:07:02PM +0100, Halil Pasic wrote:  
> > > On Fri, 21 Feb 2020 10:48:15 -0500
> > > "Michael S. Tsirkin"  wrote:
> > >   
> > > > On Fri, Feb 21, 2020 at 02:06:39PM +0100, Halil Pasic wrote:  
> > > > > On Fri, 21 Feb 2020 14:27:27 +1100
> > > > > David Gibson  wrote:
> > > > >   
> > > > > > On Thu, Feb 20, 2020 at 05:31:35PM +0100, Christoph Hellwig wrote:  
> > > > > > > On Thu, Feb 20, 2020 at 05:23:20PM +0100, Christian Borntraeger 
> > > > > > > wrote:  
> > > > > > > > >From a users perspective it makes absolutely perfect sense to 
> > > > > > > > >use the  
> > > > > > > > bounce buffers when they are NEEDED. 
> > > > > > > > Forcing the user to specify iommu_platform just because you 
> > > > > > > > need bounce buffers
> > > > > > > > really feels wrong. And obviously we have a severe performance 
> > > > > > > > issue
> > > > > > > > because of the indirections.  
> > > > > > > 
> > > > > > > The point is that the user should not have to specify 
> > > > > > > iommu_platform.
> > > > > > > We need to make sure any new hypervisor (especially one that 
> > > > > > > might require
> > > > > > > bounce buffering) always sets it,  
> > > > > > 
> > > > > > So, I have draft qemu patches which enable iommu_platform by 
> > > > > > default.
> > > > > > But that's really because of other problems with !iommu_platform, 
> > > > > > not
> > > > > > anything to do with bounce buffering or secure VMs.
> > > > > > 
> > > > > > The thing is that the hypervisor *doesn't* require bounce buffering.
> > > > > > In the POWER (and maybe s390 as well) models for Secure VMs, it's 
> > > > > > the
> > > > > > *guest*'s choice to enter secure mode, so the hypervisor has no 
> > > > > > reason
> > > > > > to know whether the guest needs bounce buffering.  As far as the
> > > > > > hypervisor and qemu are concerned that's a guest internal detail, it
> > > > > > just expects to get addresses it can access whether those are GPAs
> > > > > > (iommu_platform=off) or IOVAs (iommu_platform=on).  
> > > > > 
> > > > > I very much agree!
> > > > >   
> > > > > >   
> > > > > > > as was a rather bogus legacy hack  
> > > > > > 
> > > > > > It was certainly a bad idea, but it was a bad idea that went into a
> > > > > > public spec and has been widely deployed for many years.  We can't
> > > > > > just pretend it didn't happen and move on.
> > > > > > 
> > > > > > Turning iommu_platform=on by default breaks old guests, some of 
> > > > > > which
> > > > > > we still care about.  We can't (automatically) do it only for guests
> > > > > > that need bounce buffering, because the hypervisor doesn't know that
> > > > > > ahead of time.  

We could default to iommu_platform=on on s390 when the host has active
support for protected virtualization... but that's just another kind of
horrible, so let's just pretend I didn't suggest it.

> > > > > 
> > > > > Turning iommu_platform=on for virtio-ccw makes no sense whatsover,
> > > > > because for CCW I/O there is no such thing as IOMMU and the addresses
> > > > > are always physical addresses.  
> > > > 
> > > > Fix the name then. The spec calls is ACCESS_PLATFORM now, which
> > > > makes much more sense.  
> > > 
> > > I don't quite get it. Sorry. Maybe I will revisit this later.  
> > 
> > Halil, I think I can clarify this.
> > 
> > The "iommu_platform" flag doesn't necessarily have anything to do with
> > an iommu, although it often will.  Basically it means "access guest
> > memory via the bus's normal DMA mechanism" rather than "access guest
> > memory using GPA, because you're the hypervisor and you can do that".
> >   
> 
> Unfortunately, I don't think this is what is conveyed to the end users.
> Let's see what do we have documented:
> 
> Neither Qemu user documentation
> (https://www.qemu.org/docs/master/qemu-doc.html) nor online help:
> qemu-system-s390x -device virtio-net-ccw,?|grep iommu
>   iommu_platform=  - on/off (default: false)
> has any documentation on it.

Now, that's 'helpful' -- this certainly calls out for a bit of doc...

> 
> But libvirt does have have documenttion on the knob that contros
> iommu_platform for QEMU (when  QEMU is managed by libvirt):
> """
> Virtio-related options 
> 
> QEMU's virtio devices have some attributes related to the virtio
> transport under the driver element: The iommu attribute enables the use
> of emulated IOMMU by the device. The attribute ats controls the Address
> Translation Service support for PCIe devices. This is needed to make use
> of IOTLB support (see IOMMU device). Possible values are on or off.
> Since 3.5.0 
> """
> (https://libvirt.org/formatdomain.html#elementsVirtio)
> 
> Thus it seems the only available documentation says that it "enables the use
> of emulated IOMMU by the device".
> 
> And for vhost-user we have
> """
> When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has not be

Re: [PATCH RFC v4 06/13] mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE

2020-02-25 Thread David Hildenbrand
>>  /*
>>   * Scan pfn range [start,end) to find movable/migratable pages (LRU pages,
>> - * non-lru movable pages and hugepages). We scan pfn because it's much
>> - * easier than scanning over linked list. This function returns the pfn
>> - * of the first found movable page if it's found, otherwise 0.
>> + * non-lru movable pages and hugepages).
>> + *
>> + * Returns:
>> + *  0 in case a movable page is found and movable_pfn was updated.
>> + *  -ENOENT in case no movable page was found.
>> + *  -EBUSY in case a definetly unmovable page was found.
>>   */
>> -static unsigned long scan_movable_pages(unsigned long start, unsigned long 
>> end)
>> +static int scan_movable_pages(unsigned long start, unsigned long end,
>> +  unsigned long *movable_pfn)
>>  {
>>  unsigned long pfn;
>>  
>> @@ -1247,18 +1251,29 @@ static unsigned long scan_movable_pages(unsigned 
>> long start, unsigned long end)
>>  continue;
>>  page = pfn_to_page(pfn);
>>  if (PageLRU(page))
>> -return pfn;
>> +goto found;
>>  if (__PageMovable(page))
>> -return pfn;
>> +goto found;
>> +
>> +/*
>> + * Unmovable PageOffline() pages where somebody still holds
>> + * a reference count (after MEM_GOING_OFFLINE) can definetly
>> + * not be offlined.
>> + */
>> +if (PageOffline(page) && page_count(page))
>> +return -EBUSY;
> 
> So the comment confused me a bit because technically this function isn't
> about offlining memory, it is about finding movable pages. I had to do a
> bit of digging to find the only consumer is __offline_pages, but if we are
> going to talk about "offlining" instead of "moving" in this function it
> might make sense to rename it.

Well, it's contained in memory_hotplug.c, and the only user of moving
pages around in there is offlining code :) And it's job is to locate
movable pages, skip over some (temporary? unmovable ones) and (now)
indicate definitely unmovable ones.

Any idea for a better name?
scan_movable_pages_and_stop_on_definitely_unmovable() is not so nice :)


> 
>>  
>>  if (!PageHuge(page))
>>  continue;
>>  head = compound_head(page);
>>  if (page_huge_active(head))
>> -return pfn;
>> +goto found;
>>  skip = compound_nr(head) - (page - head);
>>  pfn += skip - 1;
>>  }
>> +return -ENOENT;
>> +found:
>> +*movable_pfn = pfn;
>>  return 0;
>>  }
> 
> So I am looking at this function and it seems like your change completely
> changes the behavior. The code before would walk the entire range and if
> at least 1 page was available to move you would return the PFN of that
> page. Now what seems to happen is that you will return -EBUSY as soon as
> you encounter an offline page with a page count. I would think that would
> slow down the offlining process since you have made the Unmovable
> PageOffline() page a head of line blocker that you have to wait to get
> around.

So, the comment says "Unmovable PageOffline() pages where somebody still
holds a reference count (after MEM_GOING_OFFLINE) can definitely not be
offlined". And the doc "-EBUSY in case a definitely unmovable page was
found."

So why would this make offlining slow? Offlining will be aborted,
because offlining is not possible.

Please note that this is the exact old behavior, where isolating the
page range would have failed directly and offlining would have been
aborted early. The old offlining failure in the case in the offlining
path would have been "failure to isolate range".

Also, note that the users of PageOffline() with unmovable pages are very
rare (only balloon drivers for now).

> 
> Would it perhaps make more sense to add a return value initialized to
> ENOENT, and if you encounter one of these offline pages you change the
> return value to EBUSY, and then if you walk through the entire list
> without finding a movable page you just return the value?

Did you have a look in  which context this function is used, especially
[1] and [2]?

> 
> Otherwise you might want to add a comment explaining why the function
> should stall instead of skipping over the unmovable section that will
> hopefully become movable later.

So we have "-EBUSY in case a definitely unmovable page was found.". Do
you have a better suggestion?

> 
>> @@ -1528,7 +1543,8 @@ static int __ref __offline_pages(unsigned long 
>> start_pfn,
>>  }
>>  
>>  do {
>> -for (pfn = start_pfn; pfn;) {
>> +pfn = start_pfn;
>> +do {
>>  if (signal_pending(current)) {
>>  ret = -EINTR;
>>  reason = "signal backoff";
>> @@ -1538,14 +1554,19 @@ static int __ref __offline_pages(unsigned long 
>> 

Re: [PATCH RFC v4 06/13] mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE

2020-02-25 Thread David Hildenbrand
On 25.02.20 22:46, Alexander Duyck wrote:
> On Tue, 2020-02-25 at 19:49 +0100, David Hildenbrand wrote:
  /*
   * Scan pfn range [start,end) to find movable/migratable pages (LRU pages,
 - * non-lru movable pages and hugepages). We scan pfn because it's much
 - * easier than scanning over linked list. This function returns the pfn
 - * of the first found movable page if it's found, otherwise 0.
 + * non-lru movable pages and hugepages).
 + *
 + * Returns:
 + *0 in case a movable page is found and movable_pfn was updated.
 + *-ENOENT in case no movable page was found.
 + *-EBUSY in case a definetly unmovable page was found.
   */
 -static unsigned long scan_movable_pages(unsigned long start, unsigned 
 long end)
 +static int scan_movable_pages(unsigned long start, unsigned long end,
 +unsigned long *movable_pfn)
  {
unsigned long pfn;
  
 @@ -1247,18 +1251,29 @@ static unsigned long scan_movable_pages(unsigned 
 long start, unsigned long end)
continue;
page = pfn_to_page(pfn);
if (PageLRU(page))
 -  return pfn;
 +  goto found;
if (__PageMovable(page))
 -  return pfn;
 +  goto found;
 +
 +  /*
 +   * Unmovable PageOffline() pages where somebody still holds
 +   * a reference count (after MEM_GOING_OFFLINE) can definetly
 +   * not be offlined.
 +   */
 +  if (PageOffline(page) && page_count(page))
 +  return -EBUSY;
>>>
>>> So the comment confused me a bit because technically this function isn't
>>> about offlining memory, it is about finding movable pages. I had to do a
>>> bit of digging to find the only consumer is __offline_pages, but if we are
>>> going to talk about "offlining" instead of "moving" in this function it
>>> might make sense to rename it.
>>
>> Well, it's contained in memory_hotplug.c, and the only user of moving
>> pages around in there is offlining code :) And it's job is to locate
>> movable pages, skip over some (temporary? unmovable ones) and (now)
>> indicate definitely unmovable ones.
>>
>> Any idea for a better name?
>> scan_movable_pages_and_stop_on_definitely_unmovable() is not so nice :)
> 
> I dunno. What I was getting at is that the wording here would make it
> clearer if you simply stated that these pages "can definately not be
> moved". Saying you cannot offline a page that is PageOffline seems kind of
> redundant, then again calling it an Unmovable and then saying it cannot be
> moves is also redundant I suppose. In the end you don't move them, but

So, in summary, there are
- PageOffline() pages that are movable (balloon compaction).
- PageOffline() pages that cannot be moved and cannot be offlined (e.g.,
  no balloon compaction enabled, XEN, HyperV, ...) . page_count(page) >=
  0
- PageOffline() pages that cannot be moved, but can be offlined.
  page_count(page) == 0.


> they can be switched to offline if the page count hits 0. When that
> happens you simply end up skipping over them in the code for
> __test_page_isolated_in_pageblock and __offline_isolated_pages.

Yes. The thing with the wording is that pages with (PageOffline(page) &&
!page_count(page)) can also not really be moved, but they can be skipped
when offlining. If we call that "moving them to /dev/null", then yes,
they can be moved to some degree :)

I can certainly do here e.g.,

/*
 * PageOffline() pages that are not marked __PageMovable() and have a
 * reference count > 0 (after MEM_GOING_OFFLINE) are definitely
 * unmovable. If their reference count would be 0, they could be skipped
 * when offlining memory sections.
 */

And maybe I'll add to the function doc, that unmovable pages that are
skipped in this function can include pages that can be skipped when
offlining (moving them to nirvana).

Other suggestions?

[...]

>>
>> [1] we detect a definite offlining blocker and
>>
 +  } while (!ret);
 +
 +  if (ret != -ENOENT) {
 +  reason = "unmovable page";
>>
>> [2] we abort offlining
>>
 +  goto failed_removal_isolated;
}
  
/*
> 
> Yeah, this is the piece I misread.  I knew the loop this was in previously
> was looping when returning -ENOENT so for some reason I had it in my head
> that you were still looping on -EBUSY.

Ah okay, I see. Yeah, that wouldn't make sense for the use case I have :)

> 
> So the one question I would have is if at this point are we guaranteed
> that the balloon drivers have already taken care of the page count for all
> the pages they set to PageOffline? Based on the patch description I was
> thinking that this was going to be looping for a while waiting for the
> driver to clear the pages and then w

Re: [PATCH] virtio_ring: Fix mem leak with vring_new_virtqueue()

2020-02-25 Thread Jason Wang


On 2020/2/26 上午12:51, Suman Anna wrote:

Hi Jason,

On 2/24/20 11:39 PM, Jason Wang wrote:

On 2020/2/25 上午5:26, Suman Anna wrote:

The functions vring_new_virtqueue() and __vring_new_virtqueue() are used
with split rings, and any allocations within these functions are managed
outside of the .we_own_ring flag. The commit cbeedb72b97a ("virtio_ring:
allocate desc state for split ring separately") allocates the desc state
within the __vring_new_virtqueue() but frees it only when the
.we_own_ring
flag is set. This leads to a memory leak when freeing such allocated
virtqueues with the vring_del_virtqueue() function.

Fix this by moving the desc_state free code outside the flag and only
for split rings. Issue was discovered during testing with remoteproc
and virtio_rpmsg.

Fixes: cbeedb72b97a ("virtio_ring: allocate desc state for split ring
separately")
Signed-off-by: Suman Anna
---
   drivers/virtio/virtio_ring.c | 4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 867c7ebd3f10..58b96baa8d48 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -2203,10 +2203,10 @@ void vring_del_virtqueue(struct virtqueue *_vq)
    vq->split.queue_size_in_bytes,
    vq->split.vring.desc,
    vq->split.queue_dma_addr);
-
-    kfree(vq->split.desc_state);
   }
   }
+    if (!vq->packed_ring)
+    kfree(vq->split.desc_state);

Nitpick, it looks to me it would be more clear if we just free
desc_state unconditionally here (and remove the kfree for packed above).

OK, are you sure you want that to be folded into this patch? It looks to
me a separate cleanup/consolidation patch, and packed desc_state does
not suffer this memleak, and need not be backported into stable kernels.

regards
Suman



Though it's just a small tweak, I'm fine for leaving it for future.

So

Acked-by: Jason Wang 

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH V5 0/5] vDPA support

2020-02-25 Thread Jason Wang
Hi all:

This is an update version of vDPA support in kernel.

vDPA device is a device that uses a datapath which complies with the
virtio specifications with vendor specific control path. vDPA devices
can be both physically located on the hardware or emulated by
software. vDPA hardware devices are usually implemented through PCIE
with the following types:

- PF (Physical Function) - A single Physical Function
- VF (Virtual Function) - Device that supports single root I/O
  virtualization (SR-IOV). Its Virtual Function (VF) represents a
  virtualized instance of the device that can be assigned to different
  partitions
- ADI (Assignable Device Interface) and its equivalents - With
  technologies such as Intel Scalable IOV, a virtual device (VDEV)
  composed by host OS utilizing one or more ADIs. Or its equivalent
  like SF (Sub function) from Mellanox.

>From a driver's perspective, depends on how and where the DMA
translation is done, vDPA devices are split into two types:

- Platform specific DMA translation - From the driver's perspective,
  the device can be used on a platform where device access to data in
  memory is limited and/or translated. An example is a PCIE vDPA whose
  DMA request was tagged via a bus (e.g PCIE) specific way. DMA
  translation and protection are done at PCIE bus IOMMU level.
- Device specific DMA translation - The device implements DMA
  isolation and protection through its own logic. An example is a vDPA
  device which uses on-chip IOMMU.

To hide the differences and complexity of the above types for a vDPA
device/IOMMU options and in order to present a generic virtio device
to the upper layer, a device agnostic framework is required.

This series introduces a software vDPA bus which abstracts the
common attributes of vDPA device, vDPA bus driver and the
communication method, the bus operations (vdpa_config_ops) between the
vDPA device abstraction and the vDPA bus driver. This allows multiple
types of drivers to be used for vDPA device like the virtio_vdpa and
vhost_vdpa driver to operate on the bus and allow vDPA device could be
used by either kernel virtio driver or userspace vhost drivers as:

   virtio drivers  vhost drivers
  | |
[virtio bus]   [vhost uAPI]
  | |
   virtio device   vhost device
   virtio_vdpa drv vhost_vdpa drv
 \   /
[vDPA bus]
 |
vDPA device
hardware drv
 |
[hardware bus]
 |
vDPA hardware

virtio_vdpa driver is a transport implementation for kernel virtio
drivers on top of vDPA bus operations. An alternative is to refactor
virtio bus which is sub-optimal since the bus and drivers are designed
to be use by kernel subsystem, a non-trivial major refactoring is
needed which may impact a brunches of drivers and devices
implementation inside the kernel. Using a new transport may grealy
simply both the design and changes.

vhost_vdpa driver is a new type of vhost device which allows userspace
vhost drivers to use vDPA devices via vhost uAPI (with minor
extension). This help to minimize the changes of existed vhost drivers
for using vDPA devices.

With the abstraction of vDPA bus and vDPA bus operations, the
difference and complexity of the under layer hardware is hidden from
upper layer. The vDPA bus drivers on top can use a unified
vdpa_config_ops to control different types of vDPA device.

This series contains the bus and virtio_vdpa implementation. We are
working on the vhost part and IFCVF (vDPA driver from Intel) which
will be posted in future few days.

Future work:

- direct doorbell mapping support
- direct interrupt support
- control virtqueue support
- dirty page tracking support
- management API (devlink)

Thanks

Changes from V4:

- use put_device() instead of kfree when fail to register virtio
  device (Jason)
- simplify the error handling when allocating vdpasim device (Jason)
- don't use device_for_each_child() during module exit (Jason)
- correct the error checking for vdpa_alloc_device() (Harpreet, Lingshan)

Changes from V3:

- various Kconfig fixes (Randy)

Changes from V2:

- release idr in the release function for put_device() unwind (Jason)
- don't panic when fail to register vdpa bus (Jason)
- use unsigned int instead of int for ida (Jason)
- fix the wrong commit log in virito_vdpa patches (Jason)
- make vdpa_sim depends on RUNTIME_TESTING_MENU (Michael)
- provide a bus release function for vDPA device (Jason)
- fix the wrong unwind when creating devices for vDPA simulator (Jason)
- move vDPA simulator to a dedicated directory (Lingshan)
- cancel the work before release vDPA simulator

Changes from V1:

- drop sysfs API, leave the management interface to future development
  (Michael)
- introduce incremental DMA ops (dma_map/dma_unmap) (Michael)
- introduce dma_device and use it instead of parent device for doing
  IOMMU or DMA from bus driver (Michael, Jason, Ling Shan,

[PATCH V5 1/5] vhost: factor out IOTLB

2020-02-25 Thread Jason Wang
This patch factors out IOTLB into a dedicated module in order to be
reused by other modules like vringh. User may choose to enable the
automatic retiring by specifying VHOST_IOTLB_FLAG_RETIRE flag to fit
for the case of vhost device IOTLB implementation.

Signed-off-by: Jason Wang 
---
 MAINTAINERS |   1 +
 drivers/vhost/Kconfig   |   6 +
 drivers/vhost/Makefile  |   2 +
 drivers/vhost/net.c |   2 +-
 drivers/vhost/vhost.c   | 221 +++-
 drivers/vhost/vhost.h   |  36 ++
 drivers/vhost/vhost_iotlb.c | 171 
 include/linux/vhost_iotlb.h |  45 
 8 files changed, 303 insertions(+), 181 deletions(-)
 create mode 100644 drivers/vhost/vhost_iotlb.c
 create mode 100644 include/linux/vhost_iotlb.h

diff --git a/MAINTAINERS b/MAINTAINERS
index c74e4ea714a5..0fb645b5a7df 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17768,6 +17768,7 @@ T:  git 
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
 S: Maintained
 F: drivers/vhost/
 F: include/uapi/linux/vhost.h
+F: include/linux/vhost_iotlb.h
 
 VIRTIO INPUT DRIVER
 M: Gerd Hoffmann 
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 3d03ccbd1adc..e76a72490563 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -36,6 +36,7 @@ config VHOST_VSOCK
 
 config VHOST
tristate
+   select VHOST_IOTLB
---help---
  This option is selected by any driver which needs to access
  the core of vhost.
@@ -54,3 +55,8 @@ config VHOST_CROSS_ENDIAN_LEGACY
  adds some overhead, it is disabled by default.
 
  If unsure, say "N".
+
+config VHOST_IOTLB
+   tristate
+   help
+ Generic IOTLB implementation for vhost and vringh.
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 6c6df24f770c..df99756fbb26 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -11,3 +11,5 @@ vhost_vsock-y := vsock.o
 obj-$(CONFIG_VHOST_RING) += vringh.o
 
 obj-$(CONFIG_VHOST)+= vhost.o
+
+obj-$(CONFIG_VHOST_IOTLB) += vhost_iotlb.o
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index e158159671fa..e4a20d7a2921 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1594,7 +1594,7 @@ static long vhost_net_reset_owner(struct vhost_net *n)
struct socket *tx_sock = NULL;
struct socket *rx_sock = NULL;
long err;
-   struct vhost_umem *umem;
+   struct vhost_iotlb *umem;
 
mutex_lock(&n->dev.mutex);
err = vhost_dev_check_owner(&n->dev);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f44340b41494..9059b95cac83 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -50,10 +50,6 @@ enum {
 #define vhost_used_event(vq) ((__virtio16 __user *)&vq->avail->ring[vq->num])
 #define vhost_avail_event(vq) ((__virtio16 __user *)&vq->used->ring[vq->num])
 
-INTERVAL_TREE_DEFINE(struct vhost_umem_node,
-rb, __u64, __subtree_last,
-START, LAST, static inline, vhost_umem_interval_tree);
-
 #ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
 static void vhost_disable_cross_endian(struct vhost_virtqueue *vq)
 {
@@ -581,21 +577,25 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
 
-struct vhost_umem *vhost_dev_reset_owner_prepare(void)
+static struct vhost_iotlb *iotlb_alloc(void)
+{
+   return vhost_iotlb_alloc(max_iotlb_entries,
+VHOST_IOTLB_FLAG_RETIRE);
+}
+
+struct vhost_iotlb *vhost_dev_reset_owner_prepare(void)
 {
-   return kvzalloc(sizeof(struct vhost_umem), GFP_KERNEL);
+   return iotlb_alloc();
 }
 EXPORT_SYMBOL_GPL(vhost_dev_reset_owner_prepare);
 
 /* Caller should have device mutex */
-void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_umem *umem)
+void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_iotlb *umem)
 {
int i;
 
vhost_dev_cleanup(dev);
 
-   /* Restore memory to default empty mapping. */
-   INIT_LIST_HEAD(&umem->umem_list);
dev->umem = umem;
/* We don't need VQ locks below since vhost_dev_cleanup makes sure
 * VQs aren't running.
@@ -618,28 +618,6 @@ void vhost_dev_stop(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_stop);
 
-static void vhost_umem_free(struct vhost_umem *umem,
-   struct vhost_umem_node *node)
-{
-   vhost_umem_interval_tree_remove(node, &umem->umem_tree);
-   list_del(&node->link);
-   kfree(node);
-   umem->numem--;
-}
-
-static void vhost_umem_clean(struct vhost_umem *umem)
-{
-   struct vhost_umem_node *node, *tmp;
-
-   if (!umem)
-   return;
-
-   list_for_each_entry_safe(node, tmp, &umem->umem_list, link)
-   vhost_umem_free(umem, node);
-
-   kvfree(umem);
-}
-
 static void vhost_clear_msg(struct vhost_dev *dev)
 {
struct vhost_msg_node *node, *n;
@@ -

[PATCH V5 2/5] vringh: IOTLB support

2020-02-25 Thread Jason Wang
This patch implements the third memory accessor for vringh besides
current kernel and userspace accessors. This idea is to allow vringh
to do the address translation through an IOTLB which is implemented
via vhost_map interval tree. Users should setup and IOVA to PA mapping
in this IOTLB.

This allows us to:

- Using vringh to access virtqueues with vIOMMU
- Using vringh to implement software virtqueues for vDPA devices

Signed-off-by: Jason Wang 
---
 drivers/vhost/Kconfig.vringh |   1 +
 drivers/vhost/vringh.c   | 421 +--
 include/linux/vringh.h   |  36 +++
 3 files changed, 435 insertions(+), 23 deletions(-)

diff --git a/drivers/vhost/Kconfig.vringh b/drivers/vhost/Kconfig.vringh
index c1fe36a9b8d4..a8d4dd0cb06e 100644
--- a/drivers/vhost/Kconfig.vringh
+++ b/drivers/vhost/Kconfig.vringh
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config VHOST_RING
tristate
+   select VHOST_IOTLB
---help---
  This option is selected by any driver which needs to access
  the host side of a virtio ring.
diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
index a0a2d74967ef..ee0491f579ac 100644
--- a/drivers/vhost/vringh.c
+++ b/drivers/vhost/vringh.c
@@ -13,6 +13,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include 
 
 static __printf(1,2) __cold void vringh_bad(const char *fmt, ...)
@@ -71,9 +74,11 @@ static inline int __vringh_get_head(const struct vringh *vrh,
 }
 
 /* Copy some bytes to/from the iovec.  Returns num copied. */
-static inline ssize_t vringh_iov_xfer(struct vringh_kiov *iov,
+static inline ssize_t vringh_iov_xfer(struct vringh *vrh,
+ struct vringh_kiov *iov,
  void *ptr, size_t len,
- int (*xfer)(void *addr, void *ptr,
+ int (*xfer)(const struct vringh *vrh,
+ void *addr, void *ptr,
  size_t len))
 {
int err, done = 0;
@@ -82,7 +87,7 @@ static inline ssize_t vringh_iov_xfer(struct vringh_kiov *iov,
size_t partlen;
 
partlen = min(iov->iov[iov->i].iov_len, len);
-   err = xfer(iov->iov[iov->i].iov_base, ptr, partlen);
+   err = xfer(vrh, iov->iov[iov->i].iov_base, ptr, partlen);
if (err)
return err;
done += partlen;
@@ -96,6 +101,7 @@ static inline ssize_t vringh_iov_xfer(struct vringh_kiov 
*iov,
/* Fix up old iov element then increment. */
iov->iov[iov->i].iov_len = iov->consumed;
iov->iov[iov->i].iov_base -= iov->consumed;
+

iov->consumed = 0;
iov->i++;
@@ -227,7 +233,8 @@ static int slow_copy(struct vringh *vrh, void *dst, const 
void *src,
  u64 addr,
  struct vringh_range *r),
 struct vringh_range *range,
-int (*copy)(void *dst, const void *src, size_t len))
+int (*copy)(const struct vringh *vrh,
+void *dst, const void *src, size_t len))
 {
size_t part, len = sizeof(struct vring_desc);
 
@@ -241,7 +248,7 @@ static int slow_copy(struct vringh *vrh, void *dst, const 
void *src,
if (!rcheck(vrh, addr, &part, range, getrange))
return -EINVAL;
 
-   err = copy(dst, src, part);
+   err = copy(vrh, dst, src, part);
if (err)
return err;
 
@@ -262,7 +269,8 @@ __vringh_iov(struct vringh *vrh, u16 i,
 struct vringh_range *)),
 bool (*getrange)(struct vringh *, u64, struct vringh_range *),
 gfp_t gfp,
-int (*copy)(void *dst, const void *src, size_t len))
+int (*copy)(const struct vringh *vrh,
+void *dst, const void *src, size_t len))
 {
int err, count = 0, up_next, desc_max;
struct vring_desc desc, *descs;
@@ -291,7 +299,7 @@ __vringh_iov(struct vringh *vrh, u16 i,
err = slow_copy(vrh, &desc, &descs[i], rcheck, getrange,
&slowrange, copy);
else
-   err = copy(&desc, &descs[i], sizeof(desc));
+   err = copy(vrh, &desc, &descs[i], sizeof(desc));
if (unlikely(err))
goto fail;
 
@@ -404,7 +412,8 @@ static inline int __vringh_complete(struct vringh *vrh,
unsigned int num_used,
int (*putu16)(const struct vringh *vrh,
  _

[PATCH V5 3/5] vDPA: introduce vDPA bus

2020-02-25 Thread Jason Wang
vDPA device is a device that uses a datapath which complies with the
virtio specifications with vendor specific control path. vDPA devices
can be both physically located on the hardware or emulated by
software. vDPA hardware devices are usually implemented through PCIE
with the following types:

- PF (Physical Function) - A single Physical Function
- VF (Virtual Function) - Device that supports single root I/O
  virtualization (SR-IOV). Its Virtual Function (VF) represents a
  virtualized instance of the device that can be assigned to different
  partitions
- ADI (Assignable Device Interface) and its equivalents - With
  technologies such as Intel Scalable IOV, a virtual device (VDEV)
  composed by host OS utilizing one or more ADIs. Or its equivalent
  like SF (Sub function) from Mellanox.

>From a driver's perspective, depends on how and where the DMA
translation is done, vDPA devices are split into two types:

- Platform specific DMA translation - From the driver's perspective,
  the device can be used on a platform where device access to data in
  memory is limited and/or translated. An example is a PCIE vDPA whose
  DMA request was tagged via a bus (e.g PCIE) specific way. DMA
  translation and protection are done at PCIE bus IOMMU level.
- Device specific DMA translation - The device implements DMA
  isolation and protection through its own logic. An example is a vDPA
  device which uses on-chip IOMMU.

To hide the differences and complexity of the above types for a vDPA
device/IOMMU options and in order to present a generic virtio device
to the upper layer, a device agnostic framework is required.

This patch introduces a software vDPA bus which abstracts the
common attributes of vDPA device, vDPA bus driver and the
communication method (vdpa_config_ops) between the vDPA device
abstraction and the vDPA bus driver. This allows multiple types of
drivers to be used for vDPA device like the virtio_vdpa and vhost_vdpa
driver to operate on the bus and allow vDPA device could be used by
either kernel virtio driver or userspace vhost drivers as:

   virtio drivers  vhost drivers
  | |
[virtio bus]   [vhost uAPI]
  | |
   virtio device   vhost device
   virtio_vdpa drv vhost_vdpa drv
 \   /
[vDPA bus]
 |
vDPA device
hardware drv
 |
[hardware bus]
 |
vDPA hardware

With the abstraction of vDPA bus and vDPA bus operations, the
difference and complexity of the under layer hardware is hidden from
upper layer. The vDPA bus drivers on top can use a unified
vdpa_config_ops to control different types of vDPA device.

Signed-off-by: Jason Wang 
---
 MAINTAINERS  |   1 +
 drivers/virtio/Kconfig   |   2 +
 drivers/virtio/Makefile  |   1 +
 drivers/virtio/vdpa/Kconfig  |   8 ++
 drivers/virtio/vdpa/Makefile |   2 +
 drivers/virtio/vdpa/vdpa.c   | 167 +
 include/linux/vdpa.h | 232 +++
 7 files changed, 413 insertions(+)
 create mode 100644 drivers/virtio/vdpa/Kconfig
 create mode 100644 drivers/virtio/vdpa/Makefile
 create mode 100644 drivers/virtio/vdpa/vdpa.c
 create mode 100644 include/linux/vdpa.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 0fb645b5a7df..2b8d9fa38d9a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17701,6 +17701,7 @@ F:  tools/virtio/
 F: drivers/net/virtio_net.c
 F: drivers/block/virtio_blk.c
 F: include/linux/virtio*.h
+F: include/linux/vdpa.h
 F: include/uapi/linux/virtio_*.h
 F: drivers/crypto/virtio/
 F: mm/balloon_compaction.c
diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 078615cf2afc..9c4fdb64d9ac 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -96,3 +96,5 @@ config VIRTIO_MMIO_CMDLINE_DEVICES
 If unsure, say 'N'.
 
 endif # VIRTIO_MENU
+
+source "drivers/virtio/vdpa/Kconfig"
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 3a2b5c5dcf46..fdf5eacd0d0a 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
 virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
 obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
 obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
+obj-$(CONFIG_VDPA) += vdpa/
diff --git a/drivers/virtio/vdpa/Kconfig b/drivers/virtio/vdpa/Kconfig
new file mode 100644
index ..9aac904a9515
--- /dev/null
+++ b/drivers/virtio/vdpa/Kconfig
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config VDPA
+   tristate
+   help
+ Enable this module to support vDPA device that uses a
+ datapath which complies with virtio specifications with
+ vendor specific control path.
+
diff --git a/drivers/virtio/vdpa/Makefile b/drivers/virtio/vdpa/Makefile
new file mode 100644
index ..ee6a35e8a4fb
--- /dev/n

[PATCH V5 4/5] virtio: introduce a vDPA based transport

2020-02-25 Thread Jason Wang
This patch introduces a vDPA transport for virtio. This is used to
use kernel virtio driver to drive the vDPA device that is capable
of populating virtqueue directly.

A new virtio-vdpa driver will be registered to the vDPA bus, when a
new virtio-vdpa device is probed, it will register the device with
vdpa based config ops. This means it is a software transport between
vDPA driver and vDPA device. The transport was implemented through
bus_ops of vDPA parent.

Signed-off-by: Jason Wang 
---
 drivers/virtio/Kconfig   |  13 ++
 drivers/virtio/Makefile  |   1 +
 drivers/virtio/virtio_vdpa.c | 396 +++
 3 files changed, 410 insertions(+)
 create mode 100644 drivers/virtio/virtio_vdpa.c

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 9c4fdb64d9ac..99e424570644 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -43,6 +43,19 @@ config VIRTIO_PCI_LEGACY
 
  If unsure, say Y.
 
+config VIRTIO_VDPA
+   tristate "vDPA driver for virtio devices"
+   select VDPA
+   select VIRTIO
+   help
+ This driver provides support for virtio based paravirtual
+ device driver over vDPA bus. For this to be useful, you need
+ an appropriate vDPA device implementation that operates on a
+ physical device to allow the datapath of virtio to be
+ offloaded to hardware.
+
+ If unsure, say M.
+
 config VIRTIO_PMEM
tristate "Support for virtio pmem driver"
depends on VIRTIO
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index fdf5eacd0d0a..3407ac03fe60 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -6,4 +6,5 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
 virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
 obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
 obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
+obj-$(CONFIG_VIRTIO_VDPA) += virtio_vdpa.o
 obj-$(CONFIG_VDPA) += vdpa/
diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
new file mode 100644
index ..c30eb55030be
--- /dev/null
+++ b/drivers/virtio/virtio_vdpa.c
@@ -0,0 +1,396 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VIRTIO based driver for vDPA device
+ *
+ * Copyright (c) 2020, Red Hat. All rights reserved.
+ * Author: Jason Wang 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define MOD_VERSION  "0.1"
+#define MOD_AUTHOR   "Jason Wang "
+#define MOD_DESC "vDPA bus driver for virtio devices"
+#define MOD_LICENSE  "GPL v2"
+
+struct virtio_vdpa_device {
+   struct virtio_device vdev;
+   struct vdpa_device *vdpa;
+   u64 features;
+
+   /* The lock to protect virtqueue list */
+   spinlock_t lock;
+   /* List of virtio_vdpa_vq_info */
+   struct list_head virtqueues;
+};
+
+struct virtio_vdpa_vq_info {
+   /* the actual virtqueue */
+   struct virtqueue *vq;
+
+   /* the list node for the virtqueues list */
+   struct list_head node;
+};
+
+static inline struct virtio_vdpa_device *
+to_virtio_vdpa_device(struct virtio_device *dev)
+{
+   return container_of(dev, struct virtio_vdpa_device, vdev);
+}
+
+static struct vdpa_device *vd_get_vdpa(struct virtio_device *vdev)
+{
+   return to_virtio_vdpa_device(vdev)->vdpa;
+}
+
+static void virtio_vdpa_get(struct virtio_device *vdev, unsigned offset,
+   void *buf, unsigned len)
+{
+   struct vdpa_device *vdpa = vd_get_vdpa(vdev);
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   ops->get_config(vdpa, offset, buf, len);
+}
+
+static void virtio_vdpa_set(struct virtio_device *vdev, unsigned offset,
+   const void *buf, unsigned len)
+{
+   struct vdpa_device *vdpa = vd_get_vdpa(vdev);
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   ops->set_config(vdpa, offset, buf, len);
+}
+
+static u32 virtio_vdpa_generation(struct virtio_device *vdev)
+{
+   struct vdpa_device *vdpa = vd_get_vdpa(vdev);
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->get_generation)
+   return ops->get_generation(vdpa);
+
+   return 0;
+}
+
+static u8 virtio_vdpa_get_status(struct virtio_device *vdev)
+{
+   struct vdpa_device *vdpa = vd_get_vdpa(vdev);
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return ops->get_status(vdpa);
+}
+
+static void virtio_vdpa_set_status(struct virtio_device *vdev, u8 status)
+{
+   struct vdpa_device *vdpa = vd_get_vdpa(vdev);
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return ops->set_status(vdpa, status);
+}
+
+static void virtio_vdpa_reset(struct virtio_device *vdev)
+{
+   struct vdpa_device *vdpa = vd_get_vdpa(vdev);
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return ops->set_status(vdpa, 0);
+}
+
+static bool virtio_vdpa

[PATCH V5 5/5] vdpasim: vDPA device simulator

2020-02-25 Thread Jason Wang
This patch implements a software vDPA networking device. The datapath
is implemented through vringh and workqueue. The device has an on-chip
IOMMU which translates IOVA to PA. For kernel virtio drivers, vDPA
simulator driver provides dma_ops. For vhost driers, set_map() methods
of vdpa_config_ops is implemented to accept mappings from vhost.

Currently, vDPA device simulator will loopback TX traffic to RX. So
the main use case for the device is vDPA feature testing, prototyping
and development.

Note, there's no management API implemented, a vDPA device will be
registered once the module is probed. We need to handle this in the
future development.

Signed-off-by: Jason Wang 
---
 drivers/virtio/vdpa/Kconfig |  18 +
 drivers/virtio/vdpa/Makefile|   1 +
 drivers/virtio/vdpa/vdpa_sim/Makefile   |   2 +
 drivers/virtio/vdpa/vdpa_sim/vdpa_sim.c | 645 
 4 files changed, 666 insertions(+)
 create mode 100644 drivers/virtio/vdpa/vdpa_sim/Makefile
 create mode 100644 drivers/virtio/vdpa/vdpa_sim/vdpa_sim.c

diff --git a/drivers/virtio/vdpa/Kconfig b/drivers/virtio/vdpa/Kconfig
index 9aac904a9515..9e7dc95e0c89 100644
--- a/drivers/virtio/vdpa/Kconfig
+++ b/drivers/virtio/vdpa/Kconfig
@@ -6,3 +6,21 @@ config VDPA
  datapath which complies with virtio specifications with
  vendor specific control path.
 
+menuconfig VDPA_MENU
+   bool "VDPA drivers"
+   default n
+
+if VDPA_MENU
+
+config VDPA_SIM
+   tristate "vDPA device simulator"
+   select VDPA
+   depends on RUNTIME_TESTING_MENU
+   default n
+   help
+ vDPA networking device simulator which loop TX traffic back
+ to RX. This device is used for testing, prototyping and
+ development of vDPA.
+
+endif # VDPA_MENU
+
diff --git a/drivers/virtio/vdpa/Makefile b/drivers/virtio/vdpa/Makefile
index ee6a35e8a4fb..3814af8e097b 100644
--- a/drivers/virtio/vdpa/Makefile
+++ b/drivers/virtio/vdpa/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_VDPA) += vdpa.o
+obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
diff --git a/drivers/virtio/vdpa/vdpa_sim/Makefile 
b/drivers/virtio/vdpa/vdpa_sim/Makefile
new file mode 100644
index ..b40278f65e04
--- /dev/null
+++ b/drivers/virtio/vdpa/vdpa_sim/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_VDPA_SIM) += vdpa_sim.o
diff --git a/drivers/virtio/vdpa/vdpa_sim/vdpa_sim.c 
b/drivers/virtio/vdpa/vdpa_sim/vdpa_sim.c
new file mode 100644
index ..c728d49a88f6
--- /dev/null
+++ b/drivers/virtio/vdpa/vdpa_sim/vdpa_sim.c
@@ -0,0 +1,645 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VDPA networking device simulator.
+ *
+ * Copyright (c) 2020, Red Hat Inc. All rights reserved.
+ * Author: Jason Wang 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DRV_VERSION  "0.1"
+#define DRV_AUTHOR   "Jason Wang "
+#define DRV_DESC "vDPA Device Simulator"
+#define DRV_LICENSE  "GPL v2"
+
+struct vdpasim_virtqueue {
+   struct vringh vring;
+   struct vringh_kiov iov;
+   unsigned short head;
+   bool ready;
+   u64 desc_addr;
+   u64 device_addr;
+   u64 driver_addr;
+   u32 num;
+   void *private;
+   irqreturn_t (*cb)(void *data);
+};
+
+#define VDPASIM_QUEUE_ALIGN PAGE_SIZE
+#define VDPASIM_QUEUE_MAX 256
+#define VDPASIM_DEVICE_ID 0x1
+#define VDPASIM_VENDOR_ID 0
+#define VDPASIM_VQ_NUM 0x2
+#define VDPASIM_NAME "vdpasim-netdev"
+
+static u64 vdpasim_features = (1ULL << VIRTIO_F_ANY_LAYOUT) |
+ (1ULL << VIRTIO_F_VERSION_1)  |
+ (1ULL << VIRTIO_F_IOMMU_PLATFORM);
+
+/* State of each vdpasim device */
+struct vdpasim {
+   struct vdpasim_virtqueue vqs[2];
+   struct work_struct work;
+   /* spinlock to synchronize virtqueue state */
+   spinlock_t lock;
+   struct vdpa_device *vdpa;
+   struct device dev;
+   struct virtio_net_config config;
+   struct vhost_iotlb *iommu;
+   void *buffer;
+   u32 status;
+   u32 generation;
+   u64 features;
+};
+
+struct vdpasim *vdpasim_dev;
+
+static struct vdpasim *dev_to_sim(struct device *dev)
+{
+   return container_of(dev, struct vdpasim, dev);
+}
+
+static struct vdpasim *vdpa_to_sim(struct vdpa_device *vdpa)
+{
+   struct device *d = &vdpa->dev;
+
+   return dev_to_sim(d->parent);
+}
+
+static void vdpasim_queue_ready(struct vdpasim *vdpasim, unsigned int idx)
+{
+   struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
+   int ret;
+
+   ret = vringh_init_iotlb(&vq->vring, vdpasim_features,
+   VDPASIM_QUEUE_MAX, false,
+   (struct vring_desc *)(uintptr_t)vq->desc_addr,
+   (struct vring_avail *

Re: [PATCH V4 5/5] vdpasim: vDPA device simulator

2020-02-25 Thread Jason Wang


On 2020/2/21 下午3:57, Jason Wang wrote:


On 2020/2/20 下午11:12, Jason Gunthorpe wrote:

On Thu, Feb 20, 2020 at 02:11:41PM +0800, Jason Wang wrote:

+static void vdpasim_device_release(struct device *dev)
+{
+    struct vdpasim *vdpasim = dev_to_sim(dev);
+
+    cancel_work_sync(&vdpasim->work);
+    kfree(vdpasim->buffer);
+    vhost_iotlb_free(vdpasim->iommu);
+    kfree(vdpasim);
+}
+
+static struct vdpasim *vdpasim_create(void)
+{
+    struct virtio_net_config *config;
+    struct vhost_iotlb *iommu;
+    struct vdpasim *vdpasim;
+    struct device *dev;
+    void *buffer;
+    int ret = -ENOMEM;
+
+    iommu = vhost_iotlb_alloc(2048, 0);
+    if (!iommu)
+    goto err;
+
+    buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
+    if (!buffer)
+    goto err_buffer;
+
+    vdpasim = kzalloc(sizeof(*vdpasim), GFP_KERNEL);
+    if (!vdpasim)
+    goto err_alloc;
+
+    vdpasim->buffer = buffer;
+    vdpasim->iommu = iommu;
+
+    config = &vdpasim->config;
+    config->mtu = 1500;
+    config->status = VIRTIO_NET_S_LINK_UP;
+    eth_random_addr(config->mac);
+
+    INIT_WORK(&vdpasim->work, vdpasim_work);
+    spin_lock_init(&vdpasim->lock);
+
+    vringh_set_iotlb(&vdpasim->vqs[0].vring, vdpasim->iommu);
+    vringh_set_iotlb(&vdpasim->vqs[1].vring, vdpasim->iommu);
+
+    dev = &vdpasim->dev;
+    dev->release = vdpasim_device_release;
+    dev->coherent_dma_mask = DMA_BIT_MASK(64);
+    set_dma_ops(dev, &vdpasim_dma_ops);
+    dev_set_name(dev, "%s", VDPASIM_NAME);
+
+    ret = device_register(&vdpasim->dev);
+    if (ret)
+    goto err_init;

It is a bit weird to be creating this dummy parent, couldn't this be
done by just passing a NULL parent to vdpa_alloc_device, doing
set_dma_ops() on the vdpasim->vdpa->dev and setting dma_device to
vdpasim->vdpa->dev ?



I think it works.



Rethink about this, since most hardware vDPA driver will have a parent 
and will use it to find the parent structure e.g


dev_get_drvdata(vdpa->dev->parent)

So I keep this dummy parent in V5.

Thanks


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3] virtio: Work around frames incorrectly marked as gso

2020-02-25 Thread Anton Ivanov

On 25/02/2020 16:26, Willem de Bruijn wrote:

An skb_dump() + dump_stack() when the packet socket gets such a
packet may point us to the root cause and fix that.


We tried dump stack, it was not informative - it was just the recvmmsg
call stack coming from the UML until it hits the relevant recv bit in
af_packet - it does not tell us where the packet is coming from.

Quoting from the message earlier in the thread:

[ 2334.180854] Call Trace:
[ 2334.181947]  dump_stack+0x5c/0x80
[ 2334.183021]  packet_recvmsg.cold+0x23/0x49
[ 2334.184063]  ___sys_recvmsg+0xe1/0x1f0
[ 2334.185034]  ? packet_poll+0xca/0x130
[ 2334.186014]  ? sock_poll+0x77/0xb0
[ 2334.186977]  ? ep_item_poll.isra.0+0x3f/0xb0
[ 2334.187936]  ? ep_send_events_proc+0xf1/0x240
[ 2334.188901]  ? dequeue_signal+0xdb/0x180
[ 2334.189848]  do_recvmmsg+0xc8/0x2d0
[ 2334.190728]  ? ep_poll+0x8c/0x470
[ 2334.191581]  __sys_recvmmsg+0x108/0x150
[ 2334.192441]  __x64_sys_recvmmsg+0x25/0x30
[ 2334.193346]  do_syscall_64+0x53/0x140
[ 2334.194262]  entry_SYSCALL_64_after_hwframe+0x44/0xa9


That makes sense. skb_dump might show more interesting details about
the packet.


I will add that and retest later today.



skb len=818 headroom=2 headlen=818 tailroom=908
mac=(2,14) net=(16,0) trans=16
shinfo(txflags=0 nr_frags=0 gso(size=752 type=0 segs=1))
csum(0x100024 ip_summed=3 complete_sw=0 valid=0 level=0)
hash(0x0 sw=0 l4=0) proto=0x0800 pkttype=4 iif=0
sk family=17 type=3 proto=0

Deciphering the actual packet data gives a

TCP packet, ACK and PSH set.

The PSH flag looks like the only "interesting" thing about it in first read.


Thanks.

TCP always sets the PSH bit on a GSO packet as of commit commit
051ba67447de  ("tcp: force a PSH flag on TSO packets"), so that is
definitely informative.

The lower gso size might come from a path mtu probing depending on
tcp_base_mss, but that's definitely wild speculation. Increasing that
value to, say, 1024, could tell us.

In this case it may indeed not be a GSO packet. As 752 is the MSS + 28
B TCP header including timestamp + 20 B IPv4 header + 14B Eth header.
Which adds up to 814 already.

Not sure what those 2 B between skb->data and mac_header are. Was this
captured inside packet_rcv? 


af_packet, packet_rcv

https://elixir.bootlin.com/linux/latest/source/net/packet/af_packet.c#L2026


network_header and transport_header both
at 16B offset is also sketchy, but again may be an artifact of where
exactly this is being read.

Perhaps this is a segment of a larger GSO packet that is retransmitted
in part. Like an mtu probe or loss probe. See for instance this in
tcp_send_loss_probe for  how a single MSS is extracted:

if ((pcount > 1) && (skb->len > (pcount - 1) * mss)) {
 if (unlikely(tcp_fragment(sk, TCP_FRAG_IN_RTX_QUEUE, skb,
   (pcount - 1) * mss, mss,
   GFP_ATOMIC)))
 goto rearm_timer;
 skb = skb_rb_next(skb);
 }

Note that I'm not implicating this specific code. I don't see anything
wrong with it. Just an indication that a trace would be very
informative, as it could tell if any of these edge cases is being hit.


I will be honest, I have found it a bit difficult to trace.

At the point where this is detected, the packet is already in the vEth 
interface queue and is being read by recvmmsg on a raw socket.


The flags + gso size combination happened long before that - even before 
it was being placed in the queue.


What is clear so far is that while the packet has invalid 
gso_size/gso_type combination, it is an otherwise valid tcp frame.


I will stick the debug into is_gso (with a backtrace) instead and re-run 
it later today to see if this can pick it up elsewhere in the stack.




___
linux-um mailing list
linux...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um




--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization