Re: [PATCH 2/3] vhost-net: Unify the code of mergeable and big buffer handling

2011-01-17 Thread Jason Wang
Michael S. Tsirkin writes:
 > On Tue, Jan 18, 2011 at 11:05:33AM +0800, Jason Wang wrote:
 > > Michael S. Tsirkin writes:
 > >  > On Mon, Jan 17, 2011 at 04:11:08PM +0800, Jason Wang wrote:
 > >  > > Codes duplication were found between the handling of mergeable and big
 > >  > > buffers, so this patch tries to unify them. This could be easily done
 > >  > > by adding a quota to the get_rx_bufs() which is used to limit the
 > >  > > number of buffers it returns (for mergeable buffer, the quota is
 > >  > > simply UIO_MAXIOV, for big buffers, the quota is just 1), and then the
 > >  > > previous handle_rx_mergeable() could be resued also for big buffers.
 > >  > > 
 > >  > > Signed-off-by: Jason Wang 
 > >  > 
 > >  > We actually started this way. However the code turned out
 > >  > to have measureable overhead when handle_rx_mergeable
 > >  > is called with quota 1.
 > >  > So before applying this I'd like to see some data
 > >  > to show this is not the case anymore.
 > >  > 
 > > 
 > > I've run a round of test (Host->Guest) for these three patches on my 
 > > desktop:
 > 
 > Yes but what if you only apply patch 3?
 > 

Result here, slightly better than without it but worse than applying all the 
three
patches except for big buffer size at 2048 but the differnece could be 
neglected.

mergeable buffers:
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.161 
(10.66.91.161) port 0 AF_INET
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

 87380  16384 6460.00   584.91   69.4126.1038.882  7.310  
 87380  1638425660.00  1194.05   72.8134.8219.980  4.778  
 87380  1638451260.00  1503.93   74.2336.8616.174  4.016  
 87380  16384   102460.00  2004.57   73.5337.5912.019  3.073  
 87380  16384   204860.00  3445.96   73.7638.887.014   1.849  

big buffers:
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.129 
(10.66.91.129) port 0 AF_INET
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

 87380  16384 6460.00   646.95   72.2527.0536.595  6.850  
 87380  1638425660.00  1258.61   74.8833.0119.495  4.297  
 87380  1638451260.00  1655.95   74.1433.9614.671  3.360  
 87380  16384   102460.00  3220.32   74.2433.317.555   1.695  
 87380  16384   204860.00  4516.40   73.8842.105.360   1.527  

 > > Without these patches
 > > 
 > > mergeable buffers:
 > > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.42 
 > > (10.66.91.42) port 0 AF_INET
 > > Recv   SendSend  Utilization   Service 
 > > Demand
 > > Socket Socket  Message  Elapsed  Send Recv SendRecv
 > > Size   SizeSize Time Throughput  localremote   local   
 > > remote
 > > bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   
 > > us/KB
 > > 
 > >  87380  16384 6460.00   575.87   69.2026.3639.375  
 > > 7.499  
 > >  87380  1638425660.01  1123.57   73.1634.7321.335  
 > > 5.064  
 > >  87380  1638451260.00  1351.12   75.2635.8018.251  
 > > 4.341  
 > >  87380  16384   102460.00  1955.31   74.7337.6712.523  
 > > 3.156  
 > >  87380  16384   204860.01  3411.92   74.8239.497.186   
 > > 1.896  
 > > 
 > > bug buffers:
 > > Netperf test results
 > > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.109 
 > > (10.66.91.109) port 0 AF_INET
 > > Recv   SendSend  Utilization   Service 
 > > Demand
 > > Socket Socket  Message  Elapsed  Send Recv SendRecv
 > > Size   SizeSize Time Throughput  localremote   local   
 > > remote
 > > bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   
 > > us/KB
 > > 
 > >  87380  16384 6460.00   567.10   72.0626.1341.638  
 > > 7.550  
 > >  87380  1638425660.00  1143.69   74.6632.5821.392  
 > > 4.667  
 > >  87380  1638451260.00  1460.92   73.9433.4016.585  
 > > 3.746  
 > >  87380  16384   102460.00  3454.85   77.4933.897.349   
 > > 1.607  
 > >  87380  16384   204860.00  3781.11   76.5138.386.631   
 > > 1.663  
 > > 
 > > With these patches:
 > > 
 > > mergeable buffers:
 > > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 A

Re: [PATCH 2/3] vhost-net: Unify the code of mergeable and big buffer handling

2011-01-17 Thread Michael S. Tsirkin
On Tue, Jan 18, 2011 at 11:05:33AM +0800, Jason Wang wrote:
> Michael S. Tsirkin writes:
>  > On Mon, Jan 17, 2011 at 04:11:08PM +0800, Jason Wang wrote:
>  > > Codes duplication were found between the handling of mergeable and big
>  > > buffers, so this patch tries to unify them. This could be easily done
>  > > by adding a quota to the get_rx_bufs() which is used to limit the
>  > > number of buffers it returns (for mergeable buffer, the quota is
>  > > simply UIO_MAXIOV, for big buffers, the quota is just 1), and then the
>  > > previous handle_rx_mergeable() could be resued also for big buffers.
>  > > 
>  > > Signed-off-by: Jason Wang 
>  > 
>  > We actually started this way. However the code turned out
>  > to have measureable overhead when handle_rx_mergeable
>  > is called with quota 1.
>  > So before applying this I'd like to see some data
>  > to show this is not the case anymore.
>  > 
> 
> I've run a round of test (Host->Guest) for these three patches on my desktop:

Yes but what if you only apply patch 3?

> Without these patches
> 
> mergeable buffers:
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.42 
> (10.66.91.42) port 0 AF_INET
> Recv   SendSend  Utilization   Service Demand
> Socket Socket  Message  Elapsed  Send Recv SendRecv
> Size   SizeSize Time Throughput  localremote   local   remote
> bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB
> 
>  87380  16384 6460.00   575.87   69.2026.3639.375  7.499  
>  87380  1638425660.01  1123.57   73.1634.7321.335  5.064  
>  87380  1638451260.00  1351.12   75.2635.8018.251  4.341  
>  87380  16384   102460.00  1955.31   74.7337.6712.523  3.156  
>  87380  16384   204860.01  3411.92   74.8239.497.186   1.896  
> 
> bug buffers:
> Netperf test results
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.109 
> (10.66.91.109) port 0 AF_INET
> Recv   SendSend  Utilization   Service Demand
> Socket Socket  Message  Elapsed  Send Recv SendRecv
> Size   SizeSize Time Throughput  localremote   local   remote
> bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB
> 
>  87380  16384 6460.00   567.10   72.0626.1341.638  7.550  
>  87380  1638425660.00  1143.69   74.6632.5821.392  4.667  
>  87380  1638451260.00  1460.92   73.9433.4016.585  3.746  
>  87380  16384   102460.00  3454.85   77.4933.897.349   1.607  
>  87380  16384   204860.00  3781.11   76.5138.386.631   1.663  
> 
> With these patches:
> 
> mergeable buffers:
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.236 
> (10.66.91.236) port 0 AF_INET
> Recv   SendSend  Utilization   Service Demand
> Socket Socket  Message  Elapsed  Send Recv SendRecv
> Size   SizeSize Time Throughput  localremote   local   remote
> bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB
> 
>  87380  16384 6460.00   657.53   71.2726.4235.517  6.583  
>  87380  1638425660.00  1217.73   74.3434.6720.004  4.665  
>  87380  1638451260.00  1575.25   75.2737.1215.658  3.861  
>  87380  16384   102460.00  2416.07   74.7737.2010.140  2.522  
>  87380  16384   204860.00  3702.29   77.3136.296.842   1.606  
> 
> big buffers:
> Netperf test results
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.202 
> (10.66.91.202) port 0 AF_INET
> Recv   SendSend  Utilization   Service Demand
> Socket Socket  Message  Elapsed  Send Recv SendRecv
> Size   SizeSize Time Throughput  localremote   local   remote
> bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB
> 
>  87380  16384 6460.00   647.67   71.8627.2636.356  6.895  
>  87380  1638425660.00  1265.82   76.1936.5419.724  4.729  
>  87380  1638451260.00  1796.64   76.0639.4813.872  3.601  
>  87380  16384   102460.00  4008.37   77.0536.476.299   1.491  
>  87380  16384   204860.00  4468.56   75.1841.795.513   1.532 
> 
> Looks like the unification does not hurt the performance, and with those 
> patches
> we can get some improvement. BTW, the regression of mergeable buffer still
> exist.
> 
>  > > ---
>  > >  drivers/vhost/net.c |  128 
> +++
>  > >  1 files changed, 7 insertions(+), 121 deletions(-)
>  > > 
>  > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>  > > index 95e49de..c32a2e4 100644
>  > > --- a/drivers/v

Re: [PATCH 1/3] vhost-net: check the support of mergeable buffer outside the receive loop

2011-01-17 Thread Michael S. Tsirkin
On Tue, Jan 18, 2011 at 12:26:17PM +0800, Jason Wang wrote:
> Michael S. Tsirkin writes:
>  > On Mon, Jan 17, 2011 at 04:10:59PM +0800, Jason Wang wrote:
>  > > No need to check the support of mergeable buffer inside the recevie
>  > > loop as the whole handle_rx()_xx is in the read critical region.  So
>  > > this patch move it ahead of the receiving loop.
>  > > 
>  > > Signed-off-by: Jason Wang 
>  > 
>  > Well feature check is mostly just features & bit
>  > so why would it be slower? Because of the rcu
>  > adding memory barriers? Do you observe a
>  > measureable speedup with this patch?
>  > 
> 
> I do not measure the performance for just this patch, maybe not obvious. And 
> it
> can also help the code unification.
> 
>  > Apropos, I noticed that the check in vhost_has_feature
>  > is wrong: it uses the same kind of RCU as the
>  > private pointer. So we'll have to fix that properly
>  > by adding more lockdep classes, but for now
>  > we'll need to make
>  > the check 1 || lockdep_is_held(&dev->mutex);
>  > and add a TODO.
>  > 
> 
> Not sure, lockdep_is_head(&dev->mutex) maybe not accurate but sufficient, as 
> it
> was always held in the read critical region.

Not really, when we call vhost_has_feature from the vq handling thread
it's not.

>  > > ---
>  > >  drivers/vhost/net.c |5 +++--
>  > >  1 files changed, 3 insertions(+), 2 deletions(-)
>  > > 
>  > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>  > > index 14fc189..95e49de 100644
>  > > --- a/drivers/vhost/net.c
>  > > +++ b/drivers/vhost/net.c
>  > > @@ -411,7 +411,7 @@ static void handle_rx_mergeable(struct vhost_net 
> *net)
>  > >  };
>  > >  
>  > >  size_t total_len = 0;
>  > > -int err, headcount;
>  > > +int err, headcount, mergeable;
>  > >  size_t vhost_hlen, sock_hlen;
>  > >  size_t vhost_len, sock_len;
>  > >  struct socket *sock = rcu_dereference(vq->private_data);
>  > > @@ -425,6 +425,7 @@ static void handle_rx_mergeable(struct vhost_net 
> *net)
>  > >  
>  > >  vq_log = unlikely(vhost_has_feature(&net->dev, 
> VHOST_F_LOG_ALL)) ?
>  > >  vq->log : NULL;
>  > > +mergeable = vhost_has_feature(&net->dev, 
> VIRTIO_NET_F_MRG_RXBUF);
>  > >  
>  > >  while ((sock_len = peek_head_len(sock->sk))) {
>  > >  sock_len += sock_hlen;
>  > > @@ -474,7 +475,7 @@ static void handle_rx_mergeable(struct vhost_net 
> *net)
>  > >  break;
>  > >  }
>  > >  /* TODO: Should check and handle checksum. */
>  > > -if (vhost_has_feature(&net->dev, 
> VIRTIO_NET_F_MRG_RXBUF) &&
>  > > +if (likely(mergeable) &&
>  > >  memcpy_toiovecend(vq->hdr, (unsigned char 
> *)&headcount,
>  > >offsetof(typeof(hdr), 
> num_buffers),
>  > >sizeof hdr.num_buffers)) {
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] vhost-net: check the support of mergeable buffer outside the receive loop

2011-01-17 Thread Jason Wang
Michael S. Tsirkin writes:
 > On Mon, Jan 17, 2011 at 04:10:59PM +0800, Jason Wang wrote:
 > > No need to check the support of mergeable buffer inside the recevie
 > > loop as the whole handle_rx()_xx is in the read critical region.  So
 > > this patch move it ahead of the receiving loop.
 > > 
 > > Signed-off-by: Jason Wang 
 > 
 > Well feature check is mostly just features & bit
 > so why would it be slower? Because of the rcu
 > adding memory barriers? Do you observe a
 > measureable speedup with this patch?
 > 

I do not measure the performance for just this patch, maybe not obvious. And it
can also help the code unification.

 > Apropos, I noticed that the check in vhost_has_feature
 > is wrong: it uses the same kind of RCU as the
 > private pointer. So we'll have to fix that properly
 > by adding more lockdep classes, but for now
 > we'll need to make
 > the check 1 || lockdep_is_held(&dev->mutex);
 > and add a TODO.
 > 

Not sure, lockdep_is_head(&dev->mutex) maybe not accurate but sufficient, as it
was always held in the read critical region.

 > > ---
 > >  drivers/vhost/net.c |5 +++--
 > >  1 files changed, 3 insertions(+), 2 deletions(-)
 > > 
 > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 > > index 14fc189..95e49de 100644
 > > --- a/drivers/vhost/net.c
 > > +++ b/drivers/vhost/net.c
 > > @@ -411,7 +411,7 @@ static void handle_rx_mergeable(struct vhost_net *net)
 > >};
 > >  
 > >size_t total_len = 0;
 > > -  int err, headcount;
 > > +  int err, headcount, mergeable;
 > >size_t vhost_hlen, sock_hlen;
 > >size_t vhost_len, sock_len;
 > >struct socket *sock = rcu_dereference(vq->private_data);
 > > @@ -425,6 +425,7 @@ static void handle_rx_mergeable(struct vhost_net *net)
 > >  
 > >vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
 > >vq->log : NULL;
 > > +  mergeable = vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF);
 > >  
 > >while ((sock_len = peek_head_len(sock->sk))) {
 > >sock_len += sock_hlen;
 > > @@ -474,7 +475,7 @@ static void handle_rx_mergeable(struct vhost_net *net)
 > >break;
 > >}
 > >/* TODO: Should check and handle checksum. */
 > > -  if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF) &&
 > > +  if (likely(mergeable) &&
 > >memcpy_toiovecend(vq->hdr, (unsigned char *)&headcount,
 > >  offsetof(typeof(hdr), num_buffers),
 > >  sizeof hdr.num_buffers)) {
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] vhost-net: Unify the code of mergeable and big buffer handling

2011-01-17 Thread Jason Wang
Michael S. Tsirkin writes:
 > On Mon, Jan 17, 2011 at 04:11:08PM +0800, Jason Wang wrote:
 > > Codes duplication were found between the handling of mergeable and big
 > > buffers, so this patch tries to unify them. This could be easily done
 > > by adding a quota to the get_rx_bufs() which is used to limit the
 > > number of buffers it returns (for mergeable buffer, the quota is
 > > simply UIO_MAXIOV, for big buffers, the quota is just 1), and then the
 > > previous handle_rx_mergeable() could be resued also for big buffers.
 > > 
 > > Signed-off-by: Jason Wang 
 > 
 > We actually started this way. However the code turned out
 > to have measureable overhead when handle_rx_mergeable
 > is called with quota 1.
 > So before applying this I'd like to see some data
 > to show this is not the case anymore.
 > 

I've run a round of test (Host->Guest) for these three patches on my desktop:

Without these patches

mergeable buffers:
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.42 
(10.66.91.42) port 0 AF_INET
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

 87380  16384 6460.00   575.87   69.2026.3639.375  7.499  
 87380  1638425660.01  1123.57   73.1634.7321.335  5.064  
 87380  1638451260.00  1351.12   75.2635.8018.251  4.341  
 87380  16384   102460.00  1955.31   74.7337.6712.523  3.156  
 87380  16384   204860.01  3411.92   74.8239.497.186   1.896  

bug buffers:
Netperf test results
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.109 
(10.66.91.109) port 0 AF_INET
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

 87380  16384 6460.00   567.10   72.0626.1341.638  7.550  
 87380  1638425660.00  1143.69   74.6632.5821.392  4.667  
 87380  1638451260.00  1460.92   73.9433.4016.585  3.746  
 87380  16384   102460.00  3454.85   77.4933.897.349   1.607  
 87380  16384   204860.00  3781.11   76.5138.386.631   1.663  

With these patches:

mergeable buffers:
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.236 
(10.66.91.236) port 0 AF_INET
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

 87380  16384 6460.00   657.53   71.2726.4235.517  6.583  
 87380  1638425660.00  1217.73   74.3434.6720.004  4.665  
 87380  1638451260.00  1575.25   75.2737.1215.658  3.861  
 87380  16384   102460.00  2416.07   74.7737.2010.140  2.522  
 87380  16384   204860.00  3702.29   77.3136.296.842   1.606  

big buffers:
Netperf test results
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.202 
(10.66.91.202) port 0 AF_INET
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

 87380  16384 6460.00   647.67   71.8627.2636.356  6.895  
 87380  1638425660.00  1265.82   76.1936.5419.724  4.729  
 87380  1638451260.00  1796.64   76.0639.4813.872  3.601  
 87380  16384   102460.00  4008.37   77.0536.476.299   1.491  
 87380  16384   204860.00  4468.56   75.1841.795.513   1.532 

Looks like the unification does not hurt the performance, and with those patches
we can get some improvement. BTW, the regression of mergeable buffer still
exist.

 > > ---
 > >  drivers/vhost/net.c |  128 
 > > +++
 > >  1 files changed, 7 insertions(+), 121 deletions(-)
 > > 
 > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 > > index 95e49de..c32a2e4 100644
 > > --- a/drivers/vhost/net.c
 > > +++ b/drivers/vhost/net.c
 > > @@ -227,6 +227,7 @@ static int peek_head_len(struct sock *sk)
 > >   * @iovcount  - returned count of io vectors we fill
 > >   * @log   - vhost log
 > >   * @log_num   - log offset
 > > + * @quota   - headcount quota,

Re: Errors on MMIO read access on VM suspend / resume operations

2011-01-17 Thread Stefan Berger

On 01/16/2011 09:43 AM, Avi Kivity wrote:

On 01/14/2011 09:27 PM, Stefan Berger wrote:




Can you sprinkle some printfs() arount kvm_run (in qemu-kvm.c) to 
verify this?



Here's what I did:


interrupt exit requested


It appears from this you're using qemu.git.  Please try qemu-kvm.git, 
where the code appears to be correct.



Cc'ing qemu-devel now. For reference, here the initial problem description:

http://www.spinics.net/lists/kvm/msg48274.html

I didn't know there was another tree...

I have seen now a couple of suspends-while-reading with patches applied 
to the qemu-kvm.git tree and indeed, when run with the same host kernel 
and VM I do not see the debugging dumps due to double-reads that I would 
have anticipated seeing by now. Now what? Can this be easily fixed in 
the other Qemu tree as well?


One thing I'd like to mention is that I have seen what I think are 
interrupt stalls when running my tests inside the qemu-kvm.git tree 
version and not suspending at all. A some point the interrupt counter in 
the guest kernel does not increase anymore even though I see the device 
model raising the IRQ and lowering it. The same tests run literally 
forever in the qemu.git tree version of Qemu.


Regards,
   Stefan

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ConVirt 2.0.1 Open Source released.

2011-01-17 Thread jd
Stefan, 

  I think you have captured it nicely. Here are few links that gives more 
feature level perspectives.

http://www.convirture.com/products_opensource.php
http://www.convirture.com/products_enterprise.php
http://www.convirture.com/products_compare.php

/Jd
--- On Wed, 12/15/10, Stefan Hajnoczi  wrote:

> From: Stefan Hajnoczi 
> Subject: Re: ConVirt 2.0.1 Open Source released.
> To: "jd" 
> Cc: "KVM List" 
> Date: Wednesday, December 15, 2010, 1:57 PM
> On Wed, Dec 15, 2010 at 8:28 PM, jd
> 
> wrote:
> > We are pleased to announce availability of ConVirt
> 2.0.1 open
> > source. We would like to thank ConVirt user community
> for their
> > continuing participation and support. This release
> incorporates
> > feedback gathered from the community over last few
> months.
> 
> jd: A description of ConVirt would be nice.
> 
> Here's what I've figured out from the links:
> 
> It is a management tool for Xen, KVM, and others. 
> Written in Python
> under the GPLv2 but developed as "open core" software
> (there's an
> "open source" edition and an "enterprise" edition). 
> It talks to KVM
> using the QEMU (human) monitor.
> 
> Stefan
> 


  
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: fix build warning within __kvm_set_memory_region() on s390

2011-01-17 Thread Heiko Carstens
From: Heiko Carstens 

Get rid of this warning:

  CC  arch/s390/kvm/../../../virt/kvm/kvm_main.o
arch/s390/kvm/../../../virt/kvm/kvm_main.c:596:12: warning: 
'kvm_create_dirty_bitmap' defined but not used

The only caller of the function is within a !CONFIG_S390 section, so add the
same ifdef around kvm_create_dirty_bitmap() as well.

Signed-off-by: Heiko Carstens 
---

 virt/kvm/kvm_main.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f29abeb..a762720 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -588,6 +588,7 @@ static int kvm_vm_release(struct inode *inode, struct file 
*filp)
return 0;
 }
 
+#ifndef CONFIG_S390
 /*
  * Allocation size is twice as large as the actual dirty bitmap size.
  * This makes it possible to do double buffering: see x86's
@@ -608,6 +609,7 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot 
*memslot)
memslot->dirty_bitmap_head = memslot->dirty_bitmap;
return 0;
 }
+#endif /* !CONFIG_S390 */
 
 /*
  * Allocate some memory and give it an address in the guest physical address
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call agenda for Jan 18

2011-01-17 Thread Chris Wright
Please send in any agenda items you are interested in covering.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IO_PAGE_FAULT while booting.

2011-01-17 Thread Joerg Roedel
On Mon, Jan 17, 2011 at 05:08:42PM +, Prasad Joshi wrote:
> >> [   13.132547] AMD-Vi: Event logged [IO_PAGE_FAULT device=06:00.1 
> >> domain=0x address=0xbb402000 flags=0x0050]

Ok, the problem is, that the device 06:00.1 is not described in the
ACPI table for the IOMMU driver. So the driver does not feel responsible
for it and does not create mappings. It just blocks all DMA from that
device which results in the IO_PAGE_FAULT you have seen.
You will not be able to use the IDE controler in DMA mode, so if you
attach any drives to it you are sticked to slow PIO access mode. The
problem was already seen by another person and he reported the issue to
the board manufacturer to get a new BIOS where this issue is fixed.

Regards,
Joerg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IO_PAGE_FAULT while booting.

2011-01-17 Thread Joerg Roedel
On Mon, Jan 17, 2011 at 05:08:42PM +, Prasad Joshi wrote:
> Please find the information you requested attached with this mail.
> Also attaching the output of lspci -vvv -xxx incase it is useful.

Thanks, I'll have a look at it.

> I would be interested in understanding the problem. If possible can
> you please let me know what does this error indicate?

The error indicates that the device in question tried a dma transfer to
a location which it is not allowed to access. To most likely reason is
that the IOMMU was setup before the BIOS was giving away the device to
the OS. But other reasons are also possible. I'll let you know when I
know more.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] device-assignment: Properly terminate vmsd.fields

2011-01-17 Thread Alex Williamson
The vmsd code expects the fields structure to be properly terminated,
not NULL.  An assigned device should never be saved or restored, and
recent qemu fixes to the no_migrate flag should ensure this, but let's
avoid setting the wrong precedent.

Signed-off-by: Alex Williamson 
---

v2:
 - Change to commit log only, was "device-assignment: Add fields to
   VMStateDescription".  Recent qemu.git no_migrate changes avoid
   potential segfault, but this should still be applied for correctness.

 hw/device-assignment.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index e97f565..0925fa8 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1697,7 +1697,10 @@ static void 
assigned_dev_unregister_msix_mmio(AssignedDevice *dev)
 }
 
 static const VMStateDescription vmstate_assigned_device = {
-.name = "pci-assign"
+.name = "pci-assign",
+.fields = (VMStateField []) {
+VMSTATE_END_OF_LIST()
+}
 };
 
 static void reset_assigned_device(DeviceState *dev)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-01-17 Thread Avi Kivity

On 01/17/2011 05:52 PM, Marcelo Tosatti wrote:

>
>  What a value is written by the guest, which kvm cannot handle itself
>  (i.e. a change to anything other than the mask bit), we exit with
>  the table and entry ids, so userspace can reread them.

OK. But regarding access to the MSI-X entry in userspace, it can
only be accessed safely wrt parallel updates by the kernel in the
exit handler.

Is the exit handler the only location where the MSI-X entry will be
read or written to, in userspace?



Yes.  This should be documented, that the kernel owns the msix area as 
far as writes are concerned.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v5 PATCH 2/4] sched: Add yield_to(task, preempt) functionality.

2011-01-17 Thread Srivatsa Vaddagiri
On Fri, Jan 14, 2011 at 01:29:52PM -0500, Rik van Riel wrote:
> >I am not sure whether we are meeting that objective via this patch, as
> >lock-spinning vcpu would simply yield after setting next buddy to preferred
> >vcpu on target pcpu, thereby leaking some amount of bandwidth on the pcpu
> >where it is spinning.
> 
> Have you read the patch?

Sorry had mis-read the patch!

On reviewing it further, I am wondering if we can optimize yield_to() further
for case when target and current are on same pcpu, by swapping vruntimes of two
tasks (to let target run in current's place - as we do in task_fork_fair()).

- vatsa
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-01-17 Thread Marcelo Tosatti
On Mon, Jan 17, 2011 at 02:51:37PM +0200, Avi Kivity wrote:
> On 01/17/2011 02:48 PM, Marcelo Tosatti wrote:
> >On Mon, Jan 17, 2011 at 02:18:43PM +0200, Avi Kivity wrote:
> >>  On 01/17/2011 02:18 PM, Sheng Yang wrote:
> >>  >>   >   +
> >>  >>   >   +if (copy_to_user((void __user *)(entry_base + offset), 
> >> val, len))
> >>  >>   >   +goto out;
> >>  >>
> >>  >>   Instead of copying to/from userspace (which is subject to swapin,
> >>  >>   unexpected values), you could include the guest written value in a
> >>  >>   kvm_run structure, along with address. Qemu-kvm would use that to
> >>  >>   synchronize its copy of the table, on KVM_EXIT_MSIX_ROUTING_UPDATE 
> >> exit.
> >>  >
> >>  >We want to acelerate MSI-X mask bit accessing, which won't exit to 
> >> userspace in
> >>  >the most condition. That's the cost we want to optimize. Also it's 
> >> possible to
> >>  >userspace to read the correct value of MMIO(but mostly userspace can't 
> >> write to it
> >>  >in order to prevent synchronize issue).
> >>
> >>  It's also good to have the values in just one place; using userspace
> >>  makes it easy for both the kernel and userspace to see the values
> >>  (and set them after migration, if/when we extend this to virtio).
> >
> >Right, thats an advantage, but:
> >
> >- How can userspace ever synchronize with updates by the kernel
> >   to the MSI-X entry?
> 
> What a value is written by the guest, which kvm cannot handle itself
> (i.e. a change to anything other than the mask bit), we exit with
> the table and entry ids, so userspace can reread them.

OK. But regarding access to the MSI-X entry in userspace, it can 
only be accessed safely wrt parallel updates by the kernel in the
exit handler.

Is the exit handler the only location where the MSI-X entry will be
read or written to, in userspace?

> >- Reading/writing to the userspace area must be done carefully,
> >   values must be validated before used.
> 
> True every time...
> 
> >- Swapping issue (minor?).
> 
> I don't see the issue... just like any part of qemu that may be
> swapped out, blocking the vcpu thread.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IO_PAGE_FAULT while booting.

2011-01-17 Thread Joerg Roedel
Hi Prasad,

On Mon, Jan 17, 2011 at 12:30:48PM +, Prasad Joshi wrote:
> Hello,
> 
> I saw this error logged while booting the kernel
> prasad@prasad-kvm:~$ dmesg | grep PAGE
> [   13.132547] AMD-Vi: Event logged [IO_PAGE_FAULT device=06:00.1
> domain=0x address=0xbb402000 flags=0x0050]

Can you please boot with amd_iommu_dump added on the kernel command line
and send full dmesg, your kernels .config and dmidecode please? That
would help analyzing the problem.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-01-17 Thread Jan Kiszka
On 2011-01-17 13:29, Avi Kivity wrote:
> On 01/06/2011 12:19 PM, Sheng Yang wrote:
>> Then we can support mask bit operation of assigned devices now.
>>
>>
>>
>> +int kvm_assigned_device_update_msix_mask_bit(struct kvm *kvm,
>> +int assigned_dev_id, int entry, bool mask)
>> +{
>> +int r = -EFAULT;
>> +struct kvm_assigned_dev_kernel *adev;
>> +int i;
>> +
>> +if (!irqchip_in_kernel(kvm))
>> +return r;
>> +
>> +mutex_lock(&kvm->lock);
>> +adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
>> +  assigned_dev_id);
>> +if (!adev)
>> +goto out;
>> +
>> +for (i = 0; i<  adev->entries_nr; i++)
>> +if (adev->host_msix_entries[i].entry == entry) {
>> +if (mask)
>> +disable_irq_nosync(
>> +adev->host_msix_entries[i].vector);
> 
> Is it okay to call disable_irq_nosync() here?  IIRC we don't check the
> mask bit on irq delivery, so we may forward an interrupt to the guest
> after the mask bit was set.
> 
> What does pci say about the mask bit?  when does it take effect?
> 
> Another question is whether disable_irq_nosync() actually programs the
> device mask bit, or not.  If it does, then it's slow, and it may be
> better to leave interrupts enabled but have an internal pending bit.  If
> it doesn't program the mask bit, it's fine.

disable_irq* works lazily, only disabling the line in software. If an
IRQ actually occurs, it is marked pending and the line is left masked on
handler return (that's one reason why masking via PCI config space is
slower at macro level).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-01-17 Thread Michael S. Tsirkin
On Mon, Jan 17, 2011 at 02:29:44PM +0200, Avi Kivity wrote:
> On 01/06/2011 12:19 PM, Sheng Yang wrote:
> >Then we can support mask bit operation of assigned devices now.
> >
> >
> >
> >+int kvm_assigned_device_update_msix_mask_bit(struct kvm *kvm,
> >+int assigned_dev_id, int entry, bool mask)
> >+{
> >+int r = -EFAULT;
> >+struct kvm_assigned_dev_kernel *adev;
> >+int i;
> >+
> >+if (!irqchip_in_kernel(kvm))
> >+return r;
> >+
> >+mutex_lock(&kvm->lock);
> >+adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> >+  assigned_dev_id);
> >+if (!adev)
> >+goto out;
> >+
> >+for (i = 0; i<  adev->entries_nr; i++)
> >+if (adev->host_msix_entries[i].entry == entry) {
> >+if (mask)
> >+disable_irq_nosync(
> >+adev->host_msix_entries[i].vector);
> 
> Is it okay to call disable_irq_nosync() here?  IIRC we don't check
> the mask bit on irq delivery, so we may forward an interrupt to the
> guest after the mask bit was set.
> 
> What does pci say about the mask bit?  when does it take effect?

It is only guaranteed to take effect on the next read.
In practice, it takes effect earlier and guests
rely on this as an optimization.

> Another question is whether disable_irq_nosync() actually programs
> the device mask bit, or not.  If it does, then it's slow, and it may
> be better to leave interrupts enabled but have an internal pending
> bit.  If it doesn't program the mask bit, it's fine.
> 
> >+else
> >+enable_irq(adev->host_msix_entries[i].vector);
> >+r = 0;
> >+break;
> >+}
> >+out:
> >+mutex_unlock(&kvm->lock);
> >+return r;
> >+}
> >
> >+
> >+static int msix_table_mmio_read(struct kvm_io_device *this, gpa_t addr, int 
> >len,
> >+void *val)
> >+{
> >+struct kvm_msix_mmio_dev *mmio_dev =
> >+container_of(this, struct kvm_msix_mmio_dev, table_dev);
> >+struct kvm_msix_mmio *mmio;
> >+int idx, ret = 0, entry, offset, r;
> >+
> >+mutex_lock(&mmio_dev->lock);
> >+idx = get_mmio_table_index(mmio_dev, addr, len);
> >+if (idx<  0) {
> >+ret = -EOPNOTSUPP;
> >+goto out;
> >+}
> >+if ((addr&  0x3) || (len != 4&&  len != 8))
> >+goto out;
> >+
> >+offset = addr&  0xf;
> >+if (offset == PCI_MSIX_ENTRY_VECTOR_CTRL&&  len == 8)
> >+goto out;
> >+
> >+mmio =&mmio_dev->mmio[idx];
> >+entry = (addr - mmio->table_base_addr) / PCI_MSIX_ENTRY_SIZE;
> >+r = copy_from_user(val, (void __user *)(mmio->table_base_va +
> >+entry * PCI_MSIX_ENTRY_SIZE + offset), len);
> >+if (r)
> >+goto out;
> 
> and return ret == 0?
> 
> >+out:
> >+mutex_unlock(&mmio_dev->lock);
> >+return ret;
> >+}
> >+
> >+static int msix_table_mmio_write(struct kvm_io_device *this, gpa_t addr,
> >+int len, const void *val)
> >+{
> >+struct kvm_msix_mmio_dev *mmio_dev =
> >+container_of(this, struct kvm_msix_mmio_dev, table_dev);
> >+struct kvm_msix_mmio *mmio;
> >+int idx, entry, offset, ret = 0, r = 0;
> >+gpa_t entry_base;
> >+u32 old_ctrl, new_ctrl;
> >+u32 *ctrl_pos;
> >+
> >+mutex_lock(&mmio_dev->lock);
> >+idx = get_mmio_table_index(mmio_dev, addr, len);
> >+if (idx<  0) {
> >+ret = -EOPNOTSUPP;
> >+goto out;
> >+}
> >+if ((addr&  0x3) || (len != 4&&  len != 8))
> >+goto out;
> >+
> >+offset = addr&  0xF;
> >+if (offset == PCI_MSIX_ENTRY_VECTOR_CTRL&&  len == 8)
> >+goto out;
> >+
> >+mmio =&mmio_dev->mmio[idx];
> >+entry = (addr - mmio->table_base_addr) / PCI_MSIX_ENTRY_SIZE;
> >+entry_base = mmio->table_base_va + entry * PCI_MSIX_ENTRY_SIZE;
> >+ctrl_pos = (u32 *)(entry_base + PCI_MSIX_ENTRY_VECTOR_CTRL);
> >+
> >+if (get_user(old_ctrl, ctrl_pos))
> >+goto out;
> >+
> >+/* No allow writing to other fields when entry is unmasked */
> >+if (!(old_ctrl&  PCI_MSIX_ENTRY_CTRL_MASKBIT)&&
> >+offset != PCI_MSIX_ENTRY_VECTOR_CTRL)
> >+goto out;
> >+
> >+if (copy_to_user((void __user *)(entry_base + offset), val, len))
> >+goto out;
> >+
> >+if (get_user(new_ctrl, ctrl_pos))
> >+goto out;
> 
> here, too.
> 
> >+
> >+if ((offset<  PCI_MSIX_ENTRY_VECTOR_CTRL&&  len == 4) ||
> >+(offset<  PCI_MSIX_ENTRY_DATA&&  len == 8))
> >+ret = -ENOTSYNC;
> 
> goto out?
> 
> >+if (old_ctrl == new_ctrl)
> >+goto out;
> >+if (!(old_ctrl&  PCI_MSIX_ENTRY_CTRL_MASKBIT)&&
> >+(new_ctrl&  PCI_MSIX_ENTRY_CTRL_MASKBIT))
> >+r = update_msix_mask_bit(mmio_dev->kvm, mmio, entry, 1);
> >+

Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-01-17 Thread Avi Kivity

On 01/17/2011 02:48 PM, Marcelo Tosatti wrote:

On Mon, Jan 17, 2011 at 02:18:43PM +0200, Avi Kivity wrote:
>  On 01/17/2011 02:18 PM, Sheng Yang wrote:
>  >>   >   +
>  >>   >   +   if (copy_to_user((void __user *)(entry_base + offset), val, 
len))
>  >>   >   +   goto out;
>  >>
>  >>   Instead of copying to/from userspace (which is subject to swapin,
>  >>   unexpected values), you could include the guest written value in a
>  >>   kvm_run structure, along with address. Qemu-kvm would use that to
>  >>   synchronize its copy of the table, on KVM_EXIT_MSIX_ROUTING_UPDATE exit.
>  >
>  >We want to acelerate MSI-X mask bit accessing, which won't exit to 
userspace in
>  >the most condition. That's the cost we want to optimize. Also it's possible 
to
>  >userspace to read the correct value of MMIO(but mostly userspace can't 
write to it
>  >in order to prevent synchronize issue).
>
>  It's also good to have the values in just one place; using userspace
>  makes it easy for both the kernel and userspace to see the values
>  (and set them after migration, if/when we extend this to virtio).

Right, thats an advantage, but:

- How can userspace ever synchronize with updates by the kernel
   to the MSI-X entry?


What a value is written by the guest, which kvm cannot handle itself 
(i.e. a change to anything other than the mask bit), we exit with the 
table and entry ids, so userspace can reread them.



- Reading/writing to the userspace area must be done carefully,
   values must be validated before used.


True every time...


- Swapping issue (minor?).


I don't see the issue... just like any part of qemu that may be swapped 
out, blocking the vcpu thread.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-01-17 Thread Marcelo Tosatti
On Mon, Jan 17, 2011 at 02:18:43PM +0200, Avi Kivity wrote:
> On 01/17/2011 02:18 PM, Sheng Yang wrote:
> >>  >  +
> >>  >  +  if (copy_to_user((void __user *)(entry_base + offset), val, 
> >> len))
> >>  >  +  goto out;
> >>
> >>  Instead of copying to/from userspace (which is subject to swapin,
> >>  unexpected values), you could include the guest written value in a
> >>  kvm_run structure, along with address. Qemu-kvm would use that to
> >>  synchronize its copy of the table, on KVM_EXIT_MSIX_ROUTING_UPDATE exit.
> >
> >We want to acelerate MSI-X mask bit accessing, which won't exit to userspace 
> >in
> >the most condition. That's the cost we want to optimize. Also it's possible 
> >to
> >userspace to read the correct value of MMIO(but mostly userspace can't write 
> >to it
> >in order to prevent synchronize issue).
> 
> It's also good to have the values in just one place; using userspace
> makes it easy for both the kernel and userspace to see the values
> (and set them after migration, if/when we extend this to virtio).

Right, thats an advantage, but:

- How can userspace ever synchronize with updates by the kernel 
  to the MSI-X entry?
- Reading/writing to the userspace area must be done carefully,
  values must be validated before used.
- Swapping issue (minor?).


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: Add documents for MSI-X MMIO API

2011-01-17 Thread Avi Kivity

On 01/17/2011 02:35 PM, Sheng Yang wrote:

On Monday 17 January 2011 20:21:45 Avi Kivity wrote:
>  On 01/06/2011 12:19 PM, Sheng Yang wrote:
>  >  Signed-off-by: Sheng Yang
>  >  ---
>  >
>  >Documentation/kvm/api.txt |   41
>  >+ 1 files changed, 41
>  >insertions(+), 0 deletions(-)
>  >
>  >  diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
>  >  index e1a9297..4978b94 100644
>  >  --- a/Documentation/kvm/api.txt
>  >  +++ b/Documentation/kvm/api.txt
>  >  @@ -1263,6 +1263,47 @@ struct kvm_assigned_msix_entry {
>  >
>  >  __u16 padding[3];
>  >
>  >};
>  >
>  >  +4.54 KVM_REGISTER_MSIX_MMIO
>  >  +
>  >  +Capability: KVM_CAP_MSIX_MMIO
>  >  +Architectures: x86
>  >  +Type: vm ioctl
>  >  +Parameters: struct kvm_msix_mmio_user (in)
>  >  +Returns: 0 on success, -1 on error
>  >  +
>  >  +This API indicates an MSI-X MMIO address of a guest device. Then all
>  >  MMIO +operation would be handled by kernel. When necessary(e.g. MSI
>  >  data/address +changed), KVM would exit to userspace using
>  >  KVM_EXIT_MSIX_ROUTING_UPDATE to +indicate the MMIO modification and
>  >  require userspace to update IRQ routing +table.
>  >  +
>  >  +struct kvm_msix_mmio_user {
>  >  +   __u32 dev_id;
>  >  +   __u16 type; /* Device type and MMIO address type */
>  >  +   __u16 max_entries_nr;   /* Maximum entries supported */
>  >  +   __u64 base_addr;/* Guest physical address of MMIO */
>  >  +   __u64 base_va;  /* Host virtual address of MMIO mapping */
>  >  +   __u64 flags;/* Reserved for now */
>  >  +   __u64 reserved[4];
>  >  +};
>  >  +
>  >  +Current device type can be:
>  >  +#define KVM_MSIX_MMIO_TYPE_ASSIGNED_DEV(1<<   0)
>  >  +
>  >  +Current MMIO type can be:
>  >  +#define KVM_MSIX_MMIO_TYPE_BASE_TABLE  (1<<   8)
>  >  +
>
>  How does userspace know which entry of which table changed?  Need a
>  field in struct kvm_run for that.

We already got an guest MMIO address for that in the exit information. I've
created a chain of handler in qemu to handle it.



But we already decoded the table and entry...

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-01-17 Thread Marcelo Tosatti
On Mon, Jan 17, 2011 at 08:18:22PM +0800, Sheng Yang wrote:
> > > + goto out;
> > > +
> > > + mmio = &mmio_dev->mmio[idx];
> > > + entry = (addr - mmio->table_base_addr) / PCI_MSIX_ENTRY_SIZE;
> > > + entry_base = mmio->table_base_va + entry * PCI_MSIX_ENTRY_SIZE;
> > > + ctrl_pos = (u32 *)(entry_base + PCI_MSIX_ENTRY_VECTOR_CTRL);
> > > +
> > > + if (get_user(old_ctrl, ctrl_pos))
> > > + goto out;
> > > +
> > > + /* No allow writing to other fields when entry is unmasked */
> > > + if (!(old_ctrl & PCI_MSIX_ENTRY_CTRL_MASKBIT) &&
> > > + offset != PCI_MSIX_ENTRY_VECTOR_CTRL)
> > > + goto out;
> > > +
> > > + if (copy_to_user((void __user *)(entry_base + offset), val, len))
> > > + goto out;
> > 
> > Instead of copying to/from userspace (which is subject to swapin,
> > unexpected values), you could include the guest written value in a
> > kvm_run structure, along with address. Qemu-kvm would use that to
> > synchronize its copy of the table, on KVM_EXIT_MSIX_ROUTING_UPDATE exit.
> 
> We want to acelerate MSI-X mask bit accessing, which won't exit to userspace 
> in 
> the most condition. That's the cost we want to optimize. Also it's possible 
> to 
> userspace to read the correct value of MMIO(but mostly userspace can't write 
> to it 
> in order to prevent synchronize issue).

Yes, i meant exit to userspace only when necessary, but instead of
copying directly everytime record the value the guest has written in
kvm_run and exit with KVM_EXIT_MSIX_ROUTING_UPDATE.

> > > + if (get_user(new_ctrl, ctrl_pos))
> > > + goto out;
> > > +
> > > + if ((offset < PCI_MSIX_ENTRY_VECTOR_CTRL && len == 4) ||
> > > + (offset < PCI_MSIX_ENTRY_DATA && len == 8))
> > > + ret = -ENOTSYNC;
> > > + if (old_ctrl == new_ctrl)
> > > + goto out;
> > > + if (!(old_ctrl & PCI_MSIX_ENTRY_CTRL_MASKBIT) &&
> > > + (new_ctrl & PCI_MSIX_ENTRY_CTRL_MASKBIT))
> > > + r = update_msix_mask_bit(mmio_dev->kvm, mmio, entry, 1);
> > > + else if ((old_ctrl & PCI_MSIX_ENTRY_CTRL_MASKBIT) &&
> > > + !(new_ctrl & PCI_MSIX_ENTRY_CTRL_MASKBIT))
> > > + r = update_msix_mask_bit(mmio_dev->kvm, mmio, entry, 0);
> > 
> > Then you rely on the kernel copy of the values to enable/disable irq.
> 
> Yes, they are guest owned assigned IRQs. Any potential issue?

Nothing prevents the kernel from enabling or disabling irq multiple
times with this code (what prevents it is a qemu-kvm that behaves as
expected). This is not very good.

Perhaps the guest can only harm itself with that, but i'm not sure.

And also if an msix table page is swapped out guest will hang.

> > > + return r;
> > > +}
> > 
> > This is not consistent with registration, where there are no checks
> > regarding validity of assigned device id. So why is it necessary?
> 
> I am not quite understand. We need to free mmio anyway, otherwise it would 
> result 
> in wrong mmio interception...

If you return -EOPNOTSUPP in case kvm_find_assigned_dev fails in the
read/write paths, there is no need to free anything.

> 
> > 
> > There is a lock ordering problem BTW:
> > 
> > - read/write handlers: dev->lock -> kvm->lock
> > - vm_ioctl_deassign_device -> kvm_free_msix_mmio: kvm->lock -> dev->lock
> 
> Good catch! Would fix it(and other comments of course).
> 
> --
> regards
> Yang, Sheng
> 
> > 
> > > +
> > > +int kvm_vm_ioctl_unregister_msix_mmio(struct kvm *kvm,
> > > +   struct kvm_msix_mmio_user *mmio_user)
> > > +{
> > > + struct kvm_msix_mmio mmio;
> > > +
> > > + mmio.dev_id = mmio_user->dev_id;
> > > + mmio.type = mmio_user->type;
> > > +
> > > + return kvm_free_msix_mmio(kvm, &mmio);
> > > +}
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IO_PAGE_FAULT while booting.

2011-01-17 Thread Prasad Joshi
Hello,

I saw this error logged while booting the kernel
prasad@prasad-kvm:~$ dmesg | grep PAGE
[   13.132547] AMD-Vi: Event logged [IO_PAGE_FAULT device=06:00.1
domain=0x address=0xbb402000 flags=0x0050]

The kernel I have is cloned copy of the KVM repository.

menuentry 'Ubuntu, with Linux 2.6.37-rc6+' --class ubuntu --class
gnu-linux --class gnu --class os {
recordfail
insmod part_msdos
insmod ext2
set root='(hd0,msdos5)'
search --no-floppy --fs-uuid --set ab9dfe8f-8e58-43ca-b16f-a42e0f02da72
linux   /vmlinuz-2.6.37-rc6+ root=/dev/sda1 ro iommu=pt iommu=1
amd_iommu_dump
initrd  /initrd.img-2.6.37-rc6+
}

prasad@prasad-kvm:~$ uname -a
Linux prasad-kvm 2.6.37-rc6+ #19 SMP Mon Jan 10 19:03:36 GMT 2011
x86_64 GNU/Linux

Output of lspci -v

06:00.0 SATA controller: JMicron Technology Corp. JMB361 AHCI/IDE (rev
02) (prog-if 01 [AHCI 1.0])
Subsystem: ASUSTeK Computer Inc. Device 843f
Flags: bus master, fast devsel, latency 0, IRQ 44
Memory at f9ffe000 (32-bit, non-prefetchable) [size=8K]
Capabilities: [68] Power Management version 2
Capabilities: [50] Express Legacy Endpoint, MSI 01
Kernel driver in use: ahci

06:00.1 IDE interface: JMicron Technology Corp. JMB361 AHCI/IDE (rev
02) (prog-if 85 [Master SecO PriO])
Subsystem: ASUSTeK Computer Inc. Device 843f
Flags: bus master, fast devsel, latency 0, IRQ 45
I/O ports at dc00 [size=8]
I/O ports at d880 [size=4]
I/O ports at d800 [size=8]
I/O ports at d480 [size=4]
I/O ports at d400 [size=16]
Capabilities: [68] Power Management version 2
Kernel driver in use: JMicron IDE

Please let me know if you need more information from my side.

Thanks and Regards,
Prasad
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: Add documents for MSI-X MMIO API

2011-01-17 Thread Sheng Yang
On Monday 17 January 2011 20:21:45 Avi Kivity wrote:
> On 01/06/2011 12:19 PM, Sheng Yang wrote:
> > Signed-off-by: Sheng Yang
> > ---
> > 
> >   Documentation/kvm/api.txt |   41
> >   + 1 files changed, 41
> >   insertions(+), 0 deletions(-)
> > 
> > diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
> > index e1a9297..4978b94 100644
> > --- a/Documentation/kvm/api.txt
> > +++ b/Documentation/kvm/api.txt
> > @@ -1263,6 +1263,47 @@ struct kvm_assigned_msix_entry {
> > 
> > __u16 padding[3];
> >   
> >   };
> > 
> > +4.54 KVM_REGISTER_MSIX_MMIO
> > +
> > +Capability: KVM_CAP_MSIX_MMIO
> > +Architectures: x86
> > +Type: vm ioctl
> > +Parameters: struct kvm_msix_mmio_user (in)
> > +Returns: 0 on success, -1 on error
> > +
> > +This API indicates an MSI-X MMIO address of a guest device. Then all
> > MMIO +operation would be handled by kernel. When necessary(e.g. MSI
> > data/address +changed), KVM would exit to userspace using
> > KVM_EXIT_MSIX_ROUTING_UPDATE to +indicate the MMIO modification and
> > require userspace to update IRQ routing +table.
> > +
> > +struct kvm_msix_mmio_user {
> > +   __u32 dev_id;
> > +   __u16 type; /* Device type and MMIO address type */
> > +   __u16 max_entries_nr;   /* Maximum entries supported */
> > +   __u64 base_addr;/* Guest physical address of MMIO */
> > +   __u64 base_va;  /* Host virtual address of MMIO mapping */
> > +   __u64 flags;/* Reserved for now */
> > +   __u64 reserved[4];
> > +};
> > +
> > +Current device type can be:
> > +#define KVM_MSIX_MMIO_TYPE_ASSIGNED_DEV(1<<  0)
> > +
> > +Current MMIO type can be:
> > +#define KVM_MSIX_MMIO_TYPE_BASE_TABLE  (1<<  8)
> > +
> 
> How does userspace know which entry of which table changed?  Need a
> field in struct kvm_run for that.

We already got an guest MMIO address for that in the exit information. I've 
created a chain of handler in qemu to handle it.

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-01-17 Thread Avi Kivity

On 01/06/2011 12:19 PM, Sheng Yang wrote:

Then we can support mask bit operation of assigned devices now.



+int kvm_assigned_device_update_msix_mask_bit(struct kvm *kvm,
+   int assigned_dev_id, int entry, bool mask)
+{
+   int r = -EFAULT;
+   struct kvm_assigned_dev_kernel *adev;
+   int i;
+
+   if (!irqchip_in_kernel(kvm))
+   return r;
+
+   mutex_lock(&kvm->lock);
+   adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
+ assigned_dev_id);
+   if (!adev)
+   goto out;
+
+   for (i = 0; i<  adev->entries_nr; i++)
+   if (adev->host_msix_entries[i].entry == entry) {
+   if (mask)
+   disable_irq_nosync(
+   adev->host_msix_entries[i].vector);


Is it okay to call disable_irq_nosync() here?  IIRC we don't check the 
mask bit on irq delivery, so we may forward an interrupt to the guest 
after the mask bit was set.


What does pci say about the mask bit?  when does it take effect?

Another question is whether disable_irq_nosync() actually programs the 
device mask bit, or not.  If it does, then it's slow, and it may be 
better to leave interrupts enabled but have an internal pending bit.  If 
it doesn't program the mask bit, it's fine.



+   else
+   enable_irq(adev->host_msix_entries[i].vector);
+   r = 0;
+   break;
+   }
+out:
+   mutex_unlock(&kvm->lock);
+   return r;
+}

+
+static int msix_table_mmio_read(struct kvm_io_device *this, gpa_t addr, int 
len,
+   void *val)
+{
+   struct kvm_msix_mmio_dev *mmio_dev =
+   container_of(this, struct kvm_msix_mmio_dev, table_dev);
+   struct kvm_msix_mmio *mmio;
+   int idx, ret = 0, entry, offset, r;
+
+   mutex_lock(&mmio_dev->lock);
+   idx = get_mmio_table_index(mmio_dev, addr, len);
+   if (idx<  0) {
+   ret = -EOPNOTSUPP;
+   goto out;
+   }
+   if ((addr&  0x3) || (len != 4&&  len != 8))
+   goto out;
+
+   offset = addr&  0xf;
+   if (offset == PCI_MSIX_ENTRY_VECTOR_CTRL&&  len == 8)
+   goto out;
+
+   mmio =&mmio_dev->mmio[idx];
+   entry = (addr - mmio->table_base_addr) / PCI_MSIX_ENTRY_SIZE;
+   r = copy_from_user(val, (void __user *)(mmio->table_base_va +
+   entry * PCI_MSIX_ENTRY_SIZE + offset), len);
+   if (r)
+   goto out;


and return ret == 0?


+out:
+   mutex_unlock(&mmio_dev->lock);
+   return ret;
+}
+
+static int msix_table_mmio_write(struct kvm_io_device *this, gpa_t addr,
+   int len, const void *val)
+{
+   struct kvm_msix_mmio_dev *mmio_dev =
+   container_of(this, struct kvm_msix_mmio_dev, table_dev);
+   struct kvm_msix_mmio *mmio;
+   int idx, entry, offset, ret = 0, r = 0;
+   gpa_t entry_base;
+   u32 old_ctrl, new_ctrl;
+   u32 *ctrl_pos;
+
+   mutex_lock(&mmio_dev->lock);
+   idx = get_mmio_table_index(mmio_dev, addr, len);
+   if (idx<  0) {
+   ret = -EOPNOTSUPP;
+   goto out;
+   }
+   if ((addr&  0x3) || (len != 4&&  len != 8))
+   goto out;
+
+   offset = addr&  0xF;
+   if (offset == PCI_MSIX_ENTRY_VECTOR_CTRL&&  len == 8)
+   goto out;
+
+   mmio =&mmio_dev->mmio[idx];
+   entry = (addr - mmio->table_base_addr) / PCI_MSIX_ENTRY_SIZE;
+   entry_base = mmio->table_base_va + entry * PCI_MSIX_ENTRY_SIZE;
+   ctrl_pos = (u32 *)(entry_base + PCI_MSIX_ENTRY_VECTOR_CTRL);
+
+   if (get_user(old_ctrl, ctrl_pos))
+   goto out;
+
+   /* No allow writing to other fields when entry is unmasked */
+   if (!(old_ctrl&  PCI_MSIX_ENTRY_CTRL_MASKBIT)&&
+   offset != PCI_MSIX_ENTRY_VECTOR_CTRL)
+   goto out;
+
+   if (copy_to_user((void __user *)(entry_base + offset), val, len))
+   goto out;
+
+   if (get_user(new_ctrl, ctrl_pos))
+   goto out;


here, too.


+
+   if ((offset<  PCI_MSIX_ENTRY_VECTOR_CTRL&&  len == 4) ||
+   (offset<  PCI_MSIX_ENTRY_DATA&&  len == 8))
+   ret = -ENOTSYNC;


goto out?


+   if (old_ctrl == new_ctrl)
+   goto out;
+   if (!(old_ctrl&  PCI_MSIX_ENTRY_CTRL_MASKBIT)&&
+   (new_ctrl&  PCI_MSIX_ENTRY_CTRL_MASKBIT))
+   r = update_msix_mask_bit(mmio_dev->kvm, mmio, entry, 1);
+   else if ((old_ctrl&  PCI_MSIX_ENTRY_CTRL_MASKBIT)&&
+   !(new_ctrl&  PCI_MSIX_ENTRY_CTRL_MASKBIT))
+   r = update_msix_mask_bit(mmio_dev->kvm, mmio, entry, 0);
+   if (r || ret)
+   ret = -ENOTSYNC;
+out:
+   mutex_unlock(&mmio_dev->lock);
+   return ret;
+}
+


Re: [PATCH 3/3] KVM: Add documents for MSI-X MMIO API

2011-01-17 Thread Avi Kivity

On 01/06/2011 12:19 PM, Sheng Yang wrote:

Signed-off-by: Sheng Yang
---
  Documentation/kvm/api.txt |   41 +
  1 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index e1a9297..4978b94 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -1263,6 +1263,47 @@ struct kvm_assigned_msix_entry {
__u16 padding[3];
  };

+4.54 KVM_REGISTER_MSIX_MMIO
+
+Capability: KVM_CAP_MSIX_MMIO
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_msix_mmio_user (in)
+Returns: 0 on success, -1 on error
+
+This API indicates an MSI-X MMIO address of a guest device. Then all MMIO
+operation would be handled by kernel. When necessary(e.g. MSI data/address
+changed), KVM would exit to userspace using KVM_EXIT_MSIX_ROUTING_UPDATE to
+indicate the MMIO modification and require userspace to update IRQ routing
+table.
+
+struct kvm_msix_mmio_user {
+   __u32 dev_id;
+   __u16 type; /* Device type and MMIO address type */
+   __u16 max_entries_nr;   /* Maximum entries supported */
+   __u64 base_addr;/* Guest physical address of MMIO */
+   __u64 base_va;  /* Host virtual address of MMIO mapping */
+   __u64 flags;/* Reserved for now */
+   __u64 reserved[4];
+};
+
+Current device type can be:
+#define KVM_MSIX_MMIO_TYPE_ASSIGNED_DEV(1<<  0)
+
+Current MMIO type can be:
+#define KVM_MSIX_MMIO_TYPE_BASE_TABLE  (1<<  8)
+


How does userspace know which entry of which table changed?  Need a 
field in struct kvm_run for that.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-01-17 Thread Avi Kivity

On 01/17/2011 02:18 PM, Sheng Yang wrote:

>  >  +
>  >  +   if (copy_to_user((void __user *)(entry_base + offset), val, len))
>  >  +   goto out;
>
>  Instead of copying to/from userspace (which is subject to swapin,
>  unexpected values), you could include the guest written value in a
>  kvm_run structure, along with address. Qemu-kvm would use that to
>  synchronize its copy of the table, on KVM_EXIT_MSIX_ROUTING_UPDATE exit.

We want to acelerate MSI-X mask bit accessing, which won't exit to userspace in
the most condition. That's the cost we want to optimize. Also it's possible to
userspace to read the correct value of MMIO(but mostly userspace can't write to 
it
in order to prevent synchronize issue).


It's also good to have the values in just one place; using userspace 
makes it easy for both the kernel and userspace to see the values (and 
set them after migration, if/when we extend this to virtio).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-01-17 Thread Sheng Yang
On Monday 17 January 2011 19:54:47 Marcelo Tosatti wrote:
> On Thu, Jan 06, 2011 at 06:19:44PM +0800, Sheng Yang wrote:
> > Then we can support mask bit operation of assigned devices now.
> > 
> > Signed-off-by: Sheng Yang 
> > ---
> > 
> >  arch/x86/kvm/Makefile|2 +-
> >  arch/x86/kvm/x86.c   |8 +-
> >  include/linux/kvm.h  |   21 
> >  include/linux/kvm_host.h |   25 
> >  virt/kvm/assigned-dev.c  |   44 +++
> >  virt/kvm/kvm_main.c  |   38 ++-
> >  virt/kvm/msix_mmio.c |  284
> >  ++ virt/kvm/msix_mmio.h
> >  |   25 
> >  8 files changed, 440 insertions(+), 7 deletions(-)
> >  create mode 100644 virt/kvm/msix_mmio.c
> >  create mode 100644 virt/kvm/msix_mmio.h
> > 
> > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > index ae72ae6..7444dcd 100644
> > --- a/virt/kvm/assigned-dev.c
> > +++ b/virt/kvm/assigned-dev.c
> > @@ -18,6 +18,7 @@
> > 
> >  #include 
> >  #include 
> >  #include "irq.h"
> > 
> > +#include "msix_mmio.h"
> > 
> >  static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct
> >  list_head *head,
> >  
> >   int assigned_dev_id)
> > 
> > @@ -191,12 +192,25 @@ static void kvm_free_assigned_irq(struct kvm *kvm,
> > 
> > kvm_deassign_irq(kvm, assigned_dev, assigned_dev->irq_requested_type);
> >  
> >  }
> > 
> > +static void assigned_device_free_msix_mmio(struct kvm *kvm,
> > +   struct kvm_assigned_dev_kernel *adev)
> > +{
> > +   struct kvm_msix_mmio mmio;
> > +
> > +   mmio.dev_id = adev->assigned_dev_id;
> > +   mmio.type = KVM_MSIX_MMIO_TYPE_ASSIGNED_DEV |
> > +   KVM_MSIX_MMIO_TYPE_BASE_TABLE;
> > +   kvm_free_msix_mmio(kvm, &mmio);
> > +}
> > +
> > 
> >  static void kvm_free_assigned_device(struct kvm *kvm,
> >  
> >  struct kvm_assigned_dev_kernel
> >  *assigned_dev)
> >  
> >  {
> >  
> > kvm_free_assigned_irq(kvm, assigned_dev);
> > 
> > +   assigned_device_free_msix_mmio(kvm, assigned_dev);
> > +
> > 
> > __pci_reset_function(assigned_dev->dev);
> > pci_restore_state(assigned_dev->dev);
> > 
> > @@ -785,3 +799,33 @@ out:
> > return r;
> >  
> >  }
> > 
> > +int kvm_assigned_device_update_msix_mask_bit(struct kvm *kvm,
> > +   int assigned_dev_id, int entry, bool mask)
> > +{
> > +   int r = -EFAULT;
> > +   struct kvm_assigned_dev_kernel *adev;
> > +   int i;
> > +
> > +   if (!irqchip_in_kernel(kvm))
> > +   return r;
> > +
> > +   mutex_lock(&kvm->lock);
> > +   adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > + assigned_dev_id);
> > +   if (!adev)
> > +   goto out;
> > +
> > +   for (i = 0; i < adev->entries_nr; i++)
> > +   if (adev->host_msix_entries[i].entry == entry) {
> > +   if (mask)
> > +   disable_irq_nosync(
> > +   adev->host_msix_entries[i].vector);
> > +   else
> > +   enable_irq(adev->host_msix_entries[i].vector);
> > +   r = 0;
> > +   break;
> > +   }
> 
> Should check if the host irq is registered as MSIX type.
> 
> > +out:
> > +   mutex_unlock(&kvm->lock);
> > +   return r;
> > +}
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index b1b6cbb..b7807c8 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -56,6 +56,7 @@
> > 
> >  #include "coalesced_mmio.h"
> >  #include "async_pf.h"
> > 
> > +#include "msix_mmio.h"
> > 
> >  #define CREATE_TRACE_POINTS
> >  #include 
> > 
> > @@ -509,6 +510,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
> > 
> > struct mm_struct *mm = kvm->mm;
> > 
> > kvm_arch_sync_events(kvm);
> > 
> > +   kvm_unregister_msix_mmio_dev(kvm);
> > 
> > spin_lock(&kvm_lock);
> > list_del(&kvm->vm_list);
> > spin_unlock(&kvm_lock);
> > 
> > @@ -1877,6 +1879,24 @@ static long kvm_vm_ioctl(struct file *filp,
> > 
> > mutex_unlock(&kvm->lock);
> > break;
> >  
> >  #endif
> > 
> > +   case KVM_REGISTER_MSIX_MMIO: {
> > +   struct kvm_msix_mmio_user mmio_user;
> > +
> > +   r = -EFAULT;
> > +   if (copy_from_user(&mmio_user, argp, sizeof mmio_user))
> > +   goto out;
> > +   r = kvm_vm_ioctl_register_msix_mmio(kvm, &mmio_user);
> > +   break;
> > +   }
> > +   case KVM_UNREGISTER_MSIX_MMIO: {
> > +   struct kvm_msix_mmio_user mmio_user;
> > +
> > +   r = -EFAULT;
> > +   if (copy_from_user(&mmio_user, argp, sizeof mmio_user))
> > +   goto out;
> > +   r = kvm_vm_ioctl_unregister_msix_mmio(kvm, &mmio_user);
> > +   break;
> > +   }
> > 
> > default:
> > r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> > if (r 

Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-01-17 Thread Marcelo Tosatti
On Thu, Jan 06, 2011 at 06:19:44PM +0800, Sheng Yang wrote:
> Then we can support mask bit operation of assigned devices now.
> 
> Signed-off-by: Sheng Yang 
> ---
>  arch/x86/kvm/Makefile|2 +-
>  arch/x86/kvm/x86.c   |8 +-
>  include/linux/kvm.h  |   21 
>  include/linux/kvm_host.h |   25 
>  virt/kvm/assigned-dev.c  |   44 +++
>  virt/kvm/kvm_main.c  |   38 ++-
>  virt/kvm/msix_mmio.c |  284 
> ++
>  virt/kvm/msix_mmio.h |   25 
>  8 files changed, 440 insertions(+), 7 deletions(-)
>  create mode 100644 virt/kvm/msix_mmio.c
>  create mode 100644 virt/kvm/msix_mmio.h
> 
> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index ae72ae6..7444dcd 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include "irq.h"
> +#include "msix_mmio.h"
>  
>  static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct 
> list_head *head,
> int assigned_dev_id)
> @@ -191,12 +192,25 @@ static void kvm_free_assigned_irq(struct kvm *kvm,
>   kvm_deassign_irq(kvm, assigned_dev, assigned_dev->irq_requested_type);
>  }
>  
> +static void assigned_device_free_msix_mmio(struct kvm *kvm,
> + struct kvm_assigned_dev_kernel *adev)
> +{
> + struct kvm_msix_mmio mmio;
> +
> + mmio.dev_id = adev->assigned_dev_id;
> + mmio.type = KVM_MSIX_MMIO_TYPE_ASSIGNED_DEV |
> + KVM_MSIX_MMIO_TYPE_BASE_TABLE;
> + kvm_free_msix_mmio(kvm, &mmio);
> +}
> +
>  static void kvm_free_assigned_device(struct kvm *kvm,
>struct kvm_assigned_dev_kernel
>*assigned_dev)
>  {
>   kvm_free_assigned_irq(kvm, assigned_dev);
>  
> + assigned_device_free_msix_mmio(kvm, assigned_dev);
> +
>   __pci_reset_function(assigned_dev->dev);
>   pci_restore_state(assigned_dev->dev);
>  
> @@ -785,3 +799,33 @@ out:
>   return r;
>  }
>  
> +int kvm_assigned_device_update_msix_mask_bit(struct kvm *kvm,
> + int assigned_dev_id, int entry, bool mask)
> +{
> + int r = -EFAULT;
> + struct kvm_assigned_dev_kernel *adev;
> + int i;
> +
> + if (!irqchip_in_kernel(kvm))
> + return r;
> +
> + mutex_lock(&kvm->lock);
> + adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> +   assigned_dev_id);
> + if (!adev)
> + goto out;
> +
> + for (i = 0; i < adev->entries_nr; i++)
> + if (adev->host_msix_entries[i].entry == entry) {
> + if (mask)
> + disable_irq_nosync(
> + adev->host_msix_entries[i].vector);
> + else
> + enable_irq(adev->host_msix_entries[i].vector);
> + r = 0;
> + break;
> + }

Should check if the host irq is registered as MSIX type.

> +out:
> + mutex_unlock(&kvm->lock);
> + return r;
> +}
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b1b6cbb..b7807c8 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -56,6 +56,7 @@
>  
>  #include "coalesced_mmio.h"
>  #include "async_pf.h"
> +#include "msix_mmio.h"
>  
>  #define CREATE_TRACE_POINTS
>  #include 
> @@ -509,6 +510,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>   struct mm_struct *mm = kvm->mm;
>  
>   kvm_arch_sync_events(kvm);
> + kvm_unregister_msix_mmio_dev(kvm);
>   spin_lock(&kvm_lock);
>   list_del(&kvm->vm_list);
>   spin_unlock(&kvm_lock);
> @@ -1877,6 +1879,24 @@ static long kvm_vm_ioctl(struct file *filp,
>   mutex_unlock(&kvm->lock);
>   break;
>  #endif
> + case KVM_REGISTER_MSIX_MMIO: {
> + struct kvm_msix_mmio_user mmio_user;
> +
> + r = -EFAULT;
> + if (copy_from_user(&mmio_user, argp, sizeof mmio_user))
> + goto out;
> + r = kvm_vm_ioctl_register_msix_mmio(kvm, &mmio_user);
> + break;
> + }
> + case KVM_UNREGISTER_MSIX_MMIO: {
> + struct kvm_msix_mmio_user mmio_user;
> +
> + r = -EFAULT;
> + if (copy_from_user(&mmio_user, argp, sizeof mmio_user))
> + goto out;
> + r = kvm_vm_ioctl_unregister_msix_mmio(kvm, &mmio_user);
> + break;
> + }
>   default:
>   r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>   if (r == -ENOTTY)
> @@ -1988,6 +2008,12 @@ static int kvm_dev_ioctl_create_vm(void)
>   return r;
>   }
>  #endif
> + r = kvm_register_msix_mmio_dev(kvm);
> + if (r < 0) {
> + kvm_put_kvm(kvm);
> + return r;
> + }
> +
>   r = anon_inode_getfd("kvm-vm", &kvm_vm_fops, kvm, O_

Re: [PATCH 0/2 v2] perf-kvm support for SVM

2011-01-17 Thread Roedel, Joerg
On Sun, Jan 16, 2011 at 10:38:11AM -0500, Avi Kivity wrote:
> On 01/16/2011 05:35 PM, Joerg Roedel wrote:
> > On Sun, Jan 16, 2011 at 12:49:41PM +0200, Avi Kivity wrote:
> > >  On 01/14/2011 05:45 PM, Joerg Roedel wrote:
> >
> > >>  here is the reworked version of the patch-set. Only patch 1/2 has
> > >>  changed and now contains the real fix for the crashes that were seen and
> > >>  has an updated log message.
> > >>
> > >
> > >  Thanks, applied.  2.6.37 and earlier aren't affected, yes?  So I'm
> > >  queuing it for 2.6.38 only.
> >
> > I think the problem is there since KVM has lazy state switching. So the
> > fix in patch 1 should make it in all currently maintained stable-trees.
> >
> 
> The problem is with load_gs_index(), yes?  In 2.6.37 this is called 
> before stgi(), so it's protected from nmi.

Ok, you are right :) So the fix is only necessary for 2.6.38.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Flow Control and Port Mirroring Revisited

2011-01-17 Thread Michael S. Tsirkin
On Mon, Jan 17, 2011 at 10:26:25AM +1030, Rusty Russell wrote:
> On Mon, 17 Jan 2011 09:07:30 am Simon Horman wrote:
> 
> [snip]
> 
> I've been away, but what concerns me is that socket buffer limits are
> bypassed in various configurations, due to skb cloning.  We should probably
> drop such limits altogether, or fix them to be consistent.

Further, it looks like when the limits are not bypassed, they
easily result in deadlocks. For example, with
multiple tun devices attached to a single bridge in host,
if a number of these have their queues blocked,
others will reach the socket buffer limit and
traffic on the bridge will get blocked altogether.

It might be better to drop the limits altogether
unless we can fix them. Happily, as the limits are off by
default, doing so does not require kernel changes.

> Simple fix is as someone suggested here, to attach the clone.  That might
> seriously reduce your sk limit, though.  I haven't thought about it hard,
> but might it make sense to move ownership into skb_shared_info; ie. the
> data, rather than the skb head?
> 
> Cheers,
> Rusty.

tracking data ownership might benefit others such as various zero-copy
strategies. It might need to be done per-page, though, not per-skb.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Flow Control and Port Mirroring Revisited

2011-01-17 Thread Michael S. Tsirkin
On Mon, Jan 17, 2011 at 07:37:30AM +0900, Simon Horman wrote:
> On Fri, Jan 14, 2011 at 08:54:15AM +0200, Michael S. Tsirkin wrote:
> > On Fri, Jan 14, 2011 at 03:35:28PM +0900, Simon Horman wrote:
> > > On Fri, Jan 14, 2011 at 06:58:18AM +0200, Michael S. Tsirkin wrote:
> > > > On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
> > > > > On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
> > > > > > On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman  
> > > > > > wrote:
> > > > > > > On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
> > > > > > >> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> > > > > > >> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> > > > > > >> >
> > > > > > >> > [ snip ]
> > > > > > >> > >
> > > > > > >> > > I know that everyone likes a nice netperf result but I agree 
> > > > > > >> > > with
> > > > > > >> > > Michael that this probably isn't the right question to be 
> > > > > > >> > > asking.  I
> > > > > > >> > > don't think that socket buffers are a real solution to the 
> > > > > > >> > > flow
> > > > > > >> > > control problem: they happen to provide that functionality 
> > > > > > >> > > but it's
> > > > > > >> > > more of a side effect than anything.  It's just that the 
> > > > > > >> > > amount of
> > > > > > >> > > memory consumed by packets in the queue(s) doesn't really 
> > > > > > >> > > have any
> > > > > > >> > > implicit meaning for flow control (think multiple physical 
> > > > > > >> > > adapters,
> > > > > > >> > > all with the same speed instead of a virtual device and a 
> > > > > > >> > > physical
> > > > > > >> > > device with wildly different speeds).  The analog in the 
> > > > > > >> > > physical
> > > > > > >> > > world that you're looking for would be Ethernet flow control.
> > > > > > >> > > Obviously, if the question is limiting CPU or memory 
> > > > > > >> > > consumption then
> > > > > > >> > > that's a different story.
> > > > > > >> >
> > > > > > >> > Point taken. I will see if I can control CPU (and thus memory) 
> > > > > > >> > consumption
> > > > > > >> > using cgroups and/or tc.
> > > > > > >>
> > > > > > >> I have found that I can successfully control the throughput using
> > > > > > >> the following techniques
> > > > > > >>
> > > > > > >> 1) Place a tc egress filter on dummy0
> > > > > > >>
> > > > > > >> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and 
> > > > > > >> then eth1,
> > > > > > >>    this is effectively the same as one of my hacks to the 
> > > > > > >> datapath
> > > > > > >>    that I mentioned in an earlier mail. The result is that eth1
> > > > > > >>    "paces" the connection.
> > > > 
> > > > This is actually a bug. This means that one slow connection will affect
> > > > fast ones. I intend to change the default for qemu to sndbuf=0 : this
> > > > will fix it but break your "pacing". So pls do not count on this
> > > > behaviour.
> > > 
> > > Do you have a patch I could test?
> > 
> > You can (and users already can) just run qemu with sndbuf=0. But if you
> > like, below.
> 
> Thanks
> 
> > > > > > > Further to this, I wonder if there is any interest in providing
> > > > > > > a method to switch the action order - using ovs-ofctl is a hack 
> > > > > > > imho -
> > > > > > > and/or switching the default action order for mirroring.
> > > > > > 
> > > > > > I'm not sure that there is a way to do this that is correct in the
> > > > > > generic case.  It's possible that the destination could be a VM 
> > > > > > while
> > > > > > packets are being mirrored to a physical device or we could be
> > > > > > multicasting or some other arbitrarily complex scenario.  Just think
> > > > > > of what a physical switch would do if it has ports with two 
> > > > > > different
> > > > > > speeds.
> > > > > 
> > > > > Yes, I have considered that case. And I agree that perhaps there
> > > > > is no sensible default. But perhaps we could make it configurable 
> > > > > somehow?
> > > > 
> > > > The fix is at the application level. Run netperf with -b and -w flags to
> > > > limit the speed to a sensible value.
> > > 
> > > Perhaps I should have stated my goals more clearly.
> > > I'm interested in situations where I don't control the application.
> > 
> > Well an application that streams UDP without any throttling
> > at the application level will break on a physical network, right?
> > So I am not sure why should one try to make it work on the virtual one.
> > 
> > But let's assume that you do want to throttle the guest
> > for reasons such as QOS. The proper approach seems
> > to be to throttle the sender, not have a dummy throttled
> > receiver "pacing" it. Place the qemu process in the
> > correct net_cls cgroup, set the class id and apply a rate limit?
> 
> I would like to be able to use a class to rate limit egress packets.
> That much works fine for me.
> 
> What I would also like is for there to be back-pressure such that the guest
> doesn't consume lots of 

[PATCH] KVM Test: Introduce qmp_basic test for RHEL6 host

2011-01-17 Thread Qingtang Zhou
qmp feature provided with RHEL6 is different from qmp in qemu-kvm 0.13.*.
introduce a new test for RHEL6.

Signed-off-by: Qingtang Zhou 
---
 client/tests/kvm/tests/qmp_basic_rhel6.py |  386 +
 client/tests/kvm/tests_base.cfg.sample|3 +
 2 files changed, 389 insertions(+), 0 deletions(-)
 create mode 100644 client/tests/kvm/tests/qmp_basic_rhel6.py

diff --git a/client/tests/kvm/tests/qmp_basic_rhel6.py 
b/client/tests/kvm/tests/qmp_basic_rhel6.py
new file mode 100644
index 000..1b780f5
--- /dev/null
+++ b/client/tests/kvm/tests/qmp_basic_rhel6.py
@@ -0,0 +1,386 @@
+import kvm_test_utils, kvm_monitor
+from autotest_lib.client.common_lib import error
+
+def run_qmp_basic(test, params, env):
+"""
+QMP Specification test-suite: this checks if the *basic* protocol conforms
+to its specification, which is file QMP/qmp-spec.txt in QEMU's source tree.
+
+IMPORTANT NOTES:
+
+o Most tests depend heavily on QMP's error information (eg. classes),
+  this might have bad implications as the error interface is going to
+  change in QMP
+
+o Command testing is *not* covered in this suite. Each command has its
+  own specification and should be tested separately
+
+o We use the same terminology as used by the QMP specification,
+  specially with regard to JSON types (eg. a Python dict is called
+  a json-object)
+
+o This is divided in sub test-suites, please check the bottom of this
+  file to check the order in which they are run
+
+TODO:
+
+o Finding which test failed is not as easy as it should be
+
+o Are all those check_*() functions really needed? Wouldn't a
+  specialized class (eg. a Response class) do better?
+"""
+def fail_no_key(qmp_dict, key):
+if not isinstance(qmp_dict, dict):
+raise error.TestFail("qmp_dict is not a dict (it's '%s')" %
+ type(qmp_dict))
+if not key in qmp_dict:
+raise error.TestFail("'%s' key doesn't exist in dict ('%s')" %
+ (key, str(qmp_dict)))
+
+
+def check_dict_key(qmp_dict, key, keytype):
+"""
+Performs the following checks on a QMP dict key:
+
+1. qmp_dict is a dict
+2. key exists in qmp_dict
+3. key is of type keytype
+
+If any of these checks fails, error.TestFail is raised.
+"""
+fail_no_key(qmp_dict, key)
+if not isinstance(qmp_dict[key], keytype):
+raise error.TestFail("'%s' key is not of type '%s', it's '%s'" %
+ (key, keytype, type(qmp_dict[key])))
+
+
+def check_key_is_dict(qmp_dict, key):
+check_dict_key(qmp_dict, key, dict)
+
+
+def check_key_is_list(qmp_dict, key):
+check_dict_key(qmp_dict, key, list)
+
+
+def check_key_is_str(qmp_dict, key):
+check_dict_key(qmp_dict, key, unicode)
+
+
+def check_str_key(qmp_dict, keyname, value=None):
+check_dict_key(qmp_dict, keyname, unicode)
+if value and value != qmp_dict[keyname]:
+raise error.TestFail("'%s' key value '%s' should be '%s'" %
+ (keyname, str(qmp_dict[keyname]), str(value)))
+
+
+def check_key_is_int(qmp_dict, key):
+fail_no_key(qmp_dict, key)
+try:
+int(qmp_dict[key])
+except:
+raise error.TestFail("'%s' key is not of type int, it's '%s'" %
+ (key, type(qmp_dict[key])))
+
+
+def check_bool_key(qmp_dict, keyname, value=None):
+check_dict_key(qmp_dict, keyname, bool)
+if value and value != qmp_dict[keyname]:
+raise error.TestFail("'%s' key value '%s' should be '%s'" %
+ (keyname, str(qmp_dict[keyname]), str(value)))
+
+
+def check_success_resp(resp, empty=False):
+"""
+Check QMP OK response.
+
+@param resp: QMP response
+@param empty: if True, response should not contain data to return
+"""
+check_key_is_dict(resp, "return")
+if empty and len(resp["return"]) > 0:
+raise error.TestFail("success response is not empty ('%s')" %
+ str(resp))
+
+
+def check_error_resp(resp, classname=None, datadict=None):
+"""
+Check QMP error response.
+
+@param resp: QMP response
+@param classname: Expected error class name
+@param datadict: Expected error data dictionary
+"""
+logging.debug("resp %s", str(resp))
+check_key_is_dict(resp, "error")
+check_key_is_str(resp["error"], "class")
+if classname and resp["error"]["class"] != classname:
+raise error.TestFail("got error class '%s' expected '%s'" %
+ (resp["error"]["class"], classname))
+check_key_is_dict(resp["error"]

Re: [PATCH 3/3] vhost-net: use lock_sock_fast() in peek_head_len()

2011-01-17 Thread Michael S. Tsirkin
On Mon, Jan 17, 2011 at 04:11:17PM +0800, Jason Wang wrote:
> We can use lock_sock_fast() instead of lock_sock() in order to get
> speedup in peek_head_len().
> 
> Signed-off-by: Jason Wang 

Queued for 2.6.39, thanks everyone.

> ---
>  drivers/vhost/net.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index c32a2e4..50b622a 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -211,12 +211,12 @@ static int peek_head_len(struct sock *sk)
>  {
>   struct sk_buff *head;
>   int len = 0;
> + bool slow = lock_sock_fast(sk);
>  
> - lock_sock(sk);
>   head = skb_peek(&sk->sk_receive_queue);
>   if (head)
>   len = head->len;
> - release_sock(sk);
> + unlock_sock_fast(sk, slow);
>   return len;
>  }
>  
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH uq/master 2/2] MCE, unpoison memory address across reboot

2011-01-17 Thread Jan Kiszka
On 2011-01-17 03:08, Huang Ying wrote:
 As indicated, I'm sitting on lots of fixes and refactorings of the MCE
 user space code. How do you test your patches? Any suggestions how to do
 this efficiently would be warmly welcome.
>>>
>>> We use a self-made test script to test.  Repository is at:
>>>
>>> git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
>>>
>>> The kvm test script is in kvm sub-directory.
>>>
>>> The qemu patch attached is need by the test script.
>>>
>>
>> Yeah, I already found this yesterday and started reading. I was just
>> searching for p2v in qemu, but now it's clear where it comes from. Will
>> have a look (if you want to preview my changes:
>> git://git.kiszka.org/qemu-kvm.git queues/kvm-upstream).
>>
>> I was almost about to use MADV_HWPOISON instead of the injection module.
>> Is there a way to recover the fake corruption afterward? I think that
>> would allow to move some of the test logic into qemu and avoid p2v which
>> - IIRC - was disliked upstream.
> 
> I don't know how to fully recover from  MADV_HWPOISON.  You can recover
> the virtual address space via qemu_ram_remap() introduced in 1/2 of this
> patchset.  But you will lose one or several physical pages for each
> testing.  I think that may be not a big issue for a testing machine.
> 
> Ccing Andi and Fengguang, they know more than me about MADV_HWPOISON.

"page-types -b hwpoison -x" does the trick of unpoisoning for me. It can
be found at linux/Documentation/vm/page-types.c. So it's quite easy to
set up and clean up a test case based on MADV_HWPOISON IMO. Not sure,
though, if that can simulate all of what you currently do via mce-inject.

> 
>> Also, is there a way to simulate corrected errors (BUS_MCEERR_AO)?
> 
> BUS_MCEERR_AO is recoverable uncorrected error instead of corrected
> error.
> 
> The test script is for BUS_MCEERR_AO and BUS_MCEERR_AR.  To see the
> effect of pure BUS_MCEERR_AO, just remove the memory accessing loop
> (memset) in tools/simple_process/simple_process.c.

Yeah, that question was based on lacking knowledge about the different
error types. Meanwhile, I was able to trigger BUS_MCEERR_AO via
MADV_HWPOISON - and also BUS_MCEERR_AR by accessing that page. However,
I did not succeed with using mce-inject so far, thus with mce-test. But
I need to check this again.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] vhost-net: use lock_sock_fast() in peek_head_len()

2011-01-17 Thread Eric Dumazet
Le lundi 17 janvier 2011 à 16:11 +0800, Jason Wang a écrit :
> We can use lock_sock_fast() instead of lock_sock() in order to get
> speedup in peek_head_len().
> 
> Signed-off-by: Jason Wang 
> ---
>  drivers/vhost/net.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index c32a2e4..50b622a 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -211,12 +211,12 @@ static int peek_head_len(struct sock *sk)
>  {
>   struct sk_buff *head;
>   int len = 0;
> + bool slow = lock_sock_fast(sk);
>  
> - lock_sock(sk);
>   head = skb_peek(&sk->sk_receive_queue);
>   if (head)
>   len = head->len;
> - release_sock(sk);
> + unlock_sock_fast(sk, slow);
>   return len;
>  }
>  
Acked-by: Eric Dumazet 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] vhost-net: check the support of mergeable buffer outside the receive loop

2011-01-17 Thread Michael S. Tsirkin
On Mon, Jan 17, 2011 at 04:10:59PM +0800, Jason Wang wrote:
> No need to check the support of mergeable buffer inside the recevie
> loop as the whole handle_rx()_xx is in the read critical region.  So
> this patch move it ahead of the receiving loop.
> 
> Signed-off-by: Jason Wang 

Well feature check is mostly just features & bit
so why would it be slower? Because of the rcu
adding memory barriers? Do you observe a
measureable speedup with this patch?

Apropos, I noticed that the check in vhost_has_feature
is wrong: it uses the same kind of RCU as the
private pointer. So we'll have to fix that properly
by adding more lockdep classes, but for now
we'll need to make
the check 1 || lockdep_is_held(&dev->mutex);
and add a TODO.

> ---
>  drivers/vhost/net.c |5 +++--
>  1 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 14fc189..95e49de 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -411,7 +411,7 @@ static void handle_rx_mergeable(struct vhost_net *net)
>   };
>  
>   size_t total_len = 0;
> - int err, headcount;
> + int err, headcount, mergeable;
>   size_t vhost_hlen, sock_hlen;
>   size_t vhost_len, sock_len;
>   struct socket *sock = rcu_dereference(vq->private_data);
> @@ -425,6 +425,7 @@ static void handle_rx_mergeable(struct vhost_net *net)
>  
>   vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
>   vq->log : NULL;
> + mergeable = vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF);
>  
>   while ((sock_len = peek_head_len(sock->sk))) {
>   sock_len += sock_hlen;
> @@ -474,7 +475,7 @@ static void handle_rx_mergeable(struct vhost_net *net)
>   break;
>   }
>   /* TODO: Should check and handle checksum. */
> - if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF) &&
> + if (likely(mergeable) &&
>   memcpy_toiovecend(vq->hdr, (unsigned char *)&headcount,
> offsetof(typeof(hdr), num_buffers),
> sizeof hdr.num_buffers)) {
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] vhost-net: Unify the code of mergeable and big buffer handling

2011-01-17 Thread Michael S. Tsirkin
On Mon, Jan 17, 2011 at 04:11:08PM +0800, Jason Wang wrote:
> Codes duplication were found between the handling of mergeable and big
> buffers, so this patch tries to unify them. This could be easily done
> by adding a quota to the get_rx_bufs() which is used to limit the
> number of buffers it returns (for mergeable buffer, the quota is
> simply UIO_MAXIOV, for big buffers, the quota is just 1), and then the
> previous handle_rx_mergeable() could be resued also for big buffers.
> 
> Signed-off-by: Jason Wang 

We actually started this way. However the code turned out
to have measureable overhead when handle_rx_mergeable
is called with quota 1.
So before applying this I'd like to see some data
to show this is not the case anymore.

> ---
>  drivers/vhost/net.c |  128 
> +++
>  1 files changed, 7 insertions(+), 121 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 95e49de..c32a2e4 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -227,6 +227,7 @@ static int peek_head_len(struct sock *sk)
>   * @iovcount - returned count of io vectors we fill
>   * @log  - vhost log
>   * @log_num  - log offset
> + * @quota   - headcount quota, 1 for big buffer
>   *   returns number of buffer heads allocated, negative on error
>   */
>  static int get_rx_bufs(struct vhost_virtqueue *vq,
> @@ -234,7 +235,8 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
>  int datalen,
>  unsigned *iovcount,
>  struct vhost_log *log,
> -unsigned *log_num)
> +unsigned *log_num,
> +unsigned int quota)
>  {
>   unsigned int out, in;
>   int seg = 0;
> @@ -242,7 +244,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
>   unsigned d;
>   int r, nlogs = 0;
>  
> - while (datalen > 0) {
> + while (datalen > 0 && headcount < quota) {
>   if (unlikely(seg >= UIO_MAXIOV)) {
>   r = -ENOBUFS;
>   goto err;
> @@ -282,116 +284,7 @@ err:
>  
>  /* Expects to be always run from workqueue - which acts as
>   * read-size critical section for our kind of RCU. */
> -static void handle_rx_big(struct vhost_net *net)
> -{
> - struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> - unsigned out, in, log, s;
> - int head;
> - struct vhost_log *vq_log;
> - struct msghdr msg = {
> - .msg_name = NULL,
> - .msg_namelen = 0,
> - .msg_control = NULL, /* FIXME: get and handle RX aux data. */
> - .msg_controllen = 0,
> - .msg_iov = vq->iov,
> - .msg_flags = MSG_DONTWAIT,
> - };
> -
> - struct virtio_net_hdr hdr = {
> - .flags = 0,
> - .gso_type = VIRTIO_NET_HDR_GSO_NONE
> - };
> -
> - size_t len, total_len = 0;
> - int err;
> - size_t hdr_size;
> - struct socket *sock = rcu_dereference(vq->private_data);
> - if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> - return;
> -
> - mutex_lock(&vq->mutex);
> - vhost_disable_notify(vq);
> - hdr_size = vq->vhost_hlen;
> -
> - vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> - vq->log : NULL;
> -
> - for (;;) {
> - head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> -  ARRAY_SIZE(vq->iov),
> -  &out, &in,
> -  vq_log, &log);
> - /* On error, stop handling until the next kick. */
> - if (unlikely(head < 0))
> - break;
> - /* OK, now we need to know about added descriptors. */
> - if (head == vq->num) {
> - if (unlikely(vhost_enable_notify(vq))) {
> - /* They have slipped one in as we were
> -  * doing that: check again. */
> - vhost_disable_notify(vq);
> - continue;
> - }
> - /* Nothing new?  Wait for eventfd to tell us
> -  * they refilled. */
> - break;
> - }
> - /* We don't need to be notified again. */
> - if (out) {
> - vq_err(vq, "Unexpected descriptor format for RX: "
> -"out %d, int %d\n",
> -out, in);
> - break;
> - }
> - /* Skip header. TODO: support TSO/mergeable rx buffers. */
> - s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
> - msg.msg_iovlen = in;
> - len = iov_length(vq->iov, in);
> - /* Sanity check */
> - if (!len) {
> - vq_err(vq, "Unexpected hea

[PATCH 1/3] vhost-net: check the support of mergeable buffer outside the receive loop

2011-01-17 Thread Jason Wang
No need to check the support of mergeable buffer inside the recevie
loop as the whole handle_rx()_xx is in the read critical region.  So
this patch move it ahead of the receiving loop.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 14fc189..95e49de 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -411,7 +411,7 @@ static void handle_rx_mergeable(struct vhost_net *net)
};
 
size_t total_len = 0;
-   int err, headcount;
+   int err, headcount, mergeable;
size_t vhost_hlen, sock_hlen;
size_t vhost_len, sock_len;
struct socket *sock = rcu_dereference(vq->private_data);
@@ -425,6 +425,7 @@ static void handle_rx_mergeable(struct vhost_net *net)
 
vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
vq->log : NULL;
+   mergeable = vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF);
 
while ((sock_len = peek_head_len(sock->sk))) {
sock_len += sock_hlen;
@@ -474,7 +475,7 @@ static void handle_rx_mergeable(struct vhost_net *net)
break;
}
/* TODO: Should check and handle checksum. */
-   if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF) &&
+   if (likely(mergeable) &&
memcpy_toiovecend(vq->hdr, (unsigned char *)&headcount,
  offsetof(typeof(hdr), num_buffers),
  sizeof hdr.num_buffers)) {

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] vhost-net: Unify the code of mergeable and big buffer handling

2011-01-17 Thread Jason Wang
Codes duplication were found between the handling of mergeable and big
buffers, so this patch tries to unify them. This could be easily done
by adding a quota to the get_rx_bufs() which is used to limit the
number of buffers it returns (for mergeable buffer, the quota is
simply UIO_MAXIOV, for big buffers, the quota is just 1), and then the
previous handle_rx_mergeable() could be resued also for big buffers.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c |  128 +++
 1 files changed, 7 insertions(+), 121 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 95e49de..c32a2e4 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -227,6 +227,7 @@ static int peek_head_len(struct sock *sk)
  * @iovcount   - returned count of io vectors we fill
  * @log- vhost log
  * @log_num- log offset
+ * @quota   - headcount quota, 1 for big buffer
  * returns number of buffer heads allocated, negative on error
  */
 static int get_rx_bufs(struct vhost_virtqueue *vq,
@@ -234,7 +235,8 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
   int datalen,
   unsigned *iovcount,
   struct vhost_log *log,
-  unsigned *log_num)
+  unsigned *log_num,
+  unsigned int quota)
 {
unsigned int out, in;
int seg = 0;
@@ -242,7 +244,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
unsigned d;
int r, nlogs = 0;
 
-   while (datalen > 0) {
+   while (datalen > 0 && headcount < quota) {
if (unlikely(seg >= UIO_MAXIOV)) {
r = -ENOBUFS;
goto err;
@@ -282,116 +284,7 @@ err:
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_rx_big(struct vhost_net *net)
-{
-   struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
-   unsigned out, in, log, s;
-   int head;
-   struct vhost_log *vq_log;
-   struct msghdr msg = {
-   .msg_name = NULL,
-   .msg_namelen = 0,
-   .msg_control = NULL, /* FIXME: get and handle RX aux data. */
-   .msg_controllen = 0,
-   .msg_iov = vq->iov,
-   .msg_flags = MSG_DONTWAIT,
-   };
-
-   struct virtio_net_hdr hdr = {
-   .flags = 0,
-   .gso_type = VIRTIO_NET_HDR_GSO_NONE
-   };
-
-   size_t len, total_len = 0;
-   int err;
-   size_t hdr_size;
-   struct socket *sock = rcu_dereference(vq->private_data);
-   if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
-   return;
-
-   mutex_lock(&vq->mutex);
-   vhost_disable_notify(vq);
-   hdr_size = vq->vhost_hlen;
-
-   vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
-   vq->log : NULL;
-
-   for (;;) {
-   head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
-ARRAY_SIZE(vq->iov),
-&out, &in,
-vq_log, &log);
-   /* On error, stop handling until the next kick. */
-   if (unlikely(head < 0))
-   break;
-   /* OK, now we need to know about added descriptors. */
-   if (head == vq->num) {
-   if (unlikely(vhost_enable_notify(vq))) {
-   /* They have slipped one in as we were
-* doing that: check again. */
-   vhost_disable_notify(vq);
-   continue;
-   }
-   /* Nothing new?  Wait for eventfd to tell us
-* they refilled. */
-   break;
-   }
-   /* We don't need to be notified again. */
-   if (out) {
-   vq_err(vq, "Unexpected descriptor format for RX: "
-  "out %d, int %d\n",
-  out, in);
-   break;
-   }
-   /* Skip header. TODO: support TSO/mergeable rx buffers. */
-   s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
-   msg.msg_iovlen = in;
-   len = iov_length(vq->iov, in);
-   /* Sanity check */
-   if (!len) {
-   vq_err(vq, "Unexpected header len for RX: "
-  "%zd expected %zd\n",
-  iov_length(vq->hdr, s), hdr_size);
-   break;
-   }
-   err = sock->ops->recvmsg(NULL, sock, &msg,
-len, MSG_DONTWAIT | MSG_TRUNC);
-   /* TODO: Check specific e

[PATCH 3/3] vhost-net: use lock_sock_fast() in peek_head_len()

2011-01-17 Thread Jason Wang
We can use lock_sock_fast() instead of lock_sock() in order to get
speedup in peek_head_len().

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index c32a2e4..50b622a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -211,12 +211,12 @@ static int peek_head_len(struct sock *sk)
 {
struct sk_buff *head;
int len = 0;
+   bool slow = lock_sock_fast(sk);
 
-   lock_sock(sk);
head = skb_peek(&sk->sk_receive_queue);
if (head)
len = head->len;
-   release_sock(sk);
+   unlock_sock_fast(sk, slow);
return len;
 }
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Remove user space triggerable MCE error message

2011-01-17 Thread Jan Kiszka
On 2011-01-17 08:31, Huang Ying wrote:
> On Mon, 2011-01-17 at 15:11 +0800, Jan Kiszka wrote:
>> On 2011-01-17 01:54, Huang Ying wrote:
>>> On Sat, 2011-01-15 at 17:00 +0800, Jan Kiszka wrote:
 From: Jan Kiszka 

 This case is a pure user space error we do not need to record. Moreover,
 it can be misused to flood the kernel log. Remove it.
>>>
>>> I don't think this is a pure user space error.  This happens on real
>>> hardware too, if the Machine Check exception is raised during early boot
>>> stage or the second MC exception is raised before the first MC exception
>>> is processed/cleared.
>>>
>>> So I use printk here to help debugging these issues.
>>>
>>> To avoid flooding the kernel log, we can use ratelimit.
>>
>> With user space I meant qemu, and maybe "error" was the wrong term. This
>> code path is only triggered if qemu decides to.
> 
> Not only decided by qemu, but also decided by guest OS.  If guest OS
> does not clear the MSR or guest OS does not set the X86_CR4_MCE bit in
> the cr4, the triple fault will be triggered.

Right, but qemu has full access to the state and can detect this.

> 
>> And there you may also
>> print this event (and you already do).
> 
> Sorry, which print do you mean?  I can not find similar print in user
> space.

Err, it's not yet there for the kvm case, I was starring at the tcg code
path.

> 
>> Another reason to not rely on catching this case here: KVM_X86_SET_MCE
>> is obsolete on current kernels. Qemu will use a combination of
>> KVM_SET_MSRS and KVM_SET_VCPU_EVENTS in the future, only falling back to
>> this interface on pre-vcpu-events kernels. Then you need to debug this
>> in user space anyway as the triple fault will no longer make it to the
>> kernel.
> 
> OK.  Then, I think it will be helpful for debugging if we can print
> something like this in user space implementation.

qemu_log_mask(CPU_LOG_RESET, "Triple fault\n");

(and there will be a message on the monitor if raised manually)

Jan



signature.asc
Description: OpenPGP digital signature