Re: [PATCH] virtio_net: large tx MTU support

2008-11-26 Thread David Miller
From: Mark McLoughlin <[EMAIL PROTECTED]>
Date: Wed, 26 Nov 2008 13:58:11 +

> We don't really have a max tx packet size limit, so allow configuring
> the device with up to 64k tx MTU.
> 
> Signed-off-by: Mark McLoughlin <[EMAIL PROTECTED]>

Rusty, ACK?

If so, I'll toss this into net-next-2.6, thanks!
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 0/4] [RFC] Another proportional weight IO controller

2008-11-26 Thread Fabio Checconi
> From: Nauman Rafique <[EMAIL PROTECTED]>
> Date: Wed, Nov 26, 2008 11:41:46AM -0800
>
> On Wed, Nov 26, 2008 at 6:06 AM, Paolo Valente <[EMAIL PROTECTED]> wrote:
> > Fabio and I are a little bit worried about the fact that the problem
> > of working in the time domain instead of the service domain is not
> > being properly dealt with.  Probably we did not express ourselves very
> > clearly, so we will try to put in more practical terms.  Using B-WF2Q+
> > in the time domain instead of using CFQ (Round-Robin) means introducing
> > higher complexity than CFQ to get almost the same service properties
> > of CFQ.  With regard to fairness (long term) B-WF2Q+ in the time domain
> 
> Are we talking about a case where all the contenders have equal
> weights and are continuously backlogged? That seems to be the only
> case when B-WF2Q+ would behave like Round-Robin. Am I missing
> something here?
> 

It is the case with equal weights, but it is really a common one.


> I can see that the only direct advantage of using WF2Q+ scheduling is
> reduced jitter or latency in certain cases. But under heavy loads,
> that might result in request latencies seen by RT threads to be
> reduced from a few seconds to a few msec.
> 
> > has exactly the same (un)fairness problems of CFQ.  As far as bandwidth
> > differentiation is concerned, it can be obtained with CFQ by just
> > increasing the time slice (e.g., double weight => double slice).  This
> > has no impact on long term guarantees and certainly does not decrease
> > the throughput.
> >
> > With regard to short term guarantees (request completion time), one of
> > the properties of the reference ideal system of Wf2Q+ is that, assuming
> > for simplicity that all the queues have the same weight, as the ideal
> > system serves each queue at the same speed, shorter budgets are completed
> > in a shorter time intervals than longer budgets.  B-WF2Q+ guarantees
> > O(1) deviation from this ideal service.  Hence, the tight delay/jitter
> > measured in our experiments with BFQ is a consequence of the simple (and
> > probably still improvable) budget assignment mechanism of (the overall)
> > BFQ.  In contrast, if all the budgets are equal, as it happens if we use
> > time slices, the resulting scheduler is exactly a Round-Robin, again
> > as in CFQ (see [1]).
> 
> Can the budget assignment mechanism of BFQ be converted to time slice
> assignment mechanism? What I am trying to say here is that we can have
> variable time slices, just like we have variable budgets.
> 

Yes, it could be converted, and it would do in the time domain the
same differentiation it does now in the service domain.  What we would
lose in the process is the fairness in the service domain.  The service
properties/guarantees of the resulting scheduler would _not_ be the same
as the BFQ ones.  Both long term and short term guarantees would be
affected by the unfairness given by the different service rate
experienced by the scheduled entities.


> >
> > Finally, with regard to completion time delay differentiation through
> > weight differentiation, this is probably the only case in which B-WF2Q+
> > would perform better than CFQ, because, in case of CFQ, reducing the
> > time slices may reduce the throughput, whereas increasing the time slice
> > would increase the worst-case delay/jitter.
> >
> > In the end, BFQ succeeds in guaranteeing fairness (or in general the
> > desired bandwidth distribution) because it works in the service domain
> > (and this is probably the only way to achieve this goal), not because
> > it uses WF2Q+ instead of Round-Robin.  Similarly, it provides tight
> > delay/jitter only because B-WF2Q+ is used in combination with a simple
> > budget assignment (differentiation) mechanism (again in the service
> > domain).
> >
> > [1] http://feanor.sssup.it/~fabio/linux/bfq/results.php
> >
> > --
> > ---
> > | Paolo Valente  ||
> > | Algogroup  ||
> > | Dip. Ing. Informazione | tel:   +39 059 2056318 |
> > | Via Vignolese 905/b| fax:   +39 059 2056199 |
> > | 41100 Modena   ||
> > | home:  http://algo.ing.unimo.it/people/paolo/   |
> > ---
> >
> >
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [SR-IOV driver example 0/3] introduction

2008-11-26 Thread Jeff Garzik
Yu Zhao wrote:
> SR-IOV drivers of Intel 82576 NIC are available. There are two parts
> of the drivers: Physical Function driver and Virtual Function driver.
> The PF driver is based on the IGB driver and is used to control PF to
> allocate hardware specific resources and interface with the SR-IOV core.
> The VF driver is a new NIC driver that is same as the traditional PCI
> device driver. It works in both the host and the guest (Xen and KVM)
> environment.
> 
> These two drivers are testing versions and they are *only* intended to
> show how to use SR-IOV API.
> 
> Intel 82576 NIC specification can be found at:
> http://download.intel.com/design/network/datashts/82576_Datasheet_v2p1.pdf
> 
> [SR-IOV driver example 1/3] PF driver: allocate hardware specific resource
> [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core
> [SR-IOV driver example 3/3] VF driver tar ball

Please copy [EMAIL PROTECTED] on all network-related patches.  This 
is where the network developers live, and all patches on this list are 
automatically archived for review and handling at 
http://patchwork.ozlabs.org/project/netdev/list/

Jeff



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core

2008-11-26 Thread Greg KH
On Wed, Nov 26, 2008 at 11:27:10AM -0800, Nakajima, Jun wrote:
> On 11/26/2008 8:58:59 AM, Greg KH wrote:
> > On Wed, Nov 26, 2008 at 10:21:56PM +0800, Yu Zhao wrote:
> > > This patch integrates the IGB driver with the SR-IOV core. It shows
> > > how the SR-IOV API is used to support the capability. Obviously
> > > people does not need to put much effort to integrate the PF driver
> > > with SR-IOV core. All SR-IOV standard stuff are handled by SR-IOV
> > > core and PF driver once it gets the necessary information (i.e.
> > > number of Virtual
> > > Functions) from the callback function.
> > >
> > > ---
> > >  drivers/net/igb/igb_main.c |   30 ++
> > >  1 files changed, 30 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
> > > index bc063d4..b8c7dc6 100644
> > > --- a/drivers/net/igb/igb_main.c
> > > +++ b/drivers/net/igb/igb_main.c
> > > @@ -139,6 +139,7 @@ void igb_set_mc_list_pools(struct igb_adapter *,
> > > struct e1000_hw *, int, u16);  static int igb_vmm_control(struct
> > > igb_adapter *, bool);  static int igb_set_vf_mac(struct net_device
> > > *, int, u8*);  static void igb_mbox_handler(struct igb_adapter *);
> > > +static int igb_virtual(struct pci_dev *, int);
> > >  #endif
> > >
> > >  static int igb_suspend(struct pci_dev *, pm_message_t); @@ -184,6
> > > +185,9 @@ static struct pci_driver igb_driver = {  #endif
> > > .shutdown = igb_shutdown,
> > > .err_handler = &igb_err_handler,
> > > +#ifdef CONFIG_PCI_IOV
> > > +   .virtual = igb_virtual
> > > +#endif
> >
> > #ifdef should not be needed, right?
> >
> 
> Good point. I think this is because the driver is expected to build on
> older kernels also,

That should not be an issue for patches that are being submitted, right?

And if this is the case, shouldn't it be called out in the changelog
entry?

> but the problem is that the driver (and probably others) is broken
> unless the kernel is built with CONFIG_PCI_IOV because of the
> following hunk, for example.
> 
> However, we don't want to use #ifdef for the (*virtual) field in the
> header. One option would be to define a constant like the following
> along with those changes.
> #define PCI_DEV_IOV
> 
> Any better idea?

Just always declare it in your driver, which will be added _after_ this
field gets added to the kernel tree as well.  It's not a big deal, just
an ordering of patches issue.

Because remember, don't add #ifdefs to drivers, they should not be
needed at all.

thanks,

greg k-h
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 0/4] [RFC] Another proportional weight IO controller

2008-11-26 Thread Nauman Rafique
On Wed, Nov 26, 2008 at 6:06 AM, Paolo Valente <[EMAIL PROTECTED]> wrote:
> Fabio and I are a little bit worried about the fact that the problem
> of working in the time domain instead of the service domain is not
> being properly dealt with.  Probably we did not express ourselves very
> clearly, so we will try to put in more practical terms.  Using B-WF2Q+
> in the time domain instead of using CFQ (Round-Robin) means introducing
> higher complexity than CFQ to get almost the same service properties
> of CFQ.  With regard to fairness (long term) B-WF2Q+ in the time domain

Are we talking about a case where all the contenders have equal
weights and are continuously backlogged? That seems to be the only
case when B-WF2Q+ would behave like Round-Robin. Am I missing
something here?

I can see that the only direct advantage of using WF2Q+ scheduling is
reduced jitter or latency in certain cases. But under heavy loads,
that might result in request latencies seen by RT threads to be
reduced from a few seconds to a few msec.

> has exactly the same (un)fairness problems of CFQ.  As far as bandwidth
> differentiation is concerned, it can be obtained with CFQ by just
> increasing the time slice (e.g., double weight => double slice).  This
> has no impact on long term guarantees and certainly does not decrease
> the throughput.
>
> With regard to short term guarantees (request completion time), one of
> the properties of the reference ideal system of Wf2Q+ is that, assuming
> for simplicity that all the queues have the same weight, as the ideal
> system serves each queue at the same speed, shorter budgets are completed
> in a shorter time intervals than longer budgets.  B-WF2Q+ guarantees
> O(1) deviation from this ideal service.  Hence, the tight delay/jitter
> measured in our experiments with BFQ is a consequence of the simple (and
> probably still improvable) budget assignment mechanism of (the overall)
> BFQ.  In contrast, if all the budgets are equal, as it happens if we use
> time slices, the resulting scheduler is exactly a Round-Robin, again
> as in CFQ (see [1]).

Can the budget assignment mechanism of BFQ be converted to time slice
assignment mechanism? What I am trying to say here is that we can have
variable time slices, just like we have variable budgets.

>
> Finally, with regard to completion time delay differentiation through
> weight differentiation, this is probably the only case in which B-WF2Q+
> would perform better than CFQ, because, in case of CFQ, reducing the
> time slices may reduce the throughput, whereas increasing the time slice
> would increase the worst-case delay/jitter.
>
> In the end, BFQ succeeds in guaranteeing fairness (or in general the
> desired bandwidth distribution) because it works in the service domain
> (and this is probably the only way to achieve this goal), not because
> it uses WF2Q+ instead of Round-Robin.  Similarly, it provides tight
> delay/jitter only because B-WF2Q+ is used in combination with a simple
> budget assignment (differentiation) mechanism (again in the service
> domain).
>
> [1] http://feanor.sssup.it/~fabio/linux/bfq/results.php
>
> --
> ---
> | Paolo Valente  ||
> | Algogroup  ||
> | Dip. Ing. Informazione | tel:   +39 059 2056318 |
> | Via Vignolese 905/b| fax:   +39 059 2056199 |
> | 41100 Modena   ||
> | home:  http://algo.ing.unimo.it/people/paolo/   |
> ---
>
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


RE: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core

2008-11-26 Thread Nakajima, Jun
On 11/26/2008 8:58:59 AM, Greg KH wrote:
> On Wed, Nov 26, 2008 at 10:21:56PM +0800, Yu Zhao wrote:
> > This patch integrates the IGB driver with the SR-IOV core. It shows
> > how the SR-IOV API is used to support the capability. Obviously
> > people does not need to put much effort to integrate the PF driver
> > with SR-IOV core. All SR-IOV standard stuff are handled by SR-IOV
> > core and PF driver once it gets the necessary information (i.e.
> > number of Virtual
> > Functions) from the callback function.
> >
> > ---
> >  drivers/net/igb/igb_main.c |   30 ++
> >  1 files changed, 30 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
> > index bc063d4..b8c7dc6 100644
> > --- a/drivers/net/igb/igb_main.c
> > +++ b/drivers/net/igb/igb_main.c
> > @@ -139,6 +139,7 @@ void igb_set_mc_list_pools(struct igb_adapter *,
> > struct e1000_hw *, int, u16);  static int igb_vmm_control(struct
> > igb_adapter *, bool);  static int igb_set_vf_mac(struct net_device
> > *, int, u8*);  static void igb_mbox_handler(struct igb_adapter *);
> > +static int igb_virtual(struct pci_dev *, int);
> >  #endif
> >
> >  static int igb_suspend(struct pci_dev *, pm_message_t); @@ -184,6
> > +185,9 @@ static struct pci_driver igb_driver = {  #endif
> > .shutdown = igb_shutdown,
> > .err_handler = &igb_err_handler,
> > +#ifdef CONFIG_PCI_IOV
> > +   .virtual = igb_virtual
> > +#endif
>
> #ifdef should not be needed, right?
>

Good point. I think this is because the driver is expected to build on older 
kernels also, but the problem is that the driver (and probably others) is 
broken unless the kernel is built with CONFIG_PCI_IOV because of the following 
hunk, for example.

However, we don't want to use #ifdef for the (*virtual) field in the header. 
One option would be to define a constant like the following along with those 
changes.
#define PCI_DEV_IOV

Any better idea?

Thanks,
 .
Jun Nakajima | Intel Open Source Technology Center


@@ -259,6 +266,7 @@ struct pci_dev {
struct list_head msi_list;
 #endif
struct pci_vpd *vpd;
+   struct pci_iov *iov;
 };

 extern struct pci_dev *alloc_pci_dev(void); @@ -426,6 +434,7 @@ struct 
pci_driver {
int  (*resume_early) (struct pci_dev *dev);
int  (*resume) (struct pci_dev *dev);   /* Device woken 
up */
void (*shutdown) (struct pci_dev *dev);
+   int (*virtual) (struct pci_dev *dev, int nr_virtfn);
struct pm_ext_ops *pm;
struct pci_error_handlers *err_handler;
struct device_driverdriver;

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core

2008-11-26 Thread Chris Wright
* Greg KH ([EMAIL PROTECTED]) wrote:
> > +static int
> > +igb_virtual(struct pci_dev *pdev, int nr_virtfn)
> > +{
> > +   unsigned char my_mac_addr[6] = {0x00, 0xDE, 0xAD, 0xBE, 0xEF, 0xFF};
> > +   struct net_device *netdev = pci_get_drvdata(pdev);
> > +   struct igb_adapter *adapter = netdev_priv(netdev);
> > +   int i;
> > +
> > +   if (nr_virtfn > 7)
> > +   return -EINVAL;
> 
> Why the check for 7?  Is that the max virtual functions for this card?
> Shouldn't that be a define somewhere so it's easier to fix in future
> versions of this hardware?  :)

IIRC it's 8 for the card, 1 reserved for PF.  I think both notions
should be captured w/ commented constants.

thanks,
-chris
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] virtio: block: set max_segment_size and max_sectors to infinite.

2008-11-26 Thread Chris Wright
* Rusty Russell ([EMAIL PROTECTED]) wrote:
> + /* No real sector limit. */
> + blk_queue_max_sectors(vblk->disk->queue, -1U);
> +

Is that actually legitimate?  I think it'd still work out, but seems
odd, e.g. all the spots that do:

q->max_hw_sectors << 9

will just toss the upper bits...

thanks,
-chris
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [SR-IOV driver example 3/3] VF driver tar ball

2008-11-26 Thread Greg KH
On Wed, Nov 26, 2008 at 10:40:43PM +0800, Yu Zhao wrote:
> The attachment is the VF driver for Intel 82576 NIC.

Please don't attach things as tarballs, we can't review or easily read
them at all.

Care to resend it?

thanks,

greg k-h
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core

2008-11-26 Thread Greg KH
On Wed, Nov 26, 2008 at 10:21:56PM +0800, Yu Zhao wrote:
> This patch integrates the IGB driver with the SR-IOV core. It shows how
> the SR-IOV API is used to support the capability. Obviously people does
> not need to put much effort to integrate the PF driver with SR-IOV core.
> All SR-IOV standard stuff are handled by SR-IOV core and PF driver only
> concerns the device specific resource allocation and deallocation once it
> gets the necessary information (i.e. number of Virtual Functions) from
> the callback function.
> 
> ---
>  drivers/net/igb/igb_main.c |   30 ++
>  1 files changed, 30 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
> index bc063d4..b8c7dc6 100644
> --- a/drivers/net/igb/igb_main.c
> +++ b/drivers/net/igb/igb_main.c
> @@ -139,6 +139,7 @@ void igb_set_mc_list_pools(struct igb_adapter *, struct 
> e1000_hw *, int, u16);
>  static int igb_vmm_control(struct igb_adapter *, bool);
>  static int igb_set_vf_mac(struct net_device *, int, u8*);
>  static void igb_mbox_handler(struct igb_adapter *);
> +static int igb_virtual(struct pci_dev *, int);
>  #endif
>  
>  static int igb_suspend(struct pci_dev *, pm_message_t);
> @@ -184,6 +185,9 @@ static struct pci_driver igb_driver = {
>  #endif
>   .shutdown = igb_shutdown,
>   .err_handler = &igb_err_handler,
> +#ifdef CONFIG_PCI_IOV
> + .virtual = igb_virtual
> +#endif

#ifdef should not be needed, right?

>  };
>  
>  static int global_quad_port_a; /* global quad port a indication */
> @@ -5107,6 +5111,32 @@ void igb_set_mc_list_pools(struct igb_adapter *adapter,
>   reg_data |= (1 << 25);
>   wr32(E1000_VMOLR(pool), reg_data);
>  }
> +
> +static   int
> +igb_virtual(struct pci_dev *pdev, int nr_virtfn)
> +{
> + unsigned char my_mac_addr[6] = {0x00, 0xDE, 0xAD, 0xBE, 0xEF, 0xFF};
> + struct net_device *netdev = pci_get_drvdata(pdev);
> + struct igb_adapter *adapter = netdev_priv(netdev);
> + int i;
> +
> + if (nr_virtfn > 7)
> + return -EINVAL;

Why the check for 7?  Is that the max virtual functions for this card?
Shouldn't that be a define somewhere so it's easier to fix in future
versions of this hardware?  :)

> +
> + if (nr_virtfn) {
> + for (i = 0; i < nr_virtfn; i++) {
> + printk(KERN_INFO "SR-IOV: VF %d is enabled\n", i);

Use dev_info() please, that shows the exact pci device and driver that
emitted the message.

> + my_mac_addr[5] = (unsigned char)i;
> + igb_set_vf_mac(netdev, i, my_mac_addr);
> + igb_set_vf_vmolr(adapter, i);
> + }
> + } else
> + printk(KERN_INFO "SR-IOV is disabled\n");

Is that really true?  (oh, use dev_info as well.)  What happens if you
had called this with "5" and then later with "0", you never destroyed
those existing virtual functions, yet the code does:

> + adapter->vfs_allocated_count = nr_virtfn;

Which makes the driver think they are not present.  What happens when
the driver later goes to shut down?  Are those resources freed up
properly?

thanks,

greg k-h
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [SR-IOV driver example 0/3] introduction

2008-11-26 Thread Greg KH
On Wed, Nov 26, 2008 at 10:03:03PM +0800, Yu Zhao wrote:
> SR-IOV drivers of Intel 82576 NIC are available. There are two parts
> of the drivers: Physical Function driver and Virtual Function driver.
> The PF driver is based on the IGB driver and is used to control PF to
> allocate hardware specific resources and interface with the SR-IOV core.
> The VF driver is a new NIC driver that is same as the traditional PCI
> device driver. It works in both the host and the guest (Xen and KVM)
> environment.
> 
> These two drivers are testing versions and they are *only* intended to
> show how to use SR-IOV API.

That's funny, as some distros are already shipping this driver.  You
might want to tell them that this is an "example only" driver and not to
be used "for real"... :(

greg k-h
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 0/4] [RFC] Another proportional weight IO controller

2008-11-26 Thread Vivek Goyal
On Wed, Nov 26, 2008 at 09:47:07PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> From: Vivek Goyal <[EMAIL PROTECTED]>
> Subject: Re: [patch 0/4] [RFC] Another proportional weight IO controller
> Date: Tue, 25 Nov 2008 11:27:20 -0500
> 
> > On Tue, Nov 25, 2008 at 11:33:59AM +0900, Ryo Tsuruta wrote:
> > > Hi Vivek,
> > > 
> > > > > > Ryo, do you still want to stick to two level scheduling? Given the 
> > > > > > problem
> > > > > > of it breaking down underlying scheduler's assumptions, probably it 
> > > > > > makes
> > > > > > more sense to the IO control at each individual IO scheduler.
> > > > > 
> > > > > I don't want to stick to it. I'm considering implementing dm-ioband's
> > > > > algorithm into the block I/O layer experimentally.
> > > > 
> > > > Thanks Ryo. Implementing a control at block layer sounds like another
> > > > 2 level scheduling. We will still have the issue of breaking underlying
> > > > CFQ and other schedulers. How to plan to resolve that conflict.
> > > 
> > > I think there is no conflict against I/O schedulers.
> > > Could you expain to me about the conflict?
> > 
> > Because we do the buffering at higher level scheduler and mostly release
> > the buffered bios in the FIFO order, it might break the underlying IO
> > schedulers. Generally it is the decision of IO scheduler to determine in
> > what order to release buffered bios.
> > 
> > For example, If there is one task of io priority 0 in a cgroup and rest of
> > the tasks are of io prio 7. All the tasks belong to best effort class. If
> > tasks of lower priority (7) do lot of IO, then due to buffering there is
> > a chance that IO from lower prio tasks is seen by CFQ first and io from
> > higher prio task is not seen by cfq for quite some time hence that task
> > not getting it fair share with in the cgroup. Similiar situations can
> > arise with RT tasks also.
> 
> Thanks for your explanation. 
> I think that the same thing occurs without the higher level scheduler,
> because all the tasks issuing I/Os are blocked while the underlying
> device's request queue is full before those I/Os are sent to the I/O
> scheduler.
> 

True and this issue was pointed out by Divyesh. I think we shall have to
fix this by allocating the request descriptors in proportion to their
share. One possible way is to make use of elv_may_queue() to determine
if we can allocate furhter request descriptors or not. 

> > > > What do you think about the solution at IO scheduler level (like BFQ) or
> > > > may be little above that where one can try some code sharing among IO
> > > > schedulers? 
> > > 
> > > I would like to support any type of block device even if I/Os issued
> > > to the underlying device doesn't go through IO scheduler. Dm-ioband
> > > can be made use of for the devices such as loop device.
> > > 
> > 
> > What do you mean by that IO issued to underlying device does not go
> > through IO scheduler? loop device will be associated with a file and
> > IO will ultimately go to the IO scheduler which is serving those file
> > blocks?
> 
> How about if the files is on an NFS-mounted file system?
> 

Interesting. So on the surface it looks like contention for disk but it
is more the contention for network and contention for disk on NFS server.

True that leaf node IO control will not help here as IO is not going to
leaf node at all. We can make the situation better by doing resource 
control on network IO though.

> > What's the use case scenario of doing IO control at loop device?
> > Ultimately the resource contention will take place on actual underlying
> > physical device where the file blocks are. Will doing the resource control
> > there not solve the issue for you?
> 
> I don't come up with any use case, but I would like to make the
> resource controller more flexible. Actually, a certain block device
> that I'm using does not use the I/O scheduler.

Isn't it equivalent to using No-op? If yes, then it should not be an
issue?

Thanks
Vivek
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[SR-IOV driver example 1/3] PF driver: allocate hardware specific resource

2008-11-26 Thread Yu Zhao
This patch makes the IGB driver allocate hardware resource (rx/tx queues)
for Virtual Functions. All operations in this patch are hardware specific.

---
 drivers/net/igb/Makefile|2 +-
 drivers/net/igb/e1000_82575.c   |1 +
 drivers/net/igb/e1000_82575.h   |   61 
 drivers/net/igb/e1000_defines.h |7 +
 drivers/net/igb/e1000_hw.h  |2 +
 drivers/net/igb/e1000_regs.h|   13 +
 drivers/net/igb/e1000_vf.c  |  223 ++
 drivers/net/igb/igb.h   |   10 +
 drivers/net/igb/igb_main.c  |  604 ++-
 9 files changed, 910 insertions(+), 13 deletions(-)
 create mode 100644 drivers/net/igb/e1000_vf.c

diff --git a/drivers/net/igb/Makefile b/drivers/net/igb/Makefile
index 1927b3f..ab3944c 100644
--- a/drivers/net/igb/Makefile
+++ b/drivers/net/igb/Makefile
@@ -33,5 +33,5 @@
 obj-$(CONFIG_IGB) += igb.o
 
 igb-objs := igb_main.o igb_ethtool.o e1000_82575.o \
-   e1000_mac.o e1000_nvm.o e1000_phy.o
+   e1000_mac.o e1000_nvm.o e1000_phy.o e1000_vf.o
 
diff --git a/drivers/net/igb/e1000_82575.c b/drivers/net/igb/e1000_82575.c
index f5e2e72..bb823ac 100644
--- a/drivers/net/igb/e1000_82575.c
+++ b/drivers/net/igb/e1000_82575.c
@@ -87,6 +87,7 @@ static s32 igb_get_invariants_82575(struct e1000_hw *hw)
case E1000_DEV_ID_82576:
case E1000_DEV_ID_82576_FIBER:
case E1000_DEV_ID_82576_SERDES:
+   case E1000_DEV_ID_82576_QUAD_COPPER:
mac->type = e1000_82576;
break;
default:
diff --git a/drivers/net/igb/e1000_82575.h b/drivers/net/igb/e1000_82575.h
index c1928b5..8c488ab 100644
--- a/drivers/net/igb/e1000_82575.h
+++ b/drivers/net/igb/e1000_82575.h
@@ -170,4 +170,65 @@ struct e1000_adv_tx_context_desc {
 #define E1000_DCA_TXCTRL_CPUID_SHIFT 24 /* Tx CPUID now in the last byte */
 #define E1000_DCA_RXCTRL_CPUID_SHIFT 24 /* Rx CPUID now in the last byte */
 
+#define MAX_NUM_VFS   8
+
+#define E1000_DTXSWC_VMDQ_LOOPBACK_EN (1 << 31)  /* global VF LB enable */
+
+/* Easy defines for setting default pool, would normally be left a zero */
+#define E1000_VT_CTL_DEFAULT_POOL_SHIFT 7
+#define E1000_VT_CTL_DEFAULT_POOL_MASK  (0x7 << 
E1000_VT_CTL_DEFAULT_POOL_SHIFT)
+
+/* Other useful VMD_CTL register defines */
+#define E1000_VT_CTL_DISABLE_DEF_POOL   (1 << 29)
+#define E1000_VT_CTL_VM_REPL_EN (1 << 30)
+
+/* Per VM Offload register setup */
+#define E1000_VMOLR_LPE0x0001 /* Accept Long packet */
+#define E1000_VMOLR_AUPE   0x0100 /* Accept untagged packets */
+#define E1000_VMOLR_BAM0x0800 /* Accept Broadcast packets */
+#define E1000_VMOLR_MPME   0x1000 /* Multicast promiscuous mode */
+#define E1000_VMOLR_STRVLAN0x4000 /* Vlan stripping enable */
+
+#define E1000_P2VMAILBOX_STS   0x0001 /* Initiate message send to VF */
+#define E1000_P2VMAILBOX_ACK   0x0002 /* Ack message recv'd from VF */
+#define E1000_P2VMAILBOX_VFU   0x0004 /* VF owns the mailbox buffer */
+#define E1000_P2VMAILBOX_PFU   0x0008 /* PF owns the mailbox buffer */
+
+#define E1000_VLVF_ARRAY_SIZE 32
+#define E1000_VLVF_VLANID_MASK0x0FFF
+#define E1000_VLVF_POOLSEL_SHIFT  12
+#define E1000_VLVF_POOLSEL_MASK   (0xFF << E1000_VLVF_POOLSEL_SHIFT)
+#define E1000_VLVF_VLANID_ENABLE  0x8000
+
+#define E1000_VFMAILBOX_SIZE   16 /* 16 32 bit words - 64 bytes */
+
+/* If it's a E1000_VF_* msg then it originates in the VF and is sent to the
+ * PF.  The reverse is true if it is E1000_PF_*.
+ * Message ACK's are the value or'd with 0xF000
+ */
+#define E1000_VT_MSGTYPE_ACK  0xF000  /* Messages below or'd with
+   * this are the ACK */
+#define E1000_VT_MSGTYPE_NACK 0xFF00  /* Messages below or'd with
+   * this are the NACK */
+#define E1000_VT_MSGINFO_SHIFT16
+/* bits 23:16 are used for exra info for certain messages */
+#define E1000_VT_MSGINFO_MASK (0xFF << E1000_VT_MSGINFO_SHIFT)
+
+#define E1000_VF_MSGTYPE_REQ_MAC  1 /* VF needs to know its MAC */
+#define E1000_VF_MSGTYPE_VFLR 2 /* VF notifies VFLR to PF */
+#define E1000_VF_SET_MULTICAST3 /* VF requests PF to set MC addr */
+#define E1000_VF_SET_VLAN 4 /* VF requests PF to set VLAN */
+#define E1000_VF_SET_LPE  5 /* VF requests PF to set VMOLR.LPE */
+
+s32  e1000_send_mail_to_vf(struct e1000_hw *hw, u32 *msg,
+   u32 vf_number, s16 size);
+s32  e1000_receive_mail_from_vf(struct e1000_hw *hw, u32 *msg,
+u32 vf_number, s16 size);
+void e1000_vmdq_loopback_enable_vf(struct e1000_hw *hw);
+void e1000_vmdq_loopback_disable_vf(struct e1000_hw *hw);
+void e1000_vmdq_replication_enable_vf(struct e1000_hw *hw, u32 enables);
+void e1000_vmdq_replication_disable_vf(struct e1000_hw *hw);
+bool e1000_check_for_pf_ack_vf(struct e1000_hw *hw);
+bool e1000_check_for_pf_mail

Re: [patch 0/4] [RFC] Another proportional weight IO controller

2008-11-26 Thread Vivek Goyal
On Wed, Nov 26, 2008 at 03:40:18PM +0900, Fernando Luis Vázquez Cao wrote:
> On Thu, 2008-11-20 at 08:40 -0500, Vivek Goyal wrote:
> > > The dm approach has some merrits, the major one being that it'll fit
> > > directly into existing setups that use dm and can be controlled with
> > > familiar tools. That is a bonus. The draw back is partially the same -
> > > it'll require dm. So it's still not a fit-all approach, unfortunately.
> > > 
> > > So I'd prefer an approach that doesn't force you to use dm.
> > 
> > Hi Jens,
> > 
> > My patches met the goal of not using the dm for every device one wants
> > to control.
> > 
> > Having said that, few things come to mind.
> > 
> > - In what cases do we need to control the higher level logical devices
> >   like dm. It looks like real contention for resources is at leaf nodes.
> >   Hence any kind of resource management/fair queueing should probably be
> >   done at leaf nodes and not at higher level logical nodes.
> 
> The problem with stacking devices is that we do not know how the IO
> going through the leaf nodes contributes to the aggregate throughput
> seen by the application/cgroup that generated it, which is what end
> users care about.
> 

If we keep track of cgroup information in bio and don't loose it while
bio traverses through the stack of devices, then leaf node can still do
the proportional fair share allocation among contending cgroups on that
device.

I think end users care about getting fair share if there is a contention
anywhere along the IO path. Real contention is at leaf nodes. However 
complex the logical device topology is, if two applications are not
contending for disk at lowest level, there is no point in doing any kind
of resource management among them. Though the applications seemingly might
be contending for higher level logical device, at leaf nodes, their IOs
might be going to different disk altogether and practically there is no
contention. 

> The block device could be a plain old sata device, a loop device, a
> stacking device, a SSD, you name it, but their topologies and the fact
> that some of them do not even use an elevator should be transparent to
> the user.

Are there some devices which don't use elevators at leaf nodes? If no,
then its not a issue. 

> 
> If you wanted to do resource management at the leaf nodes some kind of
> topology information should be passed down to the elevators controlling
> the underlying devices, which in turn would need to work cooperatively.
> 

I am not able to understand why some kind of topology information needs
to be passed to underlying elevators. As long as end device can map a bio
correctly to the right cgroup (irrespective of complex topology) and end
device step into resource management only if there is contention for
resources among cgroups on that device, things are fine. We don't have
to worry about intermediate complex topology.

I will take one hypothetical example. Lets assume there are two cgroups
A and B with weights 2048 and 1024 respectively. To me this information
means that if A, and B really conted for the resources somewhere, then
make sure A gets 2/3 of resources and B gets 1/3 of resource.

Now if tasks in these two groups happen to contend for same disk at lowest
level, we do resource management otherwise we don't. Why do I need to
worry about intermediate logical devices in the IO path? 

May be I am missing something. A detailed example will help here...  

> >   If that makes sense, then probably we don't need to control dm device
> >   and we don't need such higher level solutions.
> 
> For the reasons stated above the two level scheduling approach seems
> cleaner to me.
> 
> > - Any kind of 2 level scheduler solution has the potential to break the
> >   underlying IO scheduler. Higher level solution requires buffering of
> >   bios and controlled release of bios to lower layers. This control breaks
> >   the assumptions of lower layer IO scheduler which knows in what order
> >   bios should be dispatched to device to meet the semantics exported by
> >   the IO scheduler.
> 
> Please notice that the such an IO controller would only get in the way
> of the elevator in case of contention for the device.

True. So are we saying that a user can get expected CFQ or AS behavior
only if there is no contention. If there is contention, then we don't
gurantee anything?

> What is more,
> depending on the workload it turns out that buffering at higher layers
> in a per-cgroup or per-task basis, like dm-band does, may actually
> increase the aggregate throughput (I think that the dm-band team
> observed this behavior too). The reason seems to be that bios buffered
> in such way tend to be highly correlated and thus very likely to get
> merged when released to the elevator.

The goal here is not to increase throughput by doing buffering at higher
layer. This is what IO scheduler currently does. It tries to buffer bios
and select these appropriately to boost throughput. If on

[SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core

2008-11-26 Thread Yu Zhao
This patch integrates the IGB driver with the SR-IOV core. It shows how
the SR-IOV API is used to support the capability. Obviously people does
not need to put much effort to integrate the PF driver with SR-IOV core.
All SR-IOV standard stuff are handled by SR-IOV core and PF driver only
concerns the device specific resource allocation and deallocation once it
gets the necessary information (i.e. number of Virtual Functions) from
the callback function.

---
 drivers/net/igb/igb_main.c |   30 ++
 1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index bc063d4..b8c7dc6 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -139,6 +139,7 @@ void igb_set_mc_list_pools(struct igb_adapter *, struct 
e1000_hw *, int, u16);
 static int igb_vmm_control(struct igb_adapter *, bool);
 static int igb_set_vf_mac(struct net_device *, int, u8*);
 static void igb_mbox_handler(struct igb_adapter *);
+static int igb_virtual(struct pci_dev *, int);
 #endif
 
 static int igb_suspend(struct pci_dev *, pm_message_t);
@@ -184,6 +185,9 @@ static struct pci_driver igb_driver = {
 #endif
.shutdown = igb_shutdown,
.err_handler = &igb_err_handler,
+#ifdef CONFIG_PCI_IOV
+   .virtual = igb_virtual
+#endif
 };
 
 static int global_quad_port_a; /* global quad port a indication */
@@ -5107,6 +5111,32 @@ void igb_set_mc_list_pools(struct igb_adapter *adapter,
reg_data |= (1 << 25);
wr32(E1000_VMOLR(pool), reg_data);
 }
+
+static int
+igb_virtual(struct pci_dev *pdev, int nr_virtfn)
+{
+   unsigned char my_mac_addr[6] = {0x00, 0xDE, 0xAD, 0xBE, 0xEF, 0xFF};
+   struct net_device *netdev = pci_get_drvdata(pdev);
+   struct igb_adapter *adapter = netdev_priv(netdev);
+   int i;
+
+   if (nr_virtfn > 7)
+   return -EINVAL;
+
+   if (nr_virtfn) {
+   for (i = 0; i < nr_virtfn; i++) {
+   printk(KERN_INFO "SR-IOV: VF %d is enabled\n", i);
+   my_mac_addr[5] = (unsigned char)i;
+   igb_set_vf_mac(netdev, i, my_mac_addr);
+   igb_set_vf_vmolr(adapter, i);
+   }
+   } else
+   printk(KERN_INFO "SR-IOV is disabled\n");
+
+   adapter->vfs_allocated_count = nr_virtfn;
+
+   return 0;
+}
 #endif
 
 /* igb_main.c */
-- 
1.5.4.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[SR-IOV driver example 0/3] introduction

2008-11-26 Thread Yu Zhao
SR-IOV drivers of Intel 82576 NIC are available. There are two parts
of the drivers: Physical Function driver and Virtual Function driver.
The PF driver is based on the IGB driver and is used to control PF to
allocate hardware specific resources and interface with the SR-IOV core.
The VF driver is a new NIC driver that is same as the traditional PCI
device driver. It works in both the host and the guest (Xen and KVM)
environment.

These two drivers are testing versions and they are *only* intended to
show how to use SR-IOV API.

Intel 82576 NIC specification can be found at:
http://download.intel.com/design/network/datashts/82576_Datasheet_v2p1.pdf

[SR-IOV driver example 1/3] PF driver: allocate hardware specific resource
[SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core
[SR-IOV driver example 3/3] VF driver tar ball
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 0/4] [RFC] Another proportional weight IO controller

2008-11-26 Thread Paolo Valente
Fabio and I are a little bit worried about the fact that the problem
of working in the time domain instead of the service domain is not
being properly dealt with.  Probably we did not express ourselves very
clearly, so we will try to put in more practical terms.  Using B-WF2Q+
in the time domain instead of using CFQ (Round-Robin) means introducing
higher complexity than CFQ to get almost the same service properties
of CFQ.  With regard to fairness (long term) B-WF2Q+ in the time domain
has exactly the same (un)fairness problems of CFQ.  As far as bandwidth
differentiation is concerned, it can be obtained with CFQ by just
increasing the time slice (e.g., double weight => double slice).  This
has no impact on long term guarantees and certainly does not decrease
the throughput.

With regard to short term guarantees (request completion time), one of
the properties of the reference ideal system of Wf2Q+ is that, assuming
for simplicity that all the queues have the same weight, as the ideal
system serves each queue at the same speed, shorter budgets are completed
in a shorter time intervals than longer budgets.  B-WF2Q+ guarantees
O(1) deviation from this ideal service.  Hence, the tight delay/jitter
measured in our experiments with BFQ is a consequence of the simple (and
probably still improvable) budget assignment mechanism of (the overall)
BFQ.  In contrast, if all the budgets are equal, as it happens if we use
time slices, the resulting scheduler is exactly a Round-Robin, again
as in CFQ (see [1]).

Finally, with regard to completion time delay differentiation through
weight differentiation, this is probably the only case in which B-WF2Q+
would perform better than CFQ, because, in case of CFQ, reducing the
time slices may reduce the throughput, whereas increasing the time slice
would increase the worst-case delay/jitter.

In the end, BFQ succeeds in guaranteeing fairness (or in general the
desired bandwidth distribution) because it works in the service domain
(and this is probably the only way to achieve this goal), not because
it uses WF2Q+ instead of Round-Robin.  Similarly, it provides tight
delay/jitter only because B-WF2Q+ is used in combination with a simple
budget assignment (differentiation) mechanism (again in the service
domain).

[1] http://feanor.sssup.it/~fabio/linux/bfq/results.php

-- 
---
| Paolo Valente  ||
| Algogroup  ||
| Dip. Ing. Informazione | tel:   +39 059 2056318 |
| Via Vignolese 905/b| fax:   +39 059 2056199 |
| 41100 Modena   ||
| home:  http://algo.ing.unimo.it/people/paolo/   |
---

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH] virtio_net: large tx MTU support

2008-11-26 Thread Mark McLoughlin
We don't really have a max tx packet size limit, so allow configuring
the device with up to 64k tx MTU.

Signed-off-by: Mark McLoughlin <[EMAIL PROTECTED]>
---
 drivers/net/virtio_net.c |   12 
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index e6b5d6e..71ca29c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -613,6 +613,17 @@ static struct ethtool_ops virtnet_ethtool_ops = {
.set_tso = ethtool_op_set_tso,
 };
 
+#define MIN_MTU 68
+#define MAX_MTU 65535
+
+static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
+{
+   if (new_mtu < MIN_MTU || new_mtu > MAX_MTU)
+   return -EINVAL;
+   dev->mtu = new_mtu;
+   return 0;
+}
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
int err;
@@ -628,6 +639,7 @@ static int virtnet_probe(struct virtio_device *vdev)
dev->open = virtnet_open;
dev->stop = virtnet_close;
dev->hard_start_xmit = start_xmit;
+   dev->change_mtu = virtnet_change_mtu;
dev->features = NETIF_F_HIGHDMA;
 #ifdef CONFIG_NET_POLL_CONTROLLER
dev->poll_controller = virtnet_netpoll;
-- 
1.6.0.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Host<->guest channel interface advice needed

2008-11-26 Thread Gleb Natapov
On Wed, Nov 26, 2008 at 04:07:01PM +0300, Evgeniy Polyakov wrote:
> On Wed, Nov 26, 2008 at 02:39:19PM +0200, Gleb Natapov ([EMAIL PROTECTED]) 
> wrote:
> > The interfaces that are being considered are netlink socket (only datagram
> > semantics, linux specific), new socket family or character device with
> > different minor number for each channel. Which one better suits for
> > the purpose?  Is there other kind of interface to consider? New socket
> > family looks like a good choice, but it would be nice to hear other
> > opinions before starting to work on it.
> 
> What about X (or whatever else) protocol running over host-guest network
> device, which are in the kernel already?
> 
I should have mentioned that in my original mail. We don't want to
use IP stack for communication between host and guest for variety of
reasons. User of the VM may interfere with our communication by mis
configuring firewall for instance (and he/she may even not be aware
that an OS running inside a VM). We also want be able to communicate
with agent inside a guest even when guest's network is not yet configured.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Host<->guest channel interface advice needed

2008-11-26 Thread Arnd Bergmann
On Wednesday 26 November 2008, Gleb Natapov wrote:
> The interfaces that are being considered are netlink socket (only datagram
> semantics, linux specific), new socket family or character device with
> different minor number for each channel. Which one better suits for
> the purpose?  Is there other kind of interface to consider? New socket
> family looks like a good choice, but it would be nice to hear other
> opinions before starting to work on it.

I think a socket and a pty both look reasonable here, but one important
aspect IMHO is that you only need a new kernel driver for the guest, if
you just use the regular pty support or Unix domain sockets in the host.

Obviously, there needs to be some control over permissions, as a guest
most not be able to just open any socket or pty of the host, so a
reasonable approach might be that the guest can only create a socket
or pty that can be opened by the host, but not vice versa. Alternatively,
you create the socket/pty in host userspace and then allow passing that
down into the guest, which creates a virtio device from it.

Arnd <><
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 0/4] [RFC] Another proportional weight IO controller

2008-11-26 Thread Ryo Tsuruta
Hi Vivek,

From: Vivek Goyal <[EMAIL PROTECTED]>
Subject: Re: [patch 0/4] [RFC] Another proportional weight IO controller
Date: Tue, 25 Nov 2008 11:27:20 -0500

> On Tue, Nov 25, 2008 at 11:33:59AM +0900, Ryo Tsuruta wrote:
> > Hi Vivek,
> > 
> > > > > Ryo, do you still want to stick to two level scheduling? Given the 
> > > > > problem
> > > > > of it breaking down underlying scheduler's assumptions, probably it 
> > > > > makes
> > > > > more sense to the IO control at each individual IO scheduler.
> > > > 
> > > > I don't want to stick to it. I'm considering implementing dm-ioband's
> > > > algorithm into the block I/O layer experimentally.
> > > 
> > > Thanks Ryo. Implementing a control at block layer sounds like another
> > > 2 level scheduling. We will still have the issue of breaking underlying
> > > CFQ and other schedulers. How to plan to resolve that conflict.
> > 
> > I think there is no conflict against I/O schedulers.
> > Could you expain to me about the conflict?
> 
> Because we do the buffering at higher level scheduler and mostly release
> the buffered bios in the FIFO order, it might break the underlying IO
> schedulers. Generally it is the decision of IO scheduler to determine in
> what order to release buffered bios.
> 
> For example, If there is one task of io priority 0 in a cgroup and rest of
> the tasks are of io prio 7. All the tasks belong to best effort class. If
> tasks of lower priority (7) do lot of IO, then due to buffering there is
> a chance that IO from lower prio tasks is seen by CFQ first and io from
> higher prio task is not seen by cfq for quite some time hence that task
> not getting it fair share with in the cgroup. Similiar situations can
> arise with RT tasks also.

Thanks for your explanation. 
I think that the same thing occurs without the higher level scheduler,
because all the tasks issuing I/Os are blocked while the underlying
device's request queue is full before those I/Os are sent to the I/O
scheduler.

> > > What do you think about the solution at IO scheduler level (like BFQ) or
> > > may be little above that where one can try some code sharing among IO
> > > schedulers? 
> > 
> > I would like to support any type of block device even if I/Os issued
> > to the underlying device doesn't go through IO scheduler. Dm-ioband
> > can be made use of for the devices such as loop device.
> > 
> 
> What do you mean by that IO issued to underlying device does not go
> through IO scheduler? loop device will be associated with a file and
> IO will ultimately go to the IO scheduler which is serving those file
> blocks?

How about if the files is on an NFS-mounted file system?

> What's the use case scenario of doing IO control at loop device?
> Ultimately the resource contention will take place on actual underlying
> physical device where the file blocks are. Will doing the resource control
> there not solve the issue for you?

I don't come up with any use case, but I would like to make the
resource controller more flexible. Actually, a certain block device
that I'm using does not use the I/O scheduler.

Thanks,
Ryo Tsuruta
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Host<->guest channel interface advice needed

2008-11-26 Thread Gleb Natapov
Hello,

 I'd like to ask what would be the best user space interface for generic
guest<->host communication channel. The channel will be used to pass
mouse events to/from a guest or by managements software to communicate
with agents running in a guests or for something similar.

The interfaces that are being considered are netlink socket (only datagram
semantics, linux specific), new socket family or character device with
different minor number for each channel. Which one better suits for
the purpose?  Is there other kind of interface to consider? New socket
family looks like a good choice, but it would be nice to hear other
opinions before starting to work on it.

Thanks,

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 0/4] [RFC] Another proportional weight IO controller

2008-11-26 Thread Fernando Luis Vázquez Cao
On Tue, 2008-11-25 at 11:27 -0500, Vivek Goyal wrote:
> On Tue, Nov 25, 2008 at 11:33:59AM +0900, Ryo Tsuruta wrote:
> > Hi Vivek,
> > 
> > > > > Ryo, do you still want to stick to two level scheduling? Given the 
> > > > > problem
> > > > > of it breaking down underlying scheduler's assumptions, probably it 
> > > > > makes
> > > > > more sense to the IO control at each individual IO scheduler.
> > > > 
> > > > I don't want to stick to it. I'm considering implementing dm-ioband's
> > > > algorithm into the block I/O layer experimentally.
> > > 
> > > Thanks Ryo. Implementing a control at block layer sounds like another
> > > 2 level scheduling. We will still have the issue of breaking underlying
> > > CFQ and other schedulers. How to plan to resolve that conflict.
> > 
> > I think there is no conflict against I/O schedulers.
> > Could you expain to me about the conflict?
> 
> Because we do the buffering at higher level scheduler and mostly release
> the buffered bios in the FIFO order, it might break the underlying IO
> schedulers. Generally it is the decision of IO scheduler to determine in
> what order to release buffered bios.

It could be argued that the IO scheduler's primary goal is to maximize
usage of the underlying device according to its physical
characteristics. For hard disks this may imply minimizing time wasted by
seeks; other types of devices, such as SSDs, may impose different
requirements. This is something that clearly belongs in the elevator. On
the other hand, it could be argued that other non-hardware-related
scheduling disciplines would fit better in higher layers.

That said, as you pointed out such separation could impact performance,
so we will probably need to implement a feedback mechanism between the
elevator, which could collect statistics and provide hints, and the
upper layers. The elevator API looks like a good candidate for this,
though new functions might be needed.

> For example, If there is one task of io priority 0 in a cgroup and rest of
> the tasks are of io prio 7. All the tasks belong to best effort class. If
> tasks of lower priority (7) do lot of IO, then due to buffering there is
> a chance that IO from lower prio tasks is seen by CFQ first and io from
> higher prio task is not seen by cfq for quite some time hence that task
> not getting it fair share with in the cgroup. Similiar situations can
> arise with RT tasks also.

Well, this issue is not intrinsic to dm-band and similar solutions. In
the scenario you point out the problem is that the elevator and the IO
controller are not cooperating. The same could happen even if we
implemented everything at the elevator layer (or a little above): get
hierarchical scheduling wrong and you are likely to have a rough ride.

BFQ deals with hierarchical scheduling at just one layer which makes
things easier. BFQ chose the elevator layer, but a similar scheduling
discipline could be implemented higher in the block layer too. The HW
specific-bits we cannot take out the elevator, but when it comes to
task/cgroup based scheduling there are more possibilities, which
includes the middle-way approach we are discussing: two level
scheduling.

The two level model is not bad per se, we just need to get the two
levels to work in unison and for that we will certainly need to make
changes to the existing elevators.

> > > What do you think about the solution at IO scheduler level (like BFQ) or
> > > may be little above that where one can try some code sharing among IO
> > > schedulers? 
> > 
> > I would like to support any type of block device even if I/Os issued
> > to the underlying device doesn't go through IO scheduler. Dm-ioband
> > can be made use of for the devices such as loop device.
> 
> What do you mean by that IO issued to underlying device does not go
> through IO scheduler? loop device will be associated with a file and
> IO will ultimately go to the IO scheduler which is serving those file
> blocks?

I think that Tsuruta-san's point is that the loop device driver uses its
own make_request_fn which means that bios entering a loop device do not
necessarily go through a IO scheduler after that.

We will always find ourselves in this situation when trying to manage
devices that provide their own make_request_fn, the reason being that
its behavior is driver and configuration dependent: in the loop device
case whether we go through a IO scheduler or not depends on what has
been attached to it; in stacking device configurations the effect that
the IO scheduling at one of the devices that constitute the multi-device
will have in the aggregate throughput depends on the topology.

The only way I can think of to address all cases in a sane way is
controlling the entry point to the block layer, which is precisely what
dm-band does.

The problem with dm-band is that it relies on the dm infrastructure. In
my opinion, if we could remove that dependency it would be a huge step
in the right direction.

> What's the use case scenario of