Re: [PATCH] virtio_net: large tx MTU support
From: Mark McLoughlin <[EMAIL PROTECTED]> Date: Wed, 26 Nov 2008 13:58:11 + > We don't really have a max tx packet size limit, so allow configuring > the device with up to 64k tx MTU. > > Signed-off-by: Mark McLoughlin <[EMAIL PROTECTED]> Rusty, ACK? If so, I'll toss this into net-next-2.6, thanks! ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [patch 0/4] [RFC] Another proportional weight IO controller
> From: Nauman Rafique <[EMAIL PROTECTED]> > Date: Wed, Nov 26, 2008 11:41:46AM -0800 > > On Wed, Nov 26, 2008 at 6:06 AM, Paolo Valente <[EMAIL PROTECTED]> wrote: > > Fabio and I are a little bit worried about the fact that the problem > > of working in the time domain instead of the service domain is not > > being properly dealt with. Probably we did not express ourselves very > > clearly, so we will try to put in more practical terms. Using B-WF2Q+ > > in the time domain instead of using CFQ (Round-Robin) means introducing > > higher complexity than CFQ to get almost the same service properties > > of CFQ. With regard to fairness (long term) B-WF2Q+ in the time domain > > Are we talking about a case where all the contenders have equal > weights and are continuously backlogged? That seems to be the only > case when B-WF2Q+ would behave like Round-Robin. Am I missing > something here? > It is the case with equal weights, but it is really a common one. > I can see that the only direct advantage of using WF2Q+ scheduling is > reduced jitter or latency in certain cases. But under heavy loads, > that might result in request latencies seen by RT threads to be > reduced from a few seconds to a few msec. > > > has exactly the same (un)fairness problems of CFQ. As far as bandwidth > > differentiation is concerned, it can be obtained with CFQ by just > > increasing the time slice (e.g., double weight => double slice). This > > has no impact on long term guarantees and certainly does not decrease > > the throughput. > > > > With regard to short term guarantees (request completion time), one of > > the properties of the reference ideal system of Wf2Q+ is that, assuming > > for simplicity that all the queues have the same weight, as the ideal > > system serves each queue at the same speed, shorter budgets are completed > > in a shorter time intervals than longer budgets. B-WF2Q+ guarantees > > O(1) deviation from this ideal service. Hence, the tight delay/jitter > > measured in our experiments with BFQ is a consequence of the simple (and > > probably still improvable) budget assignment mechanism of (the overall) > > BFQ. In contrast, if all the budgets are equal, as it happens if we use > > time slices, the resulting scheduler is exactly a Round-Robin, again > > as in CFQ (see [1]). > > Can the budget assignment mechanism of BFQ be converted to time slice > assignment mechanism? What I am trying to say here is that we can have > variable time slices, just like we have variable budgets. > Yes, it could be converted, and it would do in the time domain the same differentiation it does now in the service domain. What we would lose in the process is the fairness in the service domain. The service properties/guarantees of the resulting scheduler would _not_ be the same as the BFQ ones. Both long term and short term guarantees would be affected by the unfairness given by the different service rate experienced by the scheduled entities. > > > > Finally, with regard to completion time delay differentiation through > > weight differentiation, this is probably the only case in which B-WF2Q+ > > would perform better than CFQ, because, in case of CFQ, reducing the > > time slices may reduce the throughput, whereas increasing the time slice > > would increase the worst-case delay/jitter. > > > > In the end, BFQ succeeds in guaranteeing fairness (or in general the > > desired bandwidth distribution) because it works in the service domain > > (and this is probably the only way to achieve this goal), not because > > it uses WF2Q+ instead of Round-Robin. Similarly, it provides tight > > delay/jitter only because B-WF2Q+ is used in combination with a simple > > budget assignment (differentiation) mechanism (again in the service > > domain). > > > > [1] http://feanor.sssup.it/~fabio/linux/bfq/results.php > > > > -- > > --- > > | Paolo Valente || > > | Algogroup || > > | Dip. Ing. Informazione | tel: +39 059 2056318 | > > | Via Vignolese 905/b| fax: +39 059 2056199 | > > | 41100 Modena || > > | home: http://algo.ing.unimo.it/people/paolo/ | > > --- > > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [SR-IOV driver example 0/3] introduction
Yu Zhao wrote: > SR-IOV drivers of Intel 82576 NIC are available. There are two parts > of the drivers: Physical Function driver and Virtual Function driver. > The PF driver is based on the IGB driver and is used to control PF to > allocate hardware specific resources and interface with the SR-IOV core. > The VF driver is a new NIC driver that is same as the traditional PCI > device driver. It works in both the host and the guest (Xen and KVM) > environment. > > These two drivers are testing versions and they are *only* intended to > show how to use SR-IOV API. > > Intel 82576 NIC specification can be found at: > http://download.intel.com/design/network/datashts/82576_Datasheet_v2p1.pdf > > [SR-IOV driver example 1/3] PF driver: allocate hardware specific resource > [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core > [SR-IOV driver example 3/3] VF driver tar ball Please copy [EMAIL PROTECTED] on all network-related patches. This is where the network developers live, and all patches on this list are automatically archived for review and handling at http://patchwork.ozlabs.org/project/netdev/list/ Jeff ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core
On Wed, Nov 26, 2008 at 11:27:10AM -0800, Nakajima, Jun wrote: > On 11/26/2008 8:58:59 AM, Greg KH wrote: > > On Wed, Nov 26, 2008 at 10:21:56PM +0800, Yu Zhao wrote: > > > This patch integrates the IGB driver with the SR-IOV core. It shows > > > how the SR-IOV API is used to support the capability. Obviously > > > people does not need to put much effort to integrate the PF driver > > > with SR-IOV core. All SR-IOV standard stuff are handled by SR-IOV > > > core and PF driver once it gets the necessary information (i.e. > > > number of Virtual > > > Functions) from the callback function. > > > > > > --- > > > drivers/net/igb/igb_main.c | 30 ++ > > > 1 files changed, 30 insertions(+), 0 deletions(-) > > > > > > diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c > > > index bc063d4..b8c7dc6 100644 > > > --- a/drivers/net/igb/igb_main.c > > > +++ b/drivers/net/igb/igb_main.c > > > @@ -139,6 +139,7 @@ void igb_set_mc_list_pools(struct igb_adapter *, > > > struct e1000_hw *, int, u16); static int igb_vmm_control(struct > > > igb_adapter *, bool); static int igb_set_vf_mac(struct net_device > > > *, int, u8*); static void igb_mbox_handler(struct igb_adapter *); > > > +static int igb_virtual(struct pci_dev *, int); > > > #endif > > > > > > static int igb_suspend(struct pci_dev *, pm_message_t); @@ -184,6 > > > +185,9 @@ static struct pci_driver igb_driver = { #endif > > > .shutdown = igb_shutdown, > > > .err_handler = &igb_err_handler, > > > +#ifdef CONFIG_PCI_IOV > > > + .virtual = igb_virtual > > > +#endif > > > > #ifdef should not be needed, right? > > > > Good point. I think this is because the driver is expected to build on > older kernels also, That should not be an issue for patches that are being submitted, right? And if this is the case, shouldn't it be called out in the changelog entry? > but the problem is that the driver (and probably others) is broken > unless the kernel is built with CONFIG_PCI_IOV because of the > following hunk, for example. > > However, we don't want to use #ifdef for the (*virtual) field in the > header. One option would be to define a constant like the following > along with those changes. > #define PCI_DEV_IOV > > Any better idea? Just always declare it in your driver, which will be added _after_ this field gets added to the kernel tree as well. It's not a big deal, just an ordering of patches issue. Because remember, don't add #ifdefs to drivers, they should not be needed at all. thanks, greg k-h ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [patch 0/4] [RFC] Another proportional weight IO controller
On Wed, Nov 26, 2008 at 6:06 AM, Paolo Valente <[EMAIL PROTECTED]> wrote: > Fabio and I are a little bit worried about the fact that the problem > of working in the time domain instead of the service domain is not > being properly dealt with. Probably we did not express ourselves very > clearly, so we will try to put in more practical terms. Using B-WF2Q+ > in the time domain instead of using CFQ (Round-Robin) means introducing > higher complexity than CFQ to get almost the same service properties > of CFQ. With regard to fairness (long term) B-WF2Q+ in the time domain Are we talking about a case where all the contenders have equal weights and are continuously backlogged? That seems to be the only case when B-WF2Q+ would behave like Round-Robin. Am I missing something here? I can see that the only direct advantage of using WF2Q+ scheduling is reduced jitter or latency in certain cases. But under heavy loads, that might result in request latencies seen by RT threads to be reduced from a few seconds to a few msec. > has exactly the same (un)fairness problems of CFQ. As far as bandwidth > differentiation is concerned, it can be obtained with CFQ by just > increasing the time slice (e.g., double weight => double slice). This > has no impact on long term guarantees and certainly does not decrease > the throughput. > > With regard to short term guarantees (request completion time), one of > the properties of the reference ideal system of Wf2Q+ is that, assuming > for simplicity that all the queues have the same weight, as the ideal > system serves each queue at the same speed, shorter budgets are completed > in a shorter time intervals than longer budgets. B-WF2Q+ guarantees > O(1) deviation from this ideal service. Hence, the tight delay/jitter > measured in our experiments with BFQ is a consequence of the simple (and > probably still improvable) budget assignment mechanism of (the overall) > BFQ. In contrast, if all the budgets are equal, as it happens if we use > time slices, the resulting scheduler is exactly a Round-Robin, again > as in CFQ (see [1]). Can the budget assignment mechanism of BFQ be converted to time slice assignment mechanism? What I am trying to say here is that we can have variable time slices, just like we have variable budgets. > > Finally, with regard to completion time delay differentiation through > weight differentiation, this is probably the only case in which B-WF2Q+ > would perform better than CFQ, because, in case of CFQ, reducing the > time slices may reduce the throughput, whereas increasing the time slice > would increase the worst-case delay/jitter. > > In the end, BFQ succeeds in guaranteeing fairness (or in general the > desired bandwidth distribution) because it works in the service domain > (and this is probably the only way to achieve this goal), not because > it uses WF2Q+ instead of Round-Robin. Similarly, it provides tight > delay/jitter only because B-WF2Q+ is used in combination with a simple > budget assignment (differentiation) mechanism (again in the service > domain). > > [1] http://feanor.sssup.it/~fabio/linux/bfq/results.php > > -- > --- > | Paolo Valente || > | Algogroup || > | Dip. Ing. Informazione | tel: +39 059 2056318 | > | Via Vignolese 905/b| fax: +39 059 2056199 | > | 41100 Modena || > | home: http://algo.ing.unimo.it/people/paolo/ | > --- > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
RE: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core
On 11/26/2008 8:58:59 AM, Greg KH wrote: > On Wed, Nov 26, 2008 at 10:21:56PM +0800, Yu Zhao wrote: > > This patch integrates the IGB driver with the SR-IOV core. It shows > > how the SR-IOV API is used to support the capability. Obviously > > people does not need to put much effort to integrate the PF driver > > with SR-IOV core. All SR-IOV standard stuff are handled by SR-IOV > > core and PF driver once it gets the necessary information (i.e. > > number of Virtual > > Functions) from the callback function. > > > > --- > > drivers/net/igb/igb_main.c | 30 ++ > > 1 files changed, 30 insertions(+), 0 deletions(-) > > > > diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c > > index bc063d4..b8c7dc6 100644 > > --- a/drivers/net/igb/igb_main.c > > +++ b/drivers/net/igb/igb_main.c > > @@ -139,6 +139,7 @@ void igb_set_mc_list_pools(struct igb_adapter *, > > struct e1000_hw *, int, u16); static int igb_vmm_control(struct > > igb_adapter *, bool); static int igb_set_vf_mac(struct net_device > > *, int, u8*); static void igb_mbox_handler(struct igb_adapter *); > > +static int igb_virtual(struct pci_dev *, int); > > #endif > > > > static int igb_suspend(struct pci_dev *, pm_message_t); @@ -184,6 > > +185,9 @@ static struct pci_driver igb_driver = { #endif > > .shutdown = igb_shutdown, > > .err_handler = &igb_err_handler, > > +#ifdef CONFIG_PCI_IOV > > + .virtual = igb_virtual > > +#endif > > #ifdef should not be needed, right? > Good point. I think this is because the driver is expected to build on older kernels also, but the problem is that the driver (and probably others) is broken unless the kernel is built with CONFIG_PCI_IOV because of the following hunk, for example. However, we don't want to use #ifdef for the (*virtual) field in the header. One option would be to define a constant like the following along with those changes. #define PCI_DEV_IOV Any better idea? Thanks, . Jun Nakajima | Intel Open Source Technology Center @@ -259,6 +266,7 @@ struct pci_dev { struct list_head msi_list; #endif struct pci_vpd *vpd; + struct pci_iov *iov; }; extern struct pci_dev *alloc_pci_dev(void); @@ -426,6 +434,7 @@ struct pci_driver { int (*resume_early) (struct pci_dev *dev); int (*resume) (struct pci_dev *dev); /* Device woken up */ void (*shutdown) (struct pci_dev *dev); + int (*virtual) (struct pci_dev *dev, int nr_virtfn); struct pm_ext_ops *pm; struct pci_error_handlers *err_handler; struct device_driverdriver; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core
* Greg KH ([EMAIL PROTECTED]) wrote: > > +static int > > +igb_virtual(struct pci_dev *pdev, int nr_virtfn) > > +{ > > + unsigned char my_mac_addr[6] = {0x00, 0xDE, 0xAD, 0xBE, 0xEF, 0xFF}; > > + struct net_device *netdev = pci_get_drvdata(pdev); > > + struct igb_adapter *adapter = netdev_priv(netdev); > > + int i; > > + > > + if (nr_virtfn > 7) > > + return -EINVAL; > > Why the check for 7? Is that the max virtual functions for this card? > Shouldn't that be a define somewhere so it's easier to fix in future > versions of this hardware? :) IIRC it's 8 for the card, 1 reserved for PF. I think both notions should be captured w/ commented constants. thanks, -chris ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/2] virtio: block: set max_segment_size and max_sectors to infinite.
* Rusty Russell ([EMAIL PROTECTED]) wrote: > + /* No real sector limit. */ > + blk_queue_max_sectors(vblk->disk->queue, -1U); > + Is that actually legitimate? I think it'd still work out, but seems odd, e.g. all the spots that do: q->max_hw_sectors << 9 will just toss the upper bits... thanks, -chris ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [SR-IOV driver example 3/3] VF driver tar ball
On Wed, Nov 26, 2008 at 10:40:43PM +0800, Yu Zhao wrote: > The attachment is the VF driver for Intel 82576 NIC. Please don't attach things as tarballs, we can't review or easily read them at all. Care to resend it? thanks, greg k-h ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core
On Wed, Nov 26, 2008 at 10:21:56PM +0800, Yu Zhao wrote: > This patch integrates the IGB driver with the SR-IOV core. It shows how > the SR-IOV API is used to support the capability. Obviously people does > not need to put much effort to integrate the PF driver with SR-IOV core. > All SR-IOV standard stuff are handled by SR-IOV core and PF driver only > concerns the device specific resource allocation and deallocation once it > gets the necessary information (i.e. number of Virtual Functions) from > the callback function. > > --- > drivers/net/igb/igb_main.c | 30 ++ > 1 files changed, 30 insertions(+), 0 deletions(-) > > diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c > index bc063d4..b8c7dc6 100644 > --- a/drivers/net/igb/igb_main.c > +++ b/drivers/net/igb/igb_main.c > @@ -139,6 +139,7 @@ void igb_set_mc_list_pools(struct igb_adapter *, struct > e1000_hw *, int, u16); > static int igb_vmm_control(struct igb_adapter *, bool); > static int igb_set_vf_mac(struct net_device *, int, u8*); > static void igb_mbox_handler(struct igb_adapter *); > +static int igb_virtual(struct pci_dev *, int); > #endif > > static int igb_suspend(struct pci_dev *, pm_message_t); > @@ -184,6 +185,9 @@ static struct pci_driver igb_driver = { > #endif > .shutdown = igb_shutdown, > .err_handler = &igb_err_handler, > +#ifdef CONFIG_PCI_IOV > + .virtual = igb_virtual > +#endif #ifdef should not be needed, right? > }; > > static int global_quad_port_a; /* global quad port a indication */ > @@ -5107,6 +5111,32 @@ void igb_set_mc_list_pools(struct igb_adapter *adapter, > reg_data |= (1 << 25); > wr32(E1000_VMOLR(pool), reg_data); > } > + > +static int > +igb_virtual(struct pci_dev *pdev, int nr_virtfn) > +{ > + unsigned char my_mac_addr[6] = {0x00, 0xDE, 0xAD, 0xBE, 0xEF, 0xFF}; > + struct net_device *netdev = pci_get_drvdata(pdev); > + struct igb_adapter *adapter = netdev_priv(netdev); > + int i; > + > + if (nr_virtfn > 7) > + return -EINVAL; Why the check for 7? Is that the max virtual functions for this card? Shouldn't that be a define somewhere so it's easier to fix in future versions of this hardware? :) > + > + if (nr_virtfn) { > + for (i = 0; i < nr_virtfn; i++) { > + printk(KERN_INFO "SR-IOV: VF %d is enabled\n", i); Use dev_info() please, that shows the exact pci device and driver that emitted the message. > + my_mac_addr[5] = (unsigned char)i; > + igb_set_vf_mac(netdev, i, my_mac_addr); > + igb_set_vf_vmolr(adapter, i); > + } > + } else > + printk(KERN_INFO "SR-IOV is disabled\n"); Is that really true? (oh, use dev_info as well.) What happens if you had called this with "5" and then later with "0", you never destroyed those existing virtual functions, yet the code does: > + adapter->vfs_allocated_count = nr_virtfn; Which makes the driver think they are not present. What happens when the driver later goes to shut down? Are those resources freed up properly? thanks, greg k-h ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [SR-IOV driver example 0/3] introduction
On Wed, Nov 26, 2008 at 10:03:03PM +0800, Yu Zhao wrote: > SR-IOV drivers of Intel 82576 NIC are available. There are two parts > of the drivers: Physical Function driver and Virtual Function driver. > The PF driver is based on the IGB driver and is used to control PF to > allocate hardware specific resources and interface with the SR-IOV core. > The VF driver is a new NIC driver that is same as the traditional PCI > device driver. It works in both the host and the guest (Xen and KVM) > environment. > > These two drivers are testing versions and they are *only* intended to > show how to use SR-IOV API. That's funny, as some distros are already shipping this driver. You might want to tell them that this is an "example only" driver and not to be used "for real"... :( greg k-h ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [patch 0/4] [RFC] Another proportional weight IO controller
On Wed, Nov 26, 2008 at 09:47:07PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > From: Vivek Goyal <[EMAIL PROTECTED]> > Subject: Re: [patch 0/4] [RFC] Another proportional weight IO controller > Date: Tue, 25 Nov 2008 11:27:20 -0500 > > > On Tue, Nov 25, 2008 at 11:33:59AM +0900, Ryo Tsuruta wrote: > > > Hi Vivek, > > > > > > > > > Ryo, do you still want to stick to two level scheduling? Given the > > > > > > problem > > > > > > of it breaking down underlying scheduler's assumptions, probably it > > > > > > makes > > > > > > more sense to the IO control at each individual IO scheduler. > > > > > > > > > > I don't want to stick to it. I'm considering implementing dm-ioband's > > > > > algorithm into the block I/O layer experimentally. > > > > > > > > Thanks Ryo. Implementing a control at block layer sounds like another > > > > 2 level scheduling. We will still have the issue of breaking underlying > > > > CFQ and other schedulers. How to plan to resolve that conflict. > > > > > > I think there is no conflict against I/O schedulers. > > > Could you expain to me about the conflict? > > > > Because we do the buffering at higher level scheduler and mostly release > > the buffered bios in the FIFO order, it might break the underlying IO > > schedulers. Generally it is the decision of IO scheduler to determine in > > what order to release buffered bios. > > > > For example, If there is one task of io priority 0 in a cgroup and rest of > > the tasks are of io prio 7. All the tasks belong to best effort class. If > > tasks of lower priority (7) do lot of IO, then due to buffering there is > > a chance that IO from lower prio tasks is seen by CFQ first and io from > > higher prio task is not seen by cfq for quite some time hence that task > > not getting it fair share with in the cgroup. Similiar situations can > > arise with RT tasks also. > > Thanks for your explanation. > I think that the same thing occurs without the higher level scheduler, > because all the tasks issuing I/Os are blocked while the underlying > device's request queue is full before those I/Os are sent to the I/O > scheduler. > True and this issue was pointed out by Divyesh. I think we shall have to fix this by allocating the request descriptors in proportion to their share. One possible way is to make use of elv_may_queue() to determine if we can allocate furhter request descriptors or not. > > > > What do you think about the solution at IO scheduler level (like BFQ) or > > > > may be little above that where one can try some code sharing among IO > > > > schedulers? > > > > > > I would like to support any type of block device even if I/Os issued > > > to the underlying device doesn't go through IO scheduler. Dm-ioband > > > can be made use of for the devices such as loop device. > > > > > > > What do you mean by that IO issued to underlying device does not go > > through IO scheduler? loop device will be associated with a file and > > IO will ultimately go to the IO scheduler which is serving those file > > blocks? > > How about if the files is on an NFS-mounted file system? > Interesting. So on the surface it looks like contention for disk but it is more the contention for network and contention for disk on NFS server. True that leaf node IO control will not help here as IO is not going to leaf node at all. We can make the situation better by doing resource control on network IO though. > > What's the use case scenario of doing IO control at loop device? > > Ultimately the resource contention will take place on actual underlying > > physical device where the file blocks are. Will doing the resource control > > there not solve the issue for you? > > I don't come up with any use case, but I would like to make the > resource controller more flexible. Actually, a certain block device > that I'm using does not use the I/O scheduler. Isn't it equivalent to using No-op? If yes, then it should not be an issue? Thanks Vivek ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[SR-IOV driver example 1/3] PF driver: allocate hardware specific resource
This patch makes the IGB driver allocate hardware resource (rx/tx queues) for Virtual Functions. All operations in this patch are hardware specific. --- drivers/net/igb/Makefile|2 +- drivers/net/igb/e1000_82575.c |1 + drivers/net/igb/e1000_82575.h | 61 drivers/net/igb/e1000_defines.h |7 + drivers/net/igb/e1000_hw.h |2 + drivers/net/igb/e1000_regs.h| 13 + drivers/net/igb/e1000_vf.c | 223 ++ drivers/net/igb/igb.h | 10 + drivers/net/igb/igb_main.c | 604 ++- 9 files changed, 910 insertions(+), 13 deletions(-) create mode 100644 drivers/net/igb/e1000_vf.c diff --git a/drivers/net/igb/Makefile b/drivers/net/igb/Makefile index 1927b3f..ab3944c 100644 --- a/drivers/net/igb/Makefile +++ b/drivers/net/igb/Makefile @@ -33,5 +33,5 @@ obj-$(CONFIG_IGB) += igb.o igb-objs := igb_main.o igb_ethtool.o e1000_82575.o \ - e1000_mac.o e1000_nvm.o e1000_phy.o + e1000_mac.o e1000_nvm.o e1000_phy.o e1000_vf.o diff --git a/drivers/net/igb/e1000_82575.c b/drivers/net/igb/e1000_82575.c index f5e2e72..bb823ac 100644 --- a/drivers/net/igb/e1000_82575.c +++ b/drivers/net/igb/e1000_82575.c @@ -87,6 +87,7 @@ static s32 igb_get_invariants_82575(struct e1000_hw *hw) case E1000_DEV_ID_82576: case E1000_DEV_ID_82576_FIBER: case E1000_DEV_ID_82576_SERDES: + case E1000_DEV_ID_82576_QUAD_COPPER: mac->type = e1000_82576; break; default: diff --git a/drivers/net/igb/e1000_82575.h b/drivers/net/igb/e1000_82575.h index c1928b5..8c488ab 100644 --- a/drivers/net/igb/e1000_82575.h +++ b/drivers/net/igb/e1000_82575.h @@ -170,4 +170,65 @@ struct e1000_adv_tx_context_desc { #define E1000_DCA_TXCTRL_CPUID_SHIFT 24 /* Tx CPUID now in the last byte */ #define E1000_DCA_RXCTRL_CPUID_SHIFT 24 /* Rx CPUID now in the last byte */ +#define MAX_NUM_VFS 8 + +#define E1000_DTXSWC_VMDQ_LOOPBACK_EN (1 << 31) /* global VF LB enable */ + +/* Easy defines for setting default pool, would normally be left a zero */ +#define E1000_VT_CTL_DEFAULT_POOL_SHIFT 7 +#define E1000_VT_CTL_DEFAULT_POOL_MASK (0x7 << E1000_VT_CTL_DEFAULT_POOL_SHIFT) + +/* Other useful VMD_CTL register defines */ +#define E1000_VT_CTL_DISABLE_DEF_POOL (1 << 29) +#define E1000_VT_CTL_VM_REPL_EN (1 << 30) + +/* Per VM Offload register setup */ +#define E1000_VMOLR_LPE0x0001 /* Accept Long packet */ +#define E1000_VMOLR_AUPE 0x0100 /* Accept untagged packets */ +#define E1000_VMOLR_BAM0x0800 /* Accept Broadcast packets */ +#define E1000_VMOLR_MPME 0x1000 /* Multicast promiscuous mode */ +#define E1000_VMOLR_STRVLAN0x4000 /* Vlan stripping enable */ + +#define E1000_P2VMAILBOX_STS 0x0001 /* Initiate message send to VF */ +#define E1000_P2VMAILBOX_ACK 0x0002 /* Ack message recv'd from VF */ +#define E1000_P2VMAILBOX_VFU 0x0004 /* VF owns the mailbox buffer */ +#define E1000_P2VMAILBOX_PFU 0x0008 /* PF owns the mailbox buffer */ + +#define E1000_VLVF_ARRAY_SIZE 32 +#define E1000_VLVF_VLANID_MASK0x0FFF +#define E1000_VLVF_POOLSEL_SHIFT 12 +#define E1000_VLVF_POOLSEL_MASK (0xFF << E1000_VLVF_POOLSEL_SHIFT) +#define E1000_VLVF_VLANID_ENABLE 0x8000 + +#define E1000_VFMAILBOX_SIZE 16 /* 16 32 bit words - 64 bytes */ + +/* If it's a E1000_VF_* msg then it originates in the VF and is sent to the + * PF. The reverse is true if it is E1000_PF_*. + * Message ACK's are the value or'd with 0xF000 + */ +#define E1000_VT_MSGTYPE_ACK 0xF000 /* Messages below or'd with + * this are the ACK */ +#define E1000_VT_MSGTYPE_NACK 0xFF00 /* Messages below or'd with + * this are the NACK */ +#define E1000_VT_MSGINFO_SHIFT16 +/* bits 23:16 are used for exra info for certain messages */ +#define E1000_VT_MSGINFO_MASK (0xFF << E1000_VT_MSGINFO_SHIFT) + +#define E1000_VF_MSGTYPE_REQ_MAC 1 /* VF needs to know its MAC */ +#define E1000_VF_MSGTYPE_VFLR 2 /* VF notifies VFLR to PF */ +#define E1000_VF_SET_MULTICAST3 /* VF requests PF to set MC addr */ +#define E1000_VF_SET_VLAN 4 /* VF requests PF to set VLAN */ +#define E1000_VF_SET_LPE 5 /* VF requests PF to set VMOLR.LPE */ + +s32 e1000_send_mail_to_vf(struct e1000_hw *hw, u32 *msg, + u32 vf_number, s16 size); +s32 e1000_receive_mail_from_vf(struct e1000_hw *hw, u32 *msg, +u32 vf_number, s16 size); +void e1000_vmdq_loopback_enable_vf(struct e1000_hw *hw); +void e1000_vmdq_loopback_disable_vf(struct e1000_hw *hw); +void e1000_vmdq_replication_enable_vf(struct e1000_hw *hw, u32 enables); +void e1000_vmdq_replication_disable_vf(struct e1000_hw *hw); +bool e1000_check_for_pf_ack_vf(struct e1000_hw *hw); +bool e1000_check_for_pf_mail
Re: [patch 0/4] [RFC] Another proportional weight IO controller
On Wed, Nov 26, 2008 at 03:40:18PM +0900, Fernando Luis Vázquez Cao wrote: > On Thu, 2008-11-20 at 08:40 -0500, Vivek Goyal wrote: > > > The dm approach has some merrits, the major one being that it'll fit > > > directly into existing setups that use dm and can be controlled with > > > familiar tools. That is a bonus. The draw back is partially the same - > > > it'll require dm. So it's still not a fit-all approach, unfortunately. > > > > > > So I'd prefer an approach that doesn't force you to use dm. > > > > Hi Jens, > > > > My patches met the goal of not using the dm for every device one wants > > to control. > > > > Having said that, few things come to mind. > > > > - In what cases do we need to control the higher level logical devices > > like dm. It looks like real contention for resources is at leaf nodes. > > Hence any kind of resource management/fair queueing should probably be > > done at leaf nodes and not at higher level logical nodes. > > The problem with stacking devices is that we do not know how the IO > going through the leaf nodes contributes to the aggregate throughput > seen by the application/cgroup that generated it, which is what end > users care about. > If we keep track of cgroup information in bio and don't loose it while bio traverses through the stack of devices, then leaf node can still do the proportional fair share allocation among contending cgroups on that device. I think end users care about getting fair share if there is a contention anywhere along the IO path. Real contention is at leaf nodes. However complex the logical device topology is, if two applications are not contending for disk at lowest level, there is no point in doing any kind of resource management among them. Though the applications seemingly might be contending for higher level logical device, at leaf nodes, their IOs might be going to different disk altogether and practically there is no contention. > The block device could be a plain old sata device, a loop device, a > stacking device, a SSD, you name it, but their topologies and the fact > that some of them do not even use an elevator should be transparent to > the user. Are there some devices which don't use elevators at leaf nodes? If no, then its not a issue. > > If you wanted to do resource management at the leaf nodes some kind of > topology information should be passed down to the elevators controlling > the underlying devices, which in turn would need to work cooperatively. > I am not able to understand why some kind of topology information needs to be passed to underlying elevators. As long as end device can map a bio correctly to the right cgroup (irrespective of complex topology) and end device step into resource management only if there is contention for resources among cgroups on that device, things are fine. We don't have to worry about intermediate complex topology. I will take one hypothetical example. Lets assume there are two cgroups A and B with weights 2048 and 1024 respectively. To me this information means that if A, and B really conted for the resources somewhere, then make sure A gets 2/3 of resources and B gets 1/3 of resource. Now if tasks in these two groups happen to contend for same disk at lowest level, we do resource management otherwise we don't. Why do I need to worry about intermediate logical devices in the IO path? May be I am missing something. A detailed example will help here... > > If that makes sense, then probably we don't need to control dm device > > and we don't need such higher level solutions. > > For the reasons stated above the two level scheduling approach seems > cleaner to me. > > > - Any kind of 2 level scheduler solution has the potential to break the > > underlying IO scheduler. Higher level solution requires buffering of > > bios and controlled release of bios to lower layers. This control breaks > > the assumptions of lower layer IO scheduler which knows in what order > > bios should be dispatched to device to meet the semantics exported by > > the IO scheduler. > > Please notice that the such an IO controller would only get in the way > of the elevator in case of contention for the device. True. So are we saying that a user can get expected CFQ or AS behavior only if there is no contention. If there is contention, then we don't gurantee anything? > What is more, > depending on the workload it turns out that buffering at higher layers > in a per-cgroup or per-task basis, like dm-band does, may actually > increase the aggregate throughput (I think that the dm-band team > observed this behavior too). The reason seems to be that bios buffered > in such way tend to be highly correlated and thus very likely to get > merged when released to the elevator. The goal here is not to increase throughput by doing buffering at higher layer. This is what IO scheduler currently does. It tries to buffer bios and select these appropriately to boost throughput. If on
[SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core
This patch integrates the IGB driver with the SR-IOV core. It shows how the SR-IOV API is used to support the capability. Obviously people does not need to put much effort to integrate the PF driver with SR-IOV core. All SR-IOV standard stuff are handled by SR-IOV core and PF driver only concerns the device specific resource allocation and deallocation once it gets the necessary information (i.e. number of Virtual Functions) from the callback function. --- drivers/net/igb/igb_main.c | 30 ++ 1 files changed, 30 insertions(+), 0 deletions(-) diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c index bc063d4..b8c7dc6 100644 --- a/drivers/net/igb/igb_main.c +++ b/drivers/net/igb/igb_main.c @@ -139,6 +139,7 @@ void igb_set_mc_list_pools(struct igb_adapter *, struct e1000_hw *, int, u16); static int igb_vmm_control(struct igb_adapter *, bool); static int igb_set_vf_mac(struct net_device *, int, u8*); static void igb_mbox_handler(struct igb_adapter *); +static int igb_virtual(struct pci_dev *, int); #endif static int igb_suspend(struct pci_dev *, pm_message_t); @@ -184,6 +185,9 @@ static struct pci_driver igb_driver = { #endif .shutdown = igb_shutdown, .err_handler = &igb_err_handler, +#ifdef CONFIG_PCI_IOV + .virtual = igb_virtual +#endif }; static int global_quad_port_a; /* global quad port a indication */ @@ -5107,6 +5111,32 @@ void igb_set_mc_list_pools(struct igb_adapter *adapter, reg_data |= (1 << 25); wr32(E1000_VMOLR(pool), reg_data); } + +static int +igb_virtual(struct pci_dev *pdev, int nr_virtfn) +{ + unsigned char my_mac_addr[6] = {0x00, 0xDE, 0xAD, 0xBE, 0xEF, 0xFF}; + struct net_device *netdev = pci_get_drvdata(pdev); + struct igb_adapter *adapter = netdev_priv(netdev); + int i; + + if (nr_virtfn > 7) + return -EINVAL; + + if (nr_virtfn) { + for (i = 0; i < nr_virtfn; i++) { + printk(KERN_INFO "SR-IOV: VF %d is enabled\n", i); + my_mac_addr[5] = (unsigned char)i; + igb_set_vf_mac(netdev, i, my_mac_addr); + igb_set_vf_vmolr(adapter, i); + } + } else + printk(KERN_INFO "SR-IOV is disabled\n"); + + adapter->vfs_allocated_count = nr_virtfn; + + return 0; +} #endif /* igb_main.c */ -- 1.5.4.4 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[SR-IOV driver example 0/3] introduction
SR-IOV drivers of Intel 82576 NIC are available. There are two parts of the drivers: Physical Function driver and Virtual Function driver. The PF driver is based on the IGB driver and is used to control PF to allocate hardware specific resources and interface with the SR-IOV core. The VF driver is a new NIC driver that is same as the traditional PCI device driver. It works in both the host and the guest (Xen and KVM) environment. These two drivers are testing versions and they are *only* intended to show how to use SR-IOV API. Intel 82576 NIC specification can be found at: http://download.intel.com/design/network/datashts/82576_Datasheet_v2p1.pdf [SR-IOV driver example 1/3] PF driver: allocate hardware specific resource [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core [SR-IOV driver example 3/3] VF driver tar ball ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [patch 0/4] [RFC] Another proportional weight IO controller
Fabio and I are a little bit worried about the fact that the problem of working in the time domain instead of the service domain is not being properly dealt with. Probably we did not express ourselves very clearly, so we will try to put in more practical terms. Using B-WF2Q+ in the time domain instead of using CFQ (Round-Robin) means introducing higher complexity than CFQ to get almost the same service properties of CFQ. With regard to fairness (long term) B-WF2Q+ in the time domain has exactly the same (un)fairness problems of CFQ. As far as bandwidth differentiation is concerned, it can be obtained with CFQ by just increasing the time slice (e.g., double weight => double slice). This has no impact on long term guarantees and certainly does not decrease the throughput. With regard to short term guarantees (request completion time), one of the properties of the reference ideal system of Wf2Q+ is that, assuming for simplicity that all the queues have the same weight, as the ideal system serves each queue at the same speed, shorter budgets are completed in a shorter time intervals than longer budgets. B-WF2Q+ guarantees O(1) deviation from this ideal service. Hence, the tight delay/jitter measured in our experiments with BFQ is a consequence of the simple (and probably still improvable) budget assignment mechanism of (the overall) BFQ. In contrast, if all the budgets are equal, as it happens if we use time slices, the resulting scheduler is exactly a Round-Robin, again as in CFQ (see [1]). Finally, with regard to completion time delay differentiation through weight differentiation, this is probably the only case in which B-WF2Q+ would perform better than CFQ, because, in case of CFQ, reducing the time slices may reduce the throughput, whereas increasing the time slice would increase the worst-case delay/jitter. In the end, BFQ succeeds in guaranteeing fairness (or in general the desired bandwidth distribution) because it works in the service domain (and this is probably the only way to achieve this goal), not because it uses WF2Q+ instead of Round-Robin. Similarly, it provides tight delay/jitter only because B-WF2Q+ is used in combination with a simple budget assignment (differentiation) mechanism (again in the service domain). [1] http://feanor.sssup.it/~fabio/linux/bfq/results.php -- --- | Paolo Valente || | Algogroup || | Dip. Ing. Informazione | tel: +39 059 2056318 | | Via Vignolese 905/b| fax: +39 059 2056199 | | 41100 Modena || | home: http://algo.ing.unimo.it/people/paolo/ | --- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH] virtio_net: large tx MTU support
We don't really have a max tx packet size limit, so allow configuring the device with up to 64k tx MTU. Signed-off-by: Mark McLoughlin <[EMAIL PROTECTED]> --- drivers/net/virtio_net.c | 12 1 files changed, 12 insertions(+), 0 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index e6b5d6e..71ca29c 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -613,6 +613,17 @@ static struct ethtool_ops virtnet_ethtool_ops = { .set_tso = ethtool_op_set_tso, }; +#define MIN_MTU 68 +#define MAX_MTU 65535 + +static int virtnet_change_mtu(struct net_device *dev, int new_mtu) +{ + if (new_mtu < MIN_MTU || new_mtu > MAX_MTU) + return -EINVAL; + dev->mtu = new_mtu; + return 0; +} + static int virtnet_probe(struct virtio_device *vdev) { int err; @@ -628,6 +639,7 @@ static int virtnet_probe(struct virtio_device *vdev) dev->open = virtnet_open; dev->stop = virtnet_close; dev->hard_start_xmit = start_xmit; + dev->change_mtu = virtnet_change_mtu; dev->features = NETIF_F_HIGHDMA; #ifdef CONFIG_NET_POLL_CONTROLLER dev->poll_controller = virtnet_netpoll; -- 1.6.0.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: Host<->guest channel interface advice needed
On Wed, Nov 26, 2008 at 04:07:01PM +0300, Evgeniy Polyakov wrote: > On Wed, Nov 26, 2008 at 02:39:19PM +0200, Gleb Natapov ([EMAIL PROTECTED]) > wrote: > > The interfaces that are being considered are netlink socket (only datagram > > semantics, linux specific), new socket family or character device with > > different minor number for each channel. Which one better suits for > > the purpose? Is there other kind of interface to consider? New socket > > family looks like a good choice, but it would be nice to hear other > > opinions before starting to work on it. > > What about X (or whatever else) protocol running over host-guest network > device, which are in the kernel already? > I should have mentioned that in my original mail. We don't want to use IP stack for communication between host and guest for variety of reasons. User of the VM may interfere with our communication by mis configuring firewall for instance (and he/she may even not be aware that an OS running inside a VM). We also want be able to communicate with agent inside a guest even when guest's network is not yet configured. -- Gleb. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: Host<->guest channel interface advice needed
On Wednesday 26 November 2008, Gleb Natapov wrote: > The interfaces that are being considered are netlink socket (only datagram > semantics, linux specific), new socket family or character device with > different minor number for each channel. Which one better suits for > the purpose? Is there other kind of interface to consider? New socket > family looks like a good choice, but it would be nice to hear other > opinions before starting to work on it. I think a socket and a pty both look reasonable here, but one important aspect IMHO is that you only need a new kernel driver for the guest, if you just use the regular pty support or Unix domain sockets in the host. Obviously, there needs to be some control over permissions, as a guest most not be able to just open any socket or pty of the host, so a reasonable approach might be that the guest can only create a socket or pty that can be opened by the host, but not vice versa. Alternatively, you create the socket/pty in host userspace and then allow passing that down into the guest, which creates a virtio device from it. Arnd <>< ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [patch 0/4] [RFC] Another proportional weight IO controller
Hi Vivek, From: Vivek Goyal <[EMAIL PROTECTED]> Subject: Re: [patch 0/4] [RFC] Another proportional weight IO controller Date: Tue, 25 Nov 2008 11:27:20 -0500 > On Tue, Nov 25, 2008 at 11:33:59AM +0900, Ryo Tsuruta wrote: > > Hi Vivek, > > > > > > > Ryo, do you still want to stick to two level scheduling? Given the > > > > > problem > > > > > of it breaking down underlying scheduler's assumptions, probably it > > > > > makes > > > > > more sense to the IO control at each individual IO scheduler. > > > > > > > > I don't want to stick to it. I'm considering implementing dm-ioband's > > > > algorithm into the block I/O layer experimentally. > > > > > > Thanks Ryo. Implementing a control at block layer sounds like another > > > 2 level scheduling. We will still have the issue of breaking underlying > > > CFQ and other schedulers. How to plan to resolve that conflict. > > > > I think there is no conflict against I/O schedulers. > > Could you expain to me about the conflict? > > Because we do the buffering at higher level scheduler and mostly release > the buffered bios in the FIFO order, it might break the underlying IO > schedulers. Generally it is the decision of IO scheduler to determine in > what order to release buffered bios. > > For example, If there is one task of io priority 0 in a cgroup and rest of > the tasks are of io prio 7. All the tasks belong to best effort class. If > tasks of lower priority (7) do lot of IO, then due to buffering there is > a chance that IO from lower prio tasks is seen by CFQ first and io from > higher prio task is not seen by cfq for quite some time hence that task > not getting it fair share with in the cgroup. Similiar situations can > arise with RT tasks also. Thanks for your explanation. I think that the same thing occurs without the higher level scheduler, because all the tasks issuing I/Os are blocked while the underlying device's request queue is full before those I/Os are sent to the I/O scheduler. > > > What do you think about the solution at IO scheduler level (like BFQ) or > > > may be little above that where one can try some code sharing among IO > > > schedulers? > > > > I would like to support any type of block device even if I/Os issued > > to the underlying device doesn't go through IO scheduler. Dm-ioband > > can be made use of for the devices such as loop device. > > > > What do you mean by that IO issued to underlying device does not go > through IO scheduler? loop device will be associated with a file and > IO will ultimately go to the IO scheduler which is serving those file > blocks? How about if the files is on an NFS-mounted file system? > What's the use case scenario of doing IO control at loop device? > Ultimately the resource contention will take place on actual underlying > physical device where the file blocks are. Will doing the resource control > there not solve the issue for you? I don't come up with any use case, but I would like to make the resource controller more flexible. Actually, a certain block device that I'm using does not use the I/O scheduler. Thanks, Ryo Tsuruta ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Host<->guest channel interface advice needed
Hello, I'd like to ask what would be the best user space interface for generic guest<->host communication channel. The channel will be used to pass mouse events to/from a guest or by managements software to communicate with agents running in a guests or for something similar. The interfaces that are being considered are netlink socket (only datagram semantics, linux specific), new socket family or character device with different minor number for each channel. Which one better suits for the purpose? Is there other kind of interface to consider? New socket family looks like a good choice, but it would be nice to hear other opinions before starting to work on it. Thanks, -- Gleb. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [patch 0/4] [RFC] Another proportional weight IO controller
On Tue, 2008-11-25 at 11:27 -0500, Vivek Goyal wrote: > On Tue, Nov 25, 2008 at 11:33:59AM +0900, Ryo Tsuruta wrote: > > Hi Vivek, > > > > > > > Ryo, do you still want to stick to two level scheduling? Given the > > > > > problem > > > > > of it breaking down underlying scheduler's assumptions, probably it > > > > > makes > > > > > more sense to the IO control at each individual IO scheduler. > > > > > > > > I don't want to stick to it. I'm considering implementing dm-ioband's > > > > algorithm into the block I/O layer experimentally. > > > > > > Thanks Ryo. Implementing a control at block layer sounds like another > > > 2 level scheduling. We will still have the issue of breaking underlying > > > CFQ and other schedulers. How to plan to resolve that conflict. > > > > I think there is no conflict against I/O schedulers. > > Could you expain to me about the conflict? > > Because we do the buffering at higher level scheduler and mostly release > the buffered bios in the FIFO order, it might break the underlying IO > schedulers. Generally it is the decision of IO scheduler to determine in > what order to release buffered bios. It could be argued that the IO scheduler's primary goal is to maximize usage of the underlying device according to its physical characteristics. For hard disks this may imply minimizing time wasted by seeks; other types of devices, such as SSDs, may impose different requirements. This is something that clearly belongs in the elevator. On the other hand, it could be argued that other non-hardware-related scheduling disciplines would fit better in higher layers. That said, as you pointed out such separation could impact performance, so we will probably need to implement a feedback mechanism between the elevator, which could collect statistics and provide hints, and the upper layers. The elevator API looks like a good candidate for this, though new functions might be needed. > For example, If there is one task of io priority 0 in a cgroup and rest of > the tasks are of io prio 7. All the tasks belong to best effort class. If > tasks of lower priority (7) do lot of IO, then due to buffering there is > a chance that IO from lower prio tasks is seen by CFQ first and io from > higher prio task is not seen by cfq for quite some time hence that task > not getting it fair share with in the cgroup. Similiar situations can > arise with RT tasks also. Well, this issue is not intrinsic to dm-band and similar solutions. In the scenario you point out the problem is that the elevator and the IO controller are not cooperating. The same could happen even if we implemented everything at the elevator layer (or a little above): get hierarchical scheduling wrong and you are likely to have a rough ride. BFQ deals with hierarchical scheduling at just one layer which makes things easier. BFQ chose the elevator layer, but a similar scheduling discipline could be implemented higher in the block layer too. The HW specific-bits we cannot take out the elevator, but when it comes to task/cgroup based scheduling there are more possibilities, which includes the middle-way approach we are discussing: two level scheduling. The two level model is not bad per se, we just need to get the two levels to work in unison and for that we will certainly need to make changes to the existing elevators. > > > What do you think about the solution at IO scheduler level (like BFQ) or > > > may be little above that where one can try some code sharing among IO > > > schedulers? > > > > I would like to support any type of block device even if I/Os issued > > to the underlying device doesn't go through IO scheduler. Dm-ioband > > can be made use of for the devices such as loop device. > > What do you mean by that IO issued to underlying device does not go > through IO scheduler? loop device will be associated with a file and > IO will ultimately go to the IO scheduler which is serving those file > blocks? I think that Tsuruta-san's point is that the loop device driver uses its own make_request_fn which means that bios entering a loop device do not necessarily go through a IO scheduler after that. We will always find ourselves in this situation when trying to manage devices that provide their own make_request_fn, the reason being that its behavior is driver and configuration dependent: in the loop device case whether we go through a IO scheduler or not depends on what has been attached to it; in stacking device configurations the effect that the IO scheduling at one of the devices that constitute the multi-device will have in the aggregate throughput depends on the topology. The only way I can think of to address all cases in a sane way is controlling the entry point to the block layer, which is precisely what dm-band does. The problem with dm-band is that it relies on the dm infrastructure. In my opinion, if we could remove that dependency it would be a huge step in the right direction. > What's the use case scenario of