Re: [RFC/PATCH 1/1] virtio: Introduce MMIO ops

2020-04-30 Thread Srivatsa Vaddagiri
* Jan Kiszka  [2020-04-30 14:59:50]:

> >I believe ivshmem2_virtio requires hypervisor to support PCI device emulation
> >(for life-cycle management of VMs), which our hypervisor may not support. A
> >simple shared memory and doorbell or message-queue based transport will work 
> >for
> >us.
> 
> As written in our private conversation, a mapping of the ivshmem2 device
> discovery to platform mechanism (device tree etc.) and maybe even the
> register access for doorbell and life-cycle management to something
> hypercall-like would be imaginable. What would count more from virtio
> perspective is a common mapping on a shared memory transport.

Yes that sounds simpler for us.

> That said, I also warned about all the features that PCI already defined
> (such as message-based interrupts) which you may have to add when going a
> different way for the shared memory device.

Is it really required to present this shared memory as belonging to a PCI
device? I would expect the device-tree to indicate the presence of this shared
memory region, which we should be able to present to ivshmem2 as shared memory
region to use (along with some handles for doorbell or message queue use).

I understand the usefulness of modeling the shared memory as part of device so
that hypervisor can send events related to peers going down or coming up. In our
case, there will be other means to discover those events and avoiding this
requirement on hypervisor (to emulate PCI) will simplify the solution for us.

Any idea when we can expect virtio over ivshmem2 to become available?!
 
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [RFC/PATCH 1/1] virtio: Introduce MMIO ops

2020-04-30 Thread Srivatsa Vaddagiri
* Will Deacon  [2020-04-30 11:41:50]:

> On Thu, Apr 30, 2020 at 04:04:46PM +0530, Srivatsa Vaddagiri wrote:
> > If CONFIG_VIRTIO_MMIO_OPS is defined, then I expect this to be 
> > unconditionally
> > set to 'magic_qcom_ops' that uses hypervisor-supported interface for IO (for
> > example: message_queue_send() and message_queue_recevie() hypercalls).
> 
> Hmm, but then how would such a kernel work as a guest under all the
> spec-compliant hypervisors out there?

Ok I see your point and yes for better binary compatibility, the ops have to be
set based on runtime detection of hypervisor capabilities.

> > Ok. I guess the other option is to standardize on a new virtio transport 
> > (like
> > ivshmem2-virtio)?
> 
> I haven't looked at that, but I suppose it depends on what your hypervisor
> folks are willing to accomodate.

I believe ivshmem2_virtio requires hypervisor to support PCI device emulation
(for life-cycle management of VMs), which our hypervisor may not support. A
simple shared memory and doorbell or message-queue based transport will work for
us.

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [RFC/PATCH 0/1] virtio_mmio: hypervisor specific interfaces for MMIO

2020-04-30 Thread Srivatsa Vaddagiri
* Will Deacon  [2020-04-30 11:39:19]:

> Hi Vatsa,
> 
> On Thu, Apr 30, 2020 at 03:59:39PM +0530, Srivatsa Vaddagiri wrote:
> > > What's stopping you from implementing the trapping support in the
> > > hypervisor? Unlike the other patches you sent out, where the guest memory
> > > is not accessible to the host, there doesn't seem to be any advantage to
> > > not having trapping support, or am I missing something here?
> > 
> > I have had this discussion with hypervisor folks. They seem to be
> > concerned about complexity of having a VM's fault be handled in another
> > untrusted VM. They are not keen to add MMIO support.
> 
> Right, but I'm concerned about forking the implementation from the spec
> and I'm not keen to add these hooks ;)
> 
> What does your hook actually do? I'm assuming an HVC? 

Yes, it will issue message-queue related hypercalls

> If so, then where the
> fault is handled seems to be unrelated and whether the guest exit is due to
> an HVC or a stage-2 fault should be immaterial. 

A stage-2 fault would be considered fatal normally and result in termination of
guest. Modifying that behavior to allow resumption in case of virtio config
space access, especially including the untrusted VM in this flow, is
perhaps the concern. HVC calls OTOH are more vetted interfaces that the
hypervisor has to do nothing additional to handle.

> In other words, I don't
> follow why the trapping mechanism necessitates the way in which the fault is
> handled.

Let me check with our hypervisor folks again. Thanks for your inputs.

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [RFC/PATCH 0/1] virtio_mmio: hypervisor specific interfaces for MMIO

2020-04-30 Thread Srivatsa Vaddagiri
* Michael S. Tsirkin  [2020-04-30 06:07:56]:

> On Thu, Apr 30, 2020 at 03:32:55PM +0530, Srivatsa Vaddagiri wrote:
> > The Type-1 hypervisor we are dealing with does not allow for MMIO 
> > transport. 
> 
> How about PCI then?

Correct me if I am wrong, but basically virtio_pci uses the same low-level
primitive as readl/writel on a platform such as ARM64? So similar issues
there also.

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [RFC/PATCH 1/1] virtio: Introduce MMIO ops

2020-04-30 Thread Srivatsa Vaddagiri
* Will Deacon  [2020-04-30 11:14:32]:

> > +#ifdef CONFIG_VIRTIO_MMIO_OPS
> >  
> > +static struct virtio_mmio_ops *mmio_ops;
> > +
> > +#define virtio_readb(a)mmio_ops->mmio_readl((a))
> > +#define virtio_readw(a)mmio_ops->mmio_readl((a))
> > +#define virtio_readl(a)mmio_ops->mmio_readl((a))
> > +#define virtio_writeb(val, a)  mmio_ops->mmio_writeb((val), (a))
> > +#define virtio_writew(val, a)  mmio_ops->mmio_writew((val), (a))
> > +#define virtio_writel(val, a)  mmio_ops->mmio_writel((val), (a))
> 
> How exactly are these ops hooked up? I'm envisaging something like:
> 
>   ops = spec_compliant_ops;
>   [...]
>   if (firmware_says_hypervisor_is_buggy())
>   ops = magic_qcom_ops;
> 
> am I wrong?

If CONFIG_VIRTIO_MMIO_OPS is defined, then I expect this to be unconditionally
set to 'magic_qcom_ops' that uses hypervisor-supported interface for IO (for
example: message_queue_send() and message_queue_recevie() hypercalls).

> > +int register_virtio_mmio_ops(struct virtio_mmio_ops *ops)
> > +{
> > +   pr_info("Registered %s as mmio ops\n", ops->name);
> > +   mmio_ops = ops;
> 
> Not looking good, and really defeats the point of standardising this stuff
> imo.

Ok. I guess the other option is to standardize on a new virtio transport (like
ivshmem2-virtio)?

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [RFC/PATCH 0/1] virtio_mmio: hypervisor specific interfaces for MMIO

2020-04-30 Thread Srivatsa Vaddagiri
* Will Deacon  [2020-04-30 11:08:22]:

> > This patch is meant to seek comments. If its considered to be in right
> > direction, will work on making it more complete and send the next version!
> 
> What's stopping you from implementing the trapping support in the
> hypervisor? Unlike the other patches you sent out, where the guest memory
> is not accessible to the host, there doesn't seem to be any advantage to
> not having trapping support, or am I missing something here?

Hi Will,
I have had this discussion with hypervisor folks. They seem to be
concerned about complexity of having a VM's fault be handled in another
untrusted VM. They are not keen to add MMIO support.

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


[RFC/PATCH 1/1] virtio: Introduce MMIO ops

2020-04-30 Thread Srivatsa Vaddagiri
Some hypervisors may not support MMIO transport i.e trap config
space access and have it be handled by backend driver. They may
allow other ways to interact with backend such as message-queue
or doorbell API. This patch allows for hypervisor specific
methods for config space IO.

Signed-off-by: Srivatsa Vaddagiri 
---
 drivers/virtio/virtio_mmio.c | 131 ++-
 include/linux/virtio.h   |  14 +
 2 files changed, 94 insertions(+), 51 deletions(-)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 97d5725..69bfa35 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -100,7 +100,35 @@ struct virtio_mmio_vq_info {
struct list_head node;
 };
 
+#ifdef CONFIG_VIRTIO_MMIO_OPS
 
+static struct virtio_mmio_ops *mmio_ops;
+
+#define virtio_readb(a)mmio_ops->mmio_readl((a))
+#define virtio_readw(a)mmio_ops->mmio_readl((a))
+#define virtio_readl(a)mmio_ops->mmio_readl((a))
+#define virtio_writeb(val, a)  mmio_ops->mmio_writeb((val), (a))
+#define virtio_writew(val, a)  mmio_ops->mmio_writew((val), (a))
+#define virtio_writel(val, a)  mmio_ops->mmio_writel((val), (a))
+
+int register_virtio_mmio_ops(struct virtio_mmio_ops *ops)
+{
+   pr_info("Registered %s as mmio ops\n", ops->name);
+   mmio_ops = ops;
+
+   return 0;
+}
+
+#else  /* CONFIG_VIRTIO_MMIO_OPS */
+
+#define virtio_readb(a)readb((a))
+#define virtio_readw(a)readw((a))
+#define virtio_readl(a)readl((a))
+#define virtio_writeb(val, a)  writeb((val), (a))
+#define virtio_writew(val, a)  writew((val), (a))
+#define virtio_writel(val, a)  writel((val), (a))
+
+#endif /* CONFIG_VIRTIO_MMIO_OPS */
 
 /* Configuration interface */
 
@@ -109,12 +137,12 @@ static u64 vm_get_features(struct virtio_device *vdev)
struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
u64 features;
 
-   writel(1, vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES_SEL);
-   features = readl(vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES);
+   virtio_writel(1, vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES_SEL);
+   features = virtio_readl(vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES);
features <<= 32;
 
-   writel(0, vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES_SEL);
-   features |= readl(vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES);
+   virtio_writel(0, vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES_SEL);
+   features |= virtio_readl(vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES);
 
return features;
 }
@@ -133,12 +161,12 @@ static int vm_finalize_features(struct virtio_device 
*vdev)
return -EINVAL;
}
 
-   writel(1, vm_dev->base + VIRTIO_MMIO_DRIVER_FEATURES_SEL);
-   writel((u32)(vdev->features >> 32),
+   virtio_writel(1, vm_dev->base + VIRTIO_MMIO_DRIVER_FEATURES_SEL);
+   virtio_writel((u32)(vdev->features >> 32),
vm_dev->base + VIRTIO_MMIO_DRIVER_FEATURES);
 
-   writel(0, vm_dev->base + VIRTIO_MMIO_DRIVER_FEATURES_SEL);
-   writel((u32)vdev->features,
+   virtio_writel(0, vm_dev->base + VIRTIO_MMIO_DRIVER_FEATURES_SEL);
+   virtio_writel((u32)vdev->features,
vm_dev->base + VIRTIO_MMIO_DRIVER_FEATURES);
 
return 0;
@@ -158,25 +186,25 @@ static void vm_get(struct virtio_device *vdev, unsigned 
offset,
int i;
 
for (i = 0; i < len; i++)
-   ptr[i] = readb(base + offset + i);
+   ptr[i] = virtio_readb(base + offset + i);
return;
}
 
switch (len) {
case 1:
-   b = readb(base + offset);
+   b = virtio_readb(base + offset);
memcpy(buf, &b, sizeof b);
break;
case 2:
-   w = cpu_to_le16(readw(base + offset));
+   w = cpu_to_le16(virtio_readw(base + offset));
memcpy(buf, &w, sizeof w);
break;
case 4:
-   l = cpu_to_le32(readl(base + offset));
+   l = cpu_to_le32(virtio_readl(base + offset));
memcpy(buf, &l, sizeof l);
break;
case 8:
-   l = cpu_to_le32(readl(base + offset));
+   l = cpu_to_le32(virtio_readl(base + offset));
memcpy(buf, &l, sizeof l);
l = cpu_to_le32(ioread32(base + offset + sizeof l));
memcpy(buf + sizeof l, &l, sizeof l);
@@ -200,7 +228,7 @@ static void vm_set(struct virtio_device *vdev, unsigned 
offset,
int i;
 
for (i = 0; i < len; i++)
-   writeb(ptr[i], base + offset + i);
+   virtio_writeb(ptr[i], base + offset + i);
 
   

[RFC/PATCH 0/1] virtio_mmio: hypervisor specific interfaces for MMIO

2020-04-30 Thread Srivatsa Vaddagiri
The Type-1 hypervisor we are dealing with does not allow for MMIO transport. 
[1] summarizes some of the problems we have in making virtio work on such
hypervisors. This patch proposes a solution for transport problem viz how we can
do config space IO on such a hypervisor. Hypervisor specific methods
introduced allows for seamless IO of config space.

This patch is meant to seek comments. If its considered to be in right
direction, will work on making it more complete and send the next version!

1. https://lkml.org/lkml/2020/4/28/427

Srivatsa Vaddagiri (1):
  virtio: Introduce MMIO ops

 drivers/virtio/virtio_mmio.c | 131 ++-
 include/linux/virtio.h   |  14 +
 2 files changed, 94 insertions(+), 51 deletions(-)

-- 
2.7.4

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [PATCH 5/5] virtio: Add bounce DMA ops

2020-04-29 Thread Srivatsa Vaddagiri
* Michael S. Tsirkin  [2020-04-29 06:20:48]:

> On Wed, Apr 29, 2020 at 03:39:53PM +0530, Srivatsa Vaddagiri wrote:
> > That would still not work I think where swiotlb is used for pass-thr devices
> > (when private memory is fine) as well as virtio devices (when shared memory 
> > is
> > required).
> 
> So that is a separate question. When there are multiple untrusted
> devices, at the moment it looks like a single bounce buffer is used.
> 
> Which to me seems like a security problem, I think we should protect
> untrusted devices from each other.

I think as first step, let me see if we can make swiotlb driver accept a target
memory segment as its working area. That may suffice our needs I think.  A
subsequent step could be to make swiotlb driver recognize multiple pools.


-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [PATCH 5/5] virtio: Add bounce DMA ops

2020-04-29 Thread Srivatsa Vaddagiri
* Michael S. Tsirkin  [2020-04-29 05:52:05]:

> > > So it seems that with modern Linux, all one needs
> > > to do on x86 is mark the device as untrusted.
> > > It's already possible to do this with ACPI and with OF - would that be
> > > sufficient for achieving what this patchset is trying to do?
> > 
> > In my case, its not sufficient to just mark virtio device untrusted and thus
> > activate the use of swiotlb. All of the secondary VM memory, including those
> > allocate by swiotlb driver, is private to it.
> 
> So why not make the bounce buffer memory shared then?

Its a limitation by our hypervisor. When a secondary VM is created, two
memory segments are allocated - one private and other shared. There is no
provision for the secondary VM to make part of its private memory shared after
it boots. I can perhaps consider a change in swiotlb driver to accept the second
shared memory segment as its main working area (rather than allocate its own).

That would still not work I think where swiotlb is used for pass-thr devices
(when private memory is fine) as well as virtio devices (when shared memory is
required).

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [PATCH 5/5] virtio: Add bounce DMA ops

2020-04-29 Thread Srivatsa Vaddagiri
* Michael S. Tsirkin  [2020-04-29 02:50:41]:

> So it seems that with modern Linux, all one needs
> to do on x86 is mark the device as untrusted.
> It's already possible to do this with ACPI and with OF - would that be
> sufficient for achieving what this patchset is trying to do?

In my case, its not sufficient to just mark virtio device untrusted and thus
activate the use of swiotlb. All of the secondary VM memory, including those
allocate by swiotlb driver, is private to it. An additional piece of memory is
available to secondary VM which is shared between VMs and which is where I need
swiotlb driver to do its work.

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [PATCH 5/5] virtio: Add bounce DMA ops

2020-04-28 Thread Srivatsa Vaddagiri
* Stefano Stabellini  [2020-04-28 16:04:34]:

> > > Is swiotlb commonly used for multiple devices that may be on different 
> > > trust
> > > boundaries (and not behind a hardware iommu)?
> 
> The trust boundary is not a good way of describing the scenario and I
> think it leads to miscommunication.
> 
> A better way to describe the scenario would be that the device can only
> DMA to/from a small reserved-memory region advertised on device tree.
> 
> Do we have other instances of devices that can only DMA to/from very
> specific and non-configurable address ranges? If so, this series could
> follow their example.

AFAICT there is no such notion in current DMA API.

static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size,
bool is_ram)
{
return end <= min_not_zero(*dev->dma_mask, dev->bus_dma_limit);
}

Only the max address a device can access is defined and not a range that we seem
to need here. I think we need to set the bus_dma_limit to 0 for virtio devices
which will force the use of swiotlb_map API. We should also have a per-device
swiotlb pool defined, so that swiotlb can use the pool meant for the given
device.

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [PATCH 5/5] virtio: Add bounce DMA ops

2020-04-28 Thread Srivatsa Vaddagiri
* Michael S. Tsirkin  [2020-04-28 16:41:04]:

> > Won't we still need some changes to virtio to make use of its own pool (to
> > bounce buffers)? Something similar to its own DMA ops proposed in this 
> > patch?
> 
> If you are doing this for all devices, you need to either find a way
> to do this without chaning DMA ops, or by doing some automatic change
> to all drivers.

Ok thanks for this input. I will see how we can obfuscate this in DMA APIs
itself.

Can you also comment on the virtio transport problem I cited? The hypervisor we
are dealing with does not support MMIO transport. It supports message queue
send/recv and also doorbell, which I think can be used if we can make some
change like this to virtio_mmio.c:

+static inline u32
+virtio_readl(struct virtio_mmio_device *vm_dev, u32 reg_offset)
+{
+return vm_dev->mmio_ops->readl(vm_dev, reg_offset);
+}
+ 
+static inline void
+virtio_writel(struct virtio_mmio_device *vm_dev, u32 reg_offset, u32 data)
+{
+vm_dev->mmio_ops->writel(vm_dev, reg_offset, data);
+}


/* Check magic value */
-magic = readl(vm_dev->base + VIRTIO_MMIO_MAGIC_VALUE);
+magic = vrito_readl(vm_dev, VIRTIO_MMIO_MAGIC_VALUE);

mmio_ops->readl on most platforms can default to readl itself, while on a
platform like us, it can boil down to message_queue send/recv. Would such a
change be acceptable?

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [PATCH 5/5] virtio: Add bounce DMA ops

2020-04-28 Thread Srivatsa Vaddagiri
* Michael S. Tsirkin  [2020-04-28 12:17:57]:

> Okay, but how is all this virtio specific?  For example, why not allow
> separate swiotlbs for any type of device?
> For example, this might make sense if a given device is from a
> different, less trusted vendor.

Is swiotlb commonly used for multiple devices that may be on different trust
boundaries (and not behind a hardware iommu)? If so, then yes it sounds like a
good application of multiple swiotlb pools.

> All this can then maybe be hidden behind the DMA API.

Won't we still need some changes to virtio to make use of its own pool (to
bounce buffers)? Something similar to its own DMA ops proposed in this patch?

> > +void virtio_bounce_set_dma_ops(struct virtio_device *vdev)
> > +{
> > +   if (!bounce_buf_paddr)
> > +   return;
> > +
> > +   set_dma_ops(vdev->dev.parent, &virtio_dma_ops);
> 
> 
> I don't think DMA API maintainers will be happy with new users
> of set_dma_ops.

Is there an alternate API that is more preffered?

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


[PATCH 3/5] swiotlb: Add alloc and free APIs

2020-04-28 Thread Srivatsa Vaddagiri
Move the memory allocation and free portion of swiotlb driver
into independent routines. They will be useful for drivers that
need swiotlb driver to just allocate/free memory chunks and not
additionally bounce memory.

Signed-off-by: Srivatsa Vaddagiri 
---
 include/linux/swiotlb.h |  17 ++
 kernel/dma/swiotlb.c| 151 
 2 files changed, 106 insertions(+), 62 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index c634b4d..957697e 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -186,6 +186,10 @@ void __init swiotlb_exit(void);
 unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
 bool is_swiotlb_active(void);
+extern phys_addr_t swiotlb_alloc(struct swiotlb_pool *pool, size_t alloc_size,
+   unsigned long tbl_dma_addr, unsigned long mask);
+extern void swiotlb_free(struct swiotlb_pool *pool,
+   phys_addr_t tlb_addr, size_t alloc_size);
 #else
 #define swiotlb_force SWIOTLB_NO_FORCE
 
@@ -219,6 +223,19 @@ static inline bool is_swiotlb_active(void)
 {
return false;
 }
+
+static inline phys_addr_t swiotlb_alloc(struct swiotlb_pool *pool,
+   size_t alloc_size, unsigned long tbl_dma_addr,
+   unsigned long mask)
+{
+   return DMA_MAPPING_ERROR;
+}
+
+static inline void swiotlb_free(struct swiotlb_pool *pool,
+   phys_addr_t tlb_addr, size_t alloc_size)
+{
+}
+
 #endif /* CONFIG_SWIOTLB */
 
 extern void swiotlb_print_info(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 8cf0b57..7411ce5 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -444,37 +444,14 @@ static inline void *tlb_vaddr(struct swiotlb_pool *pool, 
phys_addr_t tlb_addr)
return pool->io_tlb_vstart + (tlb_addr - pool->io_tlb_start);
 }
 
-phys_addr_t _swiotlb_tbl_map_single(struct swiotlb_pool *pool,
-  struct device *hwdev,
-  dma_addr_t tbl_dma_addr,
-  phys_addr_t orig_addr,
-  size_t mapping_size,
-  size_t alloc_size,
-  enum dma_data_direction dir,
-  unsigned long attrs)
+phys_addr_t swiotlb_alloc(struct swiotlb_pool *pool, size_t alloc_size,
+   unsigned long tbl_dma_addr, unsigned long mask)
 {
unsigned long flags;
phys_addr_t tlb_addr;
-   unsigned int nslots, stride, index, wrap;
-   int i;
-   unsigned long mask;
+   unsigned int i, nslots, stride, index, wrap;
unsigned long offset_slots;
unsigned long max_slots;
-   unsigned long tmp_io_tlb_used;
-
-   if (pool->no_iotlb_memory)
-   panic("Can not allocate SWIOTLB buffer earlier and can't now 
provide you with the DMA bounce buffer");
-
-   if (mem_encrypt_active())
-   pr_warn_once("Memory encryption is active and system is using 
DMA bounce buffers\n");
-
-   if (mapping_size > alloc_size) {
-   dev_warn_once(hwdev, "Invalid sizes (mapping: %zd bytes, alloc: 
%zd bytes)",
- mapping_size, alloc_size);
-   return (phys_addr_t)DMA_MAPPING_ERROR;
-   }
-
-   mask = dma_get_seg_boundary(hwdev);
 
tbl_dma_addr &= mask;
 
@@ -555,54 +532,23 @@ phys_addr_t _swiotlb_tbl_map_single(struct swiotlb_pool 
*pool,
} while (index != wrap);
 
 not_found:
-   tmp_io_tlb_used = pool->io_tlb_used;
-
spin_unlock_irqrestore(&pool->io_tlb_lock, flags);
-   if (!(attrs & DMA_ATTR_NO_WARN) && printk_ratelimit())
-   dev_warn(hwdev, "swiotlb buffer is full (sz: %zd bytes), total 
%lu (slots), used %lu (slots)\n",
-alloc_size, pool->io_tlb_nslabs, tmp_io_tlb_used);
return (phys_addr_t)DMA_MAPPING_ERROR;
+
 found:
pool->io_tlb_used += nslots;
spin_unlock_irqrestore(&pool->io_tlb_lock, flags);
 
-   /*
-* Save away the mapping from the original address to the DMA address.
-* This is needed when we sync the memory.  Then we sync the buffer if
-* needed.
-*/
-   for (i = 0; i < nslots; i++)
-   pool->io_tlb_orig_addr[index+i] = orig_addr +
-   (i << IO_TLB_SHIFT);
-   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
-   (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(orig_addr, tlb_vaddr(pool, tlb_addr),
-   mapping_size, DMA_TO_DEVICE);
-
return tlb_addr;
 }
 
-/*
- * tlb_addr is the physical address of the bounce buffer to unmap.
- */
-void _swiotlb_tbl_unmap_single(struc

[PATCH 4/5] swiotlb: Add API to register new pool

2020-04-28 Thread Srivatsa Vaddagiri
This patch adds an interface for the swiotlb driver to recognize
a new memory pool. Upon successful initialization of the pool,
swiotlb returns a handle, which needs to be passed as an argument
for any future operations on the pool (map/unmap/alloc/free).

Signed-off-by: Srivatsa Vaddagiri 
---
 include/linux/swiotlb.h |  9 
 kernel/dma/swiotlb.c| 60 +
 2 files changed, 69 insertions(+)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 957697e..97ac82a 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -182,6 +182,9 @@ static inline bool is_swiotlb_buffer(phys_addr_t paddr)
return paddr >= swiotlb_start() && paddr < swiotlb_end();
 }
 
+extern struct swiotlb_pool *swiotlb_register_pool(char *name,
+   phys_addr_t start, void *vstart, size_t size);
+
 void __init swiotlb_exit(void);
 unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
@@ -236,6 +239,12 @@ static inline void swiotlb_free(struct swiotlb_pool *pool,
 {
 }
 
+static struct swiotlb_pool *swiotlb_register_pool(char *name,
+   phys_addr_t start, void *vstart, size_t size)
+{
+   return NULL;
+}
+
 #endif /* CONFIG_SWIOTLB */
 
 extern void swiotlb_print_info(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 7411ce5..9883744 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #ifdef CONFIG_DEBUG_FS
 #include 
 #endif
@@ -736,6 +737,65 @@ size_t swiotlb_max_mapping_size(struct device *dev)
return ((size_t)1 << IO_TLB_SHIFT) * IO_TLB_SEGSIZE;
 }
 
+struct swiotlb_pool *swiotlb_register_pool(char *name, phys_addr_t start,
+   void *vstart, size_t size)
+{
+   struct swiotlb_pool *pool;
+   unsigned long i, bytes;
+   unsigned long nslabs;
+
+   nslabs = size >> IO_TLB_SHIFT;
+   if (!nslabs)
+   return ERR_PTR(-EINVAL);
+
+   pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+   if (!pool)
+   return ERR_PTR(-ENOMEM);
+
+   bytes = nslabs << IO_TLB_SHIFT;
+
+   strncpy(pool->name, name, sizeof(pool->name));
+   spin_lock_init(&pool->io_tlb_lock);
+   pool->late_alloc = 1;
+   pool->io_tlb_start = start;
+   pool->io_tlb_end = start + bytes;
+   pool->io_tlb_vstart = vstart;
+   pool->io_tlb_nslabs = nslabs;
+   pool->max_segment = rounddown(bytes, PAGE_SIZE);
+
+   /*
+* Allocate and initialize the free list array.  This array is used
+* to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
+* between io_tlb_start and io_tlb_end.
+*/
+   pool->io_tlb_list = (unsigned int *)__get_free_pages(GFP_KERNEL,
+   get_order(pool->io_tlb_nslabs * sizeof(int)));
+   if (!pool->io_tlb_list)
+   goto cleanup;
+
+   pool->io_tlb_orig_addr = (phys_addr_t *)
+   __get_free_pages(GFP_KERNEL,
+get_order(pool->io_tlb_nslabs *
+  sizeof(phys_addr_t)));
+   if (!pool->io_tlb_orig_addr)
+   goto cleanup;
+
+   for (i = 0; i < pool->io_tlb_nslabs; i++) {
+   pool->io_tlb_list[i] = IO_TLB_SEGSIZE -
+   OFFSET(i, IO_TLB_SEGSIZE);
+   pool->io_tlb_orig_addr[i] = INVALID_PHYS_ADDR;
+   }
+
+   return pool;
+
+cleanup:
+   kfree(pool->io_tlb_list);
+   kfree(pool->io_tlb_orig_addr);
+   kfree(pool);
+
+   return ERR_PTR(-ENOMEM);
+}
+
 bool is_swiotlb_active(void)
 {
/*
-- 
2.7.4

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


[PATCH 5/5] virtio: Add bounce DMA ops

2020-04-28 Thread Srivatsa Vaddagiri
For better security, its desirable that a guest VM's memory is
not accessible to any entity that executes outside the context of
guest VM. In case of virtio, backend drivers execute outside the
context of guest VM and in general will need access to complete
guest VM memory.  One option to restrict the access provided to
backend driver is to make use of a bounce buffer. The bounce
buffer is accessible to both backend and frontend drivers. All IO
buffers that are in private space of guest VM are bounced to be
accessible to backend.

This patch proposes a new memory pool to be used for this bounce
purpose, rather than the default swiotlb memory pool. That will
avoid any conflicts that may arise in situations where a VM needs
to use swiotlb pool for driving any pass-through devices (in
which case swiotlb memory needs not be shared with another VM) as
well as virtio devices (which will require swiotlb memory to be
shared with backend VM). As a possible extension to this patch,
we can provide an option for virtio to make use of default
swiotlb memory pool itself, where no such conflicts may exist in
a given deployment.

Signed-off-by: Srivatsa Vaddagiri 
---
 drivers/virtio/Makefile|   2 +-
 drivers/virtio/virtio.c|   2 +
 drivers/virtio/virtio_bounce.c | 150 +
 include/linux/virtio.h |   4 ++
 4 files changed, 157 insertions(+), 1 deletion(-)
 create mode 100644 drivers/virtio/virtio_bounce.c

diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 29a1386e..3fd3515 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-obj-$(CONFIG_VIRTIO) += virtio.o virtio_ring.o
+obj-$(CONFIG_VIRTIO) += virtio.o virtio_ring.o virtio_bounce.o
 obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o
 obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
 virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index a977e32..bc2f779 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -329,6 +329,7 @@ int register_virtio_device(struct virtio_device *dev)
 
dev->index = err;
dev_set_name(&dev->dev, "virtio%u", dev->index);
+   virtio_bounce_set_dma_ops(dev);
 
spin_lock_init(&dev->config_lock);
dev->config_enabled = false;
@@ -431,6 +432,7 @@ EXPORT_SYMBOL_GPL(virtio_device_restore);
 
 static int virtio_init(void)
 {
+   virtio_map_bounce_buffer();
if (bus_register(&virtio_bus) != 0)
panic("virtio bus registration failed");
return 0;
diff --git a/drivers/virtio/virtio_bounce.c b/drivers/virtio/virtio_bounce.c
new file mode 100644
index 000..3de8e0e
--- /dev/null
+++ b/drivers/virtio/virtio_bounce.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtio DMA ops to bounce buffers
+ *
+ *  Copyright (c) 2020, The Linux Foundation. All rights reserved.
+ *
+ * This module allows bouncing of IO buffers to a region which will be
+ * accessible to backend drivers.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static phys_addr_t bounce_buf_paddr;
+static void *bounce_buf_vaddr;
+static size_t bounce_buf_size;
+struct swiotlb_pool *virtio_pool;
+
+#define VIRTIO_MAX_BOUNCE_SIZE (16*4096)
+
+static void *virtio_alloc_coherent(struct device *dev, size_t size,
+   dma_addr_t *dma_handle, gfp_t gfp_flags, unsigned long attrs)
+{
+   phys_addr_t addr;
+
+   if (!virtio_pool)
+   return NULL;
+
+   addr = swiotlb_alloc(virtio_pool, size, bounce_buf_paddr, ULONG_MAX);
+   if (addr == DMA_MAPPING_ERROR)
+   return NULL;
+
+   *dma_handle = (addr - bounce_buf_paddr);
+
+   return bounce_buf_vaddr + (addr - bounce_buf_paddr);
+}
+
+static void virtio_free_coherent(struct device *dev, size_t size, void *vaddr,
+   dma_addr_t dma_handle, unsigned long attrs)
+{
+   phys_addr_t addr = (dma_handle + bounce_buf_paddr);
+
+   swiotlb_free(virtio_pool, addr, size);
+}
+
+static dma_addr_t virtio_map_page(struct device *dev, struct page *page,
+   unsigned long offset, size_t size,
+   enum dma_data_direction dir, unsigned long attrs)
+{
+   void *ptr = page_address(page) + offset;
+   phys_addr_t paddr = virt_to_phys(ptr);
+   dma_addr_t handle;
+
+   if (!virtio_pool)
+   return DMA_MAPPING_ERROR;
+
+   handle = _swiotlb_tbl_map_single(virtio_pool, dev, bounce_buf_paddr,
+paddr, size, size, dir, attrs);
+   if (handle == (phys_addr_t)DMA_MAPPING_ERROR)
+   return DMA_MAPPING_ERROR;
+
+   return handle - bounce_buf_paddr;
+}
+
+static void virtio_unmap_page(struct device *dev, dma_addr_t dev_addr,
+   size_t size, enum dma_data_direction dir, unsigne

[PATCH 1/5] swiotlb: Introduce concept of swiotlb_pool

2020-04-28 Thread Srivatsa Vaddagiri
Currently swiotlb driver manages a global pool of memory which
acts as bounce buffers for memory that is not accessible to some
devices. The core functions provides by this driver to
allocate/free/bounce memory chunks will be more
useful if this driver can manage more than one pool. An immediate
application of such extension to swiotlb driver is to bounce
virtio buffers between private and shared space of a VM.

This patch introduces the concept of a swiotlb memory pool and
reorganizes current driver to work with the default global pool.
There is no functional change introduced by this patch.
Subsequent patches allow the swiotlb driver to work with more
than one pool of memory.

Signed-off-by: Srivatsa Vaddagiri 
---
 drivers/xen/swiotlb-xen.c |   4 +-
 include/linux/swiotlb.h   | 129 -
 kernel/dma/swiotlb.c  | 359 +++---
 3 files changed, 307 insertions(+), 185 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index b6d2776..c2dc9c8 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -190,8 +190,8 @@ int __ref xen_swiotlb_init(int verbose, bool early)
/*
 * IO TLB memory already allocated. Just use it.
 */
-   if (io_tlb_start != 0) {
-   xen_io_tlb_start = phys_to_virt(io_tlb_start);
+   if (swiotlb_start()) {
+   xen_io_tlb_start = phys_to_virt(swiotlb_start());
goto end;
}
 
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 046bb94..8c7843f 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -44,7 +44,59 @@ enum dma_sync_target {
SYNC_FOR_DEVICE = 1,
 };
 
-extern phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
+#define MAX_POOL_NAME_SIZE 16
+
+struct swiotlb_pool {
+   char name[MAX_POOL_NAME_SIZE];
+   bool no_iotlb_memory;
+   int late_alloc;
+
+   spinlock_t io_tlb_lock;
+
+   /*
+* Used to do a quick range check in swiotlb_tbl_unmap_single and
+* swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated
+* by this API.
+*/
+
+   phys_addr_t io_tlb_start, io_tlb_end;
+
+   /*
+* The number of IO TLB blocks (in groups of 64) between io_tlb_start
+* and io_tlb_end.  This is command line adjustable via
+* setup_io_tlb_npages.
+*/
+   unsigned long io_tlb_nslabs;
+
+   /*
+* The number of used IO TLB block
+*/
+   unsigned long io_tlb_used;
+
+   /*
+* This is a free list describing the number of free entries available
+* from each index
+*/
+   unsigned int *io_tlb_list;
+   unsigned int io_tlb_index;
+
+   /*
+* We need to save away the original address corresponding to a mapped
+* entry for the sync operations.
+*/
+   phys_addr_t *io_tlb_orig_addr;
+
+   /*
+* Max segment that we can provide which (if pages are contingous) will
+* not be bounced (unless SWIOTLB_FORCE is set).
+*/
+   unsigned int max_segment;
+};
+
+extern struct swiotlb_pool default_swiotlb_pool;
+
+extern phys_addr_t _swiotlb_tbl_map_single(struct swiotlb_pool *pool,
+ struct device *hwdev,
  dma_addr_t tbl_dma_addr,
  phys_addr_t phys,
  size_t mapping_size,
@@ -52,28 +104,80 @@ extern phys_addr_t swiotlb_tbl_map_single(struct device 
*hwdev,
  enum dma_data_direction dir,
  unsigned long attrs);
 
-extern void swiotlb_tbl_unmap_single(struct device *hwdev,
+extern void _swiotlb_tbl_unmap_single(struct swiotlb_pool *pool,
+struct device *hwdev,
 phys_addr_t tlb_addr,
 size_t mapping_size,
 size_t alloc_size,
 enum dma_data_direction dir,
 unsigned long attrs);
 
-extern void swiotlb_tbl_sync_single(struct device *hwdev,
+extern void _swiotlb_tbl_sync_single(struct swiotlb_pool *pool,
+   struct device *hwdev,
phys_addr_t tlb_addr,
size_t size, enum dma_data_direction dir,
enum dma_sync_target target);
 
-dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
-   size_t size, enum dma_data_direction dir, unsigned long attrs);
+dma_addr_t _swiotlb_map(struct swiotlb_pool *pool, struct device *dev,
+   phys_addr_t phys, size_t size, enum dma_data_direction dir,
+   unsigned long attrs);
+
+static inline phys_addr_t

[PATCH 2/5] swiotlb: Allow for non-linear mapping between paddr and vaddr

2020-04-28 Thread Srivatsa Vaddagiri
Some of the memory pool managed by swiotlb driver could fall
outside the direct-mapped range, made accessible via memremap()
routine. To facilitate easy conversion between virtual and
physical address of such memory, store the virtual address of
memory pool in addition to its physical address.

Signed-off-by: Srivatsa Vaddagiri 
---
 include/linux/swiotlb.h |  2 ++
 kernel/dma/swiotlb.c| 20 ++--
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 8c7843f..c634b4d 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -61,6 +61,8 @@ struct swiotlb_pool {
 
phys_addr_t io_tlb_start, io_tlb_end;
 
+   void *io_tlb_vstart;
+
/*
 * The number of IO TLB blocks (in groups of 64) between io_tlb_start
 * and io_tlb_end.  This is command line adjustable via
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 9c504d3..8cf0b57 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -178,6 +178,7 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
default_swiotlb_pool.io_tlb_start = __pa(tlb);
default_swiotlb_pool.io_tlb_end =
default_swiotlb_pool.io_tlb_start + bytes;
+   default_swiotlb_pool.io_tlb_vstart = tlb;
 
/*
 * Allocate and initialize the free list array.  This array is used
@@ -307,6 +308,7 @@ static void swiotlb_cleanup(void)
default_swiotlb_pool.io_tlb_start = 0;
default_swiotlb_pool.io_tlb_nslabs = 0;
default_swiotlb_pool.max_segment = 0;
+   default_swiotlb_pool.io_tlb_vstart = NULL;
 }
 
 int
@@ -320,6 +322,7 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
default_swiotlb_pool.io_tlb_start = virt_to_phys(tlb);
default_swiotlb_pool.io_tlb_end =
default_swiotlb_pool.io_tlb_start + bytes;
+   default_swiotlb_pool.io_tlb_vstart = tlb;
 
set_memory_decrypted((unsigned long)tlb, bytes >> PAGE_SHIFT);
memset(tlb, 0, bytes);
@@ -400,11 +403,10 @@ void __init swiotlb_exit(void)
 /*
  * Bounce: copy the swiotlb buffer from or back to the original dma location
  */
-static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
+static void swiotlb_bounce(phys_addr_t orig_addr, void *vaddr,
   size_t size, enum dma_data_direction dir)
 {
unsigned long pfn = PFN_DOWN(orig_addr);
-   unsigned char *vaddr = phys_to_virt(tlb_addr);
 
if (PageHighMem(pfn_to_page(pfn))) {
/* The buffer does not have a mapping.  Map it in and copy */
@@ -437,6 +439,11 @@ static void swiotlb_bounce(phys_addr_t orig_addr, 
phys_addr_t tlb_addr,
}
 }
 
+static inline void *tlb_vaddr(struct swiotlb_pool *pool, phys_addr_t tlb_addr)
+{
+   return pool->io_tlb_vstart + (tlb_addr - pool->io_tlb_start);
+}
+
 phys_addr_t _swiotlb_tbl_map_single(struct swiotlb_pool *pool,
   struct device *hwdev,
   dma_addr_t tbl_dma_addr,
@@ -569,7 +576,7 @@ phys_addr_t _swiotlb_tbl_map_single(struct swiotlb_pool 
*pool,
(i << IO_TLB_SHIFT);
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
(dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(orig_addr, tlb_addr,
+   swiotlb_bounce(orig_addr, tlb_vaddr(pool, tlb_addr),
mapping_size, DMA_TO_DEVICE);
 
return tlb_addr;
@@ -594,7 +601,8 @@ void _swiotlb_tbl_unmap_single(struct swiotlb_pool *pool,
if (orig_addr != INVALID_PHYS_ADDR &&
!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
((dir == DMA_FROM_DEVICE) || (dir == DMA_BIDIRECTIONAL)))
-   swiotlb_bounce(orig_addr, tlb_addr, mapping_size, 
DMA_FROM_DEVICE);
+   swiotlb_bounce(orig_addr, tlb_vaddr(pool, tlb_addr),
+   mapping_size, DMA_FROM_DEVICE);
 
/*
 * Return the buffer to the free list by setting the corresponding
@@ -643,14 +651,14 @@ void _swiotlb_tbl_sync_single(struct swiotlb_pool *pool,
switch (target) {
case SYNC_FOR_CPU:
if (likely(dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(orig_addr, tlb_addr,
+   swiotlb_bounce(orig_addr, tlb_vaddr(pool, tlb_addr),
   size, DMA_FROM_DEVICE);
else
BUG_ON(dir != DMA_TO_DEVICE);
break;
case SYNC_FOR_DEVICE:
if (likely(dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(orig_addr, tlb_addr,
+   swiotlb_bounce(orig_addr, tlb_vaddr(pool, tlb_addr),

[PATCH 0/5] virtio on Type-1 hypervisor

2020-04-28 Thread Srivatsa Vaddagiri
We ran into several problems in using virtio for IO paravirtualization on a
Type-1 hypervisor with these characteristics:

* By default, all of a guests's memory is private to it (no other guest can
  access its memory).

* One of the VM is considered as primary and has access to most IO devices. This
  is similar to dom0 VM in case of Xen. All other VMs are considered as
  secondary VMs

* virtio-backend drivers for all secondary VMs need to be hosted in primary VM

* Since secondary VM's memory is not accessible to primary VM, to make virtio
  backend driver work, instead an additional piece of memory is provisioned 
  by the hypervisor that is shared between primary and secondary VMs. This
  shared memory can be used, for example, to host virtio-ring structures
  and also to bounce IO buffers of secondary VM.

* Message-queue and doorbell interfaces available in hypervisor to support
  inter-VM communication. Messge-queue API (send/recv) allows one VM to send
  short messages to another VM. Doorbell interface allows a VM to inject
  an interrupt into another VM.

* No support for MMIO transport i.e hypervisor does not support trapping MMIO
  config space access by front-end driver and having it handled in backend
  drivers.

Few problem statements arising out of this:

1) How can we make use of the shared memory region effectively to make virtio
work in this scenario?

What is proposed in the patch series for this problem is a virtio bounce driver
that recognizes a shared memory region (shared between VMs) and makes use of
swiotlb driver interfaces to bounce IO buffers between private and shared space.
Some changes are proposed to swiotlb driver in this regard, that can let us
reuse swiotlb functions to work with the shared memory pool. swiotlb driver can
now recognize more than one pool of memory and extend its functions
(allocate/free/bounce memory chunks) for each pool.

2) What transport mechanism works best in this case? 

I realize that ivshmem2-virtio proposal has discussed the possibility of using
shared memory + doorbell as a virtio transport option. We can consider using
that as a transport in case it will be acceptable upstream. Other option we had
considered was to modify virtio_mmio.c itself to use message_queue send/recv
hypercall interface (in place of readl/writel). That could be abstracted via
'mmio_ops' structure providing suitable implementation of readl/writel. Another
option suggested by Christopher Dall is to unmap the config space from kernel
address space and as part of the fault handler, use hypervisor specific APIs to
achieve config space IO.

3) Which virtio backend drivers to leverage?

We realized there are multiple implementations of virtio backend drivers,
bundled as part of separate projects (Qemu, lkvm etc). We think it would be nice
if we had some consolidation in that regard so that we can make use of the
backend drivers that are not tightly coupled with a VMM. In our case, we need to
be able to run virtio backend drivers as standalone programs (and not coupled
with any VMM).


Srivatsa Vaddagiri (5):
  swiotlb: Introduce concept of swiotlb_pool
  swiotlb: Allow for non-linear mapping between paddr and vaddr
  swiotlb: Add alloc and free APIs
  swiotlb: Add API to register new pool
  virtio: Add bounce DMA ops

 drivers/virtio/Makefile|   2 +-
 drivers/virtio/virtio.c|   2 +
 drivers/virtio/virtio_bounce.c | 150 +++
 drivers/xen/swiotlb-xen.c  |   4 +-
 include/linux/swiotlb.h| 157 +++-
 include/linux/virtio.h |   4 +
 kernel/dma/swiotlb.c   | 556 -
 7 files changed, 638 insertions(+), 237 deletions(-)
 create mode 100644 drivers/virtio/virtio_bounce.c

-- 
2.7.4

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


[tip:sched/core] sched/fair: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-22 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  8bf46a39be910937d4c9e8d999a7438a7ae1a75b
Gitweb: http://git.kernel.org/tip/8bf46a39be910937d4c9e8d999a7438a7ae1a75b
Author: Srivatsa Vaddagiri 
AuthorDate: Fri, 16 Sep 2016 18:28:51 -0700
Committer:  Ingo Molnar 
CommitDate: Thu, 22 Sep 2016 15:20:18 +0200

sched/fair: Fix SCHED_HRTICK bug leading to late preemption of tasks

SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime).

It makes use of a per-CPU hrtimer resource and hence arming that
hrtimer should be based on total SCHED_FAIR tasks a CPU has across its
various cfs_rqs, rather than being based on number of tasks in a
particular cfs_rq (as implemented currently).

As a result, with current code, its possible for a running task (which
is the sole task in its cfs_rq) to be preempted much after its
ideal_runtime has elapsed, resulting in increased latency for tasks in
other cfs_rq on same CPU.

Fix this by arming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Joonwoo Park 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1474075731-11550-1-git-send-email-joonw...@codeaurora.org
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 986c10c..8fb4d19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4469,7 +4469,7 @@ static void hrtick_start_fair(struct rq *rq, struct 
task_struct *p)
 
WARN_ON(task_rq(p) != rq);
 
-   if (cfs_rq->nr_running > 1) {
+   if (rq->cfs.h_nr_running > 1) {
u64 slice = sched_slice(cfs_rq, se);
u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
s64 delta = slice - ran;


[tip:x86/spinlocks] kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-14 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  92b75202e5e8790905f9441ccaea2456cc4621a5
Gitweb: http://git.kernel.org/tip/92b75202e5e8790905f9441ccaea2456cc4621a5
Author: Srivatsa Vaddagiri 
AuthorDate: Tue, 6 Aug 2013 14:55:41 +0530
Committer:  Ingo Molnar 
CommitDate: Wed, 14 Aug 2013 13:12:35 +0200

kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
required feature (KVM_FEATURE_PV_UNHALT) to support pv-ticketlocks. If so,
support for pv-ticketlocks is registered via pv_lock_ops.

Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.

Signed-off-by: Srivatsa Vaddagiri 
Link: http://lkml.kernel.org/r/20130810193849.ga25...@linux.vnet.ibm.com
Signed-off-by: Suzuki Poulose 
[Raghu: check_zero race fix, enum for kvm_contention_stat, jumplabel related 
changes,
addition of safe_halt for irq enabled case, bailout spinning in nmi case(Gleb)]
Signed-off-by: Raghavendra K T 
Acked-by: Gleb Natapov 
Acked-by: Ingo Molnar 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/kvm_para.h |  14 ++-
 arch/x86/kernel/kvm.c   | 262 
 2 files changed, 274 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 695399f..427afcb 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -118,10 +118,20 @@ void kvm_async_pf_task_wait(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
-#else
-#define kvm_guest_init() do { } while (0)
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init kvm_spinlock_init(void);
+#else /* !CONFIG_PARAVIRT_SPINLOCKS */
+static inline void kvm_spinlock_init(void)
+{
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+#else /* CONFIG_KVM_GUEST */
+#define kvm_guest_init() do {} while (0)
 #define kvm_async_pf_task_wait(T) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
+
 static inline u32 kvm_read_and_reset_pf_reason(void)
 {
return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a96d32c..b8ef630 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -419,6 +420,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
WARN_ON(kvm_register_clock("primary cpu clock"));
kvm_guest_cpu_init();
native_smp_prepare_boot_cpu();
+   kvm_spinlock_init();
 }
 
 static void kvm_guest_cpu_online(void *dummy)
@@ -523,3 +525,263 @@ static __init int activate_jump_labels(void)
return 0;
 }
 arch_initcall(activate_jump_labels);
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
+void kvm_kick_cpu(int cpu)
+{
+   int apicid;
+   unsigned long flags = 0;
+
+   apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
+}
+
+enum kvm_contention_stat {
+   TAKEN_SLOW,
+   TAKEN_SLOW_PICKUP,
+   RELEASED_SLOW,
+   RELEASED_SLOW_KICKED,
+   NR_CONTENTION_STATS
+};
+
+#ifdef CONFIG_KVM_DEBUG_FS
+#define HISTO_BUCKETS  30
+
+static struct kvm_spinlock_stats
+{
+   u32 contention_stats[NR_CONTENTION_STATS];
+   u32 histo_spin_blocked[HISTO_BUCKETS+1];
+   u64 time_blocked;
+} spinlock_stats;
+
+static u8 zero_stats;
+
+static inline void check_zero(void)
+{
+   u8 ret;
+   u8 old;
+
+   old = ACCESS_ONCE(zero_stats);
+   if (unlikely(old)) {
+   ret = cmpxchg(&zero_stats, old, 0);
+   /* This ensures only one fellow resets the stat */
+   if (ret == old)
+   memset(&spinlock_stats, 0, sizeof(spinlock_stats));
+   }
+}
+
+static inline void add_stats(enum kvm_contention_stat var, u32 val)
+{
+   check_zero();
+   spinlock_stats.contention_stats[var] += val;
+}
+
+
+static inline u64 spin_time_start(void)
+{
+   return sched_clock();
+}
+
+static void __spin_time_accum(u64 delta, u32 *array)
+{
+   unsigned index;
+
+   index = ilog2(delta);
+   check_zero();
+
+   if (index < HISTO_BUCKETS)
+   array[index]++;
+   else
+   array[HISTO_BUCKETS]++;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+   u32 delta;
+
+   delta = sched_clock() - start;
+   __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
+   spinlock_stats.time_blocked += delta;
+}
+
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+
+struct dentry *kvm_init_debugfs(void)
+{
+   d_kvm_debug = debugfs_create_dir("kvm", NULL);
+   if (!d_kvm_debug)
+   printk(KERN_WARNING "Could not create 'kvm' debugfs 
directory\n");
+
+   return d_kvm_debug;
+}
+
+static int __init kvm_spi

[tip:x86/spinlocks] kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-12 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  f9021f7fd9c8c8101c90b901053f99bfd0288021
Gitweb: http://git.kernel.org/tip/f9021f7fd9c8c8101c90b901053f99bfd0288021
Author: Srivatsa Vaddagiri 
AuthorDate: Tue, 6 Aug 2013 14:55:41 +0530
Committer:  H. Peter Anvin 
CommitDate: Mon, 12 Aug 2013 09:03:57 -0700

kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
required feature (KVM_FEATURE_PV_UNHALT) to support pv-ticketlocks. If so,
 support for pv-ticketlocks is registered via pv_lock_ops.

Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.

Signed-off-by: Srivatsa Vaddagiri 
Link: http://lkml.kernel.org/r/20130810193849.ga25...@linux.vnet.ibm.com
Signed-off-by: Suzuki Poulose 
[Raghu: check_zero race fix, enum for kvm_contention_stat, jumplabel related 
changes,
addition of safe_halt for irq enabled case, bailout spinning in nmi case(Gleb)]
Signed-off-by: Raghavendra K T 
Acked-by: Gleb Natapov 
Acked-by: Ingo Molnar 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/kvm_para.h |  14 ++-
 arch/x86/kernel/kvm.c   | 262 
 2 files changed, 274 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 695399f..427afcb 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -118,10 +118,20 @@ void kvm_async_pf_task_wait(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
-#else
-#define kvm_guest_init() do { } while (0)
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init kvm_spinlock_init(void);
+#else /* !CONFIG_PARAVIRT_SPINLOCKS */
+static inline void kvm_spinlock_init(void)
+{
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+#else /* CONFIG_KVM_GUEST */
+#define kvm_guest_init() do {} while (0)
 #define kvm_async_pf_task_wait(T) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
+
 static inline u32 kvm_read_and_reset_pf_reason(void)
 {
return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a96d32c..d442471 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -419,6 +420,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
WARN_ON(kvm_register_clock("primary cpu clock"));
kvm_guest_cpu_init();
native_smp_prepare_boot_cpu();
+   kvm_spinlock_init();
 }
 
 static void kvm_guest_cpu_online(void *dummy)
@@ -523,3 +525,263 @@ static __init int activate_jump_labels(void)
return 0;
 }
 arch_initcall(activate_jump_labels);
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
+void kvm_kick_cpu(int cpu)
+{
+   int apicid;
+   unsigned long flags = 0;
+
+   apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
+}
+
+enum kvm_contention_stat {
+   TAKEN_SLOW,
+   TAKEN_SLOW_PICKUP,
+   RELEASED_SLOW,
+   RELEASED_SLOW_KICKED,
+   NR_CONTENTION_STATS
+};
+
+#ifdef CONFIG_KVM_DEBUG_FS
+#define HISTO_BUCKETS  30
+
+static struct kvm_spinlock_stats
+{
+   u32 contention_stats[NR_CONTENTION_STATS];
+   u32 histo_spin_blocked[HISTO_BUCKETS+1];
+   u64 time_blocked;
+} spinlock_stats;
+
+static u8 zero_stats;
+
+static inline void check_zero(void)
+{
+   u8 ret;
+   u8 old;
+
+   old = ACCESS_ONCE(zero_stats);
+   if (unlikely(old)) {
+   ret = cmpxchg(&zero_stats, old, 0);
+   /* This ensures only one fellow resets the stat */
+   if (ret == old)
+   memset(&spinlock_stats, 0, sizeof(spinlock_stats));
+   }
+}
+
+static inline void add_stats(enum kvm_contention_stat var, u32 val)
+{
+   check_zero();
+   spinlock_stats.contention_stats[var] += val;
+}
+
+
+static inline u64 spin_time_start(void)
+{
+   return sched_clock();
+}
+
+static void __spin_time_accum(u64 delta, u32 *array)
+{
+   unsigned index;
+
+   index = ilog2(delta);
+   check_zero();
+
+   if (index < HISTO_BUCKETS)
+   array[index]++;
+   else
+   array[HISTO_BUCKETS]++;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+   u32 delta;
+
+   delta = sched_clock() - start;
+   __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
+   spinlock_stats.time_blocked += delta;
+}
+
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+
+struct dentry *kvm_init_debugfs(void)
+{
+   d_kvm_debug = debugfs_create_dir("kvm", NULL);
+   if (!d_kvm_debug)
+   printk(KERN_WARNING "Could not create 'kvm' debugfs 
directory\n");
+
+   return d_kvm_debug;
+}
+
+static int __init kvm_spi

[tip:x86/spinlocks] kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-10 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  23f659a237e8f633f9605fdf9408a8d130ab72c9
Gitweb: http://git.kernel.org/tip/23f659a237e8f633f9605fdf9408a8d130ab72c9
Author: Srivatsa Vaddagiri 
AuthorDate: Fri, 9 Aug 2013 19:52:02 +0530
Committer:  H. Peter Anvin 
CommitDate: Fri, 9 Aug 2013 07:54:24 -0700

kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
required feature (KVM_FEATURE_PV_UNHALT) to support pv-ticketlocks. If so,
support for pv-ticketlocks is registered via pv_lock_ops.

Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.

Signed-off-by: Srivatsa Vaddagiri 
Link: 
http://lkml.kernel.org/r/1376058122-8248-15-git-send-email-raghavendra...@linux.vnet.ibm.com
Signed-off-by: Suzuki Poulose 
[Raghu: check_zero race fix, enum for kvm_contention_stat, jumplabel related 
changes,
addition of safe_halt for irq enabled case, bailout spinning in nmi case(Gleb)]
Signed-off-by: Raghavendra K T 
Acked-by: Gleb Natapov 
Acked-by: Ingo Molnar 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/kvm_para.h |  14 ++-
 arch/x86/kernel/kvm.c   | 262 
 2 files changed, 274 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 695399f..427afcb 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -118,10 +118,20 @@ void kvm_async_pf_task_wait(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
-#else
-#define kvm_guest_init() do { } while (0)
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init kvm_spinlock_init(void);
+#else /* !CONFIG_PARAVIRT_SPINLOCKS */
+static inline void kvm_spinlock_init(void)
+{
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+#else /* CONFIG_KVM_GUEST */
+#define kvm_guest_init() do {} while (0)
 #define kvm_async_pf_task_wait(T) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
+
 static inline u32 kvm_read_and_reset_pf_reason(void)
 {
return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a96d32c..9b33a27 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -419,6 +420,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
WARN_ON(kvm_register_clock("primary cpu clock"));
kvm_guest_cpu_init();
native_smp_prepare_boot_cpu();
+   kvm_spinlock_init();
 }
 
 static void kvm_guest_cpu_online(void *dummy)
@@ -523,3 +525,263 @@ static __init int activate_jump_labels(void)
return 0;
 }
 arch_initcall(activate_jump_labels);
+
+/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
+void kvm_kick_cpu(int cpu)
+{
+   int apicid;
+   unsigned long flags = 0;
+
+   apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
+}
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+enum kvm_contention_stat {
+   TAKEN_SLOW,
+   TAKEN_SLOW_PICKUP,
+   RELEASED_SLOW,
+   RELEASED_SLOW_KICKED,
+   NR_CONTENTION_STATS
+};
+
+#ifdef CONFIG_KVM_DEBUG_FS
+#define HISTO_BUCKETS  30
+
+static struct kvm_spinlock_stats
+{
+   u32 contention_stats[NR_CONTENTION_STATS];
+   u32 histo_spin_blocked[HISTO_BUCKETS+1];
+   u64 time_blocked;
+} spinlock_stats;
+
+static u8 zero_stats;
+
+static inline void check_zero(void)
+{
+   u8 ret;
+   u8 old;
+
+   old = ACCESS_ONCE(zero_stats);
+   if (unlikely(old)) {
+   ret = cmpxchg(&zero_stats, old, 0);
+   /* This ensures only one fellow resets the stat */
+   if (ret == old)
+   memset(&spinlock_stats, 0, sizeof(spinlock_stats));
+   }
+}
+
+static inline void add_stats(enum kvm_contention_stat var, u32 val)
+{
+   check_zero();
+   spinlock_stats.contention_stats[var] += val;
+}
+
+
+static inline u64 spin_time_start(void)
+{
+   return sched_clock();
+}
+
+static void __spin_time_accum(u64 delta, u32 *array)
+{
+   unsigned index;
+
+   index = ilog2(delta);
+   check_zero();
+
+   if (index < HISTO_BUCKETS)
+   array[index]++;
+   else
+   array[HISTO_BUCKETS]++;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+   u32 delta;
+
+   delta = sched_clock() - start;
+   __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
+   spinlock_stats.time_blocked += delta;
+}
+
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+
+struct dentry *kvm_init_debugfs(void)
+{
+   d_kvm_debug = debugfs_create_dir("kvm", NULL);
+   if (!d_kvm_debug)
+   printk(KERN_WARNING "Could not create 'kvm' debugfs 
directory\n");
+
+   return d_kvm_debug

[tip:x86/spinlocks] kvm guest: Add configuration support to enable debug information for KVM Guests

2013-08-10 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  1e20eb8557cdabf76473b09572be8aa8a2bb9bc0
Gitweb: http://git.kernel.org/tip/1e20eb8557cdabf76473b09572be8aa8a2bb9bc0
Author: Srivatsa Vaddagiri 
AuthorDate: Fri, 9 Aug 2013 19:52:01 +0530
Committer:  H. Peter Anvin 
CommitDate: Fri, 9 Aug 2013 07:54:18 -0700

kvm guest: Add configuration support to enable debug information for KVM Guests

Signed-off-by: Srivatsa Vaddagiri 
Link: 
http://lkml.kernel.org/r/1376058122-8248-14-git-send-email-raghavendra...@linux.vnet.ibm.com
Signed-off-by: Suzuki Poulose 
Signed-off-by: Raghavendra K T 
Acked-by: Gleb Natapov 
Acked-by: Ingo Molnar 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/Kconfig | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 112e712..b1fb846 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -657,6 +657,15 @@ config KVM_GUEST
  underlying device model, the host provides the guest with
  timing infrastructure such as time of day, and system time
 
+config KVM_DEBUG_FS
+   bool "Enable debug information for KVM Guests in debugfs"
+   depends on KVM_GUEST && DEBUG_FS
+   default n
+   ---help---
+ This option enables collection of various statistics for KVM guest.
+ Statistics are displayed in debugfs filesystem. Enabling this option
+ may incur significant overhead.
+
 source "arch/x86/lguest/Kconfig"
 
 config PARAVIRT_TIME_ACCOUNTING
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/spinlocks] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-08 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  b5eaeb3303fc3086f1d04deea48b5dfcfc4344c0
Gitweb: http://git.kernel.org/tip/b5eaeb3303fc3086f1d04deea48b5dfcfc4344c0
Author: Srivatsa Vaddagiri 
AuthorDate: Tue, 6 Aug 2013 17:15:21 +0530
Committer:  H. Peter Anvin 
CommitDate: Thu, 8 Aug 2013 16:07:34 -0700

kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
required feature (KVM_FEATURE_PV_UNHALT) to support pv-ticketlocks. If so,
 support for pv-ticketlocks is registered via pv_lock_ops.

Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.

Signed-off-by: Srivatsa Vaddagiri 
Link: 
http://lkml.kernel.org/r/20130806114521.20643.29839.sendpatch...@codeblue.in.ibm.com
Signed-off-by: Suzuki Poulose 
[Raghu: check_zero race fix, enum for kvm_contention_stat, jumplabel related 
changes,
addition of safe_halt for irq enabled case, bailout spinning in nmi case(Gleb)]
Signed-off-by: Raghavendra K T 
Acked-by: Gleb Natapov 
Acked-by: Ingo Molnar 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/kvm_para.h |  14 ++-
 arch/x86/kernel/kvm.c   | 262 
 2 files changed, 274 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 695399f..427afcb 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -118,10 +118,20 @@ void kvm_async_pf_task_wait(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
-#else
-#define kvm_guest_init() do { } while (0)
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init kvm_spinlock_init(void);
+#else /* !CONFIG_PARAVIRT_SPINLOCKS */
+static inline void kvm_spinlock_init(void)
+{
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+#else /* CONFIG_KVM_GUEST */
+#define kvm_guest_init() do {} while (0)
 #define kvm_async_pf_task_wait(T) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
+
 static inline u32 kvm_read_and_reset_pf_reason(void)
 {
return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a96d32c..9b33a27 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -419,6 +420,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
WARN_ON(kvm_register_clock("primary cpu clock"));
kvm_guest_cpu_init();
native_smp_prepare_boot_cpu();
+   kvm_spinlock_init();
 }
 
 static void kvm_guest_cpu_online(void *dummy)
@@ -523,3 +525,263 @@ static __init int activate_jump_labels(void)
return 0;
 }
 arch_initcall(activate_jump_labels);
+
+/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
+void kvm_kick_cpu(int cpu)
+{
+   int apicid;
+   unsigned long flags = 0;
+
+   apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
+}
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+enum kvm_contention_stat {
+   TAKEN_SLOW,
+   TAKEN_SLOW_PICKUP,
+   RELEASED_SLOW,
+   RELEASED_SLOW_KICKED,
+   NR_CONTENTION_STATS
+};
+
+#ifdef CONFIG_KVM_DEBUG_FS
+#define HISTO_BUCKETS  30
+
+static struct kvm_spinlock_stats
+{
+   u32 contention_stats[NR_CONTENTION_STATS];
+   u32 histo_spin_blocked[HISTO_BUCKETS+1];
+   u64 time_blocked;
+} spinlock_stats;
+
+static u8 zero_stats;
+
+static inline void check_zero(void)
+{
+   u8 ret;
+   u8 old;
+
+   old = ACCESS_ONCE(zero_stats);
+   if (unlikely(old)) {
+   ret = cmpxchg(&zero_stats, old, 0);
+   /* This ensures only one fellow resets the stat */
+   if (ret == old)
+   memset(&spinlock_stats, 0, sizeof(spinlock_stats));
+   }
+}
+
+static inline void add_stats(enum kvm_contention_stat var, u32 val)
+{
+   check_zero();
+   spinlock_stats.contention_stats[var] += val;
+}
+
+
+static inline u64 spin_time_start(void)
+{
+   return sched_clock();
+}
+
+static void __spin_time_accum(u64 delta, u32 *array)
+{
+   unsigned index;
+
+   index = ilog2(delta);
+   check_zero();
+
+   if (index < HISTO_BUCKETS)
+   array[index]++;
+   else
+   array[HISTO_BUCKETS]++;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+   u32 delta;
+
+   delta = sched_clock() - start;
+   __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
+   spinlock_stats.time_blocked += delta;
+}
+
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+
+struct dentry *kvm_init_debugfs(void)
+{
+   d_kvm_debug = debugfs_create_dir("kvm", NULL);
+   if (!d_kvm_debug)
+   printk(KERN_WARNING "Could not create 'kvm' debugfs 
directory\n");
+
+   return d_kvm_debug;
+}
+
+static int 

[tip:x86/spinlocks] kvm guest : Add configuration support to enable debug information for KVM Guests

2013-08-08 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  20a89c88f7d2458e12f66d7850cf17deec7daa1c
Gitweb: http://git.kernel.org/tip/20a89c88f7d2458e12f66d7850cf17deec7daa1c
Author: Srivatsa Vaddagiri 
AuthorDate: Tue, 6 Aug 2013 17:15:01 +0530
Committer:  H. Peter Anvin 
CommitDate: Thu, 8 Aug 2013 16:07:30 -0700

kvm guest : Add configuration support to enable debug information for KVM Guests

Signed-off-by: Srivatsa Vaddagiri 
Link: 
http://lkml.kernel.org/r/20130806114501.20643.5734.sendpatch...@codeblue.in.ibm.com
Signed-off-by: Suzuki Poulose 
Signed-off-by: Raghavendra K T 
Acked-by: Gleb Natapov 
Acked-by: Ingo Molnar 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/Kconfig | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 112e712..b1fb846 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -657,6 +657,15 @@ config KVM_GUEST
  underlying device model, the host provides the guest with
  timing infrastructure such as time of day, and system time
 
+config KVM_DEBUG_FS
+   bool "Enable debug information for KVM Guests in debugfs"
+   depends on KVM_GUEST && DEBUG_FS
+   default n
+   ---help---
+ This option enables collection of various statistics for KVM guest.
+ Statistics are displayed in debugfs filesystem. Enabling this option
+ may incur significant overhead.
+
 source "arch/x86/lguest/Kconfig"
 
 config PARAVIRT_TIME_ACCOUNTING
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] cpuhotplug/nohz: Remove offline cpus from nohz-idle state

2013-01-07 Thread Srivatsa Vaddagiri
* Russell King - ARM Linux  [2013-01-05 10:36:27]:

> On Thu, Jan 03, 2013 at 06:58:38PM -0800, Srivatsa Vaddagiri wrote:
> > I also think that the
> > wait_for_completion() based wait in ARM's __cpu_die() can be replaced with a
> > busy-loop based one, as the wait there in general should be terminated 
> > within
> > few cycles.
> 
> Why open-code this stuff when we have infrastructure already in the kernel
> for waiting for stuff to happen?  I chose to use the standard infrastructure
> because its better tested, and avoids having to think about whether we need
> CPU barriers and such like to ensure that updates are seen in a timely
> manner.

I was primarily thinking of calling as few generic functions as possible on
a dead cpu. I recall several "am I running on a dead cpu?" checks
(cpu_is_offline(this_cpu) that were put in generic routines during early
versions of cpu hotplug [1] to educate code running on dead cpu, the need for
which went away though with introduction of atomic/stop-machine variant. The
need to add a RCU_NONIDLE() wrapper around ARM's cpu_die() [2] is perhaps a more
recent example of educating code running on dead cpu. As quickly we die as
possible after idle thread of dying cpu gains control, the better!

1. http://lwn.net/Articles/69040/
2. http://lists.infradead.org/pipermail/linux-arm-kernel/2012-July/107971.html

- vatsa
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Revert "nohz: Fix idle ticks in cpu summary line of /proc/stat" (commit 7386cdbf2f57ea8cff3c9fde93f206e58b9fe13f).

2013-01-04 Thread Srivatsa Vaddagiri
* Sergei Shtylyov  [2013-01-04 16:13:42]:

> >With offline cpus no longer beeing seen in nohz mode (ts->idle_active=0), we
> >don't need the check for cpu_online() introduced in commit 7386cdbf. Offline
> 
>Please also specify the summary of that commit in parens (or
> however you like).

I had that in Subject line, but yes would be good to include in commit message
as well. I will incorporate that change alongwith anything else required in
next version of this patch.

- vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] Revert "nohz: Fix idle ticks in cpu summary line of /proc/stat" (commit 7386cdbf2f57ea8cff3c9fde93f206e58b9fe13f).

2013-01-03 Thread Srivatsa Vaddagiri
With offline cpus no longer beeing seen in nohz mode (ts->idle_active=0), we
don't need the check for cpu_online() introduced in commit 7386cdbf. Offline
cpu's idle time as last recorded in its ts->idle_sleeptime will be reported
(thus excluding its offline time as part of idle time statistics).

Cc: mho...@suse.cz
Cc: srivatsa.b...@linux.vnet.ibm.com
Signed-off-by: Srivatsa Vaddagiri 
---
 fs/proc/stat.c |   14 --
 1 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e296572..64c3b31 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -45,13 +45,10 @@ static cputime64_t get_iowait_time(int cpu)
 
 static u64 get_idle_time(int cpu)
 {
-   u64 idle, idle_time = -1ULL;
-
-   if (cpu_online(cpu))
-   idle_time = get_cpu_idle_time_us(cpu, NULL);
+   u64 idle, idle_time = get_cpu_idle_time_us(cpu, NULL);
 
if (idle_time == -1ULL)
-   /* !NO_HZ or cpu offline so we can rely on cpustat.idle */
+   /* !NO_HZ so we can rely on cpustat.idle */
idle = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE];
else
idle = usecs_to_cputime64(idle_time);
@@ -61,13 +58,10 @@ static u64 get_idle_time(int cpu)
 
 static u64 get_iowait_time(int cpu)
 {
-   u64 iowait, iowait_time = -1ULL;
-
-   if (cpu_online(cpu))
-   iowait_time = get_cpu_iowait_time_us(cpu, NULL);
+   u64 iowait, iowait_time = get_cpu_iowait_time_us(cpu, NULL);
 
if (iowait_time == -1ULL)
-   /* !NO_HZ or cpu offline so we can rely on cpustat.iowait */
+   /* !NO_HZ so we can rely on cpustat.iowait */
iowait = kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
else
iowait = usecs_to_cputime64(iowait_time);
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] cpuhotplug/nohz: Remove offline cpus from nohz-idle state

2013-01-03 Thread Srivatsa Vaddagiri
Modify idle loop of arm, mips, s390, sh and x86 architectures to exit from nohz
state before dying upon hot-remove. This change is needed to avoid userspace
tools like top command from seeing a rollback in total idle time over some
sampling periods.

Additionaly, modify idle loop on all architectures supporting cpu hotplug to
have idle thread of a dying cpu die immediately after scheduler returns control
to it. There is no point in wasting time via calls to *_enter()/*_exit() before
noticing the need to die and dying.

Additional ARM specific change:
Revert commit ff081e05 ("ARM: 7457/1: smp: Fix suspicious
RCU originating from cpu_die()"), which added a RCU_NONIDLE() wrapper
around call to complete(). That wrapper is no longer needed as cpu_die() is
now called outside of a rcu_idle_enter()/exit() section. I also think that the
wait_for_completion() based wait in ARM's __cpu_die() can be replaced with a
busy-loop based one, as the wait there in general should be terminated within
few cycles.

Cc: Russell King 
Cc: Paul E. McKenney 
Cc: Stephen Boyd 
Cc: linux-arm-ker...@lists.infradead.org
Cc: Mike Frysinger 
Cc: uclinux-dist-de...@blackfin.uclinux.org
Cc: Ralf Baechle 
Cc: linux-m...@linux-mips.org
Cc: Benjamin Herrenschmidt 
Cc: linuxppc-...@lists.ozlabs.org
Cc: Martin Schwidefsky 
Cc: linux-s...@vger.kernel.org
Cc: Paul Mundt 
Cc: linux...@vger.kernel.org
Cc: "David S. Miller" 
Cc: sparcli...@vger.kernel.org
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: mho...@suse.cz
Cc: srivatsa.b...@linux.vnet.ibm.com
Signed-off-by: Srivatsa Vaddagiri 
---
 arch/arm/kernel/process.c  |9 -
 arch/arm/kernel/smp.c  |2 +-
 arch/blackfin/kernel/process.c |8 
 arch/mips/kernel/process.c |6 +++---
 arch/powerpc/kernel/idle.c |2 +-
 arch/s390/kernel/process.c |4 ++--
 arch/sh/kernel/idle.c  |5 ++---
 arch/sparc/kernel/process_64.c |3 ++-
 arch/x86/kernel/process.c  |5 ++---
 9 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index c6dec5f..254099b 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -191,11 +191,6 @@ void cpu_idle(void)
rcu_idle_enter();
ledtrig_cpu(CPU_LED_IDLE_START);
while (!need_resched()) {
-#ifdef CONFIG_HOTPLUG_CPU
-   if (cpu_is_offline(smp_processor_id()))
-   cpu_die();
-#endif
-
/*
 * We need to disable interrupts here
 * to ensure we don't miss a wakeup call.
@@ -224,6 +219,10 @@ void cpu_idle(void)
rcu_idle_exit();
tick_nohz_idle_exit();
schedule_preempt_disabled();
+#ifdef CONFIG_HOTPLUG_CPU
+   if (cpu_is_offline(smp_processor_id()))
+   cpu_die();
+#endif
}
 }
 
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 84f4cbf..a8e3b8a 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -251,7 +251,7 @@ void __ref cpu_die(void)
mb();
 
/* Tell __cpu_die() that this CPU is now safe to dispose of */
-   RCU_NONIDLE(complete(&cpu_died));
+   complete(&cpu_died);
 
/*
 * actual CPU shutdown procedure is at least platform (if not
diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
index 3e16ad9..2bee1af 100644
--- a/arch/blackfin/kernel/process.c
+++ b/arch/blackfin/kernel/process.c
@@ -83,10 +83,6 @@ void cpu_idle(void)
while (1) {
void (*idle)(void) = pm_idle;
 
-#ifdef CONFIG_HOTPLUG_CPU
-   if (cpu_is_offline(smp_processor_id()))
-   cpu_die();
-#endif
if (!idle)
idle = default_idle;
tick_nohz_idle_enter();
@@ -98,6 +94,10 @@ void cpu_idle(void)
preempt_enable_no_resched();
schedule();
preempt_disable();
+#ifdef CONFIG_HOTPLUG_CPU
+   if (cpu_is_offline(smp_processor_id()))
+   cpu_die();
+#endif
}
 }
 
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index a11c6f9..41102a0 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -71,13 +71,13 @@ void __noreturn cpu_idle(void)
start_critical_timings();
}
}
+   rcu_idle_exit();
+   tick_nohz_idle_exit();
+   schedule_preempt_disabled();
 #ifdef CONFIG_HOTPLUG_CPU
if (!cpu_online(cpu) && !cpu_isset(cpu, cpu_callin_map))
play_dead();
 #endif
-   rcu_idle_exit();
-   tick_nohz_idle_exit();
-   schedule_preempt_disabled();

[PATCH 0/2] cpuhotplug/nohz: Fix issue of "negative" idle time

2013-01-03 Thread Srivatsa Vaddagiri
On most architectures (arm, mips, s390, sh and x86) idle thread of a cpu does
not cleanly exit nohz state before dying upon hot-remove. As a result,
offline cpu is seen to be in nohz mode (ts->idle_active = 1) and its offline
time can potentially be included in total idle time reported via /proc/stat.
When the same cpu later comes online, its offline time however is not included
in its idle time statistics, thus causing a rollback in total idle time to be
observed by applications like top.

Example output from Android top command highlighting this issue is below:

User 232%, System 70%, IOW 46%, IRQ 1%
User 1322 + Nice 0 + Sys 399 + Idle -1423 + IOW 264 + IRQ 0 + SIRQ 7 = 569

top is reporting system to be idle for -1423 ticks over some sampling period.
This happens as total idle time reported in cpu line of /proc/stat *dropped*
from the last value observed (cached) by top command.

While this was originally seen on a ARM platform running 3.4 based kernel, I
could easily recreate it on my x86 desktop running latest tip/master kernel
(HEAD 3a7bfcad). Online/offline a cpu in a tight loop and in another loop read
/proc/stat and observe if total idle time drops from previously read value.

Although commit 7386cdbf (nohz: Fix idle ticks in cpu summary line of
/proc/stat) aims to avoid this bug, its not preemption proof. A
thread could get preempted after the cpu_online() check in get_idle_time(), thus
potentially leading to get_cpu_idle_time_us() being invoked on a offline cpu.

One potential fix is to serialize hotplug with /proc/stat read operation (via
use of get/put_online_cpus()), which I disliked in favor of the other
solution proposed in this series.

In this patch series:

- Patch 1/2 modifies idle loop on architectures arm, mips, s390, sh and x86 to
  exit nohz state before the associated idle thread dies upon hotremove. This
  fixes the idle time accounting bug.

  Patch 1/2 also modifies idle loop on all architectures supporting cpu hotplug
  to have idle thread of a dying cpu die immediately after schedule() returns
  control to it. I see no point in wasting time via calls to *_enter()/*_exit()
  before noticing the need to die and dying.

- Patch 2/2 reverts commit 7386cdbf (nohz: Fix idle ticks in cpu summary line of
  /proc/stat). The cpu_online() check introduced by it is no longer necessary
  with Patch 1/2 applied. Having fewer code sites worry about online status of
  cpus is a good thing!

---

 arch/arm/kernel/process.c  |9 -
 arch/arm/kernel/smp.c  |2 +-
 arch/blackfin/kernel/process.c |8 
 arch/mips/kernel/process.c |6 +++---
 arch/powerpc/kernel/idle.c |2 +-
 arch/s390/kernel/process.c |4 ++--
 arch/sh/kernel/idle.c  |5 ++---
 arch/sparc/kernel/process_64.c |3 ++-
 arch/x86/kernel/process.c  |5 ++---
 fs/proc/stat.c |   14 --
 10 files changed, 25 insertions(+), 33 deletions(-)
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:sched/core] sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task

2012-07-24 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  88b8dac0a14c511ff41486b83a8c3d688936eec0
Gitweb: http://git.kernel.org/tip/88b8dac0a14c511ff41486b83a8c3d688936eec0
Author: Srivatsa Vaddagiri 
AuthorDate: Tue, 19 Jun 2012 17:43:15 +0530
Committer:  Ingo Molnar 
CommitDate: Tue, 24 Jul 2012 13:58:06 +0200

sched: Improve balance_cpu() to consider other cpus in its group as target of 
(pinned) task

Current load balance scheme requires only one cpu in a
sched_group (balance_cpu) to look at other peer sched_groups for
imbalance and pull tasks towards itself from a busy cpu. Tasks
thus pulled by balance_cpu could later get picked up by cpus
that are in the same sched_group as that of balance_cpu.

This scheme however fails to pull tasks that are not allowed to
run on balance_cpu (but are allowed to run on other cpus in its
sched_group). That can affect fairness and in some worst case
scenarios cause starvation.

Consider a two core (2 threads/core) system running tasks as
below:

  Core0Core1
 / \  / \
C0 C1C2 C3
|  | |  |
v  v v  v
F0 T1F1 [idle]
 T2

 F0 = SCHED_FIFO task (pinned to C0)
 F1 = SCHED_FIFO task (pinned to C2)
 T1 = SCHED_OTHER task (pinned to C1)
 T2 = SCHED_OTHER task (pinned to C1 and C2)

F1 could become a cpu hog, which will starve T2 unless C1 pulls
it. Between C0 and C1 however, C0 is required to look for
imbalance between cores, which will fail to pull T2 towards
Core0. T2 will starve eternally in this case. The same scenario
can arise in presence of non-rt tasks as well (say we replace F1
with high irq load).

We tackle this problem by having balance_cpu move pinned tasks
to one of its sibling cpus (where they can run). We first check
if load balance goal can be met by ignoring pinned tasks,
failing which we retry move_tasks() with a new env->dst_cpu.

This patch modifies load balance semantics on who can move load
towards a given cpu in a given sched_domain.

Before this patch, a given_cpu or a ilb_cpu acting on behalf of
an idle given_cpu is responsible for moving load to given_cpu.

With this patch applied, balance_cpu can in addition decide on
moving some load to a given_cpu.

There is a remote possibility that excess load could get moved
as a result of this (balance_cpu and given_cpu/ilb_cpu deciding
*independently* and at *same* time to move some load to a
given_cpu). However we should see less of such conflicting
decisions in practice and moreover subsequent load balance
cycles should correct the excess load moved to given_cpu.

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Prashanth Nageshappa 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/r/4fe06cdb.2060...@linux.vnet.ibm.com
[ minor edits ]
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c |   78 --
 1 files changed, 74 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f9f9aa0..22321db 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3054,6 +3054,7 @@ static unsigned long __read_mostly 
max_load_balance_interval = HZ/10;
 
 #define LBF_ALL_PINNED 0x01
 #define LBF_NEED_BREAK 0x02
+#define LBF_SOME_PINNED 0x04
 
 struct lb_env {
struct sched_domain *sd;
@@ -3064,6 +3065,8 @@ struct lb_env {
int dst_cpu;
struct rq   *dst_rq;
 
+   struct cpumask  *dst_grpmask;
+   int new_dst_cpu;
enum cpu_idle_type  idle;
longimbalance;
unsigned intflags;
@@ -3131,9 +3134,31 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
 * 3) are cache-hot on their current CPU.
 */
if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
+   int new_dst_cpu;
+
schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+
+   /*
+* Remember if this task can be migrated to any other cpu in
+* our sched_group. We may want to revisit it if we couldn't
+* meet load balance goals by pulling other tasks on src_cpu.
+*
+* Also avoid computing new_dst_cpu if we have already computed
+* one in current iteration.
+*/
+   if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
+   return 0;
+
+   new_dst_cpu = cpumask_first_and(env->dst_grpmask,
+   tsk_cpus_allowed(p));
+   if (new_dst_cpu < nr_cpu_ids) {
+   env->flags |= LBF_SOME_PINNED;
+   env->new_dst_cpu = new_dst_cpu;
+   }
return 0;
}
+
+   /* Record that we found atleast one task that could run

Re: [PATCH] sched: revert load_balance_monitor()

2008-02-25 Thread Srivatsa Vaddagiri
On Mon, Feb 25, 2008 at 04:28:02PM +0100, Peter Zijlstra wrote:
> Vatsa, would it make sense to take just that out, or just do a full
> revert?

Peter,
6b2d7700266b9402e12824e11e0099ae6a4a6a79 and 
58e2d4ca581167c2a079f4ee02be2f0bc52e8729 are related very much. The
later changes how cpu load is calculated for the former to work
properly.

I would suggest we do a full revert for now, until the latency issues are sorted
out.

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-git4+ regression

2008-02-18 Thread Srivatsa Vaddagiri
On Mon, Feb 18, 2008 at 08:38:24AM +0100, Mike Galbraith wrote:
> Here, it does not.  It seems fine without CONFIG_FAIR_GROUP_SCHED.

My hunch is its because of the vruntime driven preemption which shoots
up latencies (and the fact perhaps that Peter hasnt't focused more on SMP case
yet!).

Curiously, do you observe the same results when in UP mode (maxcpus=1)? 

> Oddity:  mainline git with Srivatsa's test patch improves markedly, and
> using sched_latency_ns and sched_wakeup_granularity_ns, I can tweak the

Lukas,
Does tweaking these make any difference for you?

# echo 1000 > /proc/sys/kernel/sched_latency_ns
# echo 1000 > /proc/sys/kernel/sched_wakeup_granularity_ns

[or some other lower value]

> regression into submission.  With sched-devel, I cannot tweak it away
> with or without the test patch.  Dunno how useful that info is.

FWIW, my test patch I had sent earlier didnt address the needs of UP, as Peter 
pointed me out. In that direction, I had done more experimentation with the 
patch below, which seemed to improve UP latencies also. Note that I
don't particularly like the first hunk below, perhaps it needs to be
surrounded by an if(something) ..

---
 kernel/sched_fair.c |   12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -523,8 +523,6 @@ place_entity(struct cfs_rq *cfs_rq, stru
if (sched_feat(NEW_FAIR_SLEEPERS))
vruntime -= sysctl_sched_latency;
 
-   /* ensure we never gain time by being placed backwards. */
-   vruntime = max_vruntime(se->vruntime, vruntime);
}
 
se->vruntime = vruntime;
@@ -816,6 +814,13 @@ hrtick_start_fair(struct rq *rq, struct 
 }
 #endif
 
+static inline void dequeue_stack(struct sched_entity *se)
+{
+   for_each_sched_entity(se)
+   if (se->on_rq)
+   dequeue_entity(cfs_rq_of(se), se, 0);
+}
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -828,6 +833,9 @@ static void enqueue_task_fair(struct rq 
*topse = NULL;  /* Highest schedulable entity */
int incload = 1;
 
+   if (wakeup)
+   dequeue_stack(se);
+
for_each_sched_entity(se) {
topse = se;
if (se->on_rq) {



P.S : Sorry about slow responses, since I am now in a different project :(

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-git4+ regression

2008-02-14 Thread Srivatsa Vaddagiri
On Wed, Jan 30, 2008 at 02:56:09PM +0100, Lukas Hejtmanek wrote:
> Hello,
> 
> I noticed short thread in LKM regarding "sched: add vslice" causes horrible
> interactivity under load.
> 
> I can see similar behavior. If I stress both CPU cores, even typing on
> keyboard suffers from huge latencies, I can see letters appearing with delay
> (typing into xterm). No swap is used at all, having 1GB free RAM.
> 
> I noticed this bad behavior with 2.6.24-git[46], 2.6.24-rc8-git was OK.

Hi Lukas,
Can you check if the patch below helps improve interactivity for you?

The patch is against 2.6.25-rc1. I would request you to check for
difference it makes with CONFIG_FAIR_GROUP_SCHED and
CONFIG_FAIR_USER_SCHED turned on.

---
 kernel/sched.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -7431,8 +7431,8 @@
 
local_load = tg->cfs_rq[i]->load.weight;
local_shares = (local_load * total_shares) / total_load;
-   if (!local_shares)
-   local_shares = MIN_GROUP_SHARES;
+   if (!local_load)
+   local_shares = tg->shares;
if (local_shares == tg->se[i]->load.weight)
continue;
 
@@ -7710,7 +7710,7 @@
struct rq *rq = cfs_rq->rq;
int on_rq;
 
-   if (!shares)
+   if (shares < MIN_GROUP_SHARES)
shares = MIN_GROUP_SHARES;
 
on_rq = se->on_rq;

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression in latest sched-git

2008-02-13 Thread Srivatsa Vaddagiri
On Wed, Feb 13, 2008 at 10:04:44PM +0530, Dhaval Giani wrote:
> I know I am missing something, but aren't we trying to reduce latencies
> here?

I guess Peter is referring to the latency in seeing fairness results. In
other words, with single rq approach, you may require more time for the groups 
to converge on fairness.

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.25-rc1: volanoMark 45% regression

2008-02-13 Thread Srivatsa Vaddagiri
On Wed, Feb 13, 2008 at 03:15:16PM +0530, Balbir Singh wrote:
> Zhang, Yanmin wrote:
> > volanoMark has 45% regression with kernel 2.6.25-rc1 on my both 8-core
> > stoakley and 16-core Tigerton.
> > 
> > I used bisect to locate below patch.
> > 
> > commit 58e2d4ca581167c2a079f4ee02be2f0bc52e8729
> > Author: Srivatsa Vaddagiri <[EMAIL PROTECTED]>
> > Date:   Fri Jan 25 21:08:00 2008 +0100
> > 
> > sched: group scheduling, change how cpu load is calculated
> > 
> > 
> > 
> > hackbench has about 30% regression on 16-core tigerton, but has about 10% 
> > improvement
> > on 8-core stoakley.
> > 
> > In addition, tbench has about 6% regression on my 8-core stoakley and
> > 25% regression on 16-core stoakley. Some other benchmarks, like netperf/aim7
> > also have some regression. I will verify if they are all related to the
> > patch.
> > 
> > -yanmin
> 
> Hi, Yamin,
> 
> Thanks for reporting the issue? Any chance we could getthe Oprofile output for
> the run? The exact commandline and .config being used would also help.

Yamin,
I would also like to know against which previous version is this
regression being compared with. Is it 2.6.24? Did you have
CONFIG_FAIR_USER_SCHED enabled in both cases? It would also help to know if you
see the same regression with FAIR_GROUP_SCHED turned off.

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression in latest sched-git

2008-02-12 Thread Srivatsa Vaddagiri
On Tue, Feb 12, 2008 at 08:40:08PM +0100, Peter Zijlstra wrote:
> Yes, latency isolation is the one thing I had to sacrifice in order to
> get the normal latencies under control.

Hi Peter,
I don't have easy solution in mind either to meet both fairness
and latency goals in a acceptable way.

But I am puzzled at the max latency numbers you have provided below:

> The problem with the old code is that under light load: a kernel make
> -j2 as root, under an otherwise idle X session, generates latencies up
> to 120ms on my UP laptop. (uid grouping; two active users: peter, root).

If it was just two active users, then max latency should be:

latency to schedule user entity (~10ms?) +
latency to schedule task within that user 

20-30 ms seems more reaonable max latency to expect in this scenario.
120ms seems abnormal, unless the user had large number of tasks.

On the same lines, I cant understand how we can be seeing 700ms latency
(below) unless we had: large number of active groups/users and large number of 
tasks within each group/user.

> Others have reported latencies up to 300ms, and Ingo found a 700ms
> latency on his machine.
> 
> The source for this problem is I think the vruntime driven wakeup
> preemption (but I'm not quite sure). The other things that rely on
> global vruntime are sleeper fairness and yield. Now while I can't
> possibly care less about yield, the loss of sleeper fairness is somewhat
> sad (NB. turning it off with the old group scheduling does improve life
> somewhat).
> 
> So my first attempt at getting a global vruntime was flattening the
> whole RQ structure, you can see that patch in sched.git (I really ought
> to have posted that, will do so tomorrow).

We will do some exhaustive testing with this approach. My main concern
with this is that it may compromise the level of isolation between two
groups (imagine one group does a fork-bomb and how it would affect
fairness for other groups).

> With the experience gained from doing that, I think it might be possible
> to construct a hierarchical RQ model that has synced vruntime; but
> thinking about that still makes my head hurt.
> 
> Anyway, yes, its not ideal, but it does the more common case of light
> load much better - I basically had to tell people to disable
> CONFIG_FAIR_GROUP_SCHED in order to use their computer, which is sad,
> because its the default and we want it to be the default in the cgroup
> future.
> 
> So yes, I share your concern, lets work on this together.

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Default child of a cgroup

2008-02-01 Thread Srivatsa Vaddagiri
On Thu, Jan 31, 2008 at 06:39:56PM -0800, Paul Menage wrote:
> On Jan 30, 2008 6:40 PM, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote:
> >
> > Here are some questions that arise in this picture:
> >
> > 1. What is the relationship of the task-group in A/tasks with the
> >task-group in A/a1/tasks? In otherwords do they form siblings
> >of the same parent A?
> 
> I'd argue the same as Balbir - tasks in A/tasks are are children of A
> and are siblings of a1, a2, etc.

> > 2. Somewhat related to the above question, how much resource should the
> >task-group A/a1/tasks get in relation to A/tasks? Is it 1/2 of parent
> >A's share or 1/(1 + N) of parent A's share (where N = number of tasks
> >in A/tasks)?
> 
> Each process in A should have a scheduler weight that's derived from
> its static_prio field. Similarly each subgroup of A will have a
> scheduler weight that's determined by its cpu.shares value. So the cpu
> share of any child (be it a task or a subgroup) would be equal to its
> own weight divided by the sum of weights of all children.

Assuming all tasks are of same prio, then what you are saying is that
A/a1/tasks should cumulatively recv 1/(1 + N) of parent's share.

After some thought, that seems like a reasonable expectation. The only issue
I have for that is it breaks current behavior in mainline. Assume this
structure:

   /
   |--
   |--
   |--
   |
   |[A]
   | |
   | |
   | |


then, going by above argument, /A/tasks should recv 1/(1+M)% of system
resources (M -> number of tasks in /tasks), whereas it receives 1/2 of
system resources currently (assuming /cpu.shares and /A/cpu.shares are
same).

Balbir, is this behaviour same for memory controller as well?

So pick any option, we are talking of deviating from current
behavior, which perhaps is a non-issue if we want to DTRT.

> So yes, if a task in A forks lots of children, those children could
> end up getting a disproportionate amount of the CPU compared to tasks
> in A/a1 - but that's the same as the situation without cgroups. If you
> want to control cpu usage between different sets of processes in A,
> they should be in sibling cgroups, not directly in A.
> 
> Is there a restriction in CFS that stops a given group from
> simultaneously holding tasks and sub-groups? If so, couldn't we change
> CFS to make it possible rather than enforcing awkward restructions on
> cgroups?

Should be possible, need to look closely at what will need to change
(load_balance routines for sure).

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] Default child of a cgroup

2008-01-30 Thread Srivatsa Vaddagiri
Hi,
As we were implementing multiple-hierarchy support for CPU
controller, we hit some oddities in its implementation, partly related
to current cgroups implementation. Peter and I have been debating on the 
exact solution and I thought of bringing that discussion to lkml.

Consider the cgroup filesystem structure for managing cpu resource.

# mount -t cgroup -ocpu,cpuacct none /cgroup
# mkdir /cgroup/A
# mkdir /cgroup/B
# mkdir /cgroup/A/a1

will result in:

/cgroup
   |--
   |--
   |--
   |
   |[A]
   | |
   | |
   | |
   | |
   | |---[a1]
   |   |
   |   |
   |   |
   |   |
   |
   |[B]
   | |
   | |
   | |
   | 


Here are some questions that arise in this picture:

1. What is the relationship of the task-group in A/tasks with the
   task-group in A/a1/tasks? In otherwords do they form siblings
   of the same parent A?

2. Somewhat related to the above question, how much resource should the 
   task-group A/a1/tasks get in relation to A/tasks? Is it 1/2 of parent
   A's share or 1/(1 + N) of parent A's share (where N = number of tasks
   in A/tasks)?

3. What should A/cpuacct.usage reflect? CPU usage of A/tasks? Or CPU usage
   of all siblings put together? It can reflect only one, in which case
   user has to manually derive the other component of the statistics.
 
It seems to me that tasks in A/tasks form what can be called the
"default" child group of A, in which case:

4. Modifications to A/cpu.shares should affect the parent or its default
   child group (A/tasks)?

To avoid these ambiguities, it may be good if cgroup create this
"default child group" automatically whenever a cgroup is created?
Something like below (not the absence of tasks file in some directories
now):


/cgroup
   |
   |--
   |--
   |
   |---[def_child]
   | |
   | |
   | |
   | |
   |
   |[A]
   | |
   | |
   | |
   | |
   | |---[def_child]
   | | |
   | | |
   | | |
   | | |
   | | 
   | |---[a1]
   |   |
   |   |
   |   |
   |   |
   |   |---[def_child]
   |   |   |---
   |   |   |---
   |   |   |---
   |   |   |
   |
   |[B]
   | |
   | |
   | |
   | | 
   | |---[def_child]
   | | |
   | | |
   | | |
   | | |

Note that user cannot create subdirectories under def_child with this
scheme! I am also not sure what impact this will have on other resources
like cpusets ..

Thoughts?


-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High wake up latencies with FAIR_USER_SCHED

2008-01-29 Thread Srivatsa Vaddagiri
On Tue, Jan 29, 2008 at 04:53:56PM +0100, Guillaume Chazarain wrote:
> I just thought about something to restore low latencies with
> FAIR_GROUP_SCHED, but it's possibly utter nonsense, so bear with me
> ;-) The idea would be to reverse the trees upside down. The scheduler
> would only see tasks (on the leaves) so could apply its interactivity
> magic, but the hierarchical groups would be used to compute dynamic
> loads for each task according to their position in the tree:

I think this is equivalent to flattening the hierarchy? We discussed this a bit 
sometime back [1], but one of its weaknesses is providing strong
partitioning between groups when it comes to ensuring fairness. Ex: imagine a 
group which does a fork-bomb. With the flattened tree, it affects other groups 
more than it would with a 1-level deep hierarchy.

Having said that, I would be interested to hear other solutions that maintain 
this good partitioning b/n groups and still provide good interactivity!

1. http://lkml.org/lkml/2007/5/30/300

> - now:
>   - we schedule each level of the tree starting from the root
> 
> - with my proposition:
>   - we schedule tasks like with !FAIR_GROUP_SCHED, but
> calc_delta_fair() would traverse the tree starting from the leaves to
> compute the dynamic load.

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: scheduler scalability - cgroups, cpusets and load-balancing

2008-01-29 Thread Srivatsa Vaddagiri
On Tue, Jan 29, 2008 at 11:57:22AM +0100, Peter Zijlstra wrote:
> On Tue, 2008-01-29 at 10:53 +0100, Peter Zijlstra wrote:
> 
> > My thoughts were to make stronger use of disjoint cpu-sets. cgroups and
> > cpusets are related, in that cpusets provide a property to a cgroup.
> > However, load_balance_monitor()'s interaction with sched domains
> > confuses me - it might DTRT, but I can't tell.
> > 
> > [ It looks to me it balances a group over the largest SD the current cpu
> >   has access to, even though that might be larger than the SD associated
> >   with the cpuset of that particular cgroup. ]
> 
> Hmm, with a bit more thought I think that does indeed DTRT. Because, if
> the cpu belongs to a disjoint cpuset, the highest sd (with
> load-balancing enabled) would be that. Right?

Hi Peter,
Yes, I was having this in mind when I wrote the load_balance_monitor() 
function - to only balance across cpus that form a disjoint cpuset in the 
system.

> [ Just a bit of a shame we have all cgroups represented on each cpu. ]

After reading your explanation in the other mail abt what you mean here,
I agree. Your suggestion to remove/add cfs_rq from/to the leaf_cfs_rq_list
upon dequeue_of_last_task/enqueue_of_first_task  AND

> Also, might be a nice idea to split the daemon up if there are indeed
> disjoint sets - currently there is only a single daemon which touches
> the whole system.

the above suggestions seems like good ideas. I can also look at reducing
the frequency at which the thread runs ..

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High wake up latencies with FAIR_USER_SCHED

2008-01-28 Thread Srivatsa Vaddagiri
On Mon, Jan 28, 2008 at 09:13:53PM +0100, Guillaume Chazarain wrote:
> Unfortunately it seems to not be completely fixed, with this script:

The maximum scheduling latency of a task with group scheduler is:

Lmax = latency to schedule group entity at level0 + 
   latency to schedule group entity at level1 +
   ...
   latency to schedule task entity at last level

More the hierarchical levels, more the latency looks like. This is particularly 
so because vruntime (and not wall-clock time) is used as the basis of preemption
of entities.  The latency at each level also depends on number of entities at 
that level and sysctl_sched_latency/sched_nr_latency setting.

In this case, we only have two levels - userid + task. So the max scheduling 
latency is:

  Lmax = latency to schedule uid1 group entity (L0) +
 latency to schedule the sleeper task within uid1 group (L1)

In the first script that you had, uid1 had only one sleeper task, while uid2 has
two cpu-hogs. This means L1 is always zero for the sleeper task. L0 is also 
substantially reduced with the patch I sent (giving sleep credit for group 
level entities). Thus we were able to get low scheduling latencies in the case 
of first script.

The second script you have sent is generating two tasks (sleeper + hog) under 
uid 1 and one cpuhog task under uid 2. Consequently the group-entity 
corresponding to uid 1 is always active and hence there is no question of giving
credit to it for sleeping.

As a result, we should expect worst-case latencies of upto [2 * 10 = 20ms] in 
this case. The results you have fall within this range.

In case of !FAIR_USER_SCHED, the sleeper task always gets sleep-credits
and hence its latency is drastically reduced.

IMHO this is expected results and if someone really needs to cut down
this latency, they can reduce sysctl_sched_latency (which will be bad
from perf standpoint, as we will cause more cache thrashing with that).


> #!/usr/bin/python
> 
> import os
> import time
> 
> SLEEP_TIME = 0.1
> SAMPLES = 5
> PRINT_DELAY = 0.5
> 
> def print_wakeup_latency():
> times = []
> last_print = 0
> while True:
> start = time.time()
> time.sleep(SLEEP_TIME)
> end = time.time()
> times.insert(0, end - start - SLEEP_TIME)
> del times[SAMPLES:]
> if end > last_print + PRINT_DELAY:
> copy = times[:]
> copy.sort()
> print '%f ms' % (copy[len(copy)/2] * 1000)
> last_print = end
> 
> if os.fork() == 0:
> if os.fork() == 0:
> os.setuid(1)
> while True:
> pass
> else:
> os.setuid(2)
> while True:
> pass
> else:
> os.setuid(1)
> print_wakeup_latency()
> 
> I get seemingly unpredictable latencies (with or without the patch applied):
> 
> # ./sched.py
> 14.810944 ms
> 19.829893 ms
> 1.968050 ms
> 8.021021 ms
> -0.017977 ms
> 4.926109 ms
> 11.958027 ms
> 5.995893 ms
> 1.992130 ms
> 0.007057 ms
> 0.217819 ms
> -0.004864 ms
> 5.907202 ms
> 6.547832 ms
> -0.012970 ms
> 0.209951 ms
> -0.002003 ms
> 4.989052 ms
> 
> Without FAIR_USER_SCHED, latencies are consistently in the noise.
> Also, I forgot to mention that I'm on a single CPU.
> 
> Thanks for the help.
> 
> -- 
> Guillaume

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High wake up latencies with FAIR_USER_SCHED

2008-01-27 Thread Srivatsa Vaddagiri
On Sun, Jan 27, 2008 at 09:01:15PM +0100, Guillaume Chazarain wrote:
> I noticed some strangely high wake up latencies with FAIR_USER_SCHED
> using this script:



> We have two busy loops with UID=1.
> And UID=2 maintains the running median of its wake up latency.
> I get these latencies:
> 
> # ./sched.py
> 4.300022 ms
> 4.801178 ms
> 4.604006 ms

Given that sysctl_sched_wakeup_granularity is set to 10ms by default,
this doesn't sound abnormal.



> Disabling FAIR_USER_SCHED restores wake up latencies in the noise:
> 
> # ./sched.py
> -0.156975 ms
> -0.067091 ms
> -0.022984 ms

The reason why we are getting better wakeup latencies for !FAIR_USER_SCHED is 
because of this snippet of code in place_entity():

if (!initial) {
/* sleeps upto a single latency don't count. */
if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se))
 ^^
vruntime -= sysctl_sched_latency;

/* ensure we never gain time by being placed backwards. */
vruntime = max_vruntime(se->vruntime, vruntime);
}


NEW_FAIR_SLEEPERS feature gives credit for sleeping only to tasks and
not group-level entities. With the patch attached, I could see that wakeup 
latencies with FAIR_USER_SCHED are restored to the same level as 
!FAIR_USER_SCHED. 

However I am not sure whether that is the way to go. We want to let one group of
tasks running as much as possible until the fairness/wakeup-latency threshold is
exceeded. If someone does want better wakeup latencies between groups too, they 
can always tune sysctl_sched_wakeup_granularity.



> Strangely enough, another way to restore normal latencies is to change
> setuid(2) to setuid(1), that is, putting the latency measurement in
> the same group as the two busy loops.



-- 
Regards,
vatsa

---
 kernel/sched_fair.c |2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -511,7 +511,7 @@
 
if (!initial) {
/* sleeps upto a single latency don't count. */
-   if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se))
+   if (sched_feat(NEW_FAIR_SLEEPERS))
vruntime -= sysctl_sched_latency;
 
/* ensure we never gain time by being placed backwards. */


Re: (ondemand) CPU governor regression between 2.6.23 and 2.6.24

2008-01-27 Thread Srivatsa Vaddagiri
On Sun, Jan 27, 2008 at 04:06:17PM +0100, Toralf Förster wrote:
> > The third line (giving overall cpu usage stats) is what is interesting here.
> > If you have more than one cpu, you can get cpu usage stats for each cpu
> > in top by pressing 1. Can you provide this information with and w/o 
> > CONFIG_FAIR_GROUP_SCHED?
> 
> This is what I get if I set CONFIG_FAIR_GROUP_SCHED to "y"
> 
> top - 16:00:59 up 2 min,  1 user,  load average: 2.56, 1.60, 0.65
> Tasks:  84 total,   3 running,  81 sleeping,   0 stopped,   0 zombie
> Cpu(s): 49.7%us,  0.3%sy, 49.7%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
> Mem:   1036180k total,   322876k used,   713304k free,13164k buffers
> Swap:   997880k total,0k used,   997880k free,   149208k cached
> 
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>  6070 dnetc 39  19   664  348  264 R 49.7  0.0   1:09.71 dnetc
>  6676 tfoerste  20   0  1796  488  428 R 49.3  0.0   0:02.72 factor
> 
> Stopping dnetc gives:
> 
> top - 16:02:36 up 4 min,  1 user,  load average: 2.50, 1.87, 0.83
> Tasks:  89 total,   3 running,  86 sleeping,   0 stopped,   0 zombie
> Cpu(s): 99.3%us,  0.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:   1036180k total,   378760k used,   657420k free,14736k buffers
> Swap:   997880k total,0k used,   997880k free,   180868k cached
> 
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>  6766 tfoerste  20   0  1796  488  428 R 84.9  0.0   0:05.41 factor

Thanks for this respone. This confirms that cpu's idle time is close to
zero, as I intended to verify.

> > If I am not mistaken, cpu ondemand gov goes by the cpu idle time stats,
> > which should not be affected by FAIR_GROUP_SCHED. I will lookaround for
> > other possible causes.

On further examination, ondemand governor seems to have a tunable to
ignore nice load. In your case, I see that dnetc is running at a
positive nice value (19) which could explain why ondemand gov thinks
that the cpu is only ~50% loaded.

Can you check what is the setting of this knob in your case?

# cat /sys/devices/system/cpu/cpu0/cpufreq/ondemand/ignore_nice_load

You can set that to 0 to ask ondemand gov to include nice load into
account while calculating cpu freq changes:

# echo 0 > /sys/devices/system/cpu/cpu0/cpufreq/ondemand/ignore_nice_load

This should restore the behavior of ondemand governor as seen in 2.6.23
in your case (even with CONFIG_FAIR_GROUP_SCHED enabled). Can you pls confirm 
if that happens?

> As I stated our in http://lkml.org/lkml/2008/1/26/207 the issue is solved
> after unselecting FAIR_GROUP_SCHED. 

I understand, but we want to keep CONFIG_FAIR_GROUP_SCHED enabled by
default.

Ingo,
Most folks seem to be used to a global nice-domain, where a nice 19 
task gives up cpu in competetion to a nice-0 task (irrespective of which 
userid's they belong to). CONFIG_FAIR_USER_SCHED brings noticeable changes wrt 
that. We could possibly let it be as it is (since that is what a server
admin may possibly want when managing university servers) or modify it to be 
aware of nice-level (priority of user-sched entity is equivalent to highest 
prio task it has).

In any case, I will send across a patch to turn off CONFIG_FAIR_USER_SCHED by 
default (and instead turn on CONFIG_FAIR_CGROUP_SCHED by default).

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: (ondemand) CPU governor regression between 2.6.23 and 2.6.24

2008-01-27 Thread Srivatsa Vaddagiri
On Sat, Jan 26, 2008 at 07:46:51PM +0100, Toralf Förster wrote:
> 
> The problem is the same as described here : http://lkml.org/lkml/2007/10/21/85
> If I run dnetc even with lowest prority than the CPU stays at 600 MHz 
> regardless
> of any other load (eg. rsyncing, svn update, compiling, ...)
> 
> Stopping the dnetc process immediately speeds up the CPU up to 1.7 GHz.
> 
> 
> Am Samstag, 26. Januar 2008 schrieben Sie:
> > During the test, run top, and watch your CPU usage. Does it go above 80% 
> > (the default for 
> > /sys/devices/system/cpu/cpu0/cpufreq/ondemand/up_threshold).
> 
> No, instead I get :
> 
>  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>  7294 dnetc 39  19   664  348  264 R 49.5  0.0   0:48.68 dnetc
>  7310 tfoerste  20   0  1796  492  428 R 48.5  0.0   0:07.19 factor
>  7050 root  20   0 96736 8872 3972 S  0.7  0.9   0:02.99 X

Hi Toralf,
Can you list the o/p you see for overall cpu usage? You should
see something like below right at the top of the o/p:

top - 20:03:59 up 12 days, 21:39, 18 users,  load average: 0.22, 0.20, 0.25
Tasks: 200 total,   5 running, 193 sleeping,   0 stopped,   2 zombie
Cpu(s):  2.6% us,  1.3% sy,  0.0% ni, 96.0% id,  0.0% wa,  0.0% hi, 0.0% si,  
0.0% st

The third line (giving overall cpu usage stats) is what is interesting here.
If you have more than one cpu, you can get cpu usage stats for each cpu
in top by pressing 1. Can you provide this information with and w/o 
CONFIG_FAIR_GROUP_SCHED?

If I am not mistaken, cpu ondemand gov goes by the cpu idle time stats,
which should not be affected by FAIR_GROUP_SCHED. I will lookaround for
other possible causes.

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: don't take a mutex from interrupt context

2008-01-22 Thread Srivatsa Vaddagiri
On Tue, Jan 22, 2008 at 05:47:34PM +0100, Peter Zijlstra wrote:
> It should not, that would be another bug, but from a quick glance at the
> code it doesn't do that.

Hmm I had it in my back of mind that printk() could sleep. Looks like
that has changed and so the patch you sent should be fine. 

Thanks!

- vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: don't take a mutex from interrupt context

2008-01-22 Thread Srivatsa Vaddagiri
On Tue, Jan 22, 2008 at 05:25:38PM +0100, Peter Zijlstra wrote:
> @@ -1428,9 +1428,9 @@ static void print_cfs_stats(struct seq_f
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>   print_cfs_rq(m, cpu, &cpu_rq(cpu)->cfs);
>  #endif
> - lock_task_group_list();
> + rcu_read_lock();
>   for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
>   print_cfs_rq(m, cpu, cfs_rq);

Isn't there a possibility that print_cfs_rq() can block?

> - unlock_task_group_list();
> + rcu_read_unlock();
>  }
>  #endif
> 

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression with idle cpu cycle handling in 2.6.24 (compared to 2.6.22)

2008-01-21 Thread Srivatsa Vaddagiri
On Sun, Jan 20, 2008 at 09:03:38AM +0530, Dhaval Giani wrote:
> > btw: writing 1 into "cpu_share" totally locks up the computer!
> > 
> 
> Can you please provide some more details. Can you go into another
> console (try ctrl-alt-f1) and try to reproduce the issue there. Could
> you take a photo of the oops/panic and upload it somewhere so that we
> can see it?

I couldn't recreate this problem either on 2.6.24-rc8.

Daniel,
Could you pass the config file you used pls?

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/11] another rt group sched update

2008-01-07 Thread Srivatsa Vaddagiri
On Mon, Jan 07, 2008 at 11:51:20AM +0100, Peter Zijlstra wrote:
>  - figure out what to do for UID based group scheduling, the current
>implementation leaves it impossible for !root users to execute
>real time tasks by setting rt_runtime_us to 0, and it has no way
>to change it.
> 
>Srivatsa, what happened to the per uid weight patches?, Perhaps we
>can extend that interface to allow changing this.

Hi Peter,
The sysfs interface for tweaking each user's share should be in
mainline already (sysfs_create_file() in user_kobject_create()). This
could be extended for your purpose, hopefully in a straightforward
manner (you never know that with sysfs :(

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: cpu accounting controller (V2)

2007-11-30 Thread Srivatsa Vaddagiri
On Fri, Nov 30, 2007 at 01:35:13PM +0100, Ingo Molnar wrote:
> * Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote:
> 
> > Here's V2 of the cpu acccounting controller patch, which makes 
> > accounting scale better on SMP systems by splitting the usage counter 
> > to be per-cpu.
> 
> thanks, applied. But you dont seem to have incorporated all of the 
> review feedback from Andrew. (such as making cpuacct_subsys static)

cpuacct_subsys can't be made static, as I have pointed out at
http://marc.info/?l=linux-kernel&m=119636730930886. Other than that,
AFAICS, rest of Andrew's comments have been taken care of. If there are any 
remaining that I have over-looked, I had be happy to fix them.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched: cpu accounting controller (V2)

2007-11-30 Thread Srivatsa Vaddagiri
On Fri, Nov 30, 2007 at 01:48:33AM +0530, Srivatsa Vaddagiri wrote:
> It is indeed an important todo. Right now we take a per-group global
> lock on every accounting update (which can be very frequent) and hence
> it is pretty bad.
> 
> Ingo had expressed the need to reintroduce this patch asap and hence I resent 
> it w/o attacking the scalability part. I will take a shot at scalability
> enhancements tomorrow and send it as a separate patch.

Here's V2 of the cpu acccounting controller patch, which makes
accounting scale better on SMP systems by splitting the usage counter to
be per-cpu.

>

Commit cfb5285660aad4931b2ebbfa902ea48a37dfffa1 removed a useful feature for 
us, which provided a cpu accounting resource controller.  This feature would be 
useful if someone wants to group tasks only for accounting purpose and doesnt 
really want to exercise any control over their cpu consumption.

The patch below reintroduces the feature. It is based on Paul Menage's
original patch (Commit 62d0df64065e7c135d0002f069444fbdfc64768f), with
these differences:

- Removed load average information. I felt it needs more thought (esp 
  to deal with SMP and virtualized platforms) and can be added for 
  2.6.25 after more discussions.
- Convert group cpu usage to be nanosecond accurate (as rest of the cfs 
  stats are) and invoke cpuacct_charge() from the respective scheduler 
  classes
- Make accounting scalable on SMP systems by splitting the usage 
  counter to be per-cpu
- Move the code from kernel/cpu_acct.c to kernel/sched.c (since the 
  code is not big enough to warrant a new file and also this rightly 
  needs to live inside the scheduler. Also things like accessing 
  rq->lock while reading cpu usage becomes easier if the code lived in 
  kernel/sched.c)

The patch also modifies the cpu controller not to provide the same accounting 
information.


Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 include/linux/cgroup_subsys.h |7 +
 init/Kconfig  |7 +
 kernel/sched.c|  155 ++
 kernel/sched_fair.c   |6 +
 kernel/sched_rt.c |1 
 5 files changed, 150 insertions(+), 26 deletions(-)

Index: current/include/linux/cgroup_subsys.h
===
--- current.orig/include/linux/cgroup_subsys.h
+++ current/include/linux/cgroup_subsys.h
@@ -30,3 +30,10 @@ SUBSYS(cpu_cgroup)
 #endif
 
 /* */
+
+#ifdef CONFIG_CGROUP_CPUACCT
+SUBSYS(cpuacct)
+#endif
+
+/* */
+
Index: current/init/Kconfig
===
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -354,6 +354,13 @@ config FAIR_CGROUP_SCHED
 
 endchoice
 
+config CGROUP_CPUACCT
+   bool "Simple CPU accounting cgroup subsystem"
+   depends on CGROUPS
+   help
+ Provides a simple Resource Controller for monitoring the
+ total CPU consumed by the tasks in a cgroup
+
 config SYSFS_DEPRECATED
bool "Create deprecated sysfs files"
default y
Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -854,6 +854,12 @@ iter_move_one_task(struct rq *this_rq, i
   struct rq_iterator *iterator);
 #endif
 
+#ifdef CONFIG_CGROUP_CPUACCT
+static void cpuacct_charge(struct task_struct *tsk, u64 cputime);
+#else
+static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
+#endif
+
 #include "sched_stats.h"
 #include "sched_idletask.c"
 #include "sched_fair.c"
@@ -7221,38 +7227,12 @@ static u64 cpu_shares_read_uint(struct c
return (u64) tg->shares;
 }
 
-static u64 cpu_usage_read(struct cgroup *cgrp, struct cftype *cft)
-{
-   struct task_group *tg = cgroup_tg(cgrp);
-   unsigned long flags;
-   u64 res = 0;
-   int i;
-
-   for_each_possible_cpu(i) {
-   /*
-* Lock to prevent races with updating 64-bit counters
-* on 32-bit arches.
-*/
-   spin_lock_irqsave(&cpu_rq(i)->lock, flags);
-   res += tg->se[i]->sum_exec_runtime;
-   spin_unlock_irqrestore(&cpu_rq(i)->lock, flags);
-   }
-   /* Convert from ns to ms */
-   do_div(res, NSEC_PER_MSEC);
-
-   return res;
-}
-
 static struct cftype cpu_files[] = {
{
.name = "shares",
.read_uint = cpu_shares_read_uint,
.write_uint = cpu_shares_write_uint,
},
-   {
-   .name = "usage",
-   .read_uint = cpu_usage_read,
-   },
 };
 
 static int cpu_cgroup_populate(struct cgroup_sub

Re: [PATCH] sched: cpu accounting controller

2007-11-29 Thread Srivatsa Vaddagiri
On Thu, Nov 29, 2007 at 11:30:35AM -0800, Andrew Morton wrote:
> > - Make the accounting scalable on SMP systems (perhaps
> >   for 2.6.25)
> 
> That sounds like a rather important todo.  How bad is it now?

It is indeed an important todo. Right now we take a per-group global
lock on every accounting update (which can be very frequent) and hence
it is pretty bad.

Ingo had expressed the need to reintroduce this patch asap and hence I resent 
it w/o attacking the scalability part. I will take a shot at scalability
enhancements tomorrow and send it as a separate patch.

> > +struct cpuacct {
> > + struct cgroup_subsys_state css;
> > + spinlock_t lock;
> > + /* total time used by this class (in nanoseconds) */
> > + u64 time;
> > +};
> > +
> > +struct cgroup_subsys cpuacct_subsys;
> 
> This can be made static.

This symbol is needed in kernel/cgroup.c file, where it does this:

static struct cgroup_subsys *subsys[] = {
#include 
};

and hence it cant be static. Thanks for the rest of your comments. I
have fixed them in this version below:

Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 include/linux/cgroup_subsys.h |6 ++
 include/linux/cpu_acct.h  |   14 +
 init/Kconfig  |7 ++
 kernel/Makefile   |1 
 kernel/cpu_acct.c |  101 ++
 kernel/sched.c|   27 ---
 kernel/sched_fair.c   |5 ++
 kernel/sched_rt.c |1 
 8 files changed, 136 insertions(+), 26 deletions(-)

Index: current/include/linux/cgroup_subsys.h
===
--- current.orig/include/linux/cgroup_subsys.h
+++ current/include/linux/cgroup_subsys.h
@@ -11,6 +11,12 @@
 SUBSYS(cpuset)
 #endif
 
+#ifdef CONFIG_CGROUP_CPUACCT
+SUBSYS(cpuacct)
+#endif
+
+/* */
+
 /* */
 
 #ifdef CONFIG_CGROUP_DEBUG
Index: current/include/linux/cpu_acct.h
===
--- /dev/null
+++ current/include/linux/cpu_acct.h
@@ -0,0 +1,14 @@
+
+#ifndef _LINUX_CPU_ACCT_H
+#define _LINUX_CPU_ACCT_H
+
+#include 
+#include 
+
+#ifdef CONFIG_CGROUP_CPUACCT
+extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
+#else
+static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
+#endif
+
+#endif
Index: current/init/Kconfig
===
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -354,6 +354,13 @@ config FAIR_CGROUP_SCHED
 
 endchoice
 
+config CGROUP_CPUACCT
+   bool "Simple CPU accounting cgroup subsystem"
+   depends on CGROUPS
+   help
+ Provides a simple Resource Controller for monitoring the
+ total CPU consumed by the tasks in a cgroup
+
 config SYSFS_DEPRECATED
bool "Create deprecated sysfs files"
default y
Index: current/kernel/Makefile
===
--- current.orig/kernel/Makefile
+++ current/kernel/Makefile
@@ -40,6 +40,7 @@ obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_DEBUG) += cgroup_debug.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
+obj-$(CONFIG_CGROUP_CPUACCT) += cpu_acct.o
 obj-$(CONFIG_CGROUP_NS) += ns_cgroup.o
 obj-$(CONFIG_IKCONFIG) += configs.o
 obj-$(CONFIG_STOP_MACHINE) += stop_machine.o
Index: current/kernel/cpu_acct.c
===
--- /dev/null
+++ current/kernel/cpu_acct.c
@@ -0,0 +1,101 @@
+/*
+ * kernel/cpu_acct.c - CPU accounting cgroup subsystem
+ *
+ * Copyright (C) Google Inc, 2006
+ *
+ * Developed by Paul Menage ([EMAIL PROTECTED]) and Balbir Singh
+ * ([EMAIL PROTECTED])
+ *
+ */
+
+/*
+ * Example cgroup subsystem for reporting total CPU usage of tasks in a
+ * cgroup.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+struct cpuacct {
+   struct cgroup_subsys_state css;
+   spinlock_t lock;
+   /* total time used by this class (in nanoseconds) */
+   u64 time;
+};
+
+struct cgroup_subsys cpuacct_subsys;
+
+static inline struct cpuacct *cgroup_ca(struct cgroup *cont)
+{
+   return container_of(cgroup_subsys_state(cont, cpuacct_subsys_id),
+   struct cpuacct, css);
+}
+
+static inline struct cpuacct *task_ca(struct task_struct *task)
+{
+   return container_of(task_subsys_state(task, cpuacct_subsys_id),
+   struct cpuacct, css);
+}
+
+static struct cgroup_subsys_state *cpuacct_create(
+   struct cgroup_subsys *ss, struct cgroup *cont)
+{
+   struct cpuacct *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
+
+   if (!ca)
+   return ERR_PTR(-ENOMEM);
+   spin_lock_init(&ca->lock);
+   return &ca->css;
+}
+
+static void cpuacct_destroy(struct cgroup_subsys *ss,
+   

Re: [PATCH] sched: cpu accounting controller

2007-11-29 Thread Srivatsa Vaddagiri
On Thu, Nov 29, 2007 at 08:20:58PM +0100, Ingo Molnar wrote:
> ok, this looks certainly doable for v2.6.24. I've added it to the 
> scheduler fixes queue and will let it brew there for a few days and send 
> it to Linus after that if everything goes fine - unless anyone objects. 

Thanks.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched: cpu accounting controller

2007-11-29 Thread Srivatsa Vaddagiri
On Mon, Nov 12, 2007 at 11:57:03PM -0800, Paul Menage wrote:
> > Regarding your concern about tracking cpu usage in different ways, it
> > could be mitigated if we have cpuacct controller track usage as per
> > information present in a task's sched entity structure
> > (tsk->se.sum_exec_runtime) i.e call cpuacct_charge() from
> > __update_curr() which would accumulate the execution time of the
> > group in a SMP friendly manner (i.e dump it in a per-cpu per-group counter
> > first and then aggregate to a global per-group counter).
> 
> That seems more reasonable than the current approach in cpu_acct.c

Paul,
Sorry about the delay in getting back to this thread. I realized
very recently that cpuacct controller has been removed from Linus's tree
and have attempted to rework it as per our discussions.

Linus/Ingo,
Commit cfb5285660aad4931b2ebbfa902ea48a37dfffa1 removed a usefull 
feature for us, which provided a cpu accounting resource controller. This 
feature would be usefull if someone wants to group tasks only for accounting 
purpose and doesnt really want to exercise any control over their cpu 
consumption.

The patch below reintroduces the feature. It is based on Paul Menage's
original patch (Commit 62d0df64065e7c135d0002f069444fbdfc64768f), with
these differences:

- Removed load average information. I felt it needs
  more thought (esp to deal with SMP and virtualized platforms)
  and can be added for 2.6.25 after more discussions.
- Convert group cpu usage to be nanosecond accurate (as rest
  of the cfs stats are) and invoke cpuacct_charge() from 
  the respective scheduler classes

The patch also modifies the cpu controller not to provide the same
accounting information.

Todo:
- Make the accounting scalable on SMP systems (perhaps
  for 2.6.25)


Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 include/linux/cgroup_subsys.h |6 ++
 include/linux/cpu_acct.h  |   14 +
 init/Kconfig  |7 ++
 kernel/Makefile   |1 
 kernel/cpu_acct.c |  103 ++
 kernel/sched.c|   27 ---
 kernel/sched_fair.c   |5 ++
 kernel/sched_rt.c |1 
 8 files changed, 138 insertions(+), 26 deletions(-)

Index: current/include/linux/cgroup_subsys.h
===
--- current.orig/include/linux/cgroup_subsys.h
+++ current/include/linux/cgroup_subsys.h
@@ -11,6 +11,12 @@
 SUBSYS(cpuset)
 #endif
 
+#ifdef CONFIG_CGROUP_CPUACCT
+SUBSYS(cpuacct)
+#endif
+
+/* */
+
 /* */
 
 #ifdef CONFIG_CGROUP_DEBUG
Index: current/include/linux/cpu_acct.h
===
--- /dev/null
+++ current/include/linux/cpu_acct.h
@@ -0,0 +1,14 @@
+
+#ifndef _LINUX_CPU_ACCT_H
+#define _LINUX_CPU_ACCT_H
+
+#include 
+#include 
+
+#ifdef CONFIG_CGROUP_CPUACCT
+extern void cpuacct_charge(struct task_struct *, u64 cputime);
+#else
+static inline void cpuacct_charge(struct task_struct *p, u64 cputime) {}
+#endif
+
+#endif
Index: current/init/Kconfig
===
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -354,6 +354,13 @@ config FAIR_CGROUP_SCHED
 
 endchoice
 
+config CGROUP_CPUACCT
+   bool "Simple CPU accounting cgroup subsystem"
+   depends on CGROUPS
+   help
+ Provides a simple Resource Controller for monitoring the
+ total CPU consumed by the tasks in a cgroup
+
 config SYSFS_DEPRECATED
bool "Create deprecated sysfs files"
default y
Index: current/kernel/Makefile
===
--- current.orig/kernel/Makefile
+++ current/kernel/Makefile
@@ -40,6 +40,7 @@ obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_DEBUG) += cgroup_debug.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
+obj-$(CONFIG_CGROUP_CPUACCT) += cpu_acct.o
 obj-$(CONFIG_CGROUP_NS) += ns_cgroup.o
 obj-$(CONFIG_IKCONFIG) += configs.o
 obj-$(CONFIG_STOP_MACHINE) += stop_machine.o
Index: current/kernel/cpu_acct.c
===
--- /dev/null
+++ current/kernel/cpu_acct.c
@@ -0,0 +1,103 @@
+/*
+ * kernel/cpu_acct.c - CPU accounting cgroup subsystem
+ *
+ * Copyright (C) Google Inc, 2006
+ *
+ * Developed by Paul Menage ([EMAIL PROTECTED]) and Balbir Singh
+ * ([EMAIL PROTECTED])
+ *
+ */
+
+/*
+ * Example cgroup subsystem for reporting total CPU usage of tasks in a
+ * cgroup.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+struct cpuacct {
+   struct cgroup_subsys_state css;
+   spinlock_t lock;
+   /* total time used by this class (in nanoseconds) */
+   u64 time;
+};
+
+struct cgroup_subsys cpuac

Re: [Patch 0/5] sched: group scheduler related patches (V4)

2007-11-27 Thread Srivatsa Vaddagiri
On Tue, Nov 27, 2007 at 01:53:12PM +0100, Ingo Molnar wrote:
> > Fine. I will resubmit this patchset then once we are into 2.6.25 
> > cycle.
> 
> no need (unless you have bugfixes) i'm carrying this around in the 
> scheduler git tree. (it will show up in sched-devel.git)

Cool .. Thx! It will get me some testing results (from those who test
sched-devel tree) and also will save me some maintenance trouble :)

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch 0/5] sched: group scheduler related patches (V4)

2007-11-27 Thread Srivatsa Vaddagiri
On Tue, Nov 27, 2007 at 12:09:10PM +0100, Ingo Molnar wrote:
> thanks, it looks good - but the fact that we are at v4 of the patchset 
> underlines the point that this is more of a v2.6.25 patchset than a 
> v2.6.24 one.

Fine. I will resubmit this patchset then once we are into 2.6.25 cycle.

> Group fairness certainly works fine enough in most setups 
> and we are late into the v2.6.24 -rc stage. We'll merge this patchset 
> (or any later versions of it) into v2.6.25 for sure,

Ok ..thx.

> so distros that 
> happen to base things off v2.6.24 can pick up your patches just fine.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 5/5] sched: Improve fairness of cpu bandwidth allocation for task groups

2007-11-26 Thread Srivatsa Vaddagiri

The current load balancing scheme isn't good for group fairness.

For ex: on a 8-cpu system, I created 3 groups as under:

a = 8 tasks (cpu.shares = 1024) 
b = 4 tasks (cpu.shares = 1024) 
c = 3 tasks (cpu.shares = 1024) 

a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.

This is what I get with the latest scheduler git tree:


Col1  | Col2| Col3  |  Col4
--|-|---|---
a | 277.676 | 57.8% | 54.1%  54.1%  54.1%  54.2%  56.7%  62.2%  62.8% 64.5%
b | 116.108 | 24.2% | 47.4%  48.1%  48.7%  49.3%
c |  86.326 | 18.0% | 47.5%  47.9%  48.5%


Explanation of o/p:

Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) received by all tasks of that 
group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in 
percentage. Col3 data is derived as:
Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
Col4 = 100 * cpu_time_recd_by_task / 60

[I can share the test case that produces a similar o/p if reqd]

The deviation from desired group fairness is as below:

a = +24.47%
b = -9.13%
c = -15.33%

which is quite high.

After the patch below is applied, here are the results:


Col1  | Col2| Col3  |  Col4
--|-|---|---
a | 163.112 | 34.0% | 33.2%  33.4%  33.5%  33.5%  33.7%  34.4%  34.8% 35.3%
b | 156.220 | 32.5% | 63.3%  64.5%  66.1%  66.5%
c | 160.653 | 33.5% | 85.8%  90.6%  91.4%


Deviation from desired group fairness is as below:

a = +0.67%
b = -0.83%
c = +0.17%

which is far better IMO. Most of other runs have yielded a deviation within
+-2% at the most, which is good.

Why do we see bad (group) fairness with current scheuler?
=

Currently cpu's weight is just the summation of individual task weights.
This can yield incorrect results. For ex: consider three groups as below
on a 2-cpu system:

CPU0CPU1
---
A (10)  B(5)
C(5)
---

Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all
of which are on CPU1. Each task has the same weight (NICE_0_LOAD =
1024).

The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and
the load balancer will think both CPUs are perfectly balanced and won't
move around any tasks. This, however, would yield this bandwidth:

A = 50%
B = 25%
C = 25%

which is not the desired result.

What's changing in the patch?
=

- How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is
  defined (see below)
- API Change 
- Two tunables introduced in sysfs (under SCHED_DEBUG) to 
  control the frequency at which the load balance monitor
  thread runs. 

The basic change made in this patch is how cpu weight (rq->load.weight) is 
calculated. Its now calculated as the summation of group weights on a cpu,
rather than summation of task weights. Weight exerted by a group on a
cpu is dependent on the shares allocated to it and also the number of
tasks the group has on that cpu compared to the total number of
(runnable) tasks the group has in the system.

Let,
W(K,i)  = Weight of group K on cpu i
T(K,i)  = Task load present in group K's cfs_rq on cpu i
T(K)= Total task load of group K across various cpus
S(K)= Shares allocated to group K
NRCPUS  = Number of online cpus in the scheduler domain to
  which group K is assigned.

Then,
W(K,i) = S(K) * NRCPUS * T(K,i) / T(K)

A load balance monitor thread is created at bootup, which periodically
runs and adjusts group's weight on each cpu. To avoid its overhead, two
min/max tunables are introduced (under SCHED_DEBUG) to control the rate at which
it runs.

Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 include/linux/sched.h |4 
 kernel/sched.c|  259 --
 kernel/sched_fair.c   |   88 ++--
 kernel/sysctl.c   |   18 +++
 4 files changed, 330 insertions(+), 39 deletions(-)

Index: current/include/linux/sched.h
===

[Patch 4/5] sched: introduce a mutex and corresponding API to serialize access to doms_cur[] array

2007-11-26 Thread Srivatsa Vaddagiri
doms_cur[] array represents various scheduling domains which are mutually
exclusive. Currently cpusets code can modify this array (by calling
partition_sched_domains()) as a result of user modifying sched_load_balance 
flag for various cpusets.

This patch introduces a mutex and corresponding API (only when
CONFIG_FAIR_GROUP_SCHED is defined) which allows a reader to safely read the
doms_cur[] array w/o worrying abt concurrent modifications to the array.

The fair group scheduler code (introduced in next patch of this series)
makes use of this mutex to walk thr' doms_cur[] array while rebalancing
shares of task groups across cpus.

Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 kernel/sched.c |   19 +++
 1 files changed, 19 insertions(+)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -186,6 +186,9 @@ static struct cfs_rq *init_cfs_rq_p[NR_C
  */
 static DEFINE_MUTEX(task_group_mutex);
 
+/* doms_cur_mutex serializes access to doms_cur[] array */
+static DEFINE_MUTEX(doms_cur_mutex);
+
 /* Default task group.
  * Every task in system belong to this group at bootup.
  */
@@ -236,11 +239,23 @@ static inline void unlock_task_group_lis
mutex_unlock(&task_group_mutex);
 }
 
+static inline void lock_doms_cur(void)
+{
+   mutex_lock(&doms_cur_mutex);
+}
+
+static inline void unlock_doms_cur(void)
+{
+   mutex_unlock(&doms_cur_mutex);
+}
+
 #else
 
 static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
 static inline void lock_task_group_list(void) { }
 static inline void unlock_task_group_list(void) { }
+static inline void lock_doms_cur(void) { }
+static inline void unlock_doms_cur(void) { }
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -6547,6 +6562,8 @@ void partition_sched_domains(int ndoms_n
 {
int i, j;
 
+   lock_doms_cur();
+
/* always unregister in case we don't destroy any domains */
unregister_sched_domain_sysctl();
 
@@ -6587,6 +6604,8 @@ match2:
ndoms_cur = ndoms_new;
 
register_sched_domain_sysctl();
+
+   unlock_doms_cur();
 }
 
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)


-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 3/5 v2] sched: change how cpu load is calculated

2007-11-26 Thread Srivatsa Vaddagiri
This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.

This version of patch (v2 of Patch 3/5) has a minor impact on code size
(but should have no runtime/functional impact) for !CONFIG_FAIR_GROUP_SCHED
case, but the overall code, IMHO, is neater compared to v1 of Patch 3/5
(because of lesser #ifdefs).

I prefer v2 of Patch 3/5.

Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 kernel/sched.c  |   27 +++
 kernel/sched_fair.c |   31 +++
 kernel/sched_rt.c   |2 ++
 3 files changed, 40 insertions(+), 20 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -870,6 +870,16 @@ iter_move_one_task(struct rq *this_rq, i
   struct rq_iterator *iterator);
 #endif
 
+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+   update_load_add(&rq->load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+   update_load_sub(&rq->load, load);
+}
+
 #include "sched_stats.h"
 #include "sched_idletask.c"
 #include "sched_fair.c"
@@ -880,26 +890,14 @@ iter_move_one_task(struct rq *this_rq, i
 
 #define sched_class_highest (&rt_sched_class)
 
-static inline void inc_load(struct rq *rq, const struct task_struct *p)
-{
-   update_load_add(&rq->load, p->se.load.weight);
-}
-
-static inline void dec_load(struct rq *rq, const struct task_struct *p)
-{
-   update_load_sub(&rq->load, p->se.load.weight);
-}
-
 static void inc_nr_running(struct task_struct *p, struct rq *rq)
 {
rq->nr_running++;
-   inc_load(rq, p);
 }
 
 static void dec_nr_running(struct task_struct *p, struct rq *rq)
 {
rq->nr_running--;
-   dec_load(rq, p);
 }
 
 static void set_load_weight(struct task_struct *p)
@@ -4071,10 +4069,8 @@ void set_user_nice(struct task_struct *p
goto out_unlock;
}
on_rq = p->se.on_rq;
-   if (on_rq) {
+   if (on_rq)
dequeue_task(rq, p, 0);
-   dec_load(rq, p);
-   }
 
p->static_prio = NICE_TO_PRIO(nice);
set_load_weight(p);
@@ -4084,7 +4080,6 @@ void set_user_nice(struct task_struct *p
 
if (on_rq) {
enqueue_task(rq, p, 0);
-   inc_load(rq, p);
/*
 * If the task increased its priority or is running and
 * lowered its priority, then reschedule its CPU:
Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -755,15 +755,26 @@ static inline struct sched_entity *paren
 static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
 {
struct cfs_rq *cfs_rq;
-   struct sched_entity *se = &p->se;
+   struct sched_entity *se = &p->se, *topse = NULL;
+   int incload = 1;
 
for_each_sched_entity(se) {
-   if (se->on_rq)
+   topse = se;
+   if (se->on_rq) {
+   incload = 0;
break;
+   }
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, wakeup);
wakeup = 1;
}
+   /*
+* Increment cpu load if we just enqueued the first task of a group on
+* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+* at the highest grouping level.
+*/
+   if (incload)
+   inc_cpu_load(rq, topse->load.weight);
 }
 
 /*
@@ -774,16 +785,28 @@ static void enqueue_task_fair(struct rq 
 static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
 {
struct cfs_rq *cfs_rq;
-   struct sched_entity *se = &p->se;
+   struct sched_entity *se = &p->se, *topse = NULL;
+   int decload = 1;
 
for_each_sched_entity(se) {
+   topse = se;
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, sleep);
/* Don't dequeue parent if it has other entities besides us */
-   if (cfs_rq->load.weight)
+   if (cfs_rq->load.weight) {
+   if (parent_entity(se))
+   decload = 0;
break;
+   }
sleep = 1;
}
+   /*
+* Decrement cpu load if we just dequeued the last task of a group on
+* 'rq->cpu'. 'topse' 

[Patch 3/5 v1] sched: change how cpu load is calculated

2007-11-26 Thread Srivatsa Vaddagiri
This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.

This version of patch (v1 of Patch 3/5) has zero impact for
!CONFIG_FAIR_GROUP_SCHED case.

Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>


---
 kernel/sched.c  |   38 ++
 kernel/sched_fair.c |   31 +++
 kernel/sched_rt.c   |2 ++
 3 files changed, 59 insertions(+), 12 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -870,15 +870,25 @@ iter_move_one_task(struct rq *this_rq, i
   struct rq_iterator *iterator);
 #endif
 
-#include "sched_stats.h"
-#include "sched_idletask.c"
-#include "sched_fair.c"
-#include "sched_rt.c"
-#ifdef CONFIG_SCHED_DEBUG
-# include "sched_debug.c"
-#endif
+#ifdef CONFIG_FAIR_GROUP_SCHED
 
-#define sched_class_highest (&rt_sched_class)
+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+   update_load_add(&rq->load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+   update_load_sub(&rq->load, load);
+}
+
+static inline void inc_load(struct rq *rq, const struct task_struct *p) { }
+static inline void dec_load(struct rq *rq, const struct task_struct *p) { }
+
+#else  /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline void inc_cpu_load(struct rq *rq, unsigned long load) { }
+static inline void dec_cpu_load(struct rq *rq, unsigned long load) { }
 
 static inline void inc_load(struct rq *rq, const struct task_struct *p)
 {
@@ -890,6 +900,18 @@ static inline void dec_load(struct rq *r
update_load_sub(&rq->load, p->se.load.weight);
 }
 
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+#include "sched_stats.h"
+#include "sched_idletask.c"
+#include "sched_fair.c"
+#include "sched_rt.c"
+#ifdef CONFIG_SCHED_DEBUG
+# include "sched_debug.c"
+#endif
+
+#define sched_class_highest (&rt_sched_class)
+
 static void inc_nr_running(struct task_struct *p, struct rq *rq)
 {
rq->nr_running++;
Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -755,15 +755,26 @@ static inline struct sched_entity *paren
 static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
 {
struct cfs_rq *cfs_rq;
-   struct sched_entity *se = &p->se;
+   struct sched_entity *se = &p->se, *topse = NULL;
+   int incload = 1;
 
for_each_sched_entity(se) {
-   if (se->on_rq)
+   topse = se;
+   if (se->on_rq) {
+   incload = 0;
break;
+   }
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, wakeup);
wakeup = 1;
}
+   /*
+* Increment cpu load if we just enqueued the first task of a group on
+* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+* at the highest grouping level.
+*/
+   if (incload)
+   inc_cpu_load(rq, topse->load.weight);
 }
 
 /*
@@ -774,16 +785,28 @@ static void enqueue_task_fair(struct rq 
 static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
 {
struct cfs_rq *cfs_rq;
-   struct sched_entity *se = &p->se;
+   struct sched_entity *se = &p->se, *topse = NULL;
+   int decload = 1;
 
for_each_sched_entity(se) {
+   topse = se;
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, sleep);
/* Don't dequeue parent if it has other entities besides us */
-   if (cfs_rq->load.weight)
+   if (cfs_rq->load.weight) {
+   if (parent_entity(se))
+   decload = 0;
break;
+   }
sleep = 1;
}
+   /*
+* Decrement cpu load if we just dequeued the last task of a group on
+* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+* at the highest grouping level.
+*/
+   if (decload)
+   dec_cpu_load(rq, topse->load.weight);
 }
 
 /*
Index: current/kernel/sched_rt.c
===
--- current.orig/kernel/sched_rt.c
+++ current/kernel/sched_rt.c
@@ -31,6 +31,7 @@ static void enqueu

[Patch 2/5] sched: minor fixes for group scheduler

2007-11-26 Thread Srivatsa Vaddagiri
Minor bug fixes for group scheduler:

- Use a mutex to serialize add/remove of task groups and also when
  changing shares of a task group. Use the same mutex when printing cfs_rq
  stats for various task groups.
- Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when
  walking task group list)


Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 kernel/sched.c  |   34 ++
 kernel/sched_fair.c |4 +++-
 2 files changed, 29 insertions(+), 9 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -169,8 +169,6 @@ struct task_group {
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
-   /* spinlock to serialize modification to shares */
-   spinlock_t lock;
struct rcu_head rcu;
 };
 
@@ -182,6 +180,12 @@ static DEFINE_PER_CPU(struct cfs_rq, ini
 static struct sched_entity *init_sched_entity_p[NR_CPUS];
 static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
 
+/*
+ * task_group_mutex serializes add/remove of task groups and also changes to
+ * a task group's cpu shares.
+ */
+static DEFINE_MUTEX(task_group_mutex);
+
 /* Default task group.
  * Every task in system belong to this group at bootup.
  */
@@ -222,9 +226,21 @@ static inline void set_task_cfs_rq(struc
p->se.parent = task_group(p)->se[cpu];
 }
 
+static inline void lock_task_group_list(void)
+{
+   mutex_lock(&task_group_mutex);
+}
+
+static inline void unlock_task_group_list(void)
+{
+   mutex_unlock(&task_group_mutex);
+}
+
 #else
 
 static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
+static inline void lock_task_group_list(void) { }
+static inline void unlock_task_group_list(void) { }
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -6747,7 +6763,6 @@ void __init sched_init(void)
se->parent = NULL;
}
init_task_group.shares = init_task_group_load;
-   spin_lock_init(&init_task_group.lock);
 #endif
 
for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
@@ -6987,14 +7002,15 @@ struct task_group *sched_create_group(vo
se->parent = NULL;
}
 
+   tg->shares = NICE_0_LOAD;
+
+   lock_task_group_list();
for_each_possible_cpu(i) {
rq = cpu_rq(i);
cfs_rq = tg->cfs_rq[i];
list_add_rcu(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
}
-
-   tg->shares = NICE_0_LOAD;
-   spin_lock_init(&tg->lock);
+   unlock_task_group_list();
 
return tg;
 
@@ -7040,10 +7056,12 @@ void sched_destroy_group(struct task_gro
struct cfs_rq *cfs_rq = NULL;
int i;
 
+   lock_task_group_list();
for_each_possible_cpu(i) {
cfs_rq = tg->cfs_rq[i];
list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
}
+   unlock_task_group_list();
 
BUG_ON(!cfs_rq);
 
@@ -7117,7 +7135,7 @@ int sched_group_set_shares(struct task_g
 {
int i;
 
-   spin_lock(&tg->lock);
+   lock_task_group_list();
if (tg->shares == shares)
goto done;
 
@@ -7126,7 +7144,7 @@ int sched_group_set_shares(struct task_g
set_se_shares(tg->se[i], shares);
 
 done:
-   spin_unlock(&tg->lock);
+   unlock_task_group_list();
return 0;
 }
 
Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -685,7 +685,7 @@ static inline struct cfs_rq *cpu_cfs_rq(
 
 /* Iterate thr' all leaf cfs_rq's on a runqueue */
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
-   list_for_each_entry(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
+   list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
 
 /* Do the two (enqueued) entities belong to the same group ? */
 static inline int
@@ -1126,7 +1126,9 @@ static void print_cfs_stats(struct seq_f
 #ifdef CONFIG_FAIR_GROUP_SCHED
print_cfs_rq(m, cpu, &cpu_rq(cpu)->cfs);
 #endif
+   lock_task_group_list();
for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
print_cfs_rq(m, cpu, cfs_rq);
+   unlock_task_group_list();
 }
 #endif

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 1/5] sched: code cleanup

2007-11-26 Thread Srivatsa Vaddagiri
Minor cleanups:

- Fix coding style
- remove obsolete comment


Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 kernel/sched.c |   21 +++--
 1 files changed, 3 insertions(+), 18 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -191,12 +191,12 @@ struct task_group init_task_group = {
 };
 
 #ifdef CONFIG_FAIR_USER_SCHED
-# define INIT_TASK_GRP_LOAD2*NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD  2*NICE_0_LOAD
 #else
-# define INIT_TASK_GRP_LOADNICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD  NICE_0_LOAD
 #endif
 
-static int init_task_group_load = INIT_TASK_GRP_LOAD;
+static int init_task_group_load = INIT_TASK_GROUP_LOAD;
 
 /* return group to which a task belongs */
 static inline struct task_group *task_group(struct task_struct *p)
@@ -864,21 +864,6 @@ iter_move_one_task(struct rq *this_rq, i
 
 #define sched_class_highest (&rt_sched_class)
 
-/*
- * Update delta_exec, delta_fair fields for rq.
- *
- * delta_fair clock advances at a rate inversely proportional to
- * total load (rq->load.weight) on the runqueue, while
- * delta_exec advances at the same rate as wall-clock (provided
- * cpu is not idle).
- *
- * delta_exec / delta_fair is a measure of the (smoothened) load on this
- * runqueue over any given interval. This (smoothened) load is used
- * during load balance.
- *
- * This function is called /before/ updating rq->load
- * and when switching tasks.
- */
 static inline void inc_load(struct rq *rq, const struct task_struct *p)
 {
update_load_add(&rq->load, p->se.load.weight);


-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 0/5] sched: group scheduler related patches (V4)

2007-11-26 Thread Srivatsa Vaddagiri
On Mon, Nov 26, 2007 at 09:28:36PM +0100, Ingo Molnar wrote:
> the first SCHED_RR priority is 1, not 0 - so this call will always fail.

Thanks for spotting this bug and rest of your review comments.

Here's V4 of the patchset, aimed at improving fairness of cpu bandwidth
allocation for task groups.

Changes since V3 (http://marc.info/?l=linux-kernel&m=119605252303359):

- Fix bug in setting SCHED_RR priority for load_balance_monitor thread
- Fix coding style related issues
- Separate "introduction of lock_doms_cur() API" into a separate patch

I have also tested this patchset against your latest git tree as of
today morning.

Please apply if there are no major concerns.


-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 4/4] sched: Improve fairness of cpu bandwidth allocation for task groups

2007-11-25 Thread Srivatsa Vaddagiri

The current load balancing scheme isn't good for group fairness.

For ex: on a 8-cpu system, I created 3 groups as under:

a = 8 tasks (cpu.shares = 1024) 
b = 4 tasks (cpu.shares = 1024) 
c = 3 tasks (cpu.shares = 1024) 

a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.

This is what I get with the latest scheduler git tree:


Col1  | Col2| Col3  |  Col4
--|-|---|---
a | 277.676 | 57.8% | 54.1%  54.1%  54.1%  54.2%  56.7%  62.2%  62.8% 64.5%
b | 116.108 | 24.2% | 47.4%  48.1%  48.7%  49.3%
c |  86.326 | 18.0% | 47.5%  47.9%  48.5%


Explanation of o/p:

Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) received by all tasks of that 
group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in 
percentage. Col3 data is derived as:
Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
Col4 = 100 * cpu_time_recd_by_task / 60

[I can share the test case that produces a similar o/p if reqd]

The deviation from desired group fairness is as below:

a = +24.47%
b = -9.13%
c = -15.33%

which is quite high.

After the patch below is applied, here are the results:


Col1  | Col2| Col3  |  Col4
--|-|---|---
a | 163.112 | 34.0% | 33.2%  33.4%  33.5%  33.5%  33.7%  34.4%  34.8% 35.3%
b | 156.220 | 32.5% | 63.3%  64.5%  66.1%  66.5%
c | 160.653 | 33.5% | 85.8%  90.6%  91.4%


Deviation from desired group fairness is as below:

a = +0.67%
b = -0.83%
c = +0.17%

which is far better IMO. Most of other runs have yielded a deviation within
+-2% at the most, which is good.

Why do we see bad (group) fairness with current scheuler?
=

Currently cpu's weight is just the summation of individual task weights.
This can yield incorrect results. For ex: consider three groups as below
on a 2-cpu system:

CPU0CPU1
---
A (10)  B(5)
C(5)
---

Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all
of which are on CPU1. Each task has the same weight (NICE_0_LOAD =
1024).

The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and
the load balancer will think both CPUs are perfectly balanced and won't
move around any tasks. This, however, would yield this bandwidth:

A = 50%
B = 25%
C = 25%

which is not the desired result.

What's changing in the patch?
=

- How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is
  defined (see below)
- API Change 
- Two tunables introduced in sysfs (under SCHED_DEBUG) to 
  control the frequency at which the load balance monitor
  thread runs. 

The basic change made in this patch is how cpu weight (rq->load.weight) is 
calculated. Its now calculated as the summation of group weights on a cpu,
rather than summation of task weights. Weight exerted by a group on a
cpu is dependent on the shares allocated to it and also the number of
tasks the group has on that cpu compared to the total number of
(runnable) tasks the group has in the system.

Let,
W(K,i)  = Weight of group K on cpu i
T(K,i)  = Task load present in group K's cfs_rq on cpu i
T(K)= Total task load of group K across various cpus
S(K)= Shares allocated to group K
NRCPUS  = Number of online cpus in the scheduler domain to
  which group K is assigned.

Then,
W(K,i) = S(K) * NRCPUS * T(K,i) / T(K)

A load balance monitor thread is created at bootup, which periodically
runs and adjusts group's weight on each cpu. To avoid its overhead, two
min/max tunables are introduced (under SCHED_DEBUG) to control the rate at which
it runs.


Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 include/linux/sched.h |4 
 kernel/sched.c|  265 --
 kernel/sched_fair.c   |   86 ++--
 kernel/sysctl.c   |   18 +++
 4 files changed, 334 insertions(+), 39 deletions(-)

Index: current/include/linux/sched.h
===

[Patch 3/4 v2] sched: change how cpu load is calculated

2007-11-25 Thread Srivatsa Vaddagiri

This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.

This version of patch (v2 of Patch 3/4) has a minor impact on code size
(but should have no runtime/functional impact) for !CONFIG_FAIR_GROUP_SCHED 
case, but the overall code, IMHO, is neater compared to v1 of Patch 3/4
(because of lesser #ifdefs).

I prefer v2 of Patch 3/4.

Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 kernel/sched.c  |   27 +++
 kernel/sched_fair.c |   31 +++
 kernel/sched_rt.c   |2 ++
 3 files changed, 40 insertions(+), 20 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -869,6 +869,16 @@
   struct rq_iterator *iterator);
 #endif
 
+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+   update_load_add(&rq->load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+   update_load_sub(&rq->load, load);
+}
+
 #include "sched_stats.h"
 #include "sched_idletask.c"
 #include "sched_fair.c"
@@ -879,26 +889,14 @@
 
 #define sched_class_highest (&rt_sched_class)
 
-static inline void inc_load(struct rq *rq, const struct task_struct *p)
-{
-   update_load_add(&rq->load, p->se.load.weight);
-}
-
-static inline void dec_load(struct rq *rq, const struct task_struct *p)
-{
-   update_load_sub(&rq->load, p->se.load.weight);
-}
-
 static void inc_nr_running(struct task_struct *p, struct rq *rq)
 {
rq->nr_running++;
-   inc_load(rq, p);
 }
 
 static void dec_nr_running(struct task_struct *p, struct rq *rq)
 {
rq->nr_running--;
-   dec_load(rq, p);
 }
 
 static void set_load_weight(struct task_struct *p)
@@ -4070,10 +4068,8 @@
goto out_unlock;
}
on_rq = p->se.on_rq;
-   if (on_rq) {
+   if (on_rq)
dequeue_task(rq, p, 0);
-   dec_load(rq, p);
-   }
 
p->static_prio = NICE_TO_PRIO(nice);
set_load_weight(p);
@@ -4083,7 +4079,6 @@
 
if (on_rq) {
enqueue_task(rq, p, 0);
-   inc_load(rq, p);
/*
 * If the task increased its priority or is running and
 * lowered its priority, then reschedule its CPU:
Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -755,15 +755,26 @@
 static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
 {
struct cfs_rq *cfs_rq;
-   struct sched_entity *se = &p->se;
+   struct sched_entity *se = &p->se,
+   *topse = NULL;  /* Highest schedulable entity */
+   int incload = 1;
 
for_each_sched_entity(se) {
-   if (se->on_rq)
+   topse = se;
+   if (se->on_rq) {
+   incload = 0;
break;
+   }
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, wakeup);
wakeup = 1;
}
+   /* Increment cpu load if we just enqueued the first task of a group on
+* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+* at the highest grouping level.
+*/
+   if (incload)
+   inc_cpu_load(rq, topse->load.weight);
 }
 
 /*
@@ -774,16 +785,28 @@
 static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
 {
struct cfs_rq *cfs_rq;
-   struct sched_entity *se = &p->se;
+   struct sched_entity *se = &p->se,
+   *topse = NULL;  /* Highest schedulable entity */
+   int decload = 1;
 
for_each_sched_entity(se) {
+   topse = se;
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, sleep);
/* Don't dequeue parent if it has other entities besides us */
-   if (cfs_rq->load.weight)
+   if (cfs_rq->load.weight) {
+   if (parent_entity(se))
+   decload = 0;
break;
+   }
sleep = 1;
}
+   /* Decrement cpu load if we just dequeued the last task of a group on
+* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+* at the highest grouping level.
+*/
+   if (decload

[Patch 3/4 v1] sched: change how cpu load is calculated

2007-11-25 Thread Srivatsa Vaddagiri
This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated to it.

This version of patch (v1 of Patch 3/4) has zero impact for 
!CONFIG_FAIR_GROUP_SCHED case.

Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 kernel/sched.c  |   38 ++
 kernel/sched_fair.c |   31 +++
 kernel/sched_rt.c   |2 ++
 3 files changed, 59 insertions(+), 12 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -869,15 +869,25 @@
   struct rq_iterator *iterator);
 #endif
 
-#include "sched_stats.h"
-#include "sched_idletask.c"
-#include "sched_fair.c"
-#include "sched_rt.c"
-#ifdef CONFIG_SCHED_DEBUG
-# include "sched_debug.c"
-#endif
+#ifdef CONFIG_FAIR_GROUP_SCHED
 
-#define sched_class_highest (&rt_sched_class)
+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+   update_load_add(&rq->load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+   update_load_sub(&rq->load, load);
+}
+
+static inline void inc_load(struct rq *rq, const struct task_struct *p) { }
+static inline void dec_load(struct rq *rq, const struct task_struct *p) { }
+
+#else  /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline void inc_cpu_load(struct rq *rq, unsigned long load) { }
+static inline void dec_cpu_load(struct rq *rq, unsigned long load) { }
 
 static inline void inc_load(struct rq *rq, const struct task_struct *p)
 {
@@ -889,6 +899,18 @@
update_load_sub(&rq->load, p->se.load.weight);
 }
 
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+#include "sched_stats.h"
+#include "sched_idletask.c"
+#include "sched_fair.c"
+#include "sched_rt.c"
+#ifdef CONFIG_SCHED_DEBUG
+# include "sched_debug.c"
+#endif
+
+#define sched_class_highest (&rt_sched_class)
+
 static void inc_nr_running(struct task_struct *p, struct rq *rq)
 {
rq->nr_running++;
Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -755,15 +755,26 @@
 static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
 {
struct cfs_rq *cfs_rq;
-   struct sched_entity *se = &p->se;
+   struct sched_entity *se = &p->se,
+   *topse = NULL;  /* Highest schedulable entity */
+   int incload = 1;
 
for_each_sched_entity(se) {
-   if (se->on_rq)
+   topse = se;
+   if (se->on_rq) {
+   incload = 0;
break;
+   }
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, wakeup);
wakeup = 1;
}
+   /* Increment cpu load if we just enqueued the first task of a group on
+* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+* at the highest grouping level.
+*/
+   if (incload)
+   inc_cpu_load(rq, topse->load.weight);
 }
 
 /*
@@ -774,16 +785,28 @@
 static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
 {
struct cfs_rq *cfs_rq;
-   struct sched_entity *se = &p->se;
+   struct sched_entity *se = &p->se,
+   *topse = NULL;  /* Highest schedulable entity */
+   int decload = 1;
 
for_each_sched_entity(se) {
+   topse = se;
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, sleep);
/* Don't dequeue parent if it has other entities besides us */
-   if (cfs_rq->load.weight)
+   if (cfs_rq->load.weight) {
+   if (parent_entity(se))
+   decload = 0;
break;
+   }
sleep = 1;
}
+   /* Decrement cpu load if we just dequeued the last task of a group on
+* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+* at the highest grouping level.
+*/
+   if (decload)
+   dec_cpu_load(rq, topse->load.weight);
 }
 
 /*
Index: current/kernel/sched_rt.c
===
--- current.orig/kernel/sched_rt.c
+++ current/kernel/sched_rt.c
@@ -31,6 +31,7 @@
 
list_add_tail(&p->run_list, array->queue + p->prio

[PATCH 2/4] sched: minor fixes for group scheduler

2007-11-25 Thread Srivatsa Vaddagiri

Minor bug fixes for group scheduler:

- Use a mutex to serialize add/remove of task groups and also when
  changing shares of a task group. Use the same mutex when printing cfs_rq
  stats for various task groups.
- Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when walking task 
  group list)


Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 kernel/sched.c  |   33 +
 kernel/sched_fair.c |4 +++-
 2 files changed, 28 insertions(+), 9 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -169,8 +169,6 @@ struct task_group {
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
-   /* spinlock to serialize modification to shares */
-   spinlock_t lock;
struct rcu_head rcu;
 };
 
@@ -182,6 +180,11 @@ static DEFINE_PER_CPU(struct cfs_rq, ini
 static struct sched_entity *init_sched_entity_p[NR_CPUS];
 static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
 
+/* task_group_mutex serializes add/remove of task groups and also changes to
+ * a task group's cpu shares.
+ */
+static DEFINE_MUTEX(task_group_mutex);
+
 /* Default task group.
  * Every task in system belong to this group at bootup.
  */
@@ -222,9 +225,21 @@ static inline void set_task_cfs_rq(struc
p->se.parent = task_group(p)->se[cpu];
 }
 
+static inline void lock_task_group_list(void)
+{
+   mutex_lock(&task_group_mutex);
+}
+
+static inline void unlock_task_group_list(void)
+{
+   mutex_unlock(&task_group_mutex);
+}
+
 #else
 
 static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
+static inline void lock_task_group_list(void) { }
+static inline void unlock_task_group_list(void) { }
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -6747,7 +6762,6 @@ void __init sched_init(void)
se->parent = NULL;
}
init_task_group.shares = init_task_group_load;
-   spin_lock_init(&init_task_group.lock);
 #endif
 
for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
@@ -6987,14 +7001,15 @@ struct task_group *sched_create_group(vo
se->parent = NULL;
}
 
+   tg->shares = NICE_0_LOAD;
+
+   lock_task_group_list();
for_each_possible_cpu(i) {
rq = cpu_rq(i);
cfs_rq = tg->cfs_rq[i];
list_add_rcu(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
}
-
-   tg->shares = NICE_0_LOAD;
-   spin_lock_init(&tg->lock);
+   unlock_task_group_list();
 
return tg;
 
@@ -7040,10 +7055,12 @@ void sched_destroy_group(struct task_gro
struct cfs_rq *cfs_rq = NULL;
int i;
 
+   lock_task_group_list();
for_each_possible_cpu(i) {
cfs_rq = tg->cfs_rq[i];
list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
}
+   unlock_task_group_list();
 
BUG_ON(!cfs_rq);
 
@@ -7117,7 +7134,7 @@ int sched_group_set_shares(struct task_g
 {
int i;
 
-   spin_lock(&tg->lock);
+   lock_task_group_list();
if (tg->shares == shares)
goto done;
 
@@ -7126,7 +7143,7 @@ int sched_group_set_shares(struct task_g
set_se_shares(tg->se[i], shares);
 
 done:
-   spin_unlock(&tg->lock);
+   unlock_task_group_list();
return 0;
 }
 
Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -685,7 +685,7 @@ static inline struct cfs_rq *cpu_cfs_rq(
 
 /* Iterate thr' all leaf cfs_rq's on a runqueue */
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
-   list_for_each_entry(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
+   list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
 
 /* Do the two (enqueued) entities belong to the same group ? */
 static inline int
@@ -1126,7 +1126,9 @@ static void print_cfs_stats(struct seq_f
 #ifdef CONFIG_FAIR_GROUP_SCHED
print_cfs_rq(m, cpu, &cpu_rq(cpu)->cfs);
 #endif
+   lock_task_group_list();
for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
print_cfs_rq(m, cpu, cfs_rq);
+   unlock_task_group_list();
 }
 #endif

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/4] sched: code cleanup

2007-11-25 Thread Srivatsa Vaddagiri

Minor cleanups:

- Fix coding style
- remove obsolete comment

Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 kernel/sched.c |   21 +++--
 1 files changed, 3 insertions(+), 18 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -191,12 +191,12 @@ struct task_group init_task_group = {
 };
 
 #ifdef CONFIG_FAIR_USER_SCHED
-# define INIT_TASK_GRP_LOAD2*NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD  2*NICE_0_LOAD
 #else
-# define INIT_TASK_GRP_LOADNICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD  NICE_0_LOAD
 #endif
 
-static int init_task_group_load = INIT_TASK_GRP_LOAD;
+static int init_task_group_load = INIT_TASK_GROUP_LOAD;
 
 /* return group to which a task belongs */
 static inline struct task_group *task_group(struct task_struct *p)
@@ -864,21 +864,6 @@ iter_move_one_task(struct rq *this_rq, i
 
 #define sched_class_highest (&rt_sched_class)
 
-/*
- * Update delta_exec, delta_fair fields for rq.
- *
- * delta_fair clock advances at a rate inversely proportional to
- * total load (rq->load.weight) on the runqueue, while
- * delta_exec advances at the same rate as wall-clock (provided
- * cpu is not idle).
- *
- * delta_exec / delta_fair is a measure of the (smoothened) load on this
- * runqueue over any given interval. This (smoothened) load is used
- * during load balance.
- *
- * This function is called /before/ updating rq->load
- * and when switching tasks.
- */
 static inline void inc_load(struct rq *rq, const struct task_struct *p)
 {
update_load_add(&rq->load, p->se.load.weight);

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/4] sched: group scheduler related patches (V3)

2007-11-25 Thread Srivatsa Vaddagiri
Here's V3 of the group scheduler related patches, which is mainly addressing 
improved fairness of cpu bandwidth allocation for task groups.

Patch 1/4   -> coding style cleanup
Patch 2/4   -> Minor group scheduling related bug fixes

Patch 3/4 (v1)  -> Modifies how cpu load is calculated, such that there is zero
   impact on !CONFIG_FAIR_GROUP_SCHED
Patch 3/4 (v2)  -> Modifies how cpu load is calculated, such that there is a
   small impact on code size (but should have NO impact on
   functionality or runtime behavior) for
   !CONFIG_FAIR_GROUP_SCHED case. The resulting code however is
   much neater since it avoids some #ifdefs. I prefer v2.

Patch 4/4   -> Updates load balance logic to provide improved fairness for
   task groups.

To have zero impact on !CONFIG_FAIR_GROUP_SCHED case, please apply the
following patches:

- Patch 1/4
- Patch 2/4 
- Patch 3/4 (v1)
- Patch 4/4

I personally prefer v2 of Patch 3/4. Even though it has a minor impact
on code size for !CONFIG_FAIR_GROUP_SCHED case, the overall code is much
neater IMHO.

Impact on sched.o size:
===

!CONFIG_FAIR_GROUP_SCHED:

   textdata bss dec hex filename
  368292766  48   396439adb sched.o-before-nofgs
  368292766  48   396439adb sched.o-after-v1-nofgs (v1 of Patch 3/4)
  368432766  48   396579ae9 sched.o-after-v2-nofgs (v2 of Patch 3/4)

CONFIG_FAIR_GROUP_SCHED:

   textdata bss dec hex filename
  390193346 336   42701a6cd sched.o-before-fgs
  403033482 308   44093ac3d sched.o-after-v1-fgs (v1 of Patch 3/4)
  403033482 308   44093ac3d sched.o-after-v2-fgs (v2 of Patch 3/4)


Changes since V2 of this patchset [1]

- Split the patches better and make them pass under checkpatch.pl
  script
- Fixed compile issues under different config options and also
  a suspend failure (as posted by Ingo at [2])
- Make load_balance_monitor thread run as real-time task,
  so that its execution is not limited by shares allocated to
  default task group (init_task_group).
- Reduced minimum shares that can be allocated to a group to 1
  (from 100). Would be usefull if someone wants a task group
  to get very low bandiwdth or get bandwidth only when other groups
  are idle.
- Removed check for tg->last_total_load check in rebalance_shares()
  (which was incorrect in V2)

Changes since V1 of this patchset [3]:

- Introduced a task_group_mutex to serialize add/removal of task groups (as 
  pointed by Dipankar)

Please apply if there are no major concerns.


References:

1. http://marc.info/?l=linux-kernel&m=119549585223262
2. http://lkml.org/lkml/2007/11/19/127
3. http://marc.info/?l=linux-kernel&m=119547452517055


-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch/backport] CFS scheduler, -v24, for v2.6.24-rc3, v2.6.23.8, v2.6.22.13, v2.6.21.7

2007-11-20 Thread Srivatsa Vaddagiri
On Tue, Nov 20, 2007 at 12:24:59PM +0100, Damien Wyart wrote:
> Testing sched-cfs-v2.6.24-rc3-v24.patch on top of 2.6.24-rc3-git1
> (ignored the two "already applied" messages coming from git1 commits),
> I get a 1.00 minimum load in top, coming from the load_balance_mo thread
> staying in D-state. I get this on 2 different computers with similar
> configs, so I am attaching one of them here.

Ingo,
I am reworking the fair-group load balance patches. Can you drop
what you have until I resubmit?

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] sched: Improve fairness of cpu allocation for task groups

2007-11-19 Thread Srivatsa Vaddagiri
432766  48   396579ae9 sched.o-after
  -
+14
  

CONFIG_FAIR_GROUP_SCHED defined ->

textdata bss dec hex filename
   390193346 336   42701a6cd sched.o-before
   402063482 308   43996abdc sched.o-yes-after
   --
   +1295
   --

Signed-off-by : Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 include/linux/sched.h |4 
 kernel/sched.c|  278 ++
 kernel/sched_fair.c   |   95 +++--
 kernel/sched_rt.c |2 
 kernel/sysctl.c   |   18 +++
 5 files changed, 342 insertions(+), 55 deletions(-)

Index: current/include/linux/sched.h
===
--- current.orig/include/linux/sched.h
+++ current/include/linux/sched.h
@@ -1467,6 +1467,10 @@ extern unsigned int sysctl_sched_child_r
 extern unsigned int sysctl_sched_features;
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
+#ifdef CONFIG_FAIR_GROUP_SCHED
+extern unsigned int sysctl_sched_min_bal_int_shares;
+extern unsigned int sysctl_sched_max_bal_int_shares;
+#endif
 
 int sched_nr_latency_handler(struct ctl_table *table, int write,
struct file *file, void __user *buffer, size_t *length,
Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -168,7 +168,44 @@ struct task_group {
struct sched_entity **se;
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
+
+   /*
+* shares assigned to a task group governs how much of cpu bandwidth
+* is allocated to the group. The more shares a group has, the more is
+* the cpu bandwidth allocated to it.
+*
+* For ex, lets say that there are three task groups, A, B and C which
+* have been assigned shares 1000, 2000 and 3000 respectively. Then,
+* cpu bandwidth allocated by the scheduler to task groups A, B and C
+* should be:
+*
+*  Bw(A) = 1000/(1000+2000+3000) * 100 = 16.66%
+*  Bw(B) = 2000/(1000+2000+3000) * 100 = 33.33%
+*  Bw(C) = 3000/(1000+2000+3000) * 100 = 50%
+*
+* The weight assigned to a task group's schedulable entities on every
+* cpu (task_group.se[a_cpu]->load.weight) is derived from the task
+* group's shares. For ex: lets say that task group A has been
+* assigned shares of 1000 and there are two CPUs in a system. Then,
+*
+*  tg_A->se[0]->load.weight = tg_A->se[1]->load.weight = 1000;
+*
+* Note: It's not necessary that each of a task's group schedulable
+*   entity have the same weight on all CPUs. If the group
+*   has 2 of its tasks on CPU0 and 1 task on CPU1, then a
+*   better distribution of weight could be:
+*
+*  tg_A->se[0]->load.weight = 2/3 * 2000 = 1333
+*  tg_A->se[1]->load.weight = 1/2 * 2000 =  667
+*
+* rebalance_shares() is responsible for distributing the shares of a
+* task groups like this among the group's schedulable entities across
+* cpus.
+*
+*/
unsigned long shares;
+
+   unsigned long last_total_load;
struct rcu_head rcu;
 };
 
@@ -184,6 +221,13 @@ static struct cfs_rq *init_cfs_rq_p[NR_C
  * a task group's cpu shares.
  */
 static DEFINE_MUTEX(task_group_mutex);
+static DEFINE_MUTEX(doms_cur_mutex); /* serialize access to doms_curr[] array 
*/
+
+/* kernel thread that runs rebalance_shares() periodically */
+static struct task_struct *lb_monitor_task;
+
+static void set_se_shares(struct sched_entity *se, unsigned long shares);
+static int load_balance_monitor(void *unused);
 
 /* Default task group.
  * Every task in system belong to this group at bootup.
@@ -199,6 +243,8 @@ struct task_group init_task_group = {
 # define INIT_TASK_GROUP_LOAD  NICE_0_LOAD
 #endif
 
+#define MIN_GROUP_SHARES   100
+
 static int init_task_group_load = INIT_TASK_GROUP_LOAD;
 
 /* return group to which a task belongs */
@@ -235,11 +281,23 @@ static inline void unlock_task_group_lis
mutex_unlock(&task_group_mutex);
 }
 
+static inline void lock_doms_cur(void)
+{
+   mutex_lock(&doms_cur_mutex);
+}
+
+static inline void unlock_doms_cur(void)
+{
+   mutex_unlock(&doms_cur_mutex);
+}
+
 #else
 
 static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
 static inline void lock_task_group_list(void) { }
 

[PATCH 1/2] sched: Fix minor bugs and coding style issues

2007-11-19 Thread Srivatsa Vaddagiri

Fix these minor bugs/coding-style issues:

- s/INIT_TASK_GRP_LOAD/INIT_TASK_GROUP_LOAD
- remove obsolete comment
- Use list_for_each_entry_rcu when walking task group list in
  for_each_leaf_cfs_rq() macro
- Serialize addition/removal of task groups from rq->leaf_cfs_rq_list
  by using a mutex(task_group_mutex). Use the same mutex in
  print_cfs_stats when walking the task group list.


Signed-off-by : Srivatsa Vaddagiri <[EMAIL PROTECTED]>


---
 kernel/sched.c  |   55 +---
 kernel/sched_fair.c |5 +++-
 2 files changed, 35 insertions(+), 25 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -169,8 +169,6 @@ struct task_group {
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
-   /* spinlock to serialize modification to shares */
-   spinlock_t lock;
struct rcu_head rcu;
 };
 
@@ -182,6 +180,11 @@ static DEFINE_PER_CPU(struct cfs_rq, ini
 static struct sched_entity *init_sched_entity_p[NR_CPUS];
 static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
 
+/* task_group_mutex serializes add/remove of task groups and also changes to
+ * a task group's cpu shares.
+ */
+static DEFINE_MUTEX(task_group_mutex);
+
 /* Default task group.
  * Every task in system belong to this group at bootup.
  */
@@ -191,12 +194,12 @@ struct task_group init_task_group = {
 };
 
 #ifdef CONFIG_FAIR_USER_SCHED
-# define INIT_TASK_GRP_LOAD2*NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD  2*NICE_0_LOAD   /* root user's cpu share */
 #else
-# define INIT_TASK_GRP_LOADNICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD  NICE_0_LOAD
 #endif
 
-static int init_task_group_load = INIT_TASK_GRP_LOAD;
+static int init_task_group_load = INIT_TASK_GROUP_LOAD;
 
 /* return group to which a task belongs */
 static inline struct task_group *task_group(struct task_struct *p)
@@ -222,9 +225,21 @@ static inline void set_task_cfs_rq(struc
p->se.parent = task_group(p)->se[cpu];
 }
 
+static inline void lock_task_group_list(void)
+{
+   mutex_lock(&task_group_mutex);
+}
+
+static inline void unlock_task_group_list(void)
+{
+   mutex_unlock(&task_group_mutex);
+}
+
 #else
 
 static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
+static inline void lock_task_group_list(void) { }
+static inline void unlock_task_group_list(void) { }
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -864,21 +879,6 @@ iter_move_one_task(struct rq *this_rq, i
 
 #define sched_class_highest (&rt_sched_class)
 
-/*
- * Update delta_exec, delta_fair fields for rq.
- *
- * delta_fair clock advances at a rate inversely proportional to
- * total load (rq->load.weight) on the runqueue, while
- * delta_exec advances at the same rate as wall-clock (provided
- * cpu is not idle).
- *
- * delta_exec / delta_fair is a measure of the (smoothened) load on this
- * runqueue over any given interval. This (smoothened) load is used
- * during load balance.
- *
- * This function is called /before/ updating rq->load
- * and when switching tasks.
- */
 static inline void inc_load(struct rq *rq, const struct task_struct *p)
 {
update_load_add(&rq->load, p->se.load.weight);
@@ -6762,7 +6762,6 @@ void __init sched_init(void)
se->parent = NULL;
}
init_task_group.shares = init_task_group_load;
-   spin_lock_init(&init_task_group.lock);
 #endif
 
for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
@@ -7002,14 +7001,17 @@ struct task_group *sched_create_group(vo
se->parent = NULL;
}
 
+   lock_task_group_list();
+
for_each_possible_cpu(i) {
rq = cpu_rq(i);
cfs_rq = tg->cfs_rq[i];
list_add_rcu(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
}
 
+   unlock_task_group_list();
+
tg->shares = NICE_0_LOAD;
-   spin_lock_init(&tg->lock);
 
return tg;
 
@@ -7055,11 +7057,15 @@ void sched_destroy_group(struct task_gro
struct cfs_rq *cfs_rq = NULL;
int i;
 
+   lock_task_group_list();
+
for_each_possible_cpu(i) {
cfs_rq = tg->cfs_rq[i];
list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
}
 
+   unlock_task_group_list();
+
BUG_ON(!cfs_rq);
 
/* wait for possible concurrent references to cfs_rqs complete */
@@ -7132,7 +7138,8 @@ int sched_group_set_shares(struct task_g
 {
int i;
 
-   spin_lock(&tg->lock);
+   lock_task_group_list();
+
if (tg->shares == shares)
goto done;
 
@@ -7141,7 +7148,7 @@ int sched_group_set_shares(struct task_g
set_se_shares(tg->se[i], shares);
 
 don

[PATCH 0/2] sched: group scheduler related patches (V2)

2007-11-19 Thread Srivatsa Vaddagiri
Hi Ingo,
Here's V2 of the two patches related to group cpu scheduling I
sent a while ago:

Patch 1/2 -> Fixes minor bugs and coding style issues
Patch 2/2 -> Improves group fairness on SMP systems.

Changes since V1:
- Introduced a task_group_mutex to serialize add/removal of
  task groups (as pointed by Dipankar)

Both the patches should not have any functional/runtime effect to the scheduler
for !CONFIG_FAIR_GROUP_SCHED case.

This is also probably one of the last big changes related to group scheduler
that I had on my plate. Pls apply if there is no objection from anyone.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] sched: Improve fairness of cpu allocation for task groups

2007-11-19 Thread Srivatsa Vaddagiri
On Mon, Nov 19, 2007 at 04:22:58PM +0100, Ingo Molnar wrote:
> 
> * Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote:
> 
> > - inc/dec_load() takes a load input instead of task pointer input as their
> >   2nd arg
> > - inc/dec_nr_running don't call inc/dec_load. Instead,
> > - enqueue/dequeue_task class callbacks call inc/dec_load
> > - [Unintended/will-fix change] min/max tunables added in
> >   /proc/sys/kernel
> > 
> > All of above changes (except last, which I will fix) should have zero 
> > functional+runtime effect for !CONFIG_FAIR_GROUP_SCHED case. So I 
> > don't see how I can split Patch 2/2 further.
> 
> ok, as long as it's NOP for the CONFIG_FAIR_GROUP_SCHED, we could try 
> it.

Ok ..thx. I was begining to make changes to avoid even the above minor changes 
for !CONFIG_FAIR_GROUP_SCHED case, but it doesn't look neat, hence will drop
that effort.

I am fixing other problems observed with Patch 1/2 (usage of a mutex to
serialize create/destroy groups) and will resend the series very soon.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] sched: Improve fairness of cpu allocation for task groups

2007-11-19 Thread Srivatsa Vaddagiri
On Mon, Nov 19, 2007 at 02:12:01PM +0100, Ingo Molnar wrote:
> >  include/linux/sched.h |4 
> >  kernel/sched.c|  292 
> > +-
> >  kernel/sched_fair.c   |   95 ++--
> >  kernel/sched_rt.c |2 
> >  kernel/sysctl.c   |   16 ++
> >  5 files changed, 348 insertions(+), 61 deletions(-)
> 
> i'm leaning towards making this v2.6.25 material, as it affects the 
> non-group-scheduling bits too and is rather large. When i tested it, 
> group scheduling worked pretty well - at least for CPU bound tasks - and 
> on SMP too. Could we live with what we have for now and defer this patch 
> to v2.6.25?

Hi Ingo,
I would prefer this to go in 2.6.24 if possible. 2.6.24 would be the
first kernel to support a group scheduler in its entirety (user interface +
related support in scheduler) and also that works reasonably well :) It would 
also give me early test feedback.

> If not, could you split up this patch in a way to defer all 
> the FAIR_GROUP_SCHED relevant changes to a separate patch which will not 
> affect the !FAIR_GROUP_SCHED case at all? That will make the case much 
> clearer.

>From my inspection, here are the changes introduced by this patch
for !CONFIG_FAIR_GROUP_SCHED case:

- inc/dec_load() takes a load input instead of task pointer input as their
  2nd arg
- inc/dec_nr_running don't call inc/dec_load. Instead,
- enqueue/dequeue_task class callbacks call inc/dec_load
- [Unintended/will-fix change] min/max tunables added in
  /proc/sys/kernel

All of above changes (except last, which I will fix) should have zero
functional+runtime effect for !CONFIG_FAIR_GROUP_SCHED case. So I don't see how
I can split Patch 2/2 further.

Or do you prefer I introduce #ifdef's such that even these minor changes to 
inc/dec_load are avoided for !CONFIG_FAIR_GROUP_SCHED case? That would
make the code slightly ugly I suspect.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] sched: Minor cleanups

2007-11-19 Thread Srivatsa Vaddagiri
On Mon, Nov 19, 2007 at 02:08:03PM +0100, Ingo Molnar wrote:
> >  #define for_each_leaf_cfs_rq(rq, cfs_rq) \
> > -   list_for_each_entry(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
> > +   list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
> >  
> >  /* Do the two (enqueued) entities belong to the same group ? */
> >  static inline int
> > @@ -1126,7 +1126,10 @@ static void print_cfs_stats(struct seq_f
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > print_cfs_rq(m, cpu, &cpu_rq(cpu)->cfs);
> >  #endif
> > +
> > +   rcu_read_lock();
> > for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
> > print_cfs_rq(m, cpu, cfs_rq);
> > +   rcu_read_unlock();
> 
> hm, why is this a cleanup?

Sorry for the wrong subject. It was supposed to include the above bug fix,
related to how we walk the task group list.

Thinking abt it now, I realize that print_cfs_rq() can potentially
sleep and hence it cannot be surrounded by rcu_read_lock()/unlock().

And as Dipankar just pointed me, sched_create/destroy_group aren't
serialized at all currently, so we need a mutex to protect them. The
same mutex can be then used when walking the list in print_cfs_stats() ..

Will send update patches soon ..


-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] sched: Improve fairness of cpu allocation for task groups

2007-11-19 Thread Srivatsa Vaddagiri
122766  48   397269b2e sched.o-after
  -
+83
  

CONFIG_FAIR_GROUP_SCHED defined ->

textdata bss dec hex filename
   390193346 336   42701a6cd sched.o-before
   402233482 340   44045ac0d sched.o-after
   --
   +1344
   --

Signed-off-by : Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 include/linux/sched.h |4 
 kernel/sched.c|  292 +-
 kernel/sched_fair.c   |   95 ++--
 kernel/sched_rt.c |2 
 kernel/sysctl.c   |   16 ++
 5 files changed, 348 insertions(+), 61 deletions(-)

Index: current/include/linux/sched.h
===
--- current.orig/include/linux/sched.h
+++ current/include/linux/sched.h
@@ -1467,6 +1467,10 @@ extern unsigned int sysctl_sched_child_r
 extern unsigned int sysctl_sched_features;
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
+#ifdef CONFIG_FAIR_GROUP_SCHED
+extern unsigned int sysctl_sched_min_bal_int_shares;
+extern unsigned int sysctl_sched_max_bal_int_shares;
+#endif
 
 int sched_nr_latency_handler(struct ctl_table *table, int write,
struct file *file, void __user *buffer, size_t *length,
Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -168,9 +168,46 @@ struct task_group {
struct sched_entity **se;
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
+
+   /* shares assigned to a task group governs how much of cpu bandwidth
+* is allocated to the group. The more shares a group has, the more is
+* the cpu bandwidth allocated to it.
+*
+* For ex, lets say that there are three task groups, A, B and C which
+* have been assigned shares 1000, 2000 and 3000 respectively. Then,
+* cpu bandwidth allocated by the scheduler to task groups A, B and C
+* should be:
+*
+*  Bw(A) = 1000/(1000+2000+3000) * 100 = 16.66%
+*  Bw(B) = 2000/(1000+2000+3000) * 100 = 33.33%
+*  Bw(C) = 3000/(1000+2000+3000) * 100 = 50%
+*
+* The weight assigned to a task group's schedulable entities on every
+* cpu (task_group.se[a_cpu]->load.weight) is derived from the task
+* group's shares. For ex: lets say that task group A has been
+* assigned shares of 1000 and there are two CPUs in a system. Then,
+*
+*  tg_A->se[0]->load.weight = tg_A->se[1]->load.weight = 1000;
+*
+* Note: It's not necessary that each of a task's group schedulable
+*   entity have the same weight on all CPUs. If the group
+*   has 2 of its tasks on CPU0 and 1 task on CPU1, then a
+*   better distribution of weight could be:
+*
+*  tg_A->se[0]->load.weight = 2/3 * 2000 = 1333
+*  tg_A->se[1]->load.weight = 1/2 * 2000 =  667
+*
+* rebalance_shares() is responsible for distributing the shares of a
+* task groups like this among the group's schedulable entities across
+* cpus.
+*
+*/
unsigned long shares;
-   /* spinlock to serialize modification to shares */
-   spinlock_t lock;
+
+   /* lock to serialize modification to shares */
+   struct mutex lock;
+
+   unsigned long last_total_load;
struct rcu_head rcu;
 };
 
@@ -182,6 +219,14 @@ static DEFINE_PER_CPU(struct cfs_rq, ini
 static struct sched_entity *init_sched_entity_p[NR_CPUS];
 static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
 
+static DEFINE_MUTEX(doms_cur_mutex); /* serialize access to doms_curr[] array 
*/
+
+/* kernel thread that runs rebalance_shares() periodically */
+static struct task_struct *lb_monitor_task;
+
+static void set_se_shares(struct sched_entity *se, unsigned long shares);
+static int load_balance_monitor(void *unused);
+
 /* Default task group.
  * Every task in system belong to this group at bootup.
  */
@@ -196,6 +241,8 @@ struct task_group init_task_group = {
 # define INIT_TASK_GROUP_LOAD  NICE_0_LOAD
 #endif
 
+#define MIN_GROUP_SHARES   100
+
 static int init_task_group_load = INIT_TASK_GROUP_LOAD;
 
 /* return group to which a task belongs */
@@ -222,9 +269,21 @@ static inline void set_task_cfs_rq(struc
p->se.parent = task_group(p)->se[cpu];
 }
 
+static inline void lock_doms_cur(void)
+{
+   mutex_lock(&doms_cur_mutex);
+}
+
+static inline void unlock_doms_cur(void)
+{

[PATCH 1/2] sched: Minor cleanups

2007-11-19 Thread Srivatsa Vaddagiri

Minor scheduler cleanups:

- Fix coding style
- remove obsolete comment
- Use list_for_each_entry_rcu when walking task group list


Signed-off-by : Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 kernel/sched.c  |   21 +++--
 kernel/sched_fair.c |5 -
 2 files changed, 7 insertions(+), 19 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -191,12 +191,12 @@ struct task_group init_task_group = {
 };
 
 #ifdef CONFIG_FAIR_USER_SCHED
-# define INIT_TASK_GRP_LOAD2*NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD  2*NICE_0_LOAD   /* root user's cpu share */
 #else
-# define INIT_TASK_GRP_LOADNICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD  NICE_0_LOAD
 #endif
 
-static int init_task_group_load = INIT_TASK_GRP_LOAD;
+static int init_task_group_load = INIT_TASK_GROUP_LOAD;
 
 /* return group to which a task belongs */
 static inline struct task_group *task_group(struct task_struct *p)
@@ -864,21 +864,6 @@ iter_move_one_task(struct rq *this_rq, i
 
 #define sched_class_highest (&rt_sched_class)
 
-/*
- * Update delta_exec, delta_fair fields for rq.
- *
- * delta_fair clock advances at a rate inversely proportional to
- * total load (rq->load.weight) on the runqueue, while
- * delta_exec advances at the same rate as wall-clock (provided
- * cpu is not idle).
- *
- * delta_exec / delta_fair is a measure of the (smoothened) load on this
- * runqueue over any given interval. This (smoothened) load is used
- * during load balance.
- *
- * This function is called /before/ updating rq->load
- * and when switching tasks.
- */
 static inline void inc_load(struct rq *rq, const struct task_struct *p)
 {
update_load_add(&rq->load, p->se.load.weight);
Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -685,7 +685,7 @@ static inline struct cfs_rq *cpu_cfs_rq(
 
 /* Iterate thr' all leaf cfs_rq's on a runqueue */
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
-   list_for_each_entry(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
+   list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
 
 /* Do the two (enqueued) entities belong to the same group ? */
 static inline int
@@ -1126,7 +1126,10 @@ static void print_cfs_stats(struct seq_f
 #ifdef CONFIG_FAIR_GROUP_SCHED
print_cfs_rq(m, cpu, &cpu_rq(cpu)->cfs);
 #endif
+
+   rcu_read_lock();
for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
print_cfs_rq(m, cpu, cfs_rq);
+   rcu_read_unlock();
 }
 #endif

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] sched: Group scheduler related patches

2007-11-19 Thread Srivatsa Vaddagiri
Hi Ingo,
Here are two patches related to group cpu scheduling.

Patch 1/2 -> Fixes minor bugs and coding style issues
Patch 2/2 -> Improves group fairness on SMP systems.

This is probably one of the last big changes related to group scheduler
that I had on my plate. Pls apply if there is no objection from anyone.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc1-gb4f5550 oops

2007-11-14 Thread Srivatsa Vaddagiri
On Wed, Nov 14, 2007 at 06:17:08PM +0300, Oleg Nesterov wrote:
> Suppose that old user_struct/task_group is freed/reused, and the task does

Shouldn't this old user actually be the root user_struct?

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revert for cgroups CPU accounting subsystem patch

2007-11-12 Thread Srivatsa Vaddagiri
On Tue, Nov 13, 2007 at 12:59:59PM +0530, Balbir Singh wrote:
> Paul Menage wrote:
> > On Nov 12, 2007 11:00 PM, Balbir Singh <[EMAIL PROTECTED]> wrote:
> >> Right now, one of the limitations of the CPU controller is that
> >> the moment you create another control group, the bandwidth gets
> >> divided by the default number of shares. We can't create groups
> >> just for monitoring.
> > 
> > Could we get around this with, say, a flag that always treats a CFS
> > schedulable entity as having a weight equal to the number of runnable
> > tasks in it? So CPU bandwidth would be shared between groups in
> > proportion to the number of runnable tasks, which would distribute the
> > cycles approximately equivalently to them all being separate
> > schedulable entities.
> > 
> 
> I think it's a good hack, but not sure about the complexity to implement
> the code.

I agree that it would be adding unnecessary complexity, just to meet the
accounting needs.

Thinking of it more, this requirement to "group tasks for only accounting
purpose" may be required for other resources (mem, io, network etc) as well?
Should we have a generic accounting controller which can provide these
various resource usgae stats for a group (cpu, mem etc) by iterating thr' the 
task list for the group and summing up the corresponding stats already present 
in task structure?

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revert for cgroups CPU accounting subsystem patch

2007-11-12 Thread Srivatsa Vaddagiri
On Mon, Nov 12, 2007 at 10:05:24PM -0800, Paul Menage wrote:
> On Nov 12, 2007 10:00 PM, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote:
> > On second thoughts, this may be a usefull controller of its own.
> > Say I just want to "monitor" usage (for accounting purpose) of a group of
> > tasks, but don't want to control their cpu consumption, then cpuacct
> > controller would come in handy.
> >
> 
> That's plausible, but having two separate ways of tracking and
> reporting the CPU usage of a cgroup seems wrong.
> 
> How bad would it be in your suggested case if you just give each
> cgroup the same weight?

That's still introducing a deviation from the normal behavior we would
have had we allowed all tasks to be part of the same "control" group/runqueue.

For ex: using nice value to vary bandwidth between tasks makes sense if
they are all part of the same group.

Also an application with more tasks will get more cpu power (as intended)
compared to another app with less tasks, provided they are all
part of the same group.

Regarding your concern about tracking cpu usage in different ways, it
could be mitigated if we have cpuacct controller track usage as per
information present in a task's sched entity structure
(tsk->se.sum_exec_runtime) i.e call cpuacct_charge() from
__update_curr() which would accumulate the execution time of the 
group in a SMP friendly manner (i.e dump it in a per-cpu per-group counter 
first and then aggregate to a global per-group counter).

This will let account and control grouping to be independent if desired.

What do you think?

> So there would be fair scheduling between
> cgroups, which seems as reasonable as any other choice in the event
> that the CPU is contended.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revert for cgroups CPU accounting subsystem patch

2007-11-12 Thread Srivatsa Vaddagiri
On Mon, Nov 12, 2007 at 09:25:32PM -0800, Paul Menage wrote:
> Hi Linus,
> 
> Please can you revert commit 62d0df64065e7c135d0002f069444fbdfc64768f,
> entitled "Task Control Groups: example CPU accounting subsystem" ?

Hi Paul,
On second thoughts, this may be a usefull controller of its own.
Say I just want to "monitor" usage (for accounting purpose) of a group of 
tasks, but don't want to control their cpu consumption, then cpuacct 
controller would come in handy.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG]: Crash with CONFIG_FAIR_CGROUP_SCHED=y

2007-11-09 Thread Srivatsa Vaddagiri
On Fri, Nov 09, 2007 at 11:59:15AM +0100, Dmitry Adamushko wrote:
> > - The second problem exposed by this test is that task_new_fair()
> >   assumes that parent and child will be part of the same group (which
> >   needn't be as this test shows). As a result, cfs_rq->curr can be NULL
> >   for the child.
> 
> Would it be better, logically-wise, to use is_same_group() instead?
> Although, we can't have 2 groups with cfs_rq->curr != NULL on the same
> CPU... so if the child belongs to another group, it's cfs_rq->curr is
> automatically NULL indeed.

Yeah ..I feel safe with an explicit !curr check, perhaps with a comment like
below added to explain when curr can be NULL?


---
 kernel/sched_fair.c |1 +
 1 files changed, 1 insertion(+)

Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -1022,6 +1022,7 @@ static void task_new_fair(struct rq *rq,
update_curr(cfs_rq);
place_entity(cfs_rq, se, 1);
 
+   /* 'curr' will be NULL if the child belongs to a different group */
if (sysctl_sched_child_runs_first && this_cpu == task_cpu(p) &&
curr && curr->vruntime < se->vruntime) {
/*



-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG]: Crash with CONFIG_FAIR_CGROUP_SCHED=y

2007-11-09 Thread Srivatsa Vaddagiri
On Fri, Nov 09, 2007 at 09:45:21AM +0100, Dmitry Adamushko wrote:
> Humm... the 'current' is not kept within the tree but
> current->se.on_rq is supposed to be '1' ,
> so the old code looks ok to me (at least for the 'leaf' elements).

You are damned right! Sorry my mistake with the previous analysis and
(as I now find out) testing :(

There are couple of problems discovered by Suka's test:

- The test requires the cgroup filesystem to be mounted with
  atleast the cpu and ns options (i.e both namespace and cpu 
  controllers are active in the same hierarchy). 

# mkdir /dev/cpuctl
# mount -t cgroup -ocpu,ns none cpuctl
(or simply)
# mount -t cgroup none cpuctl -> Will activate all controllers
 in same hierarchy.

- The test invokes clone() with CLONE_NEWNS set. This causes a a new child
  to be created, also a new group (do_fork->copy_namespaces->ns_cgroup_clone->
  cgroup_clone) and the child is attached to the new group (cgroup_clone->
  attach_task->sched_move_task). At this point in time, the child's scheduler 
  related fields are uninitialized (including its on_rq field, which it has
  inherited from parent). As a result sched_move_task thinks its on
  runqueue, when it isn't.

  As a solution to this problem, I moved sched_fork() call, which
  initializes scheduler related fields on a new task, before
  copy_namespaces(). I am not sure though whether moving up will
  cause other side-effects. Do you see any issue?

- The second problem exposed by this test is that task_new_fair()
  assumes that parent and child will be part of the same group (which 
  needn't be as this test shows). As a result, cfs_rq->curr can be NULL
  for the child.

  The solution is to test for curr pointer being NULL in
  task_new_fair().


With the patch below, I could run ns_exec() fine w/o a crash.

Suka, can you verify whether this patch fixes your problem?


--

Fix copy_namespace() <-> sched_fork() dependency in do_fork, by moving
up sched_fork().

Also introduce a NULL pointer check for 'curr' in task_new_fair().

Signed-off-by : Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 kernel/fork.c   |6 +++---
 kernel/sched_fair.c |2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

Index: current/kernel/fork.c
===
--- current.orig/kernel/fork.c
+++ current/kernel/fork.c
@@ -1121,6 +1121,9 @@ static struct task_struct *copy_process(
p->blocked_on = NULL; /* not blocked yet */
 #endif
 
+   /* Perform scheduler related setup. Assign this task to a CPU. */
+   sched_fork(p, clone_flags);
+
if ((retval = security_task_alloc(p)))
goto bad_fork_cleanup_policy;
if ((retval = audit_alloc(p)))
@@ -1210,9 +1213,6 @@ static struct task_struct *copy_process(
INIT_LIST_HEAD(&p->ptrace_children);
INIT_LIST_HEAD(&p->ptrace_list);
 
-   /* Perform scheduler related setup. Assign this task to a CPU. */
-   sched_fork(p, clone_flags);
-
/* Now that the task is set up, run cgroup callbacks if
 * necessary. We need to run them before the task is visible
 * on the tasklist. */
Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -1023,7 +1023,7 @@ static void task_new_fair(struct rq *rq,
place_entity(cfs_rq, se, 1);
 
if (sysctl_sched_child_runs_first && this_cpu == task_cpu(p) &&
-   curr->vruntime < se->vruntime) {
+   curr && curr->vruntime < se->vruntime) {
/*
 * Upon rescheduling, sched_class::put_prev_task() will place
 * 'current' within the tree based on its new key value.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG]: Crash with CONFIG_FAIR_CGROUP_SCHED=y

2007-11-08 Thread Srivatsa Vaddagiri
On Thu, Nov 08, 2007 at 03:48:05PM -0800, [EMAIL PROTECTED] wrote:
> With CONFIG_FAIR_CGROUP_SCHED=y, following commands on 2.6.24-rc1 crash
> the system.

Thanks for reporting the problem. It was caused because of the fact that
current task isn't kept in its runqueue in case of sched_fair class
tasks.

With the patch below, I could run ns_exec w/o any crash. Can you pls
verify it works for you as well?

Ingo,
Once Suka verifies that the patch fixes his crash, I would request you 
to include the same in your tree and route it to Linus.

--

current task is not present in its runqueue in case of sched_fair class
tasks. Take care of this fact in rt_mutex_setprio(),
sched_setscheduler() and sched_move_task() routines.

Signed-off-by : Srivatsa Vaddagiri <[EMAIL PROTECTED]>


---
 kernel/sched.c |   45 +
 1 files changed, 25 insertions(+), 20 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -3986,11 +3986,13 @@ void rt_mutex_setprio(struct task_struct
oldprio = p->prio;
on_rq = p->se.on_rq;
running = task_running(rq, p);
-   if (on_rq) {
+   if (on_rq)
dequeue_task(rq, p, 0);
-   if (running)
-   p->sched_class->put_prev_task(rq, p);
-   }
+   /* current task is not kept in its runqueue in case of sched_fair class.
+* Hence we need the 'on_rq?' and 'running?' tests to be separate.
+*/
+   if (running)
+   p->sched_class->put_prev_task(rq, p);
 
if (rt_prio(prio))
p->sched_class = &rt_sched_class;
@@ -3999,9 +4001,9 @@ void rt_mutex_setprio(struct task_struct
 
p->prio = prio;
 
+   if (running)
+   p->sched_class->set_curr_task(rq);
if (on_rq) {
-   if (running)
-   p->sched_class->set_curr_task(rq);
enqueue_task(rq, p, 0);
inc_load(rq, p);
/*
@@ -4298,18 +4300,20 @@ recheck:
update_rq_clock(rq);
on_rq = p->se.on_rq;
running = task_running(rq, p);
-   if (on_rq) {
+   if (on_rq)
deactivate_task(rq, p, 0);
-   if (running)
-   p->sched_class->put_prev_task(rq, p);
-   }
+   /* current task is not kept in its runqueue in case of sched_fair class.
+* Hence we need the 'on_rq?' and 'running?' tests to be separate.
+*/
+   if (running)
+   p->sched_class->put_prev_task(rq, p);
 
oldprio = p->prio;
__setscheduler(rq, p, policy, param->sched_priority);
 
+   if (running)
+   p->sched_class->set_curr_task(rq);
if (on_rq) {
-   if (running)
-   p->sched_class->set_curr_task(rq);
activate_task(rq, p, 0);
/*
 * Reschedule if we are currently running on this runqueue and
@@ -7036,19 +7040,20 @@ void sched_move_task(struct task_struct 
running = task_running(rq, tsk);
on_rq = tsk->se.on_rq;
 
-   if (on_rq) {
+   if (on_rq)
dequeue_task(rq, tsk, 0);
-   if (unlikely(running))
-   tsk->sched_class->put_prev_task(rq, tsk);
-   }
+   /* current task is not kept in its runqueue in case of sched_fair class.
+* Hence we need the 'on_rq?' and 'running?' tests to be separate.
+*/
+   if (unlikely(running))
+   tsk->sched_class->put_prev_task(rq, tsk);
 
set_task_cfs_rq(tsk);
 
-   if (on_rq) {
-   if (unlikely(running))
-   tsk->sched_class->set_curr_task(rq);
+   if (unlikely(running))
+   tsk->sched_class->set_curr_task(rq);
+   if (on_rq)
enqueue_task(rq, tsk, 0);
-   }
 
 done:
task_rq_unlock(rq, &flags);


-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/6] sched: make sched_slice() group scheduling savvy

2007-11-01 Thread Srivatsa Vaddagiri
On Thu, Nov 01, 2007 at 01:20:08PM +0100, Peter Zijlstra wrote:
> On Thu, 2007-11-01 at 13:03 +0100, Peter Zijlstra wrote:
> > On Thu, 2007-11-01 at 12:58 +0100, Peter Zijlstra wrote:
> > 
> > > > sched_slice() is about lantecy, its intended purpose is to ensure each
> > > > task is ran exactly once during sched_period() - which is
> > > > sysctl_sched_latency when nr_running <= sysctl_sched_nr_latency, and
> > > > otherwise linearly scales latency.
> > 
> > The thing that got my brain in a twist is what to do about the non-leaf
> > nodes, for those it seems I'm not doing the right thing - I think.
> 
> Ok, suppose a tree like so:
> 
> 
> level 2   cfs_rq
>A   B
> 
> level 1 cfs_rqA cfs_rqB
>  A0B0 - B99
> 
> 
> So for sake of determinism, we want to impose a period in which all
> level 1 tasks will have ran (at least) once.

Peter,
I fail to see why this requirement to "determine a period in
which all level 1 tasks will have ran (at least) once" is essential.

I am visualizing each of the groups to be similar to Xen-like partitions
which are given fair timeslices by the hypervisor (Linux kernel in this
case). How each partition (group in this case) manages the allocated
timeslice(s) to provide fairness to tasks within that partition/group should not
(IMHO) depend on other groups and esp. how many tasks other groups has.

For ex: before this patch, fair time would be allocated to group and
their tasks as below:

A0   B0-B9 A0B10-B19 A0 B20-B29
 |||||||-//--| 
 0   10ms  20ms30ms 40ms 50ms 60ms

i.e during the first 10ms allocated to group B, B0-B9 run,
during the next  10ms allocated to group B, B10-B19 run etc

What's wrong with this scheme?

By letting __sched_period() be determined for each group independently,
we are building stronger isolation between them, which is good IMO
(imagine a rogue container that does a fork bomb).

> Now what sched_slice() does is calculate the weighted proportion of the
> given period for each task to run, so that each task runs exactly once.
> 
> Now level 2, can introduce these large weight differences, which in this
> case result in 'lumps' of time.
> 
> In the given example above the weight difference is 1:100, which is
> already at the edges of what regular nice levels could do.
> 
> How about limiting the max output of sched_slice() to
> sysctl_sched_latency in order to break up these large stretches of time?
> 
> Index: linux-2.6/kernel/sched_fair.c
> ===
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -341,7 +341,7 @@ static u64 sched_slice(struct cfs_rq *cf
>   do_div(slice, cfs_rq->load.weight);
>   }
> 
> - return slice;
> + return min_t(u64, sysctl_sched_latency, slice);

Hmm, going back to the previous example I cited, this will lead to:

sched_slice(grp A) = min(20ms, 500ms) = 20ms
sched_slice(A0) = min(20ms, 500ms) = 20ms

sched_slice(grp B) = min(20ms, 500ms) = 20ms
sched_slice(B0) = min(20ms, 0.5ms) = 0.5ms

Fairness between groups and tasks would be obtained as below:

 A0 B0-B39 A0 B40-B79A0
 ||||||
 0 20ms   40ms  60ms 80ms

which seems to be more or less giving what we already have w/o the
patch?

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/6] sched: make sched_slice() group scheduling savvy

2007-11-01 Thread Srivatsa Vaddagiri
On Wed, Oct 31, 2007 at 10:10:32PM +0100, Peter Zijlstra wrote:
> Currently the ideal slice length does not take group scheduling into account.
> Change it so that it properly takes all the runnable tasks on this cpu into
> account and caluclate the weight according to the grouping hierarchy.
> 
> Also fixes a bug in vslice which missed a factor NICE_0_LOAD.
> 
> Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
> CC: Srivatsa Vaddagiri <[EMAIL PROTECTED]>
> ---
>  kernel/sched_fair.c |   42 +++---
>  1 file changed, 31 insertions(+), 11 deletions(-)
> 
> Index: linux-2.6/kernel/sched_fair.c
> ===
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -331,10 +331,15 @@ static u64 __sched_period(unsigned long 
>   */
>  static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> - u64 slice = __sched_period(cfs_rq->nr_running);
> + unsigned long nr_running = rq_of(cfs_rq)->nr_running;
> + u64 slice = __sched_period(nr_running);
> 
> - slice *= se->load.weight;
> - do_div(slice, cfs_rq->load.weight);
> + for_each_sched_entity(se) {
> + cfs_rq = cfs_rq_of(se);
> +
> + slice *= se->load.weight;
> + do_div(slice, cfs_rq->load.weight);
> + }
> 
>   return slice;


Lets say we have two groups A and B on CPU0, of equal weight (1024).

Further,

A has 1 task (A0)
B has 1000 tasks (B0 .. B999) 

Agreed its a extreme case, but illustrates the problem I have in mind
for this patch.

All tasks of same weight=1024.

Before this patch
=

sched_slice(grp A) = 20ms * 1/2 = 10ms
sched_slice(A0) = 20ms

sched_slice(grp B) = 20ms * 1/2 = 10ms
sched_slice(B0) = (20ms * 1000/20) * 1 / 1000 = 1ms
sched_slice(B1) = ... = sched_slice(B99) = 1 ms

Fairness between groups and tasks would be obtained as below:

A0   B0-B9 A0B10-B19 A0 B20-B29
 |||||||-//--| 
 0   10ms  20ms30ms 40ms 50ms 60ms

After this patch


sched_slice(grp A) = (20ms * 1001/20) * 1/2 ~= 500ms
sched_slice(A0) = 500ms

sched_slice(grp B) = 500ms
sched_slice(B0) = 0.5ms 

Fairness between groups and tasks would be obtained as below:

A0B0 - B99  A0
 |---|---|---|
 0  500ms   1000ms  1500ms

Did I get it right? If so, I don't like the fact that group A is allowed to run 
for a long time (500ms) before giving chance to group B.

Can I know what real problem is being addressed by this change to
sched_slice()?

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: fix new task startup crash

2007-10-31 Thread Srivatsa Vaddagiri
On Wed, Oct 31, 2007 at 05:00:59PM +0100, Peter Zijlstra wrote:
> Hi,

Hi Peter,

> Commit: b9dca1e0fcb696716840a3bc8f20a6941b484dbf
> 
> seems to me that by calling enqueue_fair_task() from task_new_fair() is
> wrong. The wakeup=1 in enqueue_fair_task() will cause all non-top

s/non-top/non-task?

> sched_entities to be re-positioned by place_entity().

yes, provided they weren't on the tree before (i.e the task being
enqueued is the first task of that group on that particular cpu).

If a group doesn't have any tasks on a particular cpu, then the group's
sched entity object for that cpu will not be present in a cfs_rq
(tg->se[i]->on_rq = 0). Subsequently, after some time T, when a task gets
added to the group's cfs_rq on that cpu, then the group sched entity
object also gets added to the cfs_rq. 

This time T is effectively the sleep time for the group sched entity
object, when it was sleeping waiting for tasks to get added on that cpu.
Hence enqueue of group level sched entity objects is always treated as
if they woke up from sleep (and be repositioned in the tree via
place_entity, rather than go back to the same position it was (se->vruntime) 
when it went to sleep).

That sounds correct to me. If we don't treat enqueue of group level
entities as "wakeup from sleep", then a group that gets a task added
on a cpu after long time, will "hog" the cpu trying to catch up for all
lost time?

> Although the current implementation thereof seems to avoid doing
> something horrible.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] CFS CGroup: Code cleanup

2007-10-23 Thread Srivatsa Vaddagiri
On Tue, Oct 23, 2007 at 07:32:27PM -0700, Paul Menage wrote:
> Clean up some CFS CGroup code
> 
> - replace "cont" with "cgrp" in a few places in the CFS cgroup code, 
> - use write_uint rather than write for cpu.shares write function
> 
> Signed-off-by: Paul Menage <[EMAIL PROTECTED]>

Acked-by : Srivatsa Vaddagiri <[EMAIL PROTECTED]>

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] CFS CGroup: Report usage

2007-10-23 Thread Srivatsa Vaddagiri
On Tue, Oct 23, 2007 at 07:28:22PM -0700, Paul Menage wrote:
> > Good point. I think we need to subtract out the time it was waiting on 
> > runqueue
> > when calculating idle time.
> >
> > |--- . . . . . . -...---|
> > t0 t1t2   t3 t4 t5  t6
> >
> >
> > -> Running time
> > -> Waiting time (to get on the cpu)
> > -> Sleeping time (when it didnt want to run because of
> >lack of tasks)
> >
> > So, in this case,
> >
> > idle time = (t4 - t3) / [ (t6 - t1) - (t2-t1) - (t5-t4)
> >
> 
> Do you mean (t6 - t0) where you have (t6 - t1)?

Ah ..yes.

> > ?
> >
> > This idle time will be a per-cpu stat for every cgroup and needs to be
> > consolidated across cpus into a single idle-stat number, just like how
> > top does it.
> 
> This would be an idle fraction, not an idle time. (seconds divided by seconds)

agreed, we need to be reporting idle time in (milli)seconds, although
the requirement we had was to report it back in percentage. I guess the
percentage figure can be derived from the raw idle time number.

How about:

idle time = t4-t3 (effectively sleep time)

in the above example?

> It doesn't seem quite right to me that a cgroup's idle time metric be
> affected by the activity of other cgroups on the machine,

I don't see how the idle time metric defined above (t4-t3) can be
affected by other cgroup activity, unless the execution pattern of one
cgroup is dependent on the others.

However the minute you tranlsate this idle time into a percentage wrt
wall-clock time, then yes a cgroup's idle percentage can be affected by
others. For idle percentage to be meaningfull, I would imagine that
user-space tools will need to calculate it after discarding a group's
wait time.

> but it's
> hard to come up with a way of measuring it that doesn't have this
> behaviour - which is why, in the absence of hard CPU partitioning,
> it's not clear to me how much use this would be.
> 
> What would people be planning to use it for?

I think primarily for systems management tools to report back various
statistics about a OpenVZ/VServer-like container (just like top reports stats 
for a OS currently). Let me check if there are other uses envisoned for
it.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] CFS CGroup: Report usage

2007-10-23 Thread Srivatsa Vaddagiri
On Tue, Oct 23, 2007 at 09:41:49AM -0700, Paul Menage wrote:
> > > Adds a cpu.usage file to the CFS cgroup that reports CPU usage in
> > > milliseconds for that cgroup's tasks
> >
> > It would be nice to split this into user and sys time at some point.
> 
> Sounds reasonable - but does CFS track this?

No, not for a group. We could extend account_user_time() and
account_systime_time() in this regard.

> > We have also received request to provide idle time for a
> > container/cgroup.
> 
> The semantics of "idle time" for a cgroup on a shared system seem a
> bit fuzzy. How would you define it?

I think the percentage of time when it didn't have any runnable task in
its runqueues.

> Suppose you have two cgroups that would each want to use, say, 55% of
> a CPU - technically they should each be regarded as having 45% idle
> time, but if they run on a the same CPU the chances are that they will
> both always have some processes on their runqueue due to contention
> with the other group. So how would you measure the difference between
> this and a cgroup that really is trying to use 100%?

Good point. I think we need to subtract out the time it was waiting on runqueue
when calculating idle time.

|--- . . . . . . -...---|
t0 t1t2   t3 t4 t5  t6


-> Running time
-> Waiting time (to get on the cpu)
-> Sleeping time (when it didnt want to run because of
   lack of tasks)

So, in this case, 

idle time = (t4 - t3) / [ (t6 - t1) - (t2-t1) - (t5-t4) 

?

This idle time will be a per-cpu stat for every cgroup and needs to be
consolidated across cpus into a single idle-stat number, just like how
top does it.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] CFS CGroup: Report usage

2007-10-23 Thread Srivatsa Vaddagiri
On Mon, Oct 22, 2007 at 05:49:39PM -0700, Paul Menage wrote:
> Report CPU usage in CFS Cgroup directories
> 
> Adds a cpu.usage file to the CFS cgroup that reports CPU usage in
> milliseconds for that cgroup's tasks

It would be nice to split this into user and sys time at some point.
We have also received request to provide idle time for a
container/cgroup. I presume these will need to be provided as additional files
(in addition to cpu.usage that is)?

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] CFS CGroup: Report usage

2007-10-23 Thread Srivatsa Vaddagiri
On Mon, Oct 22, 2007 at 05:49:39PM -0700, Paul Menage wrote:
> + for_each_possible_cpu(i) {
> + unsigned long flags;
> + spin_lock_irqsave(&tg->cfs_rq[i]->rq->lock, flags);

A simpler form of this would be just:

spin_lock_irqsave(&cpu_rq(i)->lock, flags);

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] CFS CGroup: Report usage

2007-10-23 Thread Srivatsa Vaddagiri
On Mon, Oct 22, 2007 at 11:06:54PM -0700, Paul Menage wrote:
> > > + for_each_possible_cpu(i) {
> > > + unsigned long flags;
> > > + spin_lock_irqsave(&tg->cfs_rq[i]->rq->lock, flags);
> >
> > Is the lock absolutely required here?
> 
> I'm not sure, I was hoping you or Ingo could comment on this. But some
> kind of locking seems to required at least on 32-bit platforms, since
> sum_exec_runtime is a 64-bit number.

I tend to agree abt 32-bit platforms requiring a lock to read the 64-bit
sum_exec_runtime field.

Ingo/Dmitry, what do you think? fs/proc/array.c:task_utime() is also
buggy in that case.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] CFS CGroup: Code cleanup

2007-10-22 Thread Srivatsa Vaddagiri
On Mon, Oct 22, 2007 at 05:49:20PM -0700, Paul Menage wrote:
> Clean up some CFS CGroup code
> 
> - replace "cont" with "cgrp" in a few places in the CFS cgroup code, 

This change looks good to me. Thanks for doing it.

> - use write_uint rather than write for cpu.shares write function

Minor nit: From pov of making this patch series git bisect safe, shouldn't we 
be registering a write_uint() handler in this patch as well?
 
-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] CFS CGroup: Report usage

2007-10-22 Thread Srivatsa Vaddagiri
On Mon, Oct 22, 2007 at 05:49:39PM -0700, Paul Menage wrote:
> +static u64 cpu_usage_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> + struct task_group *tg = cgroup_tg(cgrp);
> + int i;
> + u64 res = 0;
> + for_each_possible_cpu(i) {
> + unsigned long flags;
> + spin_lock_irqsave(&tg->cfs_rq[i]->rq->lock, flags);

Is the lock absolutely required here?

Hmm .. I hope the cgroup code prevents a task group from being destroyed while 
we are still reading a task group's cpu usage. Is that so?

> + res += tg->se[i]->sum_exec_runtime;
> + spin_unlock_irqrestore(&tg->cfs_rq[i]->rq->lock, flags);
> + }
> + /* Convert from ns to ms */
> + do_div(res, 100);
> + return res;
> +}

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Buggy idle time reported in /proc/uptime for SMP systems?

2007-10-18 Thread Srivatsa Vaddagiri
On Thu, Oct 18, 2007 at 09:55:13PM +0530, Srivatsa Vaddagiri wrote:
> The idle time reported in /proc/uptime (2nd field) seems to be the idle
> time of only CPU0. Shouldn't it reflect the total idle time on a SMP system? 
> Do folks agree that it is a bug that needs to be fixed? If so, will send
> a patch ...

On further reflection, it doesn't sound buggy. Reporting idle time across all
cpus in /proc/uptime will cause idle time > uptime, which will be odd. I
guess if anyone wants strict idle time from all cpus, they can always
look at /proc/stat.

> Fyi, I see patches submitted to fix this way back in 2001
> [http://lkml.org/lkml/2001/3/17/55], but don't think they made in ..

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   >