Re: [RFC/PATCH 1/1] virtio: Introduce MMIO ops

2020-04-30 Thread Srivatsa Vaddagiri
* Jan Kiszka  [2020-04-30 14:59:50]:

> >I believe ivshmem2_virtio requires hypervisor to support PCI device emulation
> >(for life-cycle management of VMs), which our hypervisor may not support. A
> >simple shared memory and doorbell or message-queue based transport will work 
> >for
> >us.
> 
> As written in our private conversation, a mapping of the ivshmem2 device
> discovery to platform mechanism (device tree etc.) and maybe even the
> register access for doorbell and life-cycle management to something
> hypercall-like would be imaginable. What would count more from virtio
> perspective is a common mapping on a shared memory transport.

Yes that sounds simpler for us.

> That said, I also warned about all the features that PCI already defined
> (such as message-based interrupts) which you may have to add when going a
> different way for the shared memory device.

Is it really required to present this shared memory as belonging to a PCI
device? I would expect the device-tree to indicate the presence of this shared
memory region, which we should be able to present to ivshmem2 as shared memory
region to use (along with some handles for doorbell or message queue use).

I understand the usefulness of modeling the shared memory as part of device so
that hypervisor can send events related to peers going down or coming up. In our
case, there will be other means to discover those events and avoiding this
requirement on hypervisor (to emulate PCI) will simplify the solution for us.

Any idea when we can expect virtio over ivshmem2 to become available?!
 
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [RFC/PATCH 1/1] virtio: Introduce MMIO ops

2020-04-30 Thread Srivatsa Vaddagiri
* Will Deacon  [2020-04-30 11:41:50]:

> On Thu, Apr 30, 2020 at 04:04:46PM +0530, Srivatsa Vaddagiri wrote:
> > If CONFIG_VIRTIO_MMIO_OPS is defined, then I expect this to be 
> > unconditionally
> > set to 'magic_qcom_ops' that uses hypervisor-supported interface for IO (for
> > example: message_queue_send() and message_queue_recevie() hypercalls).
> 
> Hmm, but then how would such a kernel work as a guest under all the
> spec-compliant hypervisors out there?

Ok I see your point and yes for better binary compatibility, the ops have to be
set based on runtime detection of hypervisor capabilities.

> > Ok. I guess the other option is to standardize on a new virtio transport 
> > (like
> > ivshmem2-virtio)?
> 
> I haven't looked at that, but I suppose it depends on what your hypervisor
> folks are willing to accomodate.

I believe ivshmem2_virtio requires hypervisor to support PCI device emulation
(for life-cycle management of VMs), which our hypervisor may not support. A
simple shared memory and doorbell or message-queue based transport will work for
us.

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [RFC/PATCH 0/1] virtio_mmio: hypervisor specific interfaces for MMIO

2020-04-30 Thread Srivatsa Vaddagiri
* Will Deacon  [2020-04-30 11:39:19]:

> Hi Vatsa,
> 
> On Thu, Apr 30, 2020 at 03:59:39PM +0530, Srivatsa Vaddagiri wrote:
> > > What's stopping you from implementing the trapping support in the
> > > hypervisor? Unlike the other patches you sent out, where the guest memory
> > > is not accessible to the host, there doesn't seem to be any advantage to
> > > not having trapping support, or am I missing something here?
> > 
> > I have had this discussion with hypervisor folks. They seem to be
> > concerned about complexity of having a VM's fault be handled in another
> > untrusted VM. They are not keen to add MMIO support.
> 
> Right, but I'm concerned about forking the implementation from the spec
> and I'm not keen to add these hooks ;)
> 
> What does your hook actually do? I'm assuming an HVC? 

Yes, it will issue message-queue related hypercalls

> If so, then where the
> fault is handled seems to be unrelated and whether the guest exit is due to
> an HVC or a stage-2 fault should be immaterial. 

A stage-2 fault would be considered fatal normally and result in termination of
guest. Modifying that behavior to allow resumption in case of virtio config
space access, especially including the untrusted VM in this flow, is
perhaps the concern. HVC calls OTOH are more vetted interfaces that the
hypervisor has to do nothing additional to handle.

> In other words, I don't
> follow why the trapping mechanism necessitates the way in which the fault is
> handled.

Let me check with our hypervisor folks again. Thanks for your inputs.

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [RFC/PATCH 0/1] virtio_mmio: hypervisor specific interfaces for MMIO

2020-04-30 Thread Srivatsa Vaddagiri
* Michael S. Tsirkin  [2020-04-30 06:07:56]:

> On Thu, Apr 30, 2020 at 03:32:55PM +0530, Srivatsa Vaddagiri wrote:
> > The Type-1 hypervisor we are dealing with does not allow for MMIO 
> > transport. 
> 
> How about PCI then?

Correct me if I am wrong, but basically virtio_pci uses the same low-level
primitive as readl/writel on a platform such as ARM64? So similar issues
there also.

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [RFC/PATCH 1/1] virtio: Introduce MMIO ops

2020-04-30 Thread Srivatsa Vaddagiri
* Will Deacon  [2020-04-30 11:14:32]:

> > +#ifdef CONFIG_VIRTIO_MMIO_OPS
> >  
> > +static struct virtio_mmio_ops *mmio_ops;
> > +
> > +#define virtio_readb(a)mmio_ops->mmio_readl((a))
> > +#define virtio_readw(a)mmio_ops->mmio_readl((a))
> > +#define virtio_readl(a)mmio_ops->mmio_readl((a))
> > +#define virtio_writeb(val, a)  mmio_ops->mmio_writeb((val), (a))
> > +#define virtio_writew(val, a)  mmio_ops->mmio_writew((val), (a))
> > +#define virtio_writel(val, a)  mmio_ops->mmio_writel((val), (a))
> 
> How exactly are these ops hooked up? I'm envisaging something like:
> 
>   ops = spec_compliant_ops;
>   [...]
>   if (firmware_says_hypervisor_is_buggy())
>   ops = magic_qcom_ops;
> 
> am I wrong?

If CONFIG_VIRTIO_MMIO_OPS is defined, then I expect this to be unconditionally
set to 'magic_qcom_ops' that uses hypervisor-supported interface for IO (for
example: message_queue_send() and message_queue_recevie() hypercalls).

> > +int register_virtio_mmio_ops(struct virtio_mmio_ops *ops)
> > +{
> > +   pr_info("Registered %s as mmio ops\n", ops->name);
> > +   mmio_ops = ops;
> 
> Not looking good, and really defeats the point of standardising this stuff
> imo.

Ok. I guess the other option is to standardize on a new virtio transport (like
ivshmem2-virtio)?

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [RFC/PATCH 0/1] virtio_mmio: hypervisor specific interfaces for MMIO

2020-04-30 Thread Srivatsa Vaddagiri
* Will Deacon  [2020-04-30 11:08:22]:

> > This patch is meant to seek comments. If its considered to be in right
> > direction, will work on making it more complete and send the next version!
> 
> What's stopping you from implementing the trapping support in the
> hypervisor? Unlike the other patches you sent out, where the guest memory
> is not accessible to the host, there doesn't seem to be any advantage to
> not having trapping support, or am I missing something here?

Hi Will,
I have had this discussion with hypervisor folks. They seem to be
concerned about complexity of having a VM's fault be handled in another
untrusted VM. They are not keen to add MMIO support.

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


[RFC/PATCH 1/1] virtio: Introduce MMIO ops

2020-04-30 Thread Srivatsa Vaddagiri
Some hypervisors may not support MMIO transport i.e trap config
space access and have it be handled by backend driver. They may
allow other ways to interact with backend such as message-queue
or doorbell API. This patch allows for hypervisor specific
methods for config space IO.

Signed-off-by: Srivatsa Vaddagiri 
---
 drivers/virtio/virtio_mmio.c | 131 ++-
 include/linux/virtio.h   |  14 +
 2 files changed, 94 insertions(+), 51 deletions(-)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 97d5725..69bfa35 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -100,7 +100,35 @@ struct virtio_mmio_vq_info {
struct list_head node;
 };
 
+#ifdef CONFIG_VIRTIO_MMIO_OPS
 
+static struct virtio_mmio_ops *mmio_ops;
+
+#define virtio_readb(a)mmio_ops->mmio_readl((a))
+#define virtio_readw(a)mmio_ops->mmio_readl((a))
+#define virtio_readl(a)mmio_ops->mmio_readl((a))
+#define virtio_writeb(val, a)  mmio_ops->mmio_writeb((val), (a))
+#define virtio_writew(val, a)  mmio_ops->mmio_writew((val), (a))
+#define virtio_writel(val, a)  mmio_ops->mmio_writel((val), (a))
+
+int register_virtio_mmio_ops(struct virtio_mmio_ops *ops)
+{
+   pr_info("Registered %s as mmio ops\n", ops->name);
+   mmio_ops = ops;
+
+   return 0;
+}
+
+#else  /* CONFIG_VIRTIO_MMIO_OPS */
+
+#define virtio_readb(a)readb((a))
+#define virtio_readw(a)readw((a))
+#define virtio_readl(a)readl((a))
+#define virtio_writeb(val, a)  writeb((val), (a))
+#define virtio_writew(val, a)  writew((val), (a))
+#define virtio_writel(val, a)  writel((val), (a))
+
+#endif /* CONFIG_VIRTIO_MMIO_OPS */
 
 /* Configuration interface */
 
@@ -109,12 +137,12 @@ static u64 vm_get_features(struct virtio_device *vdev)
struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
u64 features;
 
-   writel(1, vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES_SEL);
-   features = readl(vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES);
+   virtio_writel(1, vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES_SEL);
+   features = virtio_readl(vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES);
features <<= 32;
 
-   writel(0, vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES_SEL);
-   features |= readl(vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES);
+   virtio_writel(0, vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES_SEL);
+   features |= virtio_readl(vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES);
 
return features;
 }
@@ -133,12 +161,12 @@ static int vm_finalize_features(struct virtio_device 
*vdev)
return -EINVAL;
}
 
-   writel(1, vm_dev->base + VIRTIO_MMIO_DRIVER_FEATURES_SEL);
-   writel((u32)(vdev->features >> 32),
+   virtio_writel(1, vm_dev->base + VIRTIO_MMIO_DRIVER_FEATURES_SEL);
+   virtio_writel((u32)(vdev->features >> 32),
vm_dev->base + VIRTIO_MMIO_DRIVER_FEATURES);
 
-   writel(0, vm_dev->base + VIRTIO_MMIO_DRIVER_FEATURES_SEL);
-   writel((u32)vdev->features,
+   virtio_writel(0, vm_dev->base + VIRTIO_MMIO_DRIVER_FEATURES_SEL);
+   virtio_writel((u32)vdev->features,
vm_dev->base + VIRTIO_MMIO_DRIVER_FEATURES);
 
return 0;
@@ -158,25 +186,25 @@ static void vm_get(struct virtio_device *vdev, unsigned 
offset,
int i;
 
for (i = 0; i < len; i++)
-   ptr[i] = readb(base + offset + i);
+   ptr[i] = virtio_readb(base + offset + i);
return;
}
 
switch (len) {
case 1:
-   b = readb(base + offset);
+   b = virtio_readb(base + offset);
memcpy(buf, , sizeof b);
break;
case 2:
-   w = cpu_to_le16(readw(base + offset));
+   w = cpu_to_le16(virtio_readw(base + offset));
memcpy(buf, , sizeof w);
break;
case 4:
-   l = cpu_to_le32(readl(base + offset));
+   l = cpu_to_le32(virtio_readl(base + offset));
memcpy(buf, , sizeof l);
break;
case 8:
-   l = cpu_to_le32(readl(base + offset));
+   l = cpu_to_le32(virtio_readl(base + offset));
memcpy(buf, , sizeof l);
l = cpu_to_le32(ioread32(base + offset + sizeof l));
memcpy(buf + sizeof l, , sizeof l);
@@ -200,7 +228,7 @@ static void vm_set(struct virtio_device *vdev, unsigned 
offset,
int i;
 
for (i = 0; i < len; i++)
-   writeb(ptr[i], base + offset + i);
+   virtio_writeb(ptr[i], base + offset + i);
 
return;
   

[RFC/PATCH 0/1] virtio_mmio: hypervisor specific interfaces for MMIO

2020-04-30 Thread Srivatsa Vaddagiri
The Type-1 hypervisor we are dealing with does not allow for MMIO transport. 
[1] summarizes some of the problems we have in making virtio work on such
hypervisors. This patch proposes a solution for transport problem viz how we can
do config space IO on such a hypervisor. Hypervisor specific methods
introduced allows for seamless IO of config space.

This patch is meant to seek comments. If its considered to be in right
direction, will work on making it more complete and send the next version!

1. https://lkml.org/lkml/2020/4/28/427

Srivatsa Vaddagiri (1):
  virtio: Introduce MMIO ops

 drivers/virtio/virtio_mmio.c | 131 ++-
 include/linux/virtio.h   |  14 +
 2 files changed, 94 insertions(+), 51 deletions(-)

-- 
2.7.4

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [PATCH 5/5] virtio: Add bounce DMA ops

2020-04-29 Thread Srivatsa Vaddagiri
* Michael S. Tsirkin  [2020-04-29 06:20:48]:

> On Wed, Apr 29, 2020 at 03:39:53PM +0530, Srivatsa Vaddagiri wrote:
> > That would still not work I think where swiotlb is used for pass-thr devices
> > (when private memory is fine) as well as virtio devices (when shared memory 
> > is
> > required).
> 
> So that is a separate question. When there are multiple untrusted
> devices, at the moment it looks like a single bounce buffer is used.
> 
> Which to me seems like a security problem, I think we should protect
> untrusted devices from each other.

I think as first step, let me see if we can make swiotlb driver accept a target
memory segment as its working area. That may suffice our needs I think.  A
subsequent step could be to make swiotlb driver recognize multiple pools.


-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [PATCH 5/5] virtio: Add bounce DMA ops

2020-04-29 Thread Srivatsa Vaddagiri
* Michael S. Tsirkin  [2020-04-29 05:52:05]:

> > > So it seems that with modern Linux, all one needs
> > > to do on x86 is mark the device as untrusted.
> > > It's already possible to do this with ACPI and with OF - would that be
> > > sufficient for achieving what this patchset is trying to do?
> > 
> > In my case, its not sufficient to just mark virtio device untrusted and thus
> > activate the use of swiotlb. All of the secondary VM memory, including those
> > allocate by swiotlb driver, is private to it.
> 
> So why not make the bounce buffer memory shared then?

Its a limitation by our hypervisor. When a secondary VM is created, two
memory segments are allocated - one private and other shared. There is no
provision for the secondary VM to make part of its private memory shared after
it boots. I can perhaps consider a change in swiotlb driver to accept the second
shared memory segment as its main working area (rather than allocate its own).

That would still not work I think where swiotlb is used for pass-thr devices
(when private memory is fine) as well as virtio devices (when shared memory is
required).

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [PATCH 5/5] virtio: Add bounce DMA ops

2020-04-29 Thread Srivatsa Vaddagiri
* Michael S. Tsirkin  [2020-04-29 02:50:41]:

> So it seems that with modern Linux, all one needs
> to do on x86 is mark the device as untrusted.
> It's already possible to do this with ACPI and with OF - would that be
> sufficient for achieving what this patchset is trying to do?

In my case, its not sufficient to just mark virtio device untrusted and thus
activate the use of swiotlb. All of the secondary VM memory, including those
allocate by swiotlb driver, is private to it. An additional piece of memory is
available to secondary VM which is shared between VMs and which is where I need
swiotlb driver to do its work.

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [PATCH 5/5] virtio: Add bounce DMA ops

2020-04-28 Thread Srivatsa Vaddagiri
* Stefano Stabellini  [2020-04-28 16:04:34]:

> > > Is swiotlb commonly used for multiple devices that may be on different 
> > > trust
> > > boundaries (and not behind a hardware iommu)?
> 
> The trust boundary is not a good way of describing the scenario and I
> think it leads to miscommunication.
> 
> A better way to describe the scenario would be that the device can only
> DMA to/from a small reserved-memory region advertised on device tree.
> 
> Do we have other instances of devices that can only DMA to/from very
> specific and non-configurable address ranges? If so, this series could
> follow their example.

AFAICT there is no such notion in current DMA API.

static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size,
bool is_ram)
{
return end <= min_not_zero(*dev->dma_mask, dev->bus_dma_limit);
}

Only the max address a device can access is defined and not a range that we seem
to need here. I think we need to set the bus_dma_limit to 0 for virtio devices
which will force the use of swiotlb_map API. We should also have a per-device
swiotlb pool defined, so that swiotlb can use the pool meant for the given
device.

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [PATCH 5/5] virtio: Add bounce DMA ops

2020-04-28 Thread Srivatsa Vaddagiri
* Michael S. Tsirkin  [2020-04-28 16:41:04]:

> > Won't we still need some changes to virtio to make use of its own pool (to
> > bounce buffers)? Something similar to its own DMA ops proposed in this 
> > patch?
> 
> If you are doing this for all devices, you need to either find a way
> to do this without chaning DMA ops, or by doing some automatic change
> to all drivers.

Ok thanks for this input. I will see how we can obfuscate this in DMA APIs
itself.

Can you also comment on the virtio transport problem I cited? The hypervisor we
are dealing with does not support MMIO transport. It supports message queue
send/recv and also doorbell, which I think can be used if we can make some
change like this to virtio_mmio.c:

+static inline u32
+virtio_readl(struct virtio_mmio_device *vm_dev, u32 reg_offset)
+{
+return vm_dev->mmio_ops->readl(vm_dev, reg_offset);
+}
+ 
+static inline void
+virtio_writel(struct virtio_mmio_device *vm_dev, u32 reg_offset, u32 data)
+{
+vm_dev->mmio_ops->writel(vm_dev, reg_offset, data);
+}


/* Check magic value */
-magic = readl(vm_dev->base + VIRTIO_MMIO_MAGIC_VALUE);
+magic = vrito_readl(vm_dev, VIRTIO_MMIO_MAGIC_VALUE);

mmio_ops->readl on most platforms can default to readl itself, while on a
platform like us, it can boil down to message_queue send/recv. Would such a
change be acceptable?

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


Re: [PATCH 5/5] virtio: Add bounce DMA ops

2020-04-28 Thread Srivatsa Vaddagiri
* Michael S. Tsirkin  [2020-04-28 12:17:57]:

> Okay, but how is all this virtio specific?  For example, why not allow
> separate swiotlbs for any type of device?
> For example, this might make sense if a given device is from a
> different, less trusted vendor.

Is swiotlb commonly used for multiple devices that may be on different trust
boundaries (and not behind a hardware iommu)? If so, then yes it sounds like a
good application of multiple swiotlb pools.

> All this can then maybe be hidden behind the DMA API.

Won't we still need some changes to virtio to make use of its own pool (to
bounce buffers)? Something similar to its own DMA ops proposed in this patch?

> > +void virtio_bounce_set_dma_ops(struct virtio_device *vdev)
> > +{
> > +   if (!bounce_buf_paddr)
> > +   return;
> > +
> > +   set_dma_ops(vdev->dev.parent, _dma_ops);
> 
> 
> I don't think DMA API maintainers will be happy with new users
> of set_dma_ops.

Is there an alternate API that is more preffered?

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


[PATCH 3/5] swiotlb: Add alloc and free APIs

2020-04-28 Thread Srivatsa Vaddagiri
Move the memory allocation and free portion of swiotlb driver
into independent routines. They will be useful for drivers that
need swiotlb driver to just allocate/free memory chunks and not
additionally bounce memory.

Signed-off-by: Srivatsa Vaddagiri 
---
 include/linux/swiotlb.h |  17 ++
 kernel/dma/swiotlb.c| 151 
 2 files changed, 106 insertions(+), 62 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index c634b4d..957697e 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -186,6 +186,10 @@ void __init swiotlb_exit(void);
 unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
 bool is_swiotlb_active(void);
+extern phys_addr_t swiotlb_alloc(struct swiotlb_pool *pool, size_t alloc_size,
+   unsigned long tbl_dma_addr, unsigned long mask);
+extern void swiotlb_free(struct swiotlb_pool *pool,
+   phys_addr_t tlb_addr, size_t alloc_size);
 #else
 #define swiotlb_force SWIOTLB_NO_FORCE
 
@@ -219,6 +223,19 @@ static inline bool is_swiotlb_active(void)
 {
return false;
 }
+
+static inline phys_addr_t swiotlb_alloc(struct swiotlb_pool *pool,
+   size_t alloc_size, unsigned long tbl_dma_addr,
+   unsigned long mask)
+{
+   return DMA_MAPPING_ERROR;
+}
+
+static inline void swiotlb_free(struct swiotlb_pool *pool,
+   phys_addr_t tlb_addr, size_t alloc_size)
+{
+}
+
 #endif /* CONFIG_SWIOTLB */
 
 extern void swiotlb_print_info(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 8cf0b57..7411ce5 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -444,37 +444,14 @@ static inline void *tlb_vaddr(struct swiotlb_pool *pool, 
phys_addr_t tlb_addr)
return pool->io_tlb_vstart + (tlb_addr - pool->io_tlb_start);
 }
 
-phys_addr_t _swiotlb_tbl_map_single(struct swiotlb_pool *pool,
-  struct device *hwdev,
-  dma_addr_t tbl_dma_addr,
-  phys_addr_t orig_addr,
-  size_t mapping_size,
-  size_t alloc_size,
-  enum dma_data_direction dir,
-  unsigned long attrs)
+phys_addr_t swiotlb_alloc(struct swiotlb_pool *pool, size_t alloc_size,
+   unsigned long tbl_dma_addr, unsigned long mask)
 {
unsigned long flags;
phys_addr_t tlb_addr;
-   unsigned int nslots, stride, index, wrap;
-   int i;
-   unsigned long mask;
+   unsigned int i, nslots, stride, index, wrap;
unsigned long offset_slots;
unsigned long max_slots;
-   unsigned long tmp_io_tlb_used;
-
-   if (pool->no_iotlb_memory)
-   panic("Can not allocate SWIOTLB buffer earlier and can't now 
provide you with the DMA bounce buffer");
-
-   if (mem_encrypt_active())
-   pr_warn_once("Memory encryption is active and system is using 
DMA bounce buffers\n");
-
-   if (mapping_size > alloc_size) {
-   dev_warn_once(hwdev, "Invalid sizes (mapping: %zd bytes, alloc: 
%zd bytes)",
- mapping_size, alloc_size);
-   return (phys_addr_t)DMA_MAPPING_ERROR;
-   }
-
-   mask = dma_get_seg_boundary(hwdev);
 
tbl_dma_addr &= mask;
 
@@ -555,54 +532,23 @@ phys_addr_t _swiotlb_tbl_map_single(struct swiotlb_pool 
*pool,
} while (index != wrap);
 
 not_found:
-   tmp_io_tlb_used = pool->io_tlb_used;
-
spin_unlock_irqrestore(>io_tlb_lock, flags);
-   if (!(attrs & DMA_ATTR_NO_WARN) && printk_ratelimit())
-   dev_warn(hwdev, "swiotlb buffer is full (sz: %zd bytes), total 
%lu (slots), used %lu (slots)\n",
-alloc_size, pool->io_tlb_nslabs, tmp_io_tlb_used);
return (phys_addr_t)DMA_MAPPING_ERROR;
+
 found:
pool->io_tlb_used += nslots;
spin_unlock_irqrestore(>io_tlb_lock, flags);
 
-   /*
-* Save away the mapping from the original address to the DMA address.
-* This is needed when we sync the memory.  Then we sync the buffer if
-* needed.
-*/
-   for (i = 0; i < nslots; i++)
-   pool->io_tlb_orig_addr[index+i] = orig_addr +
-   (i << IO_TLB_SHIFT);
-   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
-   (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(orig_addr, tlb_vaddr(pool, tlb_addr),
-   mapping_size, DMA_TO_DEVICE);
-
return tlb_addr;
 }
 
-/*
- * tlb_addr is the physical address of the bounce buffer to unmap.
- */
-void _swiotlb_tbl_unmap_single(struct swiotlb_pool *pool,
-   

[PATCH 4/5] swiotlb: Add API to register new pool

2020-04-28 Thread Srivatsa Vaddagiri
This patch adds an interface for the swiotlb driver to recognize
a new memory pool. Upon successful initialization of the pool,
swiotlb returns a handle, which needs to be passed as an argument
for any future operations on the pool (map/unmap/alloc/free).

Signed-off-by: Srivatsa Vaddagiri 
---
 include/linux/swiotlb.h |  9 
 kernel/dma/swiotlb.c| 60 +
 2 files changed, 69 insertions(+)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 957697e..97ac82a 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -182,6 +182,9 @@ static inline bool is_swiotlb_buffer(phys_addr_t paddr)
return paddr >= swiotlb_start() && paddr < swiotlb_end();
 }
 
+extern struct swiotlb_pool *swiotlb_register_pool(char *name,
+   phys_addr_t start, void *vstart, size_t size);
+
 void __init swiotlb_exit(void);
 unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
@@ -236,6 +239,12 @@ static inline void swiotlb_free(struct swiotlb_pool *pool,
 {
 }
 
+static struct swiotlb_pool *swiotlb_register_pool(char *name,
+   phys_addr_t start, void *vstart, size_t size)
+{
+   return NULL;
+}
+
 #endif /* CONFIG_SWIOTLB */
 
 extern void swiotlb_print_info(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 7411ce5..9883744 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #ifdef CONFIG_DEBUG_FS
 #include 
 #endif
@@ -736,6 +737,65 @@ size_t swiotlb_max_mapping_size(struct device *dev)
return ((size_t)1 << IO_TLB_SHIFT) * IO_TLB_SEGSIZE;
 }
 
+struct swiotlb_pool *swiotlb_register_pool(char *name, phys_addr_t start,
+   void *vstart, size_t size)
+{
+   struct swiotlb_pool *pool;
+   unsigned long i, bytes;
+   unsigned long nslabs;
+
+   nslabs = size >> IO_TLB_SHIFT;
+   if (!nslabs)
+   return ERR_PTR(-EINVAL);
+
+   pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+   if (!pool)
+   return ERR_PTR(-ENOMEM);
+
+   bytes = nslabs << IO_TLB_SHIFT;
+
+   strncpy(pool->name, name, sizeof(pool->name));
+   spin_lock_init(>io_tlb_lock);
+   pool->late_alloc = 1;
+   pool->io_tlb_start = start;
+   pool->io_tlb_end = start + bytes;
+   pool->io_tlb_vstart = vstart;
+   pool->io_tlb_nslabs = nslabs;
+   pool->max_segment = rounddown(bytes, PAGE_SIZE);
+
+   /*
+* Allocate and initialize the free list array.  This array is used
+* to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
+* between io_tlb_start and io_tlb_end.
+*/
+   pool->io_tlb_list = (unsigned int *)__get_free_pages(GFP_KERNEL,
+   get_order(pool->io_tlb_nslabs * sizeof(int)));
+   if (!pool->io_tlb_list)
+   goto cleanup;
+
+   pool->io_tlb_orig_addr = (phys_addr_t *)
+   __get_free_pages(GFP_KERNEL,
+get_order(pool->io_tlb_nslabs *
+  sizeof(phys_addr_t)));
+   if (!pool->io_tlb_orig_addr)
+   goto cleanup;
+
+   for (i = 0; i < pool->io_tlb_nslabs; i++) {
+   pool->io_tlb_list[i] = IO_TLB_SEGSIZE -
+   OFFSET(i, IO_TLB_SEGSIZE);
+   pool->io_tlb_orig_addr[i] = INVALID_PHYS_ADDR;
+   }
+
+   return pool;
+
+cleanup:
+   kfree(pool->io_tlb_list);
+   kfree(pool->io_tlb_orig_addr);
+   kfree(pool);
+
+   return ERR_PTR(-ENOMEM);
+}
+
 bool is_swiotlb_active(void)
 {
/*
-- 
2.7.4

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


[PATCH 5/5] virtio: Add bounce DMA ops

2020-04-28 Thread Srivatsa Vaddagiri
For better security, its desirable that a guest VM's memory is
not accessible to any entity that executes outside the context of
guest VM. In case of virtio, backend drivers execute outside the
context of guest VM and in general will need access to complete
guest VM memory.  One option to restrict the access provided to
backend driver is to make use of a bounce buffer. The bounce
buffer is accessible to both backend and frontend drivers. All IO
buffers that are in private space of guest VM are bounced to be
accessible to backend.

This patch proposes a new memory pool to be used for this bounce
purpose, rather than the default swiotlb memory pool. That will
avoid any conflicts that may arise in situations where a VM needs
to use swiotlb pool for driving any pass-through devices (in
which case swiotlb memory needs not be shared with another VM) as
well as virtio devices (which will require swiotlb memory to be
shared with backend VM). As a possible extension to this patch,
we can provide an option for virtio to make use of default
swiotlb memory pool itself, where no such conflicts may exist in
a given deployment.

Signed-off-by: Srivatsa Vaddagiri 
---
 drivers/virtio/Makefile|   2 +-
 drivers/virtio/virtio.c|   2 +
 drivers/virtio/virtio_bounce.c | 150 +
 include/linux/virtio.h |   4 ++
 4 files changed, 157 insertions(+), 1 deletion(-)
 create mode 100644 drivers/virtio/virtio_bounce.c

diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 29a1386e..3fd3515 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-obj-$(CONFIG_VIRTIO) += virtio.o virtio_ring.o
+obj-$(CONFIG_VIRTIO) += virtio.o virtio_ring.o virtio_bounce.o
 obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o
 obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
 virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index a977e32..bc2f779 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -329,6 +329,7 @@ int register_virtio_device(struct virtio_device *dev)
 
dev->index = err;
dev_set_name(>dev, "virtio%u", dev->index);
+   virtio_bounce_set_dma_ops(dev);
 
spin_lock_init(>config_lock);
dev->config_enabled = false;
@@ -431,6 +432,7 @@ EXPORT_SYMBOL_GPL(virtio_device_restore);
 
 static int virtio_init(void)
 {
+   virtio_map_bounce_buffer();
if (bus_register(_bus) != 0)
panic("virtio bus registration failed");
return 0;
diff --git a/drivers/virtio/virtio_bounce.c b/drivers/virtio/virtio_bounce.c
new file mode 100644
index 000..3de8e0e
--- /dev/null
+++ b/drivers/virtio/virtio_bounce.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtio DMA ops to bounce buffers
+ *
+ *  Copyright (c) 2020, The Linux Foundation. All rights reserved.
+ *
+ * This module allows bouncing of IO buffers to a region which will be
+ * accessible to backend drivers.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static phys_addr_t bounce_buf_paddr;
+static void *bounce_buf_vaddr;
+static size_t bounce_buf_size;
+struct swiotlb_pool *virtio_pool;
+
+#define VIRTIO_MAX_BOUNCE_SIZE (16*4096)
+
+static void *virtio_alloc_coherent(struct device *dev, size_t size,
+   dma_addr_t *dma_handle, gfp_t gfp_flags, unsigned long attrs)
+{
+   phys_addr_t addr;
+
+   if (!virtio_pool)
+   return NULL;
+
+   addr = swiotlb_alloc(virtio_pool, size, bounce_buf_paddr, ULONG_MAX);
+   if (addr == DMA_MAPPING_ERROR)
+   return NULL;
+
+   *dma_handle = (addr - bounce_buf_paddr);
+
+   return bounce_buf_vaddr + (addr - bounce_buf_paddr);
+}
+
+static void virtio_free_coherent(struct device *dev, size_t size, void *vaddr,
+   dma_addr_t dma_handle, unsigned long attrs)
+{
+   phys_addr_t addr = (dma_handle + bounce_buf_paddr);
+
+   swiotlb_free(virtio_pool, addr, size);
+}
+
+static dma_addr_t virtio_map_page(struct device *dev, struct page *page,
+   unsigned long offset, size_t size,
+   enum dma_data_direction dir, unsigned long attrs)
+{
+   void *ptr = page_address(page) + offset;
+   phys_addr_t paddr = virt_to_phys(ptr);
+   dma_addr_t handle;
+
+   if (!virtio_pool)
+   return DMA_MAPPING_ERROR;
+
+   handle = _swiotlb_tbl_map_single(virtio_pool, dev, bounce_buf_paddr,
+paddr, size, size, dir, attrs);
+   if (handle == (phys_addr_t)DMA_MAPPING_ERROR)
+   return DMA_MAPPING_ERROR;
+
+   return handle - bounce_buf_paddr;
+}
+
+static void virtio_unmap_page(struct device *dev, dma_addr_t dev_addr,
+   size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+   phys_addr_t

[PATCH 1/5] swiotlb: Introduce concept of swiotlb_pool

2020-04-28 Thread Srivatsa Vaddagiri
Currently swiotlb driver manages a global pool of memory which
acts as bounce buffers for memory that is not accessible to some
devices. The core functions provides by this driver to
allocate/free/bounce memory chunks will be more
useful if this driver can manage more than one pool. An immediate
application of such extension to swiotlb driver is to bounce
virtio buffers between private and shared space of a VM.

This patch introduces the concept of a swiotlb memory pool and
reorganizes current driver to work with the default global pool.
There is no functional change introduced by this patch.
Subsequent patches allow the swiotlb driver to work with more
than one pool of memory.

Signed-off-by: Srivatsa Vaddagiri 
---
 drivers/xen/swiotlb-xen.c |   4 +-
 include/linux/swiotlb.h   | 129 -
 kernel/dma/swiotlb.c  | 359 +++---
 3 files changed, 307 insertions(+), 185 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index b6d2776..c2dc9c8 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -190,8 +190,8 @@ int __ref xen_swiotlb_init(int verbose, bool early)
/*
 * IO TLB memory already allocated. Just use it.
 */
-   if (io_tlb_start != 0) {
-   xen_io_tlb_start = phys_to_virt(io_tlb_start);
+   if (swiotlb_start()) {
+   xen_io_tlb_start = phys_to_virt(swiotlb_start());
goto end;
}
 
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 046bb94..8c7843f 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -44,7 +44,59 @@ enum dma_sync_target {
SYNC_FOR_DEVICE = 1,
 };
 
-extern phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
+#define MAX_POOL_NAME_SIZE 16
+
+struct swiotlb_pool {
+   char name[MAX_POOL_NAME_SIZE];
+   bool no_iotlb_memory;
+   int late_alloc;
+
+   spinlock_t io_tlb_lock;
+
+   /*
+* Used to do a quick range check in swiotlb_tbl_unmap_single and
+* swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated
+* by this API.
+*/
+
+   phys_addr_t io_tlb_start, io_tlb_end;
+
+   /*
+* The number of IO TLB blocks (in groups of 64) between io_tlb_start
+* and io_tlb_end.  This is command line adjustable via
+* setup_io_tlb_npages.
+*/
+   unsigned long io_tlb_nslabs;
+
+   /*
+* The number of used IO TLB block
+*/
+   unsigned long io_tlb_used;
+
+   /*
+* This is a free list describing the number of free entries available
+* from each index
+*/
+   unsigned int *io_tlb_list;
+   unsigned int io_tlb_index;
+
+   /*
+* We need to save away the original address corresponding to a mapped
+* entry for the sync operations.
+*/
+   phys_addr_t *io_tlb_orig_addr;
+
+   /*
+* Max segment that we can provide which (if pages are contingous) will
+* not be bounced (unless SWIOTLB_FORCE is set).
+*/
+   unsigned int max_segment;
+};
+
+extern struct swiotlb_pool default_swiotlb_pool;
+
+extern phys_addr_t _swiotlb_tbl_map_single(struct swiotlb_pool *pool,
+ struct device *hwdev,
  dma_addr_t tbl_dma_addr,
  phys_addr_t phys,
  size_t mapping_size,
@@ -52,28 +104,80 @@ extern phys_addr_t swiotlb_tbl_map_single(struct device 
*hwdev,
  enum dma_data_direction dir,
  unsigned long attrs);
 
-extern void swiotlb_tbl_unmap_single(struct device *hwdev,
+extern void _swiotlb_tbl_unmap_single(struct swiotlb_pool *pool,
+struct device *hwdev,
 phys_addr_t tlb_addr,
 size_t mapping_size,
 size_t alloc_size,
 enum dma_data_direction dir,
 unsigned long attrs);
 
-extern void swiotlb_tbl_sync_single(struct device *hwdev,
+extern void _swiotlb_tbl_sync_single(struct swiotlb_pool *pool,
+   struct device *hwdev,
phys_addr_t tlb_addr,
size_t size, enum dma_data_direction dir,
enum dma_sync_target target);
 
-dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
-   size_t size, enum dma_data_direction dir, unsigned long attrs);
+dma_addr_t _swiotlb_map(struct swiotlb_pool *pool, struct device *dev,
+   phys_addr_t phys, size_t size, enum dma_data_direction dir,
+   unsigned long attrs);
+
+static inline phys_addr_t

[PATCH 2/5] swiotlb: Allow for non-linear mapping between paddr and vaddr

2020-04-28 Thread Srivatsa Vaddagiri
Some of the memory pool managed by swiotlb driver could fall
outside the direct-mapped range, made accessible via memremap()
routine. To facilitate easy conversion between virtual and
physical address of such memory, store the virtual address of
memory pool in addition to its physical address.

Signed-off-by: Srivatsa Vaddagiri 
---
 include/linux/swiotlb.h |  2 ++
 kernel/dma/swiotlb.c| 20 ++--
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 8c7843f..c634b4d 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -61,6 +61,8 @@ struct swiotlb_pool {
 
phys_addr_t io_tlb_start, io_tlb_end;
 
+   void *io_tlb_vstart;
+
/*
 * The number of IO TLB blocks (in groups of 64) between io_tlb_start
 * and io_tlb_end.  This is command line adjustable via
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 9c504d3..8cf0b57 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -178,6 +178,7 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
default_swiotlb_pool.io_tlb_start = __pa(tlb);
default_swiotlb_pool.io_tlb_end =
default_swiotlb_pool.io_tlb_start + bytes;
+   default_swiotlb_pool.io_tlb_vstart = tlb;
 
/*
 * Allocate and initialize the free list array.  This array is used
@@ -307,6 +308,7 @@ static void swiotlb_cleanup(void)
default_swiotlb_pool.io_tlb_start = 0;
default_swiotlb_pool.io_tlb_nslabs = 0;
default_swiotlb_pool.max_segment = 0;
+   default_swiotlb_pool.io_tlb_vstart = NULL;
 }
 
 int
@@ -320,6 +322,7 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
default_swiotlb_pool.io_tlb_start = virt_to_phys(tlb);
default_swiotlb_pool.io_tlb_end =
default_swiotlb_pool.io_tlb_start + bytes;
+   default_swiotlb_pool.io_tlb_vstart = tlb;
 
set_memory_decrypted((unsigned long)tlb, bytes >> PAGE_SHIFT);
memset(tlb, 0, bytes);
@@ -400,11 +403,10 @@ void __init swiotlb_exit(void)
 /*
  * Bounce: copy the swiotlb buffer from or back to the original dma location
  */
-static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
+static void swiotlb_bounce(phys_addr_t orig_addr, void *vaddr,
   size_t size, enum dma_data_direction dir)
 {
unsigned long pfn = PFN_DOWN(orig_addr);
-   unsigned char *vaddr = phys_to_virt(tlb_addr);
 
if (PageHighMem(pfn_to_page(pfn))) {
/* The buffer does not have a mapping.  Map it in and copy */
@@ -437,6 +439,11 @@ static void swiotlb_bounce(phys_addr_t orig_addr, 
phys_addr_t tlb_addr,
}
 }
 
+static inline void *tlb_vaddr(struct swiotlb_pool *pool, phys_addr_t tlb_addr)
+{
+   return pool->io_tlb_vstart + (tlb_addr - pool->io_tlb_start);
+}
+
 phys_addr_t _swiotlb_tbl_map_single(struct swiotlb_pool *pool,
   struct device *hwdev,
   dma_addr_t tbl_dma_addr,
@@ -569,7 +576,7 @@ phys_addr_t _swiotlb_tbl_map_single(struct swiotlb_pool 
*pool,
(i << IO_TLB_SHIFT);
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
(dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(orig_addr, tlb_addr,
+   swiotlb_bounce(orig_addr, tlb_vaddr(pool, tlb_addr),
mapping_size, DMA_TO_DEVICE);
 
return tlb_addr;
@@ -594,7 +601,8 @@ void _swiotlb_tbl_unmap_single(struct swiotlb_pool *pool,
if (orig_addr != INVALID_PHYS_ADDR &&
!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
((dir == DMA_FROM_DEVICE) || (dir == DMA_BIDIRECTIONAL)))
-   swiotlb_bounce(orig_addr, tlb_addr, mapping_size, 
DMA_FROM_DEVICE);
+   swiotlb_bounce(orig_addr, tlb_vaddr(pool, tlb_addr),
+   mapping_size, DMA_FROM_DEVICE);
 
/*
 * Return the buffer to the free list by setting the corresponding
@@ -643,14 +651,14 @@ void _swiotlb_tbl_sync_single(struct swiotlb_pool *pool,
switch (target) {
case SYNC_FOR_CPU:
if (likely(dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(orig_addr, tlb_addr,
+   swiotlb_bounce(orig_addr, tlb_vaddr(pool, tlb_addr),
   size, DMA_FROM_DEVICE);
else
BUG_ON(dir != DMA_TO_DEVICE);
break;
case SYNC_FOR_DEVICE:
if (likely(dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(orig_addr, tlb_addr,
+   swiotlb_bounce(orig_addr, tlb_vaddr(pool, tlb_addr),

[PATCH 0/5] virtio on Type-1 hypervisor

2020-04-28 Thread Srivatsa Vaddagiri
We ran into several problems in using virtio for IO paravirtualization on a
Type-1 hypervisor with these characteristics:

* By default, all of a guests's memory is private to it (no other guest can
  access its memory).

* One of the VM is considered as primary and has access to most IO devices. This
  is similar to dom0 VM in case of Xen. All other VMs are considered as
  secondary VMs

* virtio-backend drivers for all secondary VMs need to be hosted in primary VM

* Since secondary VM's memory is not accessible to primary VM, to make virtio
  backend driver work, instead an additional piece of memory is provisioned 
  by the hypervisor that is shared between primary and secondary VMs. This
  shared memory can be used, for example, to host virtio-ring structures
  and also to bounce IO buffers of secondary VM.

* Message-queue and doorbell interfaces available in hypervisor to support
  inter-VM communication. Messge-queue API (send/recv) allows one VM to send
  short messages to another VM. Doorbell interface allows a VM to inject
  an interrupt into another VM.

* No support for MMIO transport i.e hypervisor does not support trapping MMIO
  config space access by front-end driver and having it handled in backend
  drivers.

Few problem statements arising out of this:

1) How can we make use of the shared memory region effectively to make virtio
work in this scenario?

What is proposed in the patch series for this problem is a virtio bounce driver
that recognizes a shared memory region (shared between VMs) and makes use of
swiotlb driver interfaces to bounce IO buffers between private and shared space.
Some changes are proposed to swiotlb driver in this regard, that can let us
reuse swiotlb functions to work with the shared memory pool. swiotlb driver can
now recognize more than one pool of memory and extend its functions
(allocate/free/bounce memory chunks) for each pool.

2) What transport mechanism works best in this case? 

I realize that ivshmem2-virtio proposal has discussed the possibility of using
shared memory + doorbell as a virtio transport option. We can consider using
that as a transport in case it will be acceptable upstream. Other option we had
considered was to modify virtio_mmio.c itself to use message_queue send/recv
hypercall interface (in place of readl/writel). That could be abstracted via
'mmio_ops' structure providing suitable implementation of readl/writel. Another
option suggested by Christopher Dall is to unmap the config space from kernel
address space and as part of the fault handler, use hypervisor specific APIs to
achieve config space IO.

3) Which virtio backend drivers to leverage?

We realized there are multiple implementations of virtio backend drivers,
bundled as part of separate projects (Qemu, lkvm etc). We think it would be nice
if we had some consolidation in that regard so that we can make use of the
backend drivers that are not tightly coupled with a VMM. In our case, we need to
be able to run virtio backend drivers as standalone programs (and not coupled
with any VMM).


Srivatsa Vaddagiri (5):
  swiotlb: Introduce concept of swiotlb_pool
  swiotlb: Allow for non-linear mapping between paddr and vaddr
  swiotlb: Add alloc and free APIs
  swiotlb: Add API to register new pool
  virtio: Add bounce DMA ops

 drivers/virtio/Makefile|   2 +-
 drivers/virtio/virtio.c|   2 +
 drivers/virtio/virtio_bounce.c | 150 +++
 drivers/xen/swiotlb-xen.c  |   4 +-
 include/linux/swiotlb.h| 157 +++-
 include/linux/virtio.h |   4 +
 kernel/dma/swiotlb.c   | 556 -
 7 files changed, 638 insertions(+), 237 deletions(-)
 create mode 100644 drivers/virtio/virtio_bounce.c

-- 
2.7.4

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


[tip:sched/core] sched/fair: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-22 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  8bf46a39be910937d4c9e8d999a7438a7ae1a75b
Gitweb: http://git.kernel.org/tip/8bf46a39be910937d4c9e8d999a7438a7ae1a75b
Author: Srivatsa Vaddagiri <va...@codeaurora.org>
AuthorDate: Fri, 16 Sep 2016 18:28:51 -0700
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Thu, 22 Sep 2016 15:20:18 +0200

sched/fair: Fix SCHED_HRTICK bug leading to late preemption of tasks

SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime).

It makes use of a per-CPU hrtimer resource and hence arming that
hrtimer should be based on total SCHED_FAIR tasks a CPU has across its
various cfs_rqs, rather than being based on number of tasks in a
particular cfs_rq (as implemented currently).

As a result, with current code, its possible for a running task (which
is the sole task in its cfs_rq) to be preempted much after its
ideal_runtime has elapsed, resulting in increased latency for tasks in
other cfs_rq on same CPU.

Fix this by arming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.

Signed-off-by: Srivatsa Vaddagiri <va...@codeaurora.org>
Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Link: 
http://lkml.kernel.org/r/1474075731-11550-1-git-send-email-joonw...@codeaurora.org
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 986c10c..8fb4d19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4469,7 +4469,7 @@ static void hrtick_start_fair(struct rq *rq, struct 
task_struct *p)
 
WARN_ON(task_rq(p) != rq);
 
-   if (cfs_rq->nr_running > 1) {
+   if (rq->cfs.h_nr_running > 1) {
u64 slice = sched_slice(cfs_rq, se);
u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
s64 delta = slice - ran;


[tip:sched/core] sched/fair: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-22 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  8bf46a39be910937d4c9e8d999a7438a7ae1a75b
Gitweb: http://git.kernel.org/tip/8bf46a39be910937d4c9e8d999a7438a7ae1a75b
Author: Srivatsa Vaddagiri 
AuthorDate: Fri, 16 Sep 2016 18:28:51 -0700
Committer:  Ingo Molnar 
CommitDate: Thu, 22 Sep 2016 15:20:18 +0200

sched/fair: Fix SCHED_HRTICK bug leading to late preemption of tasks

SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime).

It makes use of a per-CPU hrtimer resource and hence arming that
hrtimer should be based on total SCHED_FAIR tasks a CPU has across its
various cfs_rqs, rather than being based on number of tasks in a
particular cfs_rq (as implemented currently).

As a result, with current code, its possible for a running task (which
is the sole task in its cfs_rq) to be preempted much after its
ideal_runtime has elapsed, resulting in increased latency for tasks in
other cfs_rq on same CPU.

Fix this by arming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Joonwoo Park 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1474075731-11550-1-git-send-email-joonw...@codeaurora.org
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 986c10c..8fb4d19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4469,7 +4469,7 @@ static void hrtick_start_fair(struct rq *rq, struct 
task_struct *p)
 
WARN_ON(task_rq(p) != rq);
 
-   if (cfs_rq->nr_running > 1) {
+   if (rq->cfs.h_nr_running > 1) {
u64 slice = sched_slice(cfs_rq, se);
u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
s64 delta = slice - ran;


[tip:x86/spinlocks] kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-14 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  92b75202e5e8790905f9441ccaea2456cc4621a5
Gitweb: http://git.kernel.org/tip/92b75202e5e8790905f9441ccaea2456cc4621a5
Author: Srivatsa Vaddagiri 
AuthorDate: Tue, 6 Aug 2013 14:55:41 +0530
Committer:  Ingo Molnar 
CommitDate: Wed, 14 Aug 2013 13:12:35 +0200

kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
required feature (KVM_FEATURE_PV_UNHALT) to support pv-ticketlocks. If so,
support for pv-ticketlocks is registered via pv_lock_ops.

Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.

Signed-off-by: Srivatsa Vaddagiri 
Link: http://lkml.kernel.org/r/20130810193849.ga25...@linux.vnet.ibm.com
Signed-off-by: Suzuki Poulose 
[Raghu: check_zero race fix, enum for kvm_contention_stat, jumplabel related 
changes,
addition of safe_halt for irq enabled case, bailout spinning in nmi case(Gleb)]
Signed-off-by: Raghavendra K T 
Acked-by: Gleb Natapov 
Acked-by: Ingo Molnar 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/kvm_para.h |  14 ++-
 arch/x86/kernel/kvm.c   | 262 
 2 files changed, 274 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 695399f..427afcb 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -118,10 +118,20 @@ void kvm_async_pf_task_wait(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
-#else
-#define kvm_guest_init() do { } while (0)
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init kvm_spinlock_init(void);
+#else /* !CONFIG_PARAVIRT_SPINLOCKS */
+static inline void kvm_spinlock_init(void)
+{
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+#else /* CONFIG_KVM_GUEST */
+#define kvm_guest_init() do {} while (0)
 #define kvm_async_pf_task_wait(T) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
+
 static inline u32 kvm_read_and_reset_pf_reason(void)
 {
return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a96d32c..b8ef630 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -419,6 +420,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
WARN_ON(kvm_register_clock("primary cpu clock"));
kvm_guest_cpu_init();
native_smp_prepare_boot_cpu();
+   kvm_spinlock_init();
 }
 
 static void kvm_guest_cpu_online(void *dummy)
@@ -523,3 +525,263 @@ static __init int activate_jump_labels(void)
return 0;
 }
 arch_initcall(activate_jump_labels);
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
+void kvm_kick_cpu(int cpu)
+{
+   int apicid;
+   unsigned long flags = 0;
+
+   apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
+}
+
+enum kvm_contention_stat {
+   TAKEN_SLOW,
+   TAKEN_SLOW_PICKUP,
+   RELEASED_SLOW,
+   RELEASED_SLOW_KICKED,
+   NR_CONTENTION_STATS
+};
+
+#ifdef CONFIG_KVM_DEBUG_FS
+#define HISTO_BUCKETS  30
+
+static struct kvm_spinlock_stats
+{
+   u32 contention_stats[NR_CONTENTION_STATS];
+   u32 histo_spin_blocked[HISTO_BUCKETS+1];
+   u64 time_blocked;
+} spinlock_stats;
+
+static u8 zero_stats;
+
+static inline void check_zero(void)
+{
+   u8 ret;
+   u8 old;
+
+   old = ACCESS_ONCE(zero_stats);
+   if (unlikely(old)) {
+   ret = cmpxchg(_stats, old, 0);
+   /* This ensures only one fellow resets the stat */
+   if (ret == old)
+   memset(_stats, 0, sizeof(spinlock_stats));
+   }
+}
+
+static inline void add_stats(enum kvm_contention_stat var, u32 val)
+{
+   check_zero();
+   spinlock_stats.contention_stats[var] += val;
+}
+
+
+static inline u64 spin_time_start(void)
+{
+   return sched_clock();
+}
+
+static void __spin_time_accum(u64 delta, u32 *array)
+{
+   unsigned index;
+
+   index = ilog2(delta);
+   check_zero();
+
+   if (index < HISTO_BUCKETS)
+   array[index]++;
+   else
+   array[HISTO_BUCKETS]++;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+   u32 delta;
+
+   delta = sched_clock() - start;
+   __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
+   spinlock_stats.time_blocked += delta;
+}
+
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+
+struct dentry *kvm_init_debugfs(void)
+{
+   d_kvm_debug = debugfs_create_dir("kvm", NULL);
+   if (!d_kvm_debug)
+   printk(KERN_WARNING "Could not create 'kvm' debugfs 
directory\n");
+
+   return d_kvm_debug;
+}
+
+static int __init kvm_spinlock_debugfs(void)
+{
+   stru

[tip:x86/spinlocks] kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-14 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  92b75202e5e8790905f9441ccaea2456cc4621a5
Gitweb: http://git.kernel.org/tip/92b75202e5e8790905f9441ccaea2456cc4621a5
Author: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
AuthorDate: Tue, 6 Aug 2013 14:55:41 +0530
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Wed, 14 Aug 2013 13:12:35 +0200

kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
required feature (KVM_FEATURE_PV_UNHALT) to support pv-ticketlocks. If so,
support for pv-ticketlocks is registered via pv_lock_ops.

Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.

Signed-off-by: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20130810193849.ga25...@linux.vnet.ibm.com
Signed-off-by: Suzuki Poulose suz...@in.ibm.com
[Raghu: check_zero race fix, enum for kvm_contention_stat, jumplabel related 
changes,
addition of safe_halt for irq enabled case, bailout spinning in nmi case(Gleb)]
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Acked-by: Gleb Natapov g...@redhat.com
Acked-by: Ingo Molnar mi...@kernel.org
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/include/asm/kvm_para.h |  14 ++-
 arch/x86/kernel/kvm.c   | 262 
 2 files changed, 274 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 695399f..427afcb 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -118,10 +118,20 @@ void kvm_async_pf_task_wait(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
-#else
-#define kvm_guest_init() do { } while (0)
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init kvm_spinlock_init(void);
+#else /* !CONFIG_PARAVIRT_SPINLOCKS */
+static inline void kvm_spinlock_init(void)
+{
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+#else /* CONFIG_KVM_GUEST */
+#define kvm_guest_init() do {} while (0)
 #define kvm_async_pf_task_wait(T) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
+
 static inline u32 kvm_read_and_reset_pf_reason(void)
 {
return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a96d32c..b8ef630 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -34,6 +34,7 @@
 #include linux/sched.h
 #include linux/slab.h
 #include linux/kprobes.h
+#include linux/debugfs.h
 #include asm/timer.h
 #include asm/cpu.h
 #include asm/traps.h
@@ -419,6 +420,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
WARN_ON(kvm_register_clock(primary cpu clock));
kvm_guest_cpu_init();
native_smp_prepare_boot_cpu();
+   kvm_spinlock_init();
 }
 
 static void kvm_guest_cpu_online(void *dummy)
@@ -523,3 +525,263 @@ static __init int activate_jump_labels(void)
return 0;
 }
 arch_initcall(activate_jump_labels);
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
+void kvm_kick_cpu(int cpu)
+{
+   int apicid;
+   unsigned long flags = 0;
+
+   apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
+}
+
+enum kvm_contention_stat {
+   TAKEN_SLOW,
+   TAKEN_SLOW_PICKUP,
+   RELEASED_SLOW,
+   RELEASED_SLOW_KICKED,
+   NR_CONTENTION_STATS
+};
+
+#ifdef CONFIG_KVM_DEBUG_FS
+#define HISTO_BUCKETS  30
+
+static struct kvm_spinlock_stats
+{
+   u32 contention_stats[NR_CONTENTION_STATS];
+   u32 histo_spin_blocked[HISTO_BUCKETS+1];
+   u64 time_blocked;
+} spinlock_stats;
+
+static u8 zero_stats;
+
+static inline void check_zero(void)
+{
+   u8 ret;
+   u8 old;
+
+   old = ACCESS_ONCE(zero_stats);
+   if (unlikely(old)) {
+   ret = cmpxchg(zero_stats, old, 0);
+   /* This ensures only one fellow resets the stat */
+   if (ret == old)
+   memset(spinlock_stats, 0, sizeof(spinlock_stats));
+   }
+}
+
+static inline void add_stats(enum kvm_contention_stat var, u32 val)
+{
+   check_zero();
+   spinlock_stats.contention_stats[var] += val;
+}
+
+
+static inline u64 spin_time_start(void)
+{
+   return sched_clock();
+}
+
+static void __spin_time_accum(u64 delta, u32 *array)
+{
+   unsigned index;
+
+   index = ilog2(delta);
+   check_zero();
+
+   if (index  HISTO_BUCKETS)
+   array[index]++;
+   else
+   array[HISTO_BUCKETS]++;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+   u32 delta;
+
+   delta = sched_clock() - start;
+   __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
+   spinlock_stats.time_blocked += delta;
+}
+
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+
+struct dentry *kvm_init_debugfs(void)
+{
+   d_kvm_debug = debugfs_create_dir(kvm

[tip:x86/spinlocks] kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-12 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  f9021f7fd9c8c8101c90b901053f99bfd0288021
Gitweb: http://git.kernel.org/tip/f9021f7fd9c8c8101c90b901053f99bfd0288021
Author: Srivatsa Vaddagiri 
AuthorDate: Tue, 6 Aug 2013 14:55:41 +0530
Committer:  H. Peter Anvin 
CommitDate: Mon, 12 Aug 2013 09:03:57 -0700

kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
required feature (KVM_FEATURE_PV_UNHALT) to support pv-ticketlocks. If so,
 support for pv-ticketlocks is registered via pv_lock_ops.

Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.

Signed-off-by: Srivatsa Vaddagiri 
Link: http://lkml.kernel.org/r/20130810193849.ga25...@linux.vnet.ibm.com
Signed-off-by: Suzuki Poulose 
[Raghu: check_zero race fix, enum for kvm_contention_stat, jumplabel related 
changes,
addition of safe_halt for irq enabled case, bailout spinning in nmi case(Gleb)]
Signed-off-by: Raghavendra K T 
Acked-by: Gleb Natapov 
Acked-by: Ingo Molnar 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/kvm_para.h |  14 ++-
 arch/x86/kernel/kvm.c   | 262 
 2 files changed, 274 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 695399f..427afcb 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -118,10 +118,20 @@ void kvm_async_pf_task_wait(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
-#else
-#define kvm_guest_init() do { } while (0)
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init kvm_spinlock_init(void);
+#else /* !CONFIG_PARAVIRT_SPINLOCKS */
+static inline void kvm_spinlock_init(void)
+{
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+#else /* CONFIG_KVM_GUEST */
+#define kvm_guest_init() do {} while (0)
 #define kvm_async_pf_task_wait(T) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
+
 static inline u32 kvm_read_and_reset_pf_reason(void)
 {
return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a96d32c..d442471 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -419,6 +420,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
WARN_ON(kvm_register_clock("primary cpu clock"));
kvm_guest_cpu_init();
native_smp_prepare_boot_cpu();
+   kvm_spinlock_init();
 }
 
 static void kvm_guest_cpu_online(void *dummy)
@@ -523,3 +525,263 @@ static __init int activate_jump_labels(void)
return 0;
 }
 arch_initcall(activate_jump_labels);
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
+void kvm_kick_cpu(int cpu)
+{
+   int apicid;
+   unsigned long flags = 0;
+
+   apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
+}
+
+enum kvm_contention_stat {
+   TAKEN_SLOW,
+   TAKEN_SLOW_PICKUP,
+   RELEASED_SLOW,
+   RELEASED_SLOW_KICKED,
+   NR_CONTENTION_STATS
+};
+
+#ifdef CONFIG_KVM_DEBUG_FS
+#define HISTO_BUCKETS  30
+
+static struct kvm_spinlock_stats
+{
+   u32 contention_stats[NR_CONTENTION_STATS];
+   u32 histo_spin_blocked[HISTO_BUCKETS+1];
+   u64 time_blocked;
+} spinlock_stats;
+
+static u8 zero_stats;
+
+static inline void check_zero(void)
+{
+   u8 ret;
+   u8 old;
+
+   old = ACCESS_ONCE(zero_stats);
+   if (unlikely(old)) {
+   ret = cmpxchg(_stats, old, 0);
+   /* This ensures only one fellow resets the stat */
+   if (ret == old)
+   memset(_stats, 0, sizeof(spinlock_stats));
+   }
+}
+
+static inline void add_stats(enum kvm_contention_stat var, u32 val)
+{
+   check_zero();
+   spinlock_stats.contention_stats[var] += val;
+}
+
+
+static inline u64 spin_time_start(void)
+{
+   return sched_clock();
+}
+
+static void __spin_time_accum(u64 delta, u32 *array)
+{
+   unsigned index;
+
+   index = ilog2(delta);
+   check_zero();
+
+   if (index < HISTO_BUCKETS)
+   array[index]++;
+   else
+   array[HISTO_BUCKETS]++;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+   u32 delta;
+
+   delta = sched_clock() - start;
+   __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
+   spinlock_stats.time_blocked += delta;
+}
+
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+
+struct dentry *kvm_init_debugfs(void)
+{
+   d_kvm_debug = debugfs_create_dir("kvm", NULL);
+   if (!d_kvm_debug)
+   printk(KERN_WARNING "Could not create 'kvm' debugfs 
directory\n");
+
+   return d_kvm_debug;
+}
+
+static int __init kvm_spinlock_debugfs(void)
+{
+   stru

[tip:x86/spinlocks] kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-12 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  f9021f7fd9c8c8101c90b901053f99bfd0288021
Gitweb: http://git.kernel.org/tip/f9021f7fd9c8c8101c90b901053f99bfd0288021
Author: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
AuthorDate: Tue, 6 Aug 2013 14:55:41 +0530
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Mon, 12 Aug 2013 09:03:57 -0700

kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
required feature (KVM_FEATURE_PV_UNHALT) to support pv-ticketlocks. If so,
 support for pv-ticketlocks is registered via pv_lock_ops.

Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.

Signed-off-by: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20130810193849.ga25...@linux.vnet.ibm.com
Signed-off-by: Suzuki Poulose suz...@in.ibm.com
[Raghu: check_zero race fix, enum for kvm_contention_stat, jumplabel related 
changes,
addition of safe_halt for irq enabled case, bailout spinning in nmi case(Gleb)]
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Acked-by: Gleb Natapov g...@redhat.com
Acked-by: Ingo Molnar mi...@kernel.org
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/include/asm/kvm_para.h |  14 ++-
 arch/x86/kernel/kvm.c   | 262 
 2 files changed, 274 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 695399f..427afcb 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -118,10 +118,20 @@ void kvm_async_pf_task_wait(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
-#else
-#define kvm_guest_init() do { } while (0)
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init kvm_spinlock_init(void);
+#else /* !CONFIG_PARAVIRT_SPINLOCKS */
+static inline void kvm_spinlock_init(void)
+{
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+#else /* CONFIG_KVM_GUEST */
+#define kvm_guest_init() do {} while (0)
 #define kvm_async_pf_task_wait(T) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
+
 static inline u32 kvm_read_and_reset_pf_reason(void)
 {
return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a96d32c..d442471 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -34,6 +34,7 @@
 #include linux/sched.h
 #include linux/slab.h
 #include linux/kprobes.h
+#include linux/debugfs.h
 #include asm/timer.h
 #include asm/cpu.h
 #include asm/traps.h
@@ -419,6 +420,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
WARN_ON(kvm_register_clock(primary cpu clock));
kvm_guest_cpu_init();
native_smp_prepare_boot_cpu();
+   kvm_spinlock_init();
 }
 
 static void kvm_guest_cpu_online(void *dummy)
@@ -523,3 +525,263 @@ static __init int activate_jump_labels(void)
return 0;
 }
 arch_initcall(activate_jump_labels);
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
+void kvm_kick_cpu(int cpu)
+{
+   int apicid;
+   unsigned long flags = 0;
+
+   apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
+}
+
+enum kvm_contention_stat {
+   TAKEN_SLOW,
+   TAKEN_SLOW_PICKUP,
+   RELEASED_SLOW,
+   RELEASED_SLOW_KICKED,
+   NR_CONTENTION_STATS
+};
+
+#ifdef CONFIG_KVM_DEBUG_FS
+#define HISTO_BUCKETS  30
+
+static struct kvm_spinlock_stats
+{
+   u32 contention_stats[NR_CONTENTION_STATS];
+   u32 histo_spin_blocked[HISTO_BUCKETS+1];
+   u64 time_blocked;
+} spinlock_stats;
+
+static u8 zero_stats;
+
+static inline void check_zero(void)
+{
+   u8 ret;
+   u8 old;
+
+   old = ACCESS_ONCE(zero_stats);
+   if (unlikely(old)) {
+   ret = cmpxchg(zero_stats, old, 0);
+   /* This ensures only one fellow resets the stat */
+   if (ret == old)
+   memset(spinlock_stats, 0, sizeof(spinlock_stats));
+   }
+}
+
+static inline void add_stats(enum kvm_contention_stat var, u32 val)
+{
+   check_zero();
+   spinlock_stats.contention_stats[var] += val;
+}
+
+
+static inline u64 spin_time_start(void)
+{
+   return sched_clock();
+}
+
+static void __spin_time_accum(u64 delta, u32 *array)
+{
+   unsigned index;
+
+   index = ilog2(delta);
+   check_zero();
+
+   if (index  HISTO_BUCKETS)
+   array[index]++;
+   else
+   array[HISTO_BUCKETS]++;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+   u32 delta;
+
+   delta = sched_clock() - start;
+   __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
+   spinlock_stats.time_blocked += delta;
+}
+
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+
+struct dentry *kvm_init_debugfs(void)
+{
+   d_kvm_debug

[tip:x86/spinlocks] kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-10 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  23f659a237e8f633f9605fdf9408a8d130ab72c9
Gitweb: http://git.kernel.org/tip/23f659a237e8f633f9605fdf9408a8d130ab72c9
Author: Srivatsa Vaddagiri 
AuthorDate: Fri, 9 Aug 2013 19:52:02 +0530
Committer:  H. Peter Anvin 
CommitDate: Fri, 9 Aug 2013 07:54:24 -0700

kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
required feature (KVM_FEATURE_PV_UNHALT) to support pv-ticketlocks. If so,
support for pv-ticketlocks is registered via pv_lock_ops.

Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.

Signed-off-by: Srivatsa Vaddagiri 
Link: 
http://lkml.kernel.org/r/1376058122-8248-15-git-send-email-raghavendra...@linux.vnet.ibm.com
Signed-off-by: Suzuki Poulose 
[Raghu: check_zero race fix, enum for kvm_contention_stat, jumplabel related 
changes,
addition of safe_halt for irq enabled case, bailout spinning in nmi case(Gleb)]
Signed-off-by: Raghavendra K T 
Acked-by: Gleb Natapov 
Acked-by: Ingo Molnar 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/kvm_para.h |  14 ++-
 arch/x86/kernel/kvm.c   | 262 
 2 files changed, 274 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 695399f..427afcb 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -118,10 +118,20 @@ void kvm_async_pf_task_wait(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
-#else
-#define kvm_guest_init() do { } while (0)
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init kvm_spinlock_init(void);
+#else /* !CONFIG_PARAVIRT_SPINLOCKS */
+static inline void kvm_spinlock_init(void)
+{
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+#else /* CONFIG_KVM_GUEST */
+#define kvm_guest_init() do {} while (0)
 #define kvm_async_pf_task_wait(T) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
+
 static inline u32 kvm_read_and_reset_pf_reason(void)
 {
return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a96d32c..9b33a27 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -419,6 +420,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
WARN_ON(kvm_register_clock("primary cpu clock"));
kvm_guest_cpu_init();
native_smp_prepare_boot_cpu();
+   kvm_spinlock_init();
 }
 
 static void kvm_guest_cpu_online(void *dummy)
@@ -523,3 +525,263 @@ static __init int activate_jump_labels(void)
return 0;
 }
 arch_initcall(activate_jump_labels);
+
+/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
+void kvm_kick_cpu(int cpu)
+{
+   int apicid;
+   unsigned long flags = 0;
+
+   apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
+}
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+enum kvm_contention_stat {
+   TAKEN_SLOW,
+   TAKEN_SLOW_PICKUP,
+   RELEASED_SLOW,
+   RELEASED_SLOW_KICKED,
+   NR_CONTENTION_STATS
+};
+
+#ifdef CONFIG_KVM_DEBUG_FS
+#define HISTO_BUCKETS  30
+
+static struct kvm_spinlock_stats
+{
+   u32 contention_stats[NR_CONTENTION_STATS];
+   u32 histo_spin_blocked[HISTO_BUCKETS+1];
+   u64 time_blocked;
+} spinlock_stats;
+
+static u8 zero_stats;
+
+static inline void check_zero(void)
+{
+   u8 ret;
+   u8 old;
+
+   old = ACCESS_ONCE(zero_stats);
+   if (unlikely(old)) {
+   ret = cmpxchg(_stats, old, 0);
+   /* This ensures only one fellow resets the stat */
+   if (ret == old)
+   memset(_stats, 0, sizeof(spinlock_stats));
+   }
+}
+
+static inline void add_stats(enum kvm_contention_stat var, u32 val)
+{
+   check_zero();
+   spinlock_stats.contention_stats[var] += val;
+}
+
+
+static inline u64 spin_time_start(void)
+{
+   return sched_clock();
+}
+
+static void __spin_time_accum(u64 delta, u32 *array)
+{
+   unsigned index;
+
+   index = ilog2(delta);
+   check_zero();
+
+   if (index < HISTO_BUCKETS)
+   array[index]++;
+   else
+   array[HISTO_BUCKETS]++;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+   u32 delta;
+
+   delta = sched_clock() - start;
+   __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
+   spinlock_stats.time_blocked += delta;
+}
+
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+
+struct dentry *kvm_init_debugfs(void)
+{
+   d_kvm_debug = debugfs_create_dir("kvm", NULL);
+   if (!d_kvm_debug)
+   printk(KERN_WARNING "Could not create 'kvm' debugfs 
directory\n");
+
+   return d_kvm_debug;
+}
+
+static int __init kvm_spinlock_debug

[tip:x86/spinlocks] kvm guest: Add configuration support to enable debug information for KVM Guests

2013-08-10 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  1e20eb8557cdabf76473b09572be8aa8a2bb9bc0
Gitweb: http://git.kernel.org/tip/1e20eb8557cdabf76473b09572be8aa8a2bb9bc0
Author: Srivatsa Vaddagiri 
AuthorDate: Fri, 9 Aug 2013 19:52:01 +0530
Committer:  H. Peter Anvin 
CommitDate: Fri, 9 Aug 2013 07:54:18 -0700

kvm guest: Add configuration support to enable debug information for KVM Guests

Signed-off-by: Srivatsa Vaddagiri 
Link: 
http://lkml.kernel.org/r/1376058122-8248-14-git-send-email-raghavendra...@linux.vnet.ibm.com
Signed-off-by: Suzuki Poulose 
Signed-off-by: Raghavendra K T 
Acked-by: Gleb Natapov 
Acked-by: Ingo Molnar 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/Kconfig | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 112e712..b1fb846 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -657,6 +657,15 @@ config KVM_GUEST
  underlying device model, the host provides the guest with
  timing infrastructure such as time of day, and system time
 
+config KVM_DEBUG_FS
+   bool "Enable debug information for KVM Guests in debugfs"
+   depends on KVM_GUEST && DEBUG_FS
+   default n
+   ---help---
+ This option enables collection of various statistics for KVM guest.
+ Statistics are displayed in debugfs filesystem. Enabling this option
+ may incur significant overhead.
+
 source "arch/x86/lguest/Kconfig"
 
 config PARAVIRT_TIME_ACCOUNTING
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/spinlocks] kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-10 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  23f659a237e8f633f9605fdf9408a8d130ab72c9
Gitweb: http://git.kernel.org/tip/23f659a237e8f633f9605fdf9408a8d130ab72c9
Author: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
AuthorDate: Fri, 9 Aug 2013 19:52:02 +0530
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Fri, 9 Aug 2013 07:54:24 -0700

kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor

During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
required feature (KVM_FEATURE_PV_UNHALT) to support pv-ticketlocks. If so,
support for pv-ticketlocks is registered via pv_lock_ops.

Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.

Signed-off-by: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
Link: 
http://lkml.kernel.org/r/1376058122-8248-15-git-send-email-raghavendra...@linux.vnet.ibm.com
Signed-off-by: Suzuki Poulose suz...@in.ibm.com
[Raghu: check_zero race fix, enum for kvm_contention_stat, jumplabel related 
changes,
addition of safe_halt for irq enabled case, bailout spinning in nmi case(Gleb)]
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Acked-by: Gleb Natapov g...@redhat.com
Acked-by: Ingo Molnar mi...@kernel.org
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/include/asm/kvm_para.h |  14 ++-
 arch/x86/kernel/kvm.c   | 262 
 2 files changed, 274 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 695399f..427afcb 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -118,10 +118,20 @@ void kvm_async_pf_task_wait(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
-#else
-#define kvm_guest_init() do { } while (0)
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init kvm_spinlock_init(void);
+#else /* !CONFIG_PARAVIRT_SPINLOCKS */
+static inline void kvm_spinlock_init(void)
+{
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+#else /* CONFIG_KVM_GUEST */
+#define kvm_guest_init() do {} while (0)
 #define kvm_async_pf_task_wait(T) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
+
 static inline u32 kvm_read_and_reset_pf_reason(void)
 {
return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a96d32c..9b33a27 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -34,6 +34,7 @@
 #include linux/sched.h
 #include linux/slab.h
 #include linux/kprobes.h
+#include linux/debugfs.h
 #include asm/timer.h
 #include asm/cpu.h
 #include asm/traps.h
@@ -419,6 +420,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
WARN_ON(kvm_register_clock(primary cpu clock));
kvm_guest_cpu_init();
native_smp_prepare_boot_cpu();
+   kvm_spinlock_init();
 }
 
 static void kvm_guest_cpu_online(void *dummy)
@@ -523,3 +525,263 @@ static __init int activate_jump_labels(void)
return 0;
 }
 arch_initcall(activate_jump_labels);
+
+/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
+void kvm_kick_cpu(int cpu)
+{
+   int apicid;
+   unsigned long flags = 0;
+
+   apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
+}
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+enum kvm_contention_stat {
+   TAKEN_SLOW,
+   TAKEN_SLOW_PICKUP,
+   RELEASED_SLOW,
+   RELEASED_SLOW_KICKED,
+   NR_CONTENTION_STATS
+};
+
+#ifdef CONFIG_KVM_DEBUG_FS
+#define HISTO_BUCKETS  30
+
+static struct kvm_spinlock_stats
+{
+   u32 contention_stats[NR_CONTENTION_STATS];
+   u32 histo_spin_blocked[HISTO_BUCKETS+1];
+   u64 time_blocked;
+} spinlock_stats;
+
+static u8 zero_stats;
+
+static inline void check_zero(void)
+{
+   u8 ret;
+   u8 old;
+
+   old = ACCESS_ONCE(zero_stats);
+   if (unlikely(old)) {
+   ret = cmpxchg(zero_stats, old, 0);
+   /* This ensures only one fellow resets the stat */
+   if (ret == old)
+   memset(spinlock_stats, 0, sizeof(spinlock_stats));
+   }
+}
+
+static inline void add_stats(enum kvm_contention_stat var, u32 val)
+{
+   check_zero();
+   spinlock_stats.contention_stats[var] += val;
+}
+
+
+static inline u64 spin_time_start(void)
+{
+   return sched_clock();
+}
+
+static void __spin_time_accum(u64 delta, u32 *array)
+{
+   unsigned index;
+
+   index = ilog2(delta);
+   check_zero();
+
+   if (index  HISTO_BUCKETS)
+   array[index]++;
+   else
+   array[HISTO_BUCKETS]++;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+   u32 delta;
+
+   delta = sched_clock() - start;
+   __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
+   spinlock_stats.time_blocked += delta;
+}
+
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+
+struct dentry *kvm_init_debugfs(void

[tip:x86/spinlocks] kvm guest: Add configuration support to enable debug information for KVM Guests

2013-08-10 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  1e20eb8557cdabf76473b09572be8aa8a2bb9bc0
Gitweb: http://git.kernel.org/tip/1e20eb8557cdabf76473b09572be8aa8a2bb9bc0
Author: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
AuthorDate: Fri, 9 Aug 2013 19:52:01 +0530
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Fri, 9 Aug 2013 07:54:18 -0700

kvm guest: Add configuration support to enable debug information for KVM Guests

Signed-off-by: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
Link: 
http://lkml.kernel.org/r/1376058122-8248-14-git-send-email-raghavendra...@linux.vnet.ibm.com
Signed-off-by: Suzuki Poulose suz...@in.ibm.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Acked-by: Gleb Natapov g...@redhat.com
Acked-by: Ingo Molnar mi...@kernel.org
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/Kconfig | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 112e712..b1fb846 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -657,6 +657,15 @@ config KVM_GUEST
  underlying device model, the host provides the guest with
  timing infrastructure such as time of day, and system time
 
+config KVM_DEBUG_FS
+   bool Enable debug information for KVM Guests in debugfs
+   depends on KVM_GUEST  DEBUG_FS
+   default n
+   ---help---
+ This option enables collection of various statistics for KVM guest.
+ Statistics are displayed in debugfs filesystem. Enabling this option
+ may incur significant overhead.
+
 source arch/x86/lguest/Kconfig
 
 config PARAVIRT_TIME_ACCOUNTING
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/spinlocks] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-08 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  b5eaeb3303fc3086f1d04deea48b5dfcfc4344c0
Gitweb: http://git.kernel.org/tip/b5eaeb3303fc3086f1d04deea48b5dfcfc4344c0
Author: Srivatsa Vaddagiri 
AuthorDate: Tue, 6 Aug 2013 17:15:21 +0530
Committer:  H. Peter Anvin 
CommitDate: Thu, 8 Aug 2013 16:07:34 -0700

kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
required feature (KVM_FEATURE_PV_UNHALT) to support pv-ticketlocks. If so,
 support for pv-ticketlocks is registered via pv_lock_ops.

Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.

Signed-off-by: Srivatsa Vaddagiri 
Link: 
http://lkml.kernel.org/r/20130806114521.20643.29839.sendpatch...@codeblue.in.ibm.com
Signed-off-by: Suzuki Poulose 
[Raghu: check_zero race fix, enum for kvm_contention_stat, jumplabel related 
changes,
addition of safe_halt for irq enabled case, bailout spinning in nmi case(Gleb)]
Signed-off-by: Raghavendra K T 
Acked-by: Gleb Natapov 
Acked-by: Ingo Molnar 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/kvm_para.h |  14 ++-
 arch/x86/kernel/kvm.c   | 262 
 2 files changed, 274 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 695399f..427afcb 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -118,10 +118,20 @@ void kvm_async_pf_task_wait(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
-#else
-#define kvm_guest_init() do { } while (0)
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init kvm_spinlock_init(void);
+#else /* !CONFIG_PARAVIRT_SPINLOCKS */
+static inline void kvm_spinlock_init(void)
+{
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+#else /* CONFIG_KVM_GUEST */
+#define kvm_guest_init() do {} while (0)
 #define kvm_async_pf_task_wait(T) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
+
 static inline u32 kvm_read_and_reset_pf_reason(void)
 {
return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a96d32c..9b33a27 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -419,6 +420,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
WARN_ON(kvm_register_clock("primary cpu clock"));
kvm_guest_cpu_init();
native_smp_prepare_boot_cpu();
+   kvm_spinlock_init();
 }
 
 static void kvm_guest_cpu_online(void *dummy)
@@ -523,3 +525,263 @@ static __init int activate_jump_labels(void)
return 0;
 }
 arch_initcall(activate_jump_labels);
+
+/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
+void kvm_kick_cpu(int cpu)
+{
+   int apicid;
+   unsigned long flags = 0;
+
+   apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
+}
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+enum kvm_contention_stat {
+   TAKEN_SLOW,
+   TAKEN_SLOW_PICKUP,
+   RELEASED_SLOW,
+   RELEASED_SLOW_KICKED,
+   NR_CONTENTION_STATS
+};
+
+#ifdef CONFIG_KVM_DEBUG_FS
+#define HISTO_BUCKETS  30
+
+static struct kvm_spinlock_stats
+{
+   u32 contention_stats[NR_CONTENTION_STATS];
+   u32 histo_spin_blocked[HISTO_BUCKETS+1];
+   u64 time_blocked;
+} spinlock_stats;
+
+static u8 zero_stats;
+
+static inline void check_zero(void)
+{
+   u8 ret;
+   u8 old;
+
+   old = ACCESS_ONCE(zero_stats);
+   if (unlikely(old)) {
+   ret = cmpxchg(_stats, old, 0);
+   /* This ensures only one fellow resets the stat */
+   if (ret == old)
+   memset(_stats, 0, sizeof(spinlock_stats));
+   }
+}
+
+static inline void add_stats(enum kvm_contention_stat var, u32 val)
+{
+   check_zero();
+   spinlock_stats.contention_stats[var] += val;
+}
+
+
+static inline u64 spin_time_start(void)
+{
+   return sched_clock();
+}
+
+static void __spin_time_accum(u64 delta, u32 *array)
+{
+   unsigned index;
+
+   index = ilog2(delta);
+   check_zero();
+
+   if (index < HISTO_BUCKETS)
+   array[index]++;
+   else
+   array[HISTO_BUCKETS]++;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+   u32 delta;
+
+   delta = sched_clock() - start;
+   __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
+   spinlock_stats.time_blocked += delta;
+}
+
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+
+struct dentry *kvm_init_debugfs(void)
+{
+   d_kvm_debug = debugfs_create_dir("kvm", NULL);
+   if (!d_kvm_debug)
+   printk(KERN_WARNING "Could not create 'kvm' debugfs 
directory\n");
+
+   return d_kvm_debug;
+}
+
+static int __init kvm_spinlock_debugfs(void

[tip:x86/spinlocks] kvm guest : Add configuration support to enable debug information for KVM Guests

2013-08-08 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  20a89c88f7d2458e12f66d7850cf17deec7daa1c
Gitweb: http://git.kernel.org/tip/20a89c88f7d2458e12f66d7850cf17deec7daa1c
Author: Srivatsa Vaddagiri 
AuthorDate: Tue, 6 Aug 2013 17:15:01 +0530
Committer:  H. Peter Anvin 
CommitDate: Thu, 8 Aug 2013 16:07:30 -0700

kvm guest : Add configuration support to enable debug information for KVM Guests

Signed-off-by: Srivatsa Vaddagiri 
Link: 
http://lkml.kernel.org/r/20130806114501.20643.5734.sendpatch...@codeblue.in.ibm.com
Signed-off-by: Suzuki Poulose 
Signed-off-by: Raghavendra K T 
Acked-by: Gleb Natapov 
Acked-by: Ingo Molnar 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/Kconfig | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 112e712..b1fb846 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -657,6 +657,15 @@ config KVM_GUEST
  underlying device model, the host provides the guest with
  timing infrastructure such as time of day, and system time
 
+config KVM_DEBUG_FS
+   bool "Enable debug information for KVM Guests in debugfs"
+   depends on KVM_GUEST && DEBUG_FS
+   default n
+   ---help---
+ This option enables collection of various statistics for KVM guest.
+ Statistics are displayed in debugfs filesystem. Enabling this option
+ may incur significant overhead.
+
 source "arch/x86/lguest/Kconfig"
 
 config PARAVIRT_TIME_ACCOUNTING
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/spinlocks] kvm guest : Add configuration support to enable debug information for KVM Guests

2013-08-08 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  20a89c88f7d2458e12f66d7850cf17deec7daa1c
Gitweb: http://git.kernel.org/tip/20a89c88f7d2458e12f66d7850cf17deec7daa1c
Author: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
AuthorDate: Tue, 6 Aug 2013 17:15:01 +0530
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Thu, 8 Aug 2013 16:07:30 -0700

kvm guest : Add configuration support to enable debug information for KVM Guests

Signed-off-by: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
Link: 
http://lkml.kernel.org/r/20130806114501.20643.5734.sendpatch...@codeblue.in.ibm.com
Signed-off-by: Suzuki Poulose suz...@in.ibm.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Acked-by: Gleb Natapov g...@redhat.com
Acked-by: Ingo Molnar mi...@kernel.org
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/Kconfig | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 112e712..b1fb846 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -657,6 +657,15 @@ config KVM_GUEST
  underlying device model, the host provides the guest with
  timing infrastructure such as time of day, and system time
 
+config KVM_DEBUG_FS
+   bool Enable debug information for KVM Guests in debugfs
+   depends on KVM_GUEST  DEBUG_FS
+   default n
+   ---help---
+ This option enables collection of various statistics for KVM guest.
+ Statistics are displayed in debugfs filesystem. Enabling this option
+ may incur significant overhead.
+
 source arch/x86/lguest/Kconfig
 
 config PARAVIRT_TIME_ACCOUNTING
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/spinlocks] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-08 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  b5eaeb3303fc3086f1d04deea48b5dfcfc4344c0
Gitweb: http://git.kernel.org/tip/b5eaeb3303fc3086f1d04deea48b5dfcfc4344c0
Author: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
AuthorDate: Tue, 6 Aug 2013 17:15:21 +0530
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Thu, 8 Aug 2013 16:07:34 -0700

kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
required feature (KVM_FEATURE_PV_UNHALT) to support pv-ticketlocks. If so,
 support for pv-ticketlocks is registered via pv_lock_ops.

Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.

Signed-off-by: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
Link: 
http://lkml.kernel.org/r/20130806114521.20643.29839.sendpatch...@codeblue.in.ibm.com
Signed-off-by: Suzuki Poulose suz...@in.ibm.com
[Raghu: check_zero race fix, enum for kvm_contention_stat, jumplabel related 
changes,
addition of safe_halt for irq enabled case, bailout spinning in nmi case(Gleb)]
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Acked-by: Gleb Natapov g...@redhat.com
Acked-by: Ingo Molnar mi...@kernel.org
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/include/asm/kvm_para.h |  14 ++-
 arch/x86/kernel/kvm.c   | 262 
 2 files changed, 274 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 695399f..427afcb 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -118,10 +118,20 @@ void kvm_async_pf_task_wait(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
-#else
-#define kvm_guest_init() do { } while (0)
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init kvm_spinlock_init(void);
+#else /* !CONFIG_PARAVIRT_SPINLOCKS */
+static inline void kvm_spinlock_init(void)
+{
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+#else /* CONFIG_KVM_GUEST */
+#define kvm_guest_init() do {} while (0)
 #define kvm_async_pf_task_wait(T) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
+
 static inline u32 kvm_read_and_reset_pf_reason(void)
 {
return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a96d32c..9b33a27 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -34,6 +34,7 @@
 #include linux/sched.h
 #include linux/slab.h
 #include linux/kprobes.h
+#include linux/debugfs.h
 #include asm/timer.h
 #include asm/cpu.h
 #include asm/traps.h
@@ -419,6 +420,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
WARN_ON(kvm_register_clock(primary cpu clock));
kvm_guest_cpu_init();
native_smp_prepare_boot_cpu();
+   kvm_spinlock_init();
 }
 
 static void kvm_guest_cpu_online(void *dummy)
@@ -523,3 +525,263 @@ static __init int activate_jump_labels(void)
return 0;
 }
 arch_initcall(activate_jump_labels);
+
+/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
+void kvm_kick_cpu(int cpu)
+{
+   int apicid;
+   unsigned long flags = 0;
+
+   apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
+}
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+enum kvm_contention_stat {
+   TAKEN_SLOW,
+   TAKEN_SLOW_PICKUP,
+   RELEASED_SLOW,
+   RELEASED_SLOW_KICKED,
+   NR_CONTENTION_STATS
+};
+
+#ifdef CONFIG_KVM_DEBUG_FS
+#define HISTO_BUCKETS  30
+
+static struct kvm_spinlock_stats
+{
+   u32 contention_stats[NR_CONTENTION_STATS];
+   u32 histo_spin_blocked[HISTO_BUCKETS+1];
+   u64 time_blocked;
+} spinlock_stats;
+
+static u8 zero_stats;
+
+static inline void check_zero(void)
+{
+   u8 ret;
+   u8 old;
+
+   old = ACCESS_ONCE(zero_stats);
+   if (unlikely(old)) {
+   ret = cmpxchg(zero_stats, old, 0);
+   /* This ensures only one fellow resets the stat */
+   if (ret == old)
+   memset(spinlock_stats, 0, sizeof(spinlock_stats));
+   }
+}
+
+static inline void add_stats(enum kvm_contention_stat var, u32 val)
+{
+   check_zero();
+   spinlock_stats.contention_stats[var] += val;
+}
+
+
+static inline u64 spin_time_start(void)
+{
+   return sched_clock();
+}
+
+static void __spin_time_accum(u64 delta, u32 *array)
+{
+   unsigned index;
+
+   index = ilog2(delta);
+   check_zero();
+
+   if (index  HISTO_BUCKETS)
+   array[index]++;
+   else
+   array[HISTO_BUCKETS]++;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+   u32 delta;
+
+   delta = sched_clock() - start;
+   __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
+   spinlock_stats.time_blocked += delta;
+}
+
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+
+struct dentry *kvm_init_debugfs(void

Re: [PATCH 1/2] cpuhotplug/nohz: Remove offline cpus from nohz-idle state

2013-01-07 Thread Srivatsa Vaddagiri
* Russell King - ARM Linux  [2013-01-05 10:36:27]:

> On Thu, Jan 03, 2013 at 06:58:38PM -0800, Srivatsa Vaddagiri wrote:
> > I also think that the
> > wait_for_completion() based wait in ARM's __cpu_die() can be replaced with a
> > busy-loop based one, as the wait there in general should be terminated 
> > within
> > few cycles.
> 
> Why open-code this stuff when we have infrastructure already in the kernel
> for waiting for stuff to happen?  I chose to use the standard infrastructure
> because its better tested, and avoids having to think about whether we need
> CPU barriers and such like to ensure that updates are seen in a timely
> manner.

I was primarily thinking of calling as few generic functions as possible on
a dead cpu. I recall several "am I running on a dead cpu?" checks
(cpu_is_offline(this_cpu) that were put in generic routines during early
versions of cpu hotplug [1] to educate code running on dead cpu, the need for
which went away though with introduction of atomic/stop-machine variant. The
need to add a RCU_NONIDLE() wrapper around ARM's cpu_die() [2] is perhaps a more
recent example of educating code running on dead cpu. As quickly we die as
possible after idle thread of dying cpu gains control, the better!

1. http://lwn.net/Articles/69040/
2. http://lists.infradead.org/pipermail/linux-arm-kernel/2012-July/107971.html

- vatsa
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] cpuhotplug/nohz: Remove offline cpus from nohz-idle state

2013-01-07 Thread Srivatsa Vaddagiri
* Russell King - ARM Linux li...@arm.linux.org.uk [2013-01-05 10:36:27]:

 On Thu, Jan 03, 2013 at 06:58:38PM -0800, Srivatsa Vaddagiri wrote:
  I also think that the
  wait_for_completion() based wait in ARM's __cpu_die() can be replaced with a
  busy-loop based one, as the wait there in general should be terminated 
  within
  few cycles.
 
 Why open-code this stuff when we have infrastructure already in the kernel
 for waiting for stuff to happen?  I chose to use the standard infrastructure
 because its better tested, and avoids having to think about whether we need
 CPU barriers and such like to ensure that updates are seen in a timely
 manner.

I was primarily thinking of calling as few generic functions as possible on
a dead cpu. I recall several am I running on a dead cpu? checks
(cpu_is_offline(this_cpu) that were put in generic routines during early
versions of cpu hotplug [1] to educate code running on dead cpu, the need for
which went away though with introduction of atomic/stop-machine variant. The
need to add a RCU_NONIDLE() wrapper around ARM's cpu_die() [2] is perhaps a more
recent example of educating code running on dead cpu. As quickly we die as
possible after idle thread of dying cpu gains control, the better!

1. http://lwn.net/Articles/69040/
2. http://lists.infradead.org/pipermail/linux-arm-kernel/2012-July/107971.html

- vatsa
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Revert "nohz: Fix idle ticks in cpu summary line of /proc/stat" (commit 7386cdbf2f57ea8cff3c9fde93f206e58b9fe13f).

2013-01-04 Thread Srivatsa Vaddagiri
* Sergei Shtylyov  [2013-01-04 16:13:42]:

> >With offline cpus no longer beeing seen in nohz mode (ts->idle_active=0), we
> >don't need the check for cpu_online() introduced in commit 7386cdbf. Offline
> 
>Please also specify the summary of that commit in parens (or
> however you like).

I had that in Subject line, but yes would be good to include in commit message
as well. I will incorporate that change alongwith anything else required in
next version of this patch.

- vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Revert nohz: Fix idle ticks in cpu summary line of /proc/stat (commit 7386cdbf2f57ea8cff3c9fde93f206e58b9fe13f).

2013-01-04 Thread Srivatsa Vaddagiri
* Sergei Shtylyov sshtyl...@mvista.com [2013-01-04 16:13:42]:

 With offline cpus no longer beeing seen in nohz mode (ts-idle_active=0), we
 don't need the check for cpu_online() introduced in commit 7386cdbf. Offline
 
Please also specify the summary of that commit in parens (or
 however you like).

I had that in Subject line, but yes would be good to include in commit message
as well. I will incorporate that change alongwith anything else required in
next version of this patch.

- vatsa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] Revert "nohz: Fix idle ticks in cpu summary line of /proc/stat" (commit 7386cdbf2f57ea8cff3c9fde93f206e58b9fe13f).

2013-01-03 Thread Srivatsa Vaddagiri
With offline cpus no longer beeing seen in nohz mode (ts->idle_active=0), we
don't need the check for cpu_online() introduced in commit 7386cdbf. Offline
cpu's idle time as last recorded in its ts->idle_sleeptime will be reported
(thus excluding its offline time as part of idle time statistics).

Cc: mho...@suse.cz
Cc: srivatsa.b...@linux.vnet.ibm.com
Signed-off-by: Srivatsa Vaddagiri 
---
 fs/proc/stat.c |   14 --
 1 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e296572..64c3b31 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -45,13 +45,10 @@ static cputime64_t get_iowait_time(int cpu)
 
 static u64 get_idle_time(int cpu)
 {
-   u64 idle, idle_time = -1ULL;
-
-   if (cpu_online(cpu))
-   idle_time = get_cpu_idle_time_us(cpu, NULL);
+   u64 idle, idle_time = get_cpu_idle_time_us(cpu, NULL);
 
if (idle_time == -1ULL)
-   /* !NO_HZ or cpu offline so we can rely on cpustat.idle */
+   /* !NO_HZ so we can rely on cpustat.idle */
idle = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE];
else
idle = usecs_to_cputime64(idle_time);
@@ -61,13 +58,10 @@ static u64 get_idle_time(int cpu)
 
 static u64 get_iowait_time(int cpu)
 {
-   u64 iowait, iowait_time = -1ULL;
-
-   if (cpu_online(cpu))
-   iowait_time = get_cpu_iowait_time_us(cpu, NULL);
+   u64 iowait, iowait_time = get_cpu_iowait_time_us(cpu, NULL);
 
if (iowait_time == -1ULL)
-   /* !NO_HZ or cpu offline so we can rely on cpustat.iowait */
+   /* !NO_HZ so we can rely on cpustat.iowait */
iowait = kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
else
iowait = usecs_to_cputime64(iowait_time);
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] cpuhotplug/nohz: Remove offline cpus from nohz-idle state

2013-01-03 Thread Srivatsa Vaddagiri
Modify idle loop of arm, mips, s390, sh and x86 architectures to exit from nohz
state before dying upon hot-remove. This change is needed to avoid userspace
tools like top command from seeing a rollback in total idle time over some
sampling periods.

Additionaly, modify idle loop on all architectures supporting cpu hotplug to
have idle thread of a dying cpu die immediately after scheduler returns control
to it. There is no point in wasting time via calls to *_enter()/*_exit() before
noticing the need to die and dying.

Additional ARM specific change:
Revert commit ff081e05 ("ARM: 7457/1: smp: Fix suspicious
RCU originating from cpu_die()"), which added a RCU_NONIDLE() wrapper
around call to complete(). That wrapper is no longer needed as cpu_die() is
now called outside of a rcu_idle_enter()/exit() section. I also think that the
wait_for_completion() based wait in ARM's __cpu_die() can be replaced with a
busy-loop based one, as the wait there in general should be terminated within
few cycles.

Cc: Russell King 
Cc: Paul E. McKenney 
Cc: Stephen Boyd 
Cc: linux-arm-ker...@lists.infradead.org
Cc: Mike Frysinger 
Cc: uclinux-dist-de...@blackfin.uclinux.org
Cc: Ralf Baechle 
Cc: linux-m...@linux-mips.org
Cc: Benjamin Herrenschmidt 
Cc: linuxppc-...@lists.ozlabs.org
Cc: Martin Schwidefsky 
Cc: linux-s...@vger.kernel.org
Cc: Paul Mundt 
Cc: linux...@vger.kernel.org
Cc: "David S. Miller" 
Cc: sparcli...@vger.kernel.org
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: mho...@suse.cz
Cc: srivatsa.b...@linux.vnet.ibm.com
Signed-off-by: Srivatsa Vaddagiri 
---
 arch/arm/kernel/process.c  |9 -
 arch/arm/kernel/smp.c  |2 +-
 arch/blackfin/kernel/process.c |8 
 arch/mips/kernel/process.c |6 +++---
 arch/powerpc/kernel/idle.c |2 +-
 arch/s390/kernel/process.c |4 ++--
 arch/sh/kernel/idle.c  |5 ++---
 arch/sparc/kernel/process_64.c |3 ++-
 arch/x86/kernel/process.c  |5 ++---
 9 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index c6dec5f..254099b 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -191,11 +191,6 @@ void cpu_idle(void)
rcu_idle_enter();
ledtrig_cpu(CPU_LED_IDLE_START);
while (!need_resched()) {
-#ifdef CONFIG_HOTPLUG_CPU
-   if (cpu_is_offline(smp_processor_id()))
-   cpu_die();
-#endif
-
/*
 * We need to disable interrupts here
 * to ensure we don't miss a wakeup call.
@@ -224,6 +219,10 @@ void cpu_idle(void)
rcu_idle_exit();
tick_nohz_idle_exit();
schedule_preempt_disabled();
+#ifdef CONFIG_HOTPLUG_CPU
+   if (cpu_is_offline(smp_processor_id()))
+   cpu_die();
+#endif
}
 }
 
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 84f4cbf..a8e3b8a 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -251,7 +251,7 @@ void __ref cpu_die(void)
mb();
 
/* Tell __cpu_die() that this CPU is now safe to dispose of */
-   RCU_NONIDLE(complete(_died));
+   complete(_died);
 
/*
 * actual CPU shutdown procedure is at least platform (if not
diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
index 3e16ad9..2bee1af 100644
--- a/arch/blackfin/kernel/process.c
+++ b/arch/blackfin/kernel/process.c
@@ -83,10 +83,6 @@ void cpu_idle(void)
while (1) {
void (*idle)(void) = pm_idle;
 
-#ifdef CONFIG_HOTPLUG_CPU
-   if (cpu_is_offline(smp_processor_id()))
-   cpu_die();
-#endif
if (!idle)
idle = default_idle;
tick_nohz_idle_enter();
@@ -98,6 +94,10 @@ void cpu_idle(void)
preempt_enable_no_resched();
schedule();
preempt_disable();
+#ifdef CONFIG_HOTPLUG_CPU
+   if (cpu_is_offline(smp_processor_id()))
+   cpu_die();
+#endif
}
 }
 
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index a11c6f9..41102a0 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -71,13 +71,13 @@ void __noreturn cpu_idle(void)
start_critical_timings();
}
}
+   rcu_idle_exit();
+   tick_nohz_idle_exit();
+   schedule_preempt_disabled();
 #ifdef CONFIG_HOTPLUG_CPU
if (!cpu_online(cpu) && !cpu_isset(cpu, cpu_callin_map))
play_dead();
 #endif
-   rcu_idle_exit();
-   tick_nohz_idle_exit();
-   schedule_preempt_disabled();
}
 }
 
diff --g

[PATCH 0/2] cpuhotplug/nohz: Fix issue of "negative" idle time

2013-01-03 Thread Srivatsa Vaddagiri
On most architectures (arm, mips, s390, sh and x86) idle thread of a cpu does
not cleanly exit nohz state before dying upon hot-remove. As a result,
offline cpu is seen to be in nohz mode (ts->idle_active = 1) and its offline
time can potentially be included in total idle time reported via /proc/stat.
When the same cpu later comes online, its offline time however is not included
in its idle time statistics, thus causing a rollback in total idle time to be
observed by applications like top.

Example output from Android top command highlighting this issue is below:

User 232%, System 70%, IOW 46%, IRQ 1%
User 1322 + Nice 0 + Sys 399 + Idle -1423 + IOW 264 + IRQ 0 + SIRQ 7 = 569

top is reporting system to be idle for -1423 ticks over some sampling period.
This happens as total idle time reported in cpu line of /proc/stat *dropped*
from the last value observed (cached) by top command.

While this was originally seen on a ARM platform running 3.4 based kernel, I
could easily recreate it on my x86 desktop running latest tip/master kernel
(HEAD 3a7bfcad). Online/offline a cpu in a tight loop and in another loop read
/proc/stat and observe if total idle time drops from previously read value.

Although commit 7386cdbf (nohz: Fix idle ticks in cpu summary line of
/proc/stat) aims to avoid this bug, its not preemption proof. A
thread could get preempted after the cpu_online() check in get_idle_time(), thus
potentially leading to get_cpu_idle_time_us() being invoked on a offline cpu.

One potential fix is to serialize hotplug with /proc/stat read operation (via
use of get/put_online_cpus()), which I disliked in favor of the other
solution proposed in this series.

In this patch series:

- Patch 1/2 modifies idle loop on architectures arm, mips, s390, sh and x86 to
  exit nohz state before the associated idle thread dies upon hotremove. This
  fixes the idle time accounting bug.

  Patch 1/2 also modifies idle loop on all architectures supporting cpu hotplug
  to have idle thread of a dying cpu die immediately after schedule() returns
  control to it. I see no point in wasting time via calls to *_enter()/*_exit()
  before noticing the need to die and dying.

- Patch 2/2 reverts commit 7386cdbf (nohz: Fix idle ticks in cpu summary line of
  /proc/stat). The cpu_online() check introduced by it is no longer necessary
  with Patch 1/2 applied. Having fewer code sites worry about online status of
  cpus is a good thing!

---

 arch/arm/kernel/process.c  |9 -
 arch/arm/kernel/smp.c  |2 +-
 arch/blackfin/kernel/process.c |8 
 arch/mips/kernel/process.c |6 +++---
 arch/powerpc/kernel/idle.c |2 +-
 arch/s390/kernel/process.c |4 ++--
 arch/sh/kernel/idle.c  |5 ++---
 arch/sparc/kernel/process_64.c |3 ++-
 arch/x86/kernel/process.c  |5 ++---
 fs/proc/stat.c |   14 --
 10 files changed, 25 insertions(+), 33 deletions(-)
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] cpuhotplug/nohz: Fix issue of negative idle time

2013-01-03 Thread Srivatsa Vaddagiri
On most architectures (arm, mips, s390, sh and x86) idle thread of a cpu does
not cleanly exit nohz state before dying upon hot-remove. As a result,
offline cpu is seen to be in nohz mode (ts-idle_active = 1) and its offline
time can potentially be included in total idle time reported via /proc/stat.
When the same cpu later comes online, its offline time however is not included
in its idle time statistics, thus causing a rollback in total idle time to be
observed by applications like top.

Example output from Android top command highlighting this issue is below:

User 232%, System 70%, IOW 46%, IRQ 1%
User 1322 + Nice 0 + Sys 399 + Idle -1423 + IOW 264 + IRQ 0 + SIRQ 7 = 569

top is reporting system to be idle for -1423 ticks over some sampling period.
This happens as total idle time reported in cpu line of /proc/stat *dropped*
from the last value observed (cached) by top command.

While this was originally seen on a ARM platform running 3.4 based kernel, I
could easily recreate it on my x86 desktop running latest tip/master kernel
(HEAD 3a7bfcad). Online/offline a cpu in a tight loop and in another loop read
/proc/stat and observe if total idle time drops from previously read value.

Although commit 7386cdbf (nohz: Fix idle ticks in cpu summary line of
/proc/stat) aims to avoid this bug, its not preemption proof. A
thread could get preempted after the cpu_online() check in get_idle_time(), thus
potentially leading to get_cpu_idle_time_us() being invoked on a offline cpu.

One potential fix is to serialize hotplug with /proc/stat read operation (via
use of get/put_online_cpus()), which I disliked in favor of the other
solution proposed in this series.

In this patch series:

- Patch 1/2 modifies idle loop on architectures arm, mips, s390, sh and x86 to
  exit nohz state before the associated idle thread dies upon hotremove. This
  fixes the idle time accounting bug.

  Patch 1/2 also modifies idle loop on all architectures supporting cpu hotplug
  to have idle thread of a dying cpu die immediately after schedule() returns
  control to it. I see no point in wasting time via calls to *_enter()/*_exit()
  before noticing the need to die and dying.

- Patch 2/2 reverts commit 7386cdbf (nohz: Fix idle ticks in cpu summary line of
  /proc/stat). The cpu_online() check introduced by it is no longer necessary
  with Patch 1/2 applied. Having fewer code sites worry about online status of
  cpus is a good thing!

---

 arch/arm/kernel/process.c  |9 -
 arch/arm/kernel/smp.c  |2 +-
 arch/blackfin/kernel/process.c |8 
 arch/mips/kernel/process.c |6 +++---
 arch/powerpc/kernel/idle.c |2 +-
 arch/s390/kernel/process.c |4 ++--
 arch/sh/kernel/idle.c  |5 ++---
 arch/sparc/kernel/process_64.c |3 ++-
 arch/x86/kernel/process.c  |5 ++---
 fs/proc/stat.c |   14 --
 10 files changed, 25 insertions(+), 33 deletions(-)
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] cpuhotplug/nohz: Remove offline cpus from nohz-idle state

2013-01-03 Thread Srivatsa Vaddagiri
Modify idle loop of arm, mips, s390, sh and x86 architectures to exit from nohz
state before dying upon hot-remove. This change is needed to avoid userspace
tools like top command from seeing a rollback in total idle time over some
sampling periods.

Additionaly, modify idle loop on all architectures supporting cpu hotplug to
have idle thread of a dying cpu die immediately after scheduler returns control
to it. There is no point in wasting time via calls to *_enter()/*_exit() before
noticing the need to die and dying.

Additional ARM specific change:
Revert commit ff081e05 (ARM: 7457/1: smp: Fix suspicious
RCU originating from cpu_die()), which added a RCU_NONIDLE() wrapper
around call to complete(). That wrapper is no longer needed as cpu_die() is
now called outside of a rcu_idle_enter()/exit() section. I also think that the
wait_for_completion() based wait in ARM's __cpu_die() can be replaced with a
busy-loop based one, as the wait there in general should be terminated within
few cycles.

Cc: Russell King li...@arm.linux.org.uk
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Stephen Boyd sb...@codeaurora.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: Mike Frysinger vap...@gentoo.org
Cc: uclinux-dist-de...@blackfin.uclinux.org
Cc: Ralf Baechle r...@linux-mips.org
Cc: linux-m...@linux-mips.org
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: Martin Schwidefsky schwidef...@de.ibm.com
Cc: linux-s...@vger.kernel.org
Cc: Paul Mundt let...@linux-sh.org
Cc: linux...@vger.kernel.org
Cc: David S. Miller da...@davemloft.net
Cc: sparcli...@vger.kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Cc: x...@kernel.org
Cc: mho...@suse.cz
Cc: srivatsa.b...@linux.vnet.ibm.com
Signed-off-by: Srivatsa Vaddagiri va...@codeaurora.org
---
 arch/arm/kernel/process.c  |9 -
 arch/arm/kernel/smp.c  |2 +-
 arch/blackfin/kernel/process.c |8 
 arch/mips/kernel/process.c |6 +++---
 arch/powerpc/kernel/idle.c |2 +-
 arch/s390/kernel/process.c |4 ++--
 arch/sh/kernel/idle.c  |5 ++---
 arch/sparc/kernel/process_64.c |3 ++-
 arch/x86/kernel/process.c  |5 ++---
 9 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index c6dec5f..254099b 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -191,11 +191,6 @@ void cpu_idle(void)
rcu_idle_enter();
ledtrig_cpu(CPU_LED_IDLE_START);
while (!need_resched()) {
-#ifdef CONFIG_HOTPLUG_CPU
-   if (cpu_is_offline(smp_processor_id()))
-   cpu_die();
-#endif
-
/*
 * We need to disable interrupts here
 * to ensure we don't miss a wakeup call.
@@ -224,6 +219,10 @@ void cpu_idle(void)
rcu_idle_exit();
tick_nohz_idle_exit();
schedule_preempt_disabled();
+#ifdef CONFIG_HOTPLUG_CPU
+   if (cpu_is_offline(smp_processor_id()))
+   cpu_die();
+#endif
}
 }
 
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 84f4cbf..a8e3b8a 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -251,7 +251,7 @@ void __ref cpu_die(void)
mb();
 
/* Tell __cpu_die() that this CPU is now safe to dispose of */
-   RCU_NONIDLE(complete(cpu_died));
+   complete(cpu_died);
 
/*
 * actual CPU shutdown procedure is at least platform (if not
diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
index 3e16ad9..2bee1af 100644
--- a/arch/blackfin/kernel/process.c
+++ b/arch/blackfin/kernel/process.c
@@ -83,10 +83,6 @@ void cpu_idle(void)
while (1) {
void (*idle)(void) = pm_idle;
 
-#ifdef CONFIG_HOTPLUG_CPU
-   if (cpu_is_offline(smp_processor_id()))
-   cpu_die();
-#endif
if (!idle)
idle = default_idle;
tick_nohz_idle_enter();
@@ -98,6 +94,10 @@ void cpu_idle(void)
preempt_enable_no_resched();
schedule();
preempt_disable();
+#ifdef CONFIG_HOTPLUG_CPU
+   if (cpu_is_offline(smp_processor_id()))
+   cpu_die();
+#endif
}
 }
 
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index a11c6f9..41102a0 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -71,13 +71,13 @@ void __noreturn cpu_idle(void)
start_critical_timings();
}
}
+   rcu_idle_exit();
+   tick_nohz_idle_exit();
+   schedule_preempt_disabled();
 #ifdef CONFIG_HOTPLUG_CPU
if (!cpu_online(cpu

[PATCH 2/2] Revert nohz: Fix idle ticks in cpu summary line of /proc/stat (commit 7386cdbf2f57ea8cff3c9fde93f206e58b9fe13f).

2013-01-03 Thread Srivatsa Vaddagiri
With offline cpus no longer beeing seen in nohz mode (ts-idle_active=0), we
don't need the check for cpu_online() introduced in commit 7386cdbf. Offline
cpu's idle time as last recorded in its ts-idle_sleeptime will be reported
(thus excluding its offline time as part of idle time statistics).

Cc: mho...@suse.cz
Cc: srivatsa.b...@linux.vnet.ibm.com
Signed-off-by: Srivatsa Vaddagiri va...@codeaurora.org
---
 fs/proc/stat.c |   14 --
 1 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e296572..64c3b31 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -45,13 +45,10 @@ static cputime64_t get_iowait_time(int cpu)
 
 static u64 get_idle_time(int cpu)
 {
-   u64 idle, idle_time = -1ULL;
-
-   if (cpu_online(cpu))
-   idle_time = get_cpu_idle_time_us(cpu, NULL);
+   u64 idle, idle_time = get_cpu_idle_time_us(cpu, NULL);
 
if (idle_time == -1ULL)
-   /* !NO_HZ or cpu offline so we can rely on cpustat.idle */
+   /* !NO_HZ so we can rely on cpustat.idle */
idle = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE];
else
idle = usecs_to_cputime64(idle_time);
@@ -61,13 +58,10 @@ static u64 get_idle_time(int cpu)
 
 static u64 get_iowait_time(int cpu)
 {
-   u64 iowait, iowait_time = -1ULL;
-
-   if (cpu_online(cpu))
-   iowait_time = get_cpu_iowait_time_us(cpu, NULL);
+   u64 iowait, iowait_time = get_cpu_iowait_time_us(cpu, NULL);
 
if (iowait_time == -1ULL)
-   /* !NO_HZ or cpu offline so we can rely on cpustat.iowait */
+   /* !NO_HZ so we can rely on cpustat.iowait */
iowait = kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
else
iowait = usecs_to_cputime64(iowait_time);
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:sched/core] sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task

2012-07-24 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  88b8dac0a14c511ff41486b83a8c3d688936eec0
Gitweb: http://git.kernel.org/tip/88b8dac0a14c511ff41486b83a8c3d688936eec0
Author: Srivatsa Vaddagiri 
AuthorDate: Tue, 19 Jun 2012 17:43:15 +0530
Committer:  Ingo Molnar 
CommitDate: Tue, 24 Jul 2012 13:58:06 +0200

sched: Improve balance_cpu() to consider other cpus in its group as target of 
(pinned) task

Current load balance scheme requires only one cpu in a
sched_group (balance_cpu) to look at other peer sched_groups for
imbalance and pull tasks towards itself from a busy cpu. Tasks
thus pulled by balance_cpu could later get picked up by cpus
that are in the same sched_group as that of balance_cpu.

This scheme however fails to pull tasks that are not allowed to
run on balance_cpu (but are allowed to run on other cpus in its
sched_group). That can affect fairness and in some worst case
scenarios cause starvation.

Consider a two core (2 threads/core) system running tasks as
below:

  Core0Core1
 / \  / \
C0 C1C2 C3
|  | |  |
v  v v  v
F0 T1F1 [idle]
 T2

 F0 = SCHED_FIFO task (pinned to C0)
 F1 = SCHED_FIFO task (pinned to C2)
 T1 = SCHED_OTHER task (pinned to C1)
 T2 = SCHED_OTHER task (pinned to C1 and C2)

F1 could become a cpu hog, which will starve T2 unless C1 pulls
it. Between C0 and C1 however, C0 is required to look for
imbalance between cores, which will fail to pull T2 towards
Core0. T2 will starve eternally in this case. The same scenario
can arise in presence of non-rt tasks as well (say we replace F1
with high irq load).

We tackle this problem by having balance_cpu move pinned tasks
to one of its sibling cpus (where they can run). We first check
if load balance goal can be met by ignoring pinned tasks,
failing which we retry move_tasks() with a new env->dst_cpu.

This patch modifies load balance semantics on who can move load
towards a given cpu in a given sched_domain.

Before this patch, a given_cpu or a ilb_cpu acting on behalf of
an idle given_cpu is responsible for moving load to given_cpu.

With this patch applied, balance_cpu can in addition decide on
moving some load to a given_cpu.

There is a remote possibility that excess load could get moved
as a result of this (balance_cpu and given_cpu/ilb_cpu deciding
*independently* and at *same* time to move some load to a
given_cpu). However we should see less of such conflicting
decisions in practice and moreover subsequent load balance
cycles should correct the excess load moved to given_cpu.

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Prashanth Nageshappa 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/r/4fe06cdb.2060...@linux.vnet.ibm.com
[ minor edits ]
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c |   78 --
 1 files changed, 74 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f9f9aa0..22321db 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3054,6 +3054,7 @@ static unsigned long __read_mostly 
max_load_balance_interval = HZ/10;
 
 #define LBF_ALL_PINNED 0x01
 #define LBF_NEED_BREAK 0x02
+#define LBF_SOME_PINNED 0x04
 
 struct lb_env {
struct sched_domain *sd;
@@ -3064,6 +3065,8 @@ struct lb_env {
int dst_cpu;
struct rq   *dst_rq;
 
+   struct cpumask  *dst_grpmask;
+   int new_dst_cpu;
enum cpu_idle_type  idle;
longimbalance;
unsigned intflags;
@@ -3131,9 +3134,31 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
 * 3) are cache-hot on their current CPU.
 */
if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
+   int new_dst_cpu;
+
schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+
+   /*
+* Remember if this task can be migrated to any other cpu in
+* our sched_group. We may want to revisit it if we couldn't
+* meet load balance goals by pulling other tasks on src_cpu.
+*
+* Also avoid computing new_dst_cpu if we have already computed
+* one in current iteration.
+*/
+   if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
+   return 0;
+
+   new_dst_cpu = cpumask_first_and(env->dst_grpmask,
+   tsk_cpus_allowed(p));
+   if (new_dst_cpu < nr_cpu_ids) {
+   env->flags |= LBF_SOME_PINNED;
+   env->new_dst_cpu = new_dst_cpu;
+   }
return 0;
}
+
+   /* Record that we found atleast one task that could run on d

[tip:sched/core] sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task

2012-07-24 Thread tip-bot for Srivatsa Vaddagiri
Commit-ID:  88b8dac0a14c511ff41486b83a8c3d688936eec0
Gitweb: http://git.kernel.org/tip/88b8dac0a14c511ff41486b83a8c3d688936eec0
Author: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
AuthorDate: Tue, 19 Jun 2012 17:43:15 +0530
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Tue, 24 Jul 2012 13:58:06 +0200

sched: Improve balance_cpu() to consider other cpus in its group as target of 
(pinned) task

Current load balance scheme requires only one cpu in a
sched_group (balance_cpu) to look at other peer sched_groups for
imbalance and pull tasks towards itself from a busy cpu. Tasks
thus pulled by balance_cpu could later get picked up by cpus
that are in the same sched_group as that of balance_cpu.

This scheme however fails to pull tasks that are not allowed to
run on balance_cpu (but are allowed to run on other cpus in its
sched_group). That can affect fairness and in some worst case
scenarios cause starvation.

Consider a two core (2 threads/core) system running tasks as
below:

  Core0Core1
 / \  / \
C0 C1C2 C3
|  | |  |
v  v v  v
F0 T1F1 [idle]
 T2

 F0 = SCHED_FIFO task (pinned to C0)
 F1 = SCHED_FIFO task (pinned to C2)
 T1 = SCHED_OTHER task (pinned to C1)
 T2 = SCHED_OTHER task (pinned to C1 and C2)

F1 could become a cpu hog, which will starve T2 unless C1 pulls
it. Between C0 and C1 however, C0 is required to look for
imbalance between cores, which will fail to pull T2 towards
Core0. T2 will starve eternally in this case. The same scenario
can arise in presence of non-rt tasks as well (say we replace F1
with high irq load).

We tackle this problem by having balance_cpu move pinned tasks
to one of its sibling cpus (where they can run). We first check
if load balance goal can be met by ignoring pinned tasks,
failing which we retry move_tasks() with a new env-dst_cpu.

This patch modifies load balance semantics on who can move load
towards a given cpu in a given sched_domain.

Before this patch, a given_cpu or a ilb_cpu acting on behalf of
an idle given_cpu is responsible for moving load to given_cpu.

With this patch applied, balance_cpu can in addition decide on
moving some load to a given_cpu.

There is a remote possibility that excess load could get moved
as a result of this (balance_cpu and given_cpu/ilb_cpu deciding
*independently* and at *same* time to move some load to a
given_cpu). However we should see less of such conflicting
decisions in practice and moreover subsequent load balance
cycles should correct the excess load moved to given_cpu.

Signed-off-by: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
Signed-off-by: Prashanth Nageshappa prasha...@linux.vnet.ibm.com
Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
Link: http://lkml.kernel.org/r/4fe06cdb.2060...@linux.vnet.ibm.com
[ minor edits ]
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 kernel/sched/fair.c |   78 --
 1 files changed, 74 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f9f9aa0..22321db 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3054,6 +3054,7 @@ static unsigned long __read_mostly 
max_load_balance_interval = HZ/10;
 
 #define LBF_ALL_PINNED 0x01
 #define LBF_NEED_BREAK 0x02
+#define LBF_SOME_PINNED 0x04
 
 struct lb_env {
struct sched_domain *sd;
@@ -3064,6 +3065,8 @@ struct lb_env {
int dst_cpu;
struct rq   *dst_rq;
 
+   struct cpumask  *dst_grpmask;
+   int new_dst_cpu;
enum cpu_idle_type  idle;
longimbalance;
unsigned intflags;
@@ -3131,9 +3134,31 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
 * 3) are cache-hot on their current CPU.
 */
if (!cpumask_test_cpu(env-dst_cpu, tsk_cpus_allowed(p))) {
+   int new_dst_cpu;
+
schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+
+   /*
+* Remember if this task can be migrated to any other cpu in
+* our sched_group. We may want to revisit it if we couldn't
+* meet load balance goals by pulling other tasks on src_cpu.
+*
+* Also avoid computing new_dst_cpu if we have already computed
+* one in current iteration.
+*/
+   if (!env-dst_grpmask || (env-flags  LBF_SOME_PINNED))
+   return 0;
+
+   new_dst_cpu = cpumask_first_and(env-dst_grpmask,
+   tsk_cpus_allowed(p));
+   if (new_dst_cpu  nr_cpu_ids) {
+   env-flags |= LBF_SOME_PINNED;
+   env-new_dst_cpu = new_dst_cpu

Re: [PATCH] sched: revert load_balance_monitor()

2008-02-25 Thread Srivatsa Vaddagiri
On Mon, Feb 25, 2008 at 04:28:02PM +0100, Peter Zijlstra wrote:
> Vatsa, would it make sense to take just that out, or just do a full
> revert?

Peter,
6b2d7700266b9402e12824e11e0099ae6a4a6a79 and 
58e2d4ca581167c2a079f4ee02be2f0bc52e8729 are related very much. The
later changes how cpu load is calculated for the former to work
properly.

I would suggest we do a full revert for now, until the latency issues are sorted
out.

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: revert load_balance_monitor()

2008-02-25 Thread Srivatsa Vaddagiri
On Mon, Feb 25, 2008 at 04:28:02PM +0100, Peter Zijlstra wrote:
 Vatsa, would it make sense to take just that out, or just do a full
 revert?

Peter,
6b2d7700266b9402e12824e11e0099ae6a4a6a79 and 
58e2d4ca581167c2a079f4ee02be2f0bc52e8729 are related very much. The
later changes how cpu load is calculated for the former to work
properly.

I would suggest we do a full revert for now, until the latency issues are sorted
out.

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-git4+ regression

2008-02-18 Thread Srivatsa Vaddagiri
On Mon, Feb 18, 2008 at 08:38:24AM +0100, Mike Galbraith wrote:
> Here, it does not.  It seems fine without CONFIG_FAIR_GROUP_SCHED.

My hunch is its because of the vruntime driven preemption which shoots
up latencies (and the fact perhaps that Peter hasnt't focused more on SMP case
yet!).

Curiously, do you observe the same results when in UP mode (maxcpus=1)? 

> Oddity:  mainline git with Srivatsa's test patch improves markedly, and
> using sched_latency_ns and sched_wakeup_granularity_ns, I can tweak the

Lukas,
Does tweaking these make any difference for you?

# echo 1000 > /proc/sys/kernel/sched_latency_ns
# echo 1000 > /proc/sys/kernel/sched_wakeup_granularity_ns

[or some other lower value]

> regression into submission.  With sched-devel, I cannot tweak it away
> with or without the test patch.  Dunno how useful that info is.

FWIW, my test patch I had sent earlier didnt address the needs of UP, as Peter 
pointed me out. In that direction, I had done more experimentation with the 
patch below, which seemed to improve UP latencies also. Note that I
don't particularly like the first hunk below, perhaps it needs to be
surrounded by an if(something) ..

---
 kernel/sched_fair.c |   12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -523,8 +523,6 @@ place_entity(struct cfs_rq *cfs_rq, stru
if (sched_feat(NEW_FAIR_SLEEPERS))
vruntime -= sysctl_sched_latency;
 
-   /* ensure we never gain time by being placed backwards. */
-   vruntime = max_vruntime(se->vruntime, vruntime);
}
 
se->vruntime = vruntime;
@@ -816,6 +814,13 @@ hrtick_start_fair(struct rq *rq, struct 
 }
 #endif
 
+static inline void dequeue_stack(struct sched_entity *se)
+{
+   for_each_sched_entity(se)
+   if (se->on_rq)
+   dequeue_entity(cfs_rq_of(se), se, 0);
+}
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -828,6 +833,9 @@ static void enqueue_task_fair(struct rq 
*topse = NULL;  /* Highest schedulable entity */
int incload = 1;
 
+   if (wakeup)
+   dequeue_stack(se);
+
for_each_sched_entity(se) {
topse = se;
if (se->on_rq) {



P.S : Sorry about slow responses, since I am now in a different project :(

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-git4+ regression

2008-02-18 Thread Srivatsa Vaddagiri
On Mon, Feb 18, 2008 at 08:38:24AM +0100, Mike Galbraith wrote:
 Here, it does not.  It seems fine without CONFIG_FAIR_GROUP_SCHED.

My hunch is its because of the vruntime driven preemption which shoots
up latencies (and the fact perhaps that Peter hasnt't focused more on SMP case
yet!).

Curiously, do you observe the same results when in UP mode (maxcpus=1)? 

 Oddity:  mainline git with Srivatsa's test patch improves markedly, and
 using sched_latency_ns and sched_wakeup_granularity_ns, I can tweak the

Lukas,
Does tweaking these make any difference for you?

# echo 1000  /proc/sys/kernel/sched_latency_ns
# echo 1000  /proc/sys/kernel/sched_wakeup_granularity_ns

[or some other lower value]

 regression into submission.  With sched-devel, I cannot tweak it away
 with or without the test patch.  Dunno how useful that info is.

FWIW, my test patch I had sent earlier didnt address the needs of UP, as Peter 
pointed me out. In that direction, I had done more experimentation with the 
patch below, which seemed to improve UP latencies also. Note that I
don't particularly like the first hunk below, perhaps it needs to be
surrounded by an if(something) ..

---
 kernel/sched_fair.c |   12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -523,8 +523,6 @@ place_entity(struct cfs_rq *cfs_rq, stru
if (sched_feat(NEW_FAIR_SLEEPERS))
vruntime -= sysctl_sched_latency;
 
-   /* ensure we never gain time by being placed backwards. */
-   vruntime = max_vruntime(se-vruntime, vruntime);
}
 
se-vruntime = vruntime;
@@ -816,6 +814,13 @@ hrtick_start_fair(struct rq *rq, struct 
 }
 #endif
 
+static inline void dequeue_stack(struct sched_entity *se)
+{
+   for_each_sched_entity(se)
+   if (se-on_rq)
+   dequeue_entity(cfs_rq_of(se), se, 0);
+}
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -828,6 +833,9 @@ static void enqueue_task_fair(struct rq 
*topse = NULL;  /* Highest schedulable entity */
int incload = 1;
 
+   if (wakeup)
+   dequeue_stack(se);
+
for_each_sched_entity(se) {
topse = se;
if (se-on_rq) {



P.S : Sorry about slow responses, since I am now in a different project :(

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-git4+ regression

2008-02-14 Thread Srivatsa Vaddagiri
On Wed, Jan 30, 2008 at 02:56:09PM +0100, Lukas Hejtmanek wrote:
> Hello,
> 
> I noticed short thread in LKM regarding "sched: add vslice" causes horrible
> interactivity under load.
> 
> I can see similar behavior. If I stress both CPU cores, even typing on
> keyboard suffers from huge latencies, I can see letters appearing with delay
> (typing into xterm). No swap is used at all, having 1GB free RAM.
> 
> I noticed this bad behavior with 2.6.24-git[46], 2.6.24-rc8-git was OK.

Hi Lukas,
Can you check if the patch below helps improve interactivity for you?

The patch is against 2.6.25-rc1. I would request you to check for
difference it makes with CONFIG_FAIR_GROUP_SCHED and
CONFIG_FAIR_USER_SCHED turned on.

---
 kernel/sched.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -7431,8 +7431,8 @@
 
local_load = tg->cfs_rq[i]->load.weight;
local_shares = (local_load * total_shares) / total_load;
-   if (!local_shares)
-   local_shares = MIN_GROUP_SHARES;
+   if (!local_load)
+   local_shares = tg->shares;
if (local_shares == tg->se[i]->load.weight)
continue;
 
@@ -7710,7 +7710,7 @@
struct rq *rq = cfs_rq->rq;
int on_rq;
 
-   if (!shares)
+   if (shares < MIN_GROUP_SHARES)
shares = MIN_GROUP_SHARES;
 
on_rq = se->on_rq;

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-git4+ regression

2008-02-14 Thread Srivatsa Vaddagiri
On Wed, Jan 30, 2008 at 02:56:09PM +0100, Lukas Hejtmanek wrote:
 Hello,
 
 I noticed short thread in LKM regarding sched: add vslice causes horrible
 interactivity under load.
 
 I can see similar behavior. If I stress both CPU cores, even typing on
 keyboard suffers from huge latencies, I can see letters appearing with delay
 (typing into xterm). No swap is used at all, having 1GB free RAM.
 
 I noticed this bad behavior with 2.6.24-git[46], 2.6.24-rc8-git was OK.

Hi Lukas,
Can you check if the patch below helps improve interactivity for you?

The patch is against 2.6.25-rc1. I would request you to check for
difference it makes with CONFIG_FAIR_GROUP_SCHED and
CONFIG_FAIR_USER_SCHED turned on.

---
 kernel/sched.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -7431,8 +7431,8 @@
 
local_load = tg-cfs_rq[i]-load.weight;
local_shares = (local_load * total_shares) / total_load;
-   if (!local_shares)
-   local_shares = MIN_GROUP_SHARES;
+   if (!local_load)
+   local_shares = tg-shares;
if (local_shares == tg-se[i]-load.weight)
continue;
 
@@ -7710,7 +7710,7 @@
struct rq *rq = cfs_rq-rq;
int on_rq;
 
-   if (!shares)
+   if (shares  MIN_GROUP_SHARES)
shares = MIN_GROUP_SHARES;
 
on_rq = se-on_rq;

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression in latest sched-git

2008-02-13 Thread Srivatsa Vaddagiri
On Wed, Feb 13, 2008 at 10:04:44PM +0530, Dhaval Giani wrote:
> I know I am missing something, but aren't we trying to reduce latencies
> here?

I guess Peter is referring to the latency in seeing fairness results. In
other words, with single rq approach, you may require more time for the groups 
to converge on fairness.

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.25-rc1: volanoMark 45% regression

2008-02-13 Thread Srivatsa Vaddagiri
On Wed, Feb 13, 2008 at 03:15:16PM +0530, Balbir Singh wrote:
> Zhang, Yanmin wrote:
> > volanoMark has 45% regression with kernel 2.6.25-rc1 on my both 8-core
> > stoakley and 16-core Tigerton.
> > 
> > I used bisect to locate below patch.
> > 
> > commit 58e2d4ca581167c2a079f4ee02be2f0bc52e8729
> > Author: Srivatsa Vaddagiri <[EMAIL PROTECTED]>
> > Date:   Fri Jan 25 21:08:00 2008 +0100
> > 
> > sched: group scheduling, change how cpu load is calculated
> > 
> > 
> > 
> > hackbench has about 30% regression on 16-core tigerton, but has about 10% 
> > improvement
> > on 8-core stoakley.
> > 
> > In addition, tbench has about 6% regression on my 8-core stoakley and
> > 25% regression on 16-core stoakley. Some other benchmarks, like netperf/aim7
> > also have some regression. I will verify if they are all related to the
> > patch.
> > 
> > -yanmin
> 
> Hi, Yamin,
> 
> Thanks for reporting the issue? Any chance we could getthe Oprofile output for
> the run? The exact commandline and .config being used would also help.

Yamin,
I would also like to know against which previous version is this
regression being compared with. Is it 2.6.24? Did you have
CONFIG_FAIR_USER_SCHED enabled in both cases? It would also help to know if you
see the same regression with FAIR_GROUP_SCHED turned off.

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.25-rc1: volanoMark 45% regression

2008-02-13 Thread Srivatsa Vaddagiri
On Wed, Feb 13, 2008 at 03:15:16PM +0530, Balbir Singh wrote:
 Zhang, Yanmin wrote:
  volanoMark has 45% regression with kernel 2.6.25-rc1 on my both 8-core
  stoakley and 16-core Tigerton.
  
  I used bisect to locate below patch.
  
  commit 58e2d4ca581167c2a079f4ee02be2f0bc52e8729
  Author: Srivatsa Vaddagiri [EMAIL PROTECTED]
  Date:   Fri Jan 25 21:08:00 2008 +0100
  
  sched: group scheduling, change how cpu load is calculated
  
  
  
  hackbench has about 30% regression on 16-core tigerton, but has about 10% 
  improvement
  on 8-core stoakley.
  
  In addition, tbench has about 6% regression on my 8-core stoakley and
  25% regression on 16-core stoakley. Some other benchmarks, like netperf/aim7
  also have some regression. I will verify if they are all related to the
  patch.
  
  -yanmin
 
 Hi, Yamin,
 
 Thanks for reporting the issue? Any chance we could getthe Oprofile output for
 the run? The exact commandline and .config being used would also help.

Yamin,
I would also like to know against which previous version is this
regression being compared with. Is it 2.6.24? Did you have
CONFIG_FAIR_USER_SCHED enabled in both cases? It would also help to know if you
see the same regression with FAIR_GROUP_SCHED turned off.

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression in latest sched-git

2008-02-13 Thread Srivatsa Vaddagiri
On Wed, Feb 13, 2008 at 10:04:44PM +0530, Dhaval Giani wrote:
 I know I am missing something, but aren't we trying to reduce latencies
 here?

I guess Peter is referring to the latency in seeing fairness results. In
other words, with single rq approach, you may require more time for the groups 
to converge on fairness.

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression in latest sched-git

2008-02-12 Thread Srivatsa Vaddagiri
On Tue, Feb 12, 2008 at 08:40:08PM +0100, Peter Zijlstra wrote:
> Yes, latency isolation is the one thing I had to sacrifice in order to
> get the normal latencies under control.

Hi Peter,
I don't have easy solution in mind either to meet both fairness
and latency goals in a acceptable way.

But I am puzzled at the max latency numbers you have provided below:

> The problem with the old code is that under light load: a kernel make
> -j2 as root, under an otherwise idle X session, generates latencies up
> to 120ms on my UP laptop. (uid grouping; two active users: peter, root).

If it was just two active users, then max latency should be:

latency to schedule user entity (~10ms?) +
latency to schedule task within that user 

20-30 ms seems more reaonable max latency to expect in this scenario.
120ms seems abnormal, unless the user had large number of tasks.

On the same lines, I cant understand how we can be seeing 700ms latency
(below) unless we had: large number of active groups/users and large number of 
tasks within each group/user.

> Others have reported latencies up to 300ms, and Ingo found a 700ms
> latency on his machine.
> 
> The source for this problem is I think the vruntime driven wakeup
> preemption (but I'm not quite sure). The other things that rely on
> global vruntime are sleeper fairness and yield. Now while I can't
> possibly care less about yield, the loss of sleeper fairness is somewhat
> sad (NB. turning it off with the old group scheduling does improve life
> somewhat).
> 
> So my first attempt at getting a global vruntime was flattening the
> whole RQ structure, you can see that patch in sched.git (I really ought
> to have posted that, will do so tomorrow).

We will do some exhaustive testing with this approach. My main concern
with this is that it may compromise the level of isolation between two
groups (imagine one group does a fork-bomb and how it would affect
fairness for other groups).

> With the experience gained from doing that, I think it might be possible
> to construct a hierarchical RQ model that has synced vruntime; but
> thinking about that still makes my head hurt.
> 
> Anyway, yes, its not ideal, but it does the more common case of light
> load much better - I basically had to tell people to disable
> CONFIG_FAIR_GROUP_SCHED in order to use their computer, which is sad,
> because its the default and we want it to be the default in the cgroup
> future.
> 
> So yes, I share your concern, lets work on this together.

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression in latest sched-git

2008-02-12 Thread Srivatsa Vaddagiri
On Tue, Feb 12, 2008 at 08:40:08PM +0100, Peter Zijlstra wrote:
 Yes, latency isolation is the one thing I had to sacrifice in order to
 get the normal latencies under control.

Hi Peter,
I don't have easy solution in mind either to meet both fairness
and latency goals in a acceptable way.

But I am puzzled at the max latency numbers you have provided below:

 The problem with the old code is that under light load: a kernel make
 -j2 as root, under an otherwise idle X session, generates latencies up
 to 120ms on my UP laptop. (uid grouping; two active users: peter, root).

If it was just two active users, then max latency should be:

latency to schedule user entity (~10ms?) +
latency to schedule task within that user 

20-30 ms seems more reaonable max latency to expect in this scenario.
120ms seems abnormal, unless the user had large number of tasks.

On the same lines, I cant understand how we can be seeing 700ms latency
(below) unless we had: large number of active groups/users and large number of 
tasks within each group/user.

 Others have reported latencies up to 300ms, and Ingo found a 700ms
 latency on his machine.
 
 The source for this problem is I think the vruntime driven wakeup
 preemption (but I'm not quite sure). The other things that rely on
 global vruntime are sleeper fairness and yield. Now while I can't
 possibly care less about yield, the loss of sleeper fairness is somewhat
 sad (NB. turning it off with the old group scheduling does improve life
 somewhat).
 
 So my first attempt at getting a global vruntime was flattening the
 whole RQ structure, you can see that patch in sched.git (I really ought
 to have posted that, will do so tomorrow).

We will do some exhaustive testing with this approach. My main concern
with this is that it may compromise the level of isolation between two
groups (imagine one group does a fork-bomb and how it would affect
fairness for other groups).

 With the experience gained from doing that, I think it might be possible
 to construct a hierarchical RQ model that has synced vruntime; but
 thinking about that still makes my head hurt.
 
 Anyway, yes, its not ideal, but it does the more common case of light
 load much better - I basically had to tell people to disable
 CONFIG_FAIR_GROUP_SCHED in order to use their computer, which is sad,
 because its the default and we want it to be the default in the cgroup
 future.
 
 So yes, I share your concern, lets work on this together.

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Default child of a cgroup

2008-02-01 Thread Srivatsa Vaddagiri
On Thu, Jan 31, 2008 at 06:39:56PM -0800, Paul Menage wrote:
> On Jan 30, 2008 6:40 PM, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote:
> >
> > Here are some questions that arise in this picture:
> >
> > 1. What is the relationship of the task-group in A/tasks with the
> >task-group in A/a1/tasks? In otherwords do they form siblings
> >of the same parent A?
> 
> I'd argue the same as Balbir - tasks in A/tasks are are children of A
> and are siblings of a1, a2, etc.

> > 2. Somewhat related to the above question, how much resource should the
> >task-group A/a1/tasks get in relation to A/tasks? Is it 1/2 of parent
> >A's share or 1/(1 + N) of parent A's share (where N = number of tasks
> >in A/tasks)?
> 
> Each process in A should have a scheduler weight that's derived from
> its static_prio field. Similarly each subgroup of A will have a
> scheduler weight that's determined by its cpu.shares value. So the cpu
> share of any child (be it a task or a subgroup) would be equal to its
> own weight divided by the sum of weights of all children.

Assuming all tasks are of same prio, then what you are saying is that
A/a1/tasks should cumulatively recv 1/(1 + N) of parent's share.

After some thought, that seems like a reasonable expectation. The only issue
I have for that is it breaks current behavior in mainline. Assume this
structure:

   /
   |--
   |--
   |--
   |
   |[A]
   | |
   | |
   | |


then, going by above argument, /A/tasks should recv 1/(1+M)% of system
resources (M -> number of tasks in /tasks), whereas it receives 1/2 of
system resources currently (assuming /cpu.shares and /A/cpu.shares are
same).

Balbir, is this behaviour same for memory controller as well?

So pick any option, we are talking of deviating from current
behavior, which perhaps is a non-issue if we want to DTRT.

> So yes, if a task in A forks lots of children, those children could
> end up getting a disproportionate amount of the CPU compared to tasks
> in A/a1 - but that's the same as the situation without cgroups. If you
> want to control cpu usage between different sets of processes in A,
> they should be in sibling cgroups, not directly in A.
> 
> Is there a restriction in CFS that stops a given group from
> simultaneously holding tasks and sub-groups? If so, couldn't we change
> CFS to make it possible rather than enforcing awkward restructions on
> cgroups?

Should be possible, need to look closely at what will need to change
(load_balance routines for sure).

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Default child of a cgroup

2008-02-01 Thread Srivatsa Vaddagiri
On Thu, Jan 31, 2008 at 06:39:56PM -0800, Paul Menage wrote:
 On Jan 30, 2008 6:40 PM, Srivatsa Vaddagiri [EMAIL PROTECTED] wrote:
 
  Here are some questions that arise in this picture:
 
  1. What is the relationship of the task-group in A/tasks with the
 task-group in A/a1/tasks? In otherwords do they form siblings
 of the same parent A?
 
 I'd argue the same as Balbir - tasks in A/tasks are are children of A
 and are siblings of a1, a2, etc.

  2. Somewhat related to the above question, how much resource should the
 task-group A/a1/tasks get in relation to A/tasks? Is it 1/2 of parent
 A's share or 1/(1 + N) of parent A's share (where N = number of tasks
 in A/tasks)?
 
 Each process in A should have a scheduler weight that's derived from
 its static_prio field. Similarly each subgroup of A will have a
 scheduler weight that's determined by its cpu.shares value. So the cpu
 share of any child (be it a task or a subgroup) would be equal to its
 own weight divided by the sum of weights of all children.

Assuming all tasks are of same prio, then what you are saying is that
A/a1/tasks should cumulatively recv 1/(1 + N) of parent's share.

After some thought, that seems like a reasonable expectation. The only issue
I have for that is it breaks current behavior in mainline. Assume this
structure:

   /
   |--tasks
   |--cpuacct.usage
   |--cpu.shares
   |
   |[A]
   | |tasks
   | |cpuacct.usage
   | |cpu.shares


then, going by above argument, /A/tasks should recv 1/(1+M)% of system
resources (M - number of tasks in /tasks), whereas it receives 1/2 of
system resources currently (assuming /cpu.shares and /A/cpu.shares are
same).

Balbir, is this behaviour same for memory controller as well?

So pick any option, we are talking of deviating from current
behavior, which perhaps is a non-issue if we want to DTRT.

 So yes, if a task in A forks lots of children, those children could
 end up getting a disproportionate amount of the CPU compared to tasks
 in A/a1 - but that's the same as the situation without cgroups. If you
 want to control cpu usage between different sets of processes in A,
 they should be in sibling cgroups, not directly in A.
 
 Is there a restriction in CFS that stops a given group from
 simultaneously holding tasks and sub-groups? If so, couldn't we change
 CFS to make it possible rather than enforcing awkward restructions on
 cgroups?

Should be possible, need to look closely at what will need to change
(load_balance routines for sure).

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] Default child of a cgroup

2008-01-30 Thread Srivatsa Vaddagiri
Hi,
As we were implementing multiple-hierarchy support for CPU
controller, we hit some oddities in its implementation, partly related
to current cgroups implementation. Peter and I have been debating on the 
exact solution and I thought of bringing that discussion to lkml.

Consider the cgroup filesystem structure for managing cpu resource.

# mount -t cgroup -ocpu,cpuacct none /cgroup
# mkdir /cgroup/A
# mkdir /cgroup/B
# mkdir /cgroup/A/a1

will result in:

/cgroup
   |--
   |--
   |--
   |
   |[A]
   | |
   | |
   | |
   | |
   | |---[a1]
   |   |
   |   |
   |   |
   |   |
   |
   |[B]
   | |
   | |
   | |
   | 


Here are some questions that arise in this picture:

1. What is the relationship of the task-group in A/tasks with the
   task-group in A/a1/tasks? In otherwords do they form siblings
   of the same parent A?

2. Somewhat related to the above question, how much resource should the 
   task-group A/a1/tasks get in relation to A/tasks? Is it 1/2 of parent
   A's share or 1/(1 + N) of parent A's share (where N = number of tasks
   in A/tasks)?

3. What should A/cpuacct.usage reflect? CPU usage of A/tasks? Or CPU usage
   of all siblings put together? It can reflect only one, in which case
   user has to manually derive the other component of the statistics.
 
It seems to me that tasks in A/tasks form what can be called the
"default" child group of A, in which case:

4. Modifications to A/cpu.shares should affect the parent or its default
   child group (A/tasks)?

To avoid these ambiguities, it may be good if cgroup create this
"default child group" automatically whenever a cgroup is created?
Something like below (not the absence of tasks file in some directories
now):


/cgroup
   |
   |--
   |--
   |
   |---[def_child]
   | |
   | |
   | |
   | |
   |
   |[A]
   | |
   | |
   | |
   | |
   | |---[def_child]
   | | |
   | | |
   | | |
   | | |
   | | 
   | |---[a1]
   |   |
   |   |
   |   |
   |   |
   |   |---[def_child]
   |   |   |---
   |   |   |---
   |   |   |---
   |   |   |
   |
   |[B]
   | |
   | |
   | |
   | | 
   | |---[def_child]
   | | |
   | | |
   | | |
   | | |

Note that user cannot create subdirectories under def_child with this
scheme! I am also not sure what impact this will have on other resources
like cpusets ..

Thoughts?


-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] Default child of a cgroup

2008-01-30 Thread Srivatsa Vaddagiri
Hi,
As we were implementing multiple-hierarchy support for CPU
controller, we hit some oddities in its implementation, partly related
to current cgroups implementation. Peter and I have been debating on the 
exact solution and I thought of bringing that discussion to lkml.

Consider the cgroup filesystem structure for managing cpu resource.

# mount -t cgroup -ocpu,cpuacct none /cgroup
# mkdir /cgroup/A
# mkdir /cgroup/B
# mkdir /cgroup/A/a1

will result in:

/cgroup
   |--tasks
   |--cpuacct.usage
   |--cpu.shares
   |
   |[A]
   | |tasks
   | |cpuacct.usage
   | |cpu.shares
   | |
   | |---[a1]
   |   |tasks
   |   |cpuacct.usage
   |   |cpu.shares
   |   |
   |
   |[B]
   | |tasks
   | |cpuacct.usage
   | |cpu.shares
   | 


Here are some questions that arise in this picture:

1. What is the relationship of the task-group in A/tasks with the
   task-group in A/a1/tasks? In otherwords do they form siblings
   of the same parent A?

2. Somewhat related to the above question, how much resource should the 
   task-group A/a1/tasks get in relation to A/tasks? Is it 1/2 of parent
   A's share or 1/(1 + N) of parent A's share (where N = number of tasks
   in A/tasks)?

3. What should A/cpuacct.usage reflect? CPU usage of A/tasks? Or CPU usage
   of all siblings put together? It can reflect only one, in which case
   user has to manually derive the other component of the statistics.
 
It seems to me that tasks in A/tasks form what can be called the
default child group of A, in which case:

4. Modifications to A/cpu.shares should affect the parent or its default
   child group (A/tasks)?

To avoid these ambiguities, it may be good if cgroup create this
default child group automatically whenever a cgroup is created?
Something like below (not the absence of tasks file in some directories
now):


/cgroup
   |
   |--cpuacct.usage
   |--cpu.shares
   |
   |---[def_child]
   | |tasks
   | |cpuacct.usage
   | |cpu.shares
   | |
   |
   |[A]
   | |
   | |cpuacct.usage
   | |cpu.shares
   | |
   | |---[def_child]
   | | |tasks
   | | |cpuacct.usage
   | | |cpu.shares
   | | |
   | | 
   | |---[a1]
   |   |
   |   |cpuacct.usage
   |   |cpu.shares
   |   |
   |   |---[def_child]
   |   |   |---tasks
   |   |   |---cpuacct.usage
   |   |   |---cpu.shares
   |   |   |
   |
   |[B]
   | |
   | |cpuacct.usage
   | |cpu.shares
   | | 
   | |---[def_child]
   | | |tasks
   | | |cpuacct.usage
   | | |cpu.shares
   | | |

Note that user cannot create subdirectories under def_child with this
scheme! I am also not sure what impact this will have on other resources
like cpusets ..

Thoughts?


-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High wake up latencies with FAIR_USER_SCHED

2008-01-29 Thread Srivatsa Vaddagiri
On Tue, Jan 29, 2008 at 04:53:56PM +0100, Guillaume Chazarain wrote:
> I just thought about something to restore low latencies with
> FAIR_GROUP_SCHED, but it's possibly utter nonsense, so bear with me
> ;-) The idea would be to reverse the trees upside down. The scheduler
> would only see tasks (on the leaves) so could apply its interactivity
> magic, but the hierarchical groups would be used to compute dynamic
> loads for each task according to their position in the tree:

I think this is equivalent to flattening the hierarchy? We discussed this a bit 
sometime back [1], but one of its weaknesses is providing strong
partitioning between groups when it comes to ensuring fairness. Ex: imagine a 
group which does a fork-bomb. With the flattened tree, it affects other groups 
more than it would with a 1-level deep hierarchy.

Having said that, I would be interested to hear other solutions that maintain 
this good partitioning b/n groups and still provide good interactivity!

1. http://lkml.org/lkml/2007/5/30/300

> - now:
>   - we schedule each level of the tree starting from the root
> 
> - with my proposition:
>   - we schedule tasks like with !FAIR_GROUP_SCHED, but
> calc_delta_fair() would traverse the tree starting from the leaves to
> compute the dynamic load.

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: scheduler scalability - cgroups, cpusets and load-balancing

2008-01-29 Thread Srivatsa Vaddagiri
On Tue, Jan 29, 2008 at 11:57:22AM +0100, Peter Zijlstra wrote:
> On Tue, 2008-01-29 at 10:53 +0100, Peter Zijlstra wrote:
> 
> > My thoughts were to make stronger use of disjoint cpu-sets. cgroups and
> > cpusets are related, in that cpusets provide a property to a cgroup.
> > However, load_balance_monitor()'s interaction with sched domains
> > confuses me - it might DTRT, but I can't tell.
> > 
> > [ It looks to me it balances a group over the largest SD the current cpu
> >   has access to, even though that might be larger than the SD associated
> >   with the cpuset of that particular cgroup. ]
> 
> Hmm, with a bit more thought I think that does indeed DTRT. Because, if
> the cpu belongs to a disjoint cpuset, the highest sd (with
> load-balancing enabled) would be that. Right?

Hi Peter,
Yes, I was having this in mind when I wrote the load_balance_monitor() 
function - to only balance across cpus that form a disjoint cpuset in the 
system.

> [ Just a bit of a shame we have all cgroups represented on each cpu. ]

After reading your explanation in the other mail abt what you mean here,
I agree. Your suggestion to remove/add cfs_rq from/to the leaf_cfs_rq_list
upon dequeue_of_last_task/enqueue_of_first_task  AND

> Also, might be a nice idea to split the daemon up if there are indeed
> disjoint sets - currently there is only a single daemon which touches
> the whole system.

the above suggestions seems like good ideas. I can also look at reducing
the frequency at which the thread runs ..

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: scheduler scalability - cgroups, cpusets and load-balancing

2008-01-29 Thread Srivatsa Vaddagiri
On Tue, Jan 29, 2008 at 11:57:22AM +0100, Peter Zijlstra wrote:
 On Tue, 2008-01-29 at 10:53 +0100, Peter Zijlstra wrote:
 
  My thoughts were to make stronger use of disjoint cpu-sets. cgroups and
  cpusets are related, in that cpusets provide a property to a cgroup.
  However, load_balance_monitor()'s interaction with sched domains
  confuses me - it might DTRT, but I can't tell.
  
  [ It looks to me it balances a group over the largest SD the current cpu
has access to, even though that might be larger than the SD associated
with the cpuset of that particular cgroup. ]
 
 Hmm, with a bit more thought I think that does indeed DTRT. Because, if
 the cpu belongs to a disjoint cpuset, the highest sd (with
 load-balancing enabled) would be that. Right?

Hi Peter,
Yes, I was having this in mind when I wrote the load_balance_monitor() 
function - to only balance across cpus that form a disjoint cpuset in the 
system.

 [ Just a bit of a shame we have all cgroups represented on each cpu. ]

After reading your explanation in the other mail abt what you mean here,
I agree. Your suggestion to remove/add cfs_rq from/to the leaf_cfs_rq_list
upon dequeue_of_last_task/enqueue_of_first_task  AND

 Also, might be a nice idea to split the daemon up if there are indeed
 disjoint sets - currently there is only a single daemon which touches
 the whole system.

the above suggestions seems like good ideas. I can also look at reducing
the frequency at which the thread runs ..

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High wake up latencies with FAIR_USER_SCHED

2008-01-29 Thread Srivatsa Vaddagiri
On Tue, Jan 29, 2008 at 04:53:56PM +0100, Guillaume Chazarain wrote:
 I just thought about something to restore low latencies with
 FAIR_GROUP_SCHED, but it's possibly utter nonsense, so bear with me
 ;-) The idea would be to reverse the trees upside down. The scheduler
 would only see tasks (on the leaves) so could apply its interactivity
 magic, but the hierarchical groups would be used to compute dynamic
 loads for each task according to their position in the tree:

I think this is equivalent to flattening the hierarchy? We discussed this a bit 
sometime back [1], but one of its weaknesses is providing strong
partitioning between groups when it comes to ensuring fairness. Ex: imagine a 
group which does a fork-bomb. With the flattened tree, it affects other groups 
more than it would with a 1-level deep hierarchy.

Having said that, I would be interested to hear other solutions that maintain 
this good partitioning b/n groups and still provide good interactivity!

1. http://lkml.org/lkml/2007/5/30/300

 - now:
   - we schedule each level of the tree starting from the root
 
 - with my proposition:
   - we schedule tasks like with !FAIR_GROUP_SCHED, but
 calc_delta_fair() would traverse the tree starting from the leaves to
 compute the dynamic load.

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High wake up latencies with FAIR_USER_SCHED

2008-01-28 Thread Srivatsa Vaddagiri
On Mon, Jan 28, 2008 at 09:13:53PM +0100, Guillaume Chazarain wrote:
> Unfortunately it seems to not be completely fixed, with this script:

The maximum scheduling latency of a task with group scheduler is:

Lmax = latency to schedule group entity at level0 + 
   latency to schedule group entity at level1 +
   ...
   latency to schedule task entity at last level

More the hierarchical levels, more the latency looks like. This is particularly 
so because vruntime (and not wall-clock time) is used as the basis of preemption
of entities.  The latency at each level also depends on number of entities at 
that level and sysctl_sched_latency/sched_nr_latency setting.

In this case, we only have two levels - userid + task. So the max scheduling 
latency is:

  Lmax = latency to schedule uid1 group entity (L0) +
 latency to schedule the sleeper task within uid1 group (L1)

In the first script that you had, uid1 had only one sleeper task, while uid2 has
two cpu-hogs. This means L1 is always zero for the sleeper task. L0 is also 
substantially reduced with the patch I sent (giving sleep credit for group 
level entities). Thus we were able to get low scheduling latencies in the case 
of first script.

The second script you have sent is generating two tasks (sleeper + hog) under 
uid 1 and one cpuhog task under uid 2. Consequently the group-entity 
corresponding to uid 1 is always active and hence there is no question of giving
credit to it for sleeping.

As a result, we should expect worst-case latencies of upto [2 * 10 = 20ms] in 
this case. The results you have fall within this range.

In case of !FAIR_USER_SCHED, the sleeper task always gets sleep-credits
and hence its latency is drastically reduced.

IMHO this is expected results and if someone really needs to cut down
this latency, they can reduce sysctl_sched_latency (which will be bad
from perf standpoint, as we will cause more cache thrashing with that).


> #!/usr/bin/python
> 
> import os
> import time
> 
> SLEEP_TIME = 0.1
> SAMPLES = 5
> PRINT_DELAY = 0.5
> 
> def print_wakeup_latency():
> times = []
> last_print = 0
> while True:
> start = time.time()
> time.sleep(SLEEP_TIME)
> end = time.time()
> times.insert(0, end - start - SLEEP_TIME)
> del times[SAMPLES:]
> if end > last_print + PRINT_DELAY:
> copy = times[:]
> copy.sort()
> print '%f ms' % (copy[len(copy)/2] * 1000)
> last_print = end
> 
> if os.fork() == 0:
> if os.fork() == 0:
> os.setuid(1)
> while True:
> pass
> else:
> os.setuid(2)
> while True:
> pass
> else:
> os.setuid(1)
> print_wakeup_latency()
> 
> I get seemingly unpredictable latencies (with or without the patch applied):
> 
> # ./sched.py
> 14.810944 ms
> 19.829893 ms
> 1.968050 ms
> 8.021021 ms
> -0.017977 ms
> 4.926109 ms
> 11.958027 ms
> 5.995893 ms
> 1.992130 ms
> 0.007057 ms
> 0.217819 ms
> -0.004864 ms
> 5.907202 ms
> 6.547832 ms
> -0.012970 ms
> 0.209951 ms
> -0.002003 ms
> 4.989052 ms
> 
> Without FAIR_USER_SCHED, latencies are consistently in the noise.
> Also, I forgot to mention that I'm on a single CPU.
> 
> Thanks for the help.
> 
> -- 
> Guillaume

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High wake up latencies with FAIR_USER_SCHED

2008-01-28 Thread Srivatsa Vaddagiri
On Mon, Jan 28, 2008 at 09:13:53PM +0100, Guillaume Chazarain wrote:
 Unfortunately it seems to not be completely fixed, with this script:

The maximum scheduling latency of a task with group scheduler is:

Lmax = latency to schedule group entity at level0 + 
   latency to schedule group entity at level1 +
   ...
   latency to schedule task entity at last level

More the hierarchical levels, more the latency looks like. This is particularly 
so because vruntime (and not wall-clock time) is used as the basis of preemption
of entities.  The latency at each level also depends on number of entities at 
that level and sysctl_sched_latency/sched_nr_latency setting.

In this case, we only have two levels - userid + task. So the max scheduling 
latency is:

  Lmax = latency to schedule uid1 group entity (L0) +
 latency to schedule the sleeper task within uid1 group (L1)

In the first script that you had, uid1 had only one sleeper task, while uid2 has
two cpu-hogs. This means L1 is always zero for the sleeper task. L0 is also 
substantially reduced with the patch I sent (giving sleep credit for group 
level entities). Thus we were able to get low scheduling latencies in the case 
of first script.

The second script you have sent is generating two tasks (sleeper + hog) under 
uid 1 and one cpuhog task under uid 2. Consequently the group-entity 
corresponding to uid 1 is always active and hence there is no question of giving
credit to it for sleeping.

As a result, we should expect worst-case latencies of upto [2 * 10 = 20ms] in 
this case. The results you have fall within this range.

In case of !FAIR_USER_SCHED, the sleeper task always gets sleep-credits
and hence its latency is drastically reduced.

IMHO this is expected results and if someone really needs to cut down
this latency, they can reduce sysctl_sched_latency (which will be bad
from perf standpoint, as we will cause more cache thrashing with that).


 #!/usr/bin/python
 
 import os
 import time
 
 SLEEP_TIME = 0.1
 SAMPLES = 5
 PRINT_DELAY = 0.5
 
 def print_wakeup_latency():
 times = []
 last_print = 0
 while True:
 start = time.time()
 time.sleep(SLEEP_TIME)
 end = time.time()
 times.insert(0, end - start - SLEEP_TIME)
 del times[SAMPLES:]
 if end  last_print + PRINT_DELAY:
 copy = times[:]
 copy.sort()
 print '%f ms' % (copy[len(copy)/2] * 1000)
 last_print = end
 
 if os.fork() == 0:
 if os.fork() == 0:
 os.setuid(1)
 while True:
 pass
 else:
 os.setuid(2)
 while True:
 pass
 else:
 os.setuid(1)
 print_wakeup_latency()
 
 I get seemingly unpredictable latencies (with or without the patch applied):
 
 # ./sched.py
 14.810944 ms
 19.829893 ms
 1.968050 ms
 8.021021 ms
 -0.017977 ms
 4.926109 ms
 11.958027 ms
 5.995893 ms
 1.992130 ms
 0.007057 ms
 0.217819 ms
 -0.004864 ms
 5.907202 ms
 6.547832 ms
 -0.012970 ms
 0.209951 ms
 -0.002003 ms
 4.989052 ms
 
 Without FAIR_USER_SCHED, latencies are consistently in the noise.
 Also, I forgot to mention that I'm on a single CPU.
 
 Thanks for the help.
 
 -- 
 Guillaume

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High wake up latencies with FAIR_USER_SCHED

2008-01-27 Thread Srivatsa Vaddagiri
On Sun, Jan 27, 2008 at 09:01:15PM +0100, Guillaume Chazarain wrote:
> I noticed some strangely high wake up latencies with FAIR_USER_SCHED
> using this script:



> We have two busy loops with UID=1.
> And UID=2 maintains the running median of its wake up latency.
> I get these latencies:
> 
> # ./sched.py
> 4.300022 ms
> 4.801178 ms
> 4.604006 ms

Given that sysctl_sched_wakeup_granularity is set to 10ms by default,
this doesn't sound abnormal.



> Disabling FAIR_USER_SCHED restores wake up latencies in the noise:
> 
> # ./sched.py
> -0.156975 ms
> -0.067091 ms
> -0.022984 ms

The reason why we are getting better wakeup latencies for !FAIR_USER_SCHED is 
because of this snippet of code in place_entity():

if (!initial) {
/* sleeps upto a single latency don't count. */
if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se))
 ^^
vruntime -= sysctl_sched_latency;

/* ensure we never gain time by being placed backwards. */
vruntime = max_vruntime(se->vruntime, vruntime);
}


NEW_FAIR_SLEEPERS feature gives credit for sleeping only to tasks and
not group-level entities. With the patch attached, I could see that wakeup 
latencies with FAIR_USER_SCHED are restored to the same level as 
!FAIR_USER_SCHED. 

However I am not sure whether that is the way to go. We want to let one group of
tasks running as much as possible until the fairness/wakeup-latency threshold is
exceeded. If someone does want better wakeup latencies between groups too, they 
can always tune sysctl_sched_wakeup_granularity.



> Strangely enough, another way to restore normal latencies is to change
> setuid(2) to setuid(1), that is, putting the latency measurement in
> the same group as the two busy loops.



-- 
Regards,
vatsa

---
 kernel/sched_fair.c |2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -511,7 +511,7 @@
 
if (!initial) {
/* sleeps upto a single latency don't count. */
-   if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se))
+   if (sched_feat(NEW_FAIR_SLEEPERS))
vruntime -= sysctl_sched_latency;
 
/* ensure we never gain time by being placed backwards. */


Re: (ondemand) CPU governor regression between 2.6.23 and 2.6.24

2008-01-27 Thread Srivatsa Vaddagiri
On Sun, Jan 27, 2008 at 04:06:17PM +0100, Toralf Förster wrote:
> > The third line (giving overall cpu usage stats) is what is interesting here.
> > If you have more than one cpu, you can get cpu usage stats for each cpu
> > in top by pressing 1. Can you provide this information with and w/o 
> > CONFIG_FAIR_GROUP_SCHED?
> 
> This is what I get if I set CONFIG_FAIR_GROUP_SCHED to "y"
> 
> top - 16:00:59 up 2 min,  1 user,  load average: 2.56, 1.60, 0.65
> Tasks:  84 total,   3 running,  81 sleeping,   0 stopped,   0 zombie
> Cpu(s): 49.7%us,  0.3%sy, 49.7%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
> Mem:   1036180k total,   322876k used,   713304k free,13164k buffers
> Swap:   997880k total,0k used,   997880k free,   149208k cached
> 
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>  6070 dnetc 39  19   664  348  264 R 49.7  0.0   1:09.71 dnetc
>  6676 tfoerste  20   0  1796  488  428 R 49.3  0.0   0:02.72 factor
> 
> Stopping dnetc gives:
> 
> top - 16:02:36 up 4 min,  1 user,  load average: 2.50, 1.87, 0.83
> Tasks:  89 total,   3 running,  86 sleeping,   0 stopped,   0 zombie
> Cpu(s): 99.3%us,  0.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:   1036180k total,   378760k used,   657420k free,14736k buffers
> Swap:   997880k total,0k used,   997880k free,   180868k cached
> 
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>  6766 tfoerste  20   0  1796  488  428 R 84.9  0.0   0:05.41 factor

Thanks for this respone. This confirms that cpu's idle time is close to
zero, as I intended to verify.

> > If I am not mistaken, cpu ondemand gov goes by the cpu idle time stats,
> > which should not be affected by FAIR_GROUP_SCHED. I will lookaround for
> > other possible causes.

On further examination, ondemand governor seems to have a tunable to
ignore nice load. In your case, I see that dnetc is running at a
positive nice value (19) which could explain why ondemand gov thinks
that the cpu is only ~50% loaded.

Can you check what is the setting of this knob in your case?

# cat /sys/devices/system/cpu/cpu0/cpufreq/ondemand/ignore_nice_load

You can set that to 0 to ask ondemand gov to include nice load into
account while calculating cpu freq changes:

# echo 0 > /sys/devices/system/cpu/cpu0/cpufreq/ondemand/ignore_nice_load

This should restore the behavior of ondemand governor as seen in 2.6.23
in your case (even with CONFIG_FAIR_GROUP_SCHED enabled). Can you pls confirm 
if that happens?

> As I stated our in http://lkml.org/lkml/2008/1/26/207 the issue is solved
> after unselecting FAIR_GROUP_SCHED. 

I understand, but we want to keep CONFIG_FAIR_GROUP_SCHED enabled by
default.

Ingo,
Most folks seem to be used to a global nice-domain, where a nice 19 
task gives up cpu in competetion to a nice-0 task (irrespective of which 
userid's they belong to). CONFIG_FAIR_USER_SCHED brings noticeable changes wrt 
that. We could possibly let it be as it is (since that is what a server
admin may possibly want when managing university servers) or modify it to be 
aware of nice-level (priority of user-sched entity is equivalent to highest 
prio task it has).

In any case, I will send across a patch to turn off CONFIG_FAIR_USER_SCHED by 
default (and instead turn on CONFIG_FAIR_CGROUP_SCHED by default).

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: (ondemand) CPU governor regression between 2.6.23 and 2.6.24

2008-01-27 Thread Srivatsa Vaddagiri
On Sat, Jan 26, 2008 at 07:46:51PM +0100, Toralf Förster wrote:
> 
> The problem is the same as described here : http://lkml.org/lkml/2007/10/21/85
> If I run dnetc even with lowest prority than the CPU stays at 600 MHz 
> regardless
> of any other load (eg. rsyncing, svn update, compiling, ...)
> 
> Stopping the dnetc process immediately speeds up the CPU up to 1.7 GHz.
> 
> 
> Am Samstag, 26. Januar 2008 schrieben Sie:
> > During the test, run top, and watch your CPU usage. Does it go above 80% 
> > (the default for 
> > /sys/devices/system/cpu/cpu0/cpufreq/ondemand/up_threshold).
> 
> No, instead I get :
> 
>  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>  7294 dnetc 39  19   664  348  264 R 49.5  0.0   0:48.68 dnetc
>  7310 tfoerste  20   0  1796  492  428 R 48.5  0.0   0:07.19 factor
>  7050 root  20   0 96736 8872 3972 S  0.7  0.9   0:02.99 X

Hi Toralf,
Can you list the o/p you see for overall cpu usage? You should
see something like below right at the top of the o/p:

top - 20:03:59 up 12 days, 21:39, 18 users,  load average: 0.22, 0.20, 0.25
Tasks: 200 total,   5 running, 193 sleeping,   0 stopped,   2 zombie
Cpu(s):  2.6% us,  1.3% sy,  0.0% ni, 96.0% id,  0.0% wa,  0.0% hi, 0.0% si,  
0.0% st

The third line (giving overall cpu usage stats) is what is interesting here.
If you have more than one cpu, you can get cpu usage stats for each cpu
in top by pressing 1. Can you provide this information with and w/o 
CONFIG_FAIR_GROUP_SCHED?

If I am not mistaken, cpu ondemand gov goes by the cpu idle time stats,
which should not be affected by FAIR_GROUP_SCHED. I will lookaround for
other possible causes.

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: (ondemand) CPU governor regression between 2.6.23 and 2.6.24

2008-01-27 Thread Srivatsa Vaddagiri
On Sun, Jan 27, 2008 at 04:06:17PM +0100, Toralf Förster wrote:
  The third line (giving overall cpu usage stats) is what is interesting here.
  If you have more than one cpu, you can get cpu usage stats for each cpu
  in top by pressing 1. Can you provide this information with and w/o 
  CONFIG_FAIR_GROUP_SCHED?
 
 This is what I get if I set CONFIG_FAIR_GROUP_SCHED to y
 
 top - 16:00:59 up 2 min,  1 user,  load average: 2.56, 1.60, 0.65
 Tasks:  84 total,   3 running,  81 sleeping,   0 stopped,   0 zombie
 Cpu(s): 49.7%us,  0.3%sy, 49.7%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
 Mem:   1036180k total,   322876k used,   713304k free,13164k buffers
 Swap:   997880k total,0k used,   997880k free,   149208k cached
 
   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
  6070 dnetc 39  19   664  348  264 R 49.7  0.0   1:09.71 dnetc
  6676 tfoerste  20   0  1796  488  428 R 49.3  0.0   0:02.72 factor
 
 Stopping dnetc gives:
 
 top - 16:02:36 up 4 min,  1 user,  load average: 2.50, 1.87, 0.83
 Tasks:  89 total,   3 running,  86 sleeping,   0 stopped,   0 zombie
 Cpu(s): 99.3%us,  0.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Mem:   1036180k total,   378760k used,   657420k free,14736k buffers
 Swap:   997880k total,0k used,   997880k free,   180868k cached
 
   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
  6766 tfoerste  20   0  1796  488  428 R 84.9  0.0   0:05.41 factor

Thanks for this respone. This confirms that cpu's idle time is close to
zero, as I intended to verify.

  If I am not mistaken, cpu ondemand gov goes by the cpu idle time stats,
  which should not be affected by FAIR_GROUP_SCHED. I will lookaround for
  other possible causes.

On further examination, ondemand governor seems to have a tunable to
ignore nice load. In your case, I see that dnetc is running at a
positive nice value (19) which could explain why ondemand gov thinks
that the cpu is only ~50% loaded.

Can you check what is the setting of this knob in your case?

# cat /sys/devices/system/cpu/cpu0/cpufreq/ondemand/ignore_nice_load

You can set that to 0 to ask ondemand gov to include nice load into
account while calculating cpu freq changes:

# echo 0  /sys/devices/system/cpu/cpu0/cpufreq/ondemand/ignore_nice_load

This should restore the behavior of ondemand governor as seen in 2.6.23
in your case (even with CONFIG_FAIR_GROUP_SCHED enabled). Can you pls confirm 
if that happens?

 As I stated our in http://lkml.org/lkml/2008/1/26/207 the issue is solved
 after unselecting FAIR_GROUP_SCHED. 

I understand, but we want to keep CONFIG_FAIR_GROUP_SCHED enabled by
default.

Ingo,
Most folks seem to be used to a global nice-domain, where a nice 19 
task gives up cpu in competetion to a nice-0 task (irrespective of which 
userid's they belong to). CONFIG_FAIR_USER_SCHED brings noticeable changes wrt 
that. We could possibly let it be as it is (since that is what a server
admin may possibly want when managing university servers) or modify it to be 
aware of nice-level (priority of user-sched entity is equivalent to highest 
prio task it has).

In any case, I will send across a patch to turn off CONFIG_FAIR_USER_SCHED by 
default (and instead turn on CONFIG_FAIR_CGROUP_SCHED by default).

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High wake up latencies with FAIR_USER_SCHED

2008-01-27 Thread Srivatsa Vaddagiri
On Sun, Jan 27, 2008 at 09:01:15PM +0100, Guillaume Chazarain wrote:
 I noticed some strangely high wake up latencies with FAIR_USER_SCHED
 using this script:

snip

 We have two busy loops with UID=1.
 And UID=2 maintains the running median of its wake up latency.
 I get these latencies:
 
 # ./sched.py
 4.300022 ms
 4.801178 ms
 4.604006 ms

Given that sysctl_sched_wakeup_granularity is set to 10ms by default,
this doesn't sound abnormal.

snip

 Disabling FAIR_USER_SCHED restores wake up latencies in the noise:
 
 # ./sched.py
 -0.156975 ms
 -0.067091 ms
 -0.022984 ms

The reason why we are getting better wakeup latencies for !FAIR_USER_SCHED is 
because of this snippet of code in place_entity():

if (!initial) {
/* sleeps upto a single latency don't count. */
if (sched_feat(NEW_FAIR_SLEEPERS)  entity_is_task(se))
 ^^
vruntime -= sysctl_sched_latency;

/* ensure we never gain time by being placed backwards. */
vruntime = max_vruntime(se-vruntime, vruntime);
}


NEW_FAIR_SLEEPERS feature gives credit for sleeping only to tasks and
not group-level entities. With the patch attached, I could see that wakeup 
latencies with FAIR_USER_SCHED are restored to the same level as 
!FAIR_USER_SCHED. 

However I am not sure whether that is the way to go. We want to let one group of
tasks running as much as possible until the fairness/wakeup-latency threshold is
exceeded. If someone does want better wakeup latencies between groups too, they 
can always tune sysctl_sched_wakeup_granularity.

snip

 Strangely enough, another way to restore normal latencies is to change
 setuid(2) to setuid(1), that is, putting the latency measurement in
 the same group as the two busy loops.



-- 
Regards,
vatsa

---
 kernel/sched_fair.c |2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -511,7 +511,7 @@
 
if (!initial) {
/* sleeps upto a single latency don't count. */
-   if (sched_feat(NEW_FAIR_SLEEPERS)  entity_is_task(se))
+   if (sched_feat(NEW_FAIR_SLEEPERS))
vruntime -= sysctl_sched_latency;
 
/* ensure we never gain time by being placed backwards. */


Re: [PATCH] sched: don't take a mutex from interrupt context

2008-01-22 Thread Srivatsa Vaddagiri
On Tue, Jan 22, 2008 at 05:47:34PM +0100, Peter Zijlstra wrote:
> It should not, that would be another bug, but from a quick glance at the
> code it doesn't do that.

Hmm I had it in my back of mind that printk() could sleep. Looks like
that has changed and so the patch you sent should be fine. 

Thanks!

- vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: don't take a mutex from interrupt context

2008-01-22 Thread Srivatsa Vaddagiri
On Tue, Jan 22, 2008 at 05:25:38PM +0100, Peter Zijlstra wrote:
> @@ -1428,9 +1428,9 @@ static void print_cfs_stats(struct seq_f
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>   print_cfs_rq(m, cpu, _rq(cpu)->cfs);
>  #endif
> - lock_task_group_list();
> + rcu_read_lock();
>   for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
>   print_cfs_rq(m, cpu, cfs_rq);

Isn't there a possibility that print_cfs_rq() can block?

> - unlock_task_group_list();
> + rcu_read_unlock();
>  }
>  #endif
> 

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: don't take a mutex from interrupt context

2008-01-22 Thread Srivatsa Vaddagiri
On Tue, Jan 22, 2008 at 05:25:38PM +0100, Peter Zijlstra wrote:
 @@ -1428,9 +1428,9 @@ static void print_cfs_stats(struct seq_f
  #ifdef CONFIG_FAIR_GROUP_SCHED
   print_cfs_rq(m, cpu, cpu_rq(cpu)-cfs);
  #endif
 - lock_task_group_list();
 + rcu_read_lock();
   for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
   print_cfs_rq(m, cpu, cfs_rq);

Isn't there a possibility that print_cfs_rq() can block?

 - unlock_task_group_list();
 + rcu_read_unlock();
  }
  #endif
 

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: don't take a mutex from interrupt context

2008-01-22 Thread Srivatsa Vaddagiri
On Tue, Jan 22, 2008 at 05:47:34PM +0100, Peter Zijlstra wrote:
 It should not, that would be another bug, but from a quick glance at the
 code it doesn't do that.

Hmm I had it in my back of mind that printk() could sleep. Looks like
that has changed and so the patch you sent should be fine. 

Thanks!

- vatsa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression with idle cpu cycle handling in 2.6.24 (compared to 2.6.22)

2008-01-21 Thread Srivatsa Vaddagiri
On Sun, Jan 20, 2008 at 09:03:38AM +0530, Dhaval Giani wrote:
> > btw: writing 1 into "cpu_share" totally locks up the computer!
> > 
> 
> Can you please provide some more details. Can you go into another
> console (try ctrl-alt-f1) and try to reproduce the issue there. Could
> you take a photo of the oops/panic and upload it somewhere so that we
> can see it?

I couldn't recreate this problem either on 2.6.24-rc8.

Daniel,
Could you pass the config file you used pls?

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression with idle cpu cycle handling in 2.6.24 (compared to 2.6.22)

2008-01-21 Thread Srivatsa Vaddagiri
On Sun, Jan 20, 2008 at 09:03:38AM +0530, Dhaval Giani wrote:
  btw: writing 1 into cpu_share totally locks up the computer!
  
 
 Can you please provide some more details. Can you go into another
 console (try ctrl-alt-f1) and try to reproduce the issue there. Could
 you take a photo of the oops/panic and upload it somewhere so that we
 can see it?

I couldn't recreate this problem either on 2.6.24-rc8.

Daniel,
Could you pass the config file you used pls?

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/11] another rt group sched update

2008-01-07 Thread Srivatsa Vaddagiri
On Mon, Jan 07, 2008 at 11:51:20AM +0100, Peter Zijlstra wrote:
>  - figure out what to do for UID based group scheduling, the current
>implementation leaves it impossible for !root users to execute
>real time tasks by setting rt_runtime_us to 0, and it has no way
>to change it.
> 
>Srivatsa, what happened to the per uid weight patches?, Perhaps we
>can extend that interface to allow changing this.

Hi Peter,
The sysfs interface for tweaking each user's share should be in
mainline already (sysfs_create_file() in user_kobject_create()). This
could be extended for your purpose, hopefully in a straightforward
manner (you never know that with sysfs :(

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/11] another rt group sched update

2008-01-07 Thread Srivatsa Vaddagiri
On Mon, Jan 07, 2008 at 11:51:20AM +0100, Peter Zijlstra wrote:
  - figure out what to do for UID based group scheduling, the current
implementation leaves it impossible for !root users to execute
real time tasks by setting rt_runtime_us to 0, and it has no way
to change it.
 
Srivatsa, what happened to the per uid weight patches?, Perhaps we
can extend that interface to allow changing this.

Hi Peter,
The sysfs interface for tweaking each user's share should be in
mainline already (sysfs_create_file() in user_kobject_create()). This
could be extended for your purpose, hopefully in a straightforward
manner (you never know that with sysfs :(

-- 
Regards,
vatsa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: cpu accounting controller (V2)

2007-11-30 Thread Srivatsa Vaddagiri
On Fri, Nov 30, 2007 at 01:35:13PM +0100, Ingo Molnar wrote:
> * Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote:
> 
> > Here's V2 of the cpu acccounting controller patch, which makes 
> > accounting scale better on SMP systems by splitting the usage counter 
> > to be per-cpu.
> 
> thanks, applied. But you dont seem to have incorporated all of the 
> review feedback from Andrew. (such as making cpuacct_subsys static)

cpuacct_subsys can't be made static, as I have pointed out at
http://marc.info/?l=linux-kernel=119636730930886. Other than that,
AFAICS, rest of Andrew's comments have been taken care of. If there are any 
remaining that I have over-looked, I had be happy to fix them.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched: cpu accounting controller (V2)

2007-11-30 Thread Srivatsa Vaddagiri
On Fri, Nov 30, 2007 at 01:48:33AM +0530, Srivatsa Vaddagiri wrote:
> It is indeed an important todo. Right now we take a per-group global
> lock on every accounting update (which can be very frequent) and hence
> it is pretty bad.
> 
> Ingo had expressed the need to reintroduce this patch asap and hence I resent 
> it w/o attacking the scalability part. I will take a shot at scalability
> enhancements tomorrow and send it as a separate patch.

Here's V2 of the cpu acccounting controller patch, which makes
accounting scale better on SMP systems by splitting the usage counter to
be per-cpu.

>

Commit cfb5285660aad4931b2ebbfa902ea48a37dfffa1 removed a useful feature for 
us, which provided a cpu accounting resource controller.  This feature would be 
useful if someone wants to group tasks only for accounting purpose and doesnt 
really want to exercise any control over their cpu consumption.

The patch below reintroduces the feature. It is based on Paul Menage's
original patch (Commit 62d0df64065e7c135d0002f069444fbdfc64768f), with
these differences:

- Removed load average information. I felt it needs more thought (esp 
  to deal with SMP and virtualized platforms) and can be added for 
  2.6.25 after more discussions.
- Convert group cpu usage to be nanosecond accurate (as rest of the cfs 
  stats are) and invoke cpuacct_charge() from the respective scheduler 
  classes
- Make accounting scalable on SMP systems by splitting the usage 
  counter to be per-cpu
- Move the code from kernel/cpu_acct.c to kernel/sched.c (since the 
  code is not big enough to warrant a new file and also this rightly 
  needs to live inside the scheduler. Also things like accessing 
  rq->lock while reading cpu usage becomes easier if the code lived in 
  kernel/sched.c)

The patch also modifies the cpu controller not to provide the same accounting 
information.


Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 include/linux/cgroup_subsys.h |7 +
 init/Kconfig  |7 +
 kernel/sched.c|  155 ++
 kernel/sched_fair.c   |6 +
 kernel/sched_rt.c |1 
 5 files changed, 150 insertions(+), 26 deletions(-)

Index: current/include/linux/cgroup_subsys.h
===
--- current.orig/include/linux/cgroup_subsys.h
+++ current/include/linux/cgroup_subsys.h
@@ -30,3 +30,10 @@ SUBSYS(cpu_cgroup)
 #endif
 
 /* */
+
+#ifdef CONFIG_CGROUP_CPUACCT
+SUBSYS(cpuacct)
+#endif
+
+/* */
+
Index: current/init/Kconfig
===
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -354,6 +354,13 @@ config FAIR_CGROUP_SCHED
 
 endchoice
 
+config CGROUP_CPUACCT
+   bool "Simple CPU accounting cgroup subsystem"
+   depends on CGROUPS
+   help
+ Provides a simple Resource Controller for monitoring the
+ total CPU consumed by the tasks in a cgroup
+
 config SYSFS_DEPRECATED
bool "Create deprecated sysfs files"
default y
Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -854,6 +854,12 @@ iter_move_one_task(struct rq *this_rq, i
   struct rq_iterator *iterator);
 #endif
 
+#ifdef CONFIG_CGROUP_CPUACCT
+static void cpuacct_charge(struct task_struct *tsk, u64 cputime);
+#else
+static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
+#endif
+
 #include "sched_stats.h"
 #include "sched_idletask.c"
 #include "sched_fair.c"
@@ -7221,38 +7227,12 @@ static u64 cpu_shares_read_uint(struct c
return (u64) tg->shares;
 }
 
-static u64 cpu_usage_read(struct cgroup *cgrp, struct cftype *cft)
-{
-   struct task_group *tg = cgroup_tg(cgrp);
-   unsigned long flags;
-   u64 res = 0;
-   int i;
-
-   for_each_possible_cpu(i) {
-   /*
-* Lock to prevent races with updating 64-bit counters
-* on 32-bit arches.
-*/
-   spin_lock_irqsave(_rq(i)->lock, flags);
-   res += tg->se[i]->sum_exec_runtime;
-   spin_unlock_irqrestore(_rq(i)->lock, flags);
-   }
-   /* Convert from ns to ms */
-   do_div(res, NSEC_PER_MSEC);
-
-   return res;
-}
-
 static struct cftype cpu_files[] = {
{
.name = "shares",
.read_uint = cpu_shares_read_uint,
.write_uint = cpu_shares_write_uint,
},
-   {
-   .name = "usage",
-   .read_uint = cpu_usage_read,
-   },
 };
 
 static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
@

[PATCH] sched: cpu accounting controller (V2)

2007-11-30 Thread Srivatsa Vaddagiri
On Fri, Nov 30, 2007 at 01:48:33AM +0530, Srivatsa Vaddagiri wrote:
 It is indeed an important todo. Right now we take a per-group global
 lock on every accounting update (which can be very frequent) and hence
 it is pretty bad.
 
 Ingo had expressed the need to reintroduce this patch asap and hence I resent 
 it w/o attacking the scalability part. I will take a shot at scalability
 enhancements tomorrow and send it as a separate patch.

Here's V2 of the cpu acccounting controller patch, which makes
accounting scale better on SMP systems by splitting the usage counter to
be per-cpu.



Commit cfb5285660aad4931b2ebbfa902ea48a37dfffa1 removed a useful feature for 
us, which provided a cpu accounting resource controller.  This feature would be 
useful if someone wants to group tasks only for accounting purpose and doesnt 
really want to exercise any control over their cpu consumption.

The patch below reintroduces the feature. It is based on Paul Menage's
original patch (Commit 62d0df64065e7c135d0002f069444fbdfc64768f), with
these differences:

- Removed load average information. I felt it needs more thought (esp 
  to deal with SMP and virtualized platforms) and can be added for 
  2.6.25 after more discussions.
- Convert group cpu usage to be nanosecond accurate (as rest of the cfs 
  stats are) and invoke cpuacct_charge() from the respective scheduler 
  classes
- Make accounting scalable on SMP systems by splitting the usage 
  counter to be per-cpu
- Move the code from kernel/cpu_acct.c to kernel/sched.c (since the 
  code is not big enough to warrant a new file and also this rightly 
  needs to live inside the scheduler. Also things like accessing 
  rq-lock while reading cpu usage becomes easier if the code lived in 
  kernel/sched.c)

The patch also modifies the cpu controller not to provide the same accounting 
information.


Signed-off-by: Srivatsa Vaddagiri [EMAIL PROTECTED]

---
 include/linux/cgroup_subsys.h |7 +
 init/Kconfig  |7 +
 kernel/sched.c|  155 ++
 kernel/sched_fair.c   |6 +
 kernel/sched_rt.c |1 
 5 files changed, 150 insertions(+), 26 deletions(-)

Index: current/include/linux/cgroup_subsys.h
===
--- current.orig/include/linux/cgroup_subsys.h
+++ current/include/linux/cgroup_subsys.h
@@ -30,3 +30,10 @@ SUBSYS(cpu_cgroup)
 #endif
 
 /* */
+
+#ifdef CONFIG_CGROUP_CPUACCT
+SUBSYS(cpuacct)
+#endif
+
+/* */
+
Index: current/init/Kconfig
===
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -354,6 +354,13 @@ config FAIR_CGROUP_SCHED
 
 endchoice
 
+config CGROUP_CPUACCT
+   bool Simple CPU accounting cgroup subsystem
+   depends on CGROUPS
+   help
+ Provides a simple Resource Controller for monitoring the
+ total CPU consumed by the tasks in a cgroup
+
 config SYSFS_DEPRECATED
bool Create deprecated sysfs files
default y
Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -854,6 +854,12 @@ iter_move_one_task(struct rq *this_rq, i
   struct rq_iterator *iterator);
 #endif
 
+#ifdef CONFIG_CGROUP_CPUACCT
+static void cpuacct_charge(struct task_struct *tsk, u64 cputime);
+#else
+static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
+#endif
+
 #include sched_stats.h
 #include sched_idletask.c
 #include sched_fair.c
@@ -7221,38 +7227,12 @@ static u64 cpu_shares_read_uint(struct c
return (u64) tg-shares;
 }
 
-static u64 cpu_usage_read(struct cgroup *cgrp, struct cftype *cft)
-{
-   struct task_group *tg = cgroup_tg(cgrp);
-   unsigned long flags;
-   u64 res = 0;
-   int i;
-
-   for_each_possible_cpu(i) {
-   /*
-* Lock to prevent races with updating 64-bit counters
-* on 32-bit arches.
-*/
-   spin_lock_irqsave(cpu_rq(i)-lock, flags);
-   res += tg-se[i]-sum_exec_runtime;
-   spin_unlock_irqrestore(cpu_rq(i)-lock, flags);
-   }
-   /* Convert from ns to ms */
-   do_div(res, NSEC_PER_MSEC);
-
-   return res;
-}
-
 static struct cftype cpu_files[] = {
{
.name = shares,
.read_uint = cpu_shares_read_uint,
.write_uint = cpu_shares_write_uint,
},
-   {
-   .name = usage,
-   .read_uint = cpu_usage_read,
-   },
 };
 
 static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -7272,3 +7252,126 @@ struct cgroup_subsys cpu_cgroup_subsys =
 };
 
 #endif /* CONFIG_FAIR_CGROUP_SCHED */
+
+#ifdef CONFIG_CGROUP_CPUACCT

Re: [PATCH] sched: cpu accounting controller (V2)

2007-11-30 Thread Srivatsa Vaddagiri
On Fri, Nov 30, 2007 at 01:35:13PM +0100, Ingo Molnar wrote:
 * Srivatsa Vaddagiri [EMAIL PROTECTED] wrote:
 
  Here's V2 of the cpu acccounting controller patch, which makes 
  accounting scale better on SMP systems by splitting the usage counter 
  to be per-cpu.
 
 thanks, applied. But you dont seem to have incorporated all of the 
 review feedback from Andrew. (such as making cpuacct_subsys static)

cpuacct_subsys can't be made static, as I have pointed out at
http://marc.info/?l=linux-kernelm=119636730930886. Other than that,
AFAICS, rest of Andrew's comments have been taken care of. If there are any 
remaining that I have over-looked, I had be happy to fix them.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: cpu accounting controller

2007-11-29 Thread Srivatsa Vaddagiri
On Thu, Nov 29, 2007 at 11:30:35AM -0800, Andrew Morton wrote:
> > - Make the accounting scalable on SMP systems (perhaps
> >   for 2.6.25)
> 
> That sounds like a rather important todo.  How bad is it now?

It is indeed an important todo. Right now we take a per-group global
lock on every accounting update (which can be very frequent) and hence
it is pretty bad.

Ingo had expressed the need to reintroduce this patch asap and hence I resent 
it w/o attacking the scalability part. I will take a shot at scalability
enhancements tomorrow and send it as a separate patch.

> > +struct cpuacct {
> > + struct cgroup_subsys_state css;
> > + spinlock_t lock;
> > + /* total time used by this class (in nanoseconds) */
> > + u64 time;
> > +};
> > +
> > +struct cgroup_subsys cpuacct_subsys;
> 
> This can be made static.

This symbol is needed in kernel/cgroup.c file, where it does this:

static struct cgroup_subsys *subsys[] = {
#include 
};

and hence it cant be static. Thanks for the rest of your comments. I
have fixed them in this version below:

Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 include/linux/cgroup_subsys.h |6 ++
 include/linux/cpu_acct.h  |   14 +
 init/Kconfig  |7 ++
 kernel/Makefile   |1 
 kernel/cpu_acct.c |  101 ++
 kernel/sched.c|   27 ---
 kernel/sched_fair.c   |5 ++
 kernel/sched_rt.c |1 
 8 files changed, 136 insertions(+), 26 deletions(-)

Index: current/include/linux/cgroup_subsys.h
===
--- current.orig/include/linux/cgroup_subsys.h
+++ current/include/linux/cgroup_subsys.h
@@ -11,6 +11,12 @@
 SUBSYS(cpuset)
 #endif
 
+#ifdef CONFIG_CGROUP_CPUACCT
+SUBSYS(cpuacct)
+#endif
+
+/* */
+
 /* */
 
 #ifdef CONFIG_CGROUP_DEBUG
Index: current/include/linux/cpu_acct.h
===
--- /dev/null
+++ current/include/linux/cpu_acct.h
@@ -0,0 +1,14 @@
+
+#ifndef _LINUX_CPU_ACCT_H
+#define _LINUX_CPU_ACCT_H
+
+#include 
+#include 
+
+#ifdef CONFIG_CGROUP_CPUACCT
+extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
+#else
+static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
+#endif
+
+#endif
Index: current/init/Kconfig
===
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -354,6 +354,13 @@ config FAIR_CGROUP_SCHED
 
 endchoice
 
+config CGROUP_CPUACCT
+   bool "Simple CPU accounting cgroup subsystem"
+   depends on CGROUPS
+   help
+ Provides a simple Resource Controller for monitoring the
+ total CPU consumed by the tasks in a cgroup
+
 config SYSFS_DEPRECATED
bool "Create deprecated sysfs files"
default y
Index: current/kernel/Makefile
===
--- current.orig/kernel/Makefile
+++ current/kernel/Makefile
@@ -40,6 +40,7 @@ obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_DEBUG) += cgroup_debug.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
+obj-$(CONFIG_CGROUP_CPUACCT) += cpu_acct.o
 obj-$(CONFIG_CGROUP_NS) += ns_cgroup.o
 obj-$(CONFIG_IKCONFIG) += configs.o
 obj-$(CONFIG_STOP_MACHINE) += stop_machine.o
Index: current/kernel/cpu_acct.c
===
--- /dev/null
+++ current/kernel/cpu_acct.c
@@ -0,0 +1,101 @@
+/*
+ * kernel/cpu_acct.c - CPU accounting cgroup subsystem
+ *
+ * Copyright (C) Google Inc, 2006
+ *
+ * Developed by Paul Menage ([EMAIL PROTECTED]) and Balbir Singh
+ * ([EMAIL PROTECTED])
+ *
+ */
+
+/*
+ * Example cgroup subsystem for reporting total CPU usage of tasks in a
+ * cgroup.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+struct cpuacct {
+   struct cgroup_subsys_state css;
+   spinlock_t lock;
+   /* total time used by this class (in nanoseconds) */
+   u64 time;
+};
+
+struct cgroup_subsys cpuacct_subsys;
+
+static inline struct cpuacct *cgroup_ca(struct cgroup *cont)
+{
+   return container_of(cgroup_subsys_state(cont, cpuacct_subsys_id),
+   struct cpuacct, css);
+}
+
+static inline struct cpuacct *task_ca(struct task_struct *task)
+{
+   return container_of(task_subsys_state(task, cpuacct_subsys_id),
+   struct cpuacct, css);
+}
+
+static struct cgroup_subsys_state *cpuacct_create(
+   struct cgroup_subsys *ss, struct cgroup *cont)
+{
+   struct cpuacct *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
+
+   if (!ca)
+   return ERR_PTR(-ENOMEM);
+   spin_lock_init(>lock);
+   return >css;
+}
+
+static void cpuacct_destroy(struct cgroup_subsys *ss,
+   st

Re: [PATCH] sched: cpu accounting controller

2007-11-29 Thread Srivatsa Vaddagiri
On Thu, Nov 29, 2007 at 08:20:58PM +0100, Ingo Molnar wrote:
> ok, this looks certainly doable for v2.6.24. I've added it to the 
> scheduler fixes queue and will let it brew there for a few days and send 
> it to Linus after that if everything goes fine - unless anyone objects. 

Thanks.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched: cpu accounting controller

2007-11-29 Thread Srivatsa Vaddagiri
On Mon, Nov 12, 2007 at 11:57:03PM -0800, Paul Menage wrote:
> > Regarding your concern about tracking cpu usage in different ways, it
> > could be mitigated if we have cpuacct controller track usage as per
> > information present in a task's sched entity structure
> > (tsk->se.sum_exec_runtime) i.e call cpuacct_charge() from
> > __update_curr() which would accumulate the execution time of the
> > group in a SMP friendly manner (i.e dump it in a per-cpu per-group counter
> > first and then aggregate to a global per-group counter).
> 
> That seems more reasonable than the current approach in cpu_acct.c

Paul,
Sorry about the delay in getting back to this thread. I realized
very recently that cpuacct controller has been removed from Linus's tree
and have attempted to rework it as per our discussions.

Linus/Ingo,
Commit cfb5285660aad4931b2ebbfa902ea48a37dfffa1 removed a usefull 
feature for us, which provided a cpu accounting resource controller. This 
feature would be usefull if someone wants to group tasks only for accounting 
purpose and doesnt really want to exercise any control over their cpu 
consumption.

The patch below reintroduces the feature. It is based on Paul Menage's
original patch (Commit 62d0df64065e7c135d0002f069444fbdfc64768f), with
these differences:

- Removed load average information. I felt it needs
  more thought (esp to deal with SMP and virtualized platforms)
  and can be added for 2.6.25 after more discussions.
- Convert group cpu usage to be nanosecond accurate (as rest
  of the cfs stats are) and invoke cpuacct_charge() from 
  the respective scheduler classes

The patch also modifies the cpu controller not to provide the same
accounting information.

Todo:
- Make the accounting scalable on SMP systems (perhaps
  for 2.6.25)


Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 include/linux/cgroup_subsys.h |6 ++
 include/linux/cpu_acct.h  |   14 +
 init/Kconfig  |7 ++
 kernel/Makefile   |1 
 kernel/cpu_acct.c |  103 ++
 kernel/sched.c|   27 ---
 kernel/sched_fair.c   |5 ++
 kernel/sched_rt.c |1 
 8 files changed, 138 insertions(+), 26 deletions(-)

Index: current/include/linux/cgroup_subsys.h
===
--- current.orig/include/linux/cgroup_subsys.h
+++ current/include/linux/cgroup_subsys.h
@@ -11,6 +11,12 @@
 SUBSYS(cpuset)
 #endif
 
+#ifdef CONFIG_CGROUP_CPUACCT
+SUBSYS(cpuacct)
+#endif
+
+/* */
+
 /* */
 
 #ifdef CONFIG_CGROUP_DEBUG
Index: current/include/linux/cpu_acct.h
===
--- /dev/null
+++ current/include/linux/cpu_acct.h
@@ -0,0 +1,14 @@
+
+#ifndef _LINUX_CPU_ACCT_H
+#define _LINUX_CPU_ACCT_H
+
+#include 
+#include 
+
+#ifdef CONFIG_CGROUP_CPUACCT
+extern void cpuacct_charge(struct task_struct *, u64 cputime);
+#else
+static inline void cpuacct_charge(struct task_struct *p, u64 cputime) {}
+#endif
+
+#endif
Index: current/init/Kconfig
===
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -354,6 +354,13 @@ config FAIR_CGROUP_SCHED
 
 endchoice
 
+config CGROUP_CPUACCT
+   bool "Simple CPU accounting cgroup subsystem"
+   depends on CGROUPS
+   help
+ Provides a simple Resource Controller for monitoring the
+ total CPU consumed by the tasks in a cgroup
+
 config SYSFS_DEPRECATED
bool "Create deprecated sysfs files"
default y
Index: current/kernel/Makefile
===
--- current.orig/kernel/Makefile
+++ current/kernel/Makefile
@@ -40,6 +40,7 @@ obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_DEBUG) += cgroup_debug.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
+obj-$(CONFIG_CGROUP_CPUACCT) += cpu_acct.o
 obj-$(CONFIG_CGROUP_NS) += ns_cgroup.o
 obj-$(CONFIG_IKCONFIG) += configs.o
 obj-$(CONFIG_STOP_MACHINE) += stop_machine.o
Index: current/kernel/cpu_acct.c
===
--- /dev/null
+++ current/kernel/cpu_acct.c
@@ -0,0 +1,103 @@
+/*
+ * kernel/cpu_acct.c - CPU accounting cgroup subsystem
+ *
+ * Copyright (C) Google Inc, 2006
+ *
+ * Developed by Paul Menage ([EMAIL PROTECTED]) and Balbir Singh
+ * ([EMAIL PROTECTED])
+ *
+ */
+
+/*
+ * Example cgroup subsystem for reporting total CPU usage of tasks in a
+ * cgroup.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+struct cpuacct {
+   struct cgroup_subsys_state css;
+   spinlock_t lock;
+   /* total time used by this class (in nanoseconds) */
+   u64 time;
+};
+
+struct cgroup_subsys cpuacct_subsys;
+
+s

[PATCH] sched: cpu accounting controller

2007-11-29 Thread Srivatsa Vaddagiri
On Mon, Nov 12, 2007 at 11:57:03PM -0800, Paul Menage wrote:
  Regarding your concern about tracking cpu usage in different ways, it
  could be mitigated if we have cpuacct controller track usage as per
  information present in a task's sched entity structure
  (tsk-se.sum_exec_runtime) i.e call cpuacct_charge() from
  __update_curr() which would accumulate the execution time of the
  group in a SMP friendly manner (i.e dump it in a per-cpu per-group counter
  first and then aggregate to a global per-group counter).
 
 That seems more reasonable than the current approach in cpu_acct.c

Paul,
Sorry about the delay in getting back to this thread. I realized
very recently that cpuacct controller has been removed from Linus's tree
and have attempted to rework it as per our discussions.

Linus/Ingo,
Commit cfb5285660aad4931b2ebbfa902ea48a37dfffa1 removed a usefull 
feature for us, which provided a cpu accounting resource controller. This 
feature would be usefull if someone wants to group tasks only for accounting 
purpose and doesnt really want to exercise any control over their cpu 
consumption.

The patch below reintroduces the feature. It is based on Paul Menage's
original patch (Commit 62d0df64065e7c135d0002f069444fbdfc64768f), with
these differences:

- Removed load average information. I felt it needs
  more thought (esp to deal with SMP and virtualized platforms)
  and can be added for 2.6.25 after more discussions.
- Convert group cpu usage to be nanosecond accurate (as rest
  of the cfs stats are) and invoke cpuacct_charge() from 
  the respective scheduler classes

The patch also modifies the cpu controller not to provide the same
accounting information.

Todo:
- Make the accounting scalable on SMP systems (perhaps
  for 2.6.25)


Signed-off-by: Srivatsa Vaddagiri [EMAIL PROTECTED]

---
 include/linux/cgroup_subsys.h |6 ++
 include/linux/cpu_acct.h  |   14 +
 init/Kconfig  |7 ++
 kernel/Makefile   |1 
 kernel/cpu_acct.c |  103 ++
 kernel/sched.c|   27 ---
 kernel/sched_fair.c   |5 ++
 kernel/sched_rt.c |1 
 8 files changed, 138 insertions(+), 26 deletions(-)

Index: current/include/linux/cgroup_subsys.h
===
--- current.orig/include/linux/cgroup_subsys.h
+++ current/include/linux/cgroup_subsys.h
@@ -11,6 +11,12 @@
 SUBSYS(cpuset)
 #endif
 
+#ifdef CONFIG_CGROUP_CPUACCT
+SUBSYS(cpuacct)
+#endif
+
+/* */
+
 /* */
 
 #ifdef CONFIG_CGROUP_DEBUG
Index: current/include/linux/cpu_acct.h
===
--- /dev/null
+++ current/include/linux/cpu_acct.h
@@ -0,0 +1,14 @@
+
+#ifndef _LINUX_CPU_ACCT_H
+#define _LINUX_CPU_ACCT_H
+
+#include linux/cgroup.h
+#include asm/cputime.h
+
+#ifdef CONFIG_CGROUP_CPUACCT
+extern void cpuacct_charge(struct task_struct *, u64 cputime);
+#else
+static inline void cpuacct_charge(struct task_struct *p, u64 cputime) {}
+#endif
+
+#endif
Index: current/init/Kconfig
===
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -354,6 +354,13 @@ config FAIR_CGROUP_SCHED
 
 endchoice
 
+config CGROUP_CPUACCT
+   bool Simple CPU accounting cgroup subsystem
+   depends on CGROUPS
+   help
+ Provides a simple Resource Controller for monitoring the
+ total CPU consumed by the tasks in a cgroup
+
 config SYSFS_DEPRECATED
bool Create deprecated sysfs files
default y
Index: current/kernel/Makefile
===
--- current.orig/kernel/Makefile
+++ current/kernel/Makefile
@@ -40,6 +40,7 @@ obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_DEBUG) += cgroup_debug.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
+obj-$(CONFIG_CGROUP_CPUACCT) += cpu_acct.o
 obj-$(CONFIG_CGROUP_NS) += ns_cgroup.o
 obj-$(CONFIG_IKCONFIG) += configs.o
 obj-$(CONFIG_STOP_MACHINE) += stop_machine.o
Index: current/kernel/cpu_acct.c
===
--- /dev/null
+++ current/kernel/cpu_acct.c
@@ -0,0 +1,103 @@
+/*
+ * kernel/cpu_acct.c - CPU accounting cgroup subsystem
+ *
+ * Copyright (C) Google Inc, 2006
+ *
+ * Developed by Paul Menage ([EMAIL PROTECTED]) and Balbir Singh
+ * ([EMAIL PROTECTED])
+ *
+ */
+
+/*
+ * Example cgroup subsystem for reporting total CPU usage of tasks in a
+ * cgroup.
+ */
+
+#include linux/module.h
+#include linux/cgroup.h
+#include linux/fs.h
+#include linux/rcupdate.h
+
+#include asm/div64.h
+
+struct cpuacct {
+   struct cgroup_subsys_state css;
+   spinlock_t lock;
+   /* total time used by this class (in nanoseconds) */
+   u64 time;
+};
+
+struct cgroup_subsys cpuacct_subsys;
+
+static

Re: [PATCH] sched: cpu accounting controller

2007-11-29 Thread Srivatsa Vaddagiri
On Thu, Nov 29, 2007 at 08:20:58PM +0100, Ingo Molnar wrote:
 ok, this looks certainly doable for v2.6.24. I've added it to the 
 scheduler fixes queue and will let it brew there for a few days and send 
 it to Linus after that if everything goes fine - unless anyone objects. 

Thanks.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: cpu accounting controller

2007-11-29 Thread Srivatsa Vaddagiri
On Thu, Nov 29, 2007 at 11:30:35AM -0800, Andrew Morton wrote:
  - Make the accounting scalable on SMP systems (perhaps
for 2.6.25)
 
 That sounds like a rather important todo.  How bad is it now?

It is indeed an important todo. Right now we take a per-group global
lock on every accounting update (which can be very frequent) and hence
it is pretty bad.

Ingo had expressed the need to reintroduce this patch asap and hence I resent 
it w/o attacking the scalability part. I will take a shot at scalability
enhancements tomorrow and send it as a separate patch.

  +struct cpuacct {
  + struct cgroup_subsys_state css;
  + spinlock_t lock;
  + /* total time used by this class (in nanoseconds) */
  + u64 time;
  +};
  +
  +struct cgroup_subsys cpuacct_subsys;
 
 This can be made static.

This symbol is needed in kernel/cgroup.c file, where it does this:

static struct cgroup_subsys *subsys[] = {
#include linux/cgroup_subsys.h
};

and hence it cant be static. Thanks for the rest of your comments. I
have fixed them in this version below:

Signed-off-by: Srivatsa Vaddagiri [EMAIL PROTECTED]

---
 include/linux/cgroup_subsys.h |6 ++
 include/linux/cpu_acct.h  |   14 +
 init/Kconfig  |7 ++
 kernel/Makefile   |1 
 kernel/cpu_acct.c |  101 ++
 kernel/sched.c|   27 ---
 kernel/sched_fair.c   |5 ++
 kernel/sched_rt.c |1 
 8 files changed, 136 insertions(+), 26 deletions(-)

Index: current/include/linux/cgroup_subsys.h
===
--- current.orig/include/linux/cgroup_subsys.h
+++ current/include/linux/cgroup_subsys.h
@@ -11,6 +11,12 @@
 SUBSYS(cpuset)
 #endif
 
+#ifdef CONFIG_CGROUP_CPUACCT
+SUBSYS(cpuacct)
+#endif
+
+/* */
+
 /* */
 
 #ifdef CONFIG_CGROUP_DEBUG
Index: current/include/linux/cpu_acct.h
===
--- /dev/null
+++ current/include/linux/cpu_acct.h
@@ -0,0 +1,14 @@
+
+#ifndef _LINUX_CPU_ACCT_H
+#define _LINUX_CPU_ACCT_H
+
+#include linux/cgroup.h
+#include asm/cputime.h
+
+#ifdef CONFIG_CGROUP_CPUACCT
+extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
+#else
+static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
+#endif
+
+#endif
Index: current/init/Kconfig
===
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -354,6 +354,13 @@ config FAIR_CGROUP_SCHED
 
 endchoice
 
+config CGROUP_CPUACCT
+   bool Simple CPU accounting cgroup subsystem
+   depends on CGROUPS
+   help
+ Provides a simple Resource Controller for monitoring the
+ total CPU consumed by the tasks in a cgroup
+
 config SYSFS_DEPRECATED
bool Create deprecated sysfs files
default y
Index: current/kernel/Makefile
===
--- current.orig/kernel/Makefile
+++ current/kernel/Makefile
@@ -40,6 +40,7 @@ obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_DEBUG) += cgroup_debug.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
+obj-$(CONFIG_CGROUP_CPUACCT) += cpu_acct.o
 obj-$(CONFIG_CGROUP_NS) += ns_cgroup.o
 obj-$(CONFIG_IKCONFIG) += configs.o
 obj-$(CONFIG_STOP_MACHINE) += stop_machine.o
Index: current/kernel/cpu_acct.c
===
--- /dev/null
+++ current/kernel/cpu_acct.c
@@ -0,0 +1,101 @@
+/*
+ * kernel/cpu_acct.c - CPU accounting cgroup subsystem
+ *
+ * Copyright (C) Google Inc, 2006
+ *
+ * Developed by Paul Menage ([EMAIL PROTECTED]) and Balbir Singh
+ * ([EMAIL PROTECTED])
+ *
+ */
+
+/*
+ * Example cgroup subsystem for reporting total CPU usage of tasks in a
+ * cgroup.
+ */
+
+#include linux/module.h
+#include linux/cgroup.h
+#include linux/fs.h
+#include linux/rcupdate.h
+
+struct cpuacct {
+   struct cgroup_subsys_state css;
+   spinlock_t lock;
+   /* total time used by this class (in nanoseconds) */
+   u64 time;
+};
+
+struct cgroup_subsys cpuacct_subsys;
+
+static inline struct cpuacct *cgroup_ca(struct cgroup *cont)
+{
+   return container_of(cgroup_subsys_state(cont, cpuacct_subsys_id),
+   struct cpuacct, css);
+}
+
+static inline struct cpuacct *task_ca(struct task_struct *task)
+{
+   return container_of(task_subsys_state(task, cpuacct_subsys_id),
+   struct cpuacct, css);
+}
+
+static struct cgroup_subsys_state *cpuacct_create(
+   struct cgroup_subsys *ss, struct cgroup *cont)
+{
+   struct cpuacct *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
+
+   if (!ca)
+   return ERR_PTR(-ENOMEM);
+   spin_lock_init(ca-lock);
+   return ca-css;
+}
+
+static void cpuacct_destroy(struct cgroup_subsys *ss,
+   struct cgroup *cont)
+{
+   kfree

Re: [Patch 0/5] sched: group scheduler related patches (V4)

2007-11-27 Thread Srivatsa Vaddagiri
On Tue, Nov 27, 2007 at 01:53:12PM +0100, Ingo Molnar wrote:
> > Fine. I will resubmit this patchset then once we are into 2.6.25 
> > cycle.
> 
> no need (unless you have bugfixes) i'm carrying this around in the 
> scheduler git tree. (it will show up in sched-devel.git)

Cool .. Thx! It will get me some testing results (from those who test
sched-devel tree) and also will save me some maintenance trouble :)

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch 0/5] sched: group scheduler related patches (V4)

2007-11-27 Thread Srivatsa Vaddagiri
On Tue, Nov 27, 2007 at 12:09:10PM +0100, Ingo Molnar wrote:
> thanks, it looks good - but the fact that we are at v4 of the patchset 
> underlines the point that this is more of a v2.6.25 patchset than a 
> v2.6.24 one.

Fine. I will resubmit this patchset then once we are into 2.6.25 cycle.

> Group fairness certainly works fine enough in most setups 
> and we are late into the v2.6.24 -rc stage. We'll merge this patchset 
> (or any later versions of it) into v2.6.25 for sure,

Ok ..thx.

> so distros that 
> happen to base things off v2.6.24 can pick up your patches just fine.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch 0/5] sched: group scheduler related patches (V4)

2007-11-27 Thread Srivatsa Vaddagiri
On Tue, Nov 27, 2007 at 12:09:10PM +0100, Ingo Molnar wrote:
 thanks, it looks good - but the fact that we are at v4 of the patchset 
 underlines the point that this is more of a v2.6.25 patchset than a 
 v2.6.24 one.

Fine. I will resubmit this patchset then once we are into 2.6.25 cycle.

 Group fairness certainly works fine enough in most setups 
 and we are late into the v2.6.24 -rc stage. We'll merge this patchset 
 (or any later versions of it) into v2.6.25 for sure,

Ok ..thx.

 so distros that 
 happen to base things off v2.6.24 can pick up your patches just fine.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch 0/5] sched: group scheduler related patches (V4)

2007-11-27 Thread Srivatsa Vaddagiri
On Tue, Nov 27, 2007 at 01:53:12PM +0100, Ingo Molnar wrote:
  Fine. I will resubmit this patchset then once we are into 2.6.25 
  cycle.
 
 no need (unless you have bugfixes) i'm carrying this around in the 
 scheduler git tree. (it will show up in sched-devel.git)

Cool .. Thx! It will get me some testing results (from those who test
sched-devel tree) and also will save me some maintenance trouble :)

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 5/5] sched: Improve fairness of cpu bandwidth allocation for task groups

2007-11-26 Thread Srivatsa Vaddagiri

The current load balancing scheme isn't good for group fairness.

For ex: on a 8-cpu system, I created 3 groups as under:

a = 8 tasks (cpu.shares = 1024) 
b = 4 tasks (cpu.shares = 1024) 
c = 3 tasks (cpu.shares = 1024) 

a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.

This is what I get with the latest scheduler git tree:


Col1  | Col2| Col3  |  Col4
--|-|---|---
a | 277.676 | 57.8% | 54.1%  54.1%  54.1%  54.2%  56.7%  62.2%  62.8% 64.5%
b | 116.108 | 24.2% | 47.4%  48.1%  48.7%  49.3%
c |  86.326 | 18.0% | 47.5%  47.9%  48.5%


Explanation of o/p:

Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) received by all tasks of that 
group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in 
percentage. Col3 data is derived as:
Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
Col4 = 100 * cpu_time_recd_by_task / 60

[I can share the test case that produces a similar o/p if reqd]

The deviation from desired group fairness is as below:

a = +24.47%
b = -9.13%
c = -15.33%

which is quite high.

After the patch below is applied, here are the results:


Col1  | Col2| Col3  |  Col4
--|-|---|---
a | 163.112 | 34.0% | 33.2%  33.4%  33.5%  33.5%  33.7%  34.4%  34.8% 35.3%
b | 156.220 | 32.5% | 63.3%  64.5%  66.1%  66.5%
c | 160.653 | 33.5% | 85.8%  90.6%  91.4%


Deviation from desired group fairness is as below:

a = +0.67%
b = -0.83%
c = +0.17%

which is far better IMO. Most of other runs have yielded a deviation within
+-2% at the most, which is good.

Why do we see bad (group) fairness with current scheuler?
=

Currently cpu's weight is just the summation of individual task weights.
This can yield incorrect results. For ex: consider three groups as below
on a 2-cpu system:

CPU0CPU1
---
A (10)  B(5)
C(5)
---

Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all
of which are on CPU1. Each task has the same weight (NICE_0_LOAD =
1024).

The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and
the load balancer will think both CPUs are perfectly balanced and won't
move around any tasks. This, however, would yield this bandwidth:

A = 50%
B = 25%
C = 25%

which is not the desired result.

What's changing in the patch?
=

- How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is
  defined (see below)
- API Change 
- Two tunables introduced in sysfs (under SCHED_DEBUG) to 
  control the frequency at which the load balance monitor
  thread runs. 

The basic change made in this patch is how cpu weight (rq->load.weight) is 
calculated. Its now calculated as the summation of group weights on a cpu,
rather than summation of task weights. Weight exerted by a group on a
cpu is dependent on the shares allocated to it and also the number of
tasks the group has on that cpu compared to the total number of
(runnable) tasks the group has in the system.

Let,
W(K,i)  = Weight of group K on cpu i
T(K,i)  = Task load present in group K's cfs_rq on cpu i
T(K)= Total task load of group K across various cpus
S(K)= Shares allocated to group K
NRCPUS  = Number of online cpus in the scheduler domain to
  which group K is assigned.

Then,
W(K,i) = S(K) * NRCPUS * T(K,i) / T(K)

A load balance monitor thread is created at bootup, which periodically
runs and adjusts group's weight on each cpu. To avoid its overhead, two
min/max tunables are introduced (under SCHED_DEBUG) to control the rate at which
it runs.

Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 include/linux/sched.h |4 
 kernel/sched.c|  259 --
 kernel/sched_fair.c   |   88 ++--
 kernel/sysctl.c   |   18 +++
 4 files changed, 330 insertions(+), 39 deletions(-)

Index: current/include/linux/sched.h
===
--- curre

[Patch 4/5] sched: introduce a mutex and corresponding API to serialize access to doms_cur[] array

2007-11-26 Thread Srivatsa Vaddagiri
doms_cur[] array represents various scheduling domains which are mutually
exclusive. Currently cpusets code can modify this array (by calling
partition_sched_domains()) as a result of user modifying sched_load_balance 
flag for various cpusets.

This patch introduces a mutex and corresponding API (only when
CONFIG_FAIR_GROUP_SCHED is defined) which allows a reader to safely read the
doms_cur[] array w/o worrying abt concurrent modifications to the array.

The fair group scheduler code (introduced in next patch of this series)
makes use of this mutex to walk thr' doms_cur[] array while rebalancing
shares of task groups across cpus.

Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 kernel/sched.c |   19 +++
 1 files changed, 19 insertions(+)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -186,6 +186,9 @@ static struct cfs_rq *init_cfs_rq_p[NR_C
  */
 static DEFINE_MUTEX(task_group_mutex);
 
+/* doms_cur_mutex serializes access to doms_cur[] array */
+static DEFINE_MUTEX(doms_cur_mutex);
+
 /* Default task group.
  * Every task in system belong to this group at bootup.
  */
@@ -236,11 +239,23 @@ static inline void unlock_task_group_lis
mutex_unlock(_group_mutex);
 }
 
+static inline void lock_doms_cur(void)
+{
+   mutex_lock(_cur_mutex);
+}
+
+static inline void unlock_doms_cur(void)
+{
+   mutex_unlock(_cur_mutex);
+}
+
 #else
 
 static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
 static inline void lock_task_group_list(void) { }
 static inline void unlock_task_group_list(void) { }
+static inline void lock_doms_cur(void) { }
+static inline void unlock_doms_cur(void) { }
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -6547,6 +6562,8 @@ void partition_sched_domains(int ndoms_n
 {
int i, j;
 
+   lock_doms_cur();
+
/* always unregister in case we don't destroy any domains */
unregister_sched_domain_sysctl();
 
@@ -6587,6 +6604,8 @@ match2:
ndoms_cur = ndoms_new;
 
register_sched_domain_sysctl();
+
+   unlock_doms_cur();
 }
 
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)


-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 3/5 v2] sched: change how cpu load is calculated

2007-11-26 Thread Srivatsa Vaddagiri
This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.

This version of patch (v2 of Patch 3/5) has a minor impact on code size
(but should have no runtime/functional impact) for !CONFIG_FAIR_GROUP_SCHED
case, but the overall code, IMHO, is neater compared to v1 of Patch 3/5
(because of lesser #ifdefs).

I prefer v2 of Patch 3/5.

Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 kernel/sched.c  |   27 +++
 kernel/sched_fair.c |   31 +++
 kernel/sched_rt.c   |2 ++
 3 files changed, 40 insertions(+), 20 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -870,6 +870,16 @@ iter_move_one_task(struct rq *this_rq, i
   struct rq_iterator *iterator);
 #endif
 
+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+   update_load_add(>load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+   update_load_sub(>load, load);
+}
+
 #include "sched_stats.h"
 #include "sched_idletask.c"
 #include "sched_fair.c"
@@ -880,26 +890,14 @@ iter_move_one_task(struct rq *this_rq, i
 
 #define sched_class_highest (_sched_class)
 
-static inline void inc_load(struct rq *rq, const struct task_struct *p)
-{
-   update_load_add(>load, p->se.load.weight);
-}
-
-static inline void dec_load(struct rq *rq, const struct task_struct *p)
-{
-   update_load_sub(>load, p->se.load.weight);
-}
-
 static void inc_nr_running(struct task_struct *p, struct rq *rq)
 {
rq->nr_running++;
-   inc_load(rq, p);
 }
 
 static void dec_nr_running(struct task_struct *p, struct rq *rq)
 {
rq->nr_running--;
-   dec_load(rq, p);
 }
 
 static void set_load_weight(struct task_struct *p)
@@ -4071,10 +4069,8 @@ void set_user_nice(struct task_struct *p
goto out_unlock;
}
on_rq = p->se.on_rq;
-   if (on_rq) {
+   if (on_rq)
dequeue_task(rq, p, 0);
-   dec_load(rq, p);
-   }
 
p->static_prio = NICE_TO_PRIO(nice);
set_load_weight(p);
@@ -4084,7 +4080,6 @@ void set_user_nice(struct task_struct *p
 
if (on_rq) {
enqueue_task(rq, p, 0);
-   inc_load(rq, p);
/*
 * If the task increased its priority or is running and
 * lowered its priority, then reschedule its CPU:
Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -755,15 +755,26 @@ static inline struct sched_entity *paren
 static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
 {
struct cfs_rq *cfs_rq;
-   struct sched_entity *se = >se;
+   struct sched_entity *se = >se, *topse = NULL;
+   int incload = 1;
 
for_each_sched_entity(se) {
-   if (se->on_rq)
+   topse = se;
+   if (se->on_rq) {
+   incload = 0;
break;
+   }
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, wakeup);
wakeup = 1;
}
+   /*
+* Increment cpu load if we just enqueued the first task of a group on
+* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+* at the highest grouping level.
+*/
+   if (incload)
+   inc_cpu_load(rq, topse->load.weight);
 }
 
 /*
@@ -774,16 +785,28 @@ static void enqueue_task_fair(struct rq 
 static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
 {
struct cfs_rq *cfs_rq;
-   struct sched_entity *se = >se;
+   struct sched_entity *se = >se, *topse = NULL;
+   int decload = 1;
 
for_each_sched_entity(se) {
+   topse = se;
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, sleep);
/* Don't dequeue parent if it has other entities besides us */
-   if (cfs_rq->load.weight)
+   if (cfs_rq->load.weight) {
+   if (parent_entity(se))
+   decload = 0;
break;
+   }
sleep = 1;
}
+   /*
+* Decrement cpu load if we just dequeued the last task of a group on
+* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+* at the highest grouping level.
+*/
+   if (declo

[Patch 3/5 v1] sched: change how cpu load is calculated

2007-11-26 Thread Srivatsa Vaddagiri
This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.

This version of patch (v1 of Patch 3/5) has zero impact for
!CONFIG_FAIR_GROUP_SCHED case.

Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>


---
 kernel/sched.c  |   38 ++
 kernel/sched_fair.c |   31 +++
 kernel/sched_rt.c   |2 ++
 3 files changed, 59 insertions(+), 12 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -870,15 +870,25 @@ iter_move_one_task(struct rq *this_rq, i
   struct rq_iterator *iterator);
 #endif
 
-#include "sched_stats.h"
-#include "sched_idletask.c"
-#include "sched_fair.c"
-#include "sched_rt.c"
-#ifdef CONFIG_SCHED_DEBUG
-# include "sched_debug.c"
-#endif
+#ifdef CONFIG_FAIR_GROUP_SCHED
 
-#define sched_class_highest (_sched_class)
+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+   update_load_add(>load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+   update_load_sub(>load, load);
+}
+
+static inline void inc_load(struct rq *rq, const struct task_struct *p) { }
+static inline void dec_load(struct rq *rq, const struct task_struct *p) { }
+
+#else  /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline void inc_cpu_load(struct rq *rq, unsigned long load) { }
+static inline void dec_cpu_load(struct rq *rq, unsigned long load) { }
 
 static inline void inc_load(struct rq *rq, const struct task_struct *p)
 {
@@ -890,6 +900,18 @@ static inline void dec_load(struct rq *r
update_load_sub(>load, p->se.load.weight);
 }
 
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+#include "sched_stats.h"
+#include "sched_idletask.c"
+#include "sched_fair.c"
+#include "sched_rt.c"
+#ifdef CONFIG_SCHED_DEBUG
+# include "sched_debug.c"
+#endif
+
+#define sched_class_highest (_sched_class)
+
 static void inc_nr_running(struct task_struct *p, struct rq *rq)
 {
rq->nr_running++;
Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -755,15 +755,26 @@ static inline struct sched_entity *paren
 static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
 {
struct cfs_rq *cfs_rq;
-   struct sched_entity *se = >se;
+   struct sched_entity *se = >se, *topse = NULL;
+   int incload = 1;
 
for_each_sched_entity(se) {
-   if (se->on_rq)
+   topse = se;
+   if (se->on_rq) {
+   incload = 0;
break;
+   }
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, wakeup);
wakeup = 1;
}
+   /*
+* Increment cpu load if we just enqueued the first task of a group on
+* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+* at the highest grouping level.
+*/
+   if (incload)
+   inc_cpu_load(rq, topse->load.weight);
 }
 
 /*
@@ -774,16 +785,28 @@ static void enqueue_task_fair(struct rq 
 static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
 {
struct cfs_rq *cfs_rq;
-   struct sched_entity *se = >se;
+   struct sched_entity *se = >se, *topse = NULL;
+   int decload = 1;
 
for_each_sched_entity(se) {
+   topse = se;
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, sleep);
/* Don't dequeue parent if it has other entities besides us */
-   if (cfs_rq->load.weight)
+   if (cfs_rq->load.weight) {
+   if (parent_entity(se))
+   decload = 0;
break;
+   }
sleep = 1;
}
+   /*
+* Decrement cpu load if we just dequeued the last task of a group on
+* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+* at the highest grouping level.
+*/
+   if (decload)
+   dec_cpu_load(rq, topse->load.weight);
 }
 
 /*
Index: current/kernel/sched_rt.c
===
--- current.orig/kernel/sched_rt.c
+++ current/kernel/sched_rt.c
@@ -31,6 +31,7 @@ static void enqueue_task_rt(struct rq *r
 
list_add_tail(>run_list, array->queue + p->prio);
__set_bi

[Patch 2/5] sched: minor fixes for group scheduler

2007-11-26 Thread Srivatsa Vaddagiri
Minor bug fixes for group scheduler:

- Use a mutex to serialize add/remove of task groups and also when
  changing shares of a task group. Use the same mutex when printing cfs_rq
  stats for various task groups.
- Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when
  walking task group list)


Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 kernel/sched.c  |   34 ++
 kernel/sched_fair.c |4 +++-
 2 files changed, 29 insertions(+), 9 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -169,8 +169,6 @@ struct task_group {
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
-   /* spinlock to serialize modification to shares */
-   spinlock_t lock;
struct rcu_head rcu;
 };
 
@@ -182,6 +180,12 @@ static DEFINE_PER_CPU(struct cfs_rq, ini
 static struct sched_entity *init_sched_entity_p[NR_CPUS];
 static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
 
+/*
+ * task_group_mutex serializes add/remove of task groups and also changes to
+ * a task group's cpu shares.
+ */
+static DEFINE_MUTEX(task_group_mutex);
+
 /* Default task group.
  * Every task in system belong to this group at bootup.
  */
@@ -222,9 +226,21 @@ static inline void set_task_cfs_rq(struc
p->se.parent = task_group(p)->se[cpu];
 }
 
+static inline void lock_task_group_list(void)
+{
+   mutex_lock(_group_mutex);
+}
+
+static inline void unlock_task_group_list(void)
+{
+   mutex_unlock(_group_mutex);
+}
+
 #else
 
 static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
+static inline void lock_task_group_list(void) { }
+static inline void unlock_task_group_list(void) { }
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -6747,7 +6763,6 @@ void __init sched_init(void)
se->parent = NULL;
}
init_task_group.shares = init_task_group_load;
-   spin_lock_init(_task_group.lock);
 #endif
 
for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
@@ -6987,14 +7002,15 @@ struct task_group *sched_create_group(vo
se->parent = NULL;
}
 
+   tg->shares = NICE_0_LOAD;
+
+   lock_task_group_list();
for_each_possible_cpu(i) {
rq = cpu_rq(i);
cfs_rq = tg->cfs_rq[i];
list_add_rcu(_rq->leaf_cfs_rq_list, >leaf_cfs_rq_list);
}
-
-   tg->shares = NICE_0_LOAD;
-   spin_lock_init(>lock);
+   unlock_task_group_list();
 
return tg;
 
@@ -7040,10 +7056,12 @@ void sched_destroy_group(struct task_gro
struct cfs_rq *cfs_rq = NULL;
int i;
 
+   lock_task_group_list();
for_each_possible_cpu(i) {
cfs_rq = tg->cfs_rq[i];
list_del_rcu(_rq->leaf_cfs_rq_list);
}
+   unlock_task_group_list();
 
BUG_ON(!cfs_rq);
 
@@ -7117,7 +7135,7 @@ int sched_group_set_shares(struct task_g
 {
int i;
 
-   spin_lock(>lock);
+   lock_task_group_list();
if (tg->shares == shares)
goto done;
 
@@ -7126,7 +7144,7 @@ int sched_group_set_shares(struct task_g
set_se_shares(tg->se[i], shares);
 
 done:
-   spin_unlock(>lock);
+   unlock_task_group_list();
return 0;
 }
 
Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -685,7 +685,7 @@ static inline struct cfs_rq *cpu_cfs_rq(
 
 /* Iterate thr' all leaf cfs_rq's on a runqueue */
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
-   list_for_each_entry(cfs_rq, >leaf_cfs_rq_list, leaf_cfs_rq_list)
+   list_for_each_entry_rcu(cfs_rq, >leaf_cfs_rq_list, leaf_cfs_rq_list)
 
 /* Do the two (enqueued) entities belong to the same group ? */
 static inline int
@@ -1126,7 +1126,9 @@ static void print_cfs_stats(struct seq_f
 #ifdef CONFIG_FAIR_GROUP_SCHED
print_cfs_rq(m, cpu, _rq(cpu)->cfs);
 #endif
+   lock_task_group_list();
for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
print_cfs_rq(m, cpu, cfs_rq);
+   unlock_task_group_list();
 }
 #endif

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >