Re: [PATCH v1] virtio_pmem: populate numa information

2022-11-16 Thread Pankaj Gupta
> > > > > > > > Compute the numa information for a virtio_pmem device from the 
> > > > > > > > memory
> > > > > > > > range of the device. Previously, the target_node was always 0 
> > > > > > > > since
> > > > > > > > the ndr_desc.target_node field was never explicitly set. The 
> > > > > > > > code for
> > > > > > > > computing the numa node is taken from cxl_pmem_region_probe in
> > > > > > > > drivers/cxl/pmem.c.
> > > > > > > >
> > > > > > > > Signed-off-by: Michael Sammler 
> > >
> > > Tested-by: Mina Almasry 
> > >
> > > I don't have much expertise on this driver, but with the help of this
> > > patch I was able to get memory tiering [1] emulation going on qemu. As
> > > far as I know there is no alternative to this emulation, and so I
> > > would love to see this or equivalent merged, if possible.
> > >
> > > This is what I have going to get memory tiering emulation:
> > >
> > > In qemu, added these configs:
> > >   -object 
> > > memory-backend-file,id=m4,share=on,mem-path="$path_to_virtio_pmem_file",size=2G
> > > \
> > >   -smp 2,sockets=2,maxcpus=2  \
> > >   -numa node,nodeid=0,memdev=m0 \
> > >   -numa node,nodeid=1,memdev=m1 \
> > >   -numa node,nodeid=2,memdev=m2,initiator=0 \
> > >   -numa node,nodeid=3,initiator=0 \
> > >   -device virtio-pmem-pci,memdev=m4,id=nvdimm1 \
> > >
> > > On boot, ran these commands:
> > > ndctl_static create-namespace -e namespace0.0 -m devdax -f 1&> 
> > > /dev/null
> > > echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> > > echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> > > for i in `ls /sys/devices/system/memory/`; do
> > >   state=$(cat "/sys/devices/system/memory/$i/state" 2&>/dev/null)
> > >   if [ "$state" == "offline" ]; then
> > > echo online_movable > "/sys/devices/system/memory/$i/state"
> > >   fi
> > > done
> >
> > Nice to see the way to handle the virtio-pmem device memory through kmem 
> > driver
> > and online the corresponding memory blocks to 'zone_movable'.
> >
> > This also opens way to use this memory range directly irrespective of 
> > attached
> > block device. Of course there won't be any persistent data guarantee. But 
> > good
> > way to simulate memory tiering inside guest as demonstrated below.
> > >
> > > Without this CL, I see the memory onlined in node 0 always, and is not
> > > a separate memory tier. With this CL and qemu configs, the memory is
> > > onlined in node 3 and is set as a separate memory tier, which enables
> > > qemu-based development:
> > >
> > > ==> /sys/devices/virtual/memory_tiering/memory_tier22/nodelist <==
> > > 3
> > > ==> /sys/devices/virtual/memory_tiering/memory_tier4/nodelist <==
> > > 0-2
> > >
> > > AFAIK there is no alternative to enabling memory tiering emulation in
> > > qemu, and would love to see this or equivalent merged, if possible.
> >
> > Just wondering if Qemu vNVDIMM device can also achieve this?
> >
>
> I spent a few minutes on this. Please note I'm really not familiar
> with these drivers, but as far as I can tell the qemu vNVDIMM device
> has the same problem and needs a similar fix to this to what Michael
> did here. What I did with vNVDIMM qemu device:
>
> - Added these qemu configs:
>   -object 
> memory-backend-file,id=m4,share=on,mem-path=./hello,size=2G,readonly=off
> \
>   -device nvdimm,id=nvdimm1,memdev=m4,unarmed=off \
>
> - Ran the same commands in my previous email (they seem to apply to
> the vNVDIMM device without modification):
> ndctl_static create-namespace -e namespace0.0 -m devdax -f 1&> /dev/null
> echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> for i in `ls /sys/devices/system/memory/`; do
>   state=$(cat "/sys/devices/system/memory/$i/state" 2&>/dev/null)
>   if [ "$state" == "offline" ]; then
> echo online_movable > "/sys/devices/system/memory/$i/state"
>   fi
> done
>
> I see the memory from the vNVDIMM device get onlined on node0, and is
> not detected as a separate memory tier. I suspect that driver needs a
> similar fix to this one.

Thanks for trying. It seems vNVDIMM device already has an option to provide
the target node[1].

[1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg827765.html



Re: [PATCH v1] virtio_pmem: populate numa information

2022-11-13 Thread Pankaj Gupta
> > Pankaj Gupta wrote:
> > > > > > Compute the numa information for a virtio_pmem device from the 
> > > > > > memory
> > > > > > range of the device. Previously, the target_node was always 0 since
> > > > > > the ndr_desc.target_node field was never explicitly set. The code 
> > > > > > for
> > > > > > computing the numa node is taken from cxl_pmem_region_probe in
> > > > > > drivers/cxl/pmem.c.
> > > > > >
> > > > > > Signed-off-by: Michael Sammler 
>
> Tested-by: Mina Almasry 
>
> I don't have much expertise on this driver, but with the help of this
> patch I was able to get memory tiering [1] emulation going on qemu. As
> far as I know there is no alternative to this emulation, and so I
> would love to see this or equivalent merged, if possible.
>
> This is what I have going to get memory tiering emulation:
>
> In qemu, added these configs:
>   -object 
> memory-backend-file,id=m4,share=on,mem-path="$path_to_virtio_pmem_file",size=2G
> \
>   -smp 2,sockets=2,maxcpus=2  \
>   -numa node,nodeid=0,memdev=m0 \
>   -numa node,nodeid=1,memdev=m1 \
>   -numa node,nodeid=2,memdev=m2,initiator=0 \
>   -numa node,nodeid=3,initiator=0 \
>   -device virtio-pmem-pci,memdev=m4,id=nvdimm1 \
>
> On boot, ran these commands:
> ndctl_static create-namespace -e namespace0.0 -m devdax -f 1&> /dev/null
> echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> for i in `ls /sys/devices/system/memory/`; do
>   state=$(cat "/sys/devices/system/memory/$i/state" 2&>/dev/null)
>   if [ "$state" == "offline" ]; then
> echo online_movable > "/sys/devices/system/memory/$i/state"
>   fi
> done

Nice to see the way to handle the virtio-pmem device memory through kmem driver
and online the corresponding memory blocks to 'zone_movable'.

This also opens way to use this memory range directly irrespective of attached
block device. Of course there won't be any persistent data guarantee. But good
way to simulate memory tiering inside guest as demonstrated below.
>
> Without this CL, I see the memory onlined in node 0 always, and is not
> a separate memory tier. With this CL and qemu configs, the memory is
> onlined in node 3 and is set as a separate memory tier, which enables
> qemu-based development:
>
> ==> /sys/devices/virtual/memory_tiering/memory_tier22/nodelist <==
> 3
> ==> /sys/devices/virtual/memory_tiering/memory_tier4/nodelist <==
> 0-2
>
> AFAIK there is no alternative to enabling memory tiering emulation in
> qemu, and would love to see this or equivalent merged, if possible.

Just wondering if Qemu vNVDIMM device can also achieve this?

In any case, this patch is useful, So,
Reviewed-by: Pankaj Gupta 
>
>
> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers
>
> > > > > > ---
> > > > > >  drivers/nvdimm/virtio_pmem.c | 11 +--
> > > > > >  1 file changed, 9 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/nvdimm/virtio_pmem.c 
> > > > > > b/drivers/nvdimm/virtio_pmem.c
> > > > > > index 20da455d2ef6..a92eb172f0e7 100644
> > > > > > --- a/drivers/nvdimm/virtio_pmem.c
> > > > > > +++ b/drivers/nvdimm/virtio_pmem.c
> > > > > > @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem)
> > > > > >  static int virtio_pmem_probe(struct virtio_device *vdev)
> > > > > >  {
> > > > > > struct nd_region_desc ndr_desc = {};
> > > > > > -   int nid = dev_to_node(>dev);
> > > > > > struct nd_region *nd_region;
> > > > > > struct virtio_pmem *vpmem;
> > > > > > struct resource res;
> > > > > > @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct 
> > > > > > virtio_device *vdev)
> > > > > > dev_set_drvdata(>dev, vpmem->nvdimm_bus);
> > > > > >
> > > > > > ndr_desc.res = 
> > > > > > -   ndr_desc.numa_node = nid;
> > > > > > +
> > > > > > +   ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start);
> > > > > > +   ndr_desc.target_node = phys_to_target_node(res.start);
> >

Re: [PATCH v1] virtio_pmem: populate numa information

2022-10-26 Thread Pankaj Gupta
> > > Compute the numa information for a virtio_pmem device from the memory
> > > range of the device. Previously, the target_node was always 0 since
> > > the ndr_desc.target_node field was never explicitly set. The code for
> > > computing the numa node is taken from cxl_pmem_region_probe in
> > > drivers/cxl/pmem.c.
> > >
> > > Signed-off-by: Michael Sammler 
> > > ---
> > >  drivers/nvdimm/virtio_pmem.c | 11 +--
> > >  1 file changed, 9 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> > > index 20da455d2ef6..a92eb172f0e7 100644
> > > --- a/drivers/nvdimm/virtio_pmem.c
> > > +++ b/drivers/nvdimm/virtio_pmem.c
> > > @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem)
> > >  static int virtio_pmem_probe(struct virtio_device *vdev)
> > >  {
> > > struct nd_region_desc ndr_desc = {};
> > > -   int nid = dev_to_node(>dev);
> > > struct nd_region *nd_region;
> > > struct virtio_pmem *vpmem;
> > > struct resource res;
> > > @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device 
> > > *vdev)
> > > dev_set_drvdata(>dev, vpmem->nvdimm_bus);
> > >
> > > ndr_desc.res = 
> > > -   ndr_desc.numa_node = nid;
> > > +
> > > +   ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start);
> > > +   ndr_desc.target_node = phys_to_target_node(res.start);
> > > +   if (ndr_desc.target_node == NUMA_NO_NODE) {
> > > +   ndr_desc.target_node = ndr_desc.numa_node;
> > > +   dev_dbg(>dev, "changing target node from %d to %d",
> > > +   NUMA_NO_NODE, ndr_desc.target_node);
> > > +   }
> >
> > As this memory later gets hotplugged using "devm_memremap_pages". I don't
> > see if 'target_node' is used for fsdax case?
> >
> > It seems to me "target_node" is used mainly for volatile range above
> > persistent memory ( e.g kmem driver?).
> >
> I am not sure if 'target_node' is used in the fsdax case, but it is
> indeed used by the devdax/kmem driver when hotplugging the memory (see
> 'dev_dax_kmem_probe' and '__dax_pmem_probe').

Yes, but not currently for FS_DAX iiuc.

Thanks,
Pankaj



Re: [PATCH v1] virtio_pmem: populate numa information

2022-10-21 Thread Pankaj Gupta
> Compute the numa information for a virtio_pmem device from the memory
> range of the device. Previously, the target_node was always 0 since
> the ndr_desc.target_node field was never explicitly set. The code for
> computing the numa node is taken from cxl_pmem_region_probe in
> drivers/cxl/pmem.c.
>
> Signed-off-by: Michael Sammler 
> ---
>  drivers/nvdimm/virtio_pmem.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> index 20da455d2ef6..a92eb172f0e7 100644
> --- a/drivers/nvdimm/virtio_pmem.c
> +++ b/drivers/nvdimm/virtio_pmem.c
> @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem)
>  static int virtio_pmem_probe(struct virtio_device *vdev)
>  {
> struct nd_region_desc ndr_desc = {};
> -   int nid = dev_to_node(>dev);
> struct nd_region *nd_region;
> struct virtio_pmem *vpmem;
> struct resource res;
> @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
> dev_set_drvdata(>dev, vpmem->nvdimm_bus);
>
> ndr_desc.res = 
> -   ndr_desc.numa_node = nid;
> +
> +   ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start);
> +   ndr_desc.target_node = phys_to_target_node(res.start);
> +   if (ndr_desc.target_node == NUMA_NO_NODE) {
> +   ndr_desc.target_node = ndr_desc.numa_node;
> +   dev_dbg(>dev, "changing target node from %d to %d",
> +   NUMA_NO_NODE, ndr_desc.target_node);
> +   }

As this memory later gets hotplugged using "devm_memremap_pages". I don't
see if 'target_node' is used for fsdax case?

It seems to me "target_node" is used mainly for volatile range above
persistent memory ( e.g kmem driver?).

Thanks,
Pankaj

> +
> ndr_desc.flush = async_pmem_flush;
> ndr_desc.provider_data = vdev;
> set_bit(ND_REGION_PAGEMAP, _desc.flags);
> --



Re: [RFC v3 1/2] virtio-pmem: Async virtio-pmem flush

2022-02-16 Thread Pankaj Gupta
> >
> > > >
> > > > Enable asynchronous flush for virtio pmem using work queue. Also,
> > > > coalesce the flush requests when a flush is already in process.
> > > > This functionality is copied from md/RAID code.
> > > >
> > > > When a flush is already in process, new flush requests wait till
> > > > previous flush completes in another context (work queue). For all
> > > > the requests come between ongoing flush and new flush start time, only
> > > > single flush executes, thus adhers to flush coalscing logic. This is
> > >
> > > s/adhers/adheres/
> > >
> > > s/coalscing/coalescing/
> > >
> > > > important for maintaining the flush request order with request 
> > > > coalscing.
> > >
> > > s/coalscing/coalescing/
> >
> > o.k. Sorry for the spelling mistakes.
> >
> > >
> > > >
> > > > Signed-off-by: Pankaj Gupta 
> > > > ---
> > > >  drivers/nvdimm/nd_virtio.c   | 74 +++-
> > > >  drivers/nvdimm/virtio_pmem.c | 10 +
> > > >  drivers/nvdimm/virtio_pmem.h | 16 
> > > >  3 files changed, 83 insertions(+), 17 deletions(-)
> > > >
> > > > diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
> > > > index 10351d5b49fa..179ea7a73338 100644
> > > > --- a/drivers/nvdimm/nd_virtio.c
> > > > +++ b/drivers/nvdimm/nd_virtio.c
> > > > @@ -100,26 +100,66 @@ static int virtio_pmem_flush(struct nd_region 
> > > > *nd_region)
> > > >  /* The asynchronous flush callback function */
> > > >  int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
> > > >  {
> > > > -   /*
> > > > -* Create child bio for asynchronous flush and chain with
> > > > -* parent bio. Otherwise directly call nd_region flush.
> > > > +   /* queue asynchronous flush and coalesce the flush requests */
> > > > +   struct virtio_device *vdev = nd_region->provider_data;
> > > > +   struct virtio_pmem *vpmem  = vdev->priv;
> > > > +   ktime_t req_start = ktime_get_boottime();
> > > > +   int ret = -EINPROGRESS;
> > > > +
> > > > +   spin_lock_irq(>lock);
> > >
> > > Why a new lock and not continue to use ->pmem_lock?
> >
> > This spinlock is to protect entry in 'wait_event_lock_irq'
> > and the Other spinlock is to protect virtio queue data.
>
> Understood, but md shares the mddev->lock for both purposes, so I
> would ask that you either document what motivates the locking split,
> or just reuse the lock until a strong reason to split them arises.

O.k. Will check again if we could use same lock Or document it.

Thanks,
Pankaj



Re: [RFC v3 2/2] pmem: enable pmem_submit_bio for asynchronous flush

2022-02-16 Thread Pankaj Gupta
> > > > Return from "pmem_submit_bio" when asynchronous flush is
> > > > still in progress in other context.
> > > >
> > > > Signed-off-by: Pankaj Gupta 
> > > > ---
> > > >  drivers/nvdimm/pmem.c| 15 ---
> > > >  drivers/nvdimm/region_devs.c |  4 +++-
> > > >  2 files changed, 15 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > > > index fe7ece1534e1..f20e30277a68 100644
> > > > --- a/drivers/nvdimm/pmem.c
> > > > +++ b/drivers/nvdimm/pmem.c
> > > > @@ -201,8 +201,12 @@ static void pmem_submit_bio(struct bio *bio)
> > > > struct pmem_device *pmem = bio->bi_bdev->bd_disk->private_data;
> > > > struct nd_region *nd_region = to_region(pmem);
> > > >
> > > > -   if (bio->bi_opf & REQ_PREFLUSH)
> > > > +   if (bio->bi_opf & REQ_PREFLUSH) {
> > > > ret = nvdimm_flush(nd_region, bio);
> > > > +   /* asynchronous flush completes in other context */
> > >
> > > I think a negative error code is a confusing way to capture the case
> > > of "bio successfully coalesced to previously pending flush request.
> > > Perhaps reserve negative codes for failure, 0 for synchronously
> > > completed, and > 0 for coalesced flush request.
> >
> > Yes. I implemented this way previously, will revert it to. Thanks!
> >
> > >
> > > > +   if (ret == -EINPROGRESS)
> > > > +   return;
> > > > +   }
> > > >
> > > > do_acct = blk_queue_io_stat(bio->bi_bdev->bd_disk->queue);
> > > > if (do_acct)
> > > > @@ -222,13 +226,18 @@ static void pmem_submit_bio(struct bio *bio)
> > > > if (do_acct)
> > > > bio_end_io_acct(bio, start);
> > > >
> > > > -   if (bio->bi_opf & REQ_FUA)
> > > > +   if (bio->bi_opf & REQ_FUA) {
> > > > ret = nvdimm_flush(nd_region, bio);
> > > > +   /* asynchronous flush completes in other context */
> > > > +   if (ret == -EINPROGRESS)
> > > > +   return;
> > > > +   }
> > > >
> > > > if (ret)
> > > > bio->bi_status = errno_to_blk_status(ret);
> > > >
> > > > -   bio_endio(bio);
> > > > +   if (bio)
> > > > +   bio_endio(bio);
> > > >  }
> > > >
> > > >  static int pmem_rw_page(struct block_device *bdev, sector_t sector,
> > > > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> > > > index 9ccf3d608799..8512d2eaed4e 100644
> > > > --- a/drivers/nvdimm/region_devs.c
> > > > +++ b/drivers/nvdimm/region_devs.c
> > > > @@ -1190,7 +1190,9 @@ int nvdimm_flush(struct nd_region *nd_region, 
> > > > struct bio *bio)
> > > > if (!nd_region->flush)
> > > > rc = generic_nvdimm_flush(nd_region);
> > > > else {
> > > > -   if (nd_region->flush(nd_region, bio))
> > > > +   rc = nd_region->flush(nd_region, bio);
> > > > +   /* ongoing flush in other context */
> > > > +   if (rc && rc != -EINPROGRESS)
> > > > rc = -EIO;
> > >
> > > Why change this to -EIO vs just let the error code through untranslated?
> >
> > The reason was to be generic error code instead of returning host side
> > return codes to guest?
>
> Ok, maybe a comment to indicate the need to avoid exposing these error
> codes toa guest so someone does not ask the same question in the
> future?

Sure.

Thanks,
Pankaj



Re: [RFC v3 1/2] virtio-pmem: Async virtio-pmem flush

2022-02-16 Thread Pankaj Gupta
> >
> > Enable asynchronous flush for virtio pmem using work queue. Also,
> > coalesce the flush requests when a flush is already in process.
> > This functionality is copied from md/RAID code.
> >
> > When a flush is already in process, new flush requests wait till
> > previous flush completes in another context (work queue). For all
> > the requests come between ongoing flush and new flush start time, only
> > single flush executes, thus adhers to flush coalscing logic. This is
>
> s/adhers/adheres/
>
> s/coalscing/coalescing/
>
> > important for maintaining the flush request order with request coalscing.
>
> s/coalscing/coalescing/

o.k. Sorry for the spelling mistakes.

>
> >
> > Signed-off-by: Pankaj Gupta 
> > ---
> >  drivers/nvdimm/nd_virtio.c   | 74 +++-
> >  drivers/nvdimm/virtio_pmem.c | 10 +
> >  drivers/nvdimm/virtio_pmem.h | 16 
> >  3 files changed, 83 insertions(+), 17 deletions(-)
> >
> > diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
> > index 10351d5b49fa..179ea7a73338 100644
> > --- a/drivers/nvdimm/nd_virtio.c
> > +++ b/drivers/nvdimm/nd_virtio.c
> > @@ -100,26 +100,66 @@ static int virtio_pmem_flush(struct nd_region 
> > *nd_region)
> >  /* The asynchronous flush callback function */
> >  int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
> >  {
> > -   /*
> > -* Create child bio for asynchronous flush and chain with
> > -* parent bio. Otherwise directly call nd_region flush.
> > +   /* queue asynchronous flush and coalesce the flush requests */
> > +   struct virtio_device *vdev = nd_region->provider_data;
> > +   struct virtio_pmem *vpmem  = vdev->priv;
> > +   ktime_t req_start = ktime_get_boottime();
> > +   int ret = -EINPROGRESS;
> > +
> > +   spin_lock_irq(>lock);
>
> Why a new lock and not continue to use ->pmem_lock?

This spinlock is to protect entry in 'wait_event_lock_irq'
and the Other spinlock is to protect virtio queue data.
>
> Have you tested this with CONFIG_PROVE_LOCKING?
No, I only ran xfs tests and some of my unit test program.
Will enable and test with CONFIG_PROVE_LOCKING.

>
> Along those lines do you have a selftest that can be added to the
> kernel as well so that 0day or other bots could offer early warnings
> on regressions?

Will try to add one.

Thank you Dan for the feedback!

Best regards,
Pankaj
>
> > +   /* flush requests wait until ongoing flush completes,
> > +* hence coalescing all the pending requests.
> >  */
> > -   if (bio && bio->bi_iter.bi_sector != -1) {
> > -   struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> > -
> > -   if (!child)
> > -   return -ENOMEM;
> > -   bio_copy_dev(child, bio);
> > -   child->bi_opf = REQ_PREFLUSH;
> > -   child->bi_iter.bi_sector = -1;
> > -   bio_chain(child, bio);
> > -   submit_bio(child);
> > -   return 0;
> > +   wait_event_lock_irq(vpmem->sb_wait,
> > +   !vpmem->flush_bio ||
> > +   ktime_before(req_start, 
> > vpmem->prev_flush_start),
> > +   vpmem->lock);
> > +   /* new request after previous flush is completed */
> > +   if (ktime_after(req_start, vpmem->prev_flush_start)) {
> > +   WARN_ON(vpmem->flush_bio);
> > +   vpmem->flush_bio = bio;
> > +   bio = NULL;
> > +   }
> > +   spin_unlock_irq(>lock);
> > +
> > +   if (!bio)
> > +   queue_work(vpmem->pmem_wq, >flush_work);
> > +   else {
> > +   /* flush completed in other context while we waited */
> > +   if (bio && (bio->bi_opf & REQ_PREFLUSH))
> > +   bio->bi_opf &= ~REQ_PREFLUSH;
> > +   else if (bio && (bio->bi_opf & REQ_FUA))
> > +   bio->bi_opf &= ~REQ_FUA;
> > +
> > +   ret = vpmem->prev_flush_err;
> > }
> > -   if (virtio_pmem_flush(nd_region))
> > -   return -EIO;
> >
> > -   return 0;
> > +   return ret;
> >  };
> >  EXPORT_SYMBOL_GPL(async_pmem_flush);
> > +
> > +void submit_async_flush(struct work_struct *ws)
>
> This name is too generic to be exported 

Re: [RFC v3 2/2] pmem: enable pmem_submit_bio for asynchronous flush

2022-02-16 Thread Pankaj Gupta
> >
> > Return from "pmem_submit_bio" when asynchronous flush is
> > still in progress in other context.
> >
> > Signed-off-by: Pankaj Gupta 
> > ---
> >  drivers/nvdimm/pmem.c| 15 ---
> >  drivers/nvdimm/region_devs.c |  4 +++-
> >  2 files changed, 15 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index fe7ece1534e1..f20e30277a68 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -201,8 +201,12 @@ static void pmem_submit_bio(struct bio *bio)
> > struct pmem_device *pmem = bio->bi_bdev->bd_disk->private_data;
> > struct nd_region *nd_region = to_region(pmem);
> >
> > -   if (bio->bi_opf & REQ_PREFLUSH)
> > +   if (bio->bi_opf & REQ_PREFLUSH) {
> > ret = nvdimm_flush(nd_region, bio);
> > +   /* asynchronous flush completes in other context */
>
> I think a negative error code is a confusing way to capture the case
> of "bio successfully coalesced to previously pending flush request.
> Perhaps reserve negative codes for failure, 0 for synchronously
> completed, and > 0 for coalesced flush request.

Yes. I implemented this way previously, will revert it to. Thanks!

>
> > +   if (ret == -EINPROGRESS)
> > +   return;
> > +   }
> >
> > do_acct = blk_queue_io_stat(bio->bi_bdev->bd_disk->queue);
> > if (do_acct)
> > @@ -222,13 +226,18 @@ static void pmem_submit_bio(struct bio *bio)
> > if (do_acct)
> > bio_end_io_acct(bio, start);
> >
> > -   if (bio->bi_opf & REQ_FUA)
> > +   if (bio->bi_opf & REQ_FUA) {
> > ret = nvdimm_flush(nd_region, bio);
> > +   /* asynchronous flush completes in other context */
> > +   if (ret == -EINPROGRESS)
> > +   return;
> > +   }
> >
> > if (ret)
> > bio->bi_status = errno_to_blk_status(ret);
> >
> > -   bio_endio(bio);
> > +   if (bio)
> > +   bio_endio(bio);
> >  }
> >
> >  static int pmem_rw_page(struct block_device *bdev, sector_t sector,
> > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> > index 9ccf3d608799..8512d2eaed4e 100644
> > --- a/drivers/nvdimm/region_devs.c
> > +++ b/drivers/nvdimm/region_devs.c
> > @@ -1190,7 +1190,9 @@ int nvdimm_flush(struct nd_region *nd_region, struct 
> > bio *bio)
> > if (!nd_region->flush)
> > rc = generic_nvdimm_flush(nd_region);
> > else {
> > -   if (nd_region->flush(nd_region, bio))
> > +   rc = nd_region->flush(nd_region, bio);
> > +   /* ongoing flush in other context */
> > +   if (rc && rc != -EINPROGRESS)
> > rc = -EIO;
>
> Why change this to -EIO vs just let the error code through untranslated?

The reason was to be generic error code instead of returning host side
return codes to guest?

Thanks!
Pankaj
>
> > }
> >
> > --
> > 2.25.1
> >
> >



Re: [RFC v3 0/2] virtio-pmem: Asynchronous flush

2022-01-28 Thread Pankaj Gupta
ping

>  Jeff reported preflush order issue with the existing implementation
>  of virtio pmem preflush. Dan suggested[1] to implement asynchronous flush
>  for virtio pmem using work queue as done in md/RAID. This patch series
>  intends to solve the preflush ordering issue and makes the flush asynchronous
>  for the submitting thread. Also, adds the flush coalscing logic.
>
>  Submitting this RFC v3 series for review. Thank You!
>
>  RFC v2 -> RFC v3
>  - Improve commit log message - patch1.
>  - Improve return error handling for Async flush.
>  - declare'INIT_WORK' only once.
>  - More testing and bug fix.
>
>  [1] https://marc.info/?l=linux-kernel=157446316409937=2
>
> Pankaj Gupta (2):
>   virtio-pmem: Async virtio-pmem flush
>   pmem: enable pmem_submit_bio for asynchronous flush
>
>  drivers/nvdimm/nd_virtio.c   | 74 +++-
>  drivers/nvdimm/pmem.c| 15 ++--
>  drivers/nvdimm/region_devs.c |  4 +-
>  drivers/nvdimm/virtio_pmem.c | 10 +
>  drivers/nvdimm/virtio_pmem.h | 16 
>  5 files changed, 98 insertions(+), 21 deletions(-)
>
> --
> 2.25.1
>



[RFC v3 2/2] pmem: enable pmem_submit_bio for asynchronous flush

2022-01-11 Thread Pankaj Gupta
Return from "pmem_submit_bio" when asynchronous flush is
still in progress in other context.

Signed-off-by: Pankaj Gupta 
---
 drivers/nvdimm/pmem.c| 15 ---
 drivers/nvdimm/region_devs.c |  4 +++-
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index fe7ece1534e1..f20e30277a68 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -201,8 +201,12 @@ static void pmem_submit_bio(struct bio *bio)
struct pmem_device *pmem = bio->bi_bdev->bd_disk->private_data;
struct nd_region *nd_region = to_region(pmem);
 
-   if (bio->bi_opf & REQ_PREFLUSH)
+   if (bio->bi_opf & REQ_PREFLUSH) {
ret = nvdimm_flush(nd_region, bio);
+   /* asynchronous flush completes in other context */
+   if (ret == -EINPROGRESS)
+   return;
+   }
 
do_acct = blk_queue_io_stat(bio->bi_bdev->bd_disk->queue);
if (do_acct)
@@ -222,13 +226,18 @@ static void pmem_submit_bio(struct bio *bio)
if (do_acct)
bio_end_io_acct(bio, start);
 
-   if (bio->bi_opf & REQ_FUA)
+   if (bio->bi_opf & REQ_FUA) {
ret = nvdimm_flush(nd_region, bio);
+   /* asynchronous flush completes in other context */
+   if (ret == -EINPROGRESS)
+   return;
+   }
 
if (ret)
bio->bi_status = errno_to_blk_status(ret);
 
-   bio_endio(bio);
+   if (bio)
+   bio_endio(bio);
 }
 
 static int pmem_rw_page(struct block_device *bdev, sector_t sector,
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 9ccf3d608799..8512d2eaed4e 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1190,7 +1190,9 @@ int nvdimm_flush(struct nd_region *nd_region, struct bio 
*bio)
if (!nd_region->flush)
rc = generic_nvdimm_flush(nd_region);
else {
-   if (nd_region->flush(nd_region, bio))
+   rc = nd_region->flush(nd_region, bio);
+   /* ongoing flush in other context */
+   if (rc && rc != -EINPROGRESS)
rc = -EIO;
}
 
-- 
2.25.1




[RFC v3 1/2] virtio-pmem: Async virtio-pmem flush

2022-01-11 Thread Pankaj Gupta
Enable asynchronous flush for virtio pmem using work queue. Also,
coalesce the flush requests when a flush is already in process.
This functionality is copied from md/RAID code.

When a flush is already in process, new flush requests wait till
previous flush completes in another context (work queue). For all
the requests come between ongoing flush and new flush start time, only
single flush executes, thus adhers to flush coalscing logic. This is
important for maintaining the flush request order with request coalscing.

Signed-off-by: Pankaj Gupta 
---
 drivers/nvdimm/nd_virtio.c   | 74 +++-
 drivers/nvdimm/virtio_pmem.c | 10 +
 drivers/nvdimm/virtio_pmem.h | 16 
 3 files changed, 83 insertions(+), 17 deletions(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 10351d5b49fa..179ea7a73338 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -100,26 +100,66 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 /* The asynchronous flush callback function */
 int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
 {
-   /*
-* Create child bio for asynchronous flush and chain with
-* parent bio. Otherwise directly call nd_region flush.
+   /* queue asynchronous flush and coalesce the flush requests */
+   struct virtio_device *vdev = nd_region->provider_data;
+   struct virtio_pmem *vpmem  = vdev->priv;
+   ktime_t req_start = ktime_get_boottime();
+   int ret = -EINPROGRESS;
+
+   spin_lock_irq(>lock);
+   /* flush requests wait until ongoing flush completes,
+* hence coalescing all the pending requests.
 */
-   if (bio && bio->bi_iter.bi_sector != -1) {
-   struct bio *child = bio_alloc(GFP_ATOMIC, 0);
-
-   if (!child)
-   return -ENOMEM;
-   bio_copy_dev(child, bio);
-   child->bi_opf = REQ_PREFLUSH;
-   child->bi_iter.bi_sector = -1;
-   bio_chain(child, bio);
-   submit_bio(child);
-   return 0;
+   wait_event_lock_irq(vpmem->sb_wait,
+   !vpmem->flush_bio ||
+   ktime_before(req_start, vpmem->prev_flush_start),
+   vpmem->lock);
+   /* new request after previous flush is completed */
+   if (ktime_after(req_start, vpmem->prev_flush_start)) {
+   WARN_ON(vpmem->flush_bio);
+   vpmem->flush_bio = bio;
+   bio = NULL;
+   }
+   spin_unlock_irq(>lock);
+
+   if (!bio)
+   queue_work(vpmem->pmem_wq, >flush_work);
+   else {
+   /* flush completed in other context while we waited */
+   if (bio && (bio->bi_opf & REQ_PREFLUSH))
+   bio->bi_opf &= ~REQ_PREFLUSH;
+   else if (bio && (bio->bi_opf & REQ_FUA))
+   bio->bi_opf &= ~REQ_FUA;
+
+   ret = vpmem->prev_flush_err;
}
-   if (virtio_pmem_flush(nd_region))
-   return -EIO;
 
-   return 0;
+   return ret;
 };
 EXPORT_SYMBOL_GPL(async_pmem_flush);
+
+void submit_async_flush(struct work_struct *ws)
+{
+   struct virtio_pmem *vpmem = container_of(ws, struct virtio_pmem, 
flush_work);
+   struct bio *bio = vpmem->flush_bio;
+
+   vpmem->start_flush = ktime_get_boottime();
+   vpmem->prev_flush_err = virtio_pmem_flush(vpmem->nd_region);
+   vpmem->prev_flush_start = vpmem->start_flush;
+   vpmem->flush_bio = NULL;
+   wake_up(>sb_wait);
+
+   if (vpmem->prev_flush_err)
+   bio->bi_status = errno_to_blk_status(-EIO);
+
+   /* Submit parent bio only for PREFLUSH */
+   if (bio && (bio->bi_opf & REQ_PREFLUSH)) {
+   bio->bi_opf &= ~REQ_PREFLUSH;
+   submit_bio(bio);
+   } else if (bio && (bio->bi_opf & REQ_FUA)) {
+   bio->bi_opf &= ~REQ_FUA;
+   bio_endio(bio);
+   }
+}
+EXPORT_SYMBOL_GPL(submit_async_flush);
 MODULE_LICENSE("GPL");
diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
index 726c7354d465..75ed9b7ddea1 100644
--- a/drivers/nvdimm/virtio_pmem.c
+++ b/drivers/nvdimm/virtio_pmem.c
@@ -24,6 +24,7 @@ static int init_vq(struct virtio_pmem *vpmem)
return PTR_ERR(vpmem->req_vq);
 
spin_lock_init(>pmem_lock);
+   spin_lock_init(>lock);
INIT_LIST_HEAD(>req_list);
 
return 0;
@@ -57,7 +58,14 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
dev_err(>dev, "failed to initialize virtio pmem vq's\n");
goto out_err;
}
+   vpmem->pmem_wq = alloc_workqueue("vpm

[RFC v3 0/2] virtio-pmem: Asynchronous flush

2022-01-11 Thread Pankaj Gupta
 Jeff reported preflush order issue with the existing implementation
 of virtio pmem preflush. Dan suggested[1] to implement asynchronous flush
 for virtio pmem using work queue as done in md/RAID. This patch series
 intends to solve the preflush ordering issue and makes the flush asynchronous
 for the submitting thread. Also, adds the flush coalscing logic.

 Submitting this RFC v3 series for review. Thank You!

 RFC v2 -> RFC v3
 - Improve commit log message - patch1.
 - Improve return error handling for Async flush.
 - declare'INIT_WORK' only once.
 - More testing and bug fix.

 [1] https://marc.info/?l=linux-kernel=157446316409937=2

Pankaj Gupta (2):
  virtio-pmem: Async virtio-pmem flush
  pmem: enable pmem_submit_bio for asynchronous flush

 drivers/nvdimm/nd_virtio.c   | 74 +++-
 drivers/nvdimm/pmem.c| 15 ++--
 drivers/nvdimm/region_devs.c |  4 +-
 drivers/nvdimm/virtio_pmem.c | 10 +
 drivers/nvdimm/virtio_pmem.h | 16 
 5 files changed, 98 insertions(+), 21 deletions(-)

-- 
2.25.1




[PATCH] virtio-pmem: add myself as virtio-pmem maintainer

2021-10-16 Thread Pankaj Gupta
From: Pankaj Gupta 

Adding myself as virtio-pmem maintainer and also adding virtualization
mailing list entry for virtio specific bits. Helps to get notified for
appropriate bug fixes & enhancements.

Signed-off-by: Pankaj Gupta 
---
 MAINTAINERS | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 1bd721478800..6a1ced092cfa 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -19941,6 +19941,13 @@ S: Maintained
 F: drivers/i2c/busses/i2c-virtio.c
 F: include/uapi/linux/virtio_i2c.h
 
+VIRTIO PMEM DRIVER
+M: Pankaj Gupta 
+L: virtualizat...@lists.linux-foundation.org
+S: Maintained
+F: drivers/nvdimm/virtio_pmem.c
+F: drivers/nvdimm/nd_virtio.c
+
 VIRTUAL BOX GUEST DEVICE DRIVER
 M: Hans de Goede 
 M: Arnd Bergmann 
-- 
2.25.1




Re: [RFC v2 0/2] virtio-pmem: Asynchronous flush

2021-10-16 Thread Pankaj Gupta
Friendly ping!

Thanks,
Pankaj

On Thu, 19 Aug 2021 at 13:08, Pankaj Gupta  wrote:
>
> Gentle ping.
>
> >
> >  Jeff reported preflush order issue with the existing implementation
> >  of virtio pmem preflush. Dan suggested[1] to implement asynchronous flush
> >  for virtio pmem using work queue as done in md/RAID. This patch series
> >  intends to solve the preflush ordering issue and also makes the flush
> >  asynchronous for the submitting thread.
> >
> >  Submitting this patch series for review. Sorry, It took me long time to
> >  come back to this due to some personal reasons.
> >
> >  RFC v1 -> RFC v2
> >  - More testing and bug fix.
> >
> >  [1] https://marc.info/?l=linux-kernel=157446316409937=2
> >
> > Pankaj Gupta (2):
> >   virtio-pmem: Async virtio-pmem flush
> >   pmem: enable pmem_submit_bio for asynchronous flush
> >
> >  drivers/nvdimm/nd_virtio.c   | 72 
> >  drivers/nvdimm/pmem.c| 17 ++---
> >  drivers/nvdimm/virtio_pmem.c | 10 -
> >  drivers/nvdimm/virtio_pmem.h | 14 +++
> >  4 files changed, 91 insertions(+), 22 deletions(-)
> >
> > --
> > 2.25.1
> >



Re: [RFC v2 1/2] virtio-pmem: Async virtio-pmem flush

2021-08-27 Thread Pankaj Gupta
> > > > > > Implement asynchronous flush for virtio pmem using work queue
> > > > > > to solve the preflush ordering issue. Also, coalesce the flush
> > > > > > requests when a flush is already in process.
> > > > > >
> > > > > > Signed-off-by: Pankaj Gupta 
> > > > > > ---
> > > > > >  drivers/nvdimm/nd_virtio.c   | 72 
> > > > > > 
> > > > > >  drivers/nvdimm/virtio_pmem.c | 10 -
> > > > > >  drivers/nvdimm/virtio_pmem.h | 14 +++
> > > > > >  3 files changed, 79 insertions(+), 17 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
> > > > > > index 10351d5b49fa..61b655b583be 100644
> > > > > > --- a/drivers/nvdimm/nd_virtio.c
> > > > > > +++ b/drivers/nvdimm/nd_virtio.c
> > > > > > @@ -97,29 +97,69 @@ static int virtio_pmem_flush(struct nd_region 
> > > > > > *nd_region)
> > > > > > return err;
> > > > > >  };
> > > > > >
> > > > > > +static void submit_async_flush(struct work_struct *ws);
> > > > > > +
> > > > > >  /* The asynchronous flush callback function */
> > > > > >  int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
> > > > > >  {
> > > > > > -   /*
> > > > > > -* Create child bio for asynchronous flush and chain with
> > > > > > -* parent bio. Otherwise directly call nd_region flush.
> > > > > > +   /* queue asynchronous flush and coalesce the flush requests 
> > > > > > */
> > > > > > +   struct virtio_device *vdev = nd_region->provider_data;
> > > > > > +   struct virtio_pmem *vpmem  = vdev->priv;
> > > > > > +   ktime_t req_start = ktime_get_boottime();
> > > > > > +
> > > > > > +   spin_lock_irq(>lock);
> > > > > > +   /* flush requests wait until ongoing flush completes,
> > > > > > +* hence coalescing all the pending requests.
> > > > > >  */
> > > > > > -   if (bio && bio->bi_iter.bi_sector != -1) {
> > > > > > -   struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> > > > > > -
> > > > > > -   if (!child)
> > > > > > -   return -ENOMEM;
> > > > > > -   bio_copy_dev(child, bio);
> > > > > > -   child->bi_opf = REQ_PREFLUSH;
> > > > > > -   child->bi_iter.bi_sector = -1;
> > > > > > -   bio_chain(child, bio);
> > > > > > -   submit_bio(child);
> > > > > > -   return 0;
> > > > > > +   wait_event_lock_irq(vpmem->sb_wait,
> > > > > > +   !vpmem->flush_bio ||
> > > > > > +   ktime_before(req_start, 
> > > > > > vpmem->prev_flush_start),
> > > > > > +   vpmem->lock);
> > > > > > +   /* new request after previous flush is completed */
> > > > > > +   if (ktime_after(req_start, vpmem->prev_flush_start)) {
> > > > > > +   WARN_ON(vpmem->flush_bio);
> > > > > > +   vpmem->flush_bio = bio;
> > > > > > +   bio = NULL;
> > > > > > +   }
> > > > >
> > > > > Why the dance with ->prev_flush_start vs just calling queue_work()
> > > > > again. queue_work() is naturally coalescing in that if the last work
> > > > > request has not started execution another queue attempt will be
> > > > > dropped.
> > > >
> > > > How parent flush request will know when corresponding flush is 
> > > > completed?
> > >
> > > The eventual bio_endio() is what signals upper layers that the flush
> > > completed...
> > >
> > >
> > > Hold on... it's been so long that I forgot that you are copying
> > > md_flush_request() here. It would help immensely if that was mentioned
> > > in the changelog and at a minimum have a comment in the code that this
> > > was copied from md. In fact it would be extra helpful if you
> >
> > My bad. I only mentioned this in the cover letter.
>
> Yeah, sorry about that. Having come back to this after so long I just
> decided to jump straight into the patches, but even if I had read that
> cover I still would have given the feedback that md_flush_request()
> heritage should also be noted with a comment in the code.

Sure.

Thanks,
Pankaj



Re: [RFC v2 1/2] virtio-pmem: Async virtio-pmem flush

2021-08-25 Thread Pankaj Gupta
> > Hi Dan,
> >
> > Thank you for the review. Please see my reply inline.
> >
> > > > Implement asynchronous flush for virtio pmem using work queue
> > > > to solve the preflush ordering issue. Also, coalesce the flush
> > > > requests when a flush is already in process.
> > > >
> > > > Signed-off-by: Pankaj Gupta 
> > > > ---
> > > >  drivers/nvdimm/nd_virtio.c   | 72 
> > > >  drivers/nvdimm/virtio_pmem.c | 10 -
> > > >  drivers/nvdimm/virtio_pmem.h | 14 +++
> > > >  3 files changed, 79 insertions(+), 17 deletions(-)
> > > >
> > > > diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
> > > > index 10351d5b49fa..61b655b583be 100644
> > > > --- a/drivers/nvdimm/nd_virtio.c
> > > > +++ b/drivers/nvdimm/nd_virtio.c
> > > > @@ -97,29 +97,69 @@ static int virtio_pmem_flush(struct nd_region 
> > > > *nd_region)
> > > > return err;
> > > >  };
> > > >
> > > > +static void submit_async_flush(struct work_struct *ws);
> > > > +
> > > >  /* The asynchronous flush callback function */
> > > >  int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
> > > >  {
> > > > -   /*
> > > > -* Create child bio for asynchronous flush and chain with
> > > > -* parent bio. Otherwise directly call nd_region flush.
> > > > +   /* queue asynchronous flush and coalesce the flush requests */
> > > > +   struct virtio_device *vdev = nd_region->provider_data;
> > > > +   struct virtio_pmem *vpmem  = vdev->priv;
> > > > +   ktime_t req_start = ktime_get_boottime();
> > > > +
> > > > +   spin_lock_irq(>lock);
> > > > +   /* flush requests wait until ongoing flush completes,
> > > > +* hence coalescing all the pending requests.
> > > >  */
> > > > -   if (bio && bio->bi_iter.bi_sector != -1) {
> > > > -   struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> > > > -
> > > > -   if (!child)
> > > > -   return -ENOMEM;
> > > > -   bio_copy_dev(child, bio);
> > > > -   child->bi_opf = REQ_PREFLUSH;
> > > > -   child->bi_iter.bi_sector = -1;
> > > > -   bio_chain(child, bio);
> > > > -   submit_bio(child);
> > > > -   return 0;
> > > > +   wait_event_lock_irq(vpmem->sb_wait,
> > > > +   !vpmem->flush_bio ||
> > > > +   ktime_before(req_start, 
> > > > vpmem->prev_flush_start),
> > > > +   vpmem->lock);
> > > > +   /* new request after previous flush is completed */
> > > > +   if (ktime_after(req_start, vpmem->prev_flush_start)) {
> > > > +   WARN_ON(vpmem->flush_bio);
> > > > +   vpmem->flush_bio = bio;
> > > > +   bio = NULL;
> > > > +   }
> > >
> > > Why the dance with ->prev_flush_start vs just calling queue_work()
> > > again. queue_work() is naturally coalescing in that if the last work
> > > request has not started execution another queue attempt will be
> > > dropped.
> >
> > How parent flush request will know when corresponding flush is completed?
>
> The eventual bio_endio() is what signals upper layers that the flush
> completed...
>
>
> Hold on... it's been so long that I forgot that you are copying
> md_flush_request() here. It would help immensely if that was mentioned
> in the changelog and at a minimum have a comment in the code that this
> was copied from md. In fact it would be extra helpful if you

My bad. I only mentioned this in the cover letter.

> refactored a common helper that bio based block drivers could share
> for implementing flush handling, but that can come later.

Sure.
>
> Let me go re-review this with respect to whether the md case is fully
> applicable here.

o.k.

Best regards,
Pankaj



Re: [RFC v2 1/2] virtio-pmem: Async virtio-pmem flush

2021-08-25 Thread Pankaj Gupta
Hi Dan,

Thank you for the review. Please see my reply inline.

> > Implement asynchronous flush for virtio pmem using work queue
> > to solve the preflush ordering issue. Also, coalesce the flush
> > requests when a flush is already in process.
> >
> > Signed-off-by: Pankaj Gupta 
> > ---
> >  drivers/nvdimm/nd_virtio.c   | 72 
> >  drivers/nvdimm/virtio_pmem.c | 10 -
> >  drivers/nvdimm/virtio_pmem.h | 14 +++
> >  3 files changed, 79 insertions(+), 17 deletions(-)
> >
> > diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
> > index 10351d5b49fa..61b655b583be 100644
> > --- a/drivers/nvdimm/nd_virtio.c
> > +++ b/drivers/nvdimm/nd_virtio.c
> > @@ -97,29 +97,69 @@ static int virtio_pmem_flush(struct nd_region 
> > *nd_region)
> > return err;
> >  };
> >
> > +static void submit_async_flush(struct work_struct *ws);
> > +
> >  /* The asynchronous flush callback function */
> >  int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
> >  {
> > -   /*
> > -* Create child bio for asynchronous flush and chain with
> > -* parent bio. Otherwise directly call nd_region flush.
> > +   /* queue asynchronous flush and coalesce the flush requests */
> > +   struct virtio_device *vdev = nd_region->provider_data;
> > +   struct virtio_pmem *vpmem  = vdev->priv;
> > +   ktime_t req_start = ktime_get_boottime();
> > +
> > +   spin_lock_irq(>lock);
> > +   /* flush requests wait until ongoing flush completes,
> > +* hence coalescing all the pending requests.
> >  */
> > -   if (bio && bio->bi_iter.bi_sector != -1) {
> > -   struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> > -
> > -   if (!child)
> > -   return -ENOMEM;
> > -   bio_copy_dev(child, bio);
> > -   child->bi_opf = REQ_PREFLUSH;
> > -   child->bi_iter.bi_sector = -1;
> > -   bio_chain(child, bio);
> > -   submit_bio(child);
> > -   return 0;
> > +   wait_event_lock_irq(vpmem->sb_wait,
> > +   !vpmem->flush_bio ||
> > +   ktime_before(req_start, 
> > vpmem->prev_flush_start),
> > +   vpmem->lock);
> > +   /* new request after previous flush is completed */
> > +   if (ktime_after(req_start, vpmem->prev_flush_start)) {
> > +   WARN_ON(vpmem->flush_bio);
> > +   vpmem->flush_bio = bio;
> > +   bio = NULL;
> > +   }
>
> Why the dance with ->prev_flush_start vs just calling queue_work()
> again. queue_work() is naturally coalescing in that if the last work
> request has not started execution another queue attempt will be
> dropped.

How parent flush request will know when corresponding flush is completed?

>
> > +   spin_unlock_irq(>lock);
> > +
> > +   if (!bio) {
> > +   INIT_WORK(>flush_work, submit_async_flush);
>
> I expect this only needs to be initialized once at driver init time.

yes, will fix this.
>
> > +   queue_work(vpmem->pmem_wq, >flush_work);
> > +   return 1;
> > +   }
> > +
> > +   /* flush completed in other context while we waited */
> > +   if (bio && (bio->bi_opf & REQ_PREFLUSH)) {
> > +   bio->bi_opf &= ~REQ_PREFLUSH;
> > +   submit_bio(bio);
> > +   } else if (bio && (bio->bi_opf & REQ_FUA)) {
> > +   bio->bi_opf &= ~REQ_FUA;
> > +   bio_endio(bio);
>
> It's not clear to me how this happens, shouldn't all flush completions
> be driven from the work completion?

Requests should progress after notified by ongoing flush completion
event.

>
> > }
> > -   if (virtio_pmem_flush(nd_region))
> > -   return -EIO;
> >
> > return 0;
> >  };
> >  EXPORT_SYMBOL_GPL(async_pmem_flush);
> > +
> > +static void submit_async_flush(struct work_struct *ws)
> > +{
> > +   struct virtio_pmem *vpmem = container_of(ws, struct virtio_pmem, 
> > flush_work);
> > +   struct bio *bio = vpmem->flush_bio;
> > +
> > +   vpmem->start_flush = ktime_get_boottime();
> > +   bio->bi_status = 
> > errno_to_blk_status(virtio_pmem_flush(vpmem->nd_region));
> &g

Re: [RFC 0/2] virtio-pmem: Asynchronous flush

2021-03-11 Thread Pankaj Gupta
Hi David,

> >   Jeff reported preflush order issue with the existing implementation
> >   of virtio pmem preflush. Dan suggested[1] to implement asynchronous flush
> >   for virtio pmem using work queue as done in md/RAID. This patch series
> >   intends to solve the preflush ordering issue and also makes the flush
> >   asynchronous from the submitting thread POV.
> >
> >   Submitting this patch series for feeback and is in WIP. I have
> >   done basic testing and currently doing more testing.
> >
> > Pankaj Gupta (2):
> >pmem: make nvdimm_flush asynchronous
> >virtio_pmem: Async virtio-pmem flush
> >
> >   drivers/nvdimm/nd_virtio.c   | 66 ++--
> >   drivers/nvdimm/pmem.c| 15 
> >   drivers/nvdimm/region_devs.c |  3 +-
> >   drivers/nvdimm/virtio_pmem.c |  9 +
> >   drivers/nvdimm/virtio_pmem.h | 12 +++
> >   5 files changed, 78 insertions(+), 27 deletions(-)
> >
> > [1] https://marc.info/?l=linux-kernel=157446316409937=2
> >
>
> Just wondering, was there any follow up of this or are we still waiting
> for feedback? :)

Thank you for bringing this up.

My apologies I could not followup on this. I have another version in my local
tree but could not post it as I was not sure if I solved the problem
correctly. I will
clean it up and post for feedback as soon as I can.

P.S: Due to serious personal/family health issues I am not able to
devote much time
on this with other professional commitments. I feel bad that I have
this unfinished task.
Just in last one year things have not been stable for me & my family
and still not getting :(

Best regards,
Pankaj


Re: [PATCH v1] mm/memory_hotplug: MEMHP_MERGE_RESOURCE -> MHP_MERGE_RESOURCE

2021-02-02 Thread Pankaj Gupta
> Let's make "MEMHP_MERGE_RESOURCE" consistent with "MHP_NONE", "mhp_t" and
> "mhp_flags". As discussed recently [1], "mhp" is our internal
> acronym for memory hotplug now.
>
> [1] 
> https://lore.kernel.org/linux-mm/c37de2d0-28a1-4f7d-f944-cfd7d81c3...@redhat.com/
>
> Cc: Andrew Morton 
> Cc: "K. Y. Srinivasan" 
> Cc: Haiyang Zhang 
> Cc: Stephen Hemminger 
> Cc: Wei Liu 
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Boris Ostrovsky 
> Cc: Juergen Gross 
> Cc: Stefano Stabellini 
> Cc: Pankaj Gupta 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Anshuman Khandual 
> Cc: Wei Yang 
> Cc: linux-hyp...@vger.kernel.org
> Cc: virtualizat...@lists.linux-foundation.org
> Cc: xen-de...@lists.xenproject.org
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/hv/hv_balloon.c| 2 +-
>  drivers/virtio/virtio_mem.c| 2 +-
>  drivers/xen/balloon.c  | 2 +-
>  include/linux/memory_hotplug.h | 2 +-
>  mm/memory_hotplug.c| 2 +-
>  5 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
> index 8c471823a5af..2f776d78e3c1 100644
> --- a/drivers/hv/hv_balloon.c
> +++ b/drivers/hv/hv_balloon.c
> @@ -726,7 +726,7 @@ static void hv_mem_hot_add(unsigned long start, unsigned 
> long size,
>
> nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
> ret = add_memory(nid, PFN_PHYS((start_pfn)),
> -   (HA_CHUNK << PAGE_SHIFT), 
> MEMHP_MERGE_RESOURCE);
> +   (HA_CHUNK << PAGE_SHIFT), MHP_MERGE_RESOURCE);
>
> if (ret) {
> pr_err("hot_add memory failed error is %d\n", ret);
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 85a272c9978e..148bea39b09a 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -623,7 +623,7 @@ static int virtio_mem_add_memory(struct virtio_mem *vm, 
> uint64_t addr,
> /* Memory might get onlined immediately. */
> atomic64_add(size, >offline_size);
> rc = add_memory_driver_managed(vm->nid, addr, size, vm->resource_name,
> -  MEMHP_MERGE_RESOURCE);
> +  MHP_MERGE_RESOURCE);
> if (rc) {
> atomic64_sub(size, >offline_size);
> dev_warn(>vdev->dev, "adding memory failed: %d\n", rc);
> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
> index b57b2067ecbf..671c71245a7b 100644
> --- a/drivers/xen/balloon.c
> +++ b/drivers/xen/balloon.c
> @@ -331,7 +331,7 @@ static enum bp_state reserve_additional_memory(void)
> mutex_unlock(_mutex);
> /* add_memory_resource() requires the device_hotplug lock */
> lock_device_hotplug();
> -   rc = add_memory_resource(nid, resource, MEMHP_MERGE_RESOURCE);
> +   rc = add_memory_resource(nid, resource, MHP_MERGE_RESOURCE);
> unlock_device_hotplug();
> mutex_lock(_mutex);
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 3d99de0db2dd..4b834f5d032e 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -53,7 +53,7 @@ typedef int __bitwise mhp_t;
>   * with this flag set, the resource pointer must no longer be used as it
>   * might be stale, or the resource might have changed.
>   */
> -#define MEMHP_MERGE_RESOURCE   ((__force mhp_t)BIT(0))
> +#define MHP_MERGE_RESOURCE ((__force mhp_t)BIT(0))
>
>  /*
>   * Extended parameters for memory hotplug:
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 710e469fb3a1..ae497e3ff77c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1153,7 +1153,7 @@ int __ref add_memory_resource(int nid, struct resource 
> *res, mhp_t mhp_flags)
>  * In case we're allowed to merge the resource, flag it and trigger
>  * merging now that adding succeeded.
>  */
> -   if (mhp_flags & MEMHP_MERGE_RESOURCE)
> +   if (mhp_flags & MHP_MERGE_RESOURCE)
> merge_system_ram_resource(res);
>
> /* online pages if requested */

 Reviewed-by: Pankaj Gupta 


Re: [PATCH v2 1/1] vhost scsi: alloc vhost_scsi with kvzalloc() to avoid delay

2021-02-02 Thread Pankaj Gupta
> The size of 'struct vhost_scsi' is order-10 (~2.3MB). It may take long time
> delay by kzalloc() to compact memory pages by retrying multiple times when
> there is a lack of high-order pages. As a result, there is latency to
> create a VM (with vhost-scsi) or to hotadd vhost-scsi-based storage.
>
> The prior commit 595cb754983d ("vhost/scsi: use vmalloc for order-10
> allocation") prefers to fallback only when really needed, while this patch
> allocates with kvzalloc() with __GFP_NORETRY implicitly set to avoid
> retrying memory pages compact for multiple times.
>
> The __GFP_NORETRY is implicitly set if the size to allocate is more than
> PAGE_SZIE and when __GFP_RETRY_MAYFAIL is not explicitly set.
>
> Cc: Aruna Ramakrishna 
> Cc: Joe Jin 
> Signed-off-by: Dongli Zhang 
> ---
> Changed since v1:
>   - To combine kzalloc() and vzalloc() as kvzalloc()
> (suggested by Jason Wang)
>
>  drivers/vhost/scsi.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 4ce9f00ae10e..5de21ad4bd05 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1814,12 +1814,9 @@ static int vhost_scsi_open(struct inode *inode, struct 
> file *f)
> struct vhost_virtqueue **vqs;
> int r = -ENOMEM, i;
>
> -   vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | 
> __GFP_RETRY_MAYFAIL);
> -   if (!vs) {
> -   vs = vzalloc(sizeof(*vs));
> -   if (!vs)
> -   goto err_vs;
> -   }
> +   vs = kvzalloc(sizeof(*vs), GFP_KERNEL);
> +   if (!vs)
> +   goto err_vs;
>
> vqs = kmalloc_array(VHOST_SCSI_MAX_VQ, sizeof(*vqs), GFP_KERNEL);
> if (!vqs)

 Acked-by: Pankaj Gupta 


Re: [PATCH v4 1/7] mm: memcontrol: fix NR_ANON_THPS accounting in charge moving

2020-12-10 Thread Pankaj Gupta
> The unit of NR_ANON_THPS is HPAGE_PMD_NR already. So it should inc/dec
> by one rather than nr_pages.
>
> Fixes: 468c398233da ("mm: memcontrol: switch to native NR_ANON_THPS counter")
> Signed-off-by: Muchun Song 
> Acked-by: Michal Hocko 
> Acked-by: Johannes Weiner 
> Reviewed-by: Roman Gushchin 
> ---
>  mm/memcontrol.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b80328f52fb4..8818bf64d6fe 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5653,10 +5653,8 @@ static int mem_cgroup_move_account(struct page *page,
> __mod_lruvec_state(from_vec, NR_ANON_MAPPED, 
> -nr_pages);
> __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
> if (PageTransHuge(page)) {
> -   __mod_lruvec_state(from_vec, NR_ANON_THPS,
> -  -nr_pages);
> -   __mod_lruvec_state(to_vec, NR_ANON_THPS,
> -  nr_pages);
> +   __dec_lruvec_state(from_vec, NR_ANON_THPS);
> +   __inc_lruvec_state(to_vec, NR_ANON_THPS);
> }
>
> }

Acked-by: Pankaj Gupta 


Re: [PATCH v2 1/8] mm/gup: perform check_dax_vmas only when FS_DAX is enabled

2020-12-09 Thread Pankaj Gupta
> There is no need to check_dax_vmas() and run through the npage loop of
> pinned pages if FS_DAX is not enabled.
>
> Add a stub check_dax_vmas() function for no-FS_DAX case.
>
> Signed-off-by: Pavel Tatashin 
> Reviewed-by: John Hubbard 
> ---
>  mm/gup.c | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 98eb8e6d2609..cdb8b9eeb016 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1568,6 +1568,7 @@ struct page *get_dump_page(unsigned long addr)
>  #endif /* CONFIG_ELF_CORE */
>
>  #if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
> +#ifdef CONFIG_FS_DAX
>  static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
>  {
> long i;
> @@ -1586,6 +1587,12 @@ static bool check_dax_vmas(struct vm_area_struct 
> **vmas, long nr_pages)
> }
> return false;
>  }
> +#else
> +static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
> +{
> +   return false;
> +}
> +#endif
>
>  #ifdef CONFIG_CMA
>  static long check_and_migrate_cma_pages(struct mm_struct *mm,

Reviewed-by: Pankaj Gupta 


Re: [PATCH v3 7/7] mm: memcontrol: make the slab calculation consistent

2020-12-07 Thread Pankaj Gupta
> Although the ratio of the slab is one, we also should read the ratio
> from the related memory_stats instead of hard-coding. And the local
> variable of size is already the value of slab_unreclaimable. So we
> do not need to read again.
>
> We can drop the ratio in struct memory_stat. This can make the code
> clean and simple. And get rid of the awkward mix of static and runtime
> initialization of the memory_stats table.
>
> Signed-off-by: Muchun Song 
> ---
>  mm/memcontrol.c | 112 
> 
>  1 file changed, 73 insertions(+), 39 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a40797a27f87..841ea37cc123 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1511,49 +1511,78 @@ static bool mem_cgroup_wait_acct_move(struct 
> mem_cgroup *memcg)
>
>  struct memory_stat {
> const char *name;
> -   unsigned int ratio;
> unsigned int idx;
>  };
>
>  static const struct memory_stat memory_stats[] = {
> -   { "anon", PAGE_SIZE, NR_ANON_MAPPED },
> -   { "file", PAGE_SIZE, NR_FILE_PAGES },
> -   { "kernel_stack", 1024, NR_KERNEL_STACK_KB },
> -   { "pagetables", PAGE_SIZE, NR_PAGETABLE },
> -   { "percpu", 1, MEMCG_PERCPU_B },
> -   { "sock", PAGE_SIZE, MEMCG_SOCK },
> -   { "shmem", PAGE_SIZE, NR_SHMEM },
> -   { "file_mapped", PAGE_SIZE, NR_FILE_MAPPED },
> -   { "file_dirty", PAGE_SIZE, NR_FILE_DIRTY },
> -   { "file_writeback", PAGE_SIZE, NR_WRITEBACK },
> +   { "anon",   NR_ANON_MAPPED  },
> +   { "file",   NR_FILE_PAGES   },
> +   { "kernel_stack",   NR_KERNEL_STACK_KB  },
> +   { "pagetables", NR_PAGETABLE},
> +   { "percpu", MEMCG_PERCPU_B  },
> +   { "sock",   MEMCG_SOCK  },
> +   { "shmem",  NR_SHMEM},
> +   { "file_mapped",NR_FILE_MAPPED  },
> +   { "file_dirty", NR_FILE_DIRTY   },
> +   { "file_writeback", NR_WRITEBACK},
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -   { "anon_thp", PAGE_SIZE, NR_ANON_THPS },
> -   { "file_thp", PAGE_SIZE, NR_FILE_THPS },
> -   { "shmem_thp", PAGE_SIZE, NR_SHMEM_THPS },
> +   { "anon_thp",   NR_ANON_THPS},
> +   { "file_thp",   NR_FILE_THPS},
> +   { "shmem_thp",  NR_SHMEM_THPS   },
>  #endif
> -   { "inactive_anon", PAGE_SIZE, NR_INACTIVE_ANON },
> -   { "active_anon", PAGE_SIZE, NR_ACTIVE_ANON },
> -   { "inactive_file", PAGE_SIZE, NR_INACTIVE_FILE },
> -   { "active_file", PAGE_SIZE, NR_ACTIVE_FILE },
> -   { "unevictable", PAGE_SIZE, NR_UNEVICTABLE },
> -
> -   /*
> -* Note: The slab_reclaimable and slab_unreclaimable must be
> -* together and slab_reclaimable must be in front.
> -*/
> -   { "slab_reclaimable", 1, NR_SLAB_RECLAIMABLE_B },
> -   { "slab_unreclaimable", 1, NR_SLAB_UNRECLAIMABLE_B },
> +   { "inactive_anon",  NR_INACTIVE_ANON},
> +   { "active_anon",NR_ACTIVE_ANON  },
> +   { "inactive_file",  NR_INACTIVE_FILE},
> +   { "active_file",NR_ACTIVE_FILE  },
> +   { "unevictable",NR_UNEVICTABLE  },
> +   { "slab_reclaimable",   NR_SLAB_RECLAIMABLE_B   },
> +   { "slab_unreclaimable", NR_SLAB_UNRECLAIMABLE_B },
>
> /* The memory events */
> -   { "workingset_refault_anon", 1, WORKINGSET_REFAULT_ANON },
> -   { "workingset_refault_file", 1, WORKINGSET_REFAULT_FILE },
> -   { "workingset_activate_anon", 1, WORKINGSET_ACTIVATE_ANON },
> -   { "workingset_activate_file", 1, WORKINGSET_ACTIVATE_FILE },
> -   { "workingset_restore_anon", 1, WORKINGSET_RESTORE_ANON },
> -   { "workingset_restore_file", 1, WORKINGSET_RESTORE_FILE },
> -   { "workingset_nodereclaim", 1, WORKINGSET_NODERECLAIM },
> +   { "workingset_refault_anon",WORKINGSET_REFAULT_ANON },
> +   { "workingset_refault_file",WORKINGSET_REFAULT_FILE },
> +   { "workingset_activate_anon",   WORKINGSET_ACTIVATE_ANON},
> +   { "workingset_activate_file",   WORKINGSET_ACTIVATE_FILE},
> +   { "workingset_restore_anon",WORKINGSET_RESTORE_ANON },
> +   { "workingset_restore_file",WORKINGSET_RESTORE_FILE },
> +   { "workingset_nodereclaim", WORKINGSET_NODERECLAIM  },
>  };
>
> +/* Translate stat items to the correct unit for memory.stat output */
> +static int 

Re: [PATCH] mm/shmem.c: make shmem_mapping() inline

2020-11-13 Thread Pankaj Gupta
> inline the shmem_mapping(), and use shmem_mapping()
> instead of 'inode->i_mapping->a_ops == _aops'
> in shmem_evict_inode().
>
> Signed-off-by: Hui Su 
> ---
>  include/linux/shmem_fs.h | 2 +-
>  mm/shmem.c   | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index a5a5d1d4d7b1..154a16fe7fd5 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -67,7 +67,7 @@ extern unsigned long shmem_get_unmapped_area(struct file *, 
> unsigned long addr,
> unsigned long len, unsigned long pgoff, unsigned long flags);
>  extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
>  #ifdef CONFIG_SHMEM
> -extern bool shmem_mapping(struct address_space *mapping);
> +extern inline bool shmem_mapping(struct address_space *mapping);
>  #else
>  static inline bool shmem_mapping(struct address_space *mapping)
>  {
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 537c137698f8..7395d8e8226a 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1152,7 +1152,7 @@ static void shmem_evict_inode(struct inode *inode)
> struct shmem_inode_info *info = SHMEM_I(inode);
> struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>
> -   if (inode->i_mapping->a_ops == _aops) {
> +   if (shmem_mapping(inode->i_mapping)) {
> shmem_unacct_size(info->flags, inode->i_size);
> inode->i_size = 0;
> shmem_truncate_range(inode, 0, (loff_t)-1);
> @@ -2352,7 +2352,7 @@ static struct inode *shmem_get_inode(struct super_block 
> *sb, const struct inode
> return inode;
>  }
>
> -bool shmem_mapping(struct address_space *mapping)
> +inline bool shmem_mapping(struct address_space *mapping)
>  {
> return mapping->a_ops == _aops;
>  }

Reviewed-by: Pankaj Gupta 


Re: [PATCH] mm/page_counter: use page_counter_read in page_counter_set_max

2020-11-13 Thread Pankaj Gupta
> use page_counter_read() in page_counter_set_max().
>
> Signed-off-by: Hui Su 
> ---
>  mm/page_counter.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index b24a60b28bb0..c6860f51b6c6 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -183,14 +183,14 @@ int page_counter_set_max(struct page_counter *counter, 
> unsigned long nr_pages)
>  * the limit, so if it sees the old limit, we see the
>  * modified counter and retry.
>  */
> -   usage = atomic_long_read(>usage);
> +   usage = page_counter_read(counter);
>
> if (usage > nr_pages)
> return -EBUSY;
>
> old = xchg(>max, nr_pages);
>
> -   if (atomic_long_read(>usage) <= usage)
> +   if (page_counter_read(counter) <= usage)
>     return 0;
>
> counter->max = old;

Reviewed-by: Pankaj Gupta 


Re: [PATCH v3 3/7] mm, page_alloc: remove setup_pageset()

2020-11-11 Thread Pankaj Gupta
> We initialize boot-time pagesets with setup_pageset(), which sets high and
> batch values that effectively disable pcplists.
>
> We can remove this wrapper if we just set these values for all pagesets in
> pageset_init(). Non-boot pagesets then subsequently update them to the proper
> values.
>
> No functional change.
>
> Signed-off-by: Vlastimil Babka 
> Reviewed-by: David Hildenbrand 
> Reviewed-by: Oscar Salvador 
> Acked-by: Michal Hocko 
> ---
>  mm/page_alloc.c | 17 ++---
>  1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2fa432762908..5a8ec7d94884 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5911,7 +5911,7 @@ static void build_zonelists(pg_data_t *pgdat)
>   * not check if the processor is online before following the pageset pointer.
>   * Other parts of the kernel may not check if the zone is available.
>   */
> -static void setup_pageset(struct per_cpu_pageset *p);
> +static void pageset_init(struct per_cpu_pageset *p);
>  static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>  static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>
> @@ -5979,7 +5979,7 @@ build_all_zonelists_init(void)
>  * (a chicken-egg dilemma).
>  */
> for_each_possible_cpu(cpu)
> -   setup_pageset(_cpu(boot_pageset, cpu));
> +   pageset_init(_cpu(boot_pageset, cpu));
>
> mminit_verify_zonelist();
> cpuset_init_current_mems_allowed();
> @@ -6298,12 +6298,15 @@ static void pageset_init(struct per_cpu_pageset *p)
> pcp = >pcp;
> for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
> INIT_LIST_HEAD(>lists[migratetype]);
> -}
>
> -static void setup_pageset(struct per_cpu_pageset *p)
> -{
> -   pageset_init(p);
> -   pageset_update(>pcp, 0, 1);
> +   /*
> +* Set batch and high values safe for a boot pageset. A true percpu
> +* pageset's initialization will update them subsequently. Here we 
> don't
> +* need to be as careful as pageset_update() as nobody can access the
> +* pageset yet.
> +*/
> +   pcp->high = 0;
> +   pcp->batch = 1;
>  }

 Acked-by: Pankaj Gupta 


Re: [PATCH v3 2/7] mm, page_alloc: calculate pageset high and batch once per zone

2020-11-11 Thread Pankaj Gupta
> We currently call pageset_set_high_and_batch() for each possible cpu, which
> repeats the same calculations of high and batch values.
>
> Instead call the function just once per zone, and make it apply the calculated
> values to all per-cpu pagesets of the zone.
>
> This also allows removing the zone_pageset_init() and __zone_pcp_update()
> wrappers.
>
> No functional change.
>
> Signed-off-by: Vlastimil Babka 
> Reviewed-by: Oscar Salvador 
> Reviewed-by: David Hildenbrand 
> Acked-by: Michal Hocko 
> ---
>  mm/page_alloc.c | 42 ++
>  1 file changed, 18 insertions(+), 24 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3f1c57344e73..2fa432762908 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6307,13 +6307,14 @@ static void setup_pageset(struct per_cpu_pageset *p)
>  }
>
>  /*
> - * Calculate and set new high and batch values for given per-cpu pageset of a
> + * Calculate and set new high and batch values for all per-cpu pagesets of a
>   * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
>   */
> -static void pageset_set_high_and_batch(struct zone *zone,
> -  struct per_cpu_pageset *p)
> +static void zone_set_pageset_high_and_batch(struct zone *zone)
>  {
> unsigned long new_high, new_batch;
> +   struct per_cpu_pageset *p;
> +   int cpu;
>
> if (percpu_pagelist_fraction) {
> new_high = zone_managed_pages(zone) / 
> percpu_pagelist_fraction;
> @@ -6325,23 +6326,25 @@ static void pageset_set_high_and_batch(struct zone 
> *zone,
> new_high = 6 * new_batch;
> new_batch = max(1UL, 1 * new_batch);
> }
> -   pageset_update(>pcp, new_high, new_batch);
> -}
> -
> -static void __meminit zone_pageset_init(struct zone *zone, int cpu)
> -{
> -   struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
>
> -   pageset_init(pcp);
> -   pageset_set_high_and_batch(zone, pcp);
> +   for_each_possible_cpu(cpu) {
> +   p = per_cpu_ptr(zone->pageset, cpu);
> +   pageset_update(>pcp, new_high, new_batch);
> +   }
>  }
>
>  void __meminit setup_zone_pageset(struct zone *zone)
>  {
> +   struct per_cpu_pageset *p;
> int cpu;
> +
> zone->pageset = alloc_percpu(struct per_cpu_pageset);
> -   for_each_possible_cpu(cpu)
> -   zone_pageset_init(zone, cpu);
> +   for_each_possible_cpu(cpu) {
> +   p = per_cpu_ptr(zone->pageset, cpu);
> +   pageset_init(p);
> +   }
> +
> +   zone_set_pageset_high_and_batch(zone);
>  }
>
>  /*
> @@ -8080,15 +8083,6 @@ int lowmem_reserve_ratio_sysctl_handler(struct 
> ctl_table *table, int write,
> return 0;
>  }
>
> -static void __zone_pcp_update(struct zone *zone)
> -{
> -   unsigned int cpu;
> -
> -   for_each_possible_cpu(cpu)
> -   pageset_set_high_and_batch(zone,
> -   per_cpu_ptr(zone->pageset, cpu));
> -}
> -
>  /*
>   * percpu_pagelist_fraction - changes the pcp->high for each zone on each
>   * cpu.  It is the fraction of total pages in each zone that a hot per cpu
> @@ -8121,7 +8115,7 @@ int percpu_pagelist_fraction_sysctl_handler(struct 
> ctl_table *table, int write,
> goto out;
>
> for_each_populated_zone(zone)
> -   __zone_pcp_update(zone);
> +   zone_set_pageset_high_and_batch(zone);
>  out:
> mutex_unlock(_batch_high_lock);
> return ret;
> @@ -8746,7 +8740,7 @@ EXPORT_SYMBOL(free_contig_range);
>  void __meminit zone_pcp_update(struct zone *zone)
>  {
> mutex_lock(_batch_high_lock);
> -   __zone_pcp_update(zone);
> +   zone_set_pageset_high_and_batch(zone);
> mutex_unlock(_batch_high_lock);
>  }

 Acked-by: Pankaj Gupta 


Re: [PATCH v3 1/7] mm, page_alloc: clean up pageset high and batch update

2020-11-11 Thread Pankaj Gupta
> The updates to pcplists' high and batch values are handled by multiple
> functions that make the calculations hard to follow. Consolidate everything
> to pageset_set_high_and_batch() and remove pageset_set_batch() and
> pageset_set_high() wrappers.
>
> The only special case using one of the removed wrappers was:
> build_all_zonelists_init()
>   setup_pageset()
> pageset_set_batch()
> which was hardcoding batch as 0, so we can just open-code a call to
> pageset_update() with constant parameters instead.
>
> No functional change.
>
> Signed-off-by: Vlastimil Babka 
> Reviewed-by: Oscar Salvador 
> Reviewed-by: David Hildenbrand 
> Acked-by: Michal Hocko 
> ---
>  mm/page_alloc.c | 49 -
>  1 file changed, 20 insertions(+), 29 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7560240fcf7d..3f1c57344e73 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5911,7 +5911,7 @@ static void build_zonelists(pg_data_t *pgdat)
>   * not check if the processor is online before following the pageset pointer.
>   * Other parts of the kernel may not check if the zone is available.
>   */
> -static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
> +static void setup_pageset(struct per_cpu_pageset *p);
>  static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>  static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>
> @@ -5979,7 +5979,7 @@ build_all_zonelists_init(void)
>  * (a chicken-egg dilemma).
>  */
> for_each_possible_cpu(cpu)
> -   setup_pageset(_cpu(boot_pageset, cpu), 0);
> +   setup_pageset(_cpu(boot_pageset, cpu));
>
> mminit_verify_zonelist();
> cpuset_init_current_mems_allowed();
> @@ -6288,12 +6288,6 @@ static void pageset_update(struct per_cpu_pages *pcp, 
> unsigned long high,
> pcp->batch = batch;
>  }
>
> -/* a companion to pageset_set_high() */
> -static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
> -{
> -   pageset_update(>pcp, 6 * batch, max(1UL, 1 * batch));
> -}
> -
>  static void pageset_init(struct per_cpu_pageset *p)
>  {
> struct per_cpu_pages *pcp;
> @@ -6306,35 +6300,32 @@ static void pageset_init(struct per_cpu_pageset *p)
> INIT_LIST_HEAD(>lists[migratetype]);
>  }
>
> -static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
> +static void setup_pageset(struct per_cpu_pageset *p)
>  {
> pageset_init(p);
> -   pageset_set_batch(p, batch);
> +   pageset_update(>pcp, 0, 1);
>  }
>
>  /*
> - * pageset_set_high() sets the high water mark for hot per_cpu_pagelist
> - * to the value high for the pageset p.
> + * Calculate and set new high and batch values for given per-cpu pageset of a
> + * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
>   */
> -static void pageset_set_high(struct per_cpu_pageset *p,
> -   unsigned long high)
> -{
> -   unsigned long batch = max(1UL, high / 4);
> -   if ((high / 4) > (PAGE_SHIFT * 8))
> -   batch = PAGE_SHIFT * 8;
> -
> -   pageset_update(>pcp, high, batch);
> -}
> -
>  static void pageset_set_high_and_batch(struct zone *zone,
> -  struct per_cpu_pageset *pcp)
> +  struct per_cpu_pageset *p)
>  {
> -   if (percpu_pagelist_fraction)
> -   pageset_set_high(pcp,
> -   (zone_managed_pages(zone) /
> -   percpu_pagelist_fraction));
> -   else
> -   pageset_set_batch(pcp, zone_batchsize(zone));
> +   unsigned long new_high, new_batch;
> +
> +   if (percpu_pagelist_fraction) {
> +   new_high = zone_managed_pages(zone) / 
> percpu_pagelist_fraction;
> +   new_batch = max(1UL, new_high / 4);
> +   if ((new_high / 4) > (PAGE_SHIFT * 8))
> +   new_batch = PAGE_SHIFT * 8;
> +   } else {
> +   new_batch = zone_batchsize(zone);
> +   new_high = 6 * new_batch;
> +   new_batch = max(1UL, 1 * new_batch);
> +   }
> +   pageset_update(>pcp, new_high, new_batch);
>  }
>
>  static void __meminit zone_pageset_init(struct zone *zone, int cpu)

Looks good to me.
Acked-by: Pankaj Gupta 


Re: [PATCH 1/3] md: improve variable names in md_flush_request()

2020-11-10 Thread Pankaj Gupta
Hi Paul,

> >   This patch improves readability by using better variable names
> >   in flush request coalescing logic.
>
> Please do not indent the commit message.

o.k

>
> > Signed-off-by: Pankaj Gupta 
> > ---
> >   drivers/md/md.c | 8 
> >   drivers/md/md.h | 6 +++---
> >   2 files changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 98bac4f304ae..167c80f98533 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -639,7 +639,7 @@ static void md_submit_flush_data(struct work_struct *ws)
> >* could wait for this and below md_handle_request could wait for 
> > those
> >* bios because of suspend check
> >*/
> > - mddev->last_flush = mddev->start_flush;
> > + mddev->prev_flush_start = mddev->start_flush;
> >   mddev->flush_bio = NULL;
> >   wake_up(>sb_wait);
> >
> > @@ -660,13 +660,13 @@ static void md_submit_flush_data(struct work_struct 
> > *ws)
> >*/
> >   bool md_flush_request(struct mddev *mddev, struct bio *bio)
> >   {
> > - ktime_t start = ktime_get_boottime();
> > + ktime_t req_start = ktime_get_boottime();
> >   spin_lock_irq(>lock);
> >   wait_event_lock_irq(mddev->sb_wait,
> >   !mddev->flush_bio ||
> > - ktime_after(mddev->last_flush, start),
> > + ktime_after(mddev->prev_flush_start, req_start),
> >   mddev->lock);
> > - if (!ktime_after(mddev->last_flush, start)) {
> > + if (!ktime_after(mddev->prev_flush_start, req_start)) {
> >   WARN_ON(mddev->flush_bio);
> >   mddev->flush_bio = bio;
> >   bio = NULL;
> > diff --git a/drivers/md/md.h b/drivers/md/md.h
> > index ccfb69868c2e..2292c847f9dd 100644
> > --- a/drivers/md/md.h
> > +++ b/drivers/md/md.h
> > @@ -495,9 +495,9 @@ struct mddev {
> >*/
> >   struct bio *flush_bio;
> >   atomic_t flush_pending;
> > - ktime_t start_flush, last_flush; /* last_flush is when the last 
> > completed
> > -   * flush was started.
> > -   */
> > + ktime_t start_flush, prev_flush_start; /* prev_flush_start is when 
> > the previous completed
> > + * flush was started.
> > + */
>
> With the new variable name, the comment could even be removed. ;-)
>
> >   struct work_struct flush_work;
> >   struct work_struct event_work;  /* used by dm to report failure event 
> > */
> >   mempool_t *serial_info_pool;
>
> Reviewed-by: Paul Menzel 

Thanks,
Pankaj
>
>
> Kind regards,
>
> Paul


[PATCH 1/3] md: improve variable names in md_flush_request()

2020-11-10 Thread Pankaj Gupta
From: Pankaj Gupta 

 This patch improves readability by using better variable names
 in flush request coalescing logic.

Signed-off-by: Pankaj Gupta 
---
 drivers/md/md.c | 8 
 drivers/md/md.h | 6 +++---
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 98bac4f304ae..167c80f98533 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -639,7 +639,7 @@ static void md_submit_flush_data(struct work_struct *ws)
 * could wait for this and below md_handle_request could wait for those
 * bios because of suspend check
 */
-   mddev->last_flush = mddev->start_flush;
+   mddev->prev_flush_start = mddev->start_flush;
mddev->flush_bio = NULL;
wake_up(>sb_wait);
 
@@ -660,13 +660,13 @@ static void md_submit_flush_data(struct work_struct *ws)
  */
 bool md_flush_request(struct mddev *mddev, struct bio *bio)
 {
-   ktime_t start = ktime_get_boottime();
+   ktime_t req_start = ktime_get_boottime();
spin_lock_irq(>lock);
wait_event_lock_irq(mddev->sb_wait,
!mddev->flush_bio ||
-   ktime_after(mddev->last_flush, start),
+   ktime_after(mddev->prev_flush_start, req_start),
mddev->lock);
-   if (!ktime_after(mddev->last_flush, start)) {
+   if (!ktime_after(mddev->prev_flush_start, req_start)) {
WARN_ON(mddev->flush_bio);
mddev->flush_bio = bio;
bio = NULL;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index ccfb69868c2e..2292c847f9dd 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -495,9 +495,9 @@ struct mddev {
 */
struct bio *flush_bio;
atomic_t flush_pending;
-   ktime_t start_flush, last_flush; /* last_flush is when the last 
completed
- * flush was started.
- */
+   ktime_t start_flush, prev_flush_start; /* prev_flush_start is when the 
previous completed
+   * flush was started.
+   */
struct work_struct flush_work;
struct work_struct event_work;  /* used by dm to report failure event */
mempool_t *serial_info_pool;
-- 
2.20.1



[PATCH 2/3] md: add comments in md_flush_request()

2020-11-10 Thread Pankaj Gupta
From: Pankaj Gupta 

 Request coalescing logic is dependent on flush time
 update in other context. This patch adds comments
 to understand the code flow better.

Signed-off-by: Pankaj Gupta 
---
 drivers/md/md.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 167c80f98533..a330e61876e0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -662,10 +662,14 @@ bool md_flush_request(struct mddev *mddev, struct bio 
*bio)
 {
ktime_t req_start = ktime_get_boottime();
spin_lock_irq(>lock);
+   /* flush requests wait until ongoing flush completes,
+* hence coalescing all the pending requests.
+*/
wait_event_lock_irq(mddev->sb_wait,
!mddev->flush_bio ||
ktime_after(mddev->prev_flush_start, req_start),
mddev->lock);
+   /* new request after previous flush is completed */
if (!ktime_after(mddev->prev_flush_start, req_start)) {
WARN_ON(mddev->flush_bio);
mddev->flush_bio = bio;
-- 
2.20.1



[PATCH 0/3] md: code cleanups

2020-11-10 Thread Pankaj Gupta
From: Pankaj Gupta 

This patch series does some cleanups during my attempt to understand
the code. 

Pankaj Gupta (3):
  md: improve variable names in md_flush_request()
  md: add comments in md_flush_request()
  md: use current request time as base for ktime comparisons

 drivers/md/md.c | 12 
 drivers/md/md.h |  6 +++---
 2 files changed, 11 insertions(+), 7 deletions(-)

-- 
2.20.1



[PATCH 3/3] md: use current request time as base for ktime comparisons

2020-11-10 Thread Pankaj Gupta
From: Pankaj Gupta 

Request coalescing logic uses 'prev_flush_start' as base to
compare the current request start time. 'prev_flush_start' is
updated in other context.

This patch changes this by using ktime comparison base to
'req_start' for better readability of code.

Signed-off-by: Pankaj Gupta 
---
 drivers/md/md.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index a330e61876e0..217b1e3312cb 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -667,10 +667,10 @@ bool md_flush_request(struct mddev *mddev, struct bio 
*bio)
 */
wait_event_lock_irq(mddev->sb_wait,
!mddev->flush_bio ||
-   ktime_after(mddev->prev_flush_start, req_start),
+   ktime_before(req_start, mddev->prev_flush_start),
mddev->lock);
/* new request after previous flush is completed */
-   if (!ktime_after(mddev->prev_flush_start, req_start)) {
+   if (ktime_after(req_start, mddev->prev_flush_start)) {
WARN_ON(mddev->flush_bio);
mddev->flush_bio = bio;
bio = NULL;
-- 
2.20.1



Re: [PATCH] KVM: x86: use positive error values for msr emulation that causes #GP

2020-11-05 Thread Pankaj Gupta
> > This looks good to me. This should solve "-EPERM" return by "__kvm_set_msr" 
> > .
> >
> > A question I have, In the case of "kvm_emulate_rdmsr()",  for "r" we
> > are injecting #GP.
> > Is there any possibility of this check to be hit and still result in #GP?
>
> When I wrote this patch series I assumed that msr reads usually don't have
> side effects so they shouldn't fail, and fixed only the msr write code path
> to deal with negative errors. Now that you put this in this light,
> I do think that you are right and I should have added code for both msr reads 
> and writes
> especially to catch cases in which negative errors are returned by mistake
> like this one (my mistake in this case since my patch series was merged
> after the userspace msrs patch series).
>
> What do you think?
>
> I can prepare a separate patch for this, which should go to the next
> kernel version since this doesn't fix a regression.

Patch on the top should be okay. I think.

Thanks,
Pankaj


Re: [PATCH] KVM: x86: use positive error values for msr emulation that causes #GP

2020-11-04 Thread Pankaj Gupta
NKNOWN;
> -   case -EPERM:
> +   case KVM_MSR_RET_FILTERED:
> return KVM_MSR_EXIT_REASON_FILTER;
> default:
> return KVM_MSR_EXIT_REASON_INVAL;
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 3900ab0c6004d..e7ca622a468f5 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -376,7 +376,13 @@ int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int 
> r,
>  int kvm_handle_invpcid(struct kvm_vcpu *vcpu, unsigned long type, gva_t gva);
>  bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
>
> -#define  KVM_MSR_RET_INVALID  2
> +/*
> + * Internal error codes that are used to indicate that MSR emulation 
> encountered
> + * an error that should result in #GP in the guest, unless userspace
> + * handles it.
> + */
> +#define  KVM_MSR_RET_INVALID   2   /* in-kernel MSR emulation #GP 
> condition */
> +#define  KVM_MSR_RET_FILTERED  3   /* #GP due to userspace MSR filter */
>
>  #define __cr4_reserved_bits(__cpu_has, __c) \
>  ({  \

This looks good to me. This should solve "-EPERM" return by "__kvm_set_msr" .

A question I have, In the case of "kvm_emulate_rdmsr()",  for "r" we
are injecting #GP.
Is there any possibility of this check to be hit and still result in #GP?

int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu)
{
   
r = kvm_get_msr(vcpu, ecx, );

/* MSR read failed? See if we should ask user space */
if (r && kvm_get_msr_user_space(vcpu, ecx, r)) {
/* Bounce to user space */
return 0;
}

/* MSR read failed? Inject a #GP */
if (r) {
trace_kvm_msr_read_ex(ecx);
kvm_inject_gp(vcpu, 0);
return 1;
}

}

Apart from the question above, feel free to add:
Reviewed-by: Pankaj Gupta 


Re: [PATCH v2] mm/list_lru: optimize condition of exiting the loop

2020-11-02 Thread Pankaj Gupta
> In list_lru_walk(), nr_to_walk type is 'unsigned long',
> so nr_to_walk won't be '< 0'.
>
> In list_lru_walk_node(), nr_to_walk type is 'unsigned long',
> so *nr_to_walk won't be '< 0' too.
>
> We can use '!nr_to_walk' instead of 'nr_to_walk <= 0', which
> is more precise.
>
> Signed-off-by: Hui Su 
> ---
>  include/linux/list_lru.h | 2 +-
>  mm/list_lru.c| 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index 9dcaa3e582c9..b7bc4a2636b9 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -214,7 +214,7 @@ list_lru_walk(struct list_lru *lru, list_lru_walk_cb 
> isolate,
> for_each_node_state(nid, N_NORMAL_MEMORY) {
> isolated += list_lru_walk_node(lru, nid, isolate,
>cb_arg, _to_walk);
> -   if (nr_to_walk <= 0)
> +   if (!nr_to_walk)
> break;
> }
> return isolated;
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 5aa6e44bc2ae..35be4de9fd77 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -294,7 +294,7 @@ unsigned long list_lru_walk_node(struct list_lru *lru, 
> int nid,
>
> isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
>   nr_to_walk);
> -   if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
> +   if (*nr_to_walk && list_lru_memcg_aware(lru)) {
> for_each_memcg_cache_index(memcg_idx) {
> struct list_lru_node *nlru = >node[nid];
>
> @@ -304,7 +304,7 @@ unsigned long list_lru_walk_node(struct list_lru *lru, 
> int nid,
> nr_to_walk);
> spin_unlock(>lock);
>
> -   if (*nr_to_walk <= 0)
> +   if (!*nr_to_walk)
> break;
> }
> }

Acked-by: Pankaj Gupta 


Re: [PATCH v2 5/5] mm/memory_hotplug: update comment regarding zone shuffling

2020-10-21 Thread Pankaj Gupta
> As we no longer shuffle via generic_online_page() and when undoing
> isolation, we can simplify the comment.
>
> We now effectively shuffle only once (properly) when onlining new
> memory.
>
> Reviewed-by: Wei Yang 
> Acked-by: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Dave Hansen 
> Cc: Vlastimil Babka 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Cc: Pankaj Gupta 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c | 11 ---
>  1 file changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 03a00cb68bf7..b44d4c7ba73b 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -858,13 +858,10 @@ int __ref online_pages(unsigned long pfn, unsigned long 
> nr_pages,
> undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
>
> /*
> -* When exposing larger, physically contiguous memory areas to the
> -* buddy, shuffling in the buddy (when freeing onlined pages, putting
> -* them either to the head or the tail of the freelist) is only 
> helpful
> -* for maintaining the shuffle, but not for creating the initial
> -* shuffle. Shuffle the whole zone to make sure the just onlined pages
> -* are properly distributed across the whole freelist. Make sure to
> -* shuffle once pageblocks are no longer isolated.
> +* Freshly onlined pages aren't shuffled (e.g., all pages are placed 
> to
> +* the tail of the freelist when undoing isolation). Shuffle the whole
> +* zone to make sure the just onlined pages are properly distributed
> +    * across the whole freelist - to create an initial shuffle.
>  */
> shuffle_zone(zone);
>

Acked-by: Pankaj Gupta 


Re: [PATCH v1 18/29] virtio-mem: factor out calculation of the bit number within the sb_states bitmap

2020-10-20 Thread Pankaj Gupta
> The calculation is already complicated enough, let's limit it to one
> location.
>
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Pankaj Gupta 
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/virtio/virtio_mem.c | 20 +++-
>  1 file changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 2cc497ad8298..73ff6e9ba839 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -327,6 +327,16 @@ static int 
> virtio_mem_sbm_mb_states_prepare_next_mb(struct virtio_mem *vm)
>  _mb_id--) \
> if (virtio_mem_sbm_get_mb_state(_vm, _mb_id) == _state)
>
> +/*
> + * Calculate the bit number in the sb_states bitmap for the given subblock
> + * inside the given memory block.
> + */
> +static int virtio_mem_sbm_sb_state_bit_nr(struct virtio_mem *vm,
> + unsigned long mb_id, int sb_id)
> +{
> +   return (mb_id - vm->first_mb_id) * vm->nb_sb_per_mb + sb_id;
> +}
> +
>  /*
>   * Mark all selected subblocks plugged.
>   *
> @@ -336,7 +346,7 @@ static void virtio_mem_sbm_set_sb_plugged(struct 
> virtio_mem *vm,
>   unsigned long mb_id, int sb_id,
>   int count)
>  {
> -   const int bit = (mb_id - vm->first_mb_id) * vm->nb_sb_per_mb + sb_id;
> +   const int bit = virtio_mem_sbm_sb_state_bit_nr(vm, mb_id, sb_id);
>
> __bitmap_set(vm->sbm.sb_states, bit, count);
>  }
> @@ -350,7 +360,7 @@ static void virtio_mem_sbm_set_sb_unplugged(struct 
> virtio_mem *vm,
> unsigned long mb_id, int sb_id,
> int count)
>  {
> -   const int bit = (mb_id - vm->first_mb_id) * vm->nb_sb_per_mb + sb_id;
> +   const int bit = virtio_mem_sbm_sb_state_bit_nr(vm, mb_id, sb_id);
>
> __bitmap_clear(vm->sbm.sb_states, bit, count);
>  }
> @@ -362,7 +372,7 @@ static bool virtio_mem_sbm_test_sb_plugged(struct 
> virtio_mem *vm,
>unsigned long mb_id, int sb_id,
>int count)
>  {
> -   const int bit = (mb_id - vm->first_mb_id) * vm->nb_sb_per_mb + sb_id;
> +   const int bit = virtio_mem_sbm_sb_state_bit_nr(vm, mb_id, sb_id);
>
> if (count == 1)
> return test_bit(bit, vm->sbm.sb_states);
> @@ -379,7 +389,7 @@ static bool virtio_mem_sbm_test_sb_unplugged(struct 
> virtio_mem *vm,
>  unsigned long mb_id, int sb_id,
>  int count)
>  {
> -   const int bit = (mb_id - vm->first_mb_id) * vm->nb_sb_per_mb + sb_id;
> +   const int bit = virtio_mem_sbm_sb_state_bit_nr(vm, mb_id, sb_id);
>
> /* TODO: Helper similar to bitmap_set() */
> return find_next_bit(vm->sbm.sb_states, bit + count, bit) >=
> @@ -393,7 +403,7 @@ static bool virtio_mem_sbm_test_sb_unplugged(struct 
> virtio_mem *vm,
>  static int virtio_mem_sbm_first_unplugged_sb(struct virtio_mem *vm,
> unsigned long mb_id)
>  {
> -   const int bit = (mb_id - vm->first_mb_id) * vm->nb_sb_per_mb;
> +   const int bit = virtio_mem_sbm_sb_state_bit_nr(vm, mb_id, 0);
>
> return find_next_zero_bit(vm->sbm.sb_states,
>   bit + vm->nb_sb_per_mb, bit) - bit;

Agree, there are alot of *b things, good to clean as much.

Reviewed-by: Pankaj Gupta 


Re: [PATCH v1 17/29] virito-mem: subblock states are specific to Sub Block Mode (SBM)

2020-10-20 Thread Pankaj Gupta
> Let's rename and move accordingly. While at it, rename sb_bitmap to
> "sb_states".
>
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Pankaj Gupta 
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/virtio/virtio_mem.c | 118 +++-
>  1 file changed, 62 insertions(+), 56 deletions(-)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index e76d6f769aa5..2cc497ad8298 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -137,17 +137,23 @@ struct virtio_mem {
>  * memory in one 4 KiB page.
>  */
> uint8_t *mb_states;
> -   } sbm;
>
> -   /*
> -* $nb_sb_per_mb bit per memory block. Handled similar to 
> sbm.mb_states.
> -*
> -* With 4MB subblocks, we manage 128GB of memory in one page.
> -*/
> -   unsigned long *sb_bitmap;
> +   /*
> +* Bitmap: one bit per subblock. Allocated similar to
> +* sbm.mb_states.
> +*
> +* A set bit means the corresponding subblock is plugged,
> +* otherwise it's unblocked.
> +*
> +* With 4 MiB subblocks, we manage 128 GiB of memory in one
> +* 4 KiB page.
> +*/
> +   unsigned long *sb_states;
> +   } sbm;
>
> /*
> -* Mutex that protects the sbm.mb_count, sbm.mb_states, and sb_bitmap.
> +* Mutex that protects the sbm.mb_count, sbm.mb_states, and
> +* sbm.sb_states.
>  *
>  * When this lock is held the pointers can't change, ONLINE and
>  * OFFLINE blocks can't change the state and no subblocks will get
> @@ -326,13 +332,13 @@ static int 
> virtio_mem_sbm_mb_states_prepare_next_mb(struct virtio_mem *vm)
>   *
>   * Will not modify the state of the memory block.
>   */
> -static void virtio_mem_mb_set_sb_plugged(struct virtio_mem *vm,
> -unsigned long mb_id, int sb_id,
> -int count)
> +static void virtio_mem_sbm_set_sb_plugged(struct virtio_mem *vm,
> + unsigned long mb_id, int sb_id,
> + int count)
>  {
> const int bit = (mb_id - vm->first_mb_id) * vm->nb_sb_per_mb + sb_id;
>
> -   __bitmap_set(vm->sb_bitmap, bit, count);
> +   __bitmap_set(vm->sbm.sb_states, bit, count);
>  }
>
>  /*
> @@ -340,86 +346,87 @@ static void virtio_mem_mb_set_sb_plugged(struct 
> virtio_mem *vm,
>   *
>   * Will not modify the state of the memory block.
>   */
> -static void virtio_mem_mb_set_sb_unplugged(struct virtio_mem *vm,
> -  unsigned long mb_id, int sb_id,
> -  int count)
> +static void virtio_mem_sbm_set_sb_unplugged(struct virtio_mem *vm,
> +   unsigned long mb_id, int sb_id,
> +   int count)
>  {
> const int bit = (mb_id - vm->first_mb_id) * vm->nb_sb_per_mb + sb_id;
>
> -   __bitmap_clear(vm->sb_bitmap, bit, count);
> +   __bitmap_clear(vm->sbm.sb_states, bit, count);
>  }
>
>  /*
>   * Test if all selected subblocks are plugged.
>   */
> -static bool virtio_mem_mb_test_sb_plugged(struct virtio_mem *vm,
> - unsigned long mb_id, int sb_id,
> - int count)
> +static bool virtio_mem_sbm_test_sb_plugged(struct virtio_mem *vm,
> +  unsigned long mb_id, int sb_id,
> +  int count)
>  {
> const int bit = (mb_id - vm->first_mb_id) * vm->nb_sb_per_mb + sb_id;
>
> if (count == 1)
> -   return test_bit(bit, vm->sb_bitmap);
> +   return test_bit(bit, vm->sbm.sb_states);
>
> /* TODO: Helper similar to bitmap_set() */
> -   return find_next_zero_bit(vm->sb_bitmap, bit + count, bit) >=
> +   return find_next_zero_bit(vm->sbm.sb_states, bit + count, bit) >=
>bit + count;
>  }
>
>  /*
>   * Test if all selected subblocks are unplugged.
>   */
> -static bool virtio_mem_mb_test_sb_unplugged(struct virtio_mem *vm,
> -   unsigned long mb_id, int sb_id,
> -   int count)
> +static bool virtio_mem_sbm_test_sb_unplugged(struct virtio_mem *vm,
> +   

Re: [PATCH v1 16/29] virtio-mem: memory block states are specific to Sub Block Mode (SBM)

2020-10-20 Thread Pankaj Gupta
> let's use a new "sbm" sub-struct to hold SBM-specific state and rename +
> move applicable definitions, frunctions, and variables (related to
> memory block states).
>
> While at it:
> - Drop the "_STATE" part from memory block states
> - Rename "nb_mb_state" to "mb_count"
> - "set_mb_state" / "get_mb_state" vs. "mb_set_state" / "mb_get_state"
> - Don't use lengthy "enum virtio_mem_smb_mb_state", simply use "uint8_t"
>
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Pankaj Gupta 
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/virtio/virtio_mem.c | 215 ++--
>  1 file changed, 109 insertions(+), 106 deletions(-)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index fd8685673fe4..e76d6f769aa5 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -42,20 +42,23 @@ MODULE_PARM_DESC(unplug_online, "Try to unplug online 
> memory");
>   * onlined to the same zone - virtio-mem relies on this behavior.
>   */
>
> -enum virtio_mem_mb_state {
> +/*
> + * State of a Linux memory block in SBM.
> + */
> +enum virtio_mem_sbm_mb_state {
> /* Unplugged, not added to Linux. Can be reused later. */
> -   VIRTIO_MEM_MB_STATE_UNUSED = 0,
> +   VIRTIO_MEM_SBM_MB_UNUSED = 0,
> /* (Partially) plugged, not added to Linux. Error on add_memory(). */
> -   VIRTIO_MEM_MB_STATE_PLUGGED,
> +   VIRTIO_MEM_SBM_MB_PLUGGED,
> /* Fully plugged, fully added to Linux, offline. */
> -   VIRTIO_MEM_MB_STATE_OFFLINE,
> +   VIRTIO_MEM_SBM_MB_OFFLINE,
> /* Partially plugged, fully added to Linux, offline. */
> -   VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL,
> +   VIRTIO_MEM_SBM_MB_OFFLINE_PARTIAL,
> /* Fully plugged, fully added to Linux, online. */
> -   VIRTIO_MEM_MB_STATE_ONLINE,
> +   VIRTIO_MEM_SBM_MB_ONLINE,
> /* Partially plugged, fully added to Linux, online. */
> -   VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL,
> -   VIRTIO_MEM_MB_STATE_COUNT
> +   VIRTIO_MEM_SBM_MB_ONLINE_PARTIAL,
> +   VIRTIO_MEM_SBM_MB_COUNT
>  };
>
>  struct virtio_mem {
> @@ -113,9 +116,6 @@ struct virtio_mem {
>  */
> const char *resource_name;
>
> -   /* Summary of all memory block states. */
> -   unsigned long nb_mb_state[VIRTIO_MEM_MB_STATE_COUNT];
> -
> /*
>  * We don't want to add too much memory if it's not getting onlined,
>  * to avoid running OOM. Besides this threshold, we allow to have at
> @@ -125,27 +125,29 @@ struct virtio_mem {
> atomic64_t offline_size;
> uint64_t offline_threshold;
>
> -   /*
> -* One byte state per memory block.
> -*
> -* Allocated via vmalloc(). When preparing new blocks, resized
> -* (alloc+copy+free) when needed (crossing pages with the next mb).
> -* (when crossing pages).
> -*
> -* With 128MB memory blocks, we have states for 512GB of memory in one
> -* page.
> -*/
> -   uint8_t *mb_state;
> +   struct {
> +   /* Summary of all memory block states. */
> +   unsigned long mb_count[VIRTIO_MEM_SBM_MB_COUNT];
> +
> +   /*
> +* One byte state per memory block. Allocated via vmalloc().
> +* Resized (alloc+copy+free) on demand.
> +*
> +* With 128 MiB memory blocks, we have states for 512 GiB of
> +* memory in one 4 KiB page.
> +*/
> +   uint8_t *mb_states;
> +   } sbm;
>
> /*
> -* $nb_sb_per_mb bit per memory block. Handled similar to mb_state.
> +* $nb_sb_per_mb bit per memory block. Handled similar to 
> sbm.mb_states.
>  *
>  * With 4MB subblocks, we manage 128GB of memory in one page.
>  */
> unsigned long *sb_bitmap;
>
> /*
> -* Mutex that protects the nb_mb_state, mb_state, and sb_bitmap.
> +* Mutex that protects the sbm.mb_count, sbm.mb_states, and sb_bitmap.
>  *
>  * When this lock is held the pointers can't change, ONLINE and
>  * OFFLINE blocks can't change the state and no subblocks will get
> @@ -254,70 +256,70 @@ static unsigned long virtio_mem_phys_to_sb_id(struct 
> virtio_mem *vm,
>  /*
>   * Set the state of a memory block, taking care of the state counter.
>   */
> -static void virtio_mem_mb_set_state(struct virtio_mem *vm, unsigned long 
> mb_id,
> -   

Re: [PATCH v1 15/29] virito-mem: document Sub Block Mode (SBM)

2020-10-20 Thread Pankaj Gupta
> > Let's add some documentation for the current mode - Sub Block Mode (SBM) -
> > to prepare for a new mode - Big Block Mode (BBM).
> >
> > Follow-up patches will properly factor out the existing Sub Block Mode
> > (SBM) and implement Device Block Mode (DBM).
>
> s/Device Block Mode (DBM)/Big Block Mode (BBM)/
>

Reviewed-by: Pankaj Gupta 


Re: [PATCH v1 12/29] virtio-mem: factor out fake-offlining into virtio_mem_fake_offline()

2020-10-20 Thread Pankaj Gupta
> ... which now matches virtio_mem_fake_online(). We'll reuse this
> functionality soon.
>
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Pankaj Gupta 
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/virtio/virtio_mem.c | 34 --
>  1 file changed, 24 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 00d1cfca4713..d132bc54ef57 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -832,6 +832,27 @@ static void virtio_mem_fake_online(unsigned long pfn, 
> unsigned long nr_pages)
> }
>  }
>
> +/*
> + * Try to allocate a range, marking pages fake-offline, effectively
> + * fake-offlining them.
> + */
> +static int virtio_mem_fake_offline(unsigned long pfn, unsigned long nr_pages)
> +{
> +   int rc;
> +
> +   rc = alloc_contig_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE,
> +   GFP_KERNEL);
> +   if (rc == -ENOMEM)
> +   /* whoops, out of memory */
> +   return rc;
> +   if (rc)
> +   return -EBUSY;
> +
> +   virtio_mem_set_fake_offline(pfn, nr_pages, true);
> +   adjust_managed_page_count(pfn_to_page(pfn), -nr_pages);
> +   return 0;
> +}
> +
>  static void virtio_mem_online_page_cb(struct page *page, unsigned int order)
>  {
> const unsigned long addr = page_to_phys(page);
> @@ -1335,17 +1356,10 @@ static int virtio_mem_mb_unplug_sb_online(struct 
> virtio_mem *vm,
>
> start_pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
>  sb_id * vm->subblock_size);
> -   rc = alloc_contig_range(start_pfn, start_pfn + nr_pages,
> -   MIGRATE_MOVABLE, GFP_KERNEL);
> -   if (rc == -ENOMEM)
> -   /* whoops, out of memory */
> -   return rc;
> -   if (rc)
> -   return -EBUSY;
>
> -   /* Mark it as fake-offline before unplugging it */
> -   virtio_mem_set_fake_offline(start_pfn, nr_pages, true);
> -   adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
> +   rc = virtio_mem_fake_offline(start_pfn, nr_pages);
> +   if (rc)
> +   return rc;
>
> /* Try to unplug the allocated memory */
> rc = virtio_mem_mb_unplug_sb(vm, mb_id, sb_id, count);

Reviewed-by: Pankaj Gupta 


Re: [PATCH v1 07/29] virtio-mem: generalize virtio_mem_overlaps_range()

2020-10-20 Thread Pankaj Gupta
> Avoid using memory block ids. While at it, use uint64_t for
> address/size.
>
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Pankaj Gupta 
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/virtio/virtio_mem.c | 10 +++---
>  1 file changed, 3 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 821143db14fe..37a0e338ae4a 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -489,14 +489,10 @@ static int virtio_mem_translate_node_id(struct 
> virtio_mem *vm, uint16_t node_id)
>   * Test if a virtio-mem device overlaps with the given range. Can be called
>   * from (notifier) callbacks lockless.
>   */
> -static bool virtio_mem_overlaps_range(struct virtio_mem *vm,
> - unsigned long start, unsigned long size)
> +static bool virtio_mem_overlaps_range(struct virtio_mem *vm, uint64_t start,
> + uint64_t size)
>  {
> -   unsigned long dev_start = virtio_mem_mb_id_to_phys(vm->first_mb_id);
> -   unsigned long dev_end = virtio_mem_mb_id_to_phys(vm->last_mb_id) +
> -   memory_block_size_bytes();
> -
> -   return start < dev_end && dev_start < start + size;
> +   return start < vm->addr + vm->region_size && vm->addr < start + size;
>  }

Reviewed-by: Pankaj Gupta 


Re: [PATCH v2] mm/compaction: Rename 'start_pfn' to 'iteration_start_pfn' in compact_zone()

2020-10-19 Thread Pankaj Gupta
> From: Yanfei Xu 
>
> There are two 'start_pfn' declared in compact_zone() which have
> different meaning. Rename the second one to 'iteration_start_pfn'
> to prevent trace_mm_compaction_end() from tracing an undesirable
> value.
>
> BTW, remove an useless semicolon.
>
> Acked-by: David Hildenbrand 
> Acked-by: Vlastimil Babka 
> Signed-off-by: Yanfei Xu 
> ---
> v1->v2:
> Rename 'start_pfn' to 'iteration_start_pfn' and change commit messages.
>
>  mm/compaction.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 176dcded298e..ccd27c739fd6 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2272,7 +2272,7 @@ compact_zone(struct compact_control *cc, struct 
> capture_control *capc)
>
> while ((ret = compact_finished(cc)) == COMPACT_CONTINUE) {
> int err;
> -   unsigned long start_pfn = cc->migrate_pfn;
> +   unsigned long iteration_start_pfn = cc->migrate_pfn;
>
> /*
>  * Avoid multiple rescans which can happen if a page cannot be
> @@ -2284,7 +2284,7 @@ compact_zone(struct compact_control *cc, struct 
> capture_control *capc)
>  */
> cc->rescan = false;
> if (pageblock_start_pfn(last_migrated_pfn) ==
> -   pageblock_start_pfn(start_pfn)) {
> +   pageblock_start_pfn(iteration_start_pfn)) {
> cc->rescan = true;
> }
>
> @@ -2308,8 +2308,7 @@ compact_zone(struct compact_control *cc, struct 
> capture_control *capc)
> goto check_drain;
> case ISOLATE_SUCCESS:
> update_cached = false;
> -   last_migrated_pfn = start_pfn;
> -   ;
> +   last_migrated_pfn = iteration_start_pfn;
> }
>
> err = migrate_pages(>migratepages, compaction_alloc,

Improves readability.
Acked-by: Pankaj Gupta 


Re: [PATCH v1 08/29] virtio-mem: drop last_mb_id

2020-10-15 Thread Pankaj Gupta
> No longer used, let's drop it.
>
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Pankaj Gupta 
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/virtio/virtio_mem.c | 4 
>  1 file changed, 4 deletions(-)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 37a0e338ae4a..5c93f8a65eba 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -84,8 +84,6 @@ struct virtio_mem {
>
> /* Id of the first memory block of this device. */
> unsigned long first_mb_id;
> -   /* Id of the last memory block of this device. */
> -   unsigned long last_mb_id;
> /* Id of the last usable memory block of this device. */
> unsigned long last_usable_mb_id;
> /* Id of the next memory bock to prepare when needed. */
> @@ -1689,8 +1687,6 @@ static int virtio_mem_init(struct virtio_mem *vm)
> vm->first_mb_id = virtio_mem_phys_to_mb_id(vm->addr - 1 +
>memory_block_size_bytes());
> vm->next_mb_id = vm->first_mb_id;
> -   vm->last_mb_id = virtio_mem_phys_to_mb_id(vm->addr +
> -vm->region_size) - 1;
>
> dev_info(>vdev->dev, "start address: 0x%llx", vm->addr);
> dev_info(>vdev->dev, "region size: 0x%llx", vm->region_size);

Reviewed-by: Pankaj Gupta 


Re: [PATCH v1 11/29] virtio-mem: use "unsigned long" for nr_pages when fake onlining/offlining

2020-10-15 Thread Pankaj Gupta
> No harm done, but let's be consistent.
>
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Pankaj Gupta 
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/virtio/virtio_mem.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index cb2e8f254650..00d1cfca4713 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -766,7 +766,7 @@ static int virtio_mem_memory_notifier_cb(struct 
> notifier_block *nb,
>   * (via generic_online_page()) using PageDirty().
>   */
>  static void virtio_mem_set_fake_offline(unsigned long pfn,
> -   unsigned int nr_pages, bool onlined)
> +   unsigned long nr_pages, bool onlined)
>  {
> for (; nr_pages--; pfn++) {
> struct page *page = pfn_to_page(pfn);
> @@ -785,7 +785,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
>   * (via generic_online_page()), clear PageDirty().
>   */
>  static void virtio_mem_clear_fake_offline(unsigned long pfn,
> - unsigned int nr_pages, bool onlined)
> + unsigned long nr_pages, bool 
> onlined)
>  {
> for (; nr_pages--; pfn++) {
> struct page *page = pfn_to_page(pfn);
> @@ -800,10 +800,10 @@ static void virtio_mem_clear_fake_offline(unsigned long 
> pfn,
>   * Release a range of fake-offline pages to the buddy, effectively
>   * fake-onlining them.
>   */
> -static void virtio_mem_fake_online(unsigned long pfn, unsigned int nr_pages)
> +static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages)
>  {
> const unsigned long max_nr_pages = MAX_ORDER_NR_PAGES;
> -   int i;
> +   unsigned long i;
>
> /*
>  * We are always called at least with MAX_ORDER_NR_PAGES

Reviewed-by: Pankaj Gupta 


Re: [PATCH v1 06/29] virtio-mem: generalize virtio_mem_owned_mb()

2020-10-15 Thread Pankaj Gupta
> Avoid using memory block ids. Rename it to virtio_mem_contains_range().
>
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Pankaj Gupta 
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/virtio/virtio_mem.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 6bbd1cfd10d3..821143db14fe 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -500,12 +500,13 @@ static bool virtio_mem_overlaps_range(struct virtio_mem 
> *vm,
>  }
>
>  /*
> - * Test if a virtio-mem device owns a memory block. Can be called from
> + * Test if a virtio-mem device contains a given range. Can be called from
>   * (notifier) callbacks lockless.
>   */
> -static bool virtio_mem_owned_mb(struct virtio_mem *vm, unsigned long mb_id)
> +static bool virtio_mem_contains_range(struct virtio_mem *vm, uint64_t start,
> + uint64_t size)
>  {
> -   return mb_id >= vm->first_mb_id && mb_id <= vm->last_mb_id;
> +   return start >= vm->addr && start + size <= vm->addr + 
> vm->region_size;
>  }
>
>  static int virtio_mem_notify_going_online(struct virtio_mem *vm,
> @@ -800,7 +801,7 @@ static void virtio_mem_online_page_cb(struct page *page, 
> unsigned int order)
>  */
> rcu_read_lock();
> list_for_each_entry_rcu(vm, _mem_devices, next) {
> -   if (!virtio_mem_owned_mb(vm, mb_id))
> +   if (!virtio_mem_contains_range(vm, addr, PFN_PHYS(1 << 
> order)))
> continue;
>
> sb_id = virtio_mem_phys_to_sb_id(vm, addr);

Looks good.
Reviewed-by: Pankaj Gupta 


Re: [PATCH v1 02/29] virtio-mem: simplify calculation in virtio_mem_mb_state_prepare_next_mb()

2020-10-15 Thread Pankaj Gupta
> We actually need one byte less (next_mb_id is exclusive, first_mb_id is
> inclusive). Simplify.
>
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Pankaj Gupta 
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/virtio/virtio_mem.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index a1f5bf7a571a..670b3faf412d 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -257,8 +257,8 @@ static enum virtio_mem_mb_state 
> virtio_mem_mb_get_state(struct virtio_mem *vm,
>   */
>  static int virtio_mem_mb_state_prepare_next_mb(struct virtio_mem *vm)
>  {
> -   unsigned long old_bytes = vm->next_mb_id - vm->first_mb_id + 1;
> -   unsigned long new_bytes = vm->next_mb_id - vm->first_mb_id + 2;
> +   unsigned long old_bytes = vm->next_mb_id - vm->first_mb_id;
> +   unsigned long new_bytes = old_bytes + 1;

Maybe we can avoid new_bytes & old_bytes variables, instead use single
variable. Can later be used with PFN_UP/PFN_DOWN.

> int old_pages = PFN_UP(old_bytes);
> int new_pages = PFN_UP(new_bytes);
> uint8_t *new_mb_state;


Re: [PATCH v1 01/29] virtio-mem: determine nid only once using memory_add_physaddr_to_nid()

2020-10-15 Thread Pankaj Gupta
> Let's determine the target nid only once in case we have none specified -
> usually, we'll end up with node 0 either way.
>
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Pankaj Gupta 
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/virtio/virtio_mem.c | 28 +++-
>  1 file changed, 11 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index ba4de598f663..a1f5bf7a571a 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -70,7 +70,7 @@ struct virtio_mem {
>
> /* The device block size (for communicating with the device). */
> uint64_t device_block_size;
> -   /* The translated node id. NUMA_NO_NODE in case not specified. */
> +   /* The determined node id for all memory of the device. */
> int nid;
> /* Physical start address of the memory region. */
> uint64_t addr;
> @@ -406,10 +406,6 @@ static int virtio_mem_sb_bitmap_prepare_next_mb(struct 
> virtio_mem *vm)
>  static int virtio_mem_mb_add(struct virtio_mem *vm, unsigned long mb_id)
>  {
> const uint64_t addr = virtio_mem_mb_id_to_phys(mb_id);
> -   int nid = vm->nid;
> -
> -   if (nid == NUMA_NO_NODE)
> -   nid = memory_add_physaddr_to_nid(addr);
>
> /*
>  * When force-unloading the driver and we still have memory added to
> @@ -423,7 +419,8 @@ static int virtio_mem_mb_add(struct virtio_mem *vm, 
> unsigned long mb_id)
> }
>
> dev_dbg(>vdev->dev, "adding memory block: %lu\n", mb_id);
> -   return add_memory_driver_managed(nid, addr, memory_block_size_bytes(),
> +   return add_memory_driver_managed(vm->nid, addr,
> +memory_block_size_bytes(),
>  vm->resource_name,
>  MEMHP_MERGE_RESOURCE);
>  }
> @@ -440,13 +437,9 @@ static int virtio_mem_mb_add(struct virtio_mem *vm, 
> unsigned long mb_id)
>  static int virtio_mem_mb_remove(struct virtio_mem *vm, unsigned long mb_id)
>  {
> const uint64_t addr = virtio_mem_mb_id_to_phys(mb_id);
> -   int nid = vm->nid;
> -
> -   if (nid == NUMA_NO_NODE)
> -   nid = memory_add_physaddr_to_nid(addr);
>
> dev_dbg(>vdev->dev, "removing memory block: %lu\n", mb_id);
> -   return remove_memory(nid, addr, memory_block_size_bytes());
> +   return remove_memory(vm->nid, addr, memory_block_size_bytes());
>  }
>
>  /*
> @@ -461,14 +454,11 @@ static int virtio_mem_mb_offline_and_remove(struct 
> virtio_mem *vm,
> unsigned long mb_id)
>  {
> const uint64_t addr = virtio_mem_mb_id_to_phys(mb_id);
> -   int nid = vm->nid;
> -
> -   if (nid == NUMA_NO_NODE)
> -   nid = memory_add_physaddr_to_nid(addr);
>
> dev_dbg(>vdev->dev, "offlining and removing memory block: %lu\n",
> mb_id);
> -   return offline_and_remove_memory(nid, addr, 
> memory_block_size_bytes());
> +   return offline_and_remove_memory(vm->nid, addr,
> +memory_block_size_bytes());
>  }
>
>  /*
> @@ -1659,6 +1649,10 @@ static int virtio_mem_init(struct virtio_mem *vm)
> virtio_cread_le(vm->vdev, struct virtio_mem_config, region_size,
> >region_size);
>
> +   /* Determine the nid for the device based on the lowest address. */
> +   if (vm->nid == NUMA_NO_NODE)
> +   vm->nid = memory_add_physaddr_to_nid(vm->addr);
> +
> /*
>  * We always hotplug memory in memory block granularity. This way,
>  * we have to wait for exactly one memory block to online.
> @@ -1707,7 +1701,7 @@ static int virtio_mem_init(struct virtio_mem *vm)
>  memory_block_size_bytes());
> dev_info(>vdev->dev, "subblock size: 0x%llx",
>  (unsigned long long)vm->subblock_size);
> -   if (vm->nid != NUMA_NO_NODE)
> +   if (vm->nid != NUMA_NO_NODE && IS_ENABLED(CONFIG_NUMA))
> dev_info(>vdev->dev, "nid: %d", vm->nid);
>
> return 0;

Reviewed-by: Pankaj Gupta 


Re: [PATCH v1 04/29] virtio-mem: drop rc2 in virtio_mem_mb_plug_and_add()

2020-10-12 Thread Pankaj Gupta
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Pankaj Gupta 
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/virtio/virtio_mem.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 78c2fbcddcf8..b3eebac7191f 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -1072,7 +1072,7 @@ static int virtio_mem_mb_plug_and_add(struct virtio_mem 
> *vm,
>   uint64_t *nb_sb)
>  {
> const int count = min_t(int, *nb_sb, vm->nb_sb_per_mb);
> -   int rc, rc2;
> +   int rc;
>
> if (WARN_ON_ONCE(!count))
> return -EINVAL;
> @@ -1103,13 +1103,12 @@ static int virtio_mem_mb_plug_and_add(struct 
> virtio_mem *vm,
>
> dev_err(>vdev->dev,
> "adding memory block %lu failed with %d\n", mb_id, 
> rc);
> -   rc2 = virtio_mem_mb_unplug_sb(vm, mb_id, 0, count);
>
> /*
>  * TODO: Linux MM does not properly clean up yet in all cases
>  * where adding of memory failed - especially on -ENOMEM.
>  */
> -   if (rc2)
> +   if (virtio_mem_mb_unplug_sb(vm, mb_id, 0, count))
> new_state = VIRTIO_MEM_MB_STATE_PLUGGED;
> virtio_mem_mb_set_state(vm, mb_id, new_state);
> return rc;

Reviewed-by: Pankaj Gupta 


Re: [PATCH v1 3/5] mm/page_alloc: always move pages to the tail of the freelist in unset_migratetype_isolate()

2020-09-28 Thread Pankaj Gupta
return 0;
>
> return move_freepages(zone, start_page, end_page, migratetype,
> -   num_movable);
> + to_tail, num_movable);
>  }
>
>  static void change_pageblock_range(struct page *pageblock_page,
> @@ -2526,8 +2538,8 @@ static void steal_suitable_fallback(struct zone *zone, 
> struct page *page,
> if (!whole_block)
> goto single_page;
>
> -   free_pages = move_freepages_block(zone, page, start_type,
> -   _pages);
> +   free_pages = move_freepages_block(zone, page, start_type, false,
> + _pages);
> /*
>  * Determine how many pages are compatible with our allocation.
>  * For movable allocation, it's the number of movable pages which
> @@ -2635,7 +2647,8 @@ static void reserve_highatomic_pageblock(struct page 
> *page, struct zone *zone,
> && !is_migrate_cma(mt)) {
> zone->nr_reserved_highatomic += pageblock_nr_pages;
> set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
> -   move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
> +   move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, false,
> +NULL);
> }
>
>  out_unlock:
> @@ -2711,7 +2724,7 @@ static bool unreserve_highatomic_pageblock(const struct 
> alloc_context *ac,
>  */
> set_pageblock_migratetype(page, ac->migratetype);
> ret = move_freepages_block(zone, page, 
> ac->migratetype,
> -   NULL);
> +  false, NULL);
> if (ret) {
> spin_unlock_irqrestore(>lock, flags);
> return ret;
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index abfe26ad59fd..de44e1329706 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -45,7 +45,7 @@ static int set_migratetype_isolate(struct page *page, int 
> migratetype, int isol_
> set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> zone->nr_isolate_pageblock++;
> nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE,
> -   NULL);
> +   false, NULL);
>
> __mod_zone_freepage_state(zone, -nr_pages, mt);
> spin_unlock_irqrestore(>lock, flags);
> @@ -83,7 +83,7 @@ static void unset_migratetype_isolate(struct page *page, 
> unsigned migratetype)
>  * Because freepage with more than pageblock_order on isolated
>  * pageblock is restricted to merge due to freepage counting problem,
>  * it is possible that there is free buddy page.
> -* move_freepages_block() doesn't care of merge so we need other
> +* move_freepages_block() don't care about merging, so we need another
>  * approach in order to merge them. Isolation and free will make
>  * these pages to be merged.
>  */
> @@ -106,9 +106,15 @@ static void unset_migratetype_isolate(struct page *page, 
> unsigned migratetype)
>  * If we isolate freepage with more than pageblock_order, there
>  * should be no freepage in the range, so we could avoid costly
>  * pageblock scanning for freepage moving.
> +*
> +* We didn't actually touch any of the isolated pages, so place them
> +* to the tail of the freelist. This is an optimization for memory
> +* onlining - just onlined memory won't immediately be considered for
> +* allocation.
>  */
> if (!isolated_page) {
> -   nr_pages = move_freepages_block(zone, page, migratetype, 
> NULL);
> +   nr_pages = move_freepages_block(zone, page, migratetype, true,
> +   NULL);
> __mod_zone_freepage_state(zone, nr_pages, migratetype);
> }
> set_pageblock_migratetype(page, migratetype);

Acked-by: Pankaj Gupta 


Re: [PATCH v1 2/5] mm/page_alloc: place pages to tail in __putback_isolated_page()

2020-09-28 Thread Pankaj Gupta
> __putback_isolated_page() already documents that pages will be placed to
> the tail of the freelist - this is, however, not the case for
> "order >= MAX_ORDER - 2" (see buddy_merge_likely()) - which should be
> the case for all existing users.
>
> This change affects two users:
> - free page reporting
> - page isolation, when undoing the isolation (including memory onlining).
>
> This behavior is desireable for pages that haven't really been touched
> lately, so exactly the two users that don't actually read/write page
> content, but rather move untouched pages.
>
> The new behavior is especially desirable for memory onlining, where we
> allow allocation of newly onlined pages via undo_isolate_page_range()
> in online_pages(). Right now, we always place them to the head of the
> free list, resulting in undesireable behavior: Assume we add
> individual memory chunks via add_memory() and online them right away to
> the NORMAL zone. We create a dependency chain of unmovable allocations
> e.g., via the memmap. The memmap of the next chunk will be placed onto
> previous chunks - if the last block cannot get offlined+removed, all
> dependent ones cannot get offlined+removed. While this can already be
> observed with individual DIMMs, it's more of an issue for virtio-mem
> (and I suspect also ppc DLPAR).
>
> Document that this should only be used for optimizations, and no code
> should realy on this for correction (if the order of freepage lists
> ever changes).
>
> We won't care about page shuffling: memory onlining already properly
> shuffles after onlining. free page reporting doesn't care about
> physically contiguous ranges, and there are already cases where page
> isolation will simply move (physically close) free pages to (currently)
> the head of the freelists via move_freepages_block() instead of
> shuffling. If this becomes ever relevant, we should shuffle the whole
> zone when undoing isolation of larger ranges, and after
> free_contig_range().
>
> Reviewed-by: Alexander Duyck 
> Reviewed-by: Oscar Salvador 
> Cc: Andrew Morton 
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Dave Hansen 
> Cc: Vlastimil Babka 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Cc: Scott Cheloha 
> Cc: Michael Ellerman 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/page_alloc.c | 18 --
>  1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index daab90e960fe..9e3ed4a6f69a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -89,6 +89,18 @@ typedef int __bitwise fop_t;
>   */
>  #define FOP_SKIP_REPORT_NOTIFY ((__force fop_t)BIT(0))
>
> +/*
> + * Place the (possibly merged) page to the tail of the freelist. Will ignore
> + * page shuffling (relevant code - e.g., memory onlining - is expected to
> + * shuffle the whole zone).
> + *
> + * Note: No code should rely onto this flag for correctness - it's purely
> + *   to allow for optimizations when handing back either fresh pages
> + *   (memory onlining) or untouched pages (page isolation, free page
> + *   reporting).
> + */
> +#define FOP_TO_TAIL((__force fop_t)BIT(1))
> +
>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>  static DEFINE_MUTEX(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_FRACTION   (8)
> @@ -1038,7 +1050,9 @@ static inline void __free_one_page(struct page *page, 
> unsigned long pfn,
>  done_merging:
> set_page_order(page, order);
>
> -   if (is_shuffle_order(order))
> +   if (fop_flags & FOP_TO_TAIL)
> +   to_tail = true;
> +   else if (is_shuffle_order(order))
> to_tail = shuffle_pick_tail();
> else
> to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
> @@ -3300,7 +3314,7 @@ void __putback_isolated_page(struct page *page, 
> unsigned int order, int mt)
>
> /* Return isolated page to tail of freelist. */
> __free_one_page(page, page_to_pfn(page), zone, order, mt,
> -   FOP_SKIP_REPORT_NOTIFY);
> +   FOP_SKIP_REPORT_NOTIFY | FOP_TO_TAIL);
>  }

Reviewed-by: Pankaj Gupta 


Re: [PATCH v1 4/5] mm/page_alloc: place pages to tail in __free_pages_core()

2020-09-28 Thread Pankaj Gupta
1 @@ void __free_pages_core(struct page *page, unsigned int 
> order)
> struct page *p = page;
> unsigned int loop;
>
> +   /*
> +* When initializing the memmap, init_single_page() sets the refcount
> +* of all pages to 1 ("allocated"/"not free"). We have to set the
> +* refcount of all involved pages to 0.
> +*/
> prefetchw(p);
> for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
> prefetchw(p + 1);
> @@ -1539,8 +1545,12 @@ void __free_pages_core(struct page *page, unsigned int 
> order)
> set_page_count(p, 0);
>
> atomic_long_add(nr_pages, _zone(page)->managed_pages);
> -   set_page_refcounted(page);
> -   __free_pages(page, order);
> +
> +   /*
> +* Bypass PCP and place fresh pages right to the tail, primarily
> +* relevant for memory onlining.
> +*/
> +   __free_pages_ok(page, order, FOP_TO_TAIL);
>  }
>
>  #ifdef CONFIG_NEED_MULTIPLE_NODES
> @@ -3171,7 +3181,8 @@ static void free_unref_page_commit(struct page *page, 
> unsigned long pfn)
>  */
> if (migratetype >= MIGRATE_PCPTYPES) {
> if (unlikely(is_migrate_isolate(migratetype))) {
> -   free_one_page(zone, page, pfn, 0, migratetype);
> +   free_one_page(zone, page, pfn, 0, migratetype,
> +         FOP_NONE);
> return;
> }
> migratetype = MIGRATE_MOVABLE;
> @@ -5063,7 +5074,7 @@ static inline void free_the_page(struct page *page, 
> unsigned int order)
> if (order == 0) /* Via pcp? */
> free_unref_page(page);
> else
> -   __free_pages_ok(page, order);
> +   __free_pages_ok(page, order, FOP_NONE);
>  }
>
>  void __free_pages(struct page *page, unsigned int order)

Acked-by: Pankaj Gupta 


Re: [PATCH v1 1/5] mm/page_alloc: convert "report" flag of __free_one_page() to a proper flag

2020-09-28 Thread Pankaj Gupta
> Let's prepare for additional flags and avoid long parameter lists of bools.
> Follow-up patches will also make use of the flags in __free_pages_ok(),
> however, I wasn't able to come up with a better name for the type - should
> be good enough for internal purposes.
>
> Reviewed-by: Alexander Duyck 
> Reviewed-by: Vlastimil Babka 
> Reviewed-by: Oscar Salvador 
> Cc: Andrew Morton 
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Dave Hansen 
> Cc: Vlastimil Babka 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/page_alloc.c | 28 
>  1 file changed, 20 insertions(+), 8 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index df90e3654f97..daab90e960fe 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -77,6 +77,18 @@
>  #include "shuffle.h"
>  #include "page_reporting.h"
>
> +/* Free One Page flags: for internal, non-pcp variants of free_pages(). */
> +typedef int __bitwise fop_t;
> +
> +/* No special request */
> +#define FOP_NONE   ((__force fop_t)0)
> +
> +/*
> + * Skip free page reporting notification for the (possibly merged) page. 
> (will
> + * *not* mark the page reported, only skip the notification).
> + */
> +#define FOP_SKIP_REPORT_NOTIFY ((__force fop_t)BIT(0))
> +
>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>  static DEFINE_MUTEX(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_FRACTION   (8)
> @@ -948,10 +960,9 @@ buddy_merge_likely(unsigned long pfn, unsigned long 
> buddy_pfn,
>   * -- nyc
>   */
>
> -static inline void __free_one_page(struct page *page,
> -   unsigned long pfn,
> -   struct zone *zone, unsigned int order,
> -   int migratetype, bool report)
> +static inline void __free_one_page(struct page *page, unsigned long pfn,
> +  struct zone *zone, unsigned int order,
> +  int migratetype, fop_t fop_flags)
>  {
> struct capture_control *capc = task_capc(zone);
> unsigned long buddy_pfn;
> @@ -1038,7 +1049,7 @@ static inline void __free_one_page(struct page *page,
> add_to_free_list(page, zone, order, migratetype);
>
> /* Notify page reporting subsystem of freed page */
> -   if (report)
> +   if (!(fop_flags & FOP_SKIP_REPORT_NOTIFY))
> page_reporting_notify_free(order);
>  }
>
> @@ -1379,7 +1390,7 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
> if (unlikely(isolated_pageblocks))
> mt = get_pageblock_migratetype(page);
>
> -   __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
> +   __free_one_page(page, page_to_pfn(page), zone, 0, mt, 
> FOP_NONE);
> trace_mm_page_pcpu_drain(page, 0, mt);
> }
> spin_unlock(>lock);
> @@ -1395,7 +1406,7 @@ static void free_one_page(struct zone *zone,
> is_migrate_isolate(migratetype))) {
> migratetype = get_pfnblock_migratetype(page, pfn);
> }
> -   __free_one_page(page, pfn, zone, order, migratetype, true);
> +   __free_one_page(page, pfn, zone, order, migratetype, FOP_NONE);
> spin_unlock(>lock);
>  }
>
> @@ -3288,7 +3299,8 @@ void __putback_isolated_page(struct page *page, 
> unsigned int order, int mt)
> lockdep_assert_held(>lock);
>
> /* Return isolated page to tail of freelist. */
> -   __free_one_page(page, page_to_pfn(page), zone, order, mt, false);
> +   __free_one_page(page, page_to_pfn(page), zone, order, mt,
> +   FOP_SKIP_REPORT_NOTIFY);
>  }

Reviewed-by: Pankaj Gupta 


Re: [PATCH] kernel/resource: make iomem_resource implicit in release_mem_region_adjustable()

2020-09-17 Thread Pankaj Gupta
> "mem" in the name already indicates the root, similar to
> release_mem_region() and devm_request_mem_region(). Make it implicit.
> The only single caller always passes iomem_resource, other parents are
> not applicable.
>
> Suggested-by: Wei Yang 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Dan Williams 
> Cc: Jason Gunthorpe 
> Cc: Kees Cook 
> Cc: Ard Biesheuvel 
> Cc: Pankaj Gupta 
> Cc: Baoquan He 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 
> ---
>
> Based on next-20200915. Follow up on
> "[PATCH v4 0/8] selective merging of system ram resources" [1]
> That's in next-20200915. As noted during review of v2 by Wei [2].
>
> [1] https://lkml.kernel.org/r/20200911103459.10306-1-da...@redhat.com
> [2] https://lkml.kernel.org/r/20200915021012.GC2007@L-31X9LVDL-1304.local
>
> ---
>  include/linux/ioport.h | 3 +--
>  kernel/resource.c  | 5 ++---
>  mm/memory_hotplug.c| 2 +-
>  3 files changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index 7e61389dcb01..5135d4b86cd6 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -251,8 +251,7 @@ extern struct resource * __request_region(struct resource 
> *,
>  extern void __release_region(struct resource *, resource_size_t,
> resource_size_t);
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -extern void release_mem_region_adjustable(struct resource *, resource_size_t,
> - resource_size_t);
> +extern void release_mem_region_adjustable(resource_size_t, resource_size_t);
>  #endif
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  extern void merge_system_ram_resource(struct resource *res);
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 7a91b935f4c2..ca2a666e4317 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -1240,7 +1240,6 @@ EXPORT_SYMBOL(__release_region);
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  /**
>   * release_mem_region_adjustable - release a previously reserved memory 
> region
> - * @parent: parent resource descriptor
>   * @start: resource start address
>   * @size: resource region size
>   *
> @@ -1258,9 +1257,9 @@ EXPORT_SYMBOL(__release_region);
>   *   assumes that all children remain in the lower address entry for
>   *   simplicity.  Enhance this logic when necessary.
>   */
> -void release_mem_region_adjustable(struct resource *parent,
> -  resource_size_t start, resource_size_t 
> size)
> +void release_mem_region_adjustable(resource_size_t start, resource_size_t 
> size)
>  {
> +   struct resource *parent = _resource;
> struct resource *new_res = NULL;
> bool alloc_nofail = false;
> struct resource **p;
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 553c718226b3..7c5e4744ac51 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1764,7 +1764,7 @@ static int __ref try_remove_memory(int nid, u64 start, 
> u64 size)
> memblock_remove(start, size);
> }
>
> -   release_mem_region_adjustable(_resource, start, size);
> +   release_mem_region_adjustable(start, size);
>
> try_offline_node(nid);
>

Reviewed-by: Pankaj Gupta 


Re: [PATCH v3 5/7] virtio-mem: try to merge system ram resources

2020-09-10 Thread Pankaj Gupta
Reviewed-by: Pankaj Gupta 


Re: [PATCH v3 4/7] mm/memory_hotplug: MEMHP_MERGE_RESOURCE to specify merging of System RAM resources

2020-09-10 Thread Pankaj Gupta
Looks good to me.

Reviewed-by: Pankaj Gupta 


Re: [PATCH v3 3/7] mm/memory_hotplug: prepare passing flags to add_memory() and friends

2020-09-10 Thread Pankaj Gupta
> We soon want to pass flags, e.g., to mark added System RAM resources.
> mergeable. Prepare for that.
>
> This patch is based on a similar patch by Oscar Salvador:
>
> https://lkml.kernel.org/r/20190625075227.15193-3-osalva...@suse.de
>
> Acked-by: Wei Liu 
> Reviewed-by: Juergen Gross  # Xen related part
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Dan Williams 
> Cc: Jason Gunthorpe 
> Cc: Pankaj Gupta 
> Cc: Baoquan He 
> Cc: Wei Yang 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: "Rafael J. Wysocki" 
> Cc: Len Brown 
> Cc: Greg Kroah-Hartman 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: "K. Y. Srinivasan" 
> Cc: Haiyang Zhang 
> Cc: Stephen Hemminger 
> Cc: Wei Liu 
> Cc: Heiko Carstens 
> Cc: Vasily Gorbik 
> Cc: Christian Borntraeger 
> Cc: David Hildenbrand 
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Boris Ostrovsky 
> Cc: Juergen Gross 
> Cc: Stefano Stabellini 
> Cc: "Oliver O'Halloran" 
> Cc: Pingfan Liu 
> Cc: Nathan Lynch 
> Cc: Libor Pechacek 
> Cc: Anton Blanchard 
> Cc: Leonardo Bras 
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: linux-a...@vger.kernel.org
> Cc: linux-nvd...@lists.01.org
> Cc: linux-hyp...@vger.kernel.org
> Cc: linux-s...@vger.kernel.org
> Cc: virtualizat...@lists.linux-foundation.org
> Cc: xen-de...@lists.xenproject.org
> Signed-off-by: David Hildenbrand 
> ---
>  arch/powerpc/platforms/powernv/memtrace.c   |  2 +-
>  arch/powerpc/platforms/pseries/hotplug-memory.c |  2 +-
>  drivers/acpi/acpi_memhotplug.c  |  3 ++-
>  drivers/base/memory.c   |  3 ++-
>  drivers/dax/kmem.c  |  2 +-
>  drivers/hv/hv_balloon.c |  2 +-
>  drivers/s390/char/sclp_cmd.c|  2 +-
>  drivers/virtio/virtio_mem.c |  2 +-
>  drivers/xen/balloon.c   |  2 +-
>  include/linux/memory_hotplug.h  | 16 
>  mm/memory_hotplug.c | 14 +++---
>  11 files changed, 30 insertions(+), 20 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
> b/arch/powerpc/platforms/powernv/memtrace.c
> index 13b369d2cc454..6828108486f83 100644
> --- a/arch/powerpc/platforms/powernv/memtrace.c
> +++ b/arch/powerpc/platforms/powernv/memtrace.c
> @@ -224,7 +224,7 @@ static int memtrace_online(void)
> ent->mem = 0;
> }
>
> -   if (add_memory(ent->nid, ent->start, ent->size)) {
> +   if (add_memory(ent->nid, ent->start, ent->size, MHP_NONE)) {
> pr_err("Failed to add trace memory to node %d\n",
> ent->nid);
> ret += 1;
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 0ea976d1cac47..e1c9fa0d730f5 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -615,7 +615,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
> nid = memory_add_physaddr_to_nid(lmb->base_addr);
>
> /* Add the memory */
> -   rc = __add_memory(nid, lmb->base_addr, block_sz);
> +   rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE);
> if (rc) {
> invalidate_lmb_associativity_index(lmb);
> return rc;
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index e294f44a78504..2067c3bc55763 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -207,7 +207,8 @@ static int acpi_memory_enable_device(struct 
> acpi_memory_device *mem_device)
> if (node < 0)
> node = memory_add_physaddr_to_nid(info->start_addr);
>
> -   result = __add_memory(node, info->start_addr, info->length);
> +   result = __add_memory(node, info->start_addr, info->length,
> + MHP_NONE);
>
> /*
>  * If the memory block has been used by the kernel, 
> add_memory()
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 4db3c660de831..b4c297dd04755 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -432,7 +432,8 @@ static ssize_t probe_store(struct device *dev, struct 
> device_attribute *attr,
>
> nid = memory_add_physaddr_to_nid(phys_addr);
> ret = __add_memory(nid, phys_addr,

Re: [PATCH] Fix khugepaged's request size in collapse_file()

2020-09-02 Thread Pankaj Gupta
> collapse_file() in khugepaged passes PAGE_SIZE as the number of pages to be
> read ahead to page_cache_sync_readahead().  It seems this was expressed as a
> number of bytes rather than a number of pages.
>
> Fix it to use the number of pages to the end of the window instead.
>
> Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
> Signed-off-by: David Howells 
> Acked-by: Song Liu 
> Reviewed-by: Matthew Wilcox (Oracle) 
> ---
>
>  mm/khugepaged.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 6d199c353281..f2d243077b74 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1706,7 +1706,7 @@ static void collapse_file(struct mm_struct *mm,
> xas_unlock_irq();
> page_cache_sync_readahead(mapping, 
> >f_ra,
>   file, index,
> - PAGE_SIZE);
> + end - index);
> /* drain pagevecs to help isolate_lru_page() 
> */
> lru_add_drain();
> page = find_lock_page(mapping, index);
>

Acked-by: Pankaj Gupta 


Re: [PATCH v5 2/3] memremap: rename MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC

2020-09-01 Thread Pankaj Gupta
> This is in preparation for the logic behind MEMORY_DEVICE_DEVDAX also
> being used by non DAX devices.
>
> No functional change intended.
>
> Signed-off-by: Roger Pau Monné 
> Reviewed-by: Ira Weiny 
> Acked-by: Andrew Morton 
> ---
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Andrew Morton 
> Cc: Jason Gunthorpe 
> Cc: Ira Weiny 
> Cc: "Aneesh Kumar K.V" 
> Cc: Johannes Thumshirn 
> Cc: Logan Gunthorpe 
> Cc: Juergen Gross 
> Cc: linux-nvd...@lists.01.org
> Cc: xen-de...@lists.xenproject.org
> Cc: linux...@kvack.org
> ---
>  drivers/dax/device.c | 2 +-
>  include/linux/memremap.h | 9 -
>  mm/memremap.c| 2 +-
>  3 files changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 4c0af2eb7e19..1e89513f3c59 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -429,7 +429,7 @@ int dev_dax_probe(struct device *dev)
> return -EBUSY;
> }
>
> -   dev_dax->pgmap.type = MEMORY_DEVICE_DEVDAX;
> +   dev_dax->pgmap.type = MEMORY_DEVICE_GENERIC;
> addr = devm_memremap_pages(dev, _dax->pgmap);
> if (IS_ERR(addr))
> return PTR_ERR(addr);
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 5f5b2df06e61..e5862746751b 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -46,11 +46,10 @@ struct vmem_altmap {
>   * wakeup is used to coordinate physical address space management (ex:
>   * fs truncate/hole punch) vs pinned pages (ex: device dma).
>   *
> - * MEMORY_DEVICE_DEVDAX:
> + * MEMORY_DEVICE_GENERIC:
>   * Host memory that has similar access semantics as System RAM i.e. DMA
> - * coherent and supports page pinning. In contrast to
> - * MEMORY_DEVICE_FS_DAX, this memory is access via a device-dax
> - * character device.
> + * coherent and supports page pinning. This is for example used by DAX 
> devices
> + * that expose memory using a character device.
>   *
>   * MEMORY_DEVICE_PCI_P2PDMA:
>   * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
> @@ -60,7 +59,7 @@ enum memory_type {
> /* 0 is reserved to catch uninitialized type fields */
> MEMORY_DEVICE_PRIVATE = 1,
> MEMORY_DEVICE_FS_DAX,
> -   MEMORY_DEVICE_DEVDAX,
> +   MEMORY_DEVICE_GENERIC,
> MEMORY_DEVICE_PCI_P2PDMA,
>  };
>
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 03e38b7a38f1..006dace60b1a 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -216,7 +216,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
> return ERR_PTR(-EINVAL);
>     }
> break;
> -   case MEMORY_DEVICE_DEVDAX:
> +   case MEMORY_DEVICE_GENERIC:
> need_devmap_managed = false;
> break;
> case MEMORY_DEVICE_PCI_P2PDMA:

Reviewed-by: Pankaj Gupta 


Re: [PATCH v1 03/11] mm/memory_hotplug: simplify checking if all pages are isolated in offline_pages()

2020-08-31 Thread Pankaj Gupta
> We make sure that we cannot have any memory holes right at the beginning
> of offline_pages(). We no longer need walk_system_ram_range() and can
> call test_pages_isolated() directly.
>
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Baoquan He 
> Cc: Pankaj Gupta 
> Cc: Oscar Salvador 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c | 16 +---
>  1 file changed, 1 insertion(+), 15 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 6856702af68d9..f64478349148d 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1384,17 +1384,6 @@ offline_isolated_pages_cb(unsigned long start, 
> unsigned long nr_pages,
> return 0;
>  }
>
> -/*
> - * Check all pages in range, recorded as memory resource, are isolated.
> - */
> -static int
> -check_pages_isolated_cb(unsigned long start_pfn, unsigned long nr_pages,
> -   void *data)
> -{
> -   return test_pages_isolated(start_pfn, start_pfn + nr_pages,
> -  MEMORY_OFFLINE);
> -}
> -
>  static int __init cmdline_parse_movable_node(char *p)
>  {
> movable_node_enabled = true;
> @@ -1579,10 +1568,7 @@ int __ref offline_pages(unsigned long start_pfn, 
> unsigned long nr_pages)
> reason = "failure to dissolve huge pages";
> goto failed_removal_isolated;
> }
> -   /* check again */
> -   ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> -   NULL, check_pages_isolated_cb);
> -   } while (ret);
> +   } while (test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE));
>
> /* Ok, all of our target is isolated.
>We cannot do rollback at this point. */
> --
> 2.26.2

Acked-by: Pankaj Gupta 


Re: [PATCH v2 01/10] mm/memory_hotplug: inline __offline_pages() into offline_pages()

2020-08-31 Thread Pankaj Gupta
> There is only a single user, offline_pages(). Let's inline, to make
> it look more similar to online_pages().
>
> Acked-by: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Baoquan He 
> Cc: Pankaj Gupta 
> Cc: Oscar Salvador 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c | 16 +---
>  1 file changed, 5 insertions(+), 11 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 7f62d69660e06..c781d386d87f9 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1473,11 +1473,10 @@ static int count_system_ram_pages_cb(unsigned long 
> start_pfn,
> return 0;
>  }
>
> -static int __ref __offline_pages(unsigned long start_pfn,
> - unsigned long end_pfn)
> +int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  {
> -   unsigned long pfn, nr_pages = 0;
> -   unsigned long offlined_pages = 0;
> +   const unsigned long end_pfn = start_pfn + nr_pages;
> +   unsigned long pfn, system_ram_pages = 0, offlined_pages = 0;
> int ret, node, nr_isolate_pageblock;
> unsigned long flags;
> struct zone *zone;
> @@ -1494,9 +1493,9 @@ static int __ref __offline_pages(unsigned long 
> start_pfn,
>  * memory holes PG_reserved, don't need pfn_valid() checks, and can
>  * avoid using walk_system_ram_range() later.
>  */
> -   walk_system_ram_range(start_pfn, end_pfn - start_pfn, _pages,
> +   walk_system_ram_range(start_pfn, nr_pages, _ram_pages,
>   count_system_ram_pages_cb);
> -   if (nr_pages != end_pfn - start_pfn) {
> +   if (system_ram_pages != nr_pages) {
> ret = -EINVAL;
> reason = "memory holes";
> goto failed_removal;
> @@ -1631,11 +1630,6 @@ static int __ref __offline_pages(unsigned long 
> start_pfn,
> return ret;
>  }
>
> -int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> -{
> -   return __offline_pages(start_pfn, start_pfn + nr_pages);
> -}
> -
>  static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
>  {
> int ret = !is_memblock_offlined(mem);

Reviewed-by: Pankaj Gupta 


Re: [PATCH] mm/memory-failure: Fix return wrong value when isolate page fail

2020-08-31 Thread Pankaj Gupta
> When we isolate page fail, we should not return 0, because we do not
> set page HWPoison on any page.
>
> Signed-off-by: Muchun Song 
> ---
>  mm/memory-failure.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 696505f56910..4eb3c42ffe35 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1850,6 +1850,7 @@ static int __soft_offline_page(struct page *page)
> } else {
> pr_info("soft offline: %#lx: %s isolation failed: %d, page 
> count %d, type %lx (%pGp)\n",
> pfn, msg_page[huge], ret, page_count(page), 
> page->flags, >flags);
> +   ret = -EBUSY;
> }
> return ret;
>  }
> --

Acked-by: Pankaj Gupta 
> 2.11.0
>
>


Re: [PATCH v1 2/5] kernel/resource: merge_system_ram_resources() to merge resources after hotplug

2020-08-31 Thread Pankaj Gupta
> Some add_memory*() users add memory in small, contiguous memory blocks.
> Examples include virtio-mem, hyper-v balloon, and the XEN balloon.
>
> This can quickly result in a lot of memory resources, whereby the actual
> resource boundaries are not of interest (e.g., it might be relevant for
> DIMMs, exposed via /proc/iomem to user space). We really want to merge
> added resources in this scenario where possible.
>
> Let's provide an interface to trigger merging of applicable child
> resources. It will be, for example, used by virtio-mem to trigger
> merging of system ram resources it added to its resource container, but
> also by XEN and Hyper-V to trigger merging of system ram resources in
> iomem_resource.
>
> Note: We really want to merge after the whole operation succeeded, not
> directly when adding a resource to the resource tree (it would break
> add_memory_resource() and require splitting resources again when the
> operation failed - e.g., due to -ENOMEM).
>
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Dan Williams 
> Cc: Jason Gunthorpe 
> Cc: Kees Cook 
> Cc: Ard Biesheuvel 
> Cc: Thomas Gleixner 
> Cc: "K. Y. Srinivasan" 
> Cc: Haiyang Zhang 
> Cc: Stephen Hemminger 
> Cc: Wei Liu 
> Cc: Boris Ostrovsky 
> Cc: Juergen Gross 
> Cc: Stefano Stabellini 
> Cc: Roger Pau Monné 
> Cc: Julien Grall 
> Cc: Pankaj Gupta 
> Cc: Baoquan He 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 
> ---
>  include/linux/ioport.h |  3 +++
>  kernel/resource.c  | 52 ++
>  2 files changed, 55 insertions(+)
>
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index 52a91f5fa1a36..3bb0020cd6ddc 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -251,6 +251,9 @@ extern void __release_region(struct resource *, 
> resource_size_t,
>  extern void release_mem_region_adjustable(struct resource *, resource_size_t,
>   resource_size_t);
>  #endif
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +extern void merge_system_ram_resources(struct resource *res);
> +#endif
>
>  /* Wrappers for managed devices */
>  struct device;
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 1dcef5d53d76e..b4e0963edadd2 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -1360,6 +1360,58 @@ void release_mem_region_adjustable(struct resource 
> *parent,
>  }
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +static bool system_ram_resources_mergeable(struct resource *r1,
> +  struct resource *r2)
> +{
> +   return r1->flags == r2->flags && r1->end + 1 == r2->start &&
> +  r1->name == r2->name && r1->desc == r2->desc &&
> +  !r1->child && !r2->child;
> +}
> +
> +/*
> + * merge_system_ram_resources - try to merge contiguous system ram resources
> + * @parent: parent resource descriptor
> + *
> + * This interface is intended for memory hotplug, whereby lots of contiguous
> + * system ram resources are added (e.g., via add_memory*()) by a driver, and
> + * the actual resource boundaries are not of interest (e.g., it might be
> + * relevant for DIMMs). Only immediate child resources that are busy and
> + * don't have any children are considered. All applicable child resources
> + * must be immutable during the request.
> + *
> + * Note:
> + * - The caller has to make sure that no pointers to resources that might
> + *   get merged are held anymore. Callers should only trigger merging of 
> child
> + *   resources when they are the only one adding system ram resources to the
> + *   parent (besides during boot).
> + * - release_mem_region_adjustable() will split on demand on memory hotunplug
> + */
> +void merge_system_ram_resources(struct resource *parent)
> +{
> +   const unsigned long flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> +   struct resource *cur, *next;
> +
> +   write_lock(_lock);
> +
> +   cur = parent->child;
> +   while (cur && cur->sibling) {
> +   next = cur->sibling;
> +   if ((cur->flags & flags) == flags &&

Maybe this can be changed to:
!(cur->flags & ~flags)

> +   system_ram_resources_mergeable(cur, next)) {
> +   cur->end = next->end;
> +   cur->sibling = next->sibling;
> +   free_resource(next);
> +   next = cur->sibling;
> +   }
> +   cur = next;
> +   }
> +
> +   write_unlock(_lock);
> +}
> +EXPORT_SYMBOL(merge_system_ram_resources);
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> +
>  /*
>   * Managed region resource
>   */
> --
> 2.26.2
>


Re: [PATCH v3 6/7] libnvdimm: make sure EXPORT_SYMBOL_GPL(nvdimm_flush) close to its function

2020-08-20 Thread Pankaj Gupta
> Move EXPORT_SYMBOL_GPL(nvdimm_flush) close to nvdimm_flush(), currently
> it's near to generic_nvdimm_flush().
>
> Signed-off-by: Zhen Lei 
> ---
>  drivers/nvdimm/region_devs.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 49be115c9189eff..909bf03edb132cf 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -1183,6 +1183,8 @@ int nvdimm_flush(struct nd_region *nd_region, struct 
> bio *bio)
>
> return rc;
>  }
> +EXPORT_SYMBOL_GPL(nvdimm_flush);
> +
>  /**
>   * nvdimm_flush - flush any posted write queues between the cpu and pmem 
> media
>   * @nd_region: blk or interleaved pmem region
> @@ -1214,7 +1216,6 @@ int generic_nvdimm_flush(struct nd_region *nd_region)
>
> return 0;
>  }
> -EXPORT_SYMBOL_GPL(nvdimm_flush);
>
>  /**
>   * nvdimm_has_flush - determine write flushing requirements
> --

Reviewed-by: Pankaj Gupta 

> 1.8.3
>
> ___
> Linux-nvdimm mailing list -- linux-nvd...@lists.01.org
> To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-07-29 Thread Pankaj Gupta
ct kvm_vcpu *vcpu, bool 
> prefault, gfn_t gfn,
> if (!async)
> return false; /* *pfn has correct page already */
>
> -   if (!prefault && kvm_can_do_async_pf(vcpu)) {
> +   if (!prefault && kvm_can_do_async_pf(vcpu, gpa_to_gfn(cr2_or_gpa))) {
> trace_kvm_try_async_get_page(cr2_or_gpa, gfn);
> if (kvm_find_async_pf_gfn(vcpu, gfn)) {
> trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 88c593f83b28..c1f5094d6e53 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -263,6 +263,13 @@ static inline void kvm_async_pf_hash_reset(struct 
> kvm_vcpu *vcpu)
> vcpu->arch.apf.gfns[i] = ~0;
>  }
>
> +static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu)
> +{
> +   int i;
> +   for (i = 0; i < ERROR_GFN_PER_VCPU; i++)
> +   vcpu->arch.apf.error_gfns[i] = GFN_INVALID;
> +}
> +
>  static void kvm_on_user_return(struct user_return_notifier *urn)
>  {
> unsigned slot;
> @@ -9484,6 +9491,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT;
>
> kvm_async_pf_hash_reset(vcpu);
> +   kvm_error_gfn_hash_reset(vcpu);
> kvm_pmu_init(vcpu);
>
> vcpu->arch.pending_external_vector = -1;
> @@ -9608,6 +9616,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool 
> init_event)
>
> kvm_clear_async_pf_completion_queue(vcpu);
> kvm_async_pf_hash_reset(vcpu);
> +   kvm_error_gfn_hash_reset(vcpu);
> vcpu->arch.apf.halted = false;
>
> if (kvm_mpx_supported()) {
> @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned 
> long rflags)
>  }
>  EXPORT_SYMBOL_GPL(kvm_set_rflags);
>
> +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn)
> +{
> +   BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU));
> +
> +   return hash_32(gfn & 0x, order_base_2(ERROR_GFN_PER_VCPU));
> +}
> +
> +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +   u32 key = kvm_error_gfn_hash_fn(gfn);
> +
> +   /*
> +* Overwrite the previous gfn. This is just a hint to do
> +* sync page fault.
> +*/
> +   vcpu->arch.apf.error_gfns[key] = gfn;
> +}
> +
> +/* Returns true if gfn was found in hash table, false otherwise */
> +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +   u32 key = kvm_error_gfn_hash_fn(gfn);
> +
> +   if (vcpu->arch.apf.error_gfns[key] != gfn)
> +   return 0;
> +
> +   vcpu->arch.apf.error_gfns[key] = GFN_INVALID;
> +   return true;
> +}
> +
>  void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf 
> *work)
>  {
> int r;
> @@ -10385,7 +10424,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, 
> struct kvm_async_pf *work)
>   work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
> return;
>
> -   kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> +   r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> +   if (r < 0)
> +   kvm_add_error_gfn(vcpu, gpa_to_gfn(work->cr2_or_gpa));
>  }
>
>  static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
> @@ -10495,7 +10536,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu 
> *vcpu)
> return true;
>  }
>
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
>  {
> if (unlikely(!lapic_in_kernel(vcpu) ||
>  kvm_event_needs_reinjection(vcpu) ||
> @@ -10509,7 +10550,14 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>  * If interrupts are off we cannot even use an artificial
>  * halt state.
>  */
> -   return kvm_arch_interrupt_allowed(vcpu);
> +   if (!kvm_arch_interrupt_allowed(vcpu))
> +   return false;
> +
> +   /* Found gfn in error gfn cache. Force sync fault */
> +   if (kvm_find_and_remove_error_gfn(vcpu, gfn))
> +   return false;
> +
> +   return true;
>  }
>
>  bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 68e84cf42a3f..677bb8269cd3 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -36,6 +36,7 @@ typedef u64gpa_t;
>  typedef u64gfn_t;
>
>  #define GPA_INVALID(~(gpa_t)0)
> +#define GFN_INVALID(~(gfn_t)0)
>
>  typedef unsigned long  hva_t;
>  typedef u64hpa_t;
> --
> 2.25.4

This patch looks good to me.

Reviewed-by: Pankaj Gupta 

>


Re: [PATCH] dax: Fix wrong error-number passed into xas_set_err()

2020-07-29 Thread Pankaj Gupta
> The error-number passed into xas_set_err() should be negative. Otherwise,
> the xas_error() will return 0, and grab_mapping_entry() will return the
> found entry instead of a SIGBUS error when the entry is not a value.
> And then, the subsequent code path would be wrong.
>
> Signed-off-by: Hao Li 
> ---
>  fs/dax.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 11b16729b86f..acac675fe7a6 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -488,7 +488,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
> if (dax_is_conflict(entry))
> goto fallback;
> if (!xa_is_value(entry)) {
> -   xas_set_err(xas, EIO);
> +   xas_set_err(xas, -EIO);
> goto out_unlock;
> }
>
> --

Reviewed-by: Pankaj Gupta 

> 2.28.0
>
>
> ___
> Linux-nvdimm mailing list -- linux-nvd...@lists.01.org
> To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v1 4/6] mm/page_isolation: cleanup set_migratetype_isolate()

2020-07-29 Thread Pankaj Gupta
> Let's clean it up a bit, simplifying error handling and getting rid of
> the label.
>
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Michael S. Tsirkin 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/page_isolation.c | 18 +++---
>  1 file changed, 7 insertions(+), 11 deletions(-)
>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 02a01bff6b219..5f869bef23fa4 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -17,12 +17,9 @@
>
>  static int set_migratetype_isolate(struct page *page, int migratetype, int 
> isol_flags)
>  {
> -   struct page *unmovable = NULL;
> -   struct zone *zone;
> +   struct zone *zone = page_zone(page);
> +   struct page *unmovable;
> unsigned long flags;
> -   int ret = -EBUSY;
> -
> -   zone = page_zone(page);
>
> spin_lock_irqsave(>lock, flags);
>
> @@ -51,21 +48,20 @@ static int set_migratetype_isolate(struct page *page, int 
> migratetype, int isol_
> NULL);
>
> __mod_zone_freepage_state(zone, -nr_pages, mt);
> -   ret = 0;
> +   spin_unlock_irqrestore(>lock, flags);
> +   drain_all_pages(zone);
> +   return 0;
> }
>
> -out:
> spin_unlock_irqrestore(>lock, flags);
> -   if (!ret) {
> -   drain_all_pages(zone);
> -   } else if ((isol_flags & REPORT_FAILURE) && unmovable)
> +   if (isol_flags & REPORT_FAILURE)
> /*
>  * printk() with zone->lock held will likely trigger a
>  * lockdep splat, so defer it here.
>  */
> dump_page(unmovable, "unmovable page");
>
> -   return ret;
> +   return -EBUSY;
>  }
>
>  static void unset_migratetype_isolate(struct page *page, unsigned 
> migratetype)
> --

This clean up looks good to me.

Reviewed-by: Pankaj Gupta 

> 2.26.2
>
>


Re: [PATCH v1 2/6] mm/page_isolation: don't dump_page(NULL) in set_migratetype_isolate()

2020-07-29 Thread Pankaj Gupta
> Right now, if we have two isolations racing, we might trigger the
> WARN_ON_ONCE() and to dump_page(NULL), dereferencing NULL. Let's just
> return directly.
>
> In the future, we might want to report -EAGAIN to the caller instead, as
> this could indicate a temporary isolation failure only.
>
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Michael S. Tsirkin 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/page_isolation.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index f6d07c5f0d34d..553b49a34cf71 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -29,10 +29,12 @@ static int set_migratetype_isolate(struct page *page, int 
> migratetype, int isol_
> /*
>  * We assume the caller intended to SET migrate type to isolate.
>  * If it is already set, then someone else must have raced and
> -* set it before us.  Return -EBUSY
> +* set it before us.
>  */
> -   if (is_migrate_isolate_page(page))
> -   goto out;
> +   if (is_migrate_isolate_page(page)) {
> +   spin_unlock_irqrestore(>lock, flags);
> +   return -EBUSY;
> +   }
>
>     /*
>  * FIXME: Now, memory hotplug doesn't call shrink_slab() by itself.
> --

Reviewed-by: Pankaj Gupta 

> 2.26.2
>
>


Re: [PATCH v3 3/6] sh/mm: use default dummy memory_add_physaddr_to_nid()

2020-07-20 Thread Pankaj Gupta
> After making default memory_add_physaddr_to_nid in mm/memory_hotplug,
> there is no use to define a similar one in arch specific directory.
>
> Signed-off-by: Jia He 
> ---
>  arch/sh/mm/init.c | 9 -
>  1 file changed, 9 deletions(-)
>
> diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
> index a70ba0fdd0b3..f75932ba87a6 100644
> --- a/arch/sh/mm/init.c
> +++ b/arch/sh/mm/init.c
> @@ -430,15 +430,6 @@ int arch_add_memory(int nid, u64 start, u64 size,
> return ret;
>  }
>
> -#ifdef CONFIG_NUMA
> -int memory_add_physaddr_to_nid(u64 addr)
> -{
> -   /* Node 0 for now.. */
> -   return 0;
> -}
> -EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> -#endif
> -
>  void arch_remove_memory(int nid, u64 start, u64 size,
>     struct vmem_altmap *altmap)
>  {

Reviewed-by: Pankaj Gupta 


Re: [PATCH] mm/hugetlb: hide nr_nodes in the internal of for_each_node_mask_to_[alloc|free]

2020-07-14 Thread Pankaj Gupta
1 && delta != 1);
>
> if (delta < 0) {
> -   for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) 
> {
> +   for_each_node_mask_to_alloc(h, node, nodes_allowed) {
> if (h->surplus_huge_pages_node[node])
> goto found;
> }
>     } else {
> -   for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
> +   for_each_node_mask_to_free(h, node, nodes_allowed) {
> if (h->surplus_huge_pages_node[node] <
> h->nr_huge_pages_node[node])
> goto found;
> --
> 2.20.1 (Apple Git-117)

Acked-by: Pankaj Gupta 

>
>


Re: [PATCH v1 2/2] mm/page_alloc: drop nr_free_pagecache_pages()

2020-06-21 Thread Pankaj Gupta
> nr_free_pagecache_pages() isn't used outside page_alloc.c anymore - and
> the name does not really help to understand what's going on. Let's inline
> it instead and add a comment.
>
> Cc: Andrew Morton 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Minchan Kim 
> Cc: Huang Ying 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 
> ---
>  include/linux/swap.h |  1 -
>  mm/page_alloc.c  | 16 ++--
>  2 files changed, 2 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 124261acd5d0a..9bde6c6b2c045 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -327,7 +327,6 @@ void workingset_update_node(struct xa_node *node);
>  /* linux/mm/page_alloc.c */
>  extern unsigned long totalreserve_pages;
>  extern unsigned long nr_free_buffer_pages(void);
> -extern unsigned long nr_free_pagecache_pages(void);
>
>  /* Definition of global_zone_page_state not available yet */
>  #define nr_free_pages() global_zone_page_state(NR_FREE_PAGES)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7b0dde69748c1..c38903d1b3b4d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5177,19 +5177,6 @@ unsigned long nr_free_buffer_pages(void)
>  }
>  EXPORT_SYMBOL_GPL(nr_free_buffer_pages);
>
> -/**
> - * nr_free_pagecache_pages - count number of pages beyond high watermark
> - *
> - * nr_free_pagecache_pages() counts the number of pages which are beyond the
> - * high watermark within all zones.
> - *
> - * Return: number of pages beyond high watermark within all zones.
> - */
> -unsigned long nr_free_pagecache_pages(void)
> -{
> -   return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE));
> -}
> -
>  static inline void show_node(struct zone *zone)
>  {
> if (IS_ENABLED(CONFIG_NUMA))
> @@ -5911,7 +5898,8 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
> __build_all_zonelists(pgdat);
> /* cpuset refresh routine should be here */
> }
> -   vm_total_pages = nr_free_pagecache_pages();
> +   /* Get the number of free pages beyond high watermark in all zones. */
> +   vm_total_pages = nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE));
> /*
>  * Disable grouping by mobility if the number of pages in the
>  * system is too low to allow the mechanism to work. It would be

Reviewed-by: Pankaj Gupta 


Re: [PATCH v1 1/2] mm: drop vm_total_pages

2020-06-21 Thread Pankaj Gupta
> The global variable "vm_total_pages" is a relict from older days. There
> is only a single user that reads the variable - build_all_zonelists() -
> and the first thing it does is updating it. Use a local variable in
> build_all_zonelists() instead and drop the local variable.
>
> Cc: Andrew Morton 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Huang Ying 
> Cc: Minchan Kim 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 
> ---
>  include/linux/swap.h | 1 -
>  mm/memory_hotplug.c  | 3 ---
>  mm/page-writeback.c  | 6 ++
>  mm/page_alloc.c  | 2 ++
>  mm/vmscan.c  | 5 -
>  5 files changed, 4 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4c5974bb9ba94..124261acd5d0a 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -371,7 +371,6 @@ extern unsigned long mem_cgroup_shrink_node(struct 
> mem_cgroup *mem,
>  extern unsigned long shrink_all_memory(unsigned long nr_pages);
>  extern int vm_swappiness;
>  extern int remove_mapping(struct address_space *mapping, struct page *page);
> -extern unsigned long vm_total_pages;
>
>  extern unsigned long reclaim_pages(struct list_head *page_list);
>  #ifdef CONFIG_NUMA
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 9b34e03e730a4..d682781cce48d 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -835,8 +835,6 @@ int __ref online_pages(unsigned long pfn, unsigned long 
> nr_pages,
> kswapd_run(nid);
> kcompactd_run(nid);
>
> -   vm_total_pages = nr_free_pagecache_pages();
> -
> writeback_set_ratelimit();
>
> memory_notify(MEM_ONLINE, );
> @@ -1586,7 +1584,6 @@ static int __ref __offline_pages(unsigned long 
> start_pfn,
> kcompactd_stop(node);
> }
>
> -   vm_total_pages = nr_free_pagecache_pages();
> writeback_set_ratelimit();
>
> memory_notify(MEM_OFFLINE, );
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 28b3e7a675657..4e4ddd67b71e5 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2076,13 +2076,11 @@ static int page_writeback_cpu_online(unsigned int cpu)
>   * Called early on to tune the page writeback dirty limits.
>   *
>   * We used to scale dirty pages according to how total memory
> - * related to pages that could be allocated for buffers (by
> - * comparing nr_free_buffer_pages() to vm_total_pages.
> + * related to pages that could be allocated for buffers.
>   *
>   * However, that was when we used "dirty_ratio" to scale with
>   * all memory, and we don't do that any more. "dirty_ratio"
> - * is now applied to total non-HIGHPAGE memory (by subtracting
> - * totalhigh_pages from vm_total_pages), and as such we can't
> + * is now applied to total non-HIGHPAGE memory, and as such we can't
>   * get into the old insane situation any more where we had
>   * large amounts of dirty pages compared to a small amount of
>   * non-HIGHMEM memory.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0c435b2ed665c..7b0dde69748c1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5903,6 +5903,8 @@ build_all_zonelists_init(void)
>   */
>  void __ref build_all_zonelists(pg_data_t *pgdat)
>  {
> +   unsigned long vm_total_pages;
> +
> if (system_state == SYSTEM_BOOTING) {
> build_all_zonelists_init();
> } else {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b6d84326bdf2d..0010859747df2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -170,11 +170,6 @@ struct scan_control {
>   * From 0 .. 200.  Higher means more swappy.
>   */
>  int vm_swappiness = 60;
> -/*
> - * The total number of pages which are beyond the high watermark within all
> - * zones.
> - */
> -unsigned long vm_total_pages;
>
>  static void set_task_reclaim_state(struct task_struct *task,
>struct reclaim_state *rs)

Reviewed-by: Pankaj Gupta 


Re: [PATCH v1] MAINTAINERS: add URL for virtio-mem

2020-06-17 Thread Pankaj Gupta
> Let's add the status/info page, which is still under construction, however,
> already contains valuable documentation/information.
>
> Cc: "Michael S. Tsirkin" 
> Cc: Pankaj Gupta 
> Signed-off-by: David Hildenbrand 
> ---
>  MAINTAINERS | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 301330e02bca7..3470c5bfbdc7b 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -18234,6 +18234,7 @@ VIRTIO MEM DRIVER
>  M: David Hildenbrand 
>  L: virtualizat...@lists.linux-foundation.org
>  S: Maintained
> +W: https://virtio-mem.gitlab.io/
>  F: drivers/virtio/virtio_mem.c
>  F: include/uapi/linux/virtio_mem.h

Acked-by: Pankaj Gupta 


Re: [PATCH v1] virtio-mem: add memory via add_memory_driver_managed()

2020-06-11 Thread Pankaj Gupta
> Virtio-mem managed memory is always detected and added by the virtio-mem
> driver, never using something like the firmware-provided memory map.
> This is the case after an ordinary system reboot, and has to be guaranteed
> after kexec. Especially, virtio-mem added memory resources can contain
> inaccessible parts ("unblocked memory blocks"), blindly forwarding them
> to a kexec kernel is dangerous, as unplugged memory will get accessed
> (esp. written).
>
> Let's use the new way of adding special driver-managed memory introduced
> in commit 75ac4c58bc0d ("mm/memory_hotplug: introduce
> add_memory_driver_managed()").

Is this commit id correct?
>
> This will result in no entries in /sys/firmware/memmap ("raw firmware-
> provided memory map"), the memory resource will be flagged
> IORESOURCE_MEM_DRIVER_MANAGED (esp., kexec_file_load() will not place
> kexec images on this memory), and it is exposed as "System RAM
> (virtio_mem)" in /proc/iomem, so esp. kexec-tools can properly handle it.
>
> Example /proc/iomem before this change:
>   [...]
>   14000-333ff : virtio0
> 14000-147ff : System RAM
>   33400-533ff : virtio1
> 33800-33fff : System RAM
> 34000-347ff : System RAM
> 34800-34fff : System RAM
>   [...]
>
> Example /proc/iomem after this change:
>   [...]
>   14000-333ff : virtio0
> 14000-147ff : System RAM (virtio_mem)
>   33400-533ff : virtio1
> 33800-33fff : System RAM (virtio_mem)
> 340000000-347ff : System RAM (virtio_mem)
> 34800-34fff : System RAM (virtio_mem)
>   [...]
>
> Cc: "Michael S. Tsirkin" 
> Cc: Pankaj Gupta 
> Cc: teawater 
> Signed-off-by: David Hildenbrand 
> ---
>
> Based on latest Linus' tree (and not a tag) because
> - virtio-mem has just been merged via the vhost tree
> - add_memory_driver_managed() has been merged a week ago via the -mm tree
>
> I'd like to have this patch in 5.8, with the initial merge of virtio-mem
> if possible (so the user space representation of virtio-mem added memory
> resources won't change anymore).
>
> ---
>  drivers/virtio/virtio_mem.c | 25 ++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 50c689f250450..d2eab3558a9e1 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -101,6 +101,11 @@ struct virtio_mem {
>
> /* The parent resource for all memory added via this device. */
> struct resource *parent_resource;
> +   /*
> +* Copy of "System RAM (virtio_mem)" to be used for
> +* add_memory_driver_managed().
> +*/
> +   const char *resource_name;
>
> /* Summary of all memory block states. */
> unsigned long nb_mb_state[VIRTIO_MEM_MB_STATE_COUNT];
> @@ -414,8 +419,20 @@ static int virtio_mem_mb_add(struct virtio_mem *vm, 
> unsigned long mb_id)
> if (nid == NUMA_NO_NODE)
> nid = memory_add_physaddr_to_nid(addr);
>
> +   /*
> +* When force-unloading the driver and we still have memory added to
> +* Linux, the resource name has to stay.
> +*/
> +   if (!vm->resource_name) {
> +   vm->resource_name = kstrdup_const("System RAM (virtio_mem)",
> + GFP_KERNEL);
> +   if (!vm->resource_name)
> +   return -ENOMEM;
> +   }
> +
> dev_dbg(>vdev->dev, "adding memory block: %lu\n", mb_id);
> -   return add_memory(nid, addr, memory_block_size_bytes());
> +   return add_memory_driver_managed(nid, addr, memory_block_size_bytes(),
> +vm->resource_name);
>  }
>
>  /*
> @@ -1890,10 +1907,12 @@ static void virtio_mem_remove(struct virtio_device 
> *vdev)
> vm->nb_mb_state[VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL] ||
> vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE] ||
> vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL] ||
> -   vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_MOVABLE])
> +       vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_MOVABLE]) {
> dev_warn(>dev, "device still has system memory 
> added\n");
> -   else
> +   } else {
> virtio_mem_delete_resource(vm);
> +   kfree_const(vm->resource_name);
> +   }
>
> /* remove all tracking data - no locking needed */
> vfree(vm->mb_state);

Looks good to me.
Reviewed-by: Pankaj Gupta 


Re: [PATCH] virtio-mem: drop unnecessary initialization

2020-06-08 Thread Pankaj Gupta
> rc is initialized to -ENIVAL but that's never used. Drop it.
>
> Fixes: 5f1f79bbc9e2 ("virtio-mem: Paravirtualized memory hotplug")
> Reported-by: kernel test robot 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/virtio/virtio_mem.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index f658fe9149be..2f357142ea5e 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -1768,7 +1768,7 @@ static void virtio_mem_delete_resource(struct 
> virtio_mem *vm)
>  static int virtio_mem_probe(struct virtio_device *vdev)
>  {
> struct virtio_mem *vm;
> -   int rc = -EINVAL;
> +   int rc;
>
> BUILD_BUG_ON(sizeof(struct virtio_mem_req) != 24);
>     BUILD_BUG_ON(sizeof(struct virtio_mem_resp) != 10);

Reviewed-by: Pankaj Gupta 


Re: [PATCH 1/2] mm/page_idle.c: Skip offline pages

2020-06-05 Thread Pankaj Gupta
> From: SeongJae Park 
>
> 'Idle page tracking' users can pass random pfn that might be mapped to
> an offline page.  To avoid accessing such pages, this commit modifies
> the 'page_idle_get_page()' to use 'pfn_to_online_page()' instead of
> 'pfn_valid()' and 'pfn_to_page()' combination, so that the pfn mapped to
> an offline page can be skipped.
>
> Signed-off-by: SeongJae Park 
> Reported-by: David Hildenbrand 
> ---
>  mm/page_idle.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index 295512465065..057c61df12db 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -4,6 +4,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -30,13 +31,9 @@
>   */
>  static struct page *page_idle_get_page(unsigned long pfn)
>  {
> -   struct page *page;
> +   struct page *page = pfn_to_online_page(pfn);
> pg_data_t *pgdat;
>
> -   if (!pfn_valid(pfn))
> -   return NULL;
> -
> -   page = pfn_to_page(pfn);
> if (!page || !PageLRU(page) ||
> !get_page_unless_zero(page))
> return NULL;

Reviewed-by: Pankaj Gupta 


Re: [PATCH] drivers/dax/bus: Use kobj_to_dev() API

2020-06-02 Thread Pankaj Gupta
> Use kobj_to_dev() API instead of container_of().
>
> Signed-off-by: Shuai He 
> ---
>  drivers/dax/bus.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index df238c8..24625d2 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -331,7 +331,7 @@ static DEVICE_ATTR_RO(numa_node);
>
>  static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, 
> int n)
>  {
> -   struct device *dev = container_of(kobj, struct device, kobj);
> +   struct device *dev = kobj_to_dev(kobj);
> struct dev_dax *dev_dax = to_dev_dax(dev);
>
> if (a == _attr_target_node.attr && dev_dax_target_node(dev_dax) < 
> 0)

Reviewed-by: Pankaj Gupta 


Re: [PATCH v2 2/2] vhost: convert get_user_pages() --> pin_user_pages()

2020-06-02 Thread Pankaj Gupta
> This code was using get_user_pages*(), in approximately a "Case 5"
> scenario (accessing the data within a page), using the categorization
> from [1]. That means that it's time to convert the get_user_pages*() +
> put_page() calls to pin_user_pages*() + unpin_user_pages() calls.
>
> There is some helpful background in [2]: basically, this is a small
> part of fixing a long-standing disconnect between pinning pages, and
> file systems' use of those pages.
>
> [1] Documentation/core-api/pin_user_pages.rst
>
> [2] "Explicit pinning of user-space pages":
> https://lwn.net/Articles/807108/
>
> Cc: Michael S. Tsirkin 
> Cc: Jason Wang 
> Cc: k...@vger.kernel.org
> Cc: virtualizat...@lists.linux-foundation.org
> Cc: net...@vger.kernel.org
> Signed-off-by: John Hubbard 
> ---
>  drivers/vhost/vhost.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 21a59b598ed8..596132a96cd5 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1762,15 +1762,14 @@ static int set_bit_to_user(int nr, void __user *addr)
> int bit = nr + (log % PAGE_SIZE) * 8;
> int r;
>
> -   r = get_user_pages_fast(log, 1, FOLL_WRITE, );
> +   r = pin_user_pages_fast(log, 1, FOLL_WRITE, );
> if (r < 0)
> return r;
> BUG_ON(r != 1);
> base = kmap_atomic(page);
> set_bit(bit, base);
> kunmap_atomic(base);
> -   set_page_dirty_lock(page);
> -   put_page(page);
> +   unpin_user_pages_dirty_lock(, 1, true);
> return 0;
>  }

Acked-by: Pankaj Gupta 


Re: [PATCH 1/2] mm/gup: introduce pin_user_pages_locked()

2020-05-31 Thread Pankaj Gupta
Acked-by: Pankaj Gupta 

On Thu, 28 May 2020 at 00:32, John Hubbard  wrote:
>
> Introduce pin_user_pages_locked(), which is nearly identical to
> get_user_pages_locked() except that it sets FOLL_PIN and rejects
> FOLL_GET.
>
> Signed-off-by: John Hubbard 
> ---
>  include/linux/mm.h |  2 ++
>  mm/gup.c   | 30 ++
>  2 files changed, 32 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 98be7289d7e9..d16951087c93 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1700,6 +1700,8 @@ long pin_user_pages(unsigned long start, unsigned long 
> nr_pages,
> struct vm_area_struct **vmas);
>  long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
> unsigned int gup_flags, struct page **pages, int *locked);
> +long pin_user_pages_locked(unsigned long start, unsigned long nr_pages,
> +   unsigned int gup_flags, struct page **pages, int *locked);
>  long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> struct page **pages, unsigned int gup_flags);
>  long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> diff --git a/mm/gup.c b/mm/gup.c
> index 2f0a0b497c23..17418a949067 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2992,3 +2992,33 @@ long pin_user_pages_unlocked(unsigned long start, 
> unsigned long nr_pages,
> return get_user_pages_unlocked(start, nr_pages, pages, gup_flags);
>  }
>  EXPORT_SYMBOL(pin_user_pages_unlocked);
> +
> +/*
> + * pin_user_pages_locked() is the FOLL_PIN variant of 
> get_user_pages_locked().
> + * Behavior is the same, except that this one sets FOLL_PIN and rejects
> + * FOLL_GET.
> + */
> +long pin_user_pages_locked(unsigned long start, unsigned long nr_pages,
> +  unsigned int gup_flags, struct page **pages,
> +  int *locked)
> +{
> +   /*
> +* FIXME: Current FOLL_LONGTERM behavior is incompatible with
> +* FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
> +* vmas.  As there are no users of this flag in this call we simply
> +* disallow this option for now.
> +*/
> +   if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
> +   return -EINVAL;
> +
> +   /* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +   if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> +   return -EINVAL;
> +
> +   gup_flags |= FOLL_PIN;
> +   return __get_user_pages_locked(current, current->mm, start, nr_pages,
> +  pages, NULL, locked,
> +  gup_flags | FOLL_TOUCH);
> +}
> +EXPORT_SYMBOL(pin_user_pages_locked);
> +
> --
> 2.26.2
>
>


Re: [PATCH] mm/gup: documentation fix for pin_user_pages*() APIs

2020-05-31 Thread Pankaj Gupta
Acked-by: Pankaj Gupta 


Re: [PATCH V2] mm, memory_failure: don't send BUS_MCEERR_AO for action required error

2020-05-30 Thread Pankaj Gupta
> Some processes dont't want to be killed early, but in "Action Required"
> case, those also may be killed by BUS_MCEERR_AO when sharing memory
> with other which is accessing the fail memory.
> And sending SIGBUS with BUS_MCEERR_AO for action required error is
> strange, so ignore the non-current processes here.
>
> Suggested-by: Naoya Horiguchi 
> Signed-off-by: Wetp Zhang 
> ---
>  mm/memory-failure.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index a96364be8ab4..dd3862fcf2e9 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -210,14 +210,17 @@ static int kill_proc(struct to_kill *tk, unsigned long 
> pfn, int flags)
>  {
> struct task_struct *t = tk->tsk;
> short addr_lsb = tk->size_shift;
> -   int ret;
> +   int ret = 0;
>
> -   pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware 
> memory corruption\n",
> -   pfn, t->comm, t->pid);
> +   if ((t->mm == current->mm) || !(flags & MF_ACTION_REQUIRED))
> +   pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to 
> hardware memory corruption\n",
> +   pfn, t->comm, t->pid);

Maybe we can generalize the message condition for better readability.
Thought a bit but did not get any other idea.
>
> -   if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
> -   ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr,
> -  addr_lsb);
> +   if (flags & MF_ACTION_REQUIRED) {
> +   if (t->mm == current->mm)
> +   ret = force_sig_mceerr(BUS_MCEERR_AR,
> +(void __user *)tk->addr, addr_lsb);
> +       /* send no signal to non-current processes */
> } else {
> /*
>  * Don't use force here, it's convenient if the signal
> --

Looks good to me.
Acked-by: Pankaj Gupta 


Re: [PATCH v4 11/15] virtio-mem: Add parent resource for all added "System RAM"

2020-05-07 Thread Pankaj Gupta
> Let's add a parent resource, named after the virtio device (inspired by
> drivers/dax/kmem.c). This allows user space to identify which memory
> belongs to which virtio-mem device.
>
> With this change and two virtio-mem devices:
> :/# cat /proc/iomem
> -0fff : Reserved
> 1000-0009fbff : System RAM
> [...]
> 14000-333ff : virtio0
>   14000-147ff : System RAM
>   14800-14fff : System RAM
>   15000-157ff : System RAM
> [...]
> 33400-3033ff : virtio1
>   33800-33fff : System RAM
>   34000-347ff : System RAM
>   34800-34fff : System RAM
>     [...]
>
> Cc: "Michael S. Tsirkin" 
> Cc: Pankaj Gupta 
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/virtio/virtio_mem.c | 52 -
>  1 file changed, 51 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index eb4c16d634e0..80cdb9e6b3c4 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -99,6 +99,9 @@ struct virtio_mem {
> /* Id of the next memory bock to prepare when needed. */
> unsigned long next_mb_id;
>
> +   /* The parent resource for all memory added via this device. */
> +   struct resource *parent_resource;
> +
> /* Summary of all memory block states. */
> unsigned long nb_mb_state[VIRTIO_MEM_MB_STATE_COUNT];
>  #define VIRTIO_MEM_NB_OFFLINE_THRESHOLD10
> @@ -1741,6 +1744,44 @@ static int virtio_mem_init(struct virtio_mem *vm)
> return 0;
>  }
>
> +static int virtio_mem_create_resource(struct virtio_mem *vm)
> +{
> +   /*
> +* When force-unloading the driver and removing the device, we
> +* could have a garbage pointer. Duplicate the string.
> +*/
> +   const char *name = kstrdup(dev_name(>vdev->dev), GFP_KERNEL);
> +
> +   if (!name)
> +   return -ENOMEM;
> +
> +   vm->parent_resource = __request_mem_region(vm->addr, vm->region_size,
> +  name, 
> IORESOURCE_SYSTEM_RAM);
> +   if (!vm->parent_resource) {
> +   kfree(name);
> +   dev_warn(>vdev->dev, "could not reserve device region\n");
> +   return -EBUSY;
> +   }
> +
> +   /* The memory is not actually busy - make add_memory() work. */
> +   vm->parent_resource->flags &= ~IORESOURCE_BUSY;
> +   return 0;
> +}
> +
> +static void virtio_mem_delete_resource(struct virtio_mem *vm)
> +{
> +   const char *name;
> +
> +   if (!vm->parent_resource)
> +   return;
> +
> +   name = vm->parent_resource->name;
> +   release_resource(vm->parent_resource);
> +   kfree(vm->parent_resource);
> +   kfree(name);
> +   vm->parent_resource = NULL;
> +}
> +
>  static int virtio_mem_probe(struct virtio_device *vdev)
>  {
> struct virtio_mem *vm;
> @@ -1770,11 +1811,16 @@ static int virtio_mem_probe(struct virtio_device 
> *vdev)
> if (rc)
> goto out_del_vq;
>
> +   /* create the parent resource for all memory */
> +   rc = virtio_mem_create_resource(vm);
> +   if (rc)
> +   goto out_del_vq;
> +
> /* register callbacks */
> vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb;
> rc = register_memory_notifier(>memory_notifier);
> if (rc)
> -   goto out_del_vq;
> +   goto out_del_resource;
> rc = register_virtio_mem_device(vm);
> if (rc)
> goto out_unreg_mem;
> @@ -1788,6 +1834,8 @@ static int virtio_mem_probe(struct virtio_device *vdev)
> return 0;
>  out_unreg_mem:
> unregister_memory_notifier(>memory_notifier);
> +out_del_resource:
> +   virtio_mem_delete_resource(vm);
>  out_del_vq:
> vdev->config->del_vqs(vdev);
>  out_free_vm:
> @@ -1848,6 +1896,8 @@ static void virtio_mem_remove(struct virtio_device 
> *vdev)
> vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL] ||
> vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_MOVABLE])
> dev_warn(>dev, "device still has system memory 
> added\n");
> +   else
> +   virtio_mem_delete_resource(vm);
>
> /* remove all tracking data - no locking needed */
> vfree(vm->mb_state);
> --

Reviewed-by: Pankaj Gupta 


Re: [PATCH v3 1/3] mm/memory_hotplug: Introduce add_memory_device_managed()

2020-05-06 Thread Pankaj Gupta
Looks good to me.

Acked-by: Pankaj Gupta 


Re: [PATCH v3 3/3] device-dax: Add memory via add_memory_driver_managed()

2020-05-06 Thread Pankaj Gupta
> Currently, when adding memory, we create entries in /sys/firmware/memmap/
> as "System RAM". This will lead to kexec-tools to add that memory to the
> fixed-up initial memmap for a kexec kernel (loaded via kexec_load()). The
> memory will be considered initial System RAM by the kexec'd kernel and
> can no longer be reconfigured. This is not what happens during a real
> reboot.
>
> Let's add our memory via add_memory_driver_managed() now, so we won't
> create entries in /sys/firmware/memmap/ and indicate the memory as
> "System RAM (kmem)" in /proc/iomem. This allows everybody (especially
> kexec-tools) to identify that this memory is special and has to be treated
> differently than ordinary (hotplugged) System RAM.
>
> Before configuring the namespace:
> [root@localhost ~]# cat /proc/iomem
> ...
> 14000-33fff : Persistent Memory
>   14000-33fff : namespace0.0
> 328000-32 : PCI Bus :00
>
> After configuring the namespace:
> [root@localhost ~]# cat /proc/iomem
> ...
> 14000-33fff : Persistent Memory
>   14000-1481f : namespace0.0
>   14820-33fff : dax0.0
> 328000-32 : PCI Bus :00
>
> After loading kmem before this change:
> [root@localhost ~]# cat /proc/iomem
> ...
> 14000-33fff : Persistent Memory
>   14000-1481f : namespace0.0
>   15000-33fff : dax0.0
> 15000-33fff : System RAM
> 328000-32 : PCI Bus :00
>
> After loading kmem after this change:
> [root@localhost ~]# cat /proc/iomem
> ...
> 14000-33fff : Persistent Memory
>   14000-1481f : namespace0.0
>   15000-33fff : dax0.0
> 15000-33fff : System RAM (kmem)
> 328000-32 : PCI Bus :00
>
> After a proper reboot:
> [root@localhost ~]# cat /proc/iomem
> ...
> 14000-33fff : Persistent Memory
>   14000-1481f : namespace0.0
>   14820-33fff : dax0.0
> 328000-32 : PCI Bus :00
>
> Within the kexec kernel before this change:
> [root@localhost ~]# cat /proc/iomem
> ...
> 14000-33fff : Persistent Memory
>   14000-1481f : namespace0.0
>   15000-33fff : System RAM
> 328000-32 : PCI Bus :00
>
> Within the kexec kernel after this change:
> [root@localhost ~]# cat /proc/iomem
> ...
> 14000-33fff : Persistent Memory
>   14000-1481f : namespace0.0
>   14820-33fff : dax0.0
> 328000-32 : PCI Bus :00
>
> /sys/firmware/memmap/ before this change:
> -0009fc00 (System RAM)
> 0009fc00-000a (Reserved)
> 000f-0010 (Reserved)
> 0010-bffdf000 (System RAM)
> bffdf000-c000 (Reserved)
> feffc000-ff00 (Reserved)
> fffc-0001 (Reserved)
> 0001-00014000 (System RAM)
> 00015000-00034000 (System RAM)
>
> /sys/firmware/memmap/ after a proper reboot:
> -0009fc00 (System RAM)
> 0009fc00-000a (Reserved)
> 000f-0010 (Reserved)
> 0010-bffdf000 (System RAM)
> bffdf000-c000 (Reserved)
> feffc000-ff00 (Reserved)
> fffc-0001 (Reserved)
> 0001-00014000 (System RAM)
>
> /sys/firmware/memmap/ after this change:
> -0009fc00 (System RAM)
> 0009fc00-000a (Reserved)
> 000f-0010 (Reserved)
> 0010-bffdf000 (System RAM)
> bffdf000-c000 (Reserved)
> feffc000-ff00 (Reserved)
> fffc-0001 (Reserved)
> 0001-00014000 (System RAM)
>
> kexec-tools already seem to basically ignore any System RAM that's not
> on top level when searching for areas to place kexec images - but also
> for determining crash areas to dump via kdump. Changing the resource name
> won't have an impact.
>
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Pankaj Gu

Re: [PATCH v7 3/6] mm: Use zone and order instead of free area in free_list manipulators

2019-09-06 Thread Pankaj Gupta
gned int max_order;
>   struct page *buddy;
>  
> @@ -958,7 +957,7 @@ static inline void __free_one_page(struct page *page,
>   if (page_is_guard(buddy))
>   clear_page_guard(zone, buddy, order, migratetype);
>   else
> - del_page_from_free_area(buddy, >free_area[order]);
> + del_page_from_free_list(buddy, zone, order);
>   combined_pfn = buddy_pfn & pfn;
>   page = page + (combined_pfn - pfn);
>   pfn = combined_pfn;
> @@ -992,12 +991,11 @@ static inline void __free_one_page(struct page *page,
>  done_merging:
>   set_page_order(page, order);
>  
> - area = >free_area[order];
>   if (is_shuffle_order(order) ? shuffle_pick_tail() :
>   buddy_merge_likely(pfn, buddy_pfn, page, order))
> - add_to_free_area_tail(page, area, migratetype);
> + add_to_free_list_tail(page, zone, order, migratetype);
>   else
> - add_to_free_area(page, area, migratetype);
> + add_to_free_list(page, zone, order, migratetype);
>  }
>  
>  /*
> @@ -2001,13 +1999,11 @@ void __init init_cma_reserved_pageblock(struct page
> *page)
>   * -- nyc
>   */
>  static inline void expand(struct zone *zone, struct page *page,
> - int low, int high, struct free_area *area,
> - int migratetype)
> + int low, int high, int migratetype)
>  {
>   unsigned long size = 1 << high;
>  
>   while (high > low) {
> - area--;
>   high--;
>   size >>= 1;
>   VM_BUG_ON_PAGE(bad_range(zone, [size]), [size]);
> @@ -2021,7 +2017,7 @@ static inline void expand(struct zone *zone, struct
> page *page,
>   if (set_page_guard(zone, [size], high, migratetype))
>   continue;
>  
> - add_to_free_area([size], area, migratetype);
> + add_to_free_list([size], zone, high, migratetype);
>   set_page_order([size], high);
>   }
>  }
> @@ -2179,8 +2175,8 @@ struct page *__rmqueue_smallest(struct zone *zone,
> unsigned int order,
>   page = get_page_from_free_area(area, migratetype);
>   if (!page)
>   continue;
> - del_page_from_free_area(page, area);
> - expand(zone, page, order, current_order, area, migratetype);
> + del_page_from_free_list(page, zone, current_order);
> + expand(zone, page, order, current_order, migratetype);
>   set_pcppage_migratetype(page, migratetype);
>   return page;
>   }
> @@ -2188,7 +2184,6 @@ struct page *__rmqueue_smallest(struct zone *zone,
> unsigned int order,
>   return NULL;
>  }
>  
> -
>  /*
>   * This array describes the order lists are fallen back to when
>   * the free lists for the desirable migrate type are depleted
> @@ -2254,7 +2249,7 @@ static int move_freepages(struct zone *zone,
>   VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>  
>   order = page_order(page);
> - move_to_free_area(page, >free_area[order], migratetype);
> + move_to_free_list(page, zone, order, migratetype);
>   page += 1 << order;
>   pages_moved += 1 << order;
>   }
> @@ -2370,7 +2365,6 @@ static void steal_suitable_fallback(struct zone *zone,
> struct page *page,
>   unsigned int alloc_flags, int start_type, bool whole_block)
>  {
>   unsigned int current_order = page_order(page);
> - struct free_area *area;
>   int free_pages, movable_pages, alike_pages;
>   int old_block_type;
>  
> @@ -2441,8 +2435,7 @@ static void steal_suitable_fallback(struct zone *zone,
> struct page *page,
>   return;
>  
>  single_page:
> - area = >free_area[current_order];
> - move_to_free_area(page, area, start_type);
> + move_to_free_list(page, zone, current_order, start_type);
>  }
>  
>  /*
> @@ -3113,7 +3106,6 @@ void split_page(struct page *page, unsigned int order)
>  
>  int __isolate_free_page(struct page *page, unsigned int order)
>  {
> - struct free_area *area = _zone(page)->free_area[order];
>   unsigned long watermark;
>   struct zone *zone;
>   int mt;
> @@ -3139,7 +3131,7 @@ int __isolate_free_page(struct page *page, unsigned int
> order)
>  
>   /* Remove page from free list */
>  
> - del_page_from_free_area(page, area);
> + del_page_from_free_list(page, zone, order);
>  
>   /*
>* Set the pageblock if the isolated page is at least half of a
> @@ -8560,7 +8552,7 @@ void zone_pcp_reset(struct zone *zone)
>   pr_info("remove from free list %lx %d %lx\n",
>   pfn, 1 << order, end_pfn);
>  #endif
> - del_page_from_free_area(page, >free_area[order]);
> + del_page_from_free_list(page, zone, order);
>   for (i = 0; i < (1 << order); i++)
>   SetPageReserved((page+i));
>   pfn += (1 << order);
> 
> 

Reviewed-by: Pankaj Gupta 

> 


Re: [PATCH v6 0/6] mm / virtio: Provide support for unused page reporting

2019-08-22 Thread Pankaj Gupta


> On Thu, 2019-08-22 at 06:43 -0400, Pankaj Gupta wrote:
> > > This series provides an asynchronous means of reporting to a hypervisor
> > > that a guest page is no longer in use and can have the data associated
> > > with it dropped. To do this I have implemented functionality that allows
> > > for what I am referring to as unused page reporting
> > > 
> > > The functionality for this is fairly simple. When enabled it will
> > > allocate
> > > statistics to track the number of reported pages in a given free area.
> > > When the number of free pages exceeds this value plus a high water value,
> > > currently 32, it will begin performing page reporting which consists of
> > > pulling pages off of free list and placing them into a scatter list. The
> > > scatterlist is then given to the page reporting device and it will
> > > perform
> > > the required action to make the pages "reported", in the case of
> > > virtio-balloon this results in the pages being madvised as MADV_DONTNEED
> > > and as such they are forced out of the guest. After this they are placed
> > > back on the free list, and an additional bit is added if they are not
> > > merged indicating that they are a reported buddy page instead of a
> > > standard buddy page. The cycle then repeats with additional non-reported
> > > pages being pulled until the free areas all consist of reported pages.
> > > 
> > > I am leaving a number of things hard-coded such as limiting the lowest
> > > order processed to PAGEBLOCK_ORDER, and have left it up to the guest to
> > > determine what the limit is on how many pages it wants to allocate to
> > > process the hints. The upper limit for this is based on the size of the
> > > queue used to store the scattergather list.
> > > 
> > > My primary testing has just been to verify the memory is being freed
> > > after
> > > allocation by running memhog 40g on a 40g guest and watching the total
> > > free memory via /proc/meminfo on the host. With this I have verified most
> > > of the memory is freed after each iteration.
> > 
> > I tried to go through the entire patch series. I can see you reported a
> > -3.27 drop from the baseline. If its because of re-faulting the page after
> > host has freed them? Can we avoid freeing all the pages from the guest
> > free_area
> > and keep some pages(maybe some mixed order), so that next allocation is
> > done from
> > the guest itself than faulting to host. This will work with real workload
> > where
> > allocation and deallocation happen at regular intervals.
> > 
> > This can be further optimized based on other factors like host memory
> > pressure etc.
> > 
> > Thanks,
> > Pankaj
> 
> When I originally started implementing and testing this code I was seeing
> less than a 1% regression. I didn't feel like that was really an accurate
> result since it wasn't putting much stress on the changed code so I have
> modified my tests and kernel so that I have memory shuffting and THP
> enabled. In addition I have gone out of my way to lock things down to a
> single NUMA node on my host system as the code I had would sometimes
> perform better than baseline when running the test due to the fact that
> memory was being freed back to the hose and then reallocated which
> actually allowed for better NUMA locality.
> 
> The general idea was I wanted to know what the worst case penalty would be
> for running this code, and it turns out most of that is just the cost of
> faulting back in the pages. By enabling memory shuffling I am forcing the
> memory to churn as pages are added to both the head and tail of the
> free_list. The test itself was modified so that it didn't allocate order 0
> pages and instead was allocating transparent huge pages so the effects
> were as visible as possible. Without that the page faulting overhead would
> mostly fall into the noise of having to allocate the memory as order 0
> pages, that is what I had essentially seen earlier when I was running the
> stock page_fault1 test.

Right. I think the reason is this test is allocating THP's in guest, host side
you are still using order 0 pages, I assume?

> 
> This code does no hinting on anything smaller than either MAX_ORDER - 1 or
> HUGETLB_PAGE_ORDER pages, and it only starts when there are at least 32 of
> them available to hint on. This results in us not starting to perform the
> hinting until there is 64MB to 128MB of memory sitting in the higher order
> regions of the zone.

o.k

> 
> The hinting itself stops as soon as w

Re: [PATCH v6 0/6] mm / virtio: Provide support for unused page reporting

2019-08-22 Thread Pankaj Gupta


> 
> This series provides an asynchronous means of reporting to a hypervisor
> that a guest page is no longer in use and can have the data associated
> with it dropped. To do this I have implemented functionality that allows
> for what I am referring to as unused page reporting
> 
> The functionality for this is fairly simple. When enabled it will allocate
> statistics to track the number of reported pages in a given free area.
> When the number of free pages exceeds this value plus a high water value,
> currently 32, it will begin performing page reporting which consists of
> pulling pages off of free list and placing them into a scatter list. The
> scatterlist is then given to the page reporting device and it will perform
> the required action to make the pages "reported", in the case of
> virtio-balloon this results in the pages being madvised as MADV_DONTNEED
> and as such they are forced out of the guest. After this they are placed
> back on the free list, and an additional bit is added if they are not
> merged indicating that they are a reported buddy page instead of a
> standard buddy page. The cycle then repeats with additional non-reported
> pages being pulled until the free areas all consist of reported pages.
> 
> I am leaving a number of things hard-coded such as limiting the lowest
> order processed to PAGEBLOCK_ORDER, and have left it up to the guest to
> determine what the limit is on how many pages it wants to allocate to
> process the hints. The upper limit for this is based on the size of the
> queue used to store the scattergather list.
> 
> My primary testing has just been to verify the memory is being freed after
> allocation by running memhog 40g on a 40g guest and watching the total
> free memory via /proc/meminfo on the host. With this I have verified most
> of the memory is freed after each iteration. 

I tried to go through the entire patch series. I can see you reported a
-3.27 drop from the baseline. If its because of re-faulting the page after
host has freed them? Can we avoid freeing all the pages from the guest free_area
and keep some pages(maybe some mixed order), so that next allocation is done 
from
the guest itself than faulting to host. This will work with real workload where
allocation and deallocation happen at regular intervals. 

This can be further optimized based on other factors like host memory pressure 
etc.

Thanks,
Pankaj   

As far as performance I have
> been mainly focusing on the will-it-scale/page_fault1 test running with
> 16 vcpus. I have modified it to use Transparent Huge Pages. With this I
> see almost no difference, -0.08%, with the patches applied and the feature
> disabled. I see a regression of -0.86% with the feature enabled, but the
> madvise disabled in the hypervisor due to a device being assigned. With
> the feature fully enabled I see a regression of -3.27% versus the baseline
> without these patches applied. In my testing I found that most of the
> overhead was due to the page zeroing that comes as a result of the pages
> having to be faulted back into the guest.
> 
> One side effect of these patches is that the guest becomes much more
> resilient in terms of NUMA locality. With the pages being freed and then
> reallocated when used it allows for the pages to be much closer to the
> active thread, and as a result there can be situations where this patch
> set will out-perform the stock kernel when the guest memory is not local
> to the guest vCPUs. To avoid that in my testing I set the affinity of all
> the vCPUs and QEMU instance to the same node.
> 
> Changes from the RFC:
> https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
> Moved aeration requested flag out of aerator and into zone->flags.
> Moved boundary out of free_area and into local variables for aeration.
> Moved aeration cycle out of interrupt and into workqueue.
> Left nr_free as total pages instead of splitting it between raw and aerated.
> Combined size and physical address values in virtio ring into one 64b value.
> 
> Changes from v1:
> https://lore.kernel.org/lkml/20190619222922.1231.27432.stgit@localhost.localdomain/
> Dropped "waste page treatment" in favor of "page hinting"
> Renamed files and functions from "aeration" to "page_hinting"
> Moved from page->lru list to scatterlist
> Replaced wait on refcnt in shutdown with RCU and cancel_delayed_work_sync
> Virtio now uses scatterlist directly instead of intermediate array
> Moved stats out of free_area, now in separate area and pointed to from zone
> Merged patch 5 into patch 4 to improve review-ability
> Updated various code comments throughout
> 
> Changes from v2:
> https://lore.kernel.org/lkml/20190724165158.6685.87228.stgit@localhost.localdomain/
> Dropped "page hinting" in favor of "page reporting"
> Renamed files from "hinting" to "reporting"
> Replaced "Hinted" page type with "Reported" page flag
> Added support for page poisoning while hinting is active
> Add QEMU patch 

Re: [PATCH] libnvdimm: change disk name of virtio pmem disk

2019-08-16 Thread Pankaj Gupta


> >
> > This patch adds prefix 'v' in disk name for virtio pmem.
> > This differentiates virtio-pmem disks from the pmem disks.
> 
> I don't think the small matter that this device does not support
> MAP_SYNC warrants a separate naming scheme.  That said I do think we
> need to export this attribute in sysfs, likely at the region level,
> and then display that information in ndctl. This is distinct from the
> btt case where it is operating a different data consistency contract
> than baseline pmem.

o.k. I will look to add the information in sysfs and display using ndctl.

Thanks,
Pankaj

> >
> > Signed-off-by: Pankaj Gupta 
> > ---
> >  drivers/nvdimm/namespace_devs.c | 6 +-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/nvdimm/namespace_devs.c
> > b/drivers/nvdimm/namespace_devs.c
> > index a16e52251a30..8e5d29266fb0 100644
> > --- a/drivers/nvdimm/namespace_devs.c
> > +++ b/drivers/nvdimm/namespace_devs.c
> > @@ -182,8 +182,12 @@ const char *nvdimm_namespace_disk_name(struct
> > nd_namespace_common *ndns,
> > char *name)
> >  {
> > struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
> > +   const char *prefix = "";
> > const char *suffix = NULL;
> >
> > +   if (!is_nvdimm_sync(nd_region))
> > +   prefix = "v";
> > +
> > if (ndns->claim && is_nd_btt(ndns->claim))
> > suffix = "s";
> >
> > @@ -201,7 +205,7 @@ const char *nvdimm_namespace_disk_name(struct
> > nd_namespace_common *ndns,
> > sprintf(name, "pmem%d.%d%s", nd_region->id, nsidx,
> > suffix ? suffix : "");
> > else
> > -   sprintf(name, "pmem%d%s", nd_region->id,
> > +   sprintf(name, "%spmem%d%s", prefix, nd_region->id,
> > suffix ? suffix : "");
> > } else if (is_namespace_blk(>dev)) {
> > struct nd_namespace_blk *nsblk;
> > --
> > 2.20.1
> >
> 


[PATCH v4 0/2] virtio_console: fix replug of virtio console port

2019-08-13 Thread Pankaj Gupta
This patch series fixes the issue with unplug/replug of a port in virtio
console driver which fails with an error "Error allocating inbufs\n".

Patch 1 updates the next avail index for packed ring code.
Patch 2 makes use of 'virtqueue_detach_unused_buf' function to detach
the unused buffers during port hotunplug time.

Tested the packed ring code with the qemu virtio 1.1 device code posted
here [1]. Also, sent a patch to document the behavior in virtio spec as
suggested by Michael.

Changes from v3
- Swap patch 1 with patch 2  - [Michael]
- Added acked-by tag by Jason in patch 1
- Add reference to spec change
Changes from v2
- Better change log in patch2 - [Greg]
Changes from v1[2]
- Make virtio packed ring code compatible with split ring - [Michael]

[1]  https://marc.info/?l=qemu-devel=156471883703948=2
[2]  https://lkml.org/lkml/2019/3/4/517
[3]  https://lists.oasis-open.org/archives/virtio-dev/201908/msg00055.html

Pankaj Gupta (2):
  virtio: decrement avail idx with buffer detach for packed ring
  virtio_console: free unused buffers with port delete

 char/virtio_console.c |   14 +++---
 virtio/virtio_ring.c  |6 ++
 2 files changed, 17 insertions(+), 3 deletions(-)
-- 
2.21.0



[PATCH v4 2/2] virtio_console: free unused buffers with port delete

2019-08-13 Thread Pankaj Gupta
The commit a7a69ec0d8e4 ("virtio_console: free buffers after reset")
deferred detaching of unused buffer to virtio device unplug time.
This causes unplug/replug of single port in virtio device with an
error "Error allocating inbufs\n". As we don't free the unused buffers
attached with the port. Re-plug the same port tries to allocate new
buffers in virtqueue and results in this error if queue is full.
This patch reverts this commit by removing the unused buffers in vq's
when we unplug the port.

Reported-by: Xiaohui Li 
Cc: sta...@vger.kernel.org
Signed-off-by: Pankaj Gupta 
---
 drivers/char/virtio_console.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 7270e7b69262..e8be82f1bae9 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -1494,15 +1494,25 @@ static void remove_port(struct kref *kref)
kfree(port);
 }
 
+static void remove_unused_bufs(struct virtqueue *vq)
+{
+   struct port_buffer *buf;
+
+   while ((buf = virtqueue_detach_unused_buf(vq)))
+   free_buf(buf, true);
+}
+
 static void remove_port_data(struct port *port)
 {
spin_lock_irq(>inbuf_lock);
/* Remove unused data this port might have received. */
discard_port_data(port);
+   remove_unused_bufs(port->in_vq);
spin_unlock_irq(>inbuf_lock);
 
spin_lock_irq(>outvq_lock);
reclaim_consumed_buffers(port);
+   remove_unused_bufs(port->out_vq);
spin_unlock_irq(>outvq_lock);
 }
 
@@ -1938,11 +1948,9 @@ static void remove_vqs(struct ports_device *portdev)
struct virtqueue *vq;
 
virtio_device_for_each_vq(portdev->vdev, vq) {
-   struct port_buffer *buf;
 
flush_bufs(vq, true);
-   while ((buf = virtqueue_detach_unused_buf(vq)))
-   free_buf(buf, true);
+   remove_unused_bufs(vq);
}
portdev->vdev->config->del_vqs(portdev->vdev);
kfree(portdev->in_vqs);
-- 
2.21.0



[PATCH v4 1/2] virtio: decrement avail idx with buffer detach for packed ring

2019-08-13 Thread Pankaj Gupta
This patch decrements 'next_avail_idx' count when detaching a buffer
from vq for packed ring code. Split ring code already does this in
virtqueue_detach_unused_buf_split function. This updates the
'next_avail_idx' to the previous correct index after an unused buffer
is detatched from the vq.

Acked-by: Jason Wang 
Signed-off-by: Pankaj Gupta 
---
 drivers/virtio/virtio_ring.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index c8be1c4f5b55..7c69181113e2 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1537,6 +1537,12 @@ static void *virtqueue_detach_unused_buf_packed(struct 
virtqueue *_vq)
/* detach_buf clears data, so grab it now. */
buf = vq->packed.desc_state[i].data;
detach_buf_packed(vq, i, NULL);
+   vq->packed.next_avail_idx--;
+   if (vq->packed.next_avail_idx < 0) {
+   vq->packed.next_avail_idx = vq->packed.vring.num - 1;
+   vq->packed.avail_wrap_counter ^= 1;
+   }
+
END_USE(vq);
return buf;
}
-- 
2.20.1



Re: [PATCH] libnvdimm: change disk name of virtio pmem disk

2019-08-12 Thread Pankaj Gupta


Ping.

> 
> This patch adds prefix 'v' in disk name for virtio pmem.
> This differentiates virtio-pmem disks from the pmem disks.
> 
> Signed-off-by: Pankaj Gupta 
> ---
>  drivers/nvdimm/namespace_devs.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvdimm/namespace_devs.c
> b/drivers/nvdimm/namespace_devs.c
> index a16e52251a30..8e5d29266fb0 100644
> --- a/drivers/nvdimm/namespace_devs.c
> +++ b/drivers/nvdimm/namespace_devs.c
> @@ -182,8 +182,12 @@ const char *nvdimm_namespace_disk_name(struct
> nd_namespace_common *ndns,
>   char *name)
>  {
>   struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
> + const char *prefix = "";
>   const char *suffix = NULL;
>  
> + if (!is_nvdimm_sync(nd_region))
> + prefix = "v";
> +
>   if (ndns->claim && is_nd_btt(ndns->claim))
>   suffix = "s";
>  
> @@ -201,7 +205,7 @@ const char *nvdimm_namespace_disk_name(struct
> nd_namespace_common *ndns,
>   sprintf(name, "pmem%d.%d%s", nd_region->id, nsidx,
>   suffix ? suffix : "");
>   else
> - sprintf(name, "pmem%d%s", nd_region->id,
> + sprintf(name, "%spmem%d%s", prefix, nd_region->id,
>   suffix ? suffix : "");
>   } else if (is_namespace_blk(>dev)) {
>   struct nd_namespace_blk *nsblk;
> --
> 2.20.1
> 
> 


Re: [PATCH v3 2/2] virtio: decrement avail idx with buffer detach for packed ring

2019-08-11 Thread Pankaj Gupta


> 
> On 2019/8/9 下午2:48, Pankaj Gupta wrote:
> > This patch decrements 'next_avail_idx' count when detaching a buffer
> > from vq for packed ring code. Split ring code already does this in
> > virtqueue_detach_unused_buf_split function. This updates the
> > 'next_avail_idx' to the previous correct index after an unused buffer
> > is detatched from the vq.
> >
> > Signed-off-by: Pankaj Gupta 
> > ---
> >   drivers/virtio/virtio_ring.c | 6 ++
> >   1 file changed, 6 insertions(+)
> >
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index c8be1c4f5b55..7c69181113e2 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -1537,6 +1537,12 @@ static void
> > *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
> > /* detach_buf clears data, so grab it now. */
> > buf = vq->packed.desc_state[i].data;
> > detach_buf_packed(vq, i, NULL);
> > +   vq->packed.next_avail_idx--;
> > +   if (vq->packed.next_avail_idx < 0) {
> > +   vq->packed.next_avail_idx = vq->packed.vring.num - 1;
> > +   vq->packed.avail_wrap_counter ^= 1;
> > +   }
> > +
> > END_USE(vq);
> > return buf;
> > }
> 
> 
> Acked-by: Jason Wang 

Thank you, Jason.

Best regards,
Pankaj

> 
> 
> 


Re: [PATCH v3 1/2] virtio_console: free unused buffers with port delete

2019-08-11 Thread Pankaj Gupta


> 
> On Fri, Aug 09, 2019 at 12:18:46PM +0530, Pankaj Gupta wrote:
> > The commit a7a69ec0d8e4 ("virtio_console: free buffers after reset")
> > deferred detaching of unused buffer to virtio device unplug time.
> > This causes unplug/replug of single port in virtio device with an
> > error "Error allocating inbufs\n". As we don't free the unused buffers
> > attached with the port. Re-plug the same port tries to allocate new
> > buffers in virtqueue and results in this error if queue is full.
> 
> So why not reuse the buffers that are already there in this case?
> Seems quite possible.

I took this approach because reusing the buffers will involve tweaking
the existing core functionality like managing the the virt queue indexes.
Compared to that deleting the buffers while hot-unplugging port is simple
and was working fine before. It seems logically correct as well.   

I agree we need a spec change for this.

> 
> > This patch removes the unused buffers in vq's when we unplug the port.
> > This is the best we can do as we cannot call device_reset because virtio
> > device is still active.
> > 
> > Reported-by: Xiaohui Li 
> > Fixes: a7a69ec0d8e4 ("virtio_console: free buffers after reset")
> > Cc: sta...@vger.kernel.org
> > Signed-off-by: Pankaj Gupta 
> 
> This is really a revert of a7a69ec0d8e4, just tagged confusingly.
> 
> And the original is also supposed to be a bugfix.
> So how will the original bug be fixed?

Yes, Even I was confused while adding this tag.
I will remove remove 'fixes' tag completely for this patch?
because its a revert to original behavior which also is a bugfix.

> 
> "this is the best we can do" is rarely the case.
> 
> I am not necessarily against the revert. But if we go that way then what
> we need to do is specify the behaviour in the spec, since strict spec
> compliance is exactly what the original patch was addressing.

Agree.

> 
> In particular, we'd document that console has a special property that
> when port is detached virtqueue is considered stopped, device must not
> use any buffers, and it is legal to take buffers out of the device.

Yes. This documents the exact scenario. Thanks.
You want me to send a patch for the spec change?

Best regards,
Pankaj
 
> 
> 
> 
> > ---
> >  drivers/char/virtio_console.c | 14 +++---
> >  1 file changed, 11 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> > index 7270e7b69262..e8be82f1bae9 100644
> > --- a/drivers/char/virtio_console.c
> > +++ b/drivers/char/virtio_console.c
> > @@ -1494,15 +1494,25 @@ static void remove_port(struct kref *kref)
> >  kfree(port);
> >  }
> >  
> > +static void remove_unused_bufs(struct virtqueue *vq)
> > +{
> > +struct port_buffer *buf;
> > +
> > +while ((buf = virtqueue_detach_unused_buf(vq)))
> > +free_buf(buf, true);
> > +}
> > +
> >  static void remove_port_data(struct port *port)
> >  {
> >  spin_lock_irq(>inbuf_lock);
> >  /* Remove unused data this port might have received. */
> >  discard_port_data(port);
> > +remove_unused_bufs(port->in_vq);
> >  spin_unlock_irq(>inbuf_lock);
> >  
> >  spin_lock_irq(>outvq_lock);
> >  reclaim_consumed_buffers(port);
> > +remove_unused_bufs(port->out_vq);
> >  spin_unlock_irq(>outvq_lock);
> >  }
> >  
> > @@ -1938,11 +1948,9 @@ static void remove_vqs(struct ports_device *portdev)
> >  struct virtqueue *vq;
> >  
> >  virtio_device_for_each_vq(portdev->vdev, vq) {
> > -struct port_buffer *buf;
> >  
> >  flush_bufs(vq, true);
> > -while ((buf = virtqueue_detach_unused_buf(vq)))
> > -free_buf(buf, true);
> > +remove_unused_bufs(vq);
> >  }
> >  portdev->vdev->config->del_vqs(portdev->vdev);
> >  kfree(portdev->in_vqs);
> > --
> > 2.21.0
> 


Re: [PATCH v3 2/2] virtio: decrement avail idx with buffer detach for packed ring

2019-08-11 Thread Pankaj Gupta


> On Fri, Aug 09, 2019 at 12:18:47PM +0530, Pankaj Gupta wrote:
> > This patch decrements 'next_avail_idx' count when detaching a buffer
> > from vq for packed ring code. Split ring code already does this in
> > virtqueue_detach_unused_buf_split function. This updates the
> > 'next_avail_idx' to the previous correct index after an unused buffer
> > is detatched from the vq.
> > 
> > Signed-off-by: Pankaj Gupta 
> 
> I would make this patch 1, not patch 2, otherwise
> patch 1 corrupts the ring.

Sure. Will post this patch as patch 1 in next version.

Thanks,
Pankaj 

> 
> 
> > ---
> >  drivers/virtio/virtio_ring.c | 6 ++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index c8be1c4f5b55..7c69181113e2 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -1537,6 +1537,12 @@ static void
> > *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
> > /* detach_buf clears data, so grab it now. */
> > buf = vq->packed.desc_state[i].data;
> > detach_buf_packed(vq, i, NULL);
> > +   vq->packed.next_avail_idx--;
> > +   if (vq->packed.next_avail_idx < 0) {
> > +   vq->packed.next_avail_idx = vq->packed.vring.num - 1;
> > +   vq->packed.avail_wrap_counter ^= 1;
> > +   }
> > +
> > END_USE(vq);
> > return buf;
> > }
> > --
> > 2.20.1
> 


[PATCH v3 2/2] virtio: decrement avail idx with buffer detach for packed ring

2019-08-09 Thread Pankaj Gupta
This patch decrements 'next_avail_idx' count when detaching a buffer
from vq for packed ring code. Split ring code already does this in
virtqueue_detach_unused_buf_split function. This updates the
'next_avail_idx' to the previous correct index after an unused buffer
is detatched from the vq.

Signed-off-by: Pankaj Gupta 
---
 drivers/virtio/virtio_ring.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index c8be1c4f5b55..7c69181113e2 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1537,6 +1537,12 @@ static void *virtqueue_detach_unused_buf_packed(struct 
virtqueue *_vq)
/* detach_buf clears data, so grab it now. */
buf = vq->packed.desc_state[i].data;
detach_buf_packed(vq, i, NULL);
+   vq->packed.next_avail_idx--;
+   if (vq->packed.next_avail_idx < 0) {
+   vq->packed.next_avail_idx = vq->packed.vring.num - 1;
+   vq->packed.avail_wrap_counter ^= 1;
+   }
+
END_USE(vq);
return buf;
}
-- 
2.20.1



[PATCH v3 1/2] virtio_console: free unused buffers with port delete

2019-08-09 Thread Pankaj Gupta
The commit a7a69ec0d8e4 ("virtio_console: free buffers after reset")
deferred detaching of unused buffer to virtio device unplug time.
This causes unplug/replug of single port in virtio device with an
error "Error allocating inbufs\n". As we don't free the unused buffers
attached with the port. Re-plug the same port tries to allocate new
buffers in virtqueue and results in this error if queue is full.
This patch removes the unused buffers in vq's when we unplug the port.
This is the best we can do as we cannot call device_reset because virtio
device is still active.

Reported-by: Xiaohui Li 
Fixes: a7a69ec0d8e4 ("virtio_console: free buffers after reset")
Cc: sta...@vger.kernel.org
Signed-off-by: Pankaj Gupta 
---
 drivers/char/virtio_console.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 7270e7b69262..e8be82f1bae9 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -1494,15 +1494,25 @@ static void remove_port(struct kref *kref)
kfree(port);
 }
 
+static void remove_unused_bufs(struct virtqueue *vq)
+{
+   struct port_buffer *buf;
+
+   while ((buf = virtqueue_detach_unused_buf(vq)))
+   free_buf(buf, true);
+}
+
 static void remove_port_data(struct port *port)
 {
spin_lock_irq(>inbuf_lock);
/* Remove unused data this port might have received. */
discard_port_data(port);
+   remove_unused_bufs(port->in_vq);
spin_unlock_irq(>inbuf_lock);
 
spin_lock_irq(>outvq_lock);
reclaim_consumed_buffers(port);
+   remove_unused_bufs(port->out_vq);
spin_unlock_irq(>outvq_lock);
 }
 
@@ -1938,11 +1948,9 @@ static void remove_vqs(struct ports_device *portdev)
struct virtqueue *vq;
 
virtio_device_for_each_vq(portdev->vdev, vq) {
-   struct port_buffer *buf;
 
flush_bufs(vq, true);
-   while ((buf = virtqueue_detach_unused_buf(vq)))
-   free_buf(buf, true);
+   remove_unused_bufs(vq);
}
portdev->vdev->config->del_vqs(portdev->vdev);
kfree(portdev->in_vqs);
-- 
2.21.0



[PATCH v3 0/2] virtio_console: fix replug of virtio console port

2019-08-09 Thread Pankaj Gupta
This patch series fixes the issue with unplug/replug of a port in virtio
console driver which fails with an error "Error allocating inbufs\n".
Patch 1 makes use of 'virtqueue_detach_unused_buf' function to detach
the unused buffers during port hotunplug time.
Patch 2 updates the next avail index for packed ring code.
Tested the packed ring code with the qemu virtio 1.1 device code posted
here [1].

Changes from v2
 Better change log in patch2 - [Greg]
Changes from v1[2]
 Make virtio packed ring code compatible with split ring - [Michael]

[1]  https://marc.info/?l=qemu-devel=156471883703948=2
[2]  https://lkml.org/lkml/2019/3/4/517

Pankaj Gupta (2):
  virtio_console: free unused buffers with port delete
  virtio: decrement avail idx with buffer detach for packed ring

 char/virtio_console.c |   14 +++---
 virtio/virtio_ring.c  |6 ++
 2 files changed, 17 insertions(+), 3 deletions(-)
-- 
2.21.0



Re: [PATCH v2 2/2] virtio_ring: packed ring: fix virtqueue_detach_unused_buf

2019-08-08 Thread Pankaj Gupta


> On Thu, Aug 08, 2019 at 08:28:46AM -0400, Pankaj Gupta wrote:
> > 
> > 
> > > > This patch makes packed ring code compatible with split ring in
> > > > function
> > > > 'virtqueue_detach_unused_buf_*'.
> > > 
> > > What does that mean?  What does this "fix"?
> > 
> > Patch 1 frees the buffers When a port is unplugged from the virtio
> > console device. It does this with the help of
> > 'virtqueue_detach_unused_buf_split/packed'
> > function. For split ring case, corresponding function decrements avail ring
> > index.
> > For packed ring code, this functionality is not available, so this patch
> > adds the
> > required support and hence help to remove the unused buffer completely.
> 
> Explain all of this in great detail in the changelog comment.  What you
> have in there today does not make any sense.

Sure. Will try to put in more clear words.

Thanks,
Pankaj

> 
> thanks,
> 
> greg k-h
> 


  1   2   3   4   5   >