Re: [PATCH v2 1/2] dt-bindings: virtio: mmio: add optional wakeup-source property

2022-06-09 Thread Michael S. Tsirkin
On Thu, May 19, 2022 at 03:23:02PM +0800, Minghao Xue wrote:
> On Tue, Mar 29, 2022 at 09:59:31AM +0200, Krzysztof Kozlowski wrote:
> > On 29/03/2022 09:46, Minghao Xue wrote:
> > > On Mon, Mar 28, 2022 at 04:42:59PM -0400, Michael S. Tsirkin wrote:
> > >> On Fri, Mar 25, 2022 at 09:59:45AM +0800, Minghao Xue wrote:
> > >>> Some systems want to set the interrupt of virtio_mmio device
> > >>> as a wakeup source. On such systems, we'll use the existence
> > >>> of the "wakeup-source" property as a signal of requirement.
> > >>>
> > >>> Signed-off-by: Minghao Xue 
> > >>
> > >> I don't have enough of a clue about dt to review this.
> > >> Pls get some acks from people with DT expertise.
> > >>
> > > Hi Michael,
> > > I had a discussion with Krzysztof on the first version of patch. And we've
> > > got aligned. 
> > > 
> > 
> > I thought I reviewed this and provided an ack, but apparently I did not.
> > Sorry for late response.
> > 
> > Reviewed-by: Krzysztof Kozlowski 
> > 
> > Best regards,
> > Krzysztof
> 
> Hi Michael and Jason,
> As this patch has been reviewed by Krzysztof. Would you help upstream
> these two patches? And is there any progress on it?
> 
> Regards,
> Minghao

Hi!
Sorry about the delay - the issue with the patchset is it was not
threaded correctly and so can not get handled properly by
automated scripts. Can you please repost threading properly,
preferably with a cover letter?

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 12/19] btrfs: Convert btrfs_migratepage to migrate_folio

2022-06-09 Thread David Sterba
On Thu, Jun 09, 2022 at 06:40:28PM +0100, Matthew Wilcox wrote:
> On Thu, Jun 09, 2022 at 06:33:23PM +0200, David Sterba wrote:
> > On Wed, Jun 08, 2022 at 04:02:42PM +0100, Matthew Wilcox (Oracle) wrote:
> > > Use filemap_migrate_folio() to do the bulk of the work, and then copy
> > > the ordered flag across if needed.
> > > 
> > > Signed-off-by: Matthew Wilcox (Oracle) 
> > > Reviewed-by: Christoph Hellwig 
> > 
> > Acked-by: David Sterba 
> > 
> > > +static int btrfs_migrate_folio(struct address_space *mapping,
> > > +  struct folio *dst, struct folio *src,
> > >enum migrate_mode mode)
> > >  {
> > > - int ret;
> > > + int ret = filemap_migrate_folio(mapping, dst, src, mode);
> > >  
> > > - ret = migrate_page_move_mapping(mapping, newpage, page, 0);
> > >   if (ret != MIGRATEPAGE_SUCCESS)
> > >   return ret;
> > >  
> > > - if (page_has_private(page))
> > > - attach_page_private(newpage, detach_page_private(page));
> > 
> > If I'm reading it correctly, the private pointer does not need to be set
> > like that anymore because it's done somewhere during the
> > filemap_migrate_folio() call.
> 
> That's correct.  Everything except moving the ordered flag across is
> done for you, and I'm kind of tempted to modify folio_migrate_flags()
> to copy the ordered flag across as well.  Then you could just use
> filemap_migrate_folio() directly.

Either way it works for me. If it would mean an unsafe change in folios
or complicate other code I'm fine with the migration callback that
does additional work for btrfs that could be changed later.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 12/19] btrfs: Convert btrfs_migratepage to migrate_folio

2022-06-09 Thread Matthew Wilcox
On Thu, Jun 09, 2022 at 06:33:23PM +0200, David Sterba wrote:
> On Wed, Jun 08, 2022 at 04:02:42PM +0100, Matthew Wilcox (Oracle) wrote:
> > Use filemap_migrate_folio() to do the bulk of the work, and then copy
> > the ordered flag across if needed.
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) 
> > Reviewed-by: Christoph Hellwig 
> 
> Acked-by: David Sterba 
> 
> > +static int btrfs_migrate_folio(struct address_space *mapping,
> > +struct folio *dst, struct folio *src,
> >  enum migrate_mode mode)
> >  {
> > -   int ret;
> > +   int ret = filemap_migrate_folio(mapping, dst, src, mode);
> >  
> > -   ret = migrate_page_move_mapping(mapping, newpage, page, 0);
> > if (ret != MIGRATEPAGE_SUCCESS)
> > return ret;
> >  
> > -   if (page_has_private(page))
> > -   attach_page_private(newpage, detach_page_private(page));
> 
> If I'm reading it correctly, the private pointer does not need to be set
> like that anymore because it's done somewhere during the
> filemap_migrate_folio() call.

That's correct.  Everything except moving the ordered flag across is
done for you, and I'm kind of tempted to modify folio_migrate_flags()
to copy the ordered flag across as well.  Then you could just use
filemap_migrate_folio() directly.

> > -
> > -   if (PageOrdered(page)) {
> > -   ClearPageOrdered(page);
> > -   SetPageOrdered(newpage);
> > +   if (folio_test_ordered(src)) {
> > +   folio_clear_ordered(src);
> > +   folio_set_ordered(dst);
> > }
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 12/19] btrfs: Convert btrfs_migratepage to migrate_folio

2022-06-09 Thread David Sterba
On Wed, Jun 08, 2022 at 04:02:42PM +0100, Matthew Wilcox (Oracle) wrote:
> Use filemap_migrate_folio() to do the bulk of the work, and then copy
> the ordered flag across if needed.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 

Acked-by: David Sterba 

> +static int btrfs_migrate_folio(struct address_space *mapping,
> +  struct folio *dst, struct folio *src,
>enum migrate_mode mode)
>  {
> - int ret;
> + int ret = filemap_migrate_folio(mapping, dst, src, mode);
>  
> - ret = migrate_page_move_mapping(mapping, newpage, page, 0);
>   if (ret != MIGRATEPAGE_SUCCESS)
>   return ret;
>  
> - if (page_has_private(page))
> - attach_page_private(newpage, detach_page_private(page));

If I'm reading it correctly, the private pointer does not need to be set
like that anymore because it's done somewhere during the
filemap_migrate_folio() call.

> -
> - if (PageOrdered(page)) {
> - ClearPageOrdered(page);
> - SetPageOrdered(newpage);
> + if (folio_test_ordered(src)) {
> + folio_clear_ordered(src);
> + folio_set_ordered(dst);
>   }
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 10/19] mm/migrate: Convert migrate_page() to migrate_folio()

2022-06-09 Thread David Sterba
On Wed, Jun 08, 2022 at 04:02:40PM +0100, Matthew Wilcox (Oracle) wrote:
> Convert all callers to pass a folio.  Most have the folio
> already available.  Switch all users from aops->migratepage to
> aops->migrate_folio.  Also turn the documentation into kerneldoc.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 
> ---
>  drivers/gpu/drm/i915/gem/i915_gem_userptr.c |  4 +--

For

>  fs/btrfs/disk-io.c  |  2 +-

Acked-by: David Sterba 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 08/19] btrfs: Convert btree_migratepage to migrate_folio

2022-06-09 Thread David Sterba
On Wed, Jun 08, 2022 at 04:02:38PM +0100, Matthew Wilcox (Oracle) wrote:
> Use a folio throughout this function.  migrate_page() will be converted
> later.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 

Acked-by: David Sterba 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


ICITS'2023 | Cusco, Peru | Deadline: September 4

2022-06-09 Thread ICITS-2023

ICITS'23 - The 6th International Conference on Information Technology & Systems
Cusco, Peru, 8 - 10 February 2023
http://icits.me 


Scope

ICITS'23 - The 6th International Conference on Information Technology & 
Systems, to be held at Universidad Nacional de San Antonio Abad del Cusco, in 
Cusco, Peru, between the 8th and the 10th of February 2023, is an international 
forum for researchers and practitioners to present and discuss the most recent 
innovations, trends, results, experiences and concerns in the several 
perspectives of Information Technology & Systems.

We are pleased to invite you to submit your papers to ICITS'23. They can be 
written in English, Spanish or Portuguese. All submissions will be reviewed on 
the basis of relevance, originality, importance and clarity.



Topics

Submitted papers should be related with one or more of the main themes proposed 
for the Conference:

A) Information and Knowledge Management (IKM);

B) Organizational Models and Information Systems (OMIS);

C) Software and Systems Modeling (SSM);

D) Software Systems, Architectures, Applications and Tools (SSAAT);

E) Multimedia Systems and Applications (MSA);

F) Computer Networks, Mobility and Pervasive Systems (CNMPS);

G) Intelligent and Decision Support Systems (IDSS);

H) Big Data Analytics and Applications (BDAA);

I) Human-Computer Interaction (HCI);

J) Ethics, Computers and Security (ECS)

K) Health Informatics (HIS);

L) Information Technologies in Education (ITE);

M) Media, Applied Technology and Communication (MATC).



Submission and Decision

Submitted papers written in English (until 10-page limit) must comply with the 
format of the Lecture Notes in Networks and Systems series (see Instructions 
for Authors at Springer Website 
),
 must not have been published before, not be under review for any other 
conference or publication and not include any information leading to the 
authors’ identification. Therefore, the authors’ names and affiliations should 
not be included in the version for evaluation by the Scientific Committee. This 
information should only be included in the camera-ready version, saved in Word 
or Latex format and also in PDF format. These files must be accompanied by the 
Consent to Publish form  filled out, in a 
ZIP file, and uploaded at the conference management system.

Submitted papers written in Spanish or Portuguese (until 15-page limit) must 
comply with the format of RISTI  - Revista Ibérica de 
Sistemas e Tecnologias de Informação (download instructions/template for 
authors in Spanish  or Portuguese 
), must not have been published before, 
not be under review for any other conference or publication and not include any 
information leading to the authors’ identification. Therefore, the authors’ 
names and affiliations should not be included in the version for evaluation by 
the Scientific Committee. This information should only be included in the 
camera-ready version, saved in Word. These files must be uploaded at the 
conference management system in a ZIP file.

All papers will be subjected to a “double-blind review” by at least two members 
of the Scientific Committee.

Based on Scientific Committee evaluation, a paper can be rejected or accepted 
by the Conference Chairs. In the later case, it can be accepted as paper or 
poster.

The authors of papers accepted as posters must build and print a poster to be 
exhibited during the Conference. This poster must follow an A1 or A2 vertical 
format. The Conference can include Work Sessions where these posters are 
presented and orally discussed, with a 7 minute limit per poster.

The authors of accepted papers will have 15 minutes to present their work in a 
Conference Work Session; approximately 5 minutes of discussion will follow each 
presentation.



Publication and Indexing

Papers accepted as posters are not published; they are only exhibited, 
presented and discussed during the conference.

To ensure that a paper accepted as paper is published, at least one of the 
authors must be fully registered by the 4th of November 2022, and the paper 
must comply with the suggested layout and page-limit. Additionally, all 
recommended changes must be addressed by the authors before they submit the 
camera-ready version.

No more than one paper per registration will be published. An extra fee must be 
paid for publication of additional papers, with a maximum of one additional 
paper per registration. One registration permits only the participation of 

Re: [PATCH v2 03/19] fs: Add aops->migrate_folio

2022-06-09 Thread Matthew Wilcox
On Thu, Jun 09, 2022 at 02:50:20PM +0200, David Hildenbrand wrote:
> On 08.06.22 17:02, Matthew Wilcox (Oracle) wrote:
> > diff --git a/Documentation/filesystems/locking.rst 
> > b/Documentation/filesystems/locking.rst
> > index c0fe711f14d3..3d28b23676bd 100644
> > --- a/Documentation/filesystems/locking.rst
> > +++ b/Documentation/filesystems/locking.rst
> > @@ -253,7 +253,8 @@ prototypes::
> > void (*free_folio)(struct folio *);
> > int (*direct_IO)(struct kiocb *, struct iov_iter *iter);
> > bool (*isolate_page) (struct page *, isolate_mode_t);
> > -   int (*migratepage)(struct address_space *, struct page *, struct page 
> > *);
> > +   int (*migrate_folio)(struct address_space *, struct folio *dst,
> > +   struct folio *src, enum migrate_mode);
> > void (*putback_page) (struct page *);
> 
> isolate_page/putback_page are leftovers from the previous patch, no?

Argh, right, I completely forgot I needed to update the documentation in
that patch.

> > +++ b/Documentation/vm/page_migration.rst
> > @@ -181,22 +181,23 @@ which are function pointers of struct 
> > address_space_operations.
> > Once page is successfully isolated, VM uses page.lru fields so driver
> > shouldn't expect to preserve values in those fields.
> >  
> > -2. ``int (*migratepage) (struct address_space *mapping,``
> > -|  ``struct page *newpage, struct page *oldpage, enum migrate_mode);``
> > -
> > -   After isolation, VM calls migratepage() of driver with the isolated 
> > page.
> > -   The function of migratepage() is to move the contents of the old page 
> > to the
> > -   new page
> > -   and set up fields of struct page newpage. Keep in mind that you should
> > -   indicate to the VM the oldpage is no longer movable via 
> > __ClearPageMovable()
> > -   under page_lock if you migrated the oldpage successfully and returned
> > -   MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, 
> > driver
> > -   can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short 
> > time
> > -   because VM interprets -EAGAIN as "temporary migration failure". On 
> > returning
> > -   any error except -EAGAIN, VM will give up the page migration without
> > -   retrying.
> > -
> > -   Driver shouldn't touch the page.lru field while in the migratepage() 
> > function.
> > +2. ``int (*migrate_folio) (struct address_space *mapping,``
> > +|  ``struct folio *dst, struct folio *src, enum migrate_mode);``
> > +
> > +   After isolation, VM calls the driver's migrate_folio() with the
> > +   isolated folio.  The purpose of migrate_folio() is to move the contents
> > +   of the source folio to the destination folio and set up the fields
> > +   of destination folio.  Keep in mind that you should indicate to the
> > +   VM the source folio is no longer movable via __ClearPageMovable()
> > +   under folio if you migrated the source successfully and returned
> > +   MIGRATEPAGE_SUCCESS.  If driver cannot migrate the folio at the
> > +   moment, driver can return -EAGAIN. On -EAGAIN, VM will retry folio
> > +   migration in a short time because VM interprets -EAGAIN as "temporary
> > +   migration failure".  On returning any error except -EAGAIN, VM will
> > +   give up the folio migration without retrying.
> > +
> > +   Driver shouldn't touch the folio.lru field while in the migrate_folio()
> > +   function.
> >  
> >  3. ``void (*putback_page)(struct page *);``
> 
> Hmm, here it's a bit more complicated now, because we essentially have
> two paths: LRU+migrate_folio or !LRU+movable_ops
> (isolate/migrate/putback page)

Oh ... actually, this is just documenting the driver side of things.
I don't really like how it's written.  Here, have some rewritten
documentation (which is now part of the previous patch):

+++ b/Documentation/vm/page_migration.rst
@@ -152,110 +152,15 @@ Steps:
 Non-LRU page migration
 ==

-Although migration originally aimed for reducing the latency of memory accesses
-for NUMA, compaction also uses migration to create high-order pages.
+Although migration originally aimed for reducing the latency of memory
+accesses for NUMA, compaction also uses migration to create high-order
+pages.  For compaction purposes, it is also useful to be able to move
+non-LRU pages, such as zsmalloc and virtio-balloon pages.

-Current problem of the implementation is that it is designed to migrate only
-*LRU* pages. However, there are potential non-LRU pages which can be migrated
-in drivers, for example, zsmalloc, virtio-balloon pages.
-
-For virtio-balloon pages, some parts of migration code path have been hooked
-up and added virtio-balloon specific functions to intercept migration logics.
-It's too specific to a driver so other drivers who want to make their pages
-movable would have to add their own specific hooks in the migration path.
-
-To overcome the problem, VM supports non-LRU page migration which provides
-generic functions for non-LRU movable pages without driver

Re: [PATCH] fuse: allow skipping abort interface for virtiofs

2022-06-09 Thread Vivek Goyal
On Wed, Jun 08, 2022 at 09:57:51PM +0800, Yongji Xie wrote:
> On Wed, Jun 8, 2022 at 8:44 PM Vivek Goyal  wrote:
> >
> > On Wed, Jun 08, 2022 at 04:42:46PM +0800, Yongji Xie wrote:
> > > On Wed, Jun 8, 2022 at 3:34 AM Vivek Goyal  wrote:
> > > >
> > > > On Tue, Jun 07, 2022 at 07:05:04PM +0800, Xie Yongji wrote:
> > > > > The commit 15c8e72e88e0 ("fuse: allow skipping control
> > > > > interface and forced unmount") tries to remove the control
> > > > > interface for virtio-fs since it does not support aborting
> > > > > requests which are being processed. But it doesn't work now.
> > > >
> > > > Aha.., so "no_control" basically has no effect? I was looking at
> > > > the code and did not find anybody using "no_control" and I was
> > > > wondering who is making use of "no_control" variable.
> > > >
> > > > I mounted virtiofs and noticed a directory named "40" showed up
> > > > under /sys/fs/fuse/connections/. That must be belonging to
> > > > virtiofs instance, I am assuming.
> > > >
> > >
> > > I think so.
> > >
> > > > BTW, if there are multiple fuse connections, how will one figure
> > > > out which directory belongs to which instance. Because without knowing
> > > > that, one will be shooting in dark while trying to read/write any
> > > > of the control files.
> > > >
> > >
> > > We can use "stat $mountpoint" to get the device minor ID which is the
> > > name of the corresponding control directory.
> > >
> > > > So I think a separate patch should be sent which just gets rid of
> > > > "no_control" saying nobody uses. it.
> > > >
> > >
> > > OK.
> > >
> > > > >
> > > > > This commit fixes the bug, but only remove the abort interface
> > > > > instead since other interfaces should be useful.
> > > >
> > > > Hmm.., so writing to "abort" file is bad as it ultimately does.
> > > >
> > > > fc->connected = 0;
> > > >
> > >
> > > Another problem is that it might trigger UAF since
> > > virtio_fs_request_complete() doesn't know the requests are aborted.
> > >
> > > > So getting rid of this file till we support aborting the pending
> > > > requests properly, makes sense.
> > > >
> > > > I think this probably should be a separate patch which explains
> > > > why adding "no_abort_control" is a good idea.
> > > >
> > >
> > > OK.
> >
> > BTW, which particular knob you are finding useful in control filesystem
> > for virtiofs. As you mentioned "abort" we will not implement. "waiting"
> > might not have much significance as well because requests are handed
> > over to virtiofs immidiately and if they can be sent to server (because
> > virtqueue is full) these are queued internally and fuse layer will not
> > have an idea.
> >
> 
> Couldn't it be used to check the inflight I/O for virtiofs?

Actually I might be wrong. It probably should work. Looking at
implementation.

fuse_conn_waiting_read() looks at fc->num_waiting to figure out
how many requests are in flight.

And either fuse_get_req()/fuse_simple_request() will bump up the
fc->num_request count and fuse_put_request() will drop that count
once request completes. And this seems to be independent of
virtiofs.

So looks like it should work even with virtiofs. Please give it a try.

> 
> > That leaves us with "congestion_threshold" and "max_background".
> > max_background seems to control how many background requests can be
> > submitted at a time. That probably can be useful if server is overwhelemed
> > and we want to slow down the client a bit.
> >
> > Not sure about congestion threshold.
> >
> > So have you found some knob useful for your use case?
> >
> 
> Since it doesn't do harm to the system, I think it would be better to
> just keep it as it is. Maybe some fuse users can make use of it.

I guess fair enough. I don't mind creating "control" file system for
virtiofs. Either we don't create "abort" file or may be somehow
writing to file returns error. I guess both the solutions should
work. 

Thanks
Vivek

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 24/36] printk: Remove trace_.*_rcuidle() usage

2022-06-09 Thread Petr Mladek via Virtualization
On Thu 2022-06-09 12:02:04, Peter Zijlstra wrote:
> On Thu, Jun 09, 2022 at 11:16:46AM +0200, Petr Mladek wrote:
> > On Wed 2022-06-08 16:27:47, Peter Zijlstra wrote:
> > > The problem, per commit fc98c3c8c9dc ("printk: use rcuidle console
> > > tracepoint"), was printk usage from the cpuidle path where RCU was
> > > already disabled.
> > > 
> > Does this "prevent" calling printk() a safe way in code with
> > RCU disabled?
> 
> On x86_64, yes. Other architectures, less so.
> 
> Specifically, the objtool noinstr validation pass will warn at build
> time (DEBUG_ENTRY=y) if any noinstr/cpuidle code does a call to
> non-vetted code like printk().
> 
> At the same time; there's a few hacks that allow WARN to work, but
> mostly if you hit WARN in entry/noinstr you get to keep the pieces in
> any case.
> 
> On other architecture we'll need to rely on runtime coverage with
> PROVE_RCU. That is, if a splat like in the above mentioned commit
> happens again, we'll need to fix it by adjusting the callchain, not by
> mucking about with RCU state.

Makes sense. Feel free to use for this patch:

Acked-by: Petr Mladek 

> > Therefore if this patch allows to remove some tricky tracing
> > code then it might be worth it. But if trace_console_rcuidle()
> > variant is still going to be available then I would keep using it.
> 
> My ultimate goal is to delete trace_.*_rcuidle() and RCU_NONIDLE()
> entirely. We're close, but not quite there yet.

I keep my fingers crossed.

Best Regards,
Petr
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 24/36] printk: Remove trace_.*_rcuidle() usage

2022-06-09 Thread Petr Mladek via Virtualization
On Thu 2022-06-09 20:30:58, Sergey Senozhatsky wrote:
> My emails are getting rejected... Let me try web-interface

Bad day for mail sending. I have problems as well ;-)

> Kudos to Petr for the questions and thanks to PeterZ for the answers.
> 
> On Thu, Jun 9, 2022 at 7:02 PM Peter Zijlstra  wrote:
> > This is the tracepoint used to spool all of printk into ftrace, I
> > suspect there's users, but I haven't used it myself.
> 
> I'm somewhat curious whether we can actually remove that trace event.

Good question.

Well, I think that it might be useful. It allows to see trace and
printk messages together.

It was ugly when it was in the console code. The new location
in vprintk_store() allows to have it even "correctly" sorted
(timestamp) against other tracing messages.

Best Regards,
Petr
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 03/19] fs: Add aops->migrate_folio

2022-06-09 Thread David Hildenbrand
On 08.06.22 17:02, Matthew Wilcox (Oracle) wrote:
> Provide a folio-based replacement for aops->migratepage.  Update the
> documentation to document migrate_folio instead of migratepage.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 
> ---
>  Documentation/filesystems/locking.rst |  5 ++--
>  Documentation/filesystems/vfs.rst | 13 ++-
>  Documentation/vm/page_migration.rst   | 33 ++-
>  include/linux/fs.h|  4 +++-
>  mm/compaction.c   |  4 +++-
>  mm/migrate.c  | 11 +
>  6 files changed, 40 insertions(+), 30 deletions(-)
> 
> diff --git a/Documentation/filesystems/locking.rst 
> b/Documentation/filesystems/locking.rst
> index c0fe711f14d3..3d28b23676bd 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -253,7 +253,8 @@ prototypes::
>   void (*free_folio)(struct folio *);
>   int (*direct_IO)(struct kiocb *, struct iov_iter *iter);
>   bool (*isolate_page) (struct page *, isolate_mode_t);
> - int (*migratepage)(struct address_space *, struct page *, struct page 
> *);
> + int (*migrate_folio)(struct address_space *, struct folio *dst,
> + struct folio *src, enum migrate_mode);
>   void (*putback_page) (struct page *);

isolate_page/putback_page are leftovers from the previous patch, no?

>   int (*launder_folio)(struct folio *);
>   bool (*is_partially_uptodate)(struct folio *, size_t from, size_t 
> count);
> @@ -281,7 +282,7 @@ release_folio:yes
>  free_folio:  yes
>  direct_IO:
>  isolate_page:yes
> -migratepage: yes (both)
> +migrate_folio:   yes (both)
>  putback_page:yes

Dito.

>  launder_folio:   yes
>  is_partially_uptodate:   yes
> diff --git a/Documentation/filesystems/vfs.rst 
> b/Documentation/filesystems/vfs.rst
> index a08c652467d7..3ae1b039b03f 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -740,7 +740,8 @@ cache in your filesystem.  The following members are 
> defined:
>   /* isolate a page for migration */
>   bool (*isolate_page) (struct page *, isolate_mode_t);
>   /* migrate the contents of a page to the specified target */
> - int (*migratepage) (struct page *, struct page *);
> + int (*migrate_folio)(struct mapping *, struct folio *dst,
> + struct folio *src, enum migrate_mode);
>   /* put migration-failed page back to right list */
>   void (*putback_page) (struct page *);

Dito.

>   int (*launder_folio) (struct folio *);
> @@ -935,12 +936,12 @@ cache in your filesystem.  The following members are 
> defined:
>   is successfully isolated, VM marks the page as PG_isolated via
>   __SetPageIsolated.
>  
> -``migrate_page``
> +``migrate_folio``
>   This is used to compact the physical memory usage.  If the VM
> - wants to relocate a page (maybe off a memory card that is
> - signalling imminent failure) it will pass a new page and an old
> - page to this function.  migrate_page should transfer any private
> - data across and update any references that it has to the page.
> + wants to relocate a folio (maybe from a memory device that is
> + signalling imminent failure) it will pass a new folio and an old
> + folio to this function.  migrate_folio should transfer any private
> + data across and update any references that it has to the folio.
>  
>  ``putback_page``
>   Called by the VM when isolated page's migration fails.

Dito.

> diff --git a/Documentation/vm/page_migration.rst 
> b/Documentation/vm/page_migration.rst
> index 8c5cb8147e55..e0f73ddfabb1 100644
> --- a/Documentation/vm/page_migration.rst
> +++ b/Documentation/vm/page_migration.rst
> @@ -181,22 +181,23 @@ which are function pointers of struct 
> address_space_operations.
> Once page is successfully isolated, VM uses page.lru fields so driver
> shouldn't expect to preserve values in those fields.
>  
> -2. ``int (*migratepage) (struct address_space *mapping,``
> -|``struct page *newpage, struct page *oldpage, enum migrate_mode);``
> -
> -   After isolation, VM calls migratepage() of driver with the isolated page.
> -   The function of migratepage() is to move the contents of the old page to 
> the
> -   new page
> -   and set up fields of struct page newpage. Keep in mind that you should
> -   indicate to the VM the oldpage is no longer movable via 
> __ClearPageMovable()
> -   under page_lock if you migrated the oldpage successfully and returned
> -   MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, 
> driver
> -   can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short 
> time
> -   because VM interprets -EAGAIN as "temporary mi

Re: [PATCH v2 01/19] secretmem: Remove isolate_page

2022-06-09 Thread David Hildenbrand
On 08.06.22 17:02, Matthew Wilcox (Oracle) wrote:
> The isolate_page operation is never called for filesystems, only
> for device drivers which call SetPageMovable.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/secretmem.c | 6 --
>  1 file changed, 6 deletions(-)
> 
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> index 206ed6b40c1d..1c7f1775b56e 100644
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -133,11 +133,6 @@ static const struct file_operations secretmem_fops = {
>   .mmap   = secretmem_mmap,
>  };
>  
> -static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
> -{
> - return false;
> -}
> -
>  static int secretmem_migratepage(struct address_space *mapping,
>struct page *newpage, struct page *page,
>enum migrate_mode mode)
> @@ -155,7 +150,6 @@ const struct address_space_operations secretmem_aops = {
>   .dirty_folio= noop_dirty_folio,
>   .free_folio = secretmem_free_folio,
>   .migratepage= secretmem_migratepage,
> - .isolate_page   = secretmem_isolate_page,
>  };
>  
>  static int secretmem_setattr(struct user_namespace *mnt_userns,

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 02/19] mm: Convert all PageMovable users to movable_operations

2022-06-09 Thread David Hildenbrand
On 08.06.22 17:02, Matthew Wilcox (Oracle) wrote:
> These drivers are rather uncomfortably hammered into the
> address_space_operations hole.  They aren't filesystems and don't behave
> like filesystems.  They just need their own movable_operations structure,
> which we can point to directly from page->mapping.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  arch/powerpc/platforms/pseries/cmm.c |  60 +---
>  drivers/misc/vmw_balloon.c   |  61 +---
>  drivers/virtio/virtio_balloon.c  |  47 +---
>  include/linux/balloon_compaction.h   |   6 +-
>  include/linux/fs.h   |   2 -
>  include/linux/migrate.h  |  26 +--
>  include/linux/page-flags.h   |   2 +-
>  include/uapi/linux/magic.h   |   4 --
>  mm/balloon_compaction.c  |  10 ++-
>  mm/compaction.c  |  29 
>  mm/migrate.c |  24 +++
>  mm/util.c|   4 +-
>  mm/z3fold.c  |  82 +++--
>  mm/zsmalloc.c| 102 ++-
>  14 files changed, 94 insertions(+), 365 deletions(-)

You probably should have cc'ed the relevant maintainers (including me :P ).

For everything except z3fold.c and zsmalloc.c,

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 24/36] printk: Remove trace_.*_rcuidle() usage

2022-06-09 Thread Petr Mladek via Virtualization
Sending again. The previous attempt was rejected by several
recipients. It was caused by a mail server changes on my side.

I am sorry for spamming those who got the 1st mail already.

On Wed 2022-06-08 16:27:47, Peter Zijlstra wrote:
> The problem, per commit fc98c3c8c9dc ("printk: use rcuidle console
> tracepoint"), was printk usage from the cpuidle path where RCU was
> already disabled.
> 
> Per the patches earlier in this series, this is no longer the case.

My understanding is that this series reduces a lot the amount
of code called with RCU disabled. As a result the particular printk()
call mentioned by commit fc98c3c8c9dc ("printk: use rcuidle console
tracepoint") is called with RCU enabled now. Hence this particular
problem is fixed better way now.

But is this true in general?
Does this "prevent" calling printk() a safe way in code with
RCU disabled?

I am not sure if anyone cares. printk() is the best effort
functionality because of the consoles code anyway. Also I wonder
if anyone uses this trace_console().

Therefore if this patch allows to remove some tricky tracing
code then it might be worth it. But if trace_console_rcuidle()
variant is still going to be available then I would keep using it.

Best Regards,
Petr

> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  kernel/printk/printk.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2238,7 +2238,7 @@ static u16 printk_sprint(char *text, u16
>   }
>   }
>  
> - trace_console_rcuidle(text, text_len);
> + trace_console(text, text_len);
>  
>   return text_len;
>  }
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 24/36] printk: Remove trace_.*_rcuidle() usage

2022-06-09 Thread Peter Zijlstra
On Thu, Jun 09, 2022 at 11:16:46AM +0200, Petr Mladek wrote:
> On Wed 2022-06-08 16:27:47, Peter Zijlstra wrote:
> > The problem, per commit fc98c3c8c9dc ("printk: use rcuidle console
> > tracepoint"), was printk usage from the cpuidle path where RCU was
> > already disabled.
> > 
> > Per the patches earlier in this series, this is no longer the case.
> 
> My understanding is that this series reduces a lot the amount
> of code called with RCU disabled. As a result the particular printk()
> call mentioned by commit fc98c3c8c9dc ("printk: use rcuidle console
> tracepoint") is called with RCU enabled now. Hence this particular
> problem is fixed better way now.
> 
> But is this true in general?
> Does this "prevent" calling printk() a safe way in code with
> RCU disabled?

On x86_64, yes. Other architectures, less so.

Specifically, the objtool noinstr validation pass will warn at build
time (DEBUG_ENTRY=y) if any noinstr/cpuidle code does a call to
non-vetted code like printk().

At the same time; there's a few hacks that allow WARN to work, but
mostly if you hit WARN in entry/noinstr you get to keep the pieces in
any case.

On other architecture we'll need to rely on runtime coverage with
PROVE_RCU. That is, if a splat like in the above mentioned commit
happens again, we'll need to fix it by adjusting the callchain, not by
mucking about with RCU state.

> I am not sure if anyone cares. printk() is the best effort
> functionality because of the consoles code anyway. Also I wonder
> if anyone uses this trace_console().

This is the tracepoint used to spool all of printk into ftrace, I
suspect there's users, but I haven't used it myself.

> Therefore if this patch allows to remove some tricky tracing
> code then it might be worth it. But if trace_console_rcuidle()
> variant is still going to be available then I would keep using it.

My ultimate goal is to delete trace_.*_rcuidle() and RCU_NONIDLE()
entirely. We're close, but not quite there yet.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 33/36] cpuidle,omap3: Use WFI for omap3_pm_idle()

2022-06-09 Thread Peter Zijlstra
On Thu, Jun 09, 2022 at 10:39:22AM +0300, Tony Lindgren wrote:
> * Arnd Bergmann  [220608 18:18]:
> > On Wed, Jun 8, 2022 at 4:27 PM Peter Zijlstra  wrote:
> > >
> > > arch_cpu_idle() is a very simple idle interface and exposes only a
> > > single idle state and is expected to not require RCU and not do any
> > > tracing/instrumentation.
> > >
> > > As such, omap_sram_idle() is not a valid implementation. Replace it
> > > with the simple (shallow) omap3_do_wfi() call. Leaving the more
> > > complicated idle states for the cpuidle driver.
> 
> Agreed it makes sense to limit deeper idle states to cpuidle. Hopefully
> there is some informative splat for attempting to use arch_cpu_ide()
> for deeper idle states :)

The arch_cpu_idle() interface doesn't allow one to express a desire for
deeper states. I'm not sure how anyone could even attempt this.

But given what OMAP needs to go deeper, this would involve things that
require RCU, combine that with the follow up patches that rip out all
the trace_.*_rcuidle() hackery from the power and clock domain code,
PROVE_RCU should scream if anybody were to attempt it.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 33/36] cpuidle,omap3: Use WFI for omap3_pm_idle()

2022-06-09 Thread Peter Zijlstra
On Wed, Jun 08, 2022 at 06:28:33PM +0200, Arnd Bergmann wrote:
> On Wed, Jun 8, 2022 at 4:27 PM Peter Zijlstra  wrote:
> >
> > arch_cpu_idle() is a very simple idle interface and exposes only a
> > single idle state and is expected to not require RCU and not do any
> > tracing/instrumentation.
> >
> > As such, omap_sram_idle() is not a valid implementation. Replace it
> > with the simple (shallow) omap3_do_wfi() call. Leaving the more
> > complicated idle states for the cpuidle driver.
> >
> > Signed-off-by: Peter Zijlstra (Intel) 
> 
> I see similar code in omap2:
> 
> omap2_pm_idle()
>  -> omap2_enter_full_retention()
>  -> omap2_sram_suspend()
> 
> Is that code path safe to use without RCU or does it need a similar change?

It needs a similar change, clearly I was running on fumes to not have
found that when grepping around the omap code :/
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 24/36] printk: Remove trace_.*_rcuidle() usage

2022-06-09 Thread Petr Mladek via Virtualization
On Wed 2022-06-08 16:27:47, Peter Zijlstra wrote:
> The problem, per commit fc98c3c8c9dc ("printk: use rcuidle console
> tracepoint"), was printk usage from the cpuidle path where RCU was
> already disabled.
> 
> Per the patches earlier in this series, this is no longer the case.

My understanding is that this series reduces a lot the amount
of code called with RCU disabled. As a result the particular printk()
call mentioned by commit fc98c3c8c9dc ("printk: use rcuidle console
tracepoint") is called with RCU enabled now. Hence this particular
problem is fixed better way now.

But is this true in general?
Does this "prevent" calling printk() a safe way in code with
RCU disabled?

I am not sure if anyone cares. printk() is the best effort
functionality because of the consoles code anyway. Also I wonder
if anyone uses this trace_console().

Therefore if this patch allows to remove some tricky tracing
code then it might be worth it. But if trace_console_rcuidle()
variant is still going to be available then I would keep using it.

Best Regards,
Petr

> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  kernel/printk/printk.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2238,7 +2238,7 @@ static u16 printk_sprint(char *text, u16
>   }
>   }
>  
> - trace_console_rcuidle(text, text_len);
> + trace_console(text, text_len);
>  
>   return text_len;
>  }
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive

2022-06-09 Thread Stefano Garzarella

Hi Arseniy,
I left some comments in the patches, and I'm adding something also here:

On Fri, Jun 03, 2022 at 05:27:56AM +, Arseniy Krasnov wrote:

 INTRODUCTION

Hello, this is experimental implementation of virtio vsock zerocopy
receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
fill provided vma area with pages of virtio RX buffers. After received data was
processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
flag set(if user won't call 'madvise()', next 'getsockopt()' will 
fail).


If it is not too time-consuming, can we have a table/list to compare this 
and the TCP zerocopy?




DETAILS

Here is how mapping with mapped pages looks exactly: first page mapping
contains array of trimmed virtio vsock packet headers (in contains only length
of data on the corresponding page and 'flags' field):

struct virtio_vsock_usr_hdr {
uint32_t length;
uint32_t flags;
uint32_t copy_len;
};

Field  'length' allows user to know exact size of payload within each sequence
of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part.
All other pages are data pages from RX queue.

Page 0  Page 1  Page N

[ hdr1 .. hdrN ][ data ] .. [ data ]
  ||   ^   ^
  ||   |   |
  |*---*
  ||
  ||
  **

Of course, single header could represent array of pages (when packet's
buffer is bigger than one page).So here is example of detailed mapping layout
for some set of packages. Lets consider that we have the following sequence  of
packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
be inserted to user's vma(vma is large enough).


In order to have a "userspace polling-friendly approach" and reduce 
number of syscall, can we allow for example the userspace to mmap at 
least the first header before packets arrive.
Then the userspace can poll a flag or other fields in the header to 
understand that there are new packets.


That would be cool, but in the meantime it would be nice to behave 
similarly to TCP, which is why the comparison table I mentioned earlier 
would be useful.




Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
Page 1: [ 56 ]
Page 2: [ 4096 ]
Page 3: [ 4096 ]
Page 4: [ 4096 ]
Page 5: [ 8 ]

Page 0 contains only array of headers:
'hdr0' has 56 in length field.
'hdr1' has 4096 in length field.
'hdr2' has 8200 in length field.
'hdr3' has 0 in length field(this is end of data marker).

Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
Page 2 corresponds to 'hdr1' and filled with data.
Page 3 corresponds to 'hdr2' and filled with data.
Page 4 corresponds to 'hdr2' and filled with data.
Page 5 corresponds to 'hdr2' and has only 8 bytes of data.

This patchset also changes packets allocation way: today implementation
uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
"not large" could be bigger than one page). So to avoid this, data buffers now
allocated using 'alloc_pages()' call.

  TESTS

This patchset updates 'vsock_test' utility: two tests for new feature
were added. First test covers invalid cases. Second checks valid transmission
case.

   BENCHMARKING

For benchmakring I've added small utility 'rx_zerocopy'. It works in
client/server mode. When client connects to server, server starts sending exact
amount of data to client(amount is set as input argument).Client reads data and
waits for next portion of it. Client works in two modes: copy and zero-copy. In
copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
/'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
is better. For server, we can set size of tx buffer and for client we can set
size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
is quiet simple:

For client mode:

./rx_zerocopy --mode client [--zerocopy] [--rx]

For server mode:

./rx_zerocopy --mode server [--mb] [--tx]

[--mb] sets number of megabytes to transfer.
[--rx] sets size of receive buffer/mapping in pages.
[--tx] sets size of transmit buffer in pages.

I checked for transmission of 4000mb of data. Here are some results:


Re: [RFC PATCH v2 4/8] virtio/vsock: add transport zerocopy callback

2022-06-09 Thread Stefano Garzarella

On Fri, Jun 03, 2022 at 05:37:48AM +, Arseniy Krasnov wrote:

This adds transport callback which processes rx
queue of socket and instead of copying data to
user provided buffer, it inserts data pages of
each packet to user's vm area.

Signed-off-by: Arseniy Krasnov 
---
include/linux/virtio_vsock.h|   4 +
include/uapi/linux/virtio_vsock.h   |   6 +
net/vmw_vsock/virtio_transport_common.c | 208 +++-
3 files changed, 215 insertions(+), 3 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index d02cb7aa922f..47a68a2ea838 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -51,6 +51,7 @@ struct virtio_vsock_pkt {
bool reply;
bool tap_delivered;
bool slab_buf;
+   bool split;
};

struct virtio_vsock_pkt_info {
@@ -131,6 +132,9 @@ int virtio_transport_dgram_bind(struct vsock_sock *vsk,
struct sockaddr_vm *addr);
bool virtio_transport_dgram_allow(u32 cid, u32 port);

+int virtio_transport_zerocopy_dequeue(struct vsock_sock *vsk,
+ struct vm_area_struct *vma,
+ unsigned long addr);
int virtio_transport_connect(struct vsock_sock *vsk);

int virtio_transport_shutdown(struct vsock_sock *vsk, int mode);
diff --git a/include/uapi/linux/virtio_vsock.h 
b/include/uapi/linux/virtio_vsock.h
index 64738838bee5..6775c6c44b5b 100644
--- a/include/uapi/linux/virtio_vsock.h
+++ b/include/uapi/linux/virtio_vsock.h
@@ -66,6 +66,12 @@ struct virtio_vsock_hdr {
__le32  fwd_cnt;
} __attribute__((packed));

+struct virtio_vsock_usr_hdr {
+   u32 flags;
+   u32 len;
+   u32 copy_len;
+} __attribute__((packed));
+
enum virtio_vsock_type {
VIRTIO_VSOCK_TYPE_STREAM = 1,
VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 278567f748f2..3a3e84176c75 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -12,6 +12,7 @@
#include 
#include 
#include 
+#include 
#include 

#include 
@@ -347,6 +348,196 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
return err;
}

+#define MAX_PAGES_TO_MAP 256
+
+int virtio_transport_zerocopy_dequeue(struct vsock_sock *vsk,
+ struct vm_area_struct *vma,
+ unsigned long addr)
+{
+   struct virtio_vsock_sock *vvs = vsk->trans;
+   struct virtio_vsock_usr_hdr *usr_hdr_buffer;
+   unsigned long max_pages_to_insert;
+   unsigned long tmp_pages_inserted;
+   unsigned long pages_to_insert;
+   struct page *usr_hdr_page;
+   unsigned long vma_size;
+   struct page **pages;
+   int max_vma_pages;
+   int max_usr_hdrs;
+   int res;
+   int err;
+   int i;
+
+   /* Only use VMA from first page. */
+   if (vma->vm_start != addr)
+   return -EFAULT;
+
+   vma_size = vma->vm_end - vma->vm_start;
+
+   /* Too small vma(at least one page for headers
+* and one page for data).
+*/
+   if (vma_size < 2 * PAGE_SIZE)
+   return -EFAULT;
+
+   /* Page for meta data. */
+   usr_hdr_page = alloc_page(GFP_KERNEL);


I think all these checks should be done in af_vsock.c.

It would be nice to avoid that every transport reimplements the same 
thing and especially that all transports have the same behavior.


If you can would be nice to have the transports to return an array of 
pages to map, and af_vsock will handle it and the usr_hdr_page.


Do you think it's doable?


+
+   if (!usr_hdr_page)
+   return -EFAULT;
+
+   pages = kmalloc_array(MAX_PAGES_TO_MAP, sizeof(pages[0]), GFP_KERNEL);
+
+   if (!pages)
+   return -EFAULT;
+
+   pages[pages_to_insert++] = usr_hdr_page;
+
+   usr_hdr_buffer = page_to_virt(usr_hdr_page);
+
+   err = 0;
+
+   /* As we use first page for headers, so total number of
+* pages for user is min between number of headers in
+* first page and size of vma(in pages, except first page).
+*/
+   max_usr_hdrs = PAGE_SIZE / sizeof(*usr_hdr_buffer);
+   max_vma_pages = (vma_size / PAGE_SIZE) - 1;
+   max_pages_to_insert = min(max_usr_hdrs, max_vma_pages);
+
+   if (max_pages_to_insert > MAX_PAGES_TO_MAP)
+   max_pages_to_insert = MAX_PAGES_TO_MAP;
+
+   spin_lock_bh(&vvs->rx_lock);
+
+   while (!list_empty(&vvs->rx_queue) &&
+  pages_to_insert < max_pages_to_insert) {
+   struct virtio_vsock_pkt *pkt;
+   ssize_t rest_data_bytes;
+   size_t moved_data_bytes;
+   unsigned long pg_offs;
+
+   pkt = list_first_entry(&vvs->rx_queue,
+  struct virtio_vsock_pkt, list);
+
+   /* Buffer was allo

Re: [RFC PATCH v2 3/8] af_vsock: add zerocopy receive logic

2022-06-09 Thread Stefano Garzarella

On Fri, Jun 03, 2022 at 05:35:48AM +, Arseniy Krasnov wrote:

This:
1) Adds callback for 'mmap()' call on socket. It checks vm
  area flags and sets vm area ops.
2) Adds special 'getsockopt()' case which calls transport
  zerocopy callback. Input argument is vm area address.
3) Adds 'getsockopt()/setsockopt()' for switching on/off rx
  zerocopy mode.

Signed-off-by: Arseniy Krasnov 
---
include/net/af_vsock.h  |   7 +++
include/uapi/linux/vm_sockets.h |   3 +
net/vmw_vsock/af_vsock.c| 100 
3 files changed, 110 insertions(+)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index f742e50207fb..f15f84c648ff 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -135,6 +135,13 @@ struct vsock_transport {
bool (*stream_is_active)(struct vsock_sock *);
bool (*stream_allow)(u32 cid, u32 port);

+   int (*rx_zerocopy_set)(struct vsock_sock *vsk,
+  bool enable);
+   int (*rx_zerocopy_get)(struct vsock_sock *vsk);
+   int (*zerocopy_dequeue)(struct vsock_sock *vsk,
+   struct vm_area_struct *vma,
+   unsigned long addr);
+
/* SEQ_PACKET. */
ssize_t (*seqpacket_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
 int flags);
diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
index c60ca33eac59..d1f792bed1a7 100644
--- a/include/uapi/linux/vm_sockets.h
+++ b/include/uapi/linux/vm_sockets.h
@@ -83,6 +83,9 @@

#define SO_VM_SOCKETS_CONNECT_TIMEOUT_NEW 8

+#define SO_VM_SOCKETS_MAP_RX 9
+#define SO_VM_SOCKETS_ZEROCOPY 10
+
#if !defined(__KERNEL__)
#if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
#define SO_VM_SOCKETS_CONNECT_TIMEOUT SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index f04abf662ec6..10061ef21730 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1644,6 +1644,17 @@ static int vsock_connectible_setsockopt(struct socket 
*sock,
}
break;
}
+   case SO_VM_SOCKETS_ZEROCOPY: {
+   if (!transport || !transport->rx_zerocopy_set) {
+   err = -EOPNOTSUPP;
+   } else {
+   COPY_IN(val);
+
+   if (transport->rx_zerocopy_set(vsk, val))
+   err = -EINVAL;
+   }
+   break;
+   }

default:
err = -ENOPROTOOPT;
@@ -1657,6 +1668,48 @@ static int vsock_connectible_setsockopt(struct socket 
*sock,
return err;
}

+static const struct vm_operations_struct afvsock_vm_ops = {
+};
+
+static int vsock_recv_zerocopy(struct socket *sock,
+  unsigned long address)
+{
+   struct sock *sk = sock->sk;
+   struct vsock_sock *vsk = vsock_sk(sk);
+   struct vm_area_struct *vma;
+   const struct vsock_transport *transport;
+   int res;
+
+   transport = vsk->transport;
+
+   if (!transport->rx_zerocopy_get)
+   return -EOPNOTSUPP;
+
+   if (!transport->rx_zerocopy_get(vsk))
+   return -EOPNOTSUPP;


Maybe we can merge in
if (!transport->rx_zerocopy_get ||
!transport->rx_zerocopy_get(vsk)}
return -EOPNOTSUPP;


+
+   if (!transport->zerocopy_dequeue)
+   return -EOPNOTSUPP;
+
+   lock_sock(sk);
+   mmap_write_lock(current->mm);


So, multiple threads using different sockets are serialized if they use 
zero-copy?


IIUC this is necessary because the callback calls vm_insert_page().

At this point I think it's better not to do this in every transport, but 
have the callback return an array of pages to map and we map them here 
trying to limit as much as possible the critical section to protect with 
mmap_write_lock().



+
+   vma = vma_lookup(current->mm, address);
+
+   if (!vma || vma->vm_ops != &afvsock_vm_ops) {
+   mmap_write_unlock(current->mm);
+   release_sock(sk);
+   return -EINVAL;
+   }
+
+   res = transport->zerocopy_dequeue(vsk, vma, address);
+
+   mmap_write_unlock(current->mm);
+   release_sock(sk);
+
+   return res;
+}
+
static int vsock_connectible_getsockopt(struct socket *sock,
int level, int optname,
char __user *optval,
@@ -1701,6 +1754,39 @@ static int vsock_connectible_getsockopt(struct socket 
*sock,
lv = sock_get_timeout(vsk->connect_timeout, &v,
  optname == 
SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD);
break;
+   case SO_VM_SOCKETS_ZEROCOPY: {
+   const struct vsock_transport *transport;
+   int res;
+
+   transport = vsk->transport;
+
+   if (!tran

Re: [RFC PATCH v2 2/8] vhost/vsock: rework packet allocation logic

2022-06-09 Thread Stefano Garzarella

On Fri, Jun 03, 2022 at 05:33:04AM +, Arseniy Krasnov wrote:

For packets received from virtio RX queue, use buddy
allocator instead of 'kmalloc()' to be able to insert
such pages to user provided vma. Single call to
'copy_from_iter()' replaced with per-page loop.

Signed-off-by: Arseniy Krasnov 
---
drivers/vhost/vsock.c | 81 ---
1 file changed, 69 insertions(+), 12 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index e6c9d41db1de..0dc2229f18f7 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -58,6 +58,7 @@ struct vhost_vsock {

u32 guest_cid;
bool seqpacket_allow;
+   bool zerocopy_rx_on;


This is per-device, so a single socket can change the behaviour of all 
the sockets of this device.


Can we do something better?

Maybe we can allocate the header, copy it, find the socket and check if 
zero-copy is enabled or not for that socket.


Of course we should change or extend virtio_transport_recv_pkt() to 
avoid to find the socket again.




};

static u32 vhost_transport_get_local_cid(void)
@@ -357,6 +358,7 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
  unsigned int out, unsigned int in)
{
struct virtio_vsock_pkt *pkt;
+   struct vhost_vsock *vsock;
struct iov_iter iov_iter;
size_t nbytes;
size_t len;
@@ -393,20 +395,75 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
return NULL;
}

-   pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
-   if (!pkt->buf) {
-   kfree(pkt);
-   return NULL;
-   }
-
pkt->buf_len = pkt->len;
+   vsock = container_of(vq->dev, struct vhost_vsock, dev);

-   nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
-   if (nbytes != pkt->len) {
-   vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
-  pkt->len, nbytes);
-   virtio_transport_free_pkt(pkt);
-   return NULL;
+   if (!vsock->zerocopy_rx_on) {
+   pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
+
+   if (!pkt->buf) {
+   kfree(pkt);
+   return NULL;
+   }
+
+   pkt->slab_buf = true;
+   nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
+   if (nbytes != pkt->len) {
+   vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
+   pkt->len, nbytes);
+   virtio_transport_free_pkt(pkt);
+   return NULL;
+   }
+   } else {
+   struct page *buf_page;
+   ssize_t pkt_len;
+   int page_idx;
+
+   /* This creates memory overrun, as we allocate
+* at least one page for each packet.
+*/
+   buf_page = alloc_pages(GFP_KERNEL, get_order(pkt->len));
+
+   if (buf_page == NULL) {
+   kfree(pkt);
+   return NULL;
+   }
+
+   pkt->buf = page_to_virt(buf_page);
+
+   page_idx = 0;
+   pkt_len = pkt->len;
+
+   /* As allocated pages are not mapped, process
+* pages one by one.
+*/
+   while (pkt_len > 0) {
+   void *mapped;
+   size_t to_copy;
+
+   mapped = kmap(buf_page + page_idx);
+
+   if (mapped == NULL) {
+   virtio_transport_free_pkt(pkt);
+   return NULL;
+   }
+
+   to_copy = min(pkt_len, ((ssize_t)PAGE_SIZE));
+
+   nbytes = copy_from_iter(mapped, to_copy, &iov_iter);
+   if (nbytes != to_copy) {
+   vq_err(vq, "Expected %zu byte payload, got %zu 
bytes\n",
+  to_copy, nbytes);
+   kunmap(mapped);
+   virtio_transport_free_pkt(pkt);
+   return NULL;
+   }
+
+   kunmap(mapped);
+
+   pkt_len -= to_copy;
+   page_idx++;
+   }
}

return pkt;
--
2.25.1


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v2 1/8] virtio/vsock: rework packet allocation logic

2022-06-09 Thread Stefano Garzarella

On Fri, Jun 03, 2022 at 05:31:00AM +, Arseniy Krasnov wrote:

To support zerocopy receive, packet's buffer allocation
is changed: for buffers which could be mapped to user's
vma we can't use 'kmalloc()'(as kernel restricts to map
slab pages to user's vma) and raw buddy allocator now
called. But, for tx packets(such packets won't be mapped
to user), previous 'kmalloc()' way is used, but with special
flag in packet's structure which allows to distinguish
between 'kmalloc()' and raw pages buffers.

Signed-off-by: Arseniy Krasnov 
---
include/linux/virtio_vsock.h| 1 +
net/vmw_vsock/virtio_transport.c| 8 ++--
net/vmw_vsock/virtio_transport_common.c | 9 -
3 files changed, 15 insertions(+), 3 deletions(-)


Each patch should as much as possible work to not break the 
bisectability, and here we are not touching vhost_vsock_alloc_pkt() that 
uses kmalloc to allocate the buffer.


I see you updated it in the next patch, that should be fine, but here 
you should set slab_buf in vhost_vsock_alloc_pkt(), or you can merge the 
two patches.




diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 35d7eedb5e8e..d02cb7aa922f 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -50,6 +50,7 @@ struct virtio_vsock_pkt {
u32 off;
bool reply;
bool tap_delivered;
+   bool slab_buf;
};

struct virtio_vsock_pkt_info {
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index ad64f403536a..19909c1e9ba3 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -255,16 +255,20 @@ static void virtio_vsock_rx_fill(struct virtio_vsock 
*vsock)
vq = vsock->vqs[VSOCK_VQ_RX];

do {
+   struct page *buf_page;
+
pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
if (!pkt)
break;

-   pkt->buf = kmalloc(buf_len, GFP_KERNEL);
-   if (!pkt->buf) {
+   buf_page = alloc_page(GFP_KERNEL);
+
+   if (!buf_page) {
virtio_transport_free_pkt(pkt);
break;
}

+   pkt->buf = page_to_virt(buf_page);
pkt->buf_len = buf_len;
pkt->len = buf_len;

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index ec2c2afbf0d0..278567f748f2 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -69,6 +69,7 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
if (!pkt->buf)
goto out_pkt;

+   pkt->slab_buf = true;
pkt->buf_len = len;

err = memcpy_from_msg(pkt->buf, info->msg, len);
@@ -1342,7 +1343,13 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);

void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
{
-   kfree(pkt->buf);
+   if (pkt->buf_len) {
+   if (pkt->slab_buf)
+   kfree(pkt->buf);
+   else
+   free_pages(buf, get_order(pkt->buf_len));
+   }
+
kfree(pkt);
}
EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
--
2.25.1


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] vdpa: allow vdpa dev_del management operation to return failure

2022-06-09 Thread Jason Wang
On Wed, Jun 8, 2022 at 6:43 PM Parav Pandit  wrote:
>
>
> > From: Dawar, Gautam 
> > Sent: Wednesday, June 8, 2022 6:30 AM
> > To: Jason Wang 
> > Cc: netdev ; linux-net-drivers (AMD-Xilinx)  > net-driv...@amd.com>; Anand, Harpreet ;
> > Michael S. Tsirkin ; Zhu Lingshan
> > ; Xie Yongji ; Eli
> > Cohen ; Parav Pandit ; Si-Wei Liu  > wei@oracle.com>; Stefano Garzarella ; Wan
> > Jiabing ; Dan Carpenter
> > ; virtualization  > foundation.org>; linux-kernel 
> > Subject: RE: [PATCH] vdpa: allow vdpa dev_del management operation to
> > return failure
> >
> > [AMD Official Use Only - General]
> >
> > Hi Gautam:
> > [GD>>] Hi Jason,
> >
> > On Fri, Jun 3, 2022 at 6:34 PM Gautam Dawar 
> > wrote:
> > >
> > > Currently, the vdpa_nl_cmd_dev_del_set_doit() implementation allows
> > > returning a value to depict the operation status but the return type
> > > of dev_del() callback is void. So, any error while deleting the vdpa
> > > device in the vdpa parent driver can't be returned to the management
> > > layer.
> >
> > I wonder under which cognition we can hit an error in dev_del()?
> > [GD>>] In the AMD-Xilinx vDPA driver, on receiving vdpa device deletion
> > request, I try to identify if the vdpa device is in use by any virtio-net 
> > driver
> > (through any vdpa bus driver) by looking at the vdpa device status value. In
> > case the vdpa device status is >= VIRTIO_CONFIG_S_DRIVER, -EBUSY is
> > returned.
> > This is to avoid side-effects as noted in
> > https://bugzilla.kernel.org/show_bug.cgi?id=213179 caused by deleting the
> > vdpa device when it is being used.
> > >
> User should be able to delete the device anytime.

It requires a poll event to user space and then Qemu can release the
vhost-vDPA device. This is how VFIO works. We probably need to
implement something like this.

But notice that, at the worst case, usersapce may not respond to this
event, so there's nothing more kenrel can do execpt for waiting.

We need to consider something different. I used to have an idea to
make vhost-vdpa couple with vDPA loosely with SRCU/RCU. We might
consider implementing that.

> Upper layers who are unable to perform teardown sequence should be fixed.

I think we probably don't need to bother with failing the dev_del().
We can consider to fix/workaround the waiting first.

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization