subject:"Proposal."

BUSINESS PROPOSAL

2021-04-18 Thread Zanina Ivanovna




--
Hello,

Can I write you here? I have an urgent proposal for you.

Regards,

General Director,
Zanina Ivanovna Lyubova
LLC "PUBLISHING HOUSE" OIL AND GAS ™
119991, Moscow, Leninskiy Avenue 65, bldg 4, office 1706.
E-mail: i...@pliable-oil-gas.ru
Skype: zaninaivanovna
OGRN: 5067746402610 / TIN:7736545221

Re: Proposal With Jeffery & Schulman

2021-04-15 Thread Peter Schulman

-- 
Hello ,

I sent you an email before about an inheritance without hearing from
you. Let me reintroduce my email again to you and please reply
immediately.

My name is Mr. Peter Schulman the principal attorney at Jeffery &
Schulman Law Office based in Canada. I am contacting you because of my
deceased client who died leaving some funds with a bank in Europe in
the sum of 9,850,000 USD. I want to announce you as a heir/beneficiary
to the fund.

You don't have to worry, this transaction will be done legally without
any problem. If you are interested, I can give you more information
about my proposal as soon as i get your reply.

Regards,
Best Regards,

Mr. Peter Schulman
Senior Partner | Project Manager | Litigation

Jeffery & Schulman Law Office
1230 Bay Street
Suite 810 Toronto
Ontario, M5R 2A7

Re: Proposal of improvement for DMA - direct passing of hugepages to the SG list

2021-04-14 Thread Tom Rix


On 4/14/21 4:58 AM, w...@ise.pw.edu.pl wrote:

Hi,

I'm working both on DMA engines implementations in FPGA and their 
Linux drivers.
Now I need to create an engine that takes the hugepages-backed buffer 
allocated

by the user-space application and passes it to the device.
My current solution:

https://forums.xilinx.com/t5/Embedded-Linux/How-to-pass-efficiently-the-hugepages-backed-buffer-to-the-BM/m-p/1229340/highlight/true#M49777 


or https://stackoverflow.com/a/67065962/1735409

uses the get_user_pages_fast function, to create the kernel mapping,
and then uses sg_alloc_table_from_pages to build sg_table for it.

I have verified that the created sg_table has one entry for each 
individual

hugepage (so I can efficiently map it for my SG-capable DMA device).

The disadvantage of that solution is that I need to create and keep a 
huge set
of standard-size pages. Because the "struct page" occupies between 56 
and 80

bytes, I get the overhead up to 80/4096 which is approx. 2%.

The open question is if I can free those pages as soon as the sg_table
is created? (As far as I know, the hugepages are locked automatically).
Of course it is unclear what happens if the application crashes and 
its mmaped
hugepage-based buffer gets freed. Will the driver be notified about 
closing the

file so that it can disable DMA before that memory can be taken for other
purposes?

To be sure that it doesn't happen maybe it is good to keep those pages 
locked

in the kernel as well.
The big improvement would be if we could have the get_user_hugepages_fast
function. The user should be allowed to create a smaller number of 
page structs.
The function should check if the requested region really consists of 
hugepages

and return an error if it doesn't.

Another important fact is that we don't need a kernel mapping for 
those pages

at all. So it would be good to have yet another function:
sg_alloc_table_from_user_pages
which should take an additional "flag" argument enabling the user to 
decide

if the area used consists of normal pages or of hugepages.

The function should return an error in  case if the flag does not 
match the
properties of the region. Of course the function should also lock the 
pages,

and sg_free_table should unlock them (protecting against the danger
of application crash, that I described above).

As a temporary workaround, is it possible to "manually" walk the pages
creating the application-delivered buffer, verify that they are 
hugepages,

lock them and create the sg_table?
What functions/macros should be used for that to ensure that the 
implemented
solution be portable and keeps working in a reasonable number of 
future versions

of the Linux kernel?

I'll appreciate comments, suggestions and considering of the above 
proposal.

With best regards,
Wojtek


Do you have trial patch you could RFC ?

Are you aware of the xilinx qdma patchset ?

https://www.xilinx.com/support/answers/71453.html

https://github.com/Xilinx/dma_ip_drivers

Maybe its kernel or dpdk driver has what you need.

Tom

Proposal of improvement for DMA - direct passing of hugepages to the SG list

2021-04-14 Thread wzab


Hi,

I'm working both on DMA engines implementations in FPGA and their Linux 
drivers.
Now I need to create an engine that takes the hugepages-backed buffer 
allocated

by the user-space application and passes it to the device.
My current solution:

https://forums.xilinx.com/t5/Embedded-Linux/How-to-pass-efficiently-the-hugepages-backed-buffer-to-the-BM/m-p/1229340/highlight/true#M49777
or https://stackoverflow.com/a/67065962/1735409

uses the get_user_pages_fast function, to create the kernel mapping,
and then uses sg_alloc_table_from_pages to build sg_table for it.

I have verified that the created sg_table has one entry for each individual
hugepage (so I can efficiently map it for my SG-capable DMA device).

The disadvantage of that solution is that I need to create and keep a 
huge set

of standard-size pages. Because the "struct page" occupies between 56 and 80
bytes, I get the overhead up to 80/4096 which is approx. 2%.

The open question is if I can free those pages as soon as the sg_table
is created? (As far as I know, the hugepages are locked automatically).
Of course it is unclear what happens if the application crashes and its 
mmaped
hugepage-based buffer gets freed. Will the driver be notified about 
closing the

file so that it can disable DMA before that memory can be taken for other
purposes?

To be sure that it doesn't happen maybe it is good to keep those pages 
locked

in the kernel as well.
The big improvement would be if we could have the get_user_hugepages_fast
function. The user should be allowed to create a smaller number of page 
structs.
The function should check if the requested region really consists of 
hugepages

and return an error if it doesn't.

Another important fact is that we don't need a kernel mapping for those 
pages

at all. So it would be good to have yet another function:
sg_alloc_table_from_user_pages
which should take an additional "flag" argument enabling the user to decide
if the area used consists of normal pages or of hugepages.

The function should return an error in  case if the flag does not match the
properties of the region. Of course the function should also lock the pages,
and sg_free_table should unlock them (protecting against the danger
of application crash, that I described above).

As a temporary workaround, is it possible to "manually" walk the pages
creating the application-delivered buffer, verify that they are hugepages,
lock them and create the sg_table?
What functions/macros should be used for that to ensure that the implemented
solution be portable and keeps working in a reasonable number of future 
versions

of the Linux kernel?

I'll appreciate comments, suggestions and considering of the above proposal.
With best regards,
Wojtek

--
Wojciech M Zabolotny, PhD, DSc
Institute of Electronic Systems
Faculty of Electronics and Information Technology
Warsaw University of Technology

BUSINESS PROPOSAL

2021-04-14 Thread Zanina Ivanovna




--
Hello,

Can I write you here? I have an urgent proposal.

Regards,

General Director,
Zanina Ivanovna Lyubova
LLC "PUBLISHING HOUSE" OIL AND GAS ™
119991, Moscow, Leninskiy Avenue 65, bldg 4, office 1706.
E-mail: i...@pliable-oil-gas.ru
Skype: zaninaivanovna
OGRN: 5067746402610 / TIN:7736545221

Loan Proposal

2021-04-08 Thread Amir Saeed

Greetings,

I am the investment officer of a UAE based investment company who are ready to 
fund projects outside UAE, in the form of debt finance. We grant loans to both 
Corporate and private entities at a low interest rate of 3% ROI per annum.

Thanks
investment officer

Re: Proposal With Jeffery & Schulman

2021-03-25 Thread Peter Schulman

Hello ,

I sent you an email before about an inheritance without hearing from
you. Let me reintroduce my email again to you and please reply
immediately.

My name is Mr. Peter Schulman the principal attorney at Jeffery &
Schulman Law Office based in Canada. I am contacting you because of my
deceased client who died leaving some funds with a bank in Europe in
the sum of 9,850,000 USD. I want to announce you as a heir/beneficiary
to the fund.

You don't have to worry, this transaction will be done legally without
any problem. If you are interested, I can give you more information
about my proposal as soon as i get your reply.

Regards,
Best Regards,

Mr. Peter Schulman
Senior Partner | Project Manager | Litigation

Jeffery & Schulman Law Office
1230 Bay Street
Suite 810 Toronto
Ontario, M5R 2A7

CF.- HELLO I HAVE A FINANCIAL PROPOSAL FOR YOU

2020-12-04 Thread Matthew ADE

Business Funding Proposal

2020-12-02 Thread Per Lessen


--
Good Day,

I am contacting you concerning funding of your business project. I am 
Dr. Per Lessen
and I work as a financial consultant and adviser to Global Asset 
Management company LLC
and some High Net worth individuals from the MENA region and European 
Union.


Basically, my principals are most interested in partnership with 
business ventures such
as Medical and Health care projects, Real estate projects, mining 
projects,

agricultural projects renewable energy projects, start-up projects and
business expansions / Loan with lower rate.

I MUST consider you a privileged entity who shall be privy to many
sourceable CLEAN fund meant for foreign investment and you are able to
access up to One Billion Euro or more if your project is viable.

I will wish that you send a copy of your highlighted projects with
elucidated BUSINESS PLANS or EXECUTIVE SUMMARY which I shall propagate 
for

its immediate approval for a minimal ROI return on investment which you
shall have sole mandate after review and consideration.

There will be the need for a face to face meeting with the investor or 
His representatives

and have the paperwork perfected and signed.

Kindly reply as soon as possible if you have any viable project that you
will want us to be involved as indicated above.

Thanks
Dr. Per Lessen

Re: [Cocci] Proposal for a new checkpatch check; matching _set_drvdata() & _get_drvdata()

2020-11-20 Thread Alexandru Ardelean

On Fri, Nov 20, 2020 at 3:16 PM Lars-Peter Clausen  wrote:
>
> On 11/20/20 12:54 PM, Alexandru Ardelean wrote:
> > On Fri, Nov 20, 2020 at 12:47 PM Julia Lawall  wrote:
> >>
> >>
> >> On Thu, 19 Nov 2020, Joe Perches wrote:
> >>
> >>> On Thu, 2020-11-19 at 17:16 +0200, Andy Shevchenko wrote:
>  On Thu, Nov 19, 2020 at 4:09 PM Alexandru Ardelean
>   wrote:
> > Hey,
> >
> > So, I stumbled on a new check that could be added to checkpatch.
> > Since it's in Perl, I'm reluctant to try it.
> >
> > Seems many drivers got to a point where they now call (let's say)
> > spi_set_drvdata(), but never access that information via
> > spi_get_drvdata().
> > Reasons for this seem to be:
> > 1. They got converted to device-managed functions and there is no
> > longer a remove hook to require the _get_drvdata() access
> > 2. They look like they were copied from a driver that had a
> > _set_drvdata() and when the code got finalized, the _set_drvdata() was
> > omitted
> >
> > There are a few false positives that I can notice at a quick look,
> > like the data being set via some xxx_set_drvdata() and retrieved via a
> > dev_get_drvdata().
>  I can say quite a few. And this makes a difference.
>  So, basically all drivers that are using PM callbacks would rather use
>  dev_get_drvdata() rather than bus specific.
> 
> > I think checkpatch reporting these as well would be acceptable simply
> > from a reviewability perspective.
> >
> > I did a shell script to quickly check these. See below.
> > It's pretty badly written but it is enough for me to gather a list.
> > And I wrote it in 5 minutes :P
> > I initially noticed this in some IIO drivers, and then I suspected
> > that this may be more widespread.
>  It seems more suitable for coccinelle.
> >>> To me as well.
> >> To me as well, since it seems to involve nonlocal information.
> >>
> >> I'm not sure to understand the original shell script. Is there
> >> something interesting about pci_set_drvdata?
> > Ah, it's a stupid script I wrote in 5 minutes, so I did not bother to
> > make things smart.
> > In the text-matching I did in shell, there are some entries that come
> > from comments and docs.
> > It's only about 3-4 entries, so I just did a visual/manual ignore.
> >
> > In essence:
> > The script searches for all strings that contain _set_drvdata.
> > The separators are whitespace.
> > It creates a list of all  _set_drvdata functions.
> > For each _set_drvdata function:
> >  It checks all files that have a _set_drvdata entry, but no
> > _get_drvdata
> >
> > I piped this output into a file and started manually checking the drivers.
> > There is one [I forget which function] that is _set_drvdata() but
> > equivalent is _drvdata()
> >
> > As Andy said, some precautions must be taken in places where
> > _set_drvdata() is called but dev_get_drvdata() is used.
> > Cases like PM suspend/resume calls.
> > And there may be some cases outside this context.
> >
> Doing something like this with coccinelle is fairly easy.
>
> But I'd be very cautious about putting such a script into the kernel. It
> will result in too many false positive drive-by patches. Such a script
> will not detect cases such as:

Yeah, it would probably be a good idea to start with a few simple
checks, then scale it.
If we go for the existing _set_drvdata() / _get_drvdata() pair checks,
there is a risk of breaking things.

>
>   * Driver is split over multiple files. One file does
> ..._set_drvdata(), another does ..._get_drvdata().
>
>   * Framework uses drvdata to exchange data with the driver. E.g driver
> is expected to call set_drvdata() and then the framework uses
> get_drvdata() to retrieve the data. This is not a very good pattern, but
> there are some palces int he kernel where this is used. I believe for
> example V4L2 uses this.
>
> - Lars
>

Re: [Cocci] Proposal for a new checkpatch check; matching _set_drvdata() & _get_drvdata()

2020-11-20 Thread Lars-Peter Clausen


On 11/20/20 12:54 PM, Alexandru Ardelean wrote:

On Fri, Nov 20, 2020 at 12:47 PM Julia Lawall  wrote:



On Thu, 19 Nov 2020, Joe Perches wrote:


On Thu, 2020-11-19 at 17:16 +0200, Andy Shevchenko wrote:

On Thu, Nov 19, 2020 at 4:09 PM Alexandru Ardelean
 wrote:

Hey,

So, I stumbled on a new check that could be added to checkpatch.
Since it's in Perl, I'm reluctant to try it.

Seems many drivers got to a point where they now call (let's say)
spi_set_drvdata(), but never access that information via
spi_get_drvdata().
Reasons for this seem to be:
1. They got converted to device-managed functions and there is no
longer a remove hook to require the _get_drvdata() access
2. They look like they were copied from a driver that had a
_set_drvdata() and when the code got finalized, the _set_drvdata() was
omitted

There are a few false positives that I can notice at a quick look,
like the data being set via some xxx_set_drvdata() and retrieved via a
dev_get_drvdata().

I can say quite a few. And this makes a difference.
So, basically all drivers that are using PM callbacks would rather use
dev_get_drvdata() rather than bus specific.


I think checkpatch reporting these as well would be acceptable simply
from a reviewability perspective.

I did a shell script to quickly check these. See below.
It's pretty badly written but it is enough for me to gather a list.
And I wrote it in 5 minutes :P
I initially noticed this in some IIO drivers, and then I suspected
that this may be more widespread.

It seems more suitable for coccinelle.

To me as well.

To me as well, since it seems to involve nonlocal information.

I'm not sure to understand the original shell script. Is there
something interesting about pci_set_drvdata?

Ah, it's a stupid script I wrote in 5 minutes, so I did not bother to
make things smart.
In the text-matching I did in shell, there are some entries that come
from comments and docs.
It's only about 3-4 entries, so I just did a visual/manual ignore.

In essence:
The script searches for all strings that contain _set_drvdata.
The separators are whitespace.
It creates a list of all  _set_drvdata functions.
For each _set_drvdata function:
 It checks all files that have a _set_drvdata entry, but no
_get_drvdata

I piped this output into a file and started manually checking the drivers.
There is one [I forget which function] that is _set_drvdata() but
equivalent is _drvdata()

As Andy said, some precautions must be taken in places where
_set_drvdata() is called but dev_get_drvdata() is used.
Cases like PM suspend/resume calls.
And there may be some cases outside this context.


Doing something like this with coccinelle is fairly easy.

But I'd be very cautious about putting such a script into the kernel. It 
will result in too many false positive drive-by patches. Such a script 
will not detect cases such as:


 * Driver is split over multiple files. One file does 
..._set_drvdata(), another does ..._get_drvdata().


 * Framework uses drvdata to exchange data with the driver. E.g driver 
is expected to call set_drvdata() and then the framework uses 
get_drvdata() to retrieve the data. This is not a very good pattern, but 
there are some palces int he kernel where this is used. I believe for 
example V4L2 uses this.


- Lars

Re: [Cocci] Proposal for a new checkpatch check; matching _set_drvdata() & _get_drvdata()

2020-11-20 Thread Alexandru Ardelean

On Fri, Nov 20, 2020 at 1:57 PM Julia Lawall  wrote:
>
>
>
> On Fri, 20 Nov 2020, Alexandru Ardelean wrote:
>
> > On Fri, Nov 20, 2020 at 12:47 PM Julia Lawall  wrote:
> > >
> > >
> > >
> > > On Thu, 19 Nov 2020, Joe Perches wrote:
> > >
> > > > On Thu, 2020-11-19 at 17:16 +0200, Andy Shevchenko wrote:
> > > > > On Thu, Nov 19, 2020 at 4:09 PM Alexandru Ardelean
> > > > >  wrote:
> > > > > >
> > > > > > Hey,
> > > > > >
> > > > > > So, I stumbled on a new check that could be added to checkpatch.
> > > > > > Since it's in Perl, I'm reluctant to try it.
> > > > > >
> > > > > > Seems many drivers got to a point where they now call (let's say)
> > > > > > spi_set_drvdata(), but never access that information via
> > > > > > spi_get_drvdata().
> > > > > > Reasons for this seem to be:
> > > > > > 1. They got converted to device-managed functions and there is no
> > > > > > longer a remove hook to require the _get_drvdata() access
> > > > > > 2. They look like they were copied from a driver that had a
> > > > > > _set_drvdata() and when the code got finalized, the _set_drvdata() 
> > > > > > was
> > > > > > omitted
> > > > > >
> > > > > > There are a few false positives that I can notice at a quick look,
> > > > > > like the data being set via some xxx_set_drvdata() and retrieved 
> > > > > > via a
> > > > > > dev_get_drvdata().
> > > > >
> > > > > I can say quite a few. And this makes a difference.
> > > > > So, basically all drivers that are using PM callbacks would rather use
> > > > > dev_get_drvdata() rather than bus specific.
> > > > >
> > > > > > I think checkpatch reporting these as well would be acceptable 
> > > > > > simply
> > > > > > from a reviewability perspective.
> > > > > >
> > > > > > I did a shell script to quickly check these. See below.
> > > > > > It's pretty badly written but it is enough for me to gather a list.
> > > > > > And I wrote it in 5 minutes :P
> > > > > > I initially noticed this in some IIO drivers, and then I suspected
> > > > > > that this may be more widespread.
> > > > >
> > > > > It seems more suitable for coccinelle.
> > > >
> > > > To me as well.
> > >
> > > To me as well, since it seems to involve nonlocal information.
> > >
> > > I'm not sure to understand the original shell script. Is there
> > > something interesting about pci_set_drvdata?
> >
> > Ah, it's a stupid script I wrote in 5 minutes, so I did not bother to
> > make things smart.
> > In the text-matching I did in shell, there are some entries that come
> > from comments and docs.
> > It's only about 3-4 entries, so I just did a visual/manual ignore.
> >
> > In essence:
> > The script searches for all strings that contain _set_drvdata.
> > The separators are whitespace.
> > It creates a list of all  _set_drvdata functions.
> > For each _set_drvdata function:
> > It checks all files that have a _set_drvdata entry, but no
> > _get_drvdata
>
> OK, but I have the impression that you want to ignore pci_set_drvdata for
> some reason?  Or did I misunderstand?

Yes. See difficultly visible double quote :P
'  "pci_set_drvdata   '
Apologies for the confusion

if [ "$fn" == '"pci_set_drvdata' ] ; then
continue
fi


>
> julia
>
> >
> > I piped this output into a file and started manually checking the drivers.
> > There is one [I forget which function] that is _set_drvdata() but
> > equivalent is _drvdata()
> >
> > As Andy said, some precautions must be taken in places where
> > _set_drvdata() is called but dev_get_drvdata() is used.
> > Cases like PM suspend/resume calls.
> > And there may be some cases outside this context.
> >
> >
> > >
> > > julia
> >

Re: [Cocci] Proposal for a new checkpatch check; matching _set_drvdata() & _get_drvdata()

2020-11-20 Thread Julia Lawall




On Fri, 20 Nov 2020, Alexandru Ardelean wrote:

> On Fri, Nov 20, 2020 at 12:47 PM Julia Lawall  wrote:
> >
> >
> >
> > On Thu, 19 Nov 2020, Joe Perches wrote:
> >
> > > On Thu, 2020-11-19 at 17:16 +0200, Andy Shevchenko wrote:
> > > > On Thu, Nov 19, 2020 at 4:09 PM Alexandru Ardelean
> > > >  wrote:
> > > > >
> > > > > Hey,
> > > > >
> > > > > So, I stumbled on a new check that could be added to checkpatch.
> > > > > Since it's in Perl, I'm reluctant to try it.
> > > > >
> > > > > Seems many drivers got to a point where they now call (let's say)
> > > > > spi_set_drvdata(), but never access that information via
> > > > > spi_get_drvdata().
> > > > > Reasons for this seem to be:
> > > > > 1. They got converted to device-managed functions and there is no
> > > > > longer a remove hook to require the _get_drvdata() access
> > > > > 2. They look like they were copied from a driver that had a
> > > > > _set_drvdata() and when the code got finalized, the _set_drvdata() was
> > > > > omitted
> > > > >
> > > > > There are a few false positives that I can notice at a quick look,
> > > > > like the data being set via some xxx_set_drvdata() and retrieved via a
> > > > > dev_get_drvdata().
> > > >
> > > > I can say quite a few. And this makes a difference.
> > > > So, basically all drivers that are using PM callbacks would rather use
> > > > dev_get_drvdata() rather than bus specific.
> > > >
> > > > > I think checkpatch reporting these as well would be acceptable simply
> > > > > from a reviewability perspective.
> > > > >
> > > > > I did a shell script to quickly check these. See below.
> > > > > It's pretty badly written but it is enough for me to gather a list.
> > > > > And I wrote it in 5 minutes :P
> > > > > I initially noticed this in some IIO drivers, and then I suspected
> > > > > that this may be more widespread.
> > > >
> > > > It seems more suitable for coccinelle.
> > >
> > > To me as well.
> >
> > To me as well, since it seems to involve nonlocal information.
> >
> > I'm not sure to understand the original shell script. Is there
> > something interesting about pci_set_drvdata?
>
> Ah, it's a stupid script I wrote in 5 minutes, so I did not bother to
> make things smart.
> In the text-matching I did in shell, there are some entries that come
> from comments and docs.
> It's only about 3-4 entries, so I just did a visual/manual ignore.
>
> In essence:
> The script searches for all strings that contain _set_drvdata.
> The separators are whitespace.
> It creates a list of all  _set_drvdata functions.
> For each _set_drvdata function:
> It checks all files that have a _set_drvdata entry, but no
> _get_drvdata

OK, but I have the impression that you want to ignore pci_set_drvdata for
some reason?  Or did I misunderstand?

julia

>
> I piped this output into a file and started manually checking the drivers.
> There is one [I forget which function] that is _set_drvdata() but
> equivalent is _drvdata()
>
> As Andy said, some precautions must be taken in places where
> _set_drvdata() is called but dev_get_drvdata() is used.
> Cases like PM suspend/resume calls.
> And there may be some cases outside this context.
>
>
> >
> > julia
>

Re: [Cocci] Proposal for a new checkpatch check; matching _set_drvdata() & _get_drvdata()

2020-11-20 Thread Alexandru Ardelean

On Fri, Nov 20, 2020 at 12:47 PM Julia Lawall  wrote:
>
>
>
> On Thu, 19 Nov 2020, Joe Perches wrote:
>
> > On Thu, 2020-11-19 at 17:16 +0200, Andy Shevchenko wrote:
> > > On Thu, Nov 19, 2020 at 4:09 PM Alexandru Ardelean
> > >  wrote:
> > > >
> > > > Hey,
> > > >
> > > > So, I stumbled on a new check that could be added to checkpatch.
> > > > Since it's in Perl, I'm reluctant to try it.
> > > >
> > > > Seems many drivers got to a point where they now call (let's say)
> > > > spi_set_drvdata(), but never access that information via
> > > > spi_get_drvdata().
> > > > Reasons for this seem to be:
> > > > 1. They got converted to device-managed functions and there is no
> > > > longer a remove hook to require the _get_drvdata() access
> > > > 2. They look like they were copied from a driver that had a
> > > > _set_drvdata() and when the code got finalized, the _set_drvdata() was
> > > > omitted
> > > >
> > > > There are a few false positives that I can notice at a quick look,
> > > > like the data being set via some xxx_set_drvdata() and retrieved via a
> > > > dev_get_drvdata().
> > >
> > > I can say quite a few. And this makes a difference.
> > > So, basically all drivers that are using PM callbacks would rather use
> > > dev_get_drvdata() rather than bus specific.
> > >
> > > > I think checkpatch reporting these as well would be acceptable simply
> > > > from a reviewability perspective.
> > > >
> > > > I did a shell script to quickly check these. See below.
> > > > It's pretty badly written but it is enough for me to gather a list.
> > > > And I wrote it in 5 minutes :P
> > > > I initially noticed this in some IIO drivers, and then I suspected
> > > > that this may be more widespread.
> > >
> > > It seems more suitable for coccinelle.
> >
> > To me as well.
>
> To me as well, since it seems to involve nonlocal information.
>
> I'm not sure to understand the original shell script. Is there
> something interesting about pci_set_drvdata?

Ah, it's a stupid script I wrote in 5 minutes, so I did not bother to
make things smart.
In the text-matching I did in shell, there are some entries that come
from comments and docs.
It's only about 3-4 entries, so I just did a visual/manual ignore.

In essence:
The script searches for all strings that contain _set_drvdata.
The separators are whitespace.
It creates a list of all  _set_drvdata functions.
For each _set_drvdata function:
It checks all files that have a _set_drvdata entry, but no
_get_drvdata

I piped this output into a file and started manually checking the drivers.
There is one [I forget which function] that is _set_drvdata() but
equivalent is _drvdata()

As Andy said, some precautions must be taken in places where
_set_drvdata() is called but dev_get_drvdata() is used.
Cases like PM suspend/resume calls.
And there may be some cases outside this context.

>
> julia

Re: [Cocci] Proposal for a new checkpatch check; matching _set_drvdata() & _get_drvdata()

2020-11-20 Thread Julia Lawall




On Thu, 19 Nov 2020, Joe Perches wrote:

> On Thu, 2020-11-19 at 17:16 +0200, Andy Shevchenko wrote:
> > On Thu, Nov 19, 2020 at 4:09 PM Alexandru Ardelean
> >  wrote:
> > >
> > > Hey,
> > >
> > > So, I stumbled on a new check that could be added to checkpatch.
> > > Since it's in Perl, I'm reluctant to try it.
> > >
> > > Seems many drivers got to a point where they now call (let's say)
> > > spi_set_drvdata(), but never access that information via
> > > spi_get_drvdata().
> > > Reasons for this seem to be:
> > > 1. They got converted to device-managed functions and there is no
> > > longer a remove hook to require the _get_drvdata() access
> > > 2. They look like they were copied from a driver that had a
> > > _set_drvdata() and when the code got finalized, the _set_drvdata() was
> > > omitted
> > >
> > > There are a few false positives that I can notice at a quick look,
> > > like the data being set via some xxx_set_drvdata() and retrieved via a
> > > dev_get_drvdata().
> >
> > I can say quite a few. And this makes a difference.
> > So, basically all drivers that are using PM callbacks would rather use
> > dev_get_drvdata() rather than bus specific.
> >
> > > I think checkpatch reporting these as well would be acceptable simply
> > > from a reviewability perspective.
> > >
> > > I did a shell script to quickly check these. See below.
> > > It's pretty badly written but it is enough for me to gather a list.
> > > And I wrote it in 5 minutes :P
> > > I initially noticed this in some IIO drivers, and then I suspected
> > > that this may be more widespread.
> >
> > It seems more suitable for coccinelle.
>
> To me as well.

To me as well, since it seems to involve nonlocal information.

I'm not sure to understand the original shell script. Is there
something interesting about pci_set_drvdata?

julia

Re: Proposal for a new checkpatch check; matching _set_drvdata() & _get_drvdata()

2020-11-19 Thread Joe Perches

On Thu, 2020-11-19 at 17:16 +0200, Andy Shevchenko wrote:
> On Thu, Nov 19, 2020 at 4:09 PM Alexandru Ardelean
>  wrote:
> > 
> > Hey,
> > 
> > So, I stumbled on a new check that could be added to checkpatch.
> > Since it's in Perl, I'm reluctant to try it.
> > 
> > Seems many drivers got to a point where they now call (let's say)
> > spi_set_drvdata(), but never access that information via
> > spi_get_drvdata().
> > Reasons for this seem to be:
> > 1. They got converted to device-managed functions and there is no
> > longer a remove hook to require the _get_drvdata() access
> > 2. They look like they were copied from a driver that had a
> > _set_drvdata() and when the code got finalized, the _set_drvdata() was
> > omitted
> > 
> > There are a few false positives that I can notice at a quick look,
> > like the data being set via some xxx_set_drvdata() and retrieved via a
> > dev_get_drvdata().
> 
> I can say quite a few. And this makes a difference.
> So, basically all drivers that are using PM callbacks would rather use
> dev_get_drvdata() rather than bus specific.
> 
> > I think checkpatch reporting these as well would be acceptable simply
> > from a reviewability perspective.
> > 
> > I did a shell script to quickly check these. See below.
> > It's pretty badly written but it is enough for me to gather a list.
> > And I wrote it in 5 minutes :P
> > I initially noticed this in some IIO drivers, and then I suspected
> > that this may be more widespread.
> 
> It seems more suitable for coccinelle.

To me as well.

Re: Proposal for a new checkpatch check; matching _set_drvdata() & _get_drvdata()

2020-11-19 Thread Andy Shevchenko

On Thu, Nov 19, 2020 at 4:09 PM Alexandru Ardelean
 wrote:
>
> Hey,
>
> So, I stumbled on a new check that could be added to checkpatch.
> Since it's in Perl, I'm reluctant to try it.
>
> Seems many drivers got to a point where they now call (let's say)
> spi_set_drvdata(), but never access that information via
> spi_get_drvdata().
> Reasons for this seem to be:
> 1. They got converted to device-managed functions and there is no
> longer a remove hook to require the _get_drvdata() access
> 2. They look like they were copied from a driver that had a
> _set_drvdata() and when the code got finalized, the _set_drvdata() was
> omitted
>
> There are a few false positives that I can notice at a quick look,
> like the data being set via some xxx_set_drvdata() and retrieved via a
> dev_get_drvdata().

I can say quite a few. And this makes a difference.
So, basically all drivers that are using PM callbacks would rather use
dev_get_drvdata() rather than bus specific.

> I think checkpatch reporting these as well would be acceptable simply
> from a reviewability perspective.
>
> I did a shell script to quickly check these. See below.
> It's pretty badly written but it is enough for me to gather a list.
> And I wrote it in 5 minutes :P
> I initially noticed this in some IIO drivers, and then I suspected
> that this may be more widespread.

It seems more suitable for coccinelle.

> The shell script gathers a list of xxx_set_drvdata() functions then
> greps through all files and also checks if there are any matching
> xxx_get_drvdata().
>
> Thanks
> Alex
>
> Shell script:
> ---
> #!/bin/bash
>
> fns1=$(git grep _set_drvdata | cut -d: -f2 | cut -d'(' -f1 | sort -u)
>
> for fn in $fns1 ; do
> if [ "$fn" == "//pci_set_drvdata" ] ; then
> continue
> fi
> if [ "$fn" == '``dev_set_drvdata' ] ; then
> continue
> fi
> if [ "$fn" == '"pci_set_drvdata' ] ; then
> continue
> fi
> if [[ "$fn" == *"_set_drvdata" ]]; then
> fns2="$fns2 $fn"
> fi
> done
>
> fns1=$(echo $fns2 | tr ' ' '\n' | sort -u | tr '\n' ' ')
>
> for fn in $fns1 ; do
> get_fn=$(echo $fn | sed 's/_set_/_get_/g')
>
> echo "Matching $fn - $get_fn"
> for file in $(git grep $fn | cut -d: -f1 | sort -u) ; do
> if ! grep -q $get_fn $file ; then
> echo "   Maybe $file"
> fi
> done
> done
> ---

-- 
With Best Regards,
Andy Shevchenko

Proposal for a new checkpatch check; matching _set_drvdata() & _get_drvdata()

2020-11-19 Thread Alexandru Ardelean

Hey,

So, I stumbled on a new check that could be added to checkpatch.
Since it's in Perl, I'm reluctant to try it.

Seems many drivers got to a point where they now call (let's say)
spi_set_drvdata(), but never access that information via
spi_get_drvdata().
Reasons for this seem to be:
1. They got converted to device-managed functions and there is no
longer a remove hook to require the _get_drvdata() access
2. They look like they were copied from a driver that had a
_set_drvdata() and when the code got finalized, the _set_drvdata() was
omitted

There are a few false positives that I can notice at a quick look,
like the data being set via some xxx_set_drvdata() and retrieved via a
dev_get_drvdata().
I think checkpatch reporting these as well would be acceptable simply
from a reviewability perspective.

I did a shell script to quickly check these. See below.
It's pretty badly written but it is enough for me to gather a list.
And I wrote it in 5 minutes :P
I initially noticed this in some IIO drivers, and then I suspected
that this may be more widespread.

The shell script gathers a list of xxx_set_drvdata() functions then
greps through all files and also checks if there are any matching
xxx_get_drvdata().

Thanks
Alex

Shell script:
---
#!/bin/bash

fns1=$(git grep _set_drvdata | cut -d: -f2 | cut -d'(' -f1 | sort -u)

for fn in $fns1 ; do
if [ "$fn" == "//pci_set_drvdata" ] ; then
continue
fi
if [ "$fn" == '``dev_set_drvdata' ] ; then
continue
fi
if [ "$fn" == '"pci_set_drvdata' ] ; then
continue
fi
if [[ "$fn" == *"_set_drvdata" ]]; then
fns2="$fns2 $fn"
fi
done

fns1=$(echo $fns2 | tr ' ' '\n' | sort -u | tr '\n' ' ')

for fn in $fns1 ; do
get_fn=$(echo $fn | sed 's/_set_/_get_/g')

echo "Matching $fn - $get_fn"
for file in $(git grep $fn | cut -d: -f1 | sort -u) ; do
if ! grep -q $get_fn $file ; then
echo "   Maybe $file"
fi
done
done
---

Project Funding Proposal

2020-10-18 Thread Gerald Van Luven





--
Good Day,

I am Dr. Gerald Van Luven and I work as a financial consultant and 
adviser
to Global Asset Management company LLC and some High Net worth 
individuals

from the MENA region and European Union.

Basically, my principals are interested in investing in projects that 
are

viable and has capacity of generating a reasonable ROI for the investor.

I MUST consider you a privileged entity who shall be privy to many
sourcable CLEAN fund meant for foreign investment and you are able to
access up to One Billion Euro or more if your project is viable.

I will wish that you send a copy of your highlighted projects with
elucidated BUSINESS PLANS or EXECUTIVE SUMMARY which I shall propagate 
for

its immediate approval for a minimal ROI return on investment which you
shall have sole mandate after review and consideration.

You should also bear in mind that funding of this nature is not 
completed

on email communication alone, there will be the need for a face to face
meeting with the investor or His representatives and have the paperwork
perfected and signed.

Kindly reply as soon as possible if you have any viable project that you
will want us to be involved as indicated above.

Thanks

Dr. Gerald Van Luven

Project Funding Proposal

2020-10-15 Thread Gerald Van Luven







--
Good Day,

I am Dr. Gerald Van Luven and I work as a financial consultant and 
adviser
to Global Asset Management company LLC and some High Net worth 
individuals

from the MENA region and European Union.

Basically, my principals are interested in investing in projects that 
are

viable and has capacity of generating a reasonable ROI for the investor.

I MUST consider you a privileged entity who shall be privy to many
sourcable CLEAN fund meant for foreign investment and you are able to
access up to One Billion Euro or more if your project is viable.

I will wish that you send a copy of your highlighted projects with
elucidated BUSINESS PLANS or EXECUTIVE SUMMARY which I shall propagate 
for

its immediate approval for a minimal ROI return on investment which you
shall have sole mandate after review and consideration.

You should also bear in mind that funding of this nature is not 
completed

on email communication alone, there will be the need for a face to face
meeting with the investor or His representatives and have the paperwork
perfected and signed.

Kindly reply as soon as possible if you have any viable project that you
will want us to be involved as indicated above.

Thanks

Dr. Gerald Van Luven

Business Investment Proposal

2020-10-15 Thread Gerald Van Luven







--
Good Day,

I am Dr. Gerald Van Luven and I work as a financial consultant and 
adviser
to Global Asset Management company LLC and some High Net worth 
individuals

from the MENA region and European Union.

Basically, my principals are interested in investing in projects that 
are

viable and has capacity of generating a reasonable ROI for the investor.

I MUST consider you a privileged entity who shall be privy to many
sourcable CLEAN fund meant for foreign investment and you are able to
access up to One Billion Euro or more if your project is viable.

I will wish that you send a copy of your highlighted projects with
elucidated BUSINESS PLANS or EXECUTIVE SUMMARY which I shall propagate 
for

its immediate approval for a minimal ROI return on investment which you
shall have sole mandate after review and consideration.

You should also bear in mind that funding of this nature is not 
completed

on email communication alone, there will be the need for a face to face
meeting with the investor or His representatives and have the paperwork
perfected and signed.

Kindly reply as soon as possible if you have any viable project that you
will want us to be involved as indicated above.

Thanks

Dr. Gerald Van Luven

Business Funding Proposal

2020-10-13 Thread Per Lessen




Good Day,

I am Dr.Per Lessen and I work as a financial consultant with
Global Asset Management company LLC and some High Net worth
individuals from the MENA region and European Union .

Basically, my principals are interested in investing in projects that
are viable and has capacity of generating a reasonable ROI for the
investor.

I MUST consider you a privileged entity who shall be privy to many
sourcable CLEAN fund meant for foreign investment and you are able to
access up to One Billion Euro or More.

I will wish that you send a copy of your highlighted projects with a
elucidated BUSINESS PLANS and EXECUTIVE SUMMARY which I shall propagate
for its immediate approval for a minimal ROI return on investment which
you shall have sole mandate.


You should also bear in mind that funding of this nature is not
completed on email communication alone, there will be the need for a
face to face meeting with the investor or His representatives and have
the paperwork perfected and signed.

kindly reply as soon as possible if you have any viable project that you
will want us to be involved as indicated above.

Thanks

Dr. Per Lessen
--
Good Day,

I am Dr.Per Lessen and I work as a financial consultant with
Global Asset Management company LLC and some High Net worth
individuals from the MENA region and European Union .

Basically, my principals are interested in investing in projects that
are viable and has capacity of generating a reasonable ROI for the
investor.

I MUST consider you a privileged entity who shall be privy to many
sourcable CLEAN fund meant for foreign investment and you are able to
access up to One Billion Euro or More.

I will wish that you send a copy of your highlighted projects with a
elucidated BUSINESS PLANS and EXECUTIVE SUMMARY which I shall propagate
for its immediate approval for a minimal ROI return on investment which
you shall have sole mandate.


You should also bear in mind that funding of this nature is not
completed on email communication alone, there will be the need for a
face to face meeting with the investor or His representatives and have
the paperwork perfected and signed.

kindly reply as soon as possible if you have any viable project that you
will want us to be involved as indicated above.

Thanks

Dr. Per Lessen

Project Funding Proposal.

2020-10-13 Thread Per Lessen





--
Good Day,

I am Dr.Per Lessen and I work as a financial consultant with
Global Asset Management company LLC and some High Net worth
individuals from the MENA region and European Union .

Basically, my principals are interested in investing in projects that
are viable and has capacity of generating a reasonable ROI for the
investor.

I MUST consider you a privileged entity who shall be privy to many
sourcable CLEAN fund meant for foreign investment and you are able to
access up to One Billion Euro or More.

I will wish that you send a copy of your highlighted projects with a
elucidated BUSINESS PLANS and EXECUTIVE SUMMARY which I shall propagate
for its immediate approval for a minimal ROI return on investment which
you shall have sole mandate.


You should also bear in mind that funding of this nature is not
completed on email communication alone, there will be the need for a
face to face meeting with the investor or His representatives and have
the paperwork perfected and signed.

kindly reply as soon as possible if you have any viable project that you
will want us to be involved as indicated above.

Thanks

Dr. Per Lessen

My business proposal

2020-09-23 Thread Wang Xiu Ying




Good Day,
I am Wang Xiu Ying, the Director for Credit & Marketing Chong Hing Bank, Hong 
Kong, Chong Hing Bank Center, 24 Des Voeux Road Central, Hong Kong. I have a 
business proposal of USD$13,991,674 All confirmable documents to back up the 
claims will be made availableto you prior to your acceptance and as soon as I 
receive your return mail.
Best Regards,
Wang Xiu Ying

Re: Yet another ethernet PHY LED control proposal

2020-09-14 Thread Pavel Machek

Hi!

> I have been thinking about another way to implement ABI for HW control
> of ethernet PHY connected LEDs.
> 
> This proposal is inspired by the fact that for some time there is a
> movement in the kernel to do transparent HW offloading of things (DSA
> is an example of that).

And it is good proposal.

> So currently we have the `netdev` trigger. When this is enabled for a
> LED, new files will appear in that LED's sysfs directory:
>   - `device_name` where user is supposed to write interface name
>   - `link` if set to 1, the LED will be ON if the interface is linked
>   - `rx` if set to 1, the LED will blink on receive event
>   - `tx` if set to 1, the LED will blink on transmit event
>   - `interval` specifies duration of the LED blink
> 
> Now what is interesting is that almost all combinations of link/rx/tx
> settings are offloadable to a Marvell PHY! (Not to all LEDs, though...)
> 
> So what if we abandoned the idea of a `hw` trigger, and instead just
> allowed a LED trigger to be offloadable, if that specific LED supports
> it?
> 
> For the HW mode for different speed we can just expand the `link` sysfs
> file ABI, so that if user writes a specific speed to this file, instead
> of just "1", the LED will be on if the interface is linked on that
> specific speed. Or maybe another sysfs file could be used for "light on
> N mbps" setting...
> 
> Afterwards we can figure out other possible modes.
> 
> What do you think?

If this can be implemented (and it probably can) it is the best
solution :-).

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: PGP signature

Yet another ethernet PHY LED control proposal

2020-09-11 Thread Marek Behun

Hello,

I have been thinking about another way to implement ABI for HW control
of ethernet PHY connected LEDs.

This proposal is inspired by the fact that for some time there is a
movement in the kernel to do transparent HW offloading of things (DSA
is an example of that).

So currently we have the `netdev` trigger. When this is enabled for a
LED, new files will appear in that LED's sysfs directory:
  - `device_name` where user is supposed to write interface name
  - `link` if set to 1, the LED will be ON if the interface is linked
  - `rx` if set to 1, the LED will blink on receive event
  - `tx` if set to 1, the LED will blink on transmit event
  - `interval` specifies duration of the LED blink

Now what is interesting is that almost all combinations of link/rx/tx
settings are offloadable to a Marvell PHY! (Not to all LEDs, though...)

So what if we abandoned the idea of a `hw` trigger, and instead just
allowed a LED trigger to be offloadable, if that specific LED supports
it?

For the HW mode for different speed we can just expand the `link` sysfs
file ABI, so that if user writes a specific speed to this file, instead
of just "1", the LED will be on if the interface is linked on that
specific speed. Or maybe another sysfs file could be used for "light on
N mbps" setting...

Afterwards we can figure out other possible modes.

What do you think?

Marek

Re: [RFC] Design proposal for upstream core-scheduling interface

2020-08-24 Thread Joel Fernandes

Hey Dhaval,

On Mon, Aug 24, 2020 at 3:50 PM Dhaval Giani  wrote:
>
> On Fri, Aug 21, 2020 at 8:01 PM Joel Fernandes  wrote:
> >
> > Hello!
> > Core-scheduling aims to allow making it safe for more than 1 task that trust
> > each other to safely share hyperthreads within a CPU core [1]. This results
> > in a performance improvement for workloads that can benefit from using
> > hyperthreading safely while limiting core-sharing when it is not safe.
> >
> > Currently no universally agreed set of interface exists and companies have
> > been hacking up their own interface to make use of the patches. This post
> > aims to list usecases which I got after talking to various people at Google
> > and Oracle. After which actual development of code to add interfaces can 
> > follow.
> >
> > The below text uses the terms cookie and tag interchangeably. Further, 
> > cookie
> > of 0 is assumed to indicate a trusted process - such as kernel threads or
> > system daemons. By default, if nothing is tagged then everything is
> > considered trusted since the scheduler assumes all tasks are a match for 
> > each
> > other.
> >
> > Usecase 1: Google's cloud group tags CGroups with a 32-bit integer. This
> > int32 is split into 2 parts, the color and the id. The color can only be set
> > by privileged processes and the id can be set by anyone. The CGroup 
> > structure
> > looks like:
> >
> >A B
> >   / \  / \ \
> >  C   DE  F  G
> >
> > Here A and B are container CGroups for 2 jobs are assigned a color by a
> > privileged daemon. The job itself has more sub-CGroups within (for ex, B has
> > E, F and G). When these sub-CGroups are spawned, they inherit the color from
> > the parent. An unprivileged user can then set an id for the sub-CGroup
> > without the knowledge of the privileged daemon if it desires to add further
> > isolation. This setting of id can be an unprivileged operation because the
> > root daemon has already isolated A and B.
> >
> > Usecase 2: Chrome browser - tagging renderers. In Chrome, each tab opened
> > spawns a renderer. A renderer is a sandboxed process and it is assumed it
> > could run arbitrary code (Javascript etc). When a renderer is created, a
> > prctl call is made to tag the renderer. Every thread that is spawned by the
> > renderer is also tagged. Essentially this turns SMT off for the renderer, 
> > but
> > still gives a performance boost due to privileged system threads being able
> > to share a core. The tagging also forbids the renderer from sharing a core
> > with privileged system processes. In the future, we plan to allow threads to
> > share a core as well (especially once we get syscall-isolation upstreamed.
> > Patches were posted recently for the same [2]).
> >
> > Usecase 3: ChromeOS VMs - each vCPU thread that is created by the VMM is
> > tagged thus disallowing core sharing between the vCPU thread and any other
> > thread on the system. This is because such VMs may run arbitrary user code
> > and attack both the guest and the host systems sharing the core.
> >
> > Usecase 4: Oracle - Setting a sub-CGroup as trusted (cookie 0). Chris Hyser
> > talked to me on IRC that in a CGroup hierarcy, some CGroups should be 
> > allowed
> > to not have to share its parent's CGroup tag. In fact, it should be allowed 
> > to
> > untag the child CGroup if needed thus allowing them to share a core with
> > trusted tasks. Others have had similar requirements.
> >
>
> Just to augment this. This doesn't necessarily need to be cgroup
> based. We do have a need where certain processes want to be tagged
> separately from others, which are in the same cgroup hierarchy. The
> standard mechanism for this is nested cgroups. With a unified
> hierarchy, and with cgroup tagging, I am unsure what this really
> means. Consider
>
> root
> |- A
> |- A1
> |- A2
>
> If A is tagged, can processes in A1 and A2 share a core? Should they
> share a core? In some cases we might be OK with them sharing cores
> just to get some of the performance back. Core scheduling really needs
> to be limited to just the processes that we want to protect.

Yeah this is exactly why Vineeth was suggesting separate FS without
nested hierarchies. The CGroupv2 unified hierarchy may be too
restrictive to pick and chose which nested or non-nested
sub-hierarchies want to share a core if the root of any (sub)hierarchy
is tagged. As mentioned in CGroup v2 documentation, someone thought it
is a good idea to kill the CGroup v1's non-unified flexibility so here
we

Re: [RFC] Design proposal for upstream core-scheduling interface

2020-08-24 Thread chris hyser

On 8/24/20 4:53 PM, chris hyser wrote:

On 8/21/20 11:01 PM, Joel Fernandes wrote:

Hello!
Core-scheduling aims to allow making it safe for more than 1 task that trust
each other to safely share hyperthreads within a CPU core [1]. This results
in a performance improvement for workloads that can benefit from using
hyperthreading safely while limiting core-sharing when it is not safe.

Currently no universally agreed set of interface exists and companies have
been hacking up their own interface to make use of the patches. This post
aims to list usecases which I got after talking to various people at Google
and Oracle. After which actual development of code to add interfaces can follow.

The below text uses the terms cookie and tag interchangeably. Further, cookie
of 0 is assumed to indicate a trusted process - such as kernel threads or
system daemons. By default, if nothing is tagged then everything is
considered trusted since the scheduler assumes all tasks are a match for each
other.

Usecase 1: Google's cloud group tags CGroups with a 32-bit integer. This
int32 is split into 2 parts, the color and the id. The color can only be set
by privileged processes and the id can be set by anyone. The CGroup structure
looks like:

A B
/ \ / \ \
C D E F G

Here A and B are container CGroups for 2 jobs are assigned a color by a
privileged daemon. The job itself has more sub-CGroups within (for ex, B has
E, F and G). When these sub-CGroups are spawned, they inherit the color from
the parent. An unprivileged user can then set an id for the sub-CGroup
without the knowledge of the privileged daemon if it desires to add further
isolation. This setting of id can be an unprivileged operation because the
root daemon has already isolated A and B.

Usecase 2: Chrome browser - tagging renderers. In Chrome, each tab opened
spawns a renderer. A renderer is a sandboxed process and it is assumed it
could run arbitrary code (Javascript etc). When a renderer is created, a
prctl call is made to tag the renderer. Every thread that is spawned by the
renderer is also tagged. Essentially this turns SMT off for the renderer, but
still gives a performance boost due to privileged system threads being able
to share a core. The tagging also forbids the renderer from sharing a core
with privileged system processes. In the future, we plan to allow threads to
share a core as well (especially once we get syscall-isolation upstreamed.
Patches were posted recently for the same [2]).

Usecase 3: ChromeOS VMs - each vCPU thread that is created by the VMM is
tagged thus disallowing core sharing between the vCPU thread and any other
thread on the system. This is because such VMs may run arbitrary user code
and attack both the guest and the host systems sharing the core.

Usecase 4: Oracle - Setting a sub-CGroup as trusted (cookie 0). Chris Hyser
talked to me on IRC that in a CGroup hierarcy, some CGroups should be allowed
to not have to share its parent's CGroup tag. In fact, it should be allowed to
untag the child CGroup if needed thus allowing them to share a core with
trusted tasks. Others have had similar requirements.

Proposal for tagging

We have to support both CGroup and non-CGroup users. CGroup may be overkill
for some and the CGroup v2 unified hierarchy may be too inflexible.
Regardless, we must support CGroup due its easy of use and existing users.

For Usecase #1
--
Usecase #1 requires a 2-level tagging mechanism. I propose 2 new files
to the CPU controller:
- tag : a boolean (0/1). If set, this CGroup and all sub-CGroups will be
tagged. (In the kernel, the cookie will be derived from the pointer value
of a ref-counted cookie object.). If reset, then the CGroup will inherit
the parent CGroup's cookie if there is one.

- color : The ref-counted object will be aligned say to a 256-byte boundary
(for example), then the lower 8 bits of the pointer can be used to specify
color. Together, the pointer with the color will form a cookie used by the
scheduler.

Note that if 2 CGroups belong to 2 different tagged hierarchies, then setting
their color to be the same does not imply that the 2 groups will share a
core. This is key. Also, to support usecase #4, we could add a third tag
value -- 2, along with the usual 0 and 1 to suggest that the CGroup can share
a core with cookie-0 tasks (Chris Hyser feel free to add any more comments
here).

Let em think about this. This looks like it would support delegation of a cgroup subtree, which I suppose containers are

s/em/me/

Re: [RFC] Design proposal for upstream core-scheduling interface

2020-08-24 Thread chris hyser

On 8/21/20 11:01 PM, Joel Fernandes wrote:

A B
/ \ / \ \
C DE F G

Proposal for tagging

Let em think about this. This looks like it would support delegation of a cgroup subtree, which I suppose containers are
going to want eventually. That seems to be the advantage over just allowing setting the entire cookie. Anyway, I look
forward to tomorrow and thanks for putting this together.

-chrish

For Usecase #2
--
We could add an interface that Peter suggested where 2 PIDs A and B want to
share a core. So if A wants to share a core with B, then it issues
prctl(SET_CORE_SHARE, B). ptrace_may_access() can be used to restrict access.
For r

Re: [RFC] Design proposal for upstream core-scheduling interface

2020-08-24 Thread Dhaval Giani

On Mon, Aug 24, 2020 at 4:32 AM Vineeth Pillai  wrote:
>
> > Let me know your thoughts and looking forward to a good LPC MC discussion!
> >
>
> Nice write up Joel, thanks for taking time to compile this with great detail!
>
> After going through the details of interface proposal using cgroup v2
> controllers,
> and based on our discussion offline, would like to note down this idea
> about a new
> pseudo filesystem interface for core scheduling.  We could include
> this also for the
> API discussion during core scheduler MC.
>
> coreschedfs: pseudo filesystem interface for Core Scheduling
> --
>
> The basic requirement of core scheduling is simple - we need to group a set
> of tasks into a trust group that can share a core. So we don’t really
> need a nested
> hierarchy for the trust groups. Cgroups v2 follow a unified nested
> hierarchy model
> that causes a considerable confusion if the trusted tasks are in
> different levels of the
> hierarchy and we need to allow them to share the core. Cgroup v2's
> single hierarchy
> model makes it difficult to regroup tasks in different levels of
> nesting for core scheduling.
> As noted in this mail, we could use multi-file approach and other
> interfaces like prctl to
> overcome this limitation.
>
> The idea proposed here to overcome the above limitation is to come up with a 
> new
> pseudo filesystem - “coreschedfs”. This filesystem is basically a flat
> filesystem with
> maximum nesting level of 1. That means, root directory can have
> sub-directories for
> sub-groups, but those sub-directories cannot have more sub-directories
> representing
> trust groups. Root directory is to represent the system wide trust
> group and sub-directories
> represent trusted groups. Each directory including the root directory
> has the following set
> of files/directories:
>
> - cookie_id: User exposed id for a cookie. This can be compared to a
> file descriptor.
>  This could be used in programmatic API to join/leave a group
>
> - properties: This is an interface to specify how child tasks of this
> group should behave.
>   Can be used for specifying future flag requirements as well.
>   Current list of properties include:
>   NEW_COOKIE_FOR_CHILD: All fork() for tasks in this group
> will result in
> creation of a new trust group
>   SAME_COOKIE_FOR_CHILD: All fork() for tasks in this
> group will end up in
>  this same group
>   ROOT_COOKIE_FOR_CHILD: All fork() for tasks in this
> group goes to the root group
>
> - tasks: Lists the tasks in this group. Main interface for adding
> removing tasks in a group
>
> - : A directory per task who is am member of this trust group.
> - /properties: This file is same as the parent properties file
> but this is to override
> the group setting.
>
> This pseudo filesystem can be mounted any where in the root
> filesystem, I propose the default
> to be in “/sys/kernel/coresched”
>
> When coresched is enabled, kernel internally creates the framework for
> this filesystem.
> The filesystem gets mounted to the default location and admin can
> change this if needed.
> All tasks by default are in the root group. The admin or programs can
> then create trusted
> groups on top of this filesystem.
>
> Hooks will be placed in fork() and exit() to make sure that the
> filesystem’s view of tasks is
> up-to-date with the system. Also, APIs manipulating core scheduling
> trusted groups should
> also make sure that the filesystem's view is updated.
>
> Note: The above idea is very similar to cgroups v1. Since there is no
> unified hierarchy
> in cgroup v1, most of the features of coreschedfs could be implemented
> as a cgroup v1
> controller. As no new v1 controllers are allowed, I feel the best
> alternative to have
> a simple API is to come up with a new filesystem - coreschedfs.
>
> The advantages of this approach is:
>
> - Detached from cgroup unified hierarchy and hence the very simple requirement
>of core scheduling can be easily materialized.
> - Admin can have fine-grained control of groups using shell and scripting
> - Can have programmatic access to this using existing APIs like mkdir,rmdir,
>write, read. Or can come up with new APIs using the cookie_id which can 
> wrap
>   t he above linux apis or use a new systemcall for core scheduling.
> - Fine grained permission control using linux filesystem permissions and ACLs
>
> Disadvantages are
> - yet another ps

Re: [RFC] Design proposal for upstream core-scheduling interface

2020-08-24 Thread Dhaval Giani

On Fri, Aug 21, 2020 at 8:01 PM Joel Fernandes  wrote:
>
> Hello!
> Core-scheduling aims to allow making it safe for more than 1 task that trust
> each other to safely share hyperthreads within a CPU core [1]. This results
> in a performance improvement for workloads that can benefit from using
> hyperthreading safely while limiting core-sharing when it is not safe.
>
> Currently no universally agreed set of interface exists and companies have
> been hacking up their own interface to make use of the patches. This post
> aims to list usecases which I got after talking to various people at Google
> and Oracle. After which actual development of code to add interfaces can 
> follow.
>
> The below text uses the terms cookie and tag interchangeably. Further, cookie
> of 0 is assumed to indicate a trusted process - such as kernel threads or
> system daemons. By default, if nothing is tagged then everything is
> considered trusted since the scheduler assumes all tasks are a match for each
> other.
>
> Usecase 1: Google's cloud group tags CGroups with a 32-bit integer. This
> int32 is split into 2 parts, the color and the id. The color can only be set
> by privileged processes and the id can be set by anyone. The CGroup structure
> looks like:
>
>A B
>   / \  / \ \
>  C   DE  F  G
>
> Here A and B are container CGroups for 2 jobs are assigned a color by a
> privileged daemon. The job itself has more sub-CGroups within (for ex, B has
> E, F and G). When these sub-CGroups are spawned, they inherit the color from
> the parent. An unprivileged user can then set an id for the sub-CGroup
> without the knowledge of the privileged daemon if it desires to add further
> isolation. This setting of id can be an unprivileged operation because the
> root daemon has already isolated A and B.
>
> Usecase 2: Chrome browser - tagging renderers. In Chrome, each tab opened
> spawns a renderer. A renderer is a sandboxed process and it is assumed it
> could run arbitrary code (Javascript etc). When a renderer is created, a
> prctl call is made to tag the renderer. Every thread that is spawned by the
> renderer is also tagged. Essentially this turns SMT off for the renderer, but
> still gives a performance boost due to privileged system threads being able
> to share a core. The tagging also forbids the renderer from sharing a core
> with privileged system processes. In the future, we plan to allow threads to
> share a core as well (especially once we get syscall-isolation upstreamed.
> Patches were posted recently for the same [2]).
>
> Usecase 3: ChromeOS VMs - each vCPU thread that is created by the VMM is
> tagged thus disallowing core sharing between the vCPU thread and any other
> thread on the system. This is because such VMs may run arbitrary user code
> and attack both the guest and the host systems sharing the core.
>
> Usecase 4: Oracle - Setting a sub-CGroup as trusted (cookie 0). Chris Hyser
> talked to me on IRC that in a CGroup hierarcy, some CGroups should be allowed
> to not have to share its parent's CGroup tag. In fact, it should be allowed to
> untag the child CGroup if needed thus allowing them to share a core with
> trusted tasks. Others have had similar requirements.
>

Just to augment this. This doesn't necessarily need to be cgroup
based. We do have a need where certain processes want to be tagged
separately from others, which are in the same cgroup hierarchy. The
standard mechanism for this is nested cgroups. With a unified
hierarchy, and with cgroup tagging, I am unsure what this really
means. Consider

root
|- A
|- A1
|- A2

If A is tagged, can processes in A1 and A2 share a core? Should they
share a core? In some cases we might be OK with them sharing cores
just to get some of the performance back. Core scheduling really needs
to be limited to just the processes that we want to protect.

> Proposal for tagging
> 
> We have to support both CGroup and non-CGroup users. CGroup may be overkill
> for some and the CGroup v2 unified hierarchy may be too inflexible.
> Regardless, we must support CGroup due its easy of use and existing users.
>
> For Usecase #1
> --
> Usecase #1 requires a 2-level tagging mechanism. I propose 2 new files
> to the CPU controller:
> - tag : a boolean (0/1). If set, this CGroup and all sub-CGroups will be
>   tagged.  (In the kernel, the cookie will be derived from the pointer value
>   of a ref-counted cookie object.). If reset, then the CGroup will inherit
>   the parent CGroup's cookie if there is one.
>
> - color : The ref-counted object will be aligned say to a 256-byte boundary
>   (for example), then the lower 8 bits of the pointer can be used to specify
>   color. Together, the poi

Re: [RFC] Design proposal for upstream core-scheduling interface

2020-08-24 Thread Vineeth Pillai

> Let me know your thoughts and looking forward to a good LPC MC discussion!
>

Nice write up Joel, thanks for taking time to compile this with great detail!

After going through the details of interface proposal using cgroup v2
controllers,
and based on our discussion offline, would like to note down this idea
about a new
pseudo filesystem interface for core scheduling.  We could include
this also for the
API discussion during core scheduler MC.

coreschedfs: pseudo filesystem interface for Core Scheduling
--

The basic requirement of core scheduling is simple - we need to group a set
of tasks into a trust group that can share a core. So we don’t really
need a nested
hierarchy for the trust groups. Cgroups v2 follow a unified nested
hierarchy model
that causes a considerable confusion if the trusted tasks are in
different levels of the
hierarchy and we need to allow them to share the core. Cgroup v2's
single hierarchy
model makes it difficult to regroup tasks in different levels of
nesting for core scheduling.
As noted in this mail, we could use multi-file approach and other
interfaces like prctl to
overcome this limitation.

The idea proposed here to overcome the above limitation is to come up with a new
pseudo filesystem - “coreschedfs”. This filesystem is basically a flat
filesystem with
maximum nesting level of 1. That means, root directory can have
sub-directories for
sub-groups, but those sub-directories cannot have more sub-directories
representing
trust groups. Root directory is to represent the system wide trust
group and sub-directories
represent trusted groups. Each directory including the root directory
has the following set
of files/directories:

- cookie_id: User exposed id for a cookie. This can be compared to a
file descriptor.
 This could be used in programmatic API to join/leave a group

- properties: This is an interface to specify how child tasks of this
group should behave.
  Can be used for specifying future flag requirements as well.
  Current list of properties include:
  NEW_COOKIE_FOR_CHILD: All fork() for tasks in this group
will result in
creation of a new trust group
  SAME_COOKIE_FOR_CHILD: All fork() for tasks in this
group will end up in
 this same group
  ROOT_COOKIE_FOR_CHILD: All fork() for tasks in this
group goes to the root group

- tasks: Lists the tasks in this group. Main interface for adding
removing tasks in a group

- : A directory per task who is am member of this trust group.
- /properties: This file is same as the parent properties file
but this is to override
the group setting.

This pseudo filesystem can be mounted any where in the root
filesystem, I propose the default
to be in “/sys/kernel/coresched”

When coresched is enabled, kernel internally creates the framework for
this filesystem.
The filesystem gets mounted to the default location and admin can
change this if needed.
All tasks by default are in the root group. The admin or programs can
then create trusted
groups on top of this filesystem.

Hooks will be placed in fork() and exit() to make sure that the
filesystem’s view of tasks is
up-to-date with the system. Also, APIs manipulating core scheduling
trusted groups should
also make sure that the filesystem's view is updated.

Note: The above idea is very similar to cgroups v1. Since there is no
unified hierarchy
in cgroup v1, most of the features of coreschedfs could be implemented
as a cgroup v1
controller. As no new v1 controllers are allowed, I feel the best
alternative to have
a simple API is to come up with a new filesystem - coreschedfs.

The advantages of this approach is:

- Detached from cgroup unified hierarchy and hence the very simple requirement
   of core scheduling can be easily materialized.
- Admin can have fine-grained control of groups using shell and scripting
- Can have programmatic access to this using existing APIs like mkdir,rmdir,
   write, read. Or can come up with new APIs using the cookie_id which can wrap
  t he above linux apis or use a new systemcall for core scheduling.
- Fine grained permission control using linux filesystem permissions and ACLs

Disadvantages are
- yet another psuedo filesystem.
- very similar to  cgroup v1 and might be re-implementing features
that are already
  provided by cgroups v1.

Use Cases
-

Usecase 1: Google cloud
-

Since we no longer depend on cgroup v2 hierarchies, there will not be
any issue of
nesting and sharing. The main daemon can create trusted groups in the
fileystem and
provide required permissions for the group. Then the init processes
for each job can
be added to respective groups for them to create children tasks as
needed. Multiple
jobs under the same customer which ne

[RFC] Design proposal for upstream core-scheduling interface

2020-08-21 Thread Joel Fernandes

Hello!
Core-scheduling aims to allow making it safe for more than 1 task that trust
each other to safely share hyperthreads within a CPU core [1]. This results
in a performance improvement for workloads that can benefit from using
hyperthreading safely while limiting core-sharing when it is not safe.

Currently no universally agreed set of interface exists and companies have
been hacking up their own interface to make use of the patches. This post
aims to list usecases which I got after talking to various people at Google
and Oracle. After which actual development of code to add interfaces can follow.

The below text uses the terms cookie and tag interchangeably. Further, cookie
of 0 is assumed to indicate a trusted process - such as kernel threads or
system daemons. By default, if nothing is tagged then everything is
considered trusted since the scheduler assumes all tasks are a match for each
other.

Usecase 1: Google's cloud group tags CGroups with a 32-bit integer. This
int32 is split into 2 parts, the color and the id. The color can only be set
by privileged processes and the id can be set by anyone. The CGroup structure
looks like:

   A B
  / \  / \ \ 
 C   DE  F  G

Here A and B are container CGroups for 2 jobs are assigned a color by a
privileged daemon. The job itself has more sub-CGroups within (for ex, B has
E, F and G). When these sub-CGroups are spawned, they inherit the color from
the parent. An unprivileged user can then set an id for the sub-CGroup
without the knowledge of the privileged daemon if it desires to add further
isolation. This setting of id can be an unprivileged operation because the
root daemon has already isolated A and B.

Usecase 2: Chrome browser - tagging renderers. In Chrome, each tab opened
spawns a renderer. A renderer is a sandboxed process and it is assumed it
could run arbitrary code (Javascript etc). When a renderer is created, a
prctl call is made to tag the renderer. Every thread that is spawned by the
renderer is also tagged. Essentially this turns SMT off for the renderer, but
still gives a performance boost due to privileged system threads being able
to share a core. The tagging also forbids the renderer from sharing a core
with privileged system processes. In the future, we plan to allow threads to
share a core as well (especially once we get syscall-isolation upstreamed.
Patches were posted recently for the same [2]).

Usecase 3: ChromeOS VMs - each vCPU thread that is created by the VMM is
tagged thus disallowing core sharing between the vCPU thread and any other
thread on the system. This is because such VMs may run arbitrary user code
and attack both the guest and the host systems sharing the core.

Usecase 4: Oracle - Setting a sub-CGroup as trusted (cookie 0). Chris Hyser
talked to me on IRC that in a CGroup hierarcy, some CGroups should be allowed
to not have to share its parent's CGroup tag. In fact, it should be allowed to
untag the child CGroup if needed thus allowing them to share a core with
trusted tasks. Others have had similar requirements.

Proposal for tagging

We have to support both CGroup and non-CGroup users. CGroup may be overkill
for some and the CGroup v2 unified hierarchy may be too inflexible.
Regardless, we must support CGroup due its easy of use and existing users.

For Usecase #1
--
Usecase #1 requires a 2-level tagging mechanism. I propose 2 new files
to the CPU controller:
- tag : a boolean (0/1). If set, this CGroup and all sub-CGroups will be
  tagged.  (In the kernel, the cookie will be derived from the pointer value
  of a ref-counted cookie object.). If reset, then the CGroup will inherit
  the parent CGroup's cookie if there is one.

- color : The ref-counted object will be aligned say to a 256-byte boundary
  (for example), then the lower 8 bits of the pointer can be used to specify
  color. Together, the pointer with the color will form a cookie used by the
  scheduler.

Note that if 2 CGroups belong to 2 different tagged hierarchies, then setting
their color to be the same does not imply that the 2 groups will share a
core. This is key.  Also, to support usecase #4, we could add a third tag
value -- 2, along with the usual 0 and 1 to suggest that the CGroup can share
a core with cookie-0 tasks (Chris Hyser feel free to add any more comments
here).

For Usecase #2
--
We could add an interface that Peter suggested where 2 PIDs A and B want to
share a core. So if A wants to share a core with B, then it issues
prctl(SET_CORE_SHARE, B). ptrace_may_access() can be used to restrict access.
For renderers though, we want to likely allow a renderer to share a core
exclusive with only threads within a renderer and no one else. To support
this, renderer A could simply issue prctl(SET_CORE_SHARE, A).

For Usecase #3
--
By default, all threads within a process will share a core. This makes the
most sense because threads in a process share the same

Re: CFS flat runqueue proposal fixes/update

2020-08-21 Thread Dietmar Eggemann

On 20.08.20 22:39, Rik van Riel wrote:
> On Thu, 2020-08-20 at 16:56 +0200, Dietmar Eggemann wrote:

[...]

> The issue happens with a flat runqueue, when t1 goes
> to sleep, but t2 and t3 continue running.
> 
> We need to make sure the vruntime for t2 has not been
> advanced so far into the future that cgroup A is unable
> to get its fair share of CPU wihle t1 is sleeping.

Ah, these problems are related to the 'CFS flat runqueue' patch-set!
Misunderstanding om my side.

I somehow assumed that you wanted to say that these problems exist in
current mainline and could be solved with the patch-set plus some extra
functionality on top.

Re: CFS flat runqueue proposal fixes/update

2020-08-20 Thread Rik van Riel

On Thu, 2020-08-20 at 16:56 +0200, Dietmar Eggemann wrote:
> Hi Rik,
> 
> On 31/07/2020 09:42, Rik van Riel wrote:
> 
> [...]
> 
> > Lets revisit the hierarchy from above, and assign priorities
> > to the cgroups, with the fixed point one being 1000. Lets
> > say cgroups A, A1, and B have priority 1000, while cgroup
> > A2 has priority 1.
> > 
> > /\
> >/  \
> >   AB
> >  / \\ 
> > A1 A2   t3
> >/ \
> >   t1 t2
> > 
> > One consequence of this is that when t1, t2, and t3 each
> > get a time slice, the vruntime of tasks t1 and t3 advances
> > at roughly the same speed as the clock time, while the
> > vruntime of task t2 advances 1000x faster.
> > 
> > This is fine if all three tasks continue to be runnable,
> > since t1, t2 and t3 each get their fair share of CPU time.
> > 
> > However, if t1 goes to sleep, t2 is the only thing running
> > inside cgroup A, which has the same priority as cgroup B,
> > and tasks t2 and t3 should be getting the same amount of
> > CPU time.
> > 
> > They eventually will, but not before task t3 has used up
> > enough CPU time to catch up with the enormous vruntime
> > advance that t2 just suffered.
> > 
> > That needs to be fixed, to get near-immediate convergence,
> > and not convergence after some unknown (potentially long)
> > period of time.
> 
> I'm trying to understand this issue in detail ...
> 
> Since t1 and t2 are single tasks in A1 and A2, this taskgroup level
> shouldn't matter for tick preemption after t1 went to sleep?
> 
> check_preempt_tick() is only invoked for 'cfs_rq->nr_running > 1'
> from
> entity_tick().
> 
> IMHO, tick preemption is handled between A and B and since they have
> the
> same cpu.weight (cpu.shares) t2 and t3 get the same time slice after
> t1
> went to sleep.
> 
> I think that here tick preemption happens in the 'if (delta_exec >
> ideal_runtime)' condition w/ delta_exec = curr->sum_exec_runtime -
> curr->prev_sum_exec_runtime.
> 
> Did I miss anything?

The issue happens with a flat runqueue, when t1 goes
to sleep, but t2 and t3 continue running.

We need to make sure the vruntime for t2 has not been
advanced so far into the future that cgroup A is unable
to get its fair share of CPU wihle t1 is sleeping.

-- 
All Rights Reversed.


signature.asc
Description: This is a digitally signed message part

Re: CFS flat runqueue proposal fixes/update

2020-08-20 Thread Dietmar Eggemann

Hi Rik,

On 31/07/2020 09:42, Rik van Riel wrote:

[...]

> Lets revisit the hierarchy from above, and assign priorities
> to the cgroups, with the fixed point one being 1000. Lets
> say cgroups A, A1, and B have priority 1000, while cgroup
> A2 has priority 1.
> 
> /\
>/  \
>   AB
>  / \\ 
> A1 A2   t3
>/ \
>   t1 t2
> 
> One consequence of this is that when t1, t2, and t3 each
> get a time slice, the vruntime of tasks t1 and t3 advances
> at roughly the same speed as the clock time, while the
> vruntime of task t2 advances 1000x faster.
> 
> This is fine if all three tasks continue to be runnable,
> since t1, t2 and t3 each get their fair share of CPU time.
> 
> However, if t1 goes to sleep, t2 is the only thing running
> inside cgroup A, which has the same priority as cgroup B,
> and tasks t2 and t3 should be getting the same amount of
> CPU time.
> 
> They eventually will, but not before task t3 has used up
> enough CPU time to catch up with the enormous vruntime
> advance that t2 just suffered.
> 
> That needs to be fixed, to get near-immediate convergence,
> and not convergence after some unknown (potentially long)
> period of time.

I'm trying to understand this issue in detail ...

Since t1 and t2 are single tasks in A1 and A2, this taskgroup level
shouldn't matter for tick preemption after t1 went to sleep?

check_preempt_tick() is only invoked for 'cfs_rq->nr_running > 1' from
entity_tick().

IMHO, tick preemption is handled between A and B and since they have the
same cpu.weight (cpu.shares) t2 and t3 get the same time slice after t1
went to sleep.

I think that here tick preemption happens in the 'if (delta_exec >
ideal_runtime)' condition w/ delta_exec = curr->sum_exec_runtime -
curr->prev_sum_exec_runtime.

Did I miss anything?

[...]

Re: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface

2020-08-12 Thread Shakeel Butt

Hi Michal,

On Tue, Aug 11, 2020 at 10:36 AM Michal Koutný  wrote:
>
> Hi Shakeel.
>
> On Tue, Jul 07, 2020 at 10:02:50AM -0700, Shakeel Butt  
> wrote:
> > > Well, I was talkingg about memory.low. It is not meant only to protect
> > > from the global reclaim. It can be used for balancing memory reclaim
> > > from _any_ external memory pressure source. So it is somehow related to
> > > the usecase you have mentioned.
> > >
> >
> > For the uswapd use-case, I am not concerned about the external memory
> > pressure source but the application hitting its own memory.high limit
> > and getting throttled.
> FTR, you can transform own memory.high into "external" pressure with a
> hierarchy such as
>
>   limit-group   memory.high=N+margin memory.low=0
>   `- latency-sensitive-groupmemory.low=N
>   `- regular-group  memory.low=0
>
> Would that ensure the latency targets?
>

My concern was not "whom to reclaim from" but it was "If I use
memory.high for reclaim, processes running in that memcg can hit
memory.high and get throttled". However Roman has reduced the window
where that can happen. Anyways I will send the next version after this
merge window closes.

Shakeel

Re: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface

2020-08-11 Thread Michal Koutný

Hi Shakeel.

On Tue, Jul 07, 2020 at 10:02:50AM -0700, Shakeel Butt  
wrote:
> > Well, I was talkingg about memory.low. It is not meant only to protect
> > from the global reclaim. It can be used for balancing memory reclaim
> > from _any_ external memory pressure source. So it is somehow related to
> > the usecase you have mentioned.
> >
> 
> For the uswapd use-case, I am not concerned about the external memory
> pressure source but the application hitting its own memory.high limit
> and getting throttled.
FTR, you can transform own memory.high into "external" pressure with a
hierarchy such as

  limit-group   memory.high=N+margin memory.low=0
  `- latency-sensitive-groupmemory.low=N
  `- regular-group  memory.low=0

Would that ensure the latency targets?

Michal


signature.asc
Description: Digital signature

Re: CFS flat runqueue proposal fixes/update

2020-08-07 Thread Rik van Riel

On Fri, 2020-08-07 at 16:14 +0200, Dietmar Eggemann wrote:
> On 31/07/2020 09:42, Rik van Riel wrote:
> > Possible solution
> > ...
> I imagine that I can see what you want to achieve here ;-)
> 
> But it's hard since your v5 RFC
> https://lkml.kernel.org/r/20190906191237.27006-1-r...@surriel.com is
> pretty old by now.

The v5 patches also do not implement this new idea, and
still suffer from the corner cases that Paul Turner
pointed out last year.

> Do you have a version of the patch-set against tip/sched/core? Quite
> a
> lot has changed (runnable load avg replaced by runnable avg, rq->load 
> is
> gone, CFS load balance rework).

Not yet. We got a baby this spring, so I've been busy
with things like milk and diapers, instead of with
code.

I wanted to get this proposal out before Plumbers, so
we could at least talk about it, and maybe find flaws
with this idea before I spend weeks/months implementing it :)

> IMHO it would be extremely helpful to have a current patch-set to
> discuss how these other problems can be covered by patches on top.

Agreed. I hope to get some time to work on that, but
no guarantees about getting it ready before Plumbers :)

-- 
All Rights Reversed.


signature.asc
Description: This is a digitally signed message part

Re: CFS flat runqueue proposal fixes/update

2020-08-07 Thread Dietmar Eggemann

Hi Rik,

On 31/07/2020 09:42, Rik van Riel wrote:
> Hello,
> 
> last year at Linux Plumbers conference, I presented on my work
> of turning the hierarchical CFS runqueue into a flat runqueue,
> and Paul Turner pointed out some corner cases that could not
> work with my design as it was last year.
> 
> Paul pointed out two corner cases, and I have come up with a
> third one myself, but I believe they all revolve around the
> same issue, admission control, and can all be solved with the
> same solution.

[...]

> Possible solution
> 
> A possible solution to the problems above consists of
> three parts:
> - Instead of accounting all used wall clock CPU time into
>   vruntime immediately, at the task's current hierarchical
>   priority, the vruntime can be accounted for piecemeal, for
>   example in amounts corresponding to "one timeslice at full
>   priority".
> - If there is only one runnable task on the runqueue,
>   all the runtime can be accounted into vruntime in one go.
> - Tasks that cannot account all of their used CPU time
>   into vruntime at once can be removed from the root runqueue,
>   and placed into the cgroup runqueue. A heap of cgroup
>   runqueues with "overloaded" tasks can be attached to the
>   main runqueue, where the left-most task from that heap of
>   heaps gets some vruntime accounted every time we go into
>   pick_next_task.
> - The difference between the vruntime of each task in that
>   heap and the vruntime of the root runqueue can help determine
>   how much vruntime can be accounted to that task at once.
> - If the task, or its runqueue, is no longer the left-most
>   in the heap after getting vruntime accounted, that runqueue
>   and the queue of runqueues can be resorted.
> - Once a task has accounted all of its outstanding delta
>   exec runtime into vruntime, it can be moved back to the
>   main runqueue.
> - This should solve the unequal task weight scenario Paul
>   Turner pointed out last year, because after task t1 goes
>   to sleep and only t2 and t3 remain on the CPU, t2 will
>   get its delta exec runtime converted into vruntime at
>   its new priority (equal to t3).
> - By only accounting delta exec runtime to vruntime for
>   the left-most task in the "overloaded" heap at one time,
>   we guarantee only one task at a time will be added back
>   into the root runqueue.
> - Every time a task is added to the root runqueue, that
>   slows down the rate at which vruntime advances, which
>   in turn reduces the rate at which tasks get added back
>   into the runqueue, and makes it more likely that a currently
>   running task with low hierarchical priority gets booted
>   off into the "overloaded" heap.
> 
> To tackle the thundering herd at task wakeup time, another
> strategy may be needed. One thing we may be able to do
> there is place tasks into the "overloaded" heap immediately
> on wakeup, if the hierarchical priority of the task is so
> low that if the task were to run a minimal timeslice length,
> it would be unable to account that time into its vruntime
> in one go, AND the CPU already has a larger number of tasks
> on it.
> 
> Because the common case on most systems is having just
> 0, 1, or 2 runnable tasks on a CPU, this fancy scheme
> should rarely end up being used, and even when it is the
> overhead should be reasonable because most of the
> overloaded tasks will just sit there until pick_next_task
> gets around to them.
> 
> Does this seem like something worth trying out?
> 
> Did I overlook any other corner cases that would make this
> approach unworkable?
> 
> Did I forget to explain anything that is needed to help
> understand the problem and the proposed solution better?

I imagine that I can see what you want to achieve here ;-)

But it's hard since your v5 RFC
https://lkml.kernel.org/r/20190906191237.27006-1-r...@surriel.com is
pretty old by now.

Do you have a version of the patch-set against tip/sched/core? Quite a
lot has changed (runnable load avg replaced by runnable avg, rq->load is
gone, CFS load balance rework).

IIRC, the 'CFS flat runqueue design' has the advantage to reduce the
overhead in taskgroup heavy environments like systemd. And I recall that
v5 doesn't cover CFS bandwidth control yet.

IMHO it would be extremely helpful to have a current patch-set to
discuss how these other problems can be covered by patches on top.

Proposal

2020-08-05 Thread John Woods

Dear Sir, Madam,
 

Proposal

 
I am Mr.John Woods, a Consultant with the Department of Power and Steel here in 
Spain . I have been contracted by a wealthy individual and serving  government 
official from somewhere in Africa who is interested in  engaging your services 
for investment of a large volume of fund (Ten Million Five Hundred Thousand 
Dollars ) Which he Deposited with a Finance Company here in Spain . You will 
act as the beneficiary of the fund to carry out the investment of the fund in 
any business venture you consider lucrative.

If you are capable of handling any type of investment in your country, Please, 
get back to me immediately and send your telephone and fax numbers to enable me 
communicate with you and provide you with further details.

Note that 30% of the fund is your share (Commission) for securing the fund from 
the security company before the investment of the fund shall be carried out. 
This is necessary because the investor as a serving government official in his 
country of origin is eager to transfer ownership of the fund to you because he 
is not expected in his official capacity to own such huge sum of money.

I will appreciate an urgent response from you.

Yours faithfully,

Mr.John Woods

setsid2(sid) proposal - assign current process to existing session

2020-08-04 Thread Devin Bayer


Hello,

I'm wondering about the possibility of introducing a new system call for 
moving a process to an existing session. If `sid` is an existing session 
with the same owner as the current process, one could call:


setsid2(sid)

This would have similar behavior to setpgid(), and would probably 
effectively call setpgid() internally too.


The use case is for something like `flatpak-spawn --host`, which allows 
you to launch a program in an outer namespace from an inner namespace. 
It behaves as a child of the caller but is actually a child of an 
external daemon.


It works by connecting stdin/out/err to those of the caller, for example 
a PTY for xterm running in the inner namespace. This works fine for 
non-interactive programs, but it's impossible for the spawned task to 
share the controlling TTY with the shell running in xterm.


I can't see where the problems are, though I'm surprised such 
functionally doesn't yet exist. Because it deals with such basic 
concepts, I'm wondering if such a change will even be considered.


There is a workaround; one can create a new PTY on the host and copy the 
I/O streams manually. Not ideal, but okay.


Any comments welcome.

Cheers
~ dev

Proposal

2020-08-02 Thread John Woods

Dear Sir, Madam,
 

Proposal

 
I am Mr.John Woods, a Consultant with the Department of Power and Steel here in 
Spain . I have been contracted by a wealthy individual and serving  government 
official from somewhere in Africa who is interested in  engaging your services 
for investment of a large volume of fund (Ten Million Five Hundred Thousand 
Dollars ) Which he Deposited with a Finance Company here in Spain . You will 
act as the beneficiary of the fund to carry out the investment of the fund in 
any business venture you consider lucrative.

If you are capable of handling any type of investment in your country, Please, 
get back to me immediately and send your telephone and fax numbers to enable me 
communicate with you and provide you with further details.

Note that 30% of the fund is your share (Commission) for securing the fund from 
the security company before the investment of the fund shall be carried out. 
This is necessary because the investor as a serving government official in his 
country of origin is eager to transfer ownership of the fund to you because he 
is not expected in his official capacity to own such huge sum of money.

I will appreciate an urgent response from you.

Yours faithfully,

Mr.John Woods

CFS flat runqueue proposal fixes/update

2020-07-31 Thread Rik van Riel

Hello,

last year at Linux Plumbers conference, I presented on my work
of turning the hierarchical CFS runqueue into a flat runqueue,
and Paul Turner pointed out some corner cases that could not
work with my design as it was last year.

Paul pointed out two corner cases, and I have come up with a
third one myself, but I believe they all revolve around the
same issue, admission control, and can all be solved with the
same solution.

This is a fairly complex thing, so this email is long and
in 3 parts:
- hierarchical runqueue problem statement
- quick design overview of the flat runqueue for CFS
- overview of the problems pointed out by Paul Turner
- description of a possible solution


   hierarchical runqueue problem statement

Currently CFS with the cgroups CPU controller uses a
hierarchical design, which means that a task is enqueued
on its cgroup's runqueu, that cgroup is enqueued on its
parent's runqueue, etc all the way up to the root runqueue.
This also means that every time a task in a cgroup wakes
up or goes to sleep, the common case (1 task on the CPU)
is that the entire hierarchy is enqueued or dequeued,
resulting in significant overhead for workloads with lots
of context switches.

For example, the hierarchy in a system with equal priority
cgroups A and B, with cgroup A subdivided in two unequal
priority cgroups A1 and B2, and a task in each leaf cgroup
would look like this:

/\
   /  \
  AB
 / \\ 
A1 A2   t3
   / \
  t1 t2

The common case on most systems is that CPUs are idle
some of the time, and CPUs usually have 0, 1, or 2
running tasks at a time.

That means when task t1 wakes up when the CPU is idle,
t1 first gets enqueued on A1, which then gets enqueued
on A2, which then gets enqueued on the root runqueue.
When t1 goes back to sleep, the whole thing is torn back
down.

In case all three tasks are runnable at the same time,
the scheduler will first choose between A and B in the
root runqueue, and, in case it chose A, between A1 and A2
the next level down.


   CFS flat runqueue design

The code I presented last year operates on the principle
of separating determining hierarchical priority of a task,
which is done periodically, from the runqueues, and using
the hierarchical priority to scale the vruntime of a task
like is done for nice levels.

The hierarchical priority of a tasks is the fraction of
the weight at each level in the runqueue hierarchy of the
path to that task. In the example above, with equal priority
levels for cgroups A, B, A1, and A2, and a total weight 1000
for the root runqueue, tasks t1, t2, and t3 will have hierarchical
weights of 250, 250, and 500 respectively.

CFS tracks both wall clock run time and vruntime for each
task, where vruntime is the wall clock run time scaled
according to the task's priority. The vruntime of every
entity on a runqueue advances at the same rate.

The vruntime is calculated as follows:

 vruntime = exec_time * FIXED_POINT_ONE / priority

That is, if the priority of the task is equal to the
fixed point constant used, the vruntime corresponds
to the executed wall clock time, while a lower priority
task has its vruntime advance at a faster rate, and
a higher priority task has its vruntime advance slower.

With every entity on a runqueue having its vruntime
advance at the same rate, that translates into higher
priority tasks receiving more CPU time, and lower
priority tasks receiving less CPU time.


Problems identified by Paul Turner

Lets revisit the hierarchy from above, and assign priorities
to the cgroups, with the fixed point one being 1000. Lets
say cgroups A, A1, and B have priority 1000, while cgroup
A2 has priority 1.

/\
   /  \
  AB
 / \\ 
A1 A2   t3
   / \
  t1 t2

One consequence of this is that when t1, t2, and t3 each
get a time slice, the vruntime of tasks t1 and t3 advances
at roughly the same speed as the clock time, while the
vruntime of task t2 advances 1000x faster.

This is fine if all three tasks continue to be runnable,
since t1, t2 and t3 each get their fair share of CPU time.

However, if t1 goes to sleep, t2 is the only thing running
inside cgroup A, which has the same priority as cgroup B,
and tasks t2 and t3 should be getting the same amount of
CPU time.

They eventually will, but not before task t3 has used up
enough CPU time to catch up with the enormous vruntime
advance that t2 just suffered.

That needs to be fixed, to get near-immediate convergence,
and not convergence after some unknown (potentially long)
period of time.


There are similar issues with thundering herds and cgroup
cpu.max throttling, where a large number of tasks can be
woken up simultaneously.

/\
   /  \
  AB
 /  \
t1t2 t3 t4 t5 ...
t42

In the current, hierarchical runqueue setup, CFS will
time slice between cgroups A and B at the top level,
which means task t1 will never go long without being
scheduled.  Only the tasks

INVESTMENT PROPOSAL.

2020-07-29 Thread Daouda Ali

It’s my pleasure to contact you through this media because I need an
investment assistance in your country. However I have a profitable
investment proposal with  good interest to share with you, amounted
the sum of (Twenty Eight Million Four Hundred Thousand United State
Dollar ($28.400.000.00). If you  are willing to handle this project
kindly reply urgent to enable me provide you more information about
the investment funds and the project.

I am waiting to hear from you through this my private
email(daoudaali2...@gmail.com) so we can proceed further.

Best Regards.
Mr. Daouda Ali.

Re: [Proposal] DRM: AMD: Convert logging to drm_* functions.

2020-07-09 Thread Christian König


Am 08.07.20 um 18:11 schrieb Suraj Upadhyay:

Hii AMD Maintainers,
I plan to convert logging of information, error and warnings
inside the AMD driver(s) to drm_* functions and macros for loggin,
as described by the TODO list in the DRM documentation[1].

I need your approval for the change before sending any patches, to make
sure that this is a good idea and that the patches will be merged.

The patches will essentially convert all the dev_info(), dev_warn(),
dev_err() and dev_err_once() to drm_info(), drm_warn(), drm_err() and
drm_err_once() respectively.


Well to be honest I would rather like see the conversion done in the 
other direction.


I think the drm_* functions are just an unnecessary extra layer on top 
of the core kernel functions and should probably be removed sooner or 
later because of midlayering.


Regards,
Christian.



Thank You,

Suraj Upadhyay.

[1] 
https://dri.freedesktop.org/docs/drm/gpu/todo.html#convert-logging-to-drm-functions-with-drm-device-paramater

[Proposal] DRM: AMD: Convert logging to drm_* functions.

2020-07-08 Thread Suraj Upadhyay

Hii AMD Maintainers,
I plan to convert logging of information, error and warnings
inside the AMD driver(s) to drm_* functions and macros for loggin,
as described by the TODO list in the DRM documentation[1].

I need your approval for the change before sending any patches, to make
sure that this is a good idea and that the patches will be merged.

The patches will essentially convert all the dev_info(), dev_warn(),
dev_err() and dev_err_once() to drm_info(), drm_warn(), drm_err() and
drm_err_once() respectively.

Thank You,

Suraj Upadhyay.

[1] 
https://dri.freedesktop.org/docs/drm/gpu/todo.html#convert-logging-to-drm-functions-with-drm-device-paramater



signature.asc
Description: PGP signature

Re: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface

2020-07-07 Thread Shakeel Butt

On Tue, Jul 7, 2020 at 5:14 AM Michal Hocko  wrote:
>
> On Fri 03-07-20 07:23:14, Shakeel Butt wrote:
> > On Thu, Jul 2, 2020 at 11:35 PM Michal Hocko  wrote:
> > >
> > > On Thu 02-07-20 08:22:22, Shakeel Butt wrote:
> > > [...]
> > > > Interface options:
> > > > --
> > > >
> > > > 1) memcg interface e.g. 'echo 10M > memory.reclaim'
> > > >
> > > > + simple
> > > > + can be extended to target specific type of memory (anon, file, kmem).
> > > > - most probably restricted to cgroup v2.
> > > >
> > > > 2) fadvise(PAGEOUT) on cgroup_dir_fd
> > > >
> > > > + more general and applicable to other FSes (actually we are using
> > > > something similar for tmpfs).
> > > > + can be extended in future to just age the LRUs instead of reclaim or
> > > > some new use cases.
> > >
> > > Could you explain why memory.high as an interface to trigger pro-active
> > > memory reclaim is not sufficient. Also memory.low limit to protect
> > > latency sensitve workloads?
> >
> > Yes, we can use memory.high to trigger [proactive] reclaim in a memcg
> > but note that it can also introduce stalls in the application running
> > in that memcg. Let's suppose the memory.current of a memcg is 100MiB
> > and we want to reclaim 20MiB from it, we can set the memory.high to
> > 80MiB but any allocation attempt from the application running in that
> > memcg can get stalled/throttled. I want the functionality of the
> > reclaim without potential stalls.
>
> It would be great if the proposal mention this limitation.
>

Will do in the next version.


> > The memory.min is for protection against the global reclaim and is
> > unrelated to this discussion.
>
> Well, I was talkingg about memory.low. It is not meant only to protect
> from the global reclaim. It can be used for balancing memory reclaim
> from _any_ external memory pressure source. So it is somehow related to
> the usecase you have mentioned.
>

For the uswapd use-case, I am not concerned about the external memory
pressure source but the application hitting its own memory.high limit
and getting throttled.

> What you consider a latency sensitive workload could be protected from
> directly induced reclaim latencies. You could use low events to learn
> about the external memory pressure and update your protection to allow
> for some reclaim. I do understand that this wouldn't solve your problem
> who gets reclaimed and maybe that is the crux on why it is not
> applicable but that should really be mentioned explicitly.
>

The main aim for the proactive reclaim is to not cause an external
memory pressure. The low events can be another source of information
to tell the system level situation to the 'Memory Overcommit
Controller'. So, I see the low events as complementary, not the
replacement for the reclaim interface.

BTW by "low events from external memory pressure" am I correct in
understanding that you meant an unrelated job reclaiming and
triggering low events on a job of interest. Or do you mean to
partition a job into sub-jobs and then use the low events between
these sub-jobs somehow?

Re: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface

2020-07-07 Thread Shakeel Butt

On Mon, Jul 6, 2020 at 2:38 PM Roman Gushchin  wrote:
>
> On Fri, Jul 03, 2020 at 09:27:19AM -0700, Shakeel Butt wrote:
> > On Fri, Jul 3, 2020 at 8:50 AM Roman Gushchin  wrote:
> > >
> > > On Fri, Jul 03, 2020 at 07:23:14AM -0700, Shakeel Butt wrote:
> > > > On Thu, Jul 2, 2020 at 11:35 PM Michal Hocko  wrote:
> > > > >
> > > > > On Thu 02-07-20 08:22:22, Shakeel Butt wrote:
> > > > > [...]
> > > > > > Interface options:
> > > > > > --
> > > > > >
> > > > > > 1) memcg interface e.g. 'echo 10M > memory.reclaim'
> > > > > >
> > > > > > + simple
> > > > > > + can be extended to target specific type of memory (anon, file, 
> > > > > > kmem).
> > > > > > - most probably restricted to cgroup v2.
> > > > > >
> > > > > > 2) fadvise(PAGEOUT) on cgroup_dir_fd
> > > > > >
> > > > > > + more general and applicable to other FSes (actually we are using
> > > > > > something similar for tmpfs).
> > > > > > + can be extended in future to just age the LRUs instead of reclaim 
> > > > > > or
> > > > > > some new use cases.
> > > > >
> > > > > Could you explain why memory.high as an interface to trigger 
> > > > > pro-active
> > > > > memory reclaim is not sufficient. Also memory.low limit to protect
> > > > > latency sensitve workloads?
> > >
> > > I initially liked the proposal, but after some thoughts I've realized
> > > that I don't know a good use case where memory.high is less useful.
> > > Shakeel, what's the typical use case you thinking of?
> > > Who and how will use the new interface?
> > >
> > > >
> > > > Yes, we can use memory.high to trigger [proactive] reclaim in a memcg
> > > > but note that it can also introduce stalls in the application running
> > > > in that memcg. Let's suppose the memory.current of a memcg is 100MiB
> > > > and we want to reclaim 20MiB from it, we can set the memory.high to
> > > > 80MiB but any allocation attempt from the application running in that
> > > > memcg can get stalled/throttled. I want the functionality of the
> > > > reclaim without potential stalls.
> > >
> > > But reclaiming some pagecache/swapping out anon pages can always
> > > generate some stalls caused by pagefaults, no?
> > >
> >
> > Thanks for looking into the proposal. Let me answer both of your
> > questions together. I have added the two use-cases but let me explain
> > the proactive reclaim a bit more as we actually use that in our
> > production.
> >
> > We have defined tolerable refault rates for the applications based on
> > their type (latency sensitive or not). Proactive reclaim is triggered
> > in the application based on their current refault rates and usage. If
> > the current refault rate exceeds the tolerable refault rate then
> > stop/slowdown the proactive reclaim.
> >
> > For the second question, yes, each individual refault can induce the
> > stall as well but we have more control on that stall as compared to
> > stalls due to reclaim. For us almost all the reclaimable memory is
> > anon and we use compression based swap, so, the cost of each refault
> > is fixed and a couple of microseconds.
> >
> > I think the next question is what about the refaults from disk or
> > source with highly variable cost. Usually the latency sensitive
> > applications remove such uncertainty by mlocking the pages backed by
> > such backends (e.g. mlocking the executable) or at least that is the
> > case for us.
>
> Got it.
>
> It feels like you're suggesting something similar to memory.high with
> something similar to a different gfp flags. In other words, the
> difference is only which pages can be reclaimed and which not. I don't
> have a definitive answer here, but I wonder if we can somehow
> generalize the existing interface? E.g. if the problem is with artificially
> induced delays, we can have a config option/sysctl/sysfs knob/something else
> which would disable it. Otherwise we risk ending up with many different kinds
> of soft memory limits.
>

It is possible to achieve this functionality with memory.high with
some config/sysctls e.t.c as you suggested but it can bloat and
complicate the memory.high interface.

I understand your concern with different kinds of soft memory limits
but I see this as a simple interface (i.e. just trigger reclaim) that
gives users the flexibility to define and (soft) enforce their own
virtual limits on their jobs. More specifically this interface allows
reclaiming from a job to keep the usage below some virtual limit which
can correspond to some user defined metric. In my proactive reclaim
example, the user defined metric is refault rates. Keep the usage of
the job at a level where the refault rates are tolerable.

Re: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface

2020-07-07 Thread Michal Hocko

On Fri 03-07-20 07:23:14, Shakeel Butt wrote:
> On Thu, Jul 2, 2020 at 11:35 PM Michal Hocko  wrote:
> >
> > On Thu 02-07-20 08:22:22, Shakeel Butt wrote:
> > [...]
> > > Interface options:
> > > --
> > >
> > > 1) memcg interface e.g. 'echo 10M > memory.reclaim'
> > >
> > > + simple
> > > + can be extended to target specific type of memory (anon, file, kmem).
> > > - most probably restricted to cgroup v2.
> > >
> > > 2) fadvise(PAGEOUT) on cgroup_dir_fd
> > >
> > > + more general and applicable to other FSes (actually we are using
> > > something similar for tmpfs).
> > > + can be extended in future to just age the LRUs instead of reclaim or
> > > some new use cases.
> >
> > Could you explain why memory.high as an interface to trigger pro-active
> > memory reclaim is not sufficient. Also memory.low limit to protect
> > latency sensitve workloads?
> 
> Yes, we can use memory.high to trigger [proactive] reclaim in a memcg
> but note that it can also introduce stalls in the application running
> in that memcg. Let's suppose the memory.current of a memcg is 100MiB
> and we want to reclaim 20MiB from it, we can set the memory.high to
> 80MiB but any allocation attempt from the application running in that
> memcg can get stalled/throttled. I want the functionality of the
> reclaim without potential stalls.

It would be great if the proposal mention this limitation.

> The memory.min is for protection against the global reclaim and is
> unrelated to this discussion.

Well, I was talkingg about memory.low. It is not meant only to protect
from the global reclaim. It can be used for balancing memory reclaim
from _any_ external memory pressure source. So it is somehow related to
the usecase you have mentioned.

What you consider a latency sensitive workload could be protected from
directly induced reclaim latencies. You could use low events to learn
about the external memory pressure and update your protection to allow
for some reclaim. I do understand that this wouldn't solve your problem
who gets reclaimed and maybe that is the crux on why it is not
applicable but that should really be mentioned explicitly.

-- 
Michal Hocko
SUSE Labs

Re: [Proposal] drm: amd: Convert logging to drm_* functions with drm_device parameter

2020-07-06 Thread Daniel Vetter

On Mon, Jul 06, 2020 at 04:21:38PM +0530, Suraj Upadhyay wrote:
> Hii Maintainers,
>   I recently came across this list of janatorial tasks
> for starters on DRM subsystem [1]. One of the tasks is replacing
> conventional dmesg macros (like dev_info(), dev_warn() and dev_err())
> with DRM dmesg macros [2]. And I need your input whether the
> conversions to DRM dmesg macros are worth it or not.
> I would like to start working on this task if this needs the change.

For any core code I'm happy to merge such patches. If you're changing a
specific driver (all the subdirectories under drivers/gpu/drm/*) then
please ping the specific driver maintainer first. They should be all
listed in the MAINTAINERS file.

Cheers, Daniel

> 
> Thank you,
> Suraj Upadhyay.
> 
> [1] https://dri.freedesktop.org/docs/drm/gpu/todo.html.
> [2] 
> https://dri.freedesktop.org/docs/drm/gpu/todo.html#convert-logging-to-drm-functions-with-drm-device-paramater
> 



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

Re: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface

2020-07-06 Thread Roman Gushchin

On Fri, Jul 03, 2020 at 09:27:19AM -0700, Shakeel Butt wrote:
> On Fri, Jul 3, 2020 at 8:50 AM Roman Gushchin  wrote:
> >
> > On Fri, Jul 03, 2020 at 07:23:14AM -0700, Shakeel Butt wrote:
> > > On Thu, Jul 2, 2020 at 11:35 PM Michal Hocko  wrote:
> > > >
> > > > On Thu 02-07-20 08:22:22, Shakeel Butt wrote:
> > > > [...]
> > > > > Interface options:
> > > > > --
> > > > >
> > > > > 1) memcg interface e.g. 'echo 10M > memory.reclaim'
> > > > >
> > > > > + simple
> > > > > + can be extended to target specific type of memory (anon, file, 
> > > > > kmem).
> > > > > - most probably restricted to cgroup v2.
> > > > >
> > > > > 2) fadvise(PAGEOUT) on cgroup_dir_fd
> > > > >
> > > > > + more general and applicable to other FSes (actually we are using
> > > > > something similar for tmpfs).
> > > > > + can be extended in future to just age the LRUs instead of reclaim or
> > > > > some new use cases.
> > > >
> > > > Could you explain why memory.high as an interface to trigger pro-active
> > > > memory reclaim is not sufficient. Also memory.low limit to protect
> > > > latency sensitve workloads?
> >
> > I initially liked the proposal, but after some thoughts I've realized
> > that I don't know a good use case where memory.high is less useful.
> > Shakeel, what's the typical use case you thinking of?
> > Who and how will use the new interface?
> >
> > >
> > > Yes, we can use memory.high to trigger [proactive] reclaim in a memcg
> > > but note that it can also introduce stalls in the application running
> > > in that memcg. Let's suppose the memory.current of a memcg is 100MiB
> > > and we want to reclaim 20MiB from it, we can set the memory.high to
> > > 80MiB but any allocation attempt from the application running in that
> > > memcg can get stalled/throttled. I want the functionality of the
> > > reclaim without potential stalls.
> >
> > But reclaiming some pagecache/swapping out anon pages can always
> > generate some stalls caused by pagefaults, no?
> >
> 
> Thanks for looking into the proposal. Let me answer both of your
> questions together. I have added the two use-cases but let me explain
> the proactive reclaim a bit more as we actually use that in our
> production.
> 
> We have defined tolerable refault rates for the applications based on
> their type (latency sensitive or not). Proactive reclaim is triggered
> in the application based on their current refault rates and usage. If
> the current refault rate exceeds the tolerable refault rate then
> stop/slowdown the proactive reclaim.
> 
> For the second question, yes, each individual refault can induce the
> stall as well but we have more control on that stall as compared to
> stalls due to reclaim. For us almost all the reclaimable memory is
> anon and we use compression based swap, so, the cost of each refault
> is fixed and a couple of microseconds.
> 
> I think the next question is what about the refaults from disk or
> source with highly variable cost. Usually the latency sensitive
> applications remove such uncertainty by mlocking the pages backed by
> such backends (e.g. mlocking the executable) or at least that is the
> case for us.

Got it.

It feels like you're suggesting something similar to memory.high with
something similar to a different gfp flags. In other words, the
difference is only which pages can be reclaimed and which not. I don't
have a definitive answer here, but I wonder if we can somehow
generalize the existing interface? E.g. if the problem is with artificially
induced delays, we can have a config option/sysctl/sysfs knob/something else
which would disable it. Otherwise we risk ending up with many different kinds
of soft memory limits.

Thanks!

[Proposal] drm: amd: Convert logging to drm_* functions with drm_device parameter

2020-07-06 Thread Suraj Upadhyay

Hii Maintainers,
I recently came across this list of janatorial tasks
for starters on DRM subsystem [1]. One of the tasks is replacing
conventional dmesg macros (like dev_info(), dev_warn() and dev_err())
with DRM dmesg macros [2]. And I need your input whether the
conversions to DRM dmesg macros are worth it or not.
I would like to start working on this task if this needs the change.

Thank you,
Suraj Upadhyay.

[1] https://dri.freedesktop.org/docs/drm/gpu/todo.html.
[2] 
https://dri.freedesktop.org/docs/drm/gpu/todo.html#convert-logging-to-drm-functions-with-drm-device-paramater



signature.asc
Description: PGP signature

Re: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface

2020-07-03 Thread Shakeel Butt

On Fri, Jul 3, 2020 at 8:50 AM Roman Gushchin  wrote:
>
> On Fri, Jul 03, 2020 at 07:23:14AM -0700, Shakeel Butt wrote:
> > On Thu, Jul 2, 2020 at 11:35 PM Michal Hocko  wrote:
> > >
> > > On Thu 02-07-20 08:22:22, Shakeel Butt wrote:
> > > [...]
> > > > Interface options:
> > > > --
> > > >
> > > > 1) memcg interface e.g. 'echo 10M > memory.reclaim'
> > > >
> > > > + simple
> > > > + can be extended to target specific type of memory (anon, file, kmem).
> > > > - most probably restricted to cgroup v2.
> > > >
> > > > 2) fadvise(PAGEOUT) on cgroup_dir_fd
> > > >
> > > > + more general and applicable to other FSes (actually we are using
> > > > something similar for tmpfs).
> > > > + can be extended in future to just age the LRUs instead of reclaim or
> > > > some new use cases.
> > >
> > > Could you explain why memory.high as an interface to trigger pro-active
> > > memory reclaim is not sufficient. Also memory.low limit to protect
> > > latency sensitve workloads?
>
> I initially liked the proposal, but after some thoughts I've realized
> that I don't know a good use case where memory.high is less useful.
> Shakeel, what's the typical use case you thinking of?
> Who and how will use the new interface?
>
> >
> > Yes, we can use memory.high to trigger [proactive] reclaim in a memcg
> > but note that it can also introduce stalls in the application running
> > in that memcg. Let's suppose the memory.current of a memcg is 100MiB
> > and we want to reclaim 20MiB from it, we can set the memory.high to
> > 80MiB but any allocation attempt from the application running in that
> > memcg can get stalled/throttled. I want the functionality of the
> > reclaim without potential stalls.
>
> But reclaiming some pagecache/swapping out anon pages can always
> generate some stalls caused by pagefaults, no?
>

Thanks for looking into the proposal. Let me answer both of your
questions together. I have added the two use-cases but let me explain
the proactive reclaim a bit more as we actually use that in our
production.

We have defined tolerable refault rates for the applications based on
their type (latency sensitive or not). Proactive reclaim is triggered
in the application based on their current refault rates and usage. If
the current refault rate exceeds the tolerable refault rate then
stop/slowdown the proactive reclaim.

For the second question, yes, each individual refault can induce the
stall as well but we have more control on that stall as compared to
stalls due to reclaim. For us almost all the reclaimable memory is
anon and we use compression based swap, so, the cost of each refault
is fixed and a couple of microseconds.

I think the next question is what about the refaults from disk or
source with highly variable cost. Usually the latency sensitive
applications remove such uncertainty by mlocking the pages backed by
such backends (e.g. mlocking the executable) or at least that is the
case for us.

Thanks,
Shakeel

Re: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface

2020-07-03 Thread Roman Gushchin

On Fri, Jul 03, 2020 at 07:23:14AM -0700, Shakeel Butt wrote:
> On Thu, Jul 2, 2020 at 11:35 PM Michal Hocko  wrote:
> >
> > On Thu 02-07-20 08:22:22, Shakeel Butt wrote:
> > [...]
> > > Interface options:
> > > --
> > >
> > > 1) memcg interface e.g. 'echo 10M > memory.reclaim'
> > >
> > > + simple
> > > + can be extended to target specific type of memory (anon, file, kmem).
> > > - most probably restricted to cgroup v2.
> > >
> > > 2) fadvise(PAGEOUT) on cgroup_dir_fd
> > >
> > > + more general and applicable to other FSes (actually we are using
> > > something similar for tmpfs).
> > > + can be extended in future to just age the LRUs instead of reclaim or
> > > some new use cases.
> >
> > Could you explain why memory.high as an interface to trigger pro-active
> > memory reclaim is not sufficient. Also memory.low limit to protect
> > latency sensitve workloads?

I initially liked the proposal, but after some thoughts I've realized
that I don't know a good use case where memory.high is less useful.
Shakeel, what's the typical use case you thinking of?
Who and how will use the new interface?

> 
> Yes, we can use memory.high to trigger [proactive] reclaim in a memcg
> but note that it can also introduce stalls in the application running
> in that memcg. Let's suppose the memory.current of a memcg is 100MiB
> and we want to reclaim 20MiB from it, we can set the memory.high to
> 80MiB but any allocation attempt from the application running in that
> memcg can get stalled/throttled. I want the functionality of the
> reclaim without potential stalls.

But reclaiming some pagecache/swapping out anon pages can always
generate some stalls caused by pagefaults, no?

Thanks!

Re: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface

2020-07-03 Thread Shakeel Butt

On Thu, Jul 2, 2020 at 11:35 PM Michal Hocko  wrote:
>
> On Thu 02-07-20 08:22:22, Shakeel Butt wrote:
> [...]
> > Interface options:
> > --
> >
> > 1) memcg interface e.g. 'echo 10M > memory.reclaim'
> >
> > + simple
> > + can be extended to target specific type of memory (anon, file, kmem).
> > - most probably restricted to cgroup v2.
> >
> > 2) fadvise(PAGEOUT) on cgroup_dir_fd
> >
> > + more general and applicable to other FSes (actually we are using
> > something similar for tmpfs).
> > + can be extended in future to just age the LRUs instead of reclaim or
> > some new use cases.
>
> Could you explain why memory.high as an interface to trigger pro-active
> memory reclaim is not sufficient. Also memory.low limit to protect
> latency sensitve workloads?

Yes, we can use memory.high to trigger [proactive] reclaim in a memcg
but note that it can also introduce stalls in the application running
in that memcg. Let's suppose the memory.current of a memcg is 100MiB
and we want to reclaim 20MiB from it, we can set the memory.high to
80MiB but any allocation attempt from the application running in that
memcg can get stalled/throttled. I want the functionality of the
reclaim without potential stalls.

The memory.min is for protection against the global reclaim and is
unrelated to this discussion.

Re: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface

2020-07-02 Thread Michal Hocko

On Thu 02-07-20 08:22:22, Shakeel Butt wrote:
[...]
> Interface options:
> --
> 
> 1) memcg interface e.g. 'echo 10M > memory.reclaim'
> 
> + simple
> + can be extended to target specific type of memory (anon, file, kmem).
> - most probably restricted to cgroup v2.
> 
> 2) fadvise(PAGEOUT) on cgroup_dir_fd
> 
> + more general and applicable to other FSes (actually we are using
> something similar for tmpfs).
> + can be extended in future to just age the LRUs instead of reclaim or
> some new use cases.

Could you explain why memory.high as an interface to trigger pro-active
memory reclaim is not sufficient. Also memory.low limit to protect
latency sensitve workloads?
-- 
Michal Hocko
SUSE Labs

[RFC PROPOSAL] memcg: per-memcg user space reclaim interface

2020-07-02 Thread Shakeel Butt

This is a proposal to expose an interface to the user space to trigger
memory reclaim on a memory cgroup. The proposal contains potential use
cases, benefits of the user space interface and potential implementation
choices.

Use cases:
--

1) Per-memcg uswapd:

Usually applications consists of combination of latency sensitive and
latency tolerant tasks. For example, tasks serving user requests vs
tasks doing data backup for a database application. At the moment the
kernel does not differentiate between such tasks when the application
hits the memcg limits. So, potentially a latency sensitive user facing
task can get stuck in memory reclaim and be throttled by the kernel.

This problem has been discussed before [1, 2].

One way to resolve this issue is to preemptively trigger the memory
reclaim from a latency tolerant task (uswapd) when the application is
near the limits. (Please note that finding 'near the limits' situation
is an orthogonal problem and we are exploring if per-memcg MemAvailable
notifications can be useful [3]).

2) Proactive reclaim:

This is a similar to the previous use-case, the difference is instead of
waiting for the application to be near its limit to trigger memory
reclaim, continuously pressuring the memcg to reclaim a small amount of
memory. This gives more accurate and uptodate workingset estimation as
the LRUs are continuously sorted and can potentially provide more
deterministic memory overcommit behavior. The memory overcommit
controller can provide more proactive response to the changing workload
of the running applications instead of being reactive.

Benefit of user space solution:
---

1) More flexible on who should be charged for the cpu of the memory
reclaim. For proactive reclaim, it makes more sense to centralized the
overhead while for uswapd, it makes more sense for the application to
pay for the cpu of the memory reclaim.

2) More flexible on dedicating the resources (like cpu). The memory
overcommit controller can balance the cost between the cpu usage and
the memory reclaimed.

3) Provides a way to the applications to keep their LRUs sorted, so,
under memory pressure better reclaim candidates are selected.

Interface options:
--

1) memcg interface e.g. 'echo 10M > memory.reclaim'

+ simple
+ can be extended to target specific type of memory (anon, file, kmem).
- most probably restricted to cgroup v2.

2) fadvise(PAGEOUT) on cgroup_dir_fd

+ more general and applicable to other FSes (actually we are using
something similar for tmpfs).
+ can be extended in future to just age the LRUs instead of reclaim or
some new use cases.

[Or maybe a new fadvise2() syscall which can take FS specific options.]

[1] https://lwn.net/Articles/753162/
[2] http://lkml.kernel.org/r/20200219181219.54356-1-han...@cmpxchg.org
[3] 
http://lkml.kernel.org/r/alpine.deb.2.22.394.2006281445210.855...@chino.kir.corp.google.com

The following patch is my attempt to implement the option 2. Please ignore
the fine details as I am more interested in getting the feedback on the
proposal the interface options.

Signed-off-by: Shakeel Butt 
---
 fs/kernfs/dir.c | 20 +++
 include/linux/cgroup-defs.h |  2 ++
 include/linux/kernfs.h  |  2 ++
 include/uapi/linux/fadvise.h|  1 +
 kernel/cgroup/cgroup-internal.h |  2 ++
 kernel/cgroup/cgroup-v1.c   |  1 +
 kernel/cgroup/cgroup.c  | 43 +
 mm/memcontrol.c | 20 +++
 8 files changed, 91 insertions(+)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 9aec80b9d7c6..96b3b67f3a85 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1698,9 +1698,29 @@ static int kernfs_fop_readdir(struct file *file, struct 
dir_context *ctx)
return 0;
 }
 
+static int kernfs_dir_fadvise(struct file *file, loff_t offset, loff_t len,
+ int advise)
+{
+   struct kernfs_node *kn  = kernfs_dentry_node(file->f_path.dentry);
+   struct kernfs_syscall_ops *scops = kernfs_root(kn)->syscall_ops;
+   int ret;
+
+   if (!scops || !scops->fadvise)
+   return -EPERM;
+
+   if (!kernfs_get_active(kn))
+   return -ENODEV;
+
+   ret = scops->fadvise(kn, offset, len, advise);
+
+   kernfs_put_active(kn);
+   return ret;
+}
+
 const struct file_operations kernfs_dir_fops = {
.read   = generic_read_dir,
.iterate_shared = kernfs_fop_readdir,
.release= kernfs_dir_fop_release,
.llseek = generic_file_llseek,
+   .fadvise= kernfs_dir_fadvise,
 };
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 52661155f85f..cbe46634875e 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -628,6 +628,8 @@ struct cgroup_subsys {
void (*css_rstat_flush)(struct cgroup_subsys_state *css, int cpu);

Re: [RFC PATCH] gpio: uapi: v2 proposal

2020-06-09 Thread Bartosz Golaszewski

wt., 9 cze 2020 o 11:43 Kent Gibson  napisał(a):
>
> On Tue, Jun 09, 2020 at 10:03:42AM +0200, Bartosz Golaszewski wrote:
> > sob., 6 cze 2020 o 03:56 Kent Gibson  napisał(a):
> > >
> >
> > [snip!]
> >
> > > >
> > > > I'd say yes - consolidation and reuse of data structures is always
> > > > good and normally they are going to be wrapped in some kind of
> > > > low-level user-space library anyway.
> > > >
> > >
> > > Ok, and I've changed the values field name to bitmap, along with the 
> > > change
> > > to a bitmap type, so the stuttering is gone.
> > >
> > > And, as the change to bitmap substantially reduced the size of
> > > gpioline_config, I now embed that in the gpioline_info instead of
> > > duplicating all the other fields.  The values field will be zeroed
> > > when returned within info.
> > >
> >
> > Could you post an example, I'm not sure I follow.
> >
>
> The gpioline_info_v2 now looks like this:
>
> /**
>  * struct gpioline_info_v2 - Information about a certain GPIO line
>  * @name: the name of this GPIO line, such as the output pin of the line on
>  * the chip, a rail or a pin header name on a board, as specified by the
>  * gpio chip, may be empty
>  * @consumer: a functional name for the consumer of this GPIO line as set
>  * by whatever is using it, will be empty if there is no current user but
>  * may also be empty if the consumer doesn't set this up
>  * @config: the configuration of the line.  Note that the values field is
>  * always zeroed.
>  * @offset: the local offset on this GPIO device, fill this in when
>  * requesting the line information from the kernel
>  * @padding: reserved for future use
>  */
> struct gpioline_info_v2 {
> char name[GPIO_MAX_NAME_SIZE];
> char consumer[GPIO_MAX_NAME_SIZE];
> struct gpioline_config config;
> __u32 offset;
> __u32 padding[GPIOLINE_INFO_V2_PAD_SIZE]; /* for future use */
> };
>
> Previously that had all the fields from config - other than the values.
>
> When that is populated the config.values will always be zeroed.
>

We'll probably abstract this away in libgpiod and your Go library but
for someone looking at the ABI it may be confusing because a zeroed
values array is still valid. I don't have a better idea though.

[snip!]

>
> > > > > And would it be useful for userspace to be able to influence the size 
> > > > > of
> > > > > the event buffer (currently fixed at 16 events per line)?
> > > > >
> > > >
> > > > Good question. I would prefer to not overdo it though. The event
> > > > request would need to contain the desired kfifo size and we'd only
> > > > allow to set it on request, right?
> > > >
> > >
> > > Yeah, it would only be relevant if edge detection was set and, as per
> > > edge detection itself, would only be settable via the request, not
> > > via set_config.  It would only be a suggestion, as the kfifo size gets
> > > rounded up to a power of 2 anyway.  It would be capped - I'm open to
> > > suggestions for a suitable max value.  And the 0 value would mean use
> > > the default - currently 16 per line.
> > >
> >
> > This sounds good. How about 512 for max value for now and we can
> > always increase it if needed. I don't think we should explicitly cap
> > it though - let the user specify any value and just silently limit it
> > to 512 in the kernel.
> >
>
> It will be an internal cap only - no error if the user requests more.
> I was thinking 1024, which corresponds to the maximum default - 16*64.
>

Yes, this sounds good too.

Bart

Re: [RFC PATCH] gpio: uapi: v2 proposal

2020-06-09 Thread Kent Gibson

On Tue, Jun 09, 2020 at 10:03:42AM +0200, Bartosz Golaszewski wrote:
> sob., 6 cze 2020 o 03:56 Kent Gibson  napisał(a):
> >
> 
> [snip!]
> 
> > >
> > > I'd say yes - consolidation and reuse of data structures is always
> > > good and normally they are going to be wrapped in some kind of
> > > low-level user-space library anyway.
> > >
> >
> > Ok, and I've changed the values field name to bitmap, along with the change
> > to a bitmap type, so the stuttering is gone.
> >
> > And, as the change to bitmap substantially reduced the size of
> > gpioline_config, I now embed that in the gpioline_info instead of
> > duplicating all the other fields.  The values field will be zeroed
> > when returned within info.
> >
> 
> Could you post an example, I'm not sure I follow.
> 

The gpioline_info_v2 now looks like this:

/**
 * struct gpioline_info_v2 - Information about a certain GPIO line
 * @name: the name of this GPIO line, such as the output pin of the line on
 * the chip, a rail or a pin header name on a board, as specified by the
 * gpio chip, may be empty
 * @consumer: a functional name for the consumer of this GPIO line as set
 * by whatever is using it, will be empty if there is no current user but
 * may also be empty if the consumer doesn't set this up
 * @config: the configuration of the line.  Note that the values field is
 * always zeroed.
 * @offset: the local offset on this GPIO device, fill this in when
 * requesting the line information from the kernel
 * @padding: reserved for future use
 */
struct gpioline_info_v2 {
char name[GPIO_MAX_NAME_SIZE];
char consumer[GPIO_MAX_NAME_SIZE];
struct gpioline_config config;
__u32 offset;
__u32 padding[GPIOLINE_INFO_V2_PAD_SIZE]; /* for future use */
};

Previously that had all the fields from config - other than the values.

When that is populated the config.values will always be zeroed.

[snip!]
> 
> > > >
> > > > I'm also kicking around the idea of adding sequence numbers to events,
> > > > one per line and one per handle, so userspace can more easily detect
> > > > mis-ordering or buffer overflows.  Does that make any sense?
> > > >
> > >
> > > Hmm, now that you mention it - and in the light of the recent post by
> > > Ryan Lovelett about polling precision - I think it makes sense to have
> > > this. Especially since it's very easy to add.
> > >
> >
> > OK.  I was only thinking about the edge events, but you might want it
> > for your line info events on the chip fd as well?
> >
> 
> I don't see the need for it now, but you never know. Let's leave it
> out for now and if we ever need it - we now have the appropriate
> padding.
> 

OK. It is a trivial change - I've already got the patch for it.

> > > > And would it be useful for userspace to be able to influence the size of
> > > > the event buffer (currently fixed at 16 events per line)?
> > > >
> > >
> > > Good question. I would prefer to not overdo it though. The event
> > > request would need to contain the desired kfifo size and we'd only
> > > allow to set it on request, right?
> > >
> >
> > Yeah, it would only be relevant if edge detection was set and, as per
> > edge detection itself, would only be settable via the request, not
> > via set_config.  It would only be a suggestion, as the kfifo size gets
> > rounded up to a power of 2 anyway.  It would be capped - I'm open to
> > suggestions for a suitable max value.  And the 0 value would mean use
> > the default - currently 16 per line.
> >
> 
> This sounds good. How about 512 for max value for now and we can
> always increase it if needed. I don't think we should explicitly cap
> it though - let the user specify any value and just silently limit it
> to 512 in the kernel.
> 

It will be an internal cap only - no error if the user requests more.
I was thinking 1024, which corresponds to the maximum default - 16*64.

> > If you want the equivalent for the info watch then I'm not sure where to
> > hook it in.  It should be at the chip scope, and there isn't any
> > suitable ioctl to hook it into so it would need a new one - maybe a
> > set_config for the chip?  But the buffer size would only be settable up
> > until you add a watch.
> >
> 
> I don't think we need this. Status changes are naturally much less
> frequent and the potential for buffer overflow is miniscule here.
> 

Agreed.

Cheers,
Kent.

Re: [RFC PATCH] gpio: uapi: v2 proposal

2020-06-09 Thread Bartosz Golaszewski

sob., 6 cze 2020 o 03:56 Kent Gibson  napisał(a):
>

[snip!]

> >
> > I'd say yes - consolidation and reuse of data structures is always
> > good and normally they are going to be wrapped in some kind of
> > low-level user-space library anyway.
> >
>
> Ok, and I've changed the values field name to bitmap, along with the change
> to a bitmap type, so the stuttering is gone.
>
> And, as the change to bitmap substantially reduced the size of
> gpioline_config, I now embed that in the gpioline_info instead of
> duplicating all the other fields.  The values field will be zeroed
> when returned within info.
>

Could you post an example, I'm not sure I follow.

> > > And I've renamed "default_values" to just "values" in my latest draft
> > > which doesn't help with the stuttering.
> > >
> >
> > Why though? Aren't these always default values for output?
> >
>
> To me "default" implies a fallback value, and that de-emphasises the
> fact that the lines will be immediately set to those values as they
> are switched to outputs.
> These are the values the outputs will take - the "default" doesn't add
> anything.
>

Fair enough, values it is.

[snip!]

> > >
> > > I'm also kicking around the idea of adding sequence numbers to events,
> > > one per line and one per handle, so userspace can more easily detect
> > > mis-ordering or buffer overflows.  Does that make any sense?
> > >
> >
> > Hmm, now that you mention it - and in the light of the recent post by
> > Ryan Lovelett about polling precision - I think it makes sense to have
> > this. Especially since it's very easy to add.
> >
>
> OK.  I was only thinking about the edge events, but you might want it
> for your line info events on the chip fd as well?
>

I don't see the need for it now, but you never know. Let's leave it
out for now and if we ever need it - we now have the appropriate
padding.

> > > And would it be useful for userspace to be able to influence the size of
> > > the event buffer (currently fixed at 16 events per line)?
> > >
> >
> > Good question. I would prefer to not overdo it though. The event
> > request would need to contain the desired kfifo size and we'd only
> > allow to set it on request, right?
> >
>
> Yeah, it would only be relevant if edge detection was set and, as per
> edge detection itself, would only be settable via the request, not
> via set_config.  It would only be a suggestion, as the kfifo size gets
> rounded up to a power of 2 anyway.  It would be capped - I'm open to
> suggestions for a suitable max value.  And the 0 value would mean use
> the default - currently 16 per line.
>

This sounds good. How about 512 for max value for now and we can
always increase it if needed. I don't think we should explicitly cap
it though - let the user specify any value and just silently limit it
to 512 in the kernel.

> If you want the equivalent for the info watch then I'm not sure where to
> hook it in.  It should be at the chip scope, and there isn't any
> suitable ioctl to hook it into so it would need a new one - maybe a
> set_config for the chip?  But the buffer size would only be settable up
> until you add a watch.
>

I don't think we need this. Status changes are naturally much less
frequent and the potential for buffer overflow is miniscule here.

Bart

Re: [RFC PATCH] gpio: uapi: v2 proposal

2020-06-05 Thread Kent Gibson

On Fri, Jun 05, 2020 at 11:53:05AM +0200, Bartosz Golaszewski wrote:
> czw., 4 cze 2020 o 18:00 Kent Gibson  napisał(a):
> >
> 
> [snip!]
> 
> > > > +
> > > > +enum gpioline_edge {
> > > > +   GPIOLINE_EDGE_NONE  = 0,
> > > > +   GPIOLINE_EDGE_RISING= 1,
> > > > +   GPIOLINE_EDGE_FALLING   = 2,
> > > > +   GPIOLINE_EDGE_BOTH  = 3,
> > > > +};
> > >
> > > I would skip the names of the enum types if we're not reusing them 
> > > anywhere.
> > >
> >
> > I thought it was useful to name them even if it was just to be able to
> > reference them in the documentation for relevant fields, such as that in
> > struct gpioline_config below, rather than having to either list all
> > possible values or a GPIOLINE_EDGE_* glob.
> >
> > And I'm currently using enum gpioline_edge in my edge detector
> > implementation - is that sufficient?
> >
> 
> The documentation argument is more convincing. :)
> 

I know - but your criteria was reuse... ;-).

> > > > +
> > > > +/* Line flags - V2 */
> > > > +#define GPIOLINE_FLAG_V2_KERNEL(1UL << 0) /* Line used 
> > > > by the kernel */
> > >
> > > In v1 this flag is also set if the line is used by user-space. Maybe a
> > > simple GPIOLINE_FLAG_V2_USED would be better?
> > >
> >
> > Agreed - the _KERNEL name is confusing.
> > In my latest draft I've already renamed it GPIOLINE_FLAG_V2_BUSY,
> > as EBUSY is what the ioctl returns when you try to request such a line.
> > Does that work for you?
> > I was also considering _IN_USE, and was using _UNAVAILABLE for a while.
> >
> 
> BUSY sounds less precise to me than USED or IN_USE of which both are
> fine (with a preference for the former).
>

OK, USED it shall be.

> [snip!]
> 
> > > > +
> > > > +/**
> > > > + * struct gpioline_values - Values of GPIO lines
> > > > + * @values: when getting the state of lines this contains the current
> > > > + * state of a line, when setting the state of lines these should 
> > > > contain
> > > > + * the desired target state
> > > > + */
> > > > +struct gpioline_values {
> > > > +   __u8 values[GPIOLINES_MAX];
> > >
> > > Same here for bitfield. And maybe reuse this structure in
> > > gpioline_config for default values?
> > >
> >
> > Can do.  What makes me reticent is the extra level of indirection
> > and the stuttering that would cause when referencing them.
> > e.g. config.default_values.values
> > So not sure the gain is worth the pain.
> >
> 
> I'd say yes - consolidation and reuse of data structures is always
> good and normally they are going to be wrapped in some kind of
> low-level user-space library anyway.
> 

Ok, and I've changed the values field name to bitmap, along with the change
to a bitmap type, so the stuttering is gone.

And, as the change to bitmap substantially reduced the size of
gpioline_config, I now embed that in the gpioline_info instead of
duplicating all the other fields.  The values field will be zeroed
when returned within info.

> > And I've renamed "default_values" to just "values" in my latest draft
> > which doesn't help with the stuttering.
> >
> 
> Why though? Aren't these always default values for output?
> 

To me "default" implies a fallback value, and that de-emphasises the
fact that the lines will be immediately set to those values as they
are switched to outputs.
These are the values the outputs will take - the "default" doesn't add
anything.

> [snip!]
> 
> > > > +
> > > > +/**
> > > > + * struct gpioline_event - The actual event being pushed to userspace
> > > > + * @timestamp: best estimate of time of event occurrence, in 
> > > > nanoseconds
> > > > + * @id: event identifier with value from enum gpioline_event_id
> > > > + * @offset: the offset of the line that triggered the event
> > > > + * @padding: reserved for future use
> > > > + */
> > > > +struct gpioline_event {
> > > > +   __u64 timestamp;
> > >
> > > I'd specify in the comment the type of clock used for the timestamp.
> > >
> >
> > Agreed - as this one will be guaranteed to be CLOCK_MONOTONIC.
> >
> > I'm also kicking around the idea of adding sequence numbers to events,
> > one per line and one per handle, so userspace can more easily detect
> > mis-ordering or buffer overflows.  Does that make any sense?
> >
> 
> Hmm, now that you mention it - and in the light of the recent post by
> Ryan Lovelett about polling precision - I think it makes sense to have
> this. Especially since it's very easy to add.
> 

OK.  I was only thinking about the edge events, but you might want it
for your line info events on the chip fd as well?

> > And would it be useful for userspace to be able to influence the size of
> > the event buffer (currently fixed at 16 events per line)?
> >
> 
> Good question. I would prefer to not overdo it though. The event
> request would need to contain the desired kfifo size and we'd only
> allow to set it on request, right?
>

Yeah, it would only be relevant if edge detection was set and, as per
edge

Re: [RFC PATCH] gpio: uapi: v2 proposal

2020-06-05 Thread Bartosz Golaszewski

czw., 4 cze 2020 o 18:00 Kent Gibson  napisał(a):
>

[snip!]

> > > +
> > > +enum gpioline_edge {
> > > +   GPIOLINE_EDGE_NONE  = 0,
> > > +   GPIOLINE_EDGE_RISING= 1,
> > > +   GPIOLINE_EDGE_FALLING   = 2,
> > > +   GPIOLINE_EDGE_BOTH  = 3,
> > > +};
> >
> > I would skip the names of the enum types if we're not reusing them anywhere.
> >
>
> I thought it was useful to name them even if it was just to be able to
> reference them in the documentation for relevant fields, such as that in
> struct gpioline_config below, rather than having to either list all
> possible values or a GPIOLINE_EDGE_* glob.
>
> And I'm currently using enum gpioline_edge in my edge detector
> implementation - is that sufficient?
>

The documentation argument is more convincing. :)

> > > +
> > > +/* Line flags - V2 */
> > > +#define GPIOLINE_FLAG_V2_KERNEL(1UL << 0) /* Line used 
> > > by the kernel */
> >
> > In v1 this flag is also set if the line is used by user-space. Maybe a
> > simple GPIOLINE_FLAG_V2_USED would be better?
> >
>
> Agreed - the _KERNEL name is confusing.
> In my latest draft I've already renamed it GPIOLINE_FLAG_V2_BUSY,
> as EBUSY is what the ioctl returns when you try to request such a line.
> Does that work for you?
> I was also considering _IN_USE, and was using _UNAVAILABLE for a while.
>

BUSY sounds less precise to me than USED or IN_USE of which both are
fine (with a preference for the former).

[snip!]

> > > +
> > > +/**
> > > + * struct gpioline_values - Values of GPIO lines
> > > + * @values: when getting the state of lines this contains the current
> > > + * state of a line, when setting the state of lines these should contain
> > > + * the desired target state
> > > + */
> > > +struct gpioline_values {
> > > +   __u8 values[GPIOLINES_MAX];
> >
> > Same here for bitfield. And maybe reuse this structure in
> > gpioline_config for default values?
> >
>
> Can do.  What makes me reticent is the extra level of indirection
> and the stuttering that would cause when referencing them.
> e.g. config.default_values.values
> So not sure the gain is worth the pain.
>

I'd say yes - consolidation and reuse of data structures is always
good and normally they are going to be wrapped in some kind of
low-level user-space library anyway.

> And I've renamed "default_values" to just "values" in my latest draft
> which doesn't help with the stuttering.
>

Why though? Aren't these always default values for output?

[snip!]

> > > +
> > > +/**
> > > + * struct gpioline_event - The actual event being pushed to userspace
> > > + * @timestamp: best estimate of time of event occurrence, in nanoseconds
> > > + * @id: event identifier with value from enum gpioline_event_id
> > > + * @offset: the offset of the line that triggered the event
> > > + * @padding: reserved for future use
> > > + */
> > > +struct gpioline_event {
> > > +   __u64 timestamp;
> >
> > I'd specify in the comment the type of clock used for the timestamp.
> >
>
> Agreed - as this one will be guaranteed to be CLOCK_MONOTONIC.
>
> I'm also kicking around the idea of adding sequence numbers to events,
> one per line and one per handle, so userspace can more easily detect
> mis-ordering or buffer overflows.  Does that make any sense?
>

Hmm, now that you mention it - and in the light of the recent post by
Ryan Lovelett about polling precision - I think it makes sense to have
this. Especially since it's very easy to add.

> And would it be useful for userspace to be able to influence the size of
> the event buffer (currently fixed at 16 events per line)?
>

Good question. I would prefer to not overdo it though. The event
request would need to contain the desired kfifo size and we'd only
allow to set it on request, right?

[snip!]

Bartosz

Re: [RFC PATCH] gpio: uapi: v2 proposal

2020-06-04 Thread Kent Gibson

On Thu, Jun 04, 2020 at 02:43:08PM +0200, Bartosz Golaszewski wrote:
> sob., 16 maj 2020 o 08:45 Kent Gibson  napisał(a):
> >
> > Add a new version of the uAPI to address existing 32/64bit alignment
> > issues, add support for debounce, and provide some future proofing by
> > adding padding reserved for future use.
> >
> > Signed-off-by: Kent Gibson 
> >
> 
> I'm a bit late to the party but here's my review.
> 
> >
> >  include/uapi/linux/gpio.h | 204 --
> >  1 file changed, 197 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/uapi/linux/gpio.h b/include/uapi/linux/gpio.h
> > index 0206383c0383..3db7e0bc1312 100644
> > --- a/include/uapi/linux/gpio.h
> > +++ b/include/uapi/linux/gpio.h
> > @@ -14,6 +14,9 @@
> >  #include 
> >  #include 
> >
> > +/* The maximum size of name and label arrays */
> > +#define GPIO_MAX_NAME_SIZE 32
> > +
> >  /**
> >   * struct gpiochip_info - Information about a certain GPIO chip
> >   * @name: the Linux kernel name of this GPIO chip
> > @@ -27,6 +30,184 @@ struct gpiochip_info {
> > __u32 lines;
> >  };
> >
> > +/* Maximum number of requested lines */
> > +#define GPIOLINES_MAX 64
> > +
> > +enum gpioline_direction {
> > +   GPIOLINE_DIRECTION_INPUT= 1,
> > +   GPIOLINE_DIRECTION_OUTPUT   = 2,
> > +};
> > +
> > +enum gpioline_drive {
> > +   GPIOLINE_DRIVE_PUSH_PULL= 0,
> > +   GPIOLINE_DRIVE_OPEN_DRAIN   = 1,
> > +   GPIOLINE_DRIVE_OPEN_SOURCE  = 2,
> > +};
> > +
> > +enum gpioline_bias {
> > +   GPIOLINE_BIAS_DISABLE   = 0,
> > +   GPIOLINE_BIAS_PULL_UP   = 1,
> > +   GPIOLINE_BIAS_PULL_DOWN = 2,
> > +};
> > +
> > +enum gpioline_edge {
> > +   GPIOLINE_EDGE_NONE  = 0,
> > +   GPIOLINE_EDGE_RISING= 1,
> > +   GPIOLINE_EDGE_FALLING   = 2,
> > +   GPIOLINE_EDGE_BOTH  = 3,
> > +};
> 
> I would skip the names of the enum types if we're not reusing them anywhere.
> 

I thought it was useful to name them even if it was just to be able to
reference them in the documentation for relevant fields, such as that in
struct gpioline_config below, rather than having to either list all
possible values or a GPIOLINE_EDGE_* glob.

And I'm currently using enum gpioline_edge in my edge detector
implementation - is that sufficient?

> > +
> > +/* Line flags - V2 */
> > +#define GPIOLINE_FLAG_V2_KERNEL(1UL << 0) /* Line used by 
> > the kernel */
> 
> In v1 this flag is also set if the line is used by user-space. Maybe a
> simple GPIOLINE_FLAG_V2_USED would be better?
> 

Agreed - the _KERNEL name is confusing.
In my latest draft I've already renamed it GPIOLINE_FLAG_V2_BUSY,
as EBUSY is what the ioctl returns when you try to request such a line.
Does that work for you?
I was also considering _IN_USE, and was using _UNAVAILABLE for a while.

> > +#define GPIOLINE_FLAG_V2_ACTIVE_LOW(1UL << 1)
> > +#define GPIOLINE_FLAG_V2_DIRECTION (1UL << 2)
> > +#define GPIOLINE_FLAG_V2_DRIVE (1UL << 3)
> > +#define GPIOLINE_FLAG_V2_BIAS  (1UL << 4)
> > +#define GPIOLINE_FLAG_V2_EDGE_DETECTION(1UL << 5)
> > +#define GPIOLINE_FLAG_V2_DEBOUNCE  (1UL << 6)
> > +
> > +/**
> > + * struct gpioline_config - Configuration for GPIO lines
> > + * @default_values: if the direction is GPIOLINE_DIRECTION_OUTPUT, this
> > + * specifies the default output value, should be 0 (low) or 1 (high),
> > + * anything else than 0 or 1 will be interpreted as 1 (high)
> > + * @flags: flags for the GPIO lines, such as GPIOLINE_FLAG_V2_ACTIVE_LOW,
> > + * GPIOLINE_FLAG_V2_DIRECTION etc, OR:ed together
> > + * @direction: if GPIOLINE_FLAG_V2_DIRECTION is set in flags, the desired
> > + * direction for the requested lines, with a value from enum
> > + * gpioline_direction
> > + * @drive: if GPIOLINE_FLAG_V2_DRIVE is set in flags, the desired drive for
> > + * the requested lines, with a value from enum gpioline_drive
> > + * @bias: if GPIOLINE_FLAG_V2_BIAS is set in flags, the desired bias for
> > + * the requested lines, with a value from enum gpioline_bias
> > + * @edge_detection: if GPIOLINE_FLAG_V2_EDGE_DETECTION is set in flags, the
> > + * desired edge_detection for the requested lines, with a value from enum
> > + * gpioline_edge
> > + * @debounce: if GPIOLINE_FLAG_V2_DEBOUNCE is set in flags, the desired
> > + * debounce period for the requested lines, in microseconds
> > + * @padding: reserved for future use and should be zero filled
> > + */
> > +struct gpioline_config {
> > +   __u8 default_values[GPIOLINES_MAX];
> 
> As I said elsewhere - bitfield is fine here for me: for instance a single u64.
> 
> > +   __u32 flags;
> > +   __u8 direction;
> > +   __u8 drive;
> > +   __u8 bias;
> > +   __u8 edge_detection;
> > +   __u32 debounce;
> 
> Maybe debounce_time for clarity?
> 
> > +   __u32 padding[7]; /* for future use */
> > +};
> > +
> > +/**
> > + * stru

Re: [RFC PATCH] gpio: uapi: v2 proposal

2020-06-04 Thread Kent Gibson

On Thu, Jun 04, 2020 at 02:06:31PM +0200, Bartosz Golaszewski wrote:
> śr., 27 maj 2020 o 07:58 Linus Walleij  napisał(a):
> >
> > On Mon, May 25, 2020 at 4:19 PM Kent Gibson  wrote:
> >
> > > > > +struct gpioline_config {
> > > > > +   __u8 default_values[GPIOLINES_MAX];
> > > >
> > > > So 32 bytes
> > > >
> > >
> > > Actually that one is 64 bytes, which is the same as v1, i.e. GPIOLINES_MAX
> > > is the same as GPIOHANDLES_MAX - just renamed.
> > >
> > > On the subject of values, is there any reason to use a byte for each line
> > > rather value than a bit?
> >
> > Not really, other than making things simple for userspace.
> >
> 
> I'm in favor of using bits here. I think we can rely on libgpiod to
> make things simple for user-space, the kernel interface can be as
> brief as possible.
> 

OK, I'll take another look at it.  If changed to a bitmap it will have
to be sized as a multiple of 64bits to maintain alignment.  Other than
that it should be pretty straight forward.

Cheers,
Kent.

Re: [RFC PATCH] gpio: uapi: v2 proposal

2020-06-04 Thread Bartosz Golaszewski

sob., 16 maj 2020 o 08:45 Kent Gibson  napisał(a):
>
> Add a new version of the uAPI to address existing 32/64bit alignment
> issues, add support for debounce, and provide some future proofing by
> adding padding reserved for future use.
>
> Signed-off-by: Kent Gibson 
>

I'm a bit late to the party but here's my review.

>
>  include/uapi/linux/gpio.h | 204 --
>  1 file changed, 197 insertions(+), 7 deletions(-)
>
> diff --git a/include/uapi/linux/gpio.h b/include/uapi/linux/gpio.h
> index 0206383c0383..3db7e0bc1312 100644
> --- a/include/uapi/linux/gpio.h
> +++ b/include/uapi/linux/gpio.h
> @@ -14,6 +14,9 @@
>  #include 
>  #include 
>
> +/* The maximum size of name and label arrays */
> +#define GPIO_MAX_NAME_SIZE 32
> +
>  /**
>   * struct gpiochip_info - Information about a certain GPIO chip
>   * @name: the Linux kernel name of this GPIO chip
> @@ -27,6 +30,184 @@ struct gpiochip_info {
> __u32 lines;
>  };
>
> +/* Maximum number of requested lines */
> +#define GPIOLINES_MAX 64
> +
> +enum gpioline_direction {
> +   GPIOLINE_DIRECTION_INPUT= 1,
> +   GPIOLINE_DIRECTION_OUTPUT   = 2,
> +};
> +
> +enum gpioline_drive {
> +   GPIOLINE_DRIVE_PUSH_PULL= 0,
> +   GPIOLINE_DRIVE_OPEN_DRAIN   = 1,
> +   GPIOLINE_DRIVE_OPEN_SOURCE  = 2,
> +};
> +
> +enum gpioline_bias {
> +   GPIOLINE_BIAS_DISABLE   = 0,
> +   GPIOLINE_BIAS_PULL_UP   = 1,
> +   GPIOLINE_BIAS_PULL_DOWN = 2,
> +};
> +
> +enum gpioline_edge {
> +   GPIOLINE_EDGE_NONE  = 0,
> +   GPIOLINE_EDGE_RISING= 1,
> +   GPIOLINE_EDGE_FALLING   = 2,
> +   GPIOLINE_EDGE_BOTH  = 3,
> +};

I would skip the names of the enum types if we're not reusing them anywhere.

> +
> +/* Line flags - V2 */
> +#define GPIOLINE_FLAG_V2_KERNEL(1UL << 0) /* Line used by 
> the kernel */

In v1 this flag is also set if the line is used by user-space. Maybe a
simple GPIOLINE_FLAG_V2_USED would be better?

> +#define GPIOLINE_FLAG_V2_ACTIVE_LOW(1UL << 1)
> +#define GPIOLINE_FLAG_V2_DIRECTION (1UL << 2)
> +#define GPIOLINE_FLAG_V2_DRIVE (1UL << 3)
> +#define GPIOLINE_FLAG_V2_BIAS  (1UL << 4)
> +#define GPIOLINE_FLAG_V2_EDGE_DETECTION(1UL << 5)
> +#define GPIOLINE_FLAG_V2_DEBOUNCE  (1UL << 6)
> +
> +/**
> + * struct gpioline_config - Configuration for GPIO lines
> + * @default_values: if the direction is GPIOLINE_DIRECTION_OUTPUT, this
> + * specifies the default output value, should be 0 (low) or 1 (high),
> + * anything else than 0 or 1 will be interpreted as 1 (high)
> + * @flags: flags for the GPIO lines, such as GPIOLINE_FLAG_V2_ACTIVE_LOW,
> + * GPIOLINE_FLAG_V2_DIRECTION etc, OR:ed together
> + * @direction: if GPIOLINE_FLAG_V2_DIRECTION is set in flags, the desired
> + * direction for the requested lines, with a value from enum
> + * gpioline_direction
> + * @drive: if GPIOLINE_FLAG_V2_DRIVE is set in flags, the desired drive for
> + * the requested lines, with a value from enum gpioline_drive
> + * @bias: if GPIOLINE_FLAG_V2_BIAS is set in flags, the desired bias for
> + * the requested lines, with a value from enum gpioline_bias
> + * @edge_detection: if GPIOLINE_FLAG_V2_EDGE_DETECTION is set in flags, the
> + * desired edge_detection for the requested lines, with a value from enum
> + * gpioline_edge
> + * @debounce: if GPIOLINE_FLAG_V2_DEBOUNCE is set in flags, the desired
> + * debounce period for the requested lines, in microseconds
> + * @padding: reserved for future use and should be zero filled
> + */
> +struct gpioline_config {
> +   __u8 default_values[GPIOLINES_MAX];

As I said elsewhere - bitfield is fine here for me: for instance a single u64.

> +   __u32 flags;
> +   __u8 direction;
> +   __u8 drive;
> +   __u8 bias;
> +   __u8 edge_detection;
> +   __u32 debounce;

Maybe debounce_time for clarity?

> +   __u32 padding[7]; /* for future use */
> +};
> +
> +/**
> + * struct gpioline_request - Information about a request for GPIO lines
> + * @offsets: an array of desired lines, specified by offset index for the
> + * associated GPIO device
> + * @consumer: a desired consumer label for the selected GPIO lines such
> + * as "my-bitbanged-relay"
> + * @config: Requested configuration for the requested lines. Note that
> + * even if multiple lines are requested, the same configuration must be
> + * applicable to all of them. If you want lines with individual
> + * configuration, request them one by one. It is possible to select a
> + * batch of input or output lines, but they must all have the same
> + * configuration, i.e. all inputs or all outputs, all active low etc
> + * @lines: number of lines requested in this request, i.e. the number of
> + * valid fields in the GPIOLINES_MAX sized arrays, set to 1 to request a
> + * single line
> + * @padding: reserved for future use and should be zero f

Re: [RFC PATCH] gpio: uapi: v2 proposal

2020-06-04 Thread Bartosz Golaszewski

śr., 27 maj 2020 o 07:58 Linus Walleij  napisał(a):
>
> On Mon, May 25, 2020 at 4:19 PM Kent Gibson  wrote:
>
> > > > +struct gpioline_config {
> > > > +   __u8 default_values[GPIOLINES_MAX];
> > >
> > > So 32 bytes
> > >
> >
> > Actually that one is 64 bytes, which is the same as v1, i.e. GPIOLINES_MAX
> > is the same as GPIOHANDLES_MAX - just renamed.
> >
> > On the subject of values, is there any reason to use a byte for each line
> > rather value than a bit?
>
> Not really, other than making things simple for userspace.
>

I'm in favor of using bits here. I think we can rely on libgpiod to
make things simple for user-space, the kernel interface can be as
brief as possible.

Bart

Re: [RFC PATCH] gpio: uapi: v2 proposal

2020-05-26 Thread Linus Walleij

On Mon, May 25, 2020 at 4:19 PM Kent Gibson  wrote:

> > > +struct gpioline_config {
> > > +   __u8 default_values[GPIOLINES_MAX];
> >
> > So 32 bytes
> >
>
> Actually that one is 64 bytes, which is the same as v1, i.e. GPIOLINES_MAX
> is the same as GPIOHANDLES_MAX - just renamed.
>
> On the subject of values, is there any reason to use a byte for each line
> rather value than a bit?

Not really, other than making things simple for userspace.

> when adding future fields, the idea was to have a bit
> in the flags that indicates that the corresponding field is now valid.
> If the flag is not set then whatever value is there is irrelevant.

You would need to document that idea, say in the kerneldoc,
else when someone else comes along to do this they will
get it wrong.

> But definitely better to play it safe - will check the padding is
> zeroed as well, as well as any field for which the flag bit is clear.

Yeah better like that. You can write a comment in the code too,
such like "when adding new parameters, update this validation code
to accept it".

> Back on retired ioctls, I notice that 5, 6, and 7 are missing from gpio.
> Have those been retired, or just skipped over by accident?

Just thought it was nice to use jump to 8 for line info.
They should be used when adding generic chip information ioctls().

Yours,
Linus Walleij

Re: [RFC PATCH] gpio: uapi: v2 proposal

2020-05-26 Thread Kent Gibson

On Mon, May 25, 2020 at 06:24:04PM +0200, Bartosz Golaszewski wrote:
> sob., 16 maj 2020 o 08:45 Kent Gibson  napisał(a):
> >
> > Add a new version of the uAPI to address existing 32/64bit alignment
> > issues, add support for debounce, and provide some future proofing by
> > adding padding reserved for future use.
> >
> > Signed-off-by: Kent Gibson 
> >
> > ---
> >
> > This patch is a proposal to replace the majority of the uAPI, so some
> > background and justification is in order.
> >
> > The alignment issue relates to the gpioevent_data, which packs to different
> > sizes on 32bit and 64bit platforms. That creates problems for 32bit apps
> > running on 64bit kernels.  The patch addresses that particular issue, and
> > the problem more generally, by adding pad fields that explicitly pad
> > structs out to 64bit boundaries, so they will pack to the same size now,
> > and even if some of the reserved padding is used for __u64 fields in the
> > future.
> >
> > The lack of future proofing in v1 makes it impossible to, for example,
> > add the debounce feature that is included in v2.
> > The future proofing is addressed by providing reserved padding in all
> > structs for future features.  Specifically, the line request,
> > config and info structs get updated versions and ioctls.
> >
> > I haven't added any padding to gpiochip_info, as I haven't seen any calls
> > for new features for the corresponding ioctl, but I'm open to updating that
> > as well.
> >
> > As the majority of the structs and ioctls were being replaced, it seemed
> > opportune to rework some of the other aspects of the uAPI.
> >
> > Firstly, I've reworked the flags field throughout.  v1 has three different
> > flags fields, each with their own separate bit definitions.  In v2 that is
> > collapsed to one.  Further, the bits of the v2 flags field are used
> > as feature enable flags, with any other necessary configuration fields 
> > encoded
> > separately.  This is simpler and clearer, while also providing a foundation
> > for adding features in the future.
> >
> > I've also merged the handle and event requests into a single request, the
> > line request, as the two requests where mostly the same, other than the
> > edge detection provided by event requests.  As a byproduct, the v2 uAPI
> > allows for multiple lines producing edge events on the same line handle.
> > This is a new capability as v1 only supports a single line in an event 
> > request.
> >
> > This means there are now only two types of file handle to be concerned with,
> > the chip and the line, and it is clearer which ioctls apply to which type
> > of handle.
> >
> > There is also some minor renaming of fields for consistency compared to 
> > their
> > v1 counterparts, e.g. offset rather than lineoffset or line_offset, and
> > consumer rather than consumer_label.
> >
> > And v1 GPIOHANDLES_MAX and gpiohandle_data become GPIOLINES_MAX and
> > gpioline_values for v2 - the only change being the renaming for clarity.
> >
> > The v2 uAPI is mostly just a reorganisation of v1, so userspace code,
> > particularly libgpiod, should easily port to it.
> >
> > This patch is obviously only one patch in a much bigger series that
> > will actually implement it, but I would appreciate a review and any 
> > feedback,
> > as it is foundational to the rest of that series.
> >
> > Thanks,
> > Kent.
> >
> 
> Hi Kent,
> 
> Thanks for posting this. I like the general direction a lot. I'll
> review this in detail later this week.
> 
> Seeing the speed at which you make progress I think I won't be
> implementing support for the v1 of the watch ioctl() in libgpiod after
> all. Once the v2 is live I will probably bump the API version in
> libgpiod to v2.0.0 and make some non-compatible changes anyway.
> 

Yeah, libgpiod has similar problems to the v1 uAPI - there is no
easy way to extend it without causing breakage.

I've got a patch that ports libgpiod to the v2 uAPI, though it only goes
as far as v1 parity.  That is passing all existing tests.  The patch
doesn't address debounce as that will require libgpiod API changes.  Nor
does it make use of the bulk event capability, as that would change
behaviour and break some of the tests - the wait_multiple test was the
sticking point there.  Those are probably best combined with the v2.0.0
changes you are planning.

I've also got a couple of patches for minor things that I noticed while I
was in there.  I'll post them shortly.

Cheers,
Kent.

Re: [RFC PATCH] gpio: uapi: v2 proposal

2020-05-26 Thread Kent Gibson

On Tue, May 26, 2020 at 11:04:25AM +0300, Andy Shevchenko wrote:
> +Cc: Ville
> 
> Ville, this is a v2 of the GPIO ABI we discussed with some time ago.
> If you have time to briefly look at it and perhaps comment if it's
> right way to go.
> 
> On Sat, May 16, 2020 at 9:50 AM Kent Gibson  wrote:
> >
> > Add a new version of the uAPI to address existing 32/64bit alignment
> > issues, add support for debounce, and provide some future proofing by
> > adding padding reserved for future use.
> >
> > Signed-off-by: Kent Gibson 
> >
> > ---
> >
> > This patch is a proposal to replace the majority of the uAPI, so some
> > background and justification is in order.
> >
> > The alignment issue relates to the gpioevent_data, which packs to different
> > sizes on 32bit and 64bit platforms. That creates problems for 32bit apps
> > running on 64bit kernels.  The patch addresses that particular issue, and
> > the problem more generally, by adding pad fields that explicitly pad
> > structs out to 64bit boundaries, so they will pack to the same size now,
> > and even if some of the reserved padding is used for __u64 fields in the
> > future.
> >
> > The lack of future proofing in v1 makes it impossible to, for example,
> > add the debounce feature that is included in v2.
> > The future proofing is addressed by providing reserved padding in all
> > structs for future features.  Specifically, the line request,
> > config and info structs get updated versions and ioctls.
> >
> > I haven't added any padding to gpiochip_info, as I haven't seen any calls
> > for new features for the corresponding ioctl, but I'm open to updating that
> > as well.
> >
> > As the majority of the structs and ioctls were being replaced, it seemed
> > opportune to rework some of the other aspects of the uAPI.
> >
> > Firstly, I've reworked the flags field throughout.  v1 has three different
> > flags fields, each with their own separate bit definitions.  In v2 that is
> > collapsed to one.  Further, the bits of the v2 flags field are used
> > as feature enable flags, with any other necessary configuration fields 
> > encoded
> > separately.  This is simpler and clearer, while also providing a foundation
> > for adding features in the future.
> >
> > I've also merged the handle and event requests into a single request, the
> > line request, as the two requests where mostly the same, other than the
> > edge detection provided by event requests.  As a byproduct, the v2 uAPI
> > allows for multiple lines producing edge events on the same line handle.
> > This is a new capability as v1 only supports a single line in an event 
> > request.
> >
> > This means there are now only two types of file handle to be concerned with,
> > the chip and the line, and it is clearer which ioctls apply to which type
> > of handle.
> >
> > There is also some minor renaming of fields for consistency compared to 
> > their
> > v1 counterparts, e.g. offset rather than lineoffset or line_offset, and
> > consumer rather than consumer_label.
> 
> > And v1 GPIOHANDLES_MAX and gpiohandle_data become GPIOLINES_MAX and
> > gpioline_values for v2 - the only change being the renaming for clarity.
> 
> Hmm... I think it makes sense if you fully replace uAPI, otherwise in
> my opinion it adds to the confusion.
> My point is that we may try to be less invasive, perhaps?
> 

I had an earlier draft that did just that, but I found that having the
"handle" names still floating around confusing - particularly
considering what the header will eventually look like once v1 is
eventually removed.  I'm also assuming the v1 fields will get
documentation updates should we proceed with v2 and deprecate v1.

> > The v2 uAPI is mostly just a reorganisation of v1, so userspace code,
> > particularly libgpiod, should easily port to it.
> >
> > This patch is obviously only one patch in a much bigger series that
> > will actually implement it, but I would appreciate a review and any 
> > feedback,
> > as it is foundational to the rest of that series.
> >
> > Thanks,
> > Kent.
> >
> >  include/uapi/linux/gpio.h | 204 --
> >  1 file changed, 197 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/uapi/linux/gpio.h b/include/uapi/linux/gpio.h
> > index 0206383c0383..3db7e0bc1312 100644
> > --- a/include/uapi/linux/gpio.h
> > +++ b/include/uapi/linux/gpio.h
> > @@ -14,6 +14,9 @@
> >  #include 
> >  #include 
>

Re: [RFC PATCH] gpio: uapi: v2 proposal

2020-05-26 Thread Andy Shevchenko

+Cc: Ville

Ville, this is a v2 of the GPIO ABI we discussed with some time ago.
If you have time to briefly look at it and perhaps comment if it's
right way to go.

On Sat, May 16, 2020 at 9:50 AM Kent Gibson  wrote:
>
> Add a new version of the uAPI to address existing 32/64bit alignment
> issues, add support for debounce, and provide some future proofing by
> adding padding reserved for future use.
>
> Signed-off-by: Kent Gibson 
>
> ---
>
> This patch is a proposal to replace the majority of the uAPI, so some
> background and justification is in order.
>
> The alignment issue relates to the gpioevent_data, which packs to different
> sizes on 32bit and 64bit platforms. That creates problems for 32bit apps
> running on 64bit kernels.  The patch addresses that particular issue, and
> the problem more generally, by adding pad fields that explicitly pad
> structs out to 64bit boundaries, so they will pack to the same size now,
> and even if some of the reserved padding is used for __u64 fields in the
> future.
>
> The lack of future proofing in v1 makes it impossible to, for example,
> add the debounce feature that is included in v2.
> The future proofing is addressed by providing reserved padding in all
> structs for future features.  Specifically, the line request,
> config and info structs get updated versions and ioctls.
>
> I haven't added any padding to gpiochip_info, as I haven't seen any calls
> for new features for the corresponding ioctl, but I'm open to updating that
> as well.
>
> As the majority of the structs and ioctls were being replaced, it seemed
> opportune to rework some of the other aspects of the uAPI.
>
> Firstly, I've reworked the flags field throughout.  v1 has three different
> flags fields, each with their own separate bit definitions.  In v2 that is
> collapsed to one.  Further, the bits of the v2 flags field are used
> as feature enable flags, with any other necessary configuration fields encoded
> separately.  This is simpler and clearer, while also providing a foundation
> for adding features in the future.
>
> I've also merged the handle and event requests into a single request, the
> line request, as the two requests where mostly the same, other than the
> edge detection provided by event requests.  As a byproduct, the v2 uAPI
> allows for multiple lines producing edge events on the same line handle.
> This is a new capability as v1 only supports a single line in an event 
> request.
>
> This means there are now only two types of file handle to be concerned with,
> the chip and the line, and it is clearer which ioctls apply to which type
> of handle.
>
> There is also some minor renaming of fields for consistency compared to their
> v1 counterparts, e.g. offset rather than lineoffset or line_offset, and
> consumer rather than consumer_label.

> And v1 GPIOHANDLES_MAX and gpiohandle_data become GPIOLINES_MAX and
> gpioline_values for v2 - the only change being the renaming for clarity.

Hmm... I think it makes sense if you fully replace uAPI, otherwise in
my opinion it adds to the confusion.
My point is that we may try to be less invasive, perhaps?

> The v2 uAPI is mostly just a reorganisation of v1, so userspace code,
> particularly libgpiod, should easily port to it.
>
> This patch is obviously only one patch in a much bigger series that
> will actually implement it, but I would appreciate a review and any feedback,
> as it is foundational to the rest of that series.
>
> Thanks,
> Kent.
>
>  include/uapi/linux/gpio.h | 204 --
>  1 file changed, 197 insertions(+), 7 deletions(-)
>
> diff --git a/include/uapi/linux/gpio.h b/include/uapi/linux/gpio.h
> index 0206383c0383..3db7e0bc1312 100644
> --- a/include/uapi/linux/gpio.h
> +++ b/include/uapi/linux/gpio.h
> @@ -14,6 +14,9 @@
>  #include 
>  #include 
>
> +/* The maximum size of name and label arrays */
> +#define GPIO_MAX_NAME_SIZE 32
> +
>  /**
>   * struct gpiochip_info - Information about a certain GPIO chip
>   * @name: the Linux kernel name of this GPIO chip
> @@ -27,6 +30,184 @@ struct gpiochip_info {
> __u32 lines;
>  };
>
> +/* Maximum number of requested lines */
> +#define GPIOLINES_MAX 64
> +
> +enum gpioline_direction {
> +   GPIOLINE_DIRECTION_INPUT= 1,
> +   GPIOLINE_DIRECTION_OUTPUT   = 2,
> +};
> +
> +enum gpioline_drive {
> +   GPIOLINE_DRIVE_PUSH_PULL= 0,
> +   GPIOLINE_DRIVE_OPEN_DRAIN   = 1,
> +   GPIOLINE_DRIVE_OPEN_SOURCE  = 2,
> +};
> +
> +enum gpioline_bias {
> +   GPIOLINE_BIAS_DISABLE   = 0,
> +   GPIOLINE_BIAS_PULL_UP   = 1,
> +   GPIOLINE

Re: [RFC PATCH] gpio: uapi: v2 proposal

2020-05-25 Thread Bartosz Golaszewski

sob., 16 maj 2020 o 08:45 Kent Gibson  napisał(a):
>
> Add a new version of the uAPI to address existing 32/64bit alignment
> issues, add support for debounce, and provide some future proofing by
> adding padding reserved for future use.
>
> Signed-off-by: Kent Gibson 
>
> ---
>
> This patch is a proposal to replace the majority of the uAPI, so some
> background and justification is in order.
>
> The alignment issue relates to the gpioevent_data, which packs to different
> sizes on 32bit and 64bit platforms. That creates problems for 32bit apps
> running on 64bit kernels.  The patch addresses that particular issue, and
> the problem more generally, by adding pad fields that explicitly pad
> structs out to 64bit boundaries, so they will pack to the same size now,
> and even if some of the reserved padding is used for __u64 fields in the
> future.
>
> The lack of future proofing in v1 makes it impossible to, for example,
> add the debounce feature that is included in v2.
> The future proofing is addressed by providing reserved padding in all
> structs for future features.  Specifically, the line request,
> config and info structs get updated versions and ioctls.
>
> I haven't added any padding to gpiochip_info, as I haven't seen any calls
> for new features for the corresponding ioctl, but I'm open to updating that
> as well.
>
> As the majority of the structs and ioctls were being replaced, it seemed
> opportune to rework some of the other aspects of the uAPI.
>
> Firstly, I've reworked the flags field throughout.  v1 has three different
> flags fields, each with their own separate bit definitions.  In v2 that is
> collapsed to one.  Further, the bits of the v2 flags field are used
> as feature enable flags, with any other necessary configuration fields encoded
> separately.  This is simpler and clearer, while also providing a foundation
> for adding features in the future.
>
> I've also merged the handle and event requests into a single request, the
> line request, as the two requests where mostly the same, other than the
> edge detection provided by event requests.  As a byproduct, the v2 uAPI
> allows for multiple lines producing edge events on the same line handle.
> This is a new capability as v1 only supports a single line in an event 
> request.
>
> This means there are now only two types of file handle to be concerned with,
> the chip and the line, and it is clearer which ioctls apply to which type
> of handle.
>
> There is also some minor renaming of fields for consistency compared to their
> v1 counterparts, e.g. offset rather than lineoffset or line_offset, and
> consumer rather than consumer_label.
>
> And v1 GPIOHANDLES_MAX and gpiohandle_data become GPIOLINES_MAX and
> gpioline_values for v2 - the only change being the renaming for clarity.
>
> The v2 uAPI is mostly just a reorganisation of v1, so userspace code,
> particularly libgpiod, should easily port to it.
>
> This patch is obviously only one patch in a much bigger series that
> will actually implement it, but I would appreciate a review and any feedback,
> as it is foundational to the rest of that series.
>
> Thanks,
> Kent.
>

Hi Kent,

Thanks for posting this. I like the general direction a lot. I'll
review this in detail later this week.

Seeing the speed at which you make progress I think I won't be
implementing support for the v1 of the watch ioctl() in libgpiod after
all. Once the v2 is live I will probably bump the API version in
libgpiod to v2.0.0 and make some non-compatible changes anyway.

Bart

Re: [RFC PATCH] gpio: uapi: v2 proposal

2020-05-25 Thread Kent Gibson

On Mon, May 25, 2020 at 10:39:42AM +0200, Linus Walleij wrote:
> On Sat, May 16, 2020 at 8:45 AM Kent Gibson  wrote:
> 
> > Add a new version of the uAPI to address existing 32/64bit alignment
> > issues, add support for debounce, and provide some future proofing by
> > adding padding reserved for future use.
> >
> > Signed-off-by: Kent Gibson 
> 
> I don't see any major problems with it, just some comments:
> 
> + * @padding: reserved for future use and should be zero filled
> 
> *MUST* be zerofilled is what it should say.
> 
> > +struct gpioline_config {
> > +   __u8 default_values[GPIOLINES_MAX];
> 
> So 32 bytes
> 

Actually that one is 64 bytes, which is the same as v1, i.e. GPIOLINES_MAX
is the same as GPIOHANDLES_MAX - just renamed.

On the subject of values, is there any reason to use a byte for each line
rather value than a bit?

> > +   __u32 flags;
> > +   __u8 direction;
> > +   __u8 drive;
> > +   __u8 bias;
> > +   __u8 edge_detection;
> 
> Some comment in the struc that this adds up to 32 bits?
> 
> > +   __u32 debounce;
> > +   __u32 padding[7]; /* for future use */
> 
> What we need to do inside the kernel with all of these
> is to ascertain that they are 0 for now when they are
> sent in and refuse them otherwise, I think it was Lars-Peter
> Clausen who said that they had to retire a whole slew of
> ioctl()s because some userspace just used unallocated
> memory for this and since that was random bytes it needed to
> be supported with that content forever and the reserved
> padding could never be used for the reserved purpose.
> 
> This kind of comment goes for all the structs.
> 

OK, I wasn't sure how strict the validation should be on the kernel
side, but I'll take a strict stance and check the whole struct.

Having said that, when adding future fields, the idea was to have a bit
in the flags that indicates that the corresponding field is now valid.
If the flag is not set then whatever value is there is irrelevant.
e.g. the debounce field value is irrelevent and ignored unless the
GPIOLINE_FLAG_V2_DEBOUNCE is set in flags.
OTOH, if the bit is set then the field would have to be set correctly.
And the current kernel would reject a future request with an unsupported
bit set in flags.

But definitely better to play it safe - will check the padding is
zeroed as well, as well as any field for which the flag bit is clear.

> But mostly it is a question about the kernel code receiving
> or emitting these.
> 

For sure - all the structs returned will be zeroed before use so as not
to leak kernel stack - unless they originate from userspace of course.

Back on retired ioctls, I notice that 5, 6, and 7 are missing from gpio.
Have those been retired, or just skipped over by accident?

Cheers,
Kent.

Re: [RFC PATCH] gpio: uapi: v2 proposal

2020-05-25 Thread Linus Walleij

On Sat, May 16, 2020 at 8:45 AM Kent Gibson  wrote:

> Add a new version of the uAPI to address existing 32/64bit alignment
> issues, add support for debounce, and provide some future proofing by
> adding padding reserved for future use.
>
> Signed-off-by: Kent Gibson 

I don't see any major problems with it, just some comments:

+ * @padding: reserved for future use and should be zero filled

*MUST* be zerofilled is what it should say.

> +struct gpioline_config {
> +   __u8 default_values[GPIOLINES_MAX];

So 32 bytes

> +   __u32 flags;
> +   __u8 direction;
> +   __u8 drive;
> +   __u8 bias;
> +   __u8 edge_detection;

Some comment in the struc that this adds up to 32 bits?

> +   __u32 debounce;
> +   __u32 padding[7]; /* for future use */

What we need to do inside the kernel with all of these
is to ascertain that they are 0 for now when they are
sent in and refuse them otherwise, I think it was Lars-Peter
Clausen who said that they had to retire a whole slew of
ioctl()s because some userspace just used unallocated
memory for this and since that was random bytes it needed to
be supported with that content forever and the reserved
padding could never be used for the reserved purpose.

This kind of comment goes for all the structs.

But mostly it is a question about the kernel code receiving
or emitting these.

Yours,
Linus Walleij

[RFC PATCH] gpio: uapi: v2 proposal

2020-05-15 Thread Kent Gibson

Add a new version of the uAPI to address existing 32/64bit alignment
issues, add support for debounce, and provide some future proofing by
adding padding reserved for future use.

Signed-off-by: Kent Gibson 

---

This patch is a proposal to replace the majority of the uAPI, so some 
background and justification is in order.

The alignment issue relates to the gpioevent_data, which packs to different
sizes on 32bit and 64bit platforms. That creates problems for 32bit apps
running on 64bit kernels.  The patch addresses that particular issue, and
the problem more generally, by adding pad fields that explicitly pad
structs out to 64bit boundaries, so they will pack to the same size now,
and even if some of the reserved padding is used for __u64 fields in the
future.

The lack of future proofing in v1 makes it impossible to, for example,
add the debounce feature that is included in v2.
The future proofing is addressed by providing reserved padding in all
structs for future features.  Specifically, the line request, 
config and info structs get updated versions and ioctls.

I haven't added any padding to gpiochip_info, as I haven't seen any calls
for new features for the corresponding ioctl, but I'm open to updating that
as well.

As the majority of the structs and ioctls were being replaced, it seemed
opportune to rework some of the other aspects of the uAPI.

Firstly, I've reworked the flags field throughout.  v1 has three different
flags fields, each with their own separate bit definitions.  In v2 that is
collapsed to one.  Further, the bits of the v2 flags field are used
as feature enable flags, with any other necessary configuration fields encoded
separately.  This is simpler and clearer, while also providing a foundation
for adding features in the future.

I've also merged the handle and event requests into a single request, the
line request, as the two requests where mostly the same, other than the
edge detection provided by event requests.  As a byproduct, the v2 uAPI
allows for multiple lines producing edge events on the same line handle.
This is a new capability as v1 only supports a single line in an event request.

This means there are now only two types of file handle to be concerned with,
the chip and the line, and it is clearer which ioctls apply to which type
of handle.

There is also some minor renaming of fields for consistency compared to their
v1 counterparts, e.g. offset rather than lineoffset or line_offset, and 
consumer rather than consumer_label.

And v1 GPIOHANDLES_MAX and gpiohandle_data become GPIOLINES_MAX and 
gpioline_values for v2 - the only change being the renaming for clarity.

The v2 uAPI is mostly just a reorganisation of v1, so userspace code,
particularly libgpiod, should easily port to it.

This patch is obviously only one patch in a much bigger series that 
will actually implement it, but I would appreciate a review and any feedback,
as it is foundational to the rest of that series.

Thanks,
Kent.

 include/uapi/linux/gpio.h | 204 --
 1 file changed, 197 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/gpio.h b/include/uapi/linux/gpio.h
index 0206383c0383..3db7e0bc1312 100644
--- a/include/uapi/linux/gpio.h
+++ b/include/uapi/linux/gpio.h
@@ -14,6 +14,9 @@
 #include 
 #include 
 
+/* The maximum size of name and label arrays */
+#define GPIO_MAX_NAME_SIZE 32
+
 /**
  * struct gpiochip_info - Information about a certain GPIO chip
  * @name: the Linux kernel name of this GPIO chip
@@ -27,6 +30,184 @@ struct gpiochip_info {
__u32 lines;
 };
 
+/* Maximum number of requested lines */
+#define GPIOLINES_MAX 64
+
+enum gpioline_direction {
+   GPIOLINE_DIRECTION_INPUT= 1,
+   GPIOLINE_DIRECTION_OUTPUT   = 2,
+};
+
+enum gpioline_drive {
+   GPIOLINE_DRIVE_PUSH_PULL= 0,
+   GPIOLINE_DRIVE_OPEN_DRAIN   = 1,
+   GPIOLINE_DRIVE_OPEN_SOURCE  = 2,
+};
+
+enum gpioline_bias {
+   GPIOLINE_BIAS_DISABLE   = 0,
+   GPIOLINE_BIAS_PULL_UP   = 1,
+   GPIOLINE_BIAS_PULL_DOWN = 2,
+};
+
+enum gpioline_edge {
+   GPIOLINE_EDGE_NONE  = 0,
+   GPIOLINE_EDGE_RISING= 1,
+   GPIOLINE_EDGE_FALLING   = 2,
+   GPIOLINE_EDGE_BOTH  = 3,
+};
+
+/* Line flags - V2 */
+#define GPIOLINE_FLAG_V2_KERNEL(1UL << 0) /* Line used by the 
kernel */
+#define GPIOLINE_FLAG_V2_ACTIVE_LOW(1UL << 1)
+#define GPIOLINE_FLAG_V2_DIRECTION (1UL << 2)
+#define GPIOLINE_FLAG_V2_DRIVE (1UL << 3)
+#define GPIOLINE_FLAG_V2_BIAS  (1UL << 4)
+#define GPIOLINE_FLAG_V2_EDGE_DETECTION(1UL << 5)
+#define GPIOLINE_FLAG_V2_DEBOUNCE  (1UL << 6)
+
+/**
+ * struct gpioline_config - Configuration for GPIO lines
+ * @default_values: if the direction is GPIOLINE_DIRECTION_OUTPUT, this
+ * specifies the default output value, s

Mutual business proposal

2020-05-09 Thread reginadan1



Greeting,
My Name is Regina Daniel am a Business Consultant and I represent a group of 
company based in Gulf Region that wish to invest between US$10,000,000.00 TO  
US$550,000,000. 00 in foreign investment depending on your investment capacity 
based on the amount you can invest and manage. We are currently seeking means 
of expanding and relocating our business interest abroad in the following 
sectors: oil/Gas, real estate, construction stock, mining, transportation, 
health sector, tobacco or Communication Services.

Also in Agriculture or any other viable sector.  If you think you have a solid 
background and idea of making good profit in any of the mentioned business 
sectors or any other business  in your country, please write me for possible 
business co-operation and the investment amount you can handle.

Let me hear from you.
Yours sincerely,
Mr Regina Daniel

Re: Lease semantic proposal

2019-10-10 Thread Dave Chinner

On Tue, Oct 01, 2019 at 02:01:57PM -0700, Ira Weiny wrote:
> On Mon, Sep 30, 2019 at 06:42:33PM +1000, Dave Chinner wrote:
> > On Wed, Sep 25, 2019 at 04:46:03PM -0700, Ira Weiny wrote:
> > > On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> > > > Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
> > > > doesn't appear to be compatible with the semantics required by
> > > > existing users of layout leases.
> > > 
> > > I disagree.  Other than the addition of F_UNBREAK, I think this is 
> > > consistent
> > > with what is currently implemented.  Also, by exporting all this to user 
> > > space
> > > we can now write tests for it independent of the RDMA pinning.
> > 
> > The current usage of F_RDLCK | F_LAYOUT by the pNFS code allows
> > layout changes to occur to the file while the layout lease is held.
> 
> This was not my understanding.

These are the remote procerdeure calls that the pNFS client uses to
map and/or allocate space in the file it has a lease on:

struct export_operations {

int (*map_blocks)(struct inode *inode, loff_t offset,
  u64 len, struct iomap *iomap,
  bool write, u32 *device_generation);
int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
 int nr_iomaps, struct iattr *iattr);
};

.map_blocks() allows the pnfs client to allocate blocks in the
storage.  .commit_blocks() is called once the write is complete to
do things like unwritten extent conversion on extents that it
allocated. In the XFS implementation of these methods, they call
directly into the XFS same block mapping code that the
read/write/mmap IO paths call into.

A typical pNFS use case is a HPC clusters, where thousands of nodes
might all be writing to separate parts of a huge sparse file (e.g.
out of core sparse matrix solver) and are reading/writing direct to
the storage via iSER or some other low level IB/RDMA storage
protocol.  Every write on every pNFS client needs space allocation,
so the pNFS server is basically just providing a remote interface to
the XFS space allocation interfaces for direct IO on the pNFS
clients.

IOWs, there can be thousands of concurrent pNFS layout leases on a
single inode at any given time and they can all be allocating space,
too.

> > IOWs, your definition of F_RDLCK | F_LAYOUT not being allowed
> > to change the is in direct contradition to existing users.
> > 
> > I've said this several times over the past few months now: shared
> > layout leases must allow layout modifications to be made.
> 
> I don't understand what the point of having a layout lease is then?

It's a *notification* mechanism.

Multiple processes can modify the file layout at the same time -
XFs was designed as a multi-write filesystem from the ground up and
we make use of that with shared IO locks for direct IO writes. 

The read layout lease model we've used for pNFS is essentially the
same concurrent writer model that direct IO in XFS uses. And to
enable concurrent writers, we use shared locking for the the layout
leases.

IOWs, the pNFS client IO model is very similar to local direct IO,
except for the fact they can remotely cache layout mappings.  Hence
if you do a server-side local buffered write (delayed allocation),
truncate, punch a hole, etc, (or a remote operation through the NFS
server that ends up in these same paths) the mappings those pNFS
clients hold are no longer guaranteed to cover valid data and/or
have correct physical mappings for valid data held on the server.

At this point, the layouts need to be invalidated, and so the layout
lease is broken by the filesystem operations that may cause an
issue. The pNFS server reacts to the lease break by recalling the
client layout(s) and the pNFS client has to request a new layout
from the server to be able to directly access the storage again.

i.e. the layout lease is primarily a *notification* mechanism to
allow safe interop between application level direct access
mechanisms and local filesystem access.

What you are trying to do is turn this multi-writer layout lease
notification mechanism into a single writer access control
mechanism. i.e. F_UNBREAK is all about /control/ of the layout and
who can and can't modify it, regardless of whether they write
permissions have been granted or not.

It seems I have been unable to get this point across despite trying
for months now: access control is not a function provided by layout
leases. If you need to guarantee exclusive access to a file so
nobody else can modify it, direct access or through the filesystem,
then that is what permission bits, ACLs, file locks, LSMs, etc are
for. Don't try to overload a layout change notification mechanism
with data access controls.

> I apologize for not understanding this.  My reading of the code is that layout
> changes require the read layout to be broken prior to proceeding.

There's a difference between additive layout changes (such as
al

Re: Lease semantic proposal

2019-10-07 Thread Jan Kara

On Mon 30-09-19 18:42:33, Dave Chinner wrote:
> On Wed, Sep 25, 2019 at 04:46:03PM -0700, Ira Weiny wrote:
> > On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> > > Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
> > > doesn't appear to be compatible with the semantics required by
> > > existing users of layout leases.
> > 
> > I disagree.  Other than the addition of F_UNBREAK, I think this is 
> > consistent
> > with what is currently implemented.  Also, by exporting all this to user 
> > space
> > we can now write tests for it independent of the RDMA pinning.
> 
> The current usage of F_RDLCK | F_LAYOUT by the pNFS code allows
> layout changes to occur to the file while the layout lease is held.

I remember you saying that in the past conversations. But I agree with Ira
that I don't see where in the code this would be implemented. AFAICS
break_layout() called from xfs_break_leased_layouts() simply breaks all the
leases with F_LAYOUT set attached to the inode... Now I'm not any expert on
file leases but what am I missing?

> IOWs, your definition of F_RDLCK | F_LAYOUT not being allowed
> to change the is in direct contradition to existing users.
> 
> I've said this several times over the past few months now: shared
> layout leases must allow layout modifications to be made. Only
> allowing an exclusive layout lease to modify the layout rules out
> many potential use cases for direct data placement and p2p DMA
> applications, not to mention conflicts with the existing pNFS usage.
> Layout leases need to support more than just RDMA, and tailoring the
> API to exactly the immediate needs of RDMA is just going to make it
> useless for anything else.

I agree we should not tailor the layout lease definition to just RDMA
usecase. But let's talk about the semantics once our confusion about how
pNFS currently uses layout leases is clear out.

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: Lease semantic proposal

2019-10-03 Thread Ira Weiny

On Thu, Oct 03, 2019 at 11:01:10AM +0200, Jan Kara wrote:
> On Tue 01-10-19 11:17:00, Ira Weiny wrote:
> > On Mon, Sep 23, 2019 at 04:17:59PM -0400, Jeff Layton wrote:
> > > On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> > > 
> > > Will userland require any special privileges in order to set an
> > > F_UNBREAK lease? This seems like something that could be used for DoS. I
> > > assume that these will never time out.
> > 
> > Dan and I discussed this some more and yes I think the uid of the process 
> > needs
> > to be the owner of the file.  I think that is a reasonable mechanism.
> 
> Honestly, I'm not convinced anything more than open-for-write should be
> required. Sure unbreakable lease may result in failing truncate and other
> ops but as we discussed at LFS/MM, this is not hugely different from
> executing a file resulting in ETXTBUSY for any truncate attempt (even from
> root). So sufficiently priviledged user has to be able to easily find which
> process(es) owns the lease so that he can kill it / take other
> administrative action to release the lease. But that's about it.

Well that was kind of what I was thinking.  However I wanted to be careful
about requiring write permission when doing a F_RDLCK.  I think that it has to
be clearly documented _why_ write permission is required.

>  
> > > How will we deal with the case where something is is squatting on an
> > > F_UNBREAK lease and isn't letting it go?
> > 
> > That is a good question.  I had not considered someone taking the UNBREAK
> > without pinning the file.
> 
> IMHO the same answer as above - sufficiently priviledged user should be
> able to easily find the process holding the lease and kill it. Given the
> lease owner has to have write access to the file, he better should be from
> the same "security domain"...
> 
> > > Leases are technically "owned" by the file description -- we can't
> > > necessarily trace it back to a single task in a threaded program. The
> > > kernel task that set the lease may have exited by the time we go
> > > looking.
> > > 
> > > Will we be content trying to determine this using /proc/locks+lsof, etc,
> > > or will we need something better?
> > 
> > I think using /proc/locks is our best bet.  Similar to my intention to 
> > report
> > files being pinned.[1]
> > 
> > In fact should we consider files with F_UNBREAK leases "pinned" and just 
> > report
> > them there?
> 
> As Jeff wrote later, /proc/locks is not enough. You need PID(s) which have
> access to the lease and hold it alive. Your /proc// files you had in your
> patches should do that, shouldn't they? Maybe they were not tied to the
> right structure... They really need to be tied to the existence of a lease.

Yes, sorry.  I misspoke above.

Right now /proc//file_pins indicates that the file is pinned by GUP.  I
think it may be reasonable to extend that to any file which has F_UNBREAK
specified.  'file_pins' may be the wrong name when we include F_UNBREAK'ed
leased files, so I will think on the name.  But I think this is possible and
desired.

Ira

Re: Lease semantic proposal

2019-10-03 Thread J. Bruce Fields

On Wed, Oct 02, 2019 at 04:35:55PM -0400, Jeff Layton wrote:
> On Wed, 2019-10-02 at 15:27 -0400, J. Bruce Fields wrote:
> > On Wed, Oct 02, 2019 at 08:28:40AM -0400, Jeff Layton wrote:
> > > For the byte ranges, the catch there is that extending the userland
> > > interface for that later will be difficult.
> > 
> > Why would it be difficult?
> 
> Legacy userland code that wanted to use byte range enabled layouts would
> have to be rebuilt to take advantage of them. If we require a range from
> the get-go, then they will get the benefit of them once they're
> available.

I can't see writing byte-range code for a kernel that doesn't support
that yet.  How would I test it?

> > > What I'd probably suggest
> > > (and what would jive with the way pNFS works) would be to go ahead and
> > > add an offset and length to the arguments and result (maybe also
> > > whence?).
> > 
> > Why not add new commands with range arguments later if it turns out to
> > be necessary?
> 
> We could do that. It'd be a little ugly, IMO, simply because then we'd
> end up with two interfaces that do almost the exact same thing.
> 
> Should byte-range layouts at that point conflict with non-byte range
> layouts, or should they be in different "spaces" (a'la POSIX and flock
> locks)? When it's all one interface, those sorts of questions sort of
> answer themselves. When they aren't we'll have to document them clearly
> and I think the result will be more confusing for userland programmers.

I was hoping they'd be in the same space, with the old interface just
defined to deal in locks with range [0,∞).

I'm just worried about getting the interface wrong if it's specified
without being implemented.  Maybe this is straightforward enough that
there's not a risk, I don't know.

--b.

Re: Lease semantic proposal

2019-10-03 Thread Jan Kara

On Tue 01-10-19 11:17:00, Ira Weiny wrote:
> On Mon, Sep 23, 2019 at 04:17:59PM -0400, Jeff Layton wrote:
> > On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> > 
> > Will userland require any special privileges in order to set an
> > F_UNBREAK lease? This seems like something that could be used for DoS. I
> > assume that these will never time out.
> 
> Dan and I discussed this some more and yes I think the uid of the process 
> needs
> to be the owner of the file.  I think that is a reasonable mechanism.

Honestly, I'm not convinced anything more than open-for-write should be
required. Sure unbreakable lease may result in failing truncate and other
ops but as we discussed at LFS/MM, this is not hugely different from
executing a file resulting in ETXTBUSY for any truncate attempt (even from
root). So sufficiently priviledged user has to be able to easily find which
process(es) owns the lease so that he can kill it / take other
administrative action to release the lease. But that's about it.

> > How will we deal with the case where something is is squatting on an
> > F_UNBREAK lease and isn't letting it go?
> 
> That is a good question.  I had not considered someone taking the UNBREAK
> without pinning the file.

IMHO the same answer as above - sufficiently priviledged user should be
able to easily find the process holding the lease and kill it. Given the
lease owner has to have write access to the file, he better should be from
the same "security domain"...

> > Leases are technically "owned" by the file description -- we can't
> > necessarily trace it back to a single task in a threaded program. The
> > kernel task that set the lease may have exited by the time we go
> > looking.
> > 
> > Will we be content trying to determine this using /proc/locks+lsof, etc,
> > or will we need something better?
> 
> I think using /proc/locks is our best bet.  Similar to my intention to report
> files being pinned.[1]
> 
> In fact should we consider files with F_UNBREAK leases "pinned" and just 
> report
> them there?

As Jeff wrote later, /proc/locks is not enough. You need PID(s) which have
access to the lease and hold it alive. Your /proc// files you had in your
patches should do that, shouldn't they? Maybe they were not tied to the
right structure... They really need to be tied to the existence of a lease.

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: Lease semantic proposal

2019-10-03 Thread Jan Kara

On Wed 02-10-19 16:35:55, Jeff Layton wrote:
> On Wed, 2019-10-02 at 15:27 -0400, J. Bruce Fields wrote:
> > On Wed, Oct 02, 2019 at 08:28:40AM -0400, Jeff Layton wrote:
> > > On Tue, 2019-10-01 at 11:17 -0700, Ira Weiny wrote:
> > > > On Mon, Sep 23, 2019 at 04:17:59PM -0400, Jeff Layton wrote:
> > > > > On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> > > > > > Since the last RFC patch set[1] much of the discussion of 
> > > > > > supporting RDMA with
> > > > > > FS DAX has been around the semantics of the lease mechanism.[2]  
> > > > > > Within that
> > > > > > thread it was suggested I try and write some documentation and/or 
> > > > > > tests for the
> > > > > > new mechanism being proposed.  I have created a foundation to test 
> > > > > > lease
> > > > > > functionality within xfstests.[3] This should be close to being 
> > > > > > accepted.
> > > > > > Before writing additional lease tests, or changing lots of kernel 
> > > > > > code, this
> > > > > > email presents documentation for the new proposed "layout lease" 
> > > > > > semantic.
> > > > > > 
> > > > > > At Linux Plumbers[4] just over a week ago, I presented the current 
> > > > > > state of the
> > > > > > patch set and the outstanding issues.  Based on the discussion 
> > > > > > there, well as
> > > > > > follow up emails, I propose the following addition to the fcntl() 
> > > > > > man page.
> > > > > > 
> > > > > > Thank you,
> > > > > > Ira
> > > > > > 
> > > > > > [1] https://lkml.org/lkml/2019/8/9/1043
> > > > > > [2] https://lkml.org/lkml/2019/8/9/1062
> > > > > > [3] https://www.spinics.net/lists/fstests/msg12620.html
> > > > > > [4] https://linuxplumbersconf.org/event/4/contributions/368/
> > > > > > 
> > > > > > 
> > > > > 
> > > > > Thank you so much for doing this, Ira. This allows us to debate the
> > > > > user-visible behavior semantics without getting bogged down in the
> > > > > implementation details. More comments below:
> > > > 
> > > > Thanks.  Sorry for the delay in response.  Turns out this email was in 
> > > > my
> > > > spam...  :-/  I'll need to work out why.
> > > > 
> > > > > > 
> > > > > > Layout Leases
> > > > > > -
> > > > > > 
> > > > > > Layout (F_LAYOUT) leases are special leases which can be used to 
> > > > > > control and/or
> > > > > > be informed about the manipulation of the underlying layout of a 
> > > > > > file.
> > > > > > 
> > > > > > A layout is defined as the logical file block -> physical file 
> > > > > > block mapping
> > > > > > including the file size and sharing of physical blocks among files. 
> > > > > >  Note that
> > > > > > the unwritten state of a block is not considered part of file 
> > > > > > layout.
> > > > > > 
> > > > > > **Read layout lease F_RDLCK | F_LAYOUT**
> > > > > > 
> > > > > > Read layout leases can be used to be informed of layout changes by 
> > > > > > the
> > > > > > system or other users.  This lease is similar to the standard read 
> > > > > > (F_RDLCK)
> > > > > > lease in that any attempt to change the _layout_ of the file will 
> > > > > > be reported to
> > > > > > the process through the lease break process.  But this lease is 
> > > > > > different
> > > > > > because the file can be opened for write and data can be read 
> > > > > > and/or written to
> > > > > > the file as long as the underlying layout of the file does not 
> > > > > > change.
> > > > > > Therefore, the lease is not broken if the file is simply open for 
> > > > > > write, but
> > > > > > _may_ be broken if an operation such as, truncate(), fallocate() or 
> > > > > > write()
> > > > > > results in changing the underlying layout.
> > > > > > 
> > > > > > **Write layout lease (F_WRLCK | F_LAYOUT)**
> > > > > > 
> > > > > > Write Layout leases can be used to break read layout leases to 
> > > > > > indicate that
> > > > > > the process intends to change the underlying layout lease of the 
> > > > > > file.
> > > > > > 
> > > > > > A process which has taken a write layout lease has exclusive 
> > > > > > ownership of the
> > > > > > file layout and can modify that layout as long as the lease is held.
> > > > > > Operations which change the layout are allowed by that process.  
> > > > > > But operations
> > > > > > from other file descriptors which attempt to change the layout will 
> > > > > > break the
> > > > > > lease through the standard lease break process.  The F_LAYOUT flag 
> > > > > > is used to
> > > > > > indicate a difference between a regular F_WRLCK and F_WRLCK with 
> > > > > > F_LAYOUT.  In
> > > > > > the F_LAYOUT case opens for write do not break the lease.  But some 
> > > > > > operations,
> > > > > > if they change the underlying layout, may.
> > > > > > 
> > > > > > The distinction between read layout leases and write layout leases 
> > > > > > is that
> > > > > > write layout leases can change the layout without breaking the 
> > > > > > lease within the
> > > > > > owning process.  This is useful to guarantee a layout prior to 
> > > > > > s

Re: Lease semantic proposal

2019-10-02 Thread Jeff Layton

On Wed, 2019-10-02 at 15:27 -0400, J. Bruce Fields wrote:
> On Wed, Oct 02, 2019 at 08:28:40AM -0400, Jeff Layton wrote:
> > On Tue, 2019-10-01 at 11:17 -0700, Ira Weiny wrote:
> > > On Mon, Sep 23, 2019 at 04:17:59PM -0400, Jeff Layton wrote:
> > > > On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> > > > > Since the last RFC patch set[1] much of the discussion of supporting 
> > > > > RDMA with
> > > > > FS DAX has been around the semantics of the lease mechanism.[2]  
> > > > > Within that
> > > > > thread it was suggested I try and write some documentation and/or 
> > > > > tests for the
> > > > > new mechanism being proposed.  I have created a foundation to test 
> > > > > lease
> > > > > functionality within xfstests.[3] This should be close to being 
> > > > > accepted.
> > > > > Before writing additional lease tests, or changing lots of kernel 
> > > > > code, this
> > > > > email presents documentation for the new proposed "layout lease" 
> > > > > semantic.
> > > > > 
> > > > > At Linux Plumbers[4] just over a week ago, I presented the current 
> > > > > state of the
> > > > > patch set and the outstanding issues.  Based on the discussion there, 
> > > > > well as
> > > > > follow up emails, I propose the following addition to the fcntl() man 
> > > > > page.
> > > > > 
> > > > > Thank you,
> > > > > Ira
> > > > > 
> > > > > [1] https://lkml.org/lkml/2019/8/9/1043
> > > > > [2] https://lkml.org/lkml/2019/8/9/1062
> > > > > [3] https://www.spinics.net/lists/fstests/msg12620.html
> > > > > [4] https://linuxplumbersconf.org/event/4/contributions/368/
> > > > > 
> > > > > 
> > > > 
> > > > Thank you so much for doing this, Ira. This allows us to debate the
> > > > user-visible behavior semantics without getting bogged down in the
> > > > implementation details. More comments below:
> > > 
> > > Thanks.  Sorry for the delay in response.  Turns out this email was in my
> > > spam...  :-/  I'll need to work out why.
> > > 
> > > > > 
> > > > > Layout Leases
> > > > > -
> > > > > 
> > > > > Layout (F_LAYOUT) leases are special leases which can be used to 
> > > > > control and/or
> > > > > be informed about the manipulation of the underlying layout of a file.
> > > > > 
> > > > > A layout is defined as the logical file block -> physical file block 
> > > > > mapping
> > > > > including the file size and sharing of physical blocks among files.  
> > > > > Note that
> > > > > the unwritten state of a block is not considered part of file layout.
> > > > > 
> > > > > **Read layout lease F_RDLCK | F_LAYOUT**
> > > > > 
> > > > > Read layout leases can be used to be informed of layout changes by the
> > > > > system or other users.  This lease is similar to the standard read 
> > > > > (F_RDLCK)
> > > > > lease in that any attempt to change the _layout_ of the file will be 
> > > > > reported to
> > > > > the process through the lease break process.  But this lease is 
> > > > > different
> > > > > because the file can be opened for write and data can be read and/or 
> > > > > written to
> > > > > the file as long as the underlying layout of the file does not change.
> > > > > Therefore, the lease is not broken if the file is simply open for 
> > > > > write, but
> > > > > _may_ be broken if an operation such as, truncate(), fallocate() or 
> > > > > write()
> > > > > results in changing the underlying layout.
> > > > > 
> > > > > **Write layout lease (F_WRLCK | F_LAYOUT)**
> > > > > 
> > > > > Write Layout leases can be used to break read layout leases to 
> > > > > indicate that
> > > > > the process intends to change the underlying layout lease of the file.
> > > > > 
> > > > > A process which has taken a write layout lease has exclusive 
> > > > > ownership of the
> > > > > file layout and can modify that layout as long as the lease is held.
> > > > > Operations which change the layout are allowed by that process.  But 
> > > > > operations
> > > > > from other file descriptors which attempt to change the layout will 
> > > > > break the
> > > > > lease through the standard lease break process.  The F_LAYOUT flag is 
> > > > > used to
> > > > > indicate a difference between a regular F_WRLCK and F_WRLCK with 
> > > > > F_LAYOUT.  In
> > > > > the F_LAYOUT case opens for write do not break the lease.  But some 
> > > > > operations,
> > > > > if they change the underlying layout, may.
> > > > > 
> > > > > The distinction between read layout leases and write layout leases is 
> > > > > that
> > > > > write layout leases can change the layout without breaking the lease 
> > > > > within the
> > > > > owning process.  This is useful to guarantee a layout prior to 
> > > > > specifying the
> > > > > unbreakable flag described below.
> > > > > 
> > > > > 
> > > > 
> > > > The above sounds totally reasonable. You're essentially exposing the
> > > > behavior of nfsd's layout leases to userland. To be clear, will F_LAYOUT
> > > > leases work the same way as "normal" leases, wrt signals and timeouts?

Re: Lease semantic proposal

2019-10-02 Thread J. Bruce Fields

On Wed, Oct 02, 2019 at 08:28:40AM -0400, Jeff Layton wrote:
> On Tue, 2019-10-01 at 11:17 -0700, Ira Weiny wrote:
> > On Mon, Sep 23, 2019 at 04:17:59PM -0400, Jeff Layton wrote:
> > > On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> > > > Since the last RFC patch set[1] much of the discussion of supporting 
> > > > RDMA with
> > > > FS DAX has been around the semantics of the lease mechanism.[2]  Within 
> > > > that
> > > > thread it was suggested I try and write some documentation and/or tests 
> > > > for the
> > > > new mechanism being proposed.  I have created a foundation to test lease
> > > > functionality within xfstests.[3] This should be close to being 
> > > > accepted.
> > > > Before writing additional lease tests, or changing lots of kernel code, 
> > > > this
> > > > email presents documentation for the new proposed "layout lease" 
> > > > semantic.
> > > > 
> > > > At Linux Plumbers[4] just over a week ago, I presented the current 
> > > > state of the
> > > > patch set and the outstanding issues.  Based on the discussion there, 
> > > > well as
> > > > follow up emails, I propose the following addition to the fcntl() man 
> > > > page.
> > > > 
> > > > Thank you,
> > > > Ira
> > > > 
> > > > [1] https://lkml.org/lkml/2019/8/9/1043
> > > > [2] https://lkml.org/lkml/2019/8/9/1062
> > > > [3] https://www.spinics.net/lists/fstests/msg12620.html
> > > > [4] https://linuxplumbersconf.org/event/4/contributions/368/
> > > > 
> > > > 
> > > 
> > > Thank you so much for doing this, Ira. This allows us to debate the
> > > user-visible behavior semantics without getting bogged down in the
> > > implementation details. More comments below:
> > 
> > Thanks.  Sorry for the delay in response.  Turns out this email was in my
> > spam...  :-/  I'll need to work out why.
> > 
> > > > 
> > > > Layout Leases
> > > > -
> > > > 
> > > > Layout (F_LAYOUT) leases are special leases which can be used to 
> > > > control and/or
> > > > be informed about the manipulation of the underlying layout of a file.
> > > > 
> > > > A layout is defined as the logical file block -> physical file block 
> > > > mapping
> > > > including the file size and sharing of physical blocks among files.  
> > > > Note that
> > > > the unwritten state of a block is not considered part of file layout.
> > > > 
> > > > **Read layout lease F_RDLCK | F_LAYOUT**
> > > > 
> > > > Read layout leases can be used to be informed of layout changes by the
> > > > system or other users.  This lease is similar to the standard read 
> > > > (F_RDLCK)
> > > > lease in that any attempt to change the _layout_ of the file will be 
> > > > reported to
> > > > the process through the lease break process.  But this lease is 
> > > > different
> > > > because the file can be opened for write and data can be read and/or 
> > > > written to
> > > > the file as long as the underlying layout of the file does not change.
> > > > Therefore, the lease is not broken if the file is simply open for 
> > > > write, but
> > > > _may_ be broken if an operation such as, truncate(), fallocate() or 
> > > > write()
> > > > results in changing the underlying layout.
> > > > 
> > > > **Write layout lease (F_WRLCK | F_LAYOUT)**
> > > > 
> > > > Write Layout leases can be used to break read layout leases to indicate 
> > > > that
> > > > the process intends to change the underlying layout lease of the file.
> > > > 
> > > > A process which has taken a write layout lease has exclusive ownership 
> > > > of the
> > > > file layout and can modify that layout as long as the lease is held.
> > > > Operations which change the layout are allowed by that process.  But 
> > > > operations
> > > > from other file descriptors which attempt to change the layout will 
> > > > break the
> > > > lease through the standard lease break process.  The F_LAYOUT flag is 
> > > > used to
> > > > indicate a difference between a regular F_WRLCK and F_WRLCK with 
> > > > F_LAYOUT.  In
> > > > the F_LAYOUT case opens for write do not break the lease.  But some 
> > > > operations,
> > > > if they change the underlying layout, may.
> > > > 
> > > > The distinction between read layout leases and write layout leases is 
> > > > that
> > > > write layout leases can change the layout without breaking the lease 
> > > > within the
> > > > owning process.  This is useful to guarantee a layout prior to 
> > > > specifying the
> > > > unbreakable flag described below.
> > > > 
> > > > 
> > > 
> > > The above sounds totally reasonable. You're essentially exposing the
> > > behavior of nfsd's layout leases to userland. To be clear, will F_LAYOUT
> > > leases work the same way as "normal" leases, wrt signals and timeouts?
> > 
> > That was my intention, yes.
> >
> > > I do wonder if we're better off not trying to "or" in flags for this,
> > > and instead have a separate set of commands (maybe F_RDLAYOUT,
> > > F_WRLAYOUT, F_UNLAYOUT). Maybe I'm just bikeshedding though -- I don't
> > > feel terribly s

Re: Lease semantic proposal

2019-10-02 Thread Dan Williams

On Tue, Oct 1, 2019 at 2:02 PM Ira Weiny  wrote:
>
> On Mon, Sep 30, 2019 at 06:42:33PM +1000, Dave Chinner wrote:
> > On Wed, Sep 25, 2019 at 04:46:03PM -0700, Ira Weiny wrote:
> > > On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> > > > Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
> > > > doesn't appear to be compatible with the semantics required by
> > > > existing users of layout leases.
> > >
> > > I disagree.  Other than the addition of F_UNBREAK, I think this is 
> > > consistent
> > > with what is currently implemented.  Also, by exporting all this to user 
> > > space
> > > we can now write tests for it independent of the RDMA pinning.
> >
> > The current usage of F_RDLCK | F_LAYOUT by the pNFS code allows
> > layout changes to occur to the file while the layout lease is held.
>
> This was not my understanding.

I think you guys are talking past each other. F_RDLCK | F_LAYOUT can
be broken to allow writes to the file / layout. The new unbreakable
case would require explicit SIGKILL as "revocation method of last
resort", but that's the new incremental extension being proposed. No
changes to the current behavior of F_RDLCK | F_LAYOUT.

Dave, the question at hand is whether this new layout lease mode being
proposed is going to respond to BREAK_WRITE, or just BREAK_UNMAP. It
seems longterm page pinning conflicts really only care about
BREAK_UNMAP where pages that were part of the file are being removed
from the file. The unbreakable case can tolerate layout changes that
keep pinned pages mapped / allocated to the file.

Re: Lease semantic proposal

2019-10-02 Thread Jeff Layton

On Tue, 2019-10-01 at 11:17 -0700, Ira Weiny wrote:
> On Mon, Sep 23, 2019 at 04:17:59PM -0400, Jeff Layton wrote:
> > On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> > > Since the last RFC patch set[1] much of the discussion of supporting RDMA 
> > > with
> > > FS DAX has been around the semantics of the lease mechanism.[2]  Within 
> > > that
> > > thread it was suggested I try and write some documentation and/or tests 
> > > for the
> > > new mechanism being proposed.  I have created a foundation to test lease
> > > functionality within xfstests.[3] This should be close to being accepted.
> > > Before writing additional lease tests, or changing lots of kernel code, 
> > > this
> > > email presents documentation for the new proposed "layout lease" semantic.
> > > 
> > > At Linux Plumbers[4] just over a week ago, I presented the current state 
> > > of the
> > > patch set and the outstanding issues.  Based on the discussion there, 
> > > well as
> > > follow up emails, I propose the following addition to the fcntl() man 
> > > page.
> > > 
> > > Thank you,
> > > Ira
> > > 
> > > [1] https://lkml.org/lkml/2019/8/9/1043
> > > [2] https://lkml.org/lkml/2019/8/9/1062
> > > [3] https://www.spinics.net/lists/fstests/msg12620.html
> > > [4] https://linuxplumbersconf.org/event/4/contributions/368/
> > > 
> > > 
> > 
> > Thank you so much for doing this, Ira. This allows us to debate the
> > user-visible behavior semantics without getting bogged down in the
> > implementation details. More comments below:
> 
> Thanks.  Sorry for the delay in response.  Turns out this email was in my
> spam...  :-/  I'll need to work out why.
> 
> > > 
> > > Layout Leases
> > > -
> > > 
> > > Layout (F_LAYOUT) leases are special leases which can be used to control 
> > > and/or
> > > be informed about the manipulation of the underlying layout of a file.
> > > 
> > > A layout is defined as the logical file block -> physical file block 
> > > mapping
> > > including the file size and sharing of physical blocks among files.  Note 
> > > that
> > > the unwritten state of a block is not considered part of file layout.
> > > 
> > > **Read layout lease F_RDLCK | F_LAYOUT**
> > > 
> > > Read layout leases can be used to be informed of layout changes by the
> > > system or other users.  This lease is similar to the standard read 
> > > (F_RDLCK)
> > > lease in that any attempt to change the _layout_ of the file will be 
> > > reported to
> > > the process through the lease break process.  But this lease is different
> > > because the file can be opened for write and data can be read and/or 
> > > written to
> > > the file as long as the underlying layout of the file does not change.
> > > Therefore, the lease is not broken if the file is simply open for write, 
> > > but
> > > _may_ be broken if an operation such as, truncate(), fallocate() or 
> > > write()
> > > results in changing the underlying layout.
> > > 
> > > **Write layout lease (F_WRLCK | F_LAYOUT)**
> > > 
> > > Write Layout leases can be used to break read layout leases to indicate 
> > > that
> > > the process intends to change the underlying layout lease of the file.
> > > 
> > > A process which has taken a write layout lease has exclusive ownership of 
> > > the
> > > file layout and can modify that layout as long as the lease is held.
> > > Operations which change the layout are allowed by that process.  But 
> > > operations
> > > from other file descriptors which attempt to change the layout will break 
> > > the
> > > lease through the standard lease break process.  The F_LAYOUT flag is 
> > > used to
> > > indicate a difference between a regular F_WRLCK and F_WRLCK with 
> > > F_LAYOUT.  In
> > > the F_LAYOUT case opens for write do not break the lease.  But some 
> > > operations,
> > > if they change the underlying layout, may.
> > > 
> > > The distinction between read layout leases and write layout leases is that
> > > write layout leases can change the layout without breaking the lease 
> > > within the
> > > owning process.  This is useful to guarantee a layout prior to specifying 
> > > the
> > > unbreakable flag described below.
> > > 
> > > 
> > 
> > The above sounds totally reasonable. You're essentially exposing the
> > behavior of nfsd's layout leases to userland. To be clear, will F_LAYOUT
> > leases work the same way as "normal" leases, wrt signals and timeouts?
> 
> That was my intention, yes.
>
> > I do wonder if we're better off not trying to "or" in flags for this,
> > and instead have a separate set of commands (maybe F_RDLAYOUT,
> > F_WRLAYOUT, F_UNLAYOUT). Maybe I'm just bikeshedding though -- I don't
> > feel terribly strongly about it.
> 
> I'm leaning that was as well.  To make these even more distinct from
> F_SETLEASE.
> 
> > Also, at least in NFSv4, layouts are handed out for a particular byte
> > range in a file. Should we consider doing this with an API that allows
> > for that in the future? Is this something that would be

Re: Lease semantic proposal

2019-10-01 Thread Ira Weiny

On Mon, Sep 30, 2019 at 06:42:33PM +1000, Dave Chinner wrote:
> On Wed, Sep 25, 2019 at 04:46:03PM -0700, Ira Weiny wrote:
> > On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> > > Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
> > > doesn't appear to be compatible with the semantics required by
> > > existing users of layout leases.
> > 
> > I disagree.  Other than the addition of F_UNBREAK, I think this is 
> > consistent
> > with what is currently implemented.  Also, by exporting all this to user 
> > space
> > we can now write tests for it independent of the RDMA pinning.
> 
> The current usage of F_RDLCK | F_LAYOUT by the pNFS code allows
> layout changes to occur to the file while the layout lease is held.

This was not my understanding.

> IOWs, your definition of F_RDLCK | F_LAYOUT not being allowed
> to change the is in direct contradition to existing users.
> 
> I've said this several times over the past few months now: shared
> layout leases must allow layout modifications to be made.

I don't understand what the point of having a layout lease is then?

>
> Only
> allowing an exclusive layout lease to modify the layout rules out
> many potential use cases for direct data placement and p2p DMA
> applications,

How?  I think that having a typical design pattern of multiple readers
and only a single writer would actually make all these use cases easier.

> not to mention conflicts with the existing pNFS usage.

I apologize for not understanding this.  My reading of the code is that layout
changes require the read layout to be broken prior to proceeding.

The break layout code does this by creating a F_WRLCK of type FL_LAYOUT which
conflicts with the F_RDLCK of type FL_LAYOUT...

int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
{
...
struct file_lock *new_fl, *fl, *tmp;
...

new_fl = lease_alloc(NULL, want_write ? F_WRLCK : F_RDLCK, 0);
if (IS_ERR(new_fl))
return PTR_ERR(new_fl);
new_fl->fl_flags = type;
...
list_for_each_entry_safe(fl, tmp, &ctx->flc_lease, fl_list) {
if (!leases_conflict(fl, new_fl))
continue;
...
}

type == FL_LAYOUT from the call here.

static inline int break_layout(struct inode *inode, bool wait)
{
smp_mb();
if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
return __break_lease(inode,
wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
FL_LAYOUT);
return 0;
}   

Also, I don't see any code which limits the number of read layout holders which
can be present and all of them will be revoked by the above code.

What am I missing?

Ira

Re: Lease semantic proposal

2019-10-01 Thread Ira Weiny

On Mon, Sep 23, 2019 at 04:17:59PM -0400, Jeff Layton wrote:
> On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> > Since the last RFC patch set[1] much of the discussion of supporting RDMA 
> > with
> > FS DAX has been around the semantics of the lease mechanism.[2]  Within that
> > thread it was suggested I try and write some documentation and/or tests for 
> > the
> > new mechanism being proposed.  I have created a foundation to test lease
> > functionality within xfstests.[3] This should be close to being accepted.
> > Before writing additional lease tests, or changing lots of kernel code, this
> > email presents documentation for the new proposed "layout lease" semantic.
> > 
> > At Linux Plumbers[4] just over a week ago, I presented the current state of 
> > the
> > patch set and the outstanding issues.  Based on the discussion there, well 
> > as
> > follow up emails, I propose the following addition to the fcntl() man page.
> > 
> > Thank you,
> > Ira
> > 
> > [1] https://lkml.org/lkml/2019/8/9/1043
> > [2] https://lkml.org/lkml/2019/8/9/1062
> > [3] https://www.spinics.net/lists/fstests/msg12620.html
> > [4] https://linuxplumbersconf.org/event/4/contributions/368/
> > 
> > 
> 
> Thank you so much for doing this, Ira. This allows us to debate the
> user-visible behavior semantics without getting bogged down in the
> implementation details. More comments below:

Thanks.  Sorry for the delay in response.  Turns out this email was in my
spam...  :-/  I'll need to work out why.

> 
> > 
> > Layout Leases
> > -
> > 
> > Layout (F_LAYOUT) leases are special leases which can be used to control 
> > and/or
> > be informed about the manipulation of the underlying layout of a file.
> > 
> > A layout is defined as the logical file block -> physical file block mapping
> > including the file size and sharing of physical blocks among files.  Note 
> > that
> > the unwritten state of a block is not considered part of file layout.
> > 
> > **Read layout lease F_RDLCK | F_LAYOUT**
> > 
> > Read layout leases can be used to be informed of layout changes by the
> > system or other users.  This lease is similar to the standard read (F_RDLCK)
> > lease in that any attempt to change the _layout_ of the file will be 
> > reported to
> > the process through the lease break process.  But this lease is different
> > because the file can be opened for write and data can be read and/or 
> > written to
> > the file as long as the underlying layout of the file does not change.
> > Therefore, the lease is not broken if the file is simply open for write, but
> > _may_ be broken if an operation such as, truncate(), fallocate() or write()
> > results in changing the underlying layout.
> > 
> > **Write layout lease (F_WRLCK | F_LAYOUT)**
> > 
> > Write Layout leases can be used to break read layout leases to indicate that
> > the process intends to change the underlying layout lease of the file.
> > 
> > A process which has taken a write layout lease has exclusive ownership of 
> > the
> > file layout and can modify that layout as long as the lease is held.
> > Operations which change the layout are allowed by that process.  But 
> > operations
> > from other file descriptors which attempt to change the layout will break 
> > the
> > lease through the standard lease break process.  The F_LAYOUT flag is used 
> > to
> > indicate a difference between a regular F_WRLCK and F_WRLCK with F_LAYOUT.  
> > In
> > the F_LAYOUT case opens for write do not break the lease.  But some 
> > operations,
> > if they change the underlying layout, may.
> > 
> > The distinction between read layout leases and write layout leases is that
> > write layout leases can change the layout without breaking the lease within 
> > the
> > owning process.  This is useful to guarantee a layout prior to specifying 
> > the
> > unbreakable flag described below.
> > 
> > 
> 
> The above sounds totally reasonable. You're essentially exposing the
> behavior of nfsd's layout leases to userland. To be clear, will F_LAYOUT
> leases work the same way as "normal" leases, wrt signals and timeouts?

That was my intention, yes.

> 
> I do wonder if we're better off not trying to "or" in flags for this,
> and instead have a separate set of commands (maybe F_RDLAYOUT,
> F_WRLAYOUT, F_UNLAYOUT). Maybe I'm just bikeshedding though -- I don't
> feel terribly strongly about it.

I'm leaning that was as well.  To make these even more distinct from
F_SETLEASE.

> 
> Also, at least in NFSv4, layouts are handed out for a particular byte
> range in a file. Should we consider doing this with an API that allows
> for that in the future? Is this something that would be desirable for
> your RDMA+DAX use-cases?

I don't see this.  I've thought it would be a nice thing to have but I don't
know of any hard use case.  But first I'd like to understand how this works for
NFS.

> 
> We could add a new F_SETLEASE variant that takes a struct with a byte
> range (something like struct fl

INVESTMENT PROPOSAL.

2019-10-01 Thread Daouda Ali

It’s my pleasure to contact you through this media because I need an
investment assistance in your country. However I have a profitable
investment proposal with  good interest to share with you, amounted
the sum of (Twenty Eight Million Four Hundred Thousand United State
Dollar ($28.400.000.00). If you  are willing to handle this project
kindly reply urgent to enable me provide you more information about
the investment funds and the project.

I am waiting to hear from you through this my private
email(daoudaali2...@gmail.com) so we can proceed further.

Best Regards.
Mr. Daouda Ali.

Re: Lease semantic proposal

2019-09-30 Thread Dave Chinner

On Wed, Sep 25, 2019 at 04:46:03PM -0700, Ira Weiny wrote:
> On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> > Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
> > doesn't appear to be compatible with the semantics required by
> > existing users of layout leases.
> 
> I disagree.  Other than the addition of F_UNBREAK, I think this is consistent
> with what is currently implemented.  Also, by exporting all this to user space
> we can now write tests for it independent of the RDMA pinning.

The current usage of F_RDLCK | F_LAYOUT by the pNFS code allows
layout changes to occur to the file while the layout lease is held.
IOWs, your definition of F_RDLCK | F_LAYOUT not being allowed
to change the is in direct contradition to existing users.

I've said this several times over the past few months now: shared
layout leases must allow layout modifications to be made. Only
allowing an exclusive layout lease to modify the layout rules out
many potential use cases for direct data placement and p2p DMA
applications, not to mention conflicts with the existing pNFS usage.
Layout leases need to support more than just RDMA, and tailoring the
API to exactly the immediate needs of RDMA is just going to make it
useless for anything else.

I'm getting frustrated now because we still seem to be going around
in circles and getting nowhere.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: Lease semantic proposal

2019-09-26 Thread Jeff Layton

On Wed, 2019-09-25 at 16:46 -0700, Ira Weiny wrote:
> On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> > On Mon, Sep 23, 2019 at 12:08:53PM -0700, Ira Weiny wrote:
> > > Since the last RFC patch set[1] much of the discussion of supporting RDMA 
> > > with
> > > FS DAX has been around the semantics of the lease mechanism.[2]  Within 
> > > that
> > > thread it was suggested I try and write some documentation and/or tests 
> > > for the
> > > new mechanism being proposed.  I have created a foundation to test lease
> > > functionality within xfstests.[3] This should be close to being accepted.
> > > Before writing additional lease tests, or changing lots of kernel code, 
> > > this
> > > email presents documentation for the new proposed "layout lease" semantic.
> > > 
> > > At Linux Plumbers[4] just over a week ago, I presented the current state 
> > > of the
> > > patch set and the outstanding issues.  Based on the discussion there, 
> > > well as
> > > follow up emails, I propose the following addition to the fcntl() man 
> > > page.
> > > 
> > > Thank you,
> > > Ira
> > > 
> > > [1] https://lkml.org/lkml/2019/8/9/1043
> > > [2] https://lkml.org/lkml/2019/8/9/1062
> > > [3] https://www.spinics.net/lists/fstests/msg12620.html
> > > [4] https://linuxplumbersconf.org/event/4/contributions/368/
> > > 
> > > 
> > > 
> > > Layout Leases
> > > -
> > > 
> > > Layout (F_LAYOUT) leases are special leases which can be used to control 
> > > and/or
> > > be informed about the manipulation of the underlying layout of a file.
> > > 
> > > A layout is defined as the logical file block -> physical file block 
> > > mapping
> > > including the file size and sharing of physical blocks among files.  Note 
> > > that
> > > the unwritten state of a block is not considered part of file layout.
> > 
> > Why even mention "unwritten" state if it's not considered something
> > that the layout lease treats differently?
> > 
> > i.e. Unwritten extents are a filesystem implementation detail that
> > is not exposed to userspace by anything other than FIEMAP. If they
> > have no impact on layout lease behaviour, then why raise it as
> > something the user needs to know about?
> 
> This paragraph was intended to define a layout.  So I guess one could say our
> internal discussion on what defines a "layout" has leaked into the external
> documentation.  Do you think we should just remove the second sentence or the
> whole paragraph?
> 
> > > **Read layout lease F_RDLCK | F_LAYOUT**
> > > 
> > > Read layout leases can be used to be informed of layout changes by the
> > > system or other users.  This lease is similar to the standard read 
> > > (F_RDLCK)
> > > lease in that any attempt to change the _layout_ of the file will be 
> > > reported to
> > > the process through the lease break process. 
> > 
> > Similar in what way? The standard F_RDLCK lease triggers on open or
> > truncate - a layout lease does nothing of the sort.
> 
> Similar in that attempts to "write" the layout will result in breaking the
> lease just like attempts to write the file would break the standard F_RDLCK
> lease.  I'm not stuck on the verbiage though; similar may be the wrong word.
> 
> > > But this lease is different
> > > because the file can be opened for write and data can be read and/or 
> > > written to
> > > the file as long as the underlying layout of the file does not change.
> > 
> > So a F_RDLCK|F_LAYOUT can be taken on a O_WRONLY fd, unlike a
> > F_RDLCK which can only be taken on O_RDONLY fd.
> 
> That was the idea, yes.
> 
> > I think these semantics are sufficiently different to F_RDLCK they
> > need to be explicitly documented, because I see problems here.
> 
> I agree, and I intended this document to indicate how they are different.
> 
> Anther option may be to define F_RDLAYOUT and not have F_LAYOUT such that it 
> is
> clear that this lease is not associated with F_RDLCK at all.  It is different.
> 
> > > Therefore, the lease is not broken if the file is simply open for write, 
> > > but
> > > _may_ be broken if an operation such as, truncate(), fallocate() or 
> > > write()
> > > results in changing the underlying layout.
> > 
> > As will mmap(), any number of XFS and ext4 ioctls, etc. 
> > 
> > So this really needs to say "_will_ be broken if *any* modification to
> > the file _might_ need to change the underlying physical layout".
> 
> Agreed.  I used the word "may" because a simple write() does not necessarily
> change the layout of the file.  But I like your verbiage better.  I did wonder
> if listing operations was a bad idea.  So I'm ok simply leaving that detail
> out.
> 
> > Now, the big question: what happens to a process with a
> > F_RDLCK|F_LAYOUT lease held does a write that triggers a layout
> > change? What happens then?
> 
> Because F_UNBREAK is not specified, the write() operation is held for lease
> break time and then the lease is broken if not voluntarily released.  This
> would be the same pattern

Business Proposal

2019-09-25 Thread Naomi Hayashi




-- 
Hello, I have sent mail to you previously ,maybe it didn't delivered. I am 
Naomi Hayashi By name. I am sorry for contacting you directly to your email. I 
will like to talk to you about something very important. Please acknowledge my 
email so I can provide you with details.await your response.Best Regards,Naomi 
Hayashi

Re: Lease semantic proposal

2019-09-25 Thread Ira Weiny

On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> On Mon, Sep 23, 2019 at 12:08:53PM -0700, Ira Weiny wrote:
> > 
> > Since the last RFC patch set[1] much of the discussion of supporting RDMA 
> > with
> > FS DAX has been around the semantics of the lease mechanism.[2]  Within that
> > thread it was suggested I try and write some documentation and/or tests for 
> > the
> > new mechanism being proposed.  I have created a foundation to test lease
> > functionality within xfstests.[3] This should be close to being accepted.
> > Before writing additional lease tests, or changing lots of kernel code, this
> > email presents documentation for the new proposed "layout lease" semantic.
> > 
> > At Linux Plumbers[4] just over a week ago, I presented the current state of 
> > the
> > patch set and the outstanding issues.  Based on the discussion there, well 
> > as
> > follow up emails, I propose the following addition to the fcntl() man page.
> > 
> > Thank you,
> > Ira
> > 
> > [1] https://lkml.org/lkml/2019/8/9/1043
> > [2] https://lkml.org/lkml/2019/8/9/1062
> > [3] https://www.spinics.net/lists/fstests/msg12620.html
> > [4] https://linuxplumbersconf.org/event/4/contributions/368/
> > 
> > 
> > 
> > Layout Leases
> > -
> > 
> > Layout (F_LAYOUT) leases are special leases which can be used to control 
> > and/or
> > be informed about the manipulation of the underlying layout of a file.
> > 
> > A layout is defined as the logical file block -> physical file block mapping
> > including the file size and sharing of physical blocks among files.  Note 
> > that
> > the unwritten state of a block is not considered part of file layout.
> 
> Why even mention "unwritten" state if it's not considered something
> that the layout lease treats differently?
> 
> i.e. Unwritten extents are a filesystem implementation detail that
> is not exposed to userspace by anything other than FIEMAP. If they
> have no impact on layout lease behaviour, then why raise it as
> something the user needs to know about?

This paragraph was intended to define a layout.  So I guess one could say our
internal discussion on what defines a "layout" has leaked into the external
documentation.  Do you think we should just remove the second sentence or the
whole paragraph?

> 
> > **Read layout lease F_RDLCK | F_LAYOUT**
> > 
> > Read layout leases can be used to be informed of layout changes by the
> > system or other users.  This lease is similar to the standard read (F_RDLCK)
> > lease in that any attempt to change the _layout_ of the file will be 
> > reported to
> > the process through the lease break process. 
> 
> Similar in what way? The standard F_RDLCK lease triggers on open or
> truncate - a layout lease does nothing of the sort.

Similar in that attempts to "write" the layout will result in breaking the
lease just like attempts to write the file would break the standard F_RDLCK
lease.  I'm not stuck on the verbiage though; similar may be the wrong word.

> 
> > But this lease is different
> > because the file can be opened for write and data can be read and/or 
> > written to
> > the file as long as the underlying layout of the file does not change.
> 
> So a F_RDLCK|F_LAYOUT can be taken on a O_WRONLY fd, unlike a
> F_RDLCK which can only be taken on O_RDONLY fd.

That was the idea, yes.

> 
> I think these semantics are sufficiently different to F_RDLCK they
> need to be explicitly documented, because I see problems here.

I agree, and I intended this document to indicate how they are different.

Anther option may be to define F_RDLAYOUT and not have F_LAYOUT such that it is
clear that this lease is not associated with F_RDLCK at all.  It is different.

> 
> > Therefore, the lease is not broken if the file is simply open for write, but
> > _may_ be broken if an operation such as, truncate(), fallocate() or write()
> > results in changing the underlying layout.
> 
> As will mmap(), any number of XFS and ext4 ioctls, etc. 
> 
> So this really needs to say "_will_ be broken if *any* modification to
> the file _might_ need to change the underlying physical layout".

Agreed.  I used the word "may" because a simple write() does not necessarily
change the layout of the file.  But I like your verbiage better.  I did wonder
if listing operations was a bad idea.  So I'm ok simply leaving that detail
out.

> 
> Now, the big question: what happens to a process with a
> F_RDLCK|F_LAYOUT lease held does a write that triggers a layout
> change? What happens then?

Because F_UNBREAK is not specified, the write() operation is held for lease
break time and then the lease is broken if not voluntarily released.  This
would be the same pattern as a process holding a F_RDLCK and opening the file
O_RDWR.

> 
> Also, have you noticed that XFS will unconditionally break layouts on
> write() because it /might/ need to change the layout? i.e. the
> BREAK_WRITE case in xfs_file_aio_write_checks()? This is needed for
> correctly supporting pNF

Re: Lease semantic proposal

2019-09-23 Thread Dave Chinner

On Mon, Sep 23, 2019 at 12:08:53PM -0700, Ira Weiny wrote:
> 
> Since the last RFC patch set[1] much of the discussion of supporting RDMA with
> FS DAX has been around the semantics of the lease mechanism.[2]  Within that
> thread it was suggested I try and write some documentation and/or tests for 
> the
> new mechanism being proposed.  I have created a foundation to test lease
> functionality within xfstests.[3] This should be close to being accepted.
> Before writing additional lease tests, or changing lots of kernel code, this
> email presents documentation for the new proposed "layout lease" semantic.
> 
> At Linux Plumbers[4] just over a week ago, I presented the current state of 
> the
> patch set and the outstanding issues.  Based on the discussion there, well as
> follow up emails, I propose the following addition to the fcntl() man page.
> 
> Thank you,
> Ira
> 
> [1] https://lkml.org/lkml/2019/8/9/1043
> [2] https://lkml.org/lkml/2019/8/9/1062
> [3] https://www.spinics.net/lists/fstests/msg12620.html
> [4] https://linuxplumbersconf.org/event/4/contributions/368/
> 
> 
> 
> Layout Leases
> -
> 
> Layout (F_LAYOUT) leases are special leases which can be used to control 
> and/or
> be informed about the manipulation of the underlying layout of a file.
> 
> A layout is defined as the logical file block -> physical file block mapping
> including the file size and sharing of physical blocks among files.  Note that
> the unwritten state of a block is not considered part of file layout.

Why even mention "unwritten" state if it's not considered something
that the layout lease treats differently?

i.e. Unwritten extents are a filesystem implementation detail that
is not exposed to userspace by anything other than FIEMAP. If they
have no impact on layout lease behaviour, then why raise it as
something the user needs to know about?

> **Read layout lease F_RDLCK | F_LAYOUT**
> 
> Read layout leases can be used to be informed of layout changes by the
> system or other users.  This lease is similar to the standard read (F_RDLCK)
> lease in that any attempt to change the _layout_ of the file will be reported 
> to
> the process through the lease break process. 

Similar in what way? The standard F_RDLCK lease triggers on open or
truncate - a layout lease does nothing of the sort.

> But this lease is different
> because the file can be opened for write and data can be read and/or written 
> to
> the file as long as the underlying layout of the file does not change.

So a F_RDLCK|F_LAYOUT can be taken on a O_WRONLY fd, unlike a
F_RDLCK which can only be taken on O_RDONLY fd.

I think these semantics are sufficiently different to F_RDLCK they
need to be explicitly documented, because I see problems here.

> Therefore, the lease is not broken if the file is simply open for write, but
> _may_ be broken if an operation such as, truncate(), fallocate() or write()
> results in changing the underlying layout.

As will mmap(), any number of XFS and ext4 ioctls, etc. 

So this really needs to say "_will_ be broken if *any* modification to
the file _might_ need to change the underlying physical layout".

Now, the big question: what happens to a process with a
F_RDLCK|F_LAYOUT lease held does a write that triggers a layout
change? What happens then?

Also, have you noticed that XFS will unconditionally break layouts on
write() because it /might/ need to change the layout? i.e. the
BREAK_WRITE case in xfs_file_aio_write_checks()? This is needed for
correctly supporting pNFS layout coherency against local IO. i.e.
local write() breaks layouts held by NFS server to get the
delegation recalled.

So by the above definition of F_RDLCK|F_LAYOUT behaviour, a holder
of such a lease doing a write() to that file would trigger a lease
break of their own lease as the filesystem has notified the lease
layer that there is a layout change about to happen. What's expected
to happen here?

Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
doesn't appear to be compatible with the semantics required by
existing users of layout leases.

> **Write layout lease (F_WRLCK | F_LAYOUT)**
> 
> Write Layout leases can be used to break read layout leases to indicate that
> the process intends to change the underlying layout lease of the file.

Any write() can change the layout of the file, and userspace cannot
tell in advance whether that will occur (neither can the
filesystem), so it seems to me that any application that needs to
write data is going to have to use F_WRLCK|F_LAYOUT.

> A process which has taken a write layout lease has exclusive ownership of the
> file layout and can modify that layout as long as the lease is held.

Which further implies single writer semantics and leases are
associated with a single open fd. Single writers are something we
are always trying to avoid in XFS.

> Operations which change the layout are allowed by that process.  But 
> operations
> from other file descriptors which attempt to ch

Re: Lease semantic proposal

2019-09-23 Thread Jeff Layton

On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> Since the last RFC patch set[1] much of the discussion of supporting RDMA with
> FS DAX has been around the semantics of the lease mechanism.[2]  Within that
> thread it was suggested I try and write some documentation and/or tests for 
> the
> new mechanism being proposed.  I have created a foundation to test lease
> functionality within xfstests.[3] This should be close to being accepted.
> Before writing additional lease tests, or changing lots of kernel code, this
> email presents documentation for the new proposed "layout lease" semantic.
> 
> At Linux Plumbers[4] just over a week ago, I presented the current state of 
> the
> patch set and the outstanding issues.  Based on the discussion there, well as
> follow up emails, I propose the following addition to the fcntl() man page.
> 
> Thank you,
> Ira
> 
> [1] https://lkml.org/lkml/2019/8/9/1043
> [2] https://lkml.org/lkml/2019/8/9/1062
> [3] https://www.spinics.net/lists/fstests/msg12620.html
> [4] https://linuxplumbersconf.org/event/4/contributions/368/
> 
> 

Thank you so much for doing this, Ira. This allows us to debate the
user-visible behavior semantics without getting bogged down in the
implementation details. More comments below:

> 
> Layout Leases
> -
> 
> Layout (F_LAYOUT) leases are special leases which can be used to control 
> and/or
> be informed about the manipulation of the underlying layout of a file.
> 
> A layout is defined as the logical file block -> physical file block mapping
> including the file size and sharing of physical blocks among files.  Note that
> the unwritten state of a block is not considered part of file layout.
> 
> **Read layout lease F_RDLCK | F_LAYOUT**
> 
> Read layout leases can be used to be informed of layout changes by the
> system or other users.  This lease is similar to the standard read (F_RDLCK)
> lease in that any attempt to change the _layout_ of the file will be reported 
> to
> the process through the lease break process.  But this lease is different
> because the file can be opened for write and data can be read and/or written 
> to
> the file as long as the underlying layout of the file does not change.
> Therefore, the lease is not broken if the file is simply open for write, but
> _may_ be broken if an operation such as, truncate(), fallocate() or write()
> results in changing the underlying layout.
> 
> **Write layout lease (F_WRLCK | F_LAYOUT)**
> 
> Write Layout leases can be used to break read layout leases to indicate that
> the process intends to change the underlying layout lease of the file.
> 
> A process which has taken a write layout lease has exclusive ownership of the
> file layout and can modify that layout as long as the lease is held.
> Operations which change the layout are allowed by that process.  But 
> operations
> from other file descriptors which attempt to change the layout will break the
> lease through the standard lease break process.  The F_LAYOUT flag is used to
> indicate a difference between a regular F_WRLCK and F_WRLCK with F_LAYOUT.  In
> the F_LAYOUT case opens for write do not break the lease.  But some 
> operations,
> if they change the underlying layout, may.
> 
> The distinction between read layout leases and write layout leases is that
> write layout leases can change the layout without breaking the lease within 
> the
> owning process.  This is useful to guarantee a layout prior to specifying the
> unbreakable flag described below.
> 
> 

The above sounds totally reasonable. You're essentially exposing the
behavior of nfsd's layout leases to userland. To be clear, will F_LAYOUT
leases work the same way as "normal" leases, wrt signals and timeouts?

I do wonder if we're better off not trying to "or" in flags for this,
and instead have a separate set of commands (maybe F_RDLAYOUT,
F_WRLAYOUT, F_UNLAYOUT). Maybe I'm just bikeshedding though -- I don't
feel terribly strongly about it.

Also, at least in NFSv4, layouts are handed out for a particular byte
range in a file. Should we consider doing this with an API that allows
for that in the future? Is this something that would be desirable for
your RDMA+DAX use-cases?

We could add a new F_SETLEASE variant that takes a struct with a byte
range (something like struct flock).

> **Unbreakable Layout Leases (F_UNBREAK)**
> 
> In order to support pinning of file pages by direct user space users an
> unbreakable flag (F_UNBREAK) can be used to modify the read and write layout
> lease.  When specified, F_UNBREAK indicates that any user attempting to break
> the lease will fail with ETXTBUSY rather than follow the normal breaking
> procedure.
> 
> Both read and write layout leases can have the unbreakable flag (F_UNBREAK)
> specified.  The difference between an unbreakable read layout lease and an
> unbreakable write layout lease are that an unbreakable read layout lease is
> _not_ exclusive.  This means that once a layout is established on a file

Lease semantic proposal

2019-09-23 Thread Ira Weiny



Since the last RFC patch set[1] much of the discussion of supporting RDMA with
FS DAX has been around the semantics of the lease mechanism.[2]  Within that
thread it was suggested I try and write some documentation and/or tests for the
new mechanism being proposed.  I have created a foundation to test lease
functionality within xfstests.[3] This should be close to being accepted.
Before writing additional lease tests, or changing lots of kernel code, this
email presents documentation for the new proposed "layout lease" semantic.

At Linux Plumbers[4] just over a week ago, I presented the current state of the
patch set and the outstanding issues.  Based on the discussion there, well as
follow up emails, I propose the following addition to the fcntl() man page.

Thank you,
Ira

[1] https://lkml.org/lkml/2019/8/9/1043
[2] https://lkml.org/lkml/2019/8/9/1062
[3] https://www.spinics.net/lists/fstests/msg12620.html
[4] https://linuxplumbersconf.org/event/4/contributions/368/



Layout Leases
-

Layout (F_LAYOUT) leases are special leases which can be used to control and/or
be informed about the manipulation of the underlying layout of a file.

A layout is defined as the logical file block -> physical file block mapping
including the file size and sharing of physical blocks among files.  Note that
the unwritten state of a block is not considered part of file layout.

**Read layout lease F_RDLCK | F_LAYOUT**

Read layout leases can be used to be informed of layout changes by the
system or other users.  This lease is similar to the standard read (F_RDLCK)
lease in that any attempt to change the _layout_ of the file will be reported to
the process through the lease break process.  But this lease is different
because the file can be opened for write and data can be read and/or written to
the file as long as the underlying layout of the file does not change.
Therefore, the lease is not broken if the file is simply open for write, but
_may_ be broken if an operation such as, truncate(), fallocate() or write()
results in changing the underlying layout.

**Write layout lease (F_WRLCK | F_LAYOUT)**

Write Layout leases can be used to break read layout leases to indicate that
the process intends to change the underlying layout lease of the file.

A process which has taken a write layout lease has exclusive ownership of the
file layout and can modify that layout as long as the lease is held.
Operations which change the layout are allowed by that process.  But operations
from other file descriptors which attempt to change the layout will break the
lease through the standard lease break process.  The F_LAYOUT flag is used to
indicate a difference between a regular F_WRLCK and F_WRLCK with F_LAYOUT.  In
the F_LAYOUT case opens for write do not break the lease.  But some operations,
if they change the underlying layout, may.

The distinction between read layout leases and write layout leases is that
write layout leases can change the layout without breaking the lease within the
owning process.  This is useful to guarantee a layout prior to specifying the
unbreakable flag described below.


**Unbreakable Layout Leases (F_UNBREAK)**

In order to support pinning of file pages by direct user space users an
unbreakable flag (F_UNBREAK) can be used to modify the read and write layout
lease.  When specified, F_UNBREAK indicates that any user attempting to break
the lease will fail with ETXTBUSY rather than follow the normal breaking
procedure.

Both read and write layout leases can have the unbreakable flag (F_UNBREAK)
specified.  The difference between an unbreakable read layout lease and an
unbreakable write layout lease are that an unbreakable read layout lease is
_not_ exclusive.  This means that once a layout is established on a file,
multiple unbreakable read layout leases can be taken by multiple processes and
used to pin the underlying pages of that file.

Care must therefore be taken to ensure that the layout of the file is as the
user wants prior to using the unbreakable read layout lease.  A safe mechanism
to do this would be to take a write layout lease and use fallocate() to set the
layout of the file.  The layout lease can then be "downgraded" to unbreakable
read layout as long as no other user broke the write layout lease.

INVESTMENT PROPOSAL.

2019-09-19 Thread Hadel Issa

It’s my pleasure to contact you through this media because I need an
investment assistance in your country. However I have a profitable
investment proposal with  good interest to share with you, amounted
the sum of (Twenty Eight Million Four Hundred Thousand United State
Dollar ($28.400.000.00). If you  are willing to handle this project
kindly reply urgent to enable me provide you more information about
the investment funds and the project.

I am waiting to hear from you through this my private
email(hadeliss...@gmail.com) so we can proceed further.

Best Regards.
Mr. Hadel Issa

Re: Investment Proposal...

2019-09-12 Thread Propel Consult

Greetings,

We are consultancy firm situated in Bahrain currently looking to finance new or 
existing projects in any industry.

Currently we are sourcing for opportunities for our review and consideration 
and would be delighted to discuss further.

Please feel free to contact us.

Regards,

Jamie Groom
Consultant
Propel Consult
6th Floor, Building 880
Road 3618, Seef 436 Manama
Kingdom of Bahrain

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-09-04 Thread Ira Weiny

On Tue, Sep 03, 2019 at 08:26:18AM +1000, Dave Chinner wrote:
> On Wed, Aug 28, 2019 at 07:02:31PM -0700, Ira Weiny wrote:
> > On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
> > > On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> > > > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > > > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > > > "Leases are associated with an open file description (see open(2)).  
> > > > This means
> > > > that duplicate file descriptors (created by, for example, fork(2) or 
> > > > dup(2))
> > > > refer to the same lease, and this lease may be modified or released 
> > > > using any
> > > > of these descriptors.  Furthermore,  the lease is released by either an
> > > > explicit F_UNLCK operation on any of these duplicate file descriptors, 
> > > > or when
> > > > all such file descriptors have been closed."
> > > 
> > > Right, the lease is attached to the struct file, so it follows
> > > where-ever the struct file goes. That doesn't mean it's actually
> > > useful when the struct file is duplicated and/or passed to another
> > > process. :/
> > > 
> > > AFAICT, the problem is that when we take another reference to the
> > > struct file, or when the struct file is passed to a different
> > > process, nothing updates the lease or lease state attached to that
> > > struct file.
> > 
> > Ok, I probably should have made this more clear in the cover letter but 
> > _only_
> > the process which took the lease can actually pin memory.
> 
> Sure, no question about that.
> 
> > That pinned memory _can_ be passed to another process but those 
> > sub-process' can
> > _not_ use the original lease to pin _more_ of the file.  They would need to
> > take their own lease to do that.
> 
> Yes, they would need a new lease to extend it. But that ignores the
> fact they don't have a lease on the existing pins they are using and
> have no control over the lease those pins originated under.  e.g.
> the originating process dies (for whatever reason) and now we have
> pins without a valid lease holder.

Define "valid lease holder"?

> 
> If something else now takes an exclusive lease on the file (because
> the original exclusive lease no longer exists), it's not going to
> work correctly because of the zombied page pins caused by closing
> the exclusive lease they were gained under. IOWs, pages pinned under
> an exclusive lease are no longer "exclusive" the moment the original
> exclusive lease is dropped, and pins passed to another process are
> no longer covered by the original lease they were created under.

The page pins are not zombied the lease is.  The lease still exists, it can't
be dropped while the pins are in place.  I need to double check the
implementation but that was the intent.

Yep just did a quick check, I have a test for that.  If the page pins exist
then the lease can _not_ be released.  Closing the FD will "zombie" the lease
but it and the struct file will still exist until the pins go away.

Furthermore, a "zombie" lease is _not_ sufficient to pin more pages.  (I have a
test for this too.)  I apologize that I don't have something to submit to
xfstests.  I'm new to that code base.

I'm happy to share the code I have which I've been using to test...  But it is
pretty rough as it has undergone a number of changes.  I think it would be
better to convert my test series to xfstests.

However, I don't know if it is ok to require RDMA within those tests.  Right
now that is the only sub-system I have allowed to create these page pins.  So
I'm not sure what to do at this time.  I'm open to suggestions.

> 
> > Sorry for not being clear on that.
> 
> I know exactly what you are saying. What I'm failing to get across
> is that file layout leases don't actually allow the behaviour you
> want to have.

Not currently, no.  But we are discussing the semantics to allow them _to_ have
the behavior needed.

> 
> > > As such, leases that require callbacks to userspace are currently
> > > only valid within the process context the lease was taken in.
> > 
> > But for long term pins we are not requiring callbacks.
> 
> Regardless, we still require an active lease for long term pins so
> that other lease holders fail operations appropriately. And that
> exclusive lease must follow the process that pins the pages so that
> the life cycle is the same...

I disagree.  See below.

> 
> > > Indeed, even closing the fd the lease was taken on without
> > > F_UNLCKing it first doesn't mean the lease has been torn down if
> > > there is some other reference to the struct file. That means the
> > > original lease owner will still get SIGIO delivered to that fd on a
> > > lease break regardless of whether it is open or not. ANd if we
> > > implement "layout lease not released within SIGIO response timeout"
> > > then that process will get killed, despite the fact it may not even
> > > have a reference to that file anymore.
> > 
> > I'm not seeing th

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-09-02 Thread Dave Chinner

On Wed, Aug 28, 2019 at 07:02:31PM -0700, Ira Weiny wrote:
> On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
> > On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> > > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > > "Leases are associated with an open file description (see open(2)).  This 
> > > means
> > > that duplicate file descriptors (created by, for example, fork(2) or 
> > > dup(2))
> > > refer to the same lease, and this lease may be modified or released using 
> > > any
> > > of these descriptors.  Furthermore,  the lease is released by either an
> > > explicit F_UNLCK operation on any of these duplicate file descriptors, or 
> > > when
> > > all such file descriptors have been closed."
> > 
> > Right, the lease is attached to the struct file, so it follows
> > where-ever the struct file goes. That doesn't mean it's actually
> > useful when the struct file is duplicated and/or passed to another
> > process. :/
> > 
> > AFAICT, the problem is that when we take another reference to the
> > struct file, or when the struct file is passed to a different
> > process, nothing updates the lease or lease state attached to that
> > struct file.
> 
> Ok, I probably should have made this more clear in the cover letter but _only_
> the process which took the lease can actually pin memory.

Sure, no question about that.

> That pinned memory _can_ be passed to another process but those sub-process' 
> can
> _not_ use the original lease to pin _more_ of the file.  They would need to
> take their own lease to do that.

Yes, they would need a new lease to extend it. But that ignores the
fact they don't have a lease on the existing pins they are using and
have no control over the lease those pins originated under.  e.g.
the originating process dies (for whatever reason) and now we have
pins without a valid lease holder.

If something else now takes an exclusive lease on the file (because
the original exclusive lease no longer exists), it's not going to
work correctly because of the zombied page pins caused by closing
the exclusive lease they were gained under. IOWs, pages pinned under
an exclusive lease are no longer "exclusive" the moment the original
exclusive lease is dropped, and pins passed to another process are
no longer covered by the original lease they were created under.

> Sorry for not being clear on that.

I know exactly what you are saying. What I'm failing to get across
is that file layout leases don't actually allow the behaviour you
want to have.

> > As such, leases that require callbacks to userspace are currently
> > only valid within the process context the lease was taken in.
> 
> But for long term pins we are not requiring callbacks.

Regardless, we still require an active lease for long term pins so
that other lease holders fail operations appropriately. And that
exclusive lease must follow the process that pins the pages so that
the life cycle is the same...

> > Indeed, even closing the fd the lease was taken on without
> > F_UNLCKing it first doesn't mean the lease has been torn down if
> > there is some other reference to the struct file. That means the
> > original lease owner will still get SIGIO delivered to that fd on a
> > lease break regardless of whether it is open or not. ANd if we
> > implement "layout lease not released within SIGIO response timeout"
> > then that process will get killed, despite the fact it may not even
> > have a reference to that file anymore.
> 
> I'm not seeing that as a problem.  This is all a result of the application
> failing to do the right thing.

How is that not a problem?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1132 matches

Mail list logo