Re: [PATCH] Force cppc_cpufreq to report values in KHz to fix user space reporting

2016-04-15 Thread kbuild test robot
Hi Al,

[auto build test WARNING on pm/linux-next]
[also build test WARNING on v4.6-rc3 next-20160415]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/Al-Stone/Force-cppc_cpufreq-to-report-values-in-KHz-to-fix-user-space-reporting/20160416-061911
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git 
linux-next
config: arm64-allmodconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm64 

All warnings (new ones prefixed by >>):

warning: (ACPI_CPPC_CPUFREQ) selects DMI which has unmet direct dependencies 
(EFI)

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH] Force cppc_cpufreq to report values in KHz to fix user space reporting

2016-04-15 Thread kbuild test robot
Hi Al,

[auto build test WARNING on pm/linux-next]
[also build test WARNING on v4.6-rc3 next-20160415]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/Al-Stone/Force-cppc_cpufreq-to-report-values-in-KHz-to-fix-user-space-reporting/20160416-061911
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git 
linux-next
config: arm64-allmodconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm64 

All warnings (new ones prefixed by >>):

warning: (ACPI_CPPC_CPUFREQ) selects DMI which has unmet direct dependencies 
(EFI)

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


[PATCH] kernfs_path_from_node_locked: don't overwrite nlen

2016-04-15 Thread Serge E. Hallyn
We've calculated @len to be the bytes we need for '/..' entries from
@kn_from to the common ancestor, and calculated @nlen to be the extra
bytes we need to get from the common ancestor to @kn_to.  We use them
as such at the end.  But in the loop copying the actual entries, we
overwrite @nlen.  Use a temporary variable for that instead.

Without this, the return length, when the buffer is large enough, is
wrong - a positive value less than the actual length.  (When the buffer
is NULL or too small, the returned value is correct. The buffer contents
are also correct.)

Interestingly, no callers of this function are affected by this as of
yet.  However the upcoming cgroup_show_path() will be.

Signed-off-by: Serge Hallyn 
---
 fs/kernfs/dir.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 03b688d..37f9678 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -153,9 +153,9 @@ static int kernfs_path_from_node_locked(struct kernfs_node 
*kn_to,
p = buf + len + nlen;
*p = '\0';
for (kn = kn_to; kn != common; kn = kn->parent) {
-   nlen = strlen(kn->name);
-   p -= nlen;
-   memcpy(p, kn->name, nlen);
+   size_t tmp = strlen(kn->name);
+   p -= tmp;
+   memcpy(p, kn->name, tmp);
*(--p) = '/';
}
 
-- 
2.7.4



[PATCH] kernfs_path_from_node_locked: don't overwrite nlen

2016-04-15 Thread Serge E. Hallyn
We've calculated @len to be the bytes we need for '/..' entries from
@kn_from to the common ancestor, and calculated @nlen to be the extra
bytes we need to get from the common ancestor to @kn_to.  We use them
as such at the end.  But in the loop copying the actual entries, we
overwrite @nlen.  Use a temporary variable for that instead.

Without this, the return length, when the buffer is large enough, is
wrong - a positive value less than the actual length.  (When the buffer
is NULL or too small, the returned value is correct. The buffer contents
are also correct.)

Interestingly, no callers of this function are affected by this as of
yet.  However the upcoming cgroup_show_path() will be.

Signed-off-by: Serge Hallyn 
---
 fs/kernfs/dir.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 03b688d..37f9678 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -153,9 +153,9 @@ static int kernfs_path_from_node_locked(struct kernfs_node 
*kn_to,
p = buf + len + nlen;
*p = '\0';
for (kn = kn_to; kn != common; kn = kn->parent) {
-   nlen = strlen(kn->name);
-   p -= nlen;
-   memcpy(p, kn->name, nlen);
+   size_t tmp = strlen(kn->name);
+   p -= tmp;
+   memcpy(p, kn->name, tmp);
*(--p) = '/';
}
 
-- 
2.7.4



Re: [PATCH v11 0/3] printk: Make printk() completely async

2016-04-15 Thread Joe Perches
On Sat, 2016-04-16 at 11:55 +0900, Sergey Senozhatsky wrote:
> On (04/08/16 02:31), Sergey Senozhatsky wrote:
> > 
> > Hello,
> > 
> > This patch set makes printk() completely asynchronous: new messages
> > are getting upended to the kernel printk buffer, but instead of 'direct'
> > printing the actual print job is performed by a dedicated kthread.
> > This has the advantage that printing always happens from a schedulable
> > context and thus we don't lockup any particular CPU or even interrupts.
> Hello,
> 
> Sir, is there anything else you want me to improve in this patch set?

I'm not sir, but my preference would be to move as much of the
async/thread functionality as possible into a separate file.



Re: [PATCH v11 0/3] printk: Make printk() completely async

2016-04-15 Thread Joe Perches
On Sat, 2016-04-16 at 11:55 +0900, Sergey Senozhatsky wrote:
> On (04/08/16 02:31), Sergey Senozhatsky wrote:
> > 
> > Hello,
> > 
> > This patch set makes printk() completely asynchronous: new messages
> > are getting upended to the kernel printk buffer, but instead of 'direct'
> > printing the actual print job is performed by a dedicated kthread.
> > This has the advantage that printing always happens from a schedulable
> > context and thus we don't lockup any particular CPU or even interrupts.
> Hello,
> 
> Sir, is there anything else you want me to improve in this patch set?

I'm not sir, but my preference would be to move as much of the
async/thread functionality as possible into a separate file.



Re: [PATCH v3 08/15] dmaengine: dw: revisit data_width property

2016-04-15 Thread Vinod Koul
On Fri, Apr 15, 2016 at 03:45:34PM +0300, Andy Shevchenko wrote:
> On Wed, 2016-04-13 at 17:40 +0100, Mark Brown wrote:
> > On Wed, Apr 13, 2016 at 07:21:53PM +0300, Andy Shevchenko wrote:
> > > 
> > > On Wed, 2016-04-13 at 21:47 +0530, Vinod Koul wrote:
> > > > 
> > > > On Wed, Apr 13, 2016 at 07:05:48PM +0300, Andy Shevchenko wrote:
> > > 
> > > > 
> > > > > 
> > > > > The old is still supported and benefit is apparently in unifying
> > > > > standard properties across the drivers.
> > > 
> > > > 
> > > > Hrmmm how is that?
> > > 
> > > The common usage for data-width property is "in bytes". And I like
> > > the
> > > idea. I don't know why at all I chose to keep encoded value there in
> > > the
> > > first place and no one commented at that time. I suppose because of
> > > screwed device tree process. I think now it's better to follow some
> > > standard / registered properties in new drivers.
> > You're unfortunately still breaking compatibility with existing DTs
> > using this property.  Now, it does appear that there is very little
> > use
> > of this DMA controller on DT systems and judging by the somewhat odd
> > compatible string and in tree DTs most of those are legacy so perhaps
> > this isn't the end of the world but this isn't something that should
> > be
> > dismissed as a simple cleanup.
> 
> Well, does everyone agree that keeping data-width a) with dash in the
> name and b) in bytes is good approach?
> 
> I will keep an array and support for old encoded property though.

That would be preferred

Thanks
-- 
~Vinod


Re: [PATCH v3 08/15] dmaengine: dw: revisit data_width property

2016-04-15 Thread Vinod Koul
On Fri, Apr 15, 2016 at 03:45:34PM +0300, Andy Shevchenko wrote:
> On Wed, 2016-04-13 at 17:40 +0100, Mark Brown wrote:
> > On Wed, Apr 13, 2016 at 07:21:53PM +0300, Andy Shevchenko wrote:
> > > 
> > > On Wed, 2016-04-13 at 21:47 +0530, Vinod Koul wrote:
> > > > 
> > > > On Wed, Apr 13, 2016 at 07:05:48PM +0300, Andy Shevchenko wrote:
> > > 
> > > > 
> > > > > 
> > > > > The old is still supported and benefit is apparently in unifying
> > > > > standard properties across the drivers.
> > > 
> > > > 
> > > > Hrmmm how is that?
> > > 
> > > The common usage for data-width property is "in bytes". And I like
> > > the
> > > idea. I don't know why at all I chose to keep encoded value there in
> > > the
> > > first place and no one commented at that time. I suppose because of
> > > screwed device tree process. I think now it's better to follow some
> > > standard / registered properties in new drivers.
> > You're unfortunately still breaking compatibility with existing DTs
> > using this property.  Now, it does appear that there is very little
> > use
> > of this DMA controller on DT systems and judging by the somewhat odd
> > compatible string and in tree DTs most of those are legacy so perhaps
> > this isn't the end of the world but this isn't something that should
> > be
> > dismissed as a simple cleanup.
> 
> Well, does everyone agree that keeping data-width a) with dash in the
> name and b) in bytes is good approach?
> 
> I will keep an array and support for old encoded property though.

That would be preferred

Thanks
-- 
~Vinod


Re: [PATCH] dmaengine: pxa: handle bus errors

2016-04-15 Thread Vinod Koul
On Thu, Apr 14, 2016 at 08:23:26PM +0200, Robert Jarzmik wrote:
> Vinod Koul  writes:
> 
> > On Mon, Mar 28, 2016 at 11:32:24PM +0200, Robert Jarzmik wrote:
> >> In the current state, upon bus error the driver will spin endlessly,
> >> relaunching the last tx, which will fail again and again :
> >>  - a bus error happens
> >>  - pxad_chan_handler() is called
> >>  - as PXA_DCSR_STOPSTATE is true, the last non-terminated transaction is
> >>lauched, which is the one triggering the bus error, as it didn't
> >>terminate
> >>  - moreover, the STOP interrupt fires a new, as the STOPIRQEN is still
> >>active
> >> 
> >> Break this logic by stopping the automatic relaunch of a dma channel
> >> upon a bus error, even if there are still pending issued requests on it.
> >> 
> >> As dma_cookie_status() seems unable to return DMA_ERROR in its current
> >> form, ie. there seems no way to mark a DMA_ERROR on a per-async-tx
> >> basis, it is chosen in this patch to remember on the channel which
> >> transaction failed, and report it in pxad_tx_status().
> >> 
> >> It's a bit misleading because if T1, T2, T3 and T4 were queued, and T1
> >> was completed while T2 causes a bus error, the status of T3 and T4 will
> >> be reported as DMA_IN_PROGRESS, while the channel is actually stopped.
> >
> > No it is not misleading. The subsequent descriptor can be submitted and
> > continued. But yes you are right on the error reporting part, that is
> > something we need to add.
> Ok, fair enough.
> 
> > So what exactly are you trying to fix/achive here?
> Euh you mean the first chapter about the "endless spin" is not clear ?
> This is what I'm trying to fix, the unstoppable endless relauch of a 
> descriptor
> doomed to make the same bus error over and over again.

Okay so IIUC the patch here essential stops all transfers and abort the
channel, right?

-- 
~Vinod


Re: [PATCH] dmaengine: pxa: handle bus errors

2016-04-15 Thread Vinod Koul
On Thu, Apr 14, 2016 at 08:23:26PM +0200, Robert Jarzmik wrote:
> Vinod Koul  writes:
> 
> > On Mon, Mar 28, 2016 at 11:32:24PM +0200, Robert Jarzmik wrote:
> >> In the current state, upon bus error the driver will spin endlessly,
> >> relaunching the last tx, which will fail again and again :
> >>  - a bus error happens
> >>  - pxad_chan_handler() is called
> >>  - as PXA_DCSR_STOPSTATE is true, the last non-terminated transaction is
> >>lauched, which is the one triggering the bus error, as it didn't
> >>terminate
> >>  - moreover, the STOP interrupt fires a new, as the STOPIRQEN is still
> >>active
> >> 
> >> Break this logic by stopping the automatic relaunch of a dma channel
> >> upon a bus error, even if there are still pending issued requests on it.
> >> 
> >> As dma_cookie_status() seems unable to return DMA_ERROR in its current
> >> form, ie. there seems no way to mark a DMA_ERROR on a per-async-tx
> >> basis, it is chosen in this patch to remember on the channel which
> >> transaction failed, and report it in pxad_tx_status().
> >> 
> >> It's a bit misleading because if T1, T2, T3 and T4 were queued, and T1
> >> was completed while T2 causes a bus error, the status of T3 and T4 will
> >> be reported as DMA_IN_PROGRESS, while the channel is actually stopped.
> >
> > No it is not misleading. The subsequent descriptor can be submitted and
> > continued. But yes you are right on the error reporting part, that is
> > something we need to add.
> Ok, fair enough.
> 
> > So what exactly are you trying to fix/achive here?
> Euh you mean the first chapter about the "endless spin" is not clear ?
> This is what I'm trying to fix, the unstoppable endless relauch of a 
> descriptor
> doomed to make the same bus error over and over again.

Okay so IIUC the patch here essential stops all transfers and abort the
channel, right?

-- 
~Vinod


Re: [PATCH v3 0/2] Align mmap address for DAX pmd mappings

2016-04-15 Thread Andrew Morton
On Thu, 14 Apr 2016 10:48:29 -0600 Toshi Kani  wrote:

> When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using pmd page
> size.  This feature relies on both mmap virtual address and FS
> block (i.e. physical address) to be aligned by the pmd page size.
> Users can use mkfs options to specify FS to align block allocations.
> However, aligning mmap address requires code changes to existing
> applications for providing a pmd-aligned address to mmap().
> 
> For instance, fio with "ioengine=mmap" performs I/Os with mmap() [1].
> It calls mmap() with a NULL address, which needs to be changed to
> provide a pmd-aligned address for testing with DAX pmd mappings.
> Changing all applications that call mmap() with NULL is undesirable.
> 
> This patch-set extends filesystems to align an mmap address for
> a DAX file so that unmodified applications can use DAX pmd mappings.

Matthew sounded unconvinced about the need for this patchset, but I
must say that

: The point is that we do not need to modify existing applications for using
: DAX PMD mappings.
: 
: For instance, fio with "ioengine=mmap" performs I/Os with mmap(). 
: https://github.com/caius/fio/blob/master/engines/mmap.c
: 
: With this change, unmodified fio can be used for testing with DAX PMD
: mappings.  There are many examples like this, and I do not think we want
: to modify all applications that we want to evaluate/test with.

sounds pretty convincing?


And if we go ahead with this, it looks like 4.7 material to me - it
affects ABI and we want to get that stabilized asap.  What do people
think?



Re: [PATCH v3 0/2] Align mmap address for DAX pmd mappings

2016-04-15 Thread Andrew Morton
On Thu, 14 Apr 2016 10:48:29 -0600 Toshi Kani  wrote:

> When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using pmd page
> size.  This feature relies on both mmap virtual address and FS
> block (i.e. physical address) to be aligned by the pmd page size.
> Users can use mkfs options to specify FS to align block allocations.
> However, aligning mmap address requires code changes to existing
> applications for providing a pmd-aligned address to mmap().
> 
> For instance, fio with "ioengine=mmap" performs I/Os with mmap() [1].
> It calls mmap() with a NULL address, which needs to be changed to
> provide a pmd-aligned address for testing with DAX pmd mappings.
> Changing all applications that call mmap() with NULL is undesirable.
> 
> This patch-set extends filesystems to align an mmap address for
> a DAX file so that unmodified applications can use DAX pmd mappings.

Matthew sounded unconvinced about the need for this patchset, but I
must say that

: The point is that we do not need to modify existing applications for using
: DAX PMD mappings.
: 
: For instance, fio with "ioengine=mmap" performs I/Os with mmap(). 
: https://github.com/caius/fio/blob/master/engines/mmap.c
: 
: With this change, unmodified fio can be used for testing with DAX PMD
: mappings.  There are many examples like this, and I do not think we want
: to modify all applications that we want to evaluate/test with.

sounds pretty convincing?


And if we go ahead with this, it looks like 4.7 material to me - it
affects ABI and we want to get that stabilized asap.  What do people
think?



Re: [PATCH 09/10] huge pagecache: mmap_sem is unlocked when truncation splits pmd

2016-04-15 Thread Andrew Morton
On Thu, 14 Apr 2016 13:39:22 -0400 Matthew Wilcox  wrote:

> On Tue, Apr 05, 2016 at 01:55:23PM -0700, Hugh Dickins wrote:
> > zap_pmd_range()'s CONFIG_DEBUG_VM !rwsem_is_locked(_sem) BUG()
> > will be invalid with huge pagecache, in whatever way it is implemented:
> > truncation of a hugely-mapped file to an unhugely-aligned size would
> > easily hit it.
> 
> We can reproduce this BUG() in the current Linus tree with DAX PMDs.
> Andrew, can you send this patch to Linus for inclusion in 4.7?

Wilco, thanks.




Re: [PATCH 09/10] huge pagecache: mmap_sem is unlocked when truncation splits pmd

2016-04-15 Thread Andrew Morton
On Thu, 14 Apr 2016 13:39:22 -0400 Matthew Wilcox  wrote:

> On Tue, Apr 05, 2016 at 01:55:23PM -0700, Hugh Dickins wrote:
> > zap_pmd_range()'s CONFIG_DEBUG_VM !rwsem_is_locked(_sem) BUG()
> > will be invalid with huge pagecache, in whatever way it is implemented:
> > truncation of a hugely-mapped file to an unhugely-aligned size would
> > easily hit it.
> 
> We can reproduce this BUG() in the current Linus tree with DAX PMDs.
> Andrew, can you send this patch to Linus for inclusion in 4.7?

Wilco, thanks.




Re: [PATCH 2/2] rtlwifi: Fix reusable codes in core.c

2016-04-15 Thread Kalle Valo
Julian Calaby  writes:

> Hi Kalle,
>
> On Sat, Apr 16, 2016 at 4:25 AM, Kalle Valo  wrote:
>> Byeoungwook Kim  writes:
>>
>>> rtl_*_delay() functions were reused same codes about addr variable.
>>> So i have converted to rtl_addr_delay() from code about addr variable.
>>>
>>> Signed-off-by: Byeoungwook Kim 
>>> Reviewed-by: Julian Calaby 
>>
>> Doesn't apply:
>>
>> Applying: rtlwifi: Fix reusable codes in core.c
>> fatal: sha1 information is lacking or useless 
>> (drivers/net/wireless/realtek/rtlwifi/core.c).
>> Repository lacks necessary blobs to fall back on 3-way merge.
>> Cannot fall back to three-way merge.
>> Patch failed at 0001 rtlwifi: Fix reusable codes in core.c
>>
>> Please rebase and resend.
>
> This one is already applied in some form. I thought I'd listed it in
> my big list of superseded patches, however I must have missed it.

Or I missed it :) But good to know, so no actions needed anymore.

-- 
Kalle Valo


Re: [PATCH 2/2] rtlwifi: Fix reusable codes in core.c

2016-04-15 Thread Kalle Valo
Julian Calaby  writes:

> Hi Kalle,
>
> On Sat, Apr 16, 2016 at 4:25 AM, Kalle Valo  wrote:
>> Byeoungwook Kim  writes:
>>
>>> rtl_*_delay() functions were reused same codes about addr variable.
>>> So i have converted to rtl_addr_delay() from code about addr variable.
>>>
>>> Signed-off-by: Byeoungwook Kim 
>>> Reviewed-by: Julian Calaby 
>>
>> Doesn't apply:
>>
>> Applying: rtlwifi: Fix reusable codes in core.c
>> fatal: sha1 information is lacking or useless 
>> (drivers/net/wireless/realtek/rtlwifi/core.c).
>> Repository lacks necessary blobs to fall back on 3-way merge.
>> Cannot fall back to three-way merge.
>> Patch failed at 0001 rtlwifi: Fix reusable codes in core.c
>>
>> Please rebase and resend.
>
> This one is already applied in some form. I thought I'd listed it in
> my big list of superseded patches, however I must have missed it.

Or I missed it :) But good to know, so no actions needed anymore.

-- 
Kalle Valo


Re: [PATCH v3] i2c: mediatek: i2c multi transfer optimization

2016-04-15 Thread liguo zhang
On Tue, 2016-04-12 at 23:13 +0200, Wolfram Sang wrote:
> Hi,
> 
> thanks for the submission!
> 
> On Tue, Mar 08, 2016 at 02:23:51AM +0800, Liguo Zhang wrote:
> > Signal complete() in the i2c irq handler after one transfer done,
> > and then wait_for_completion_timeout() will return, this procedure
> > may cost much time, so only signal complete() when the entire
> > transaction has been completed, it will reduce the entire transaction
> > time.
> > 
> > Signed-off-by: Liguo Zhang 
> 
> I wonder. You have less context switches, yes. On the other hand, you
> likely have bigger interrupt latency because you do more stuff in the
> interrupt handler. Is it really a gain in the end?
> 

When doing i2c multi transfer(first i2c write then i2c read, and not
using the MTK i2c WRRD mode) repeatedly in our stress test, we found the
procedure(complete()-->wait_for_completion_timeout()) may cost much
time, and it will affect the following i2c transfer. In our stress test,
It will affect the i2c read transfer, the value from the i2c read is not
right.
So when doing i2c multi transfer, the first is i2c write,then i2c read,
we will use the MTK i2c WRRD mode to do i2c multi transfer in the
previous patch.
But If i2c multi transfer has at least three transfer, we can't use the
MTK i2c WRRD mode, this patch may be important. Now we have not already
seen the i2c multi transfer scenario, which has at least three transfer.

> Regards,
> 
>Wolfram
> 




Re: [PATCH v3] i2c: mediatek: i2c multi transfer optimization

2016-04-15 Thread liguo zhang
On Tue, 2016-04-12 at 23:13 +0200, Wolfram Sang wrote:
> Hi,
> 
> thanks for the submission!
> 
> On Tue, Mar 08, 2016 at 02:23:51AM +0800, Liguo Zhang wrote:
> > Signal complete() in the i2c irq handler after one transfer done,
> > and then wait_for_completion_timeout() will return, this procedure
> > may cost much time, so only signal complete() when the entire
> > transaction has been completed, it will reduce the entire transaction
> > time.
> > 
> > Signed-off-by: Liguo Zhang 
> 
> I wonder. You have less context switches, yes. On the other hand, you
> likely have bigger interrupt latency because you do more stuff in the
> interrupt handler. Is it really a gain in the end?
> 

When doing i2c multi transfer(first i2c write then i2c read, and not
using the MTK i2c WRRD mode) repeatedly in our stress test, we found the
procedure(complete()-->wait_for_completion_timeout()) may cost much
time, and it will affect the following i2c transfer. In our stress test,
It will affect the i2c read transfer, the value from the i2c read is not
right.
So when doing i2c multi transfer, the first is i2c write,then i2c read,
we will use the MTK i2c WRRD mode to do i2c multi transfer in the
previous patch.
But If i2c multi transfer has at least three transfer, we can't use the
MTK i2c WRRD mode, this patch may be important. Now we have not already
seen the i2c multi transfer scenario, which has at least three transfer.

> Regards,
> 
>Wolfram
> 




Re: [PATCH 15/15] parallel lookups: actual switch to rwsem

2016-04-15 Thread Al Viro
On Fri, Apr 15, 2016 at 09:02:06PM -0600, Andreas Dilger wrote:

> Wouldn't it make sense to have helpers like "inode_read_lock(inode)" or 
> similar,
> so that it is consistent with other parts of the code and easier to find?
> It's a bit strange to have the filesystems use "inode_lock()" and some places
> here use "inode_lock_nested()", but other places use up_read() and down_read()
> directly on >i_rwsem.  That would also simplify delegating the 
> directory
> locking to the filesystems in the future.

FWIW, my preference would be inode_lock_shared(), but that's bikeshedding;
seeing that we have very few callers at the moment *and* there's the missing
down_write_killable() stuff...  This patch will obviously be reworked and
it's small enough to be understandable, open-coding or not.


Re: [PATCH 15/15] parallel lookups: actual switch to rwsem

2016-04-15 Thread Al Viro
On Fri, Apr 15, 2016 at 09:02:06PM -0600, Andreas Dilger wrote:

> Wouldn't it make sense to have helpers like "inode_read_lock(inode)" or 
> similar,
> so that it is consistent with other parts of the code and easier to find?
> It's a bit strange to have the filesystems use "inode_lock()" and some places
> here use "inode_lock_nested()", but other places use up_read() and down_read()
> directly on >i_rwsem.  That would also simplify delegating the 
> directory
> locking to the filesystems in the future.

FWIW, my preference would be inode_lock_shared(), but that's bikeshedding;
seeing that we have very few callers at the moment *and* there's the missing
down_write_killable() stuff...  This patch will obviously be reworked and
it's small enough to be understandable, open-coding or not.


[PATCH] mm: Do not discard partial pages with POSIX_FADV_DONTNEED

2016-04-15 Thread green
From: Oleg Drokin 

I noticed that the logic in fadvise64_64 syscall is incorrect
for partial pages. While first page of the region is correctly skipped
if it is partial, the last page of the region is mistakenly discarded.
This leads to problems for applications that read data in
non-page-aligned chunks discarding already processed data between
the reads.

Signed-off-by: Oleg Drokin 
---
 mm/fadvise.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/mm/fadvise.c b/mm/fadvise.c
index b8024fa..6c707bf 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -126,6 +126,17 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, 
loff_t, len, int, advice)
 */
start_index = (offset+(PAGE_SIZE-1)) >> PAGE_SHIFT;
end_index = (endbyte >> PAGE_SHIFT);
+   if ((endbyte & ~PAGE_MASK) != ~PAGE_MASK) {
+   /* First page is tricky as 0 - 1 = -1, but pgoff_t
+* is unsigned, so the end_index >= start_index
+* check below would be true and we'll discard the whole
+* file cache which is not what was asked.
+*/
+   if (end_index == 0)
+   break;
+
+   end_index--;
+   }
 
if (end_index >= start_index) {
unsigned long count = invalidate_mapping_pages(mapping,
-- 
2.1.0



[PATCH] mm: Do not discard partial pages with POSIX_FADV_DONTNEED

2016-04-15 Thread green
From: Oleg Drokin 

I noticed that the logic in fadvise64_64 syscall is incorrect
for partial pages. While first page of the region is correctly skipped
if it is partial, the last page of the region is mistakenly discarded.
This leads to problems for applications that read data in
non-page-aligned chunks discarding already processed data between
the reads.

Signed-off-by: Oleg Drokin 
---
 mm/fadvise.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/mm/fadvise.c b/mm/fadvise.c
index b8024fa..6c707bf 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -126,6 +126,17 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, 
loff_t, len, int, advice)
 */
start_index = (offset+(PAGE_SIZE-1)) >> PAGE_SHIFT;
end_index = (endbyte >> PAGE_SHIFT);
+   if ((endbyte & ~PAGE_MASK) != ~PAGE_MASK) {
+   /* First page is tricky as 0 - 1 = -1, but pgoff_t
+* is unsigned, so the end_index >= start_index
+* check below would be true and we'll discard the whole
+* file cache which is not what was asked.
+*/
+   if (end_index == 0)
+   break;
+
+   end_index--;
+   }
 
if (end_index >= start_index) {
unsigned long count = invalidate_mapping_pages(mapping,
-- 
2.1.0



Re: [PATCHSET][RFC][CFT] parallel lookups

2016-04-15 Thread Al Viro
On Fri, Apr 15, 2016 at 09:02:02PM -0600, Andreas Dilger wrote:

> Looks very interesting, and long awaited.  How do you see the parallel
> operations moving forward?  Staying as lookup only, or moving on to parallel
> modifications as well?

lookup + readdir.  Not even atomic_open at this point, and that's the
route I'd suggest for modifiers - i.e. a combined lookup + mkdir, etc.
operations.  But we'd really need to sort atomic_open pathway out first...

Let's discuss that at LSFMM, corridor track if needed.  With lookups I'd been
able to keep the surgery site pretty much entirely in VFS proper - fs/dcache.c
and (after earlier massage) a single function in fs/namei.c.  With readdir
it'll be somewhat more invasive - pre-seeding dcache is done in a bunch of
filesystems right now (mostly the network ones, where readdir request is
equivalent to bulk lookup, as well as synthetic-inodes ones a-la procfs)
and it'll need to be regularized; ncpfs is particularly nasty, what with its
case-changing crap), but at least it will be reasonably compact.  For
atomic_open, and worse yet - mkdir/mknod/symlink/link/unlink/rmdir/rename
it will really dip into filesystem code.  A lot.

FWIW, I agree that relying on i_mutex^Wi_rwsem for dcache protection is
something worth getting rid of in the longer term.  But that protection is
there right now, and getting rid of that will take quite a bit of careful
massage.  I don't have such a transition plotted yet; not enough information
at the moment, and I seriously suspect that atomic_open would be the best
place to start.  If nothing else, there are reasonably few instances of that
puppy.  Moreover, we badly need to regularize the paths around do_last() -
right now they are messy as hell.  Once that is sorted out, we'll be in better
position to deal with the rest of directory-modifying operations.


Re: [PATCHSET][RFC][CFT] parallel lookups

2016-04-15 Thread Al Viro
On Fri, Apr 15, 2016 at 09:02:02PM -0600, Andreas Dilger wrote:

> Looks very interesting, and long awaited.  How do you see the parallel
> operations moving forward?  Staying as lookup only, or moving on to parallel
> modifications as well?

lookup + readdir.  Not even atomic_open at this point, and that's the
route I'd suggest for modifiers - i.e. a combined lookup + mkdir, etc.
operations.  But we'd really need to sort atomic_open pathway out first...

Let's discuss that at LSFMM, corridor track if needed.  With lookups I'd been
able to keep the surgery site pretty much entirely in VFS proper - fs/dcache.c
and (after earlier massage) a single function in fs/namei.c.  With readdir
it'll be somewhat more invasive - pre-seeding dcache is done in a bunch of
filesystems right now (mostly the network ones, where readdir request is
equivalent to bulk lookup, as well as synthetic-inodes ones a-la procfs)
and it'll need to be regularized; ncpfs is particularly nasty, what with its
case-changing crap), but at least it will be reasonably compact.  For
atomic_open, and worse yet - mkdir/mknod/symlink/link/unlink/rmdir/rename
it will really dip into filesystem code.  A lot.

FWIW, I agree that relying on i_mutex^Wi_rwsem for dcache protection is
something worth getting rid of in the longer term.  But that protection is
there right now, and getting rid of that will take quite a bit of careful
massage.  I don't have such a transition plotted yet; not enough information
at the moment, and I seriously suspect that atomic_open would be the best
place to start.  If nothing else, there are reasonably few instances of that
puppy.  Moreover, we badly need to regularize the paths around do_last() -
right now they are messy as hell.  Once that is sorted out, we'll be in better
position to deal with the rest of directory-modifying operations.


Re: [PATCHSET][RFC][CFT] parallel lookups

2016-04-15 Thread Andreas Dilger
On Apr 15, 2016, at 6:52 PM, Al Viro  wrote:
> 
>   The thing appears to be working.  It's in vfs.git#work.lookups; the
> last 5 commits are the infrastructure (fs/namei.c and fs/dcache.c; no changes
> in fs/*/*) + actual switch to rwsem.
> 
>   The missing bits: down_write_killable() (there had been a series
> posted introducing just that; for now I've replaced mutex_lock_killable()
> calls with plain inode_lock() - they are not critical for any testing and
> as soon as down_write_killable() gets there I'll replace those), lockdep
> bits might need corrections and right now it's only for lookups.
> 
>   I'm going to add readdir to the mix; the primitive added in this
> series (d_alloc_parallel()) will need to be used in dcache pre-seeding
> paths, ncpfs use of dentry_update_name_case() will need to be changed to
> something less hacky and syscalls calling iterate_dir() will need to
> switch to fdget_pos() (with FMODE_ATOMIC_POS set for directories as well
> as regulars).  The last bit is needed for exclusion on struct file
> level - there's a bunch of cases where we maintain data structures
> hanging off file->private and those really need to be serialized.  Besides,
> serializing ->f_pos updates is needed for sane semantics; right now we
> tend to use ->i_mutex for that, but it would be easier to go for the same
> mechanism as for regular files.  With any luck we'll have working parallel
> readdir in addition to parallel lookups in this cycle as well.
> 
>   The patchset is on top of switching getxattr to passing dentry and
> inode separately; that part will get changes (in particular, the stuff
> agruen has posted lately), but the lookups queue proper cares only about
> being able to move security_d_instantiate() to the point before dentry
> is attached to inode.
> 
> 1/15: security_d_instantiate(): move to the point prior to attaching dentry
> to inode.  Depends on getxattr changes, allows to do the "attach to inode"
> and "add to dentry hash" parts without dropping ->d_lock in between.
> 
> 2/15 -- 8/15: preparations - stuff similar to what went in during the last
> cycle; several places switched to lookup_one_len_unlocked(), a bunch of
> direct manipulations of ->i_mutex replaced with inode_lock, etc. helpers.
> 
> kernfs: use lookup_one_len_unlocked().
> configfs_detach_prep(): make sure that wait_mutex won't go away
> ocfs2: don't open-code inode_lock/inode_unlock
> orangefs: don't open-code inode_lock/inode_unlock
> reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack()
> reconnect_one(): use lookup_one_len_unlocked()
> ovl_lookup_real(): use lookup_one_len_unlocked()
> 
> 9/15: lookup_slow(): bugger off on IS_DEADDIR() from the very beginning
> open-code real_lookup() call in lookup_slow(), move IS_DEADDIR check upwards.
> 
> 10/15: __d_add(): don't drop/regain ->d_lock
> that's what 1/15 had been for; might make sense to reorder closer to it.
> 
> 11/15 -- 14/15: actual machinery for parallel lookups.  This stuff could've
> been a single commit, along with the actual switch to rwsem and shared lock
> in lookup_slow(), but it's easier to review if carved up like that.  From the
> testing POV it's one chunk - it is bisect-safe, but the added code really
> comes into play only after we go for shared lock, which happens in 15/15.
> That's the core of the series.
> 
> beginning of transition to parallel lookups - marking in-lookup dentries
> parallel lookups machinery, part 2
> parallel lookups machinery, part 3
> parallel lookups machinery, part 4 (and last)
> 
> 15/15: parallel lookups: actual switch to rwsem
> 
> Note that filesystems would be free to switch some of their own uses of
> inode_lock() to grabbing it shared - it's really up to them.  This series
> works only with directories locking, but this field has become an rwsem
> for all inodes.  XFS folks in particular might be interested in using it...

Looks very interesting, and long awaited.  How do you see the parallel
operations moving forward?  Staying as lookup only, or moving on to parallel
modifications as well?

We've been carrying an out-of-tree patch for ext4 for several years to allow
parallel create/unlink for directory entries*, as I discussed a few times with
you in the past.  It is still a bit heavyweight for doing read-only lookups,
but after this patch series it might finally be interesting to merge into ext4,
with a hope that the VFS might allow parallel directory changes in the future?
We can already do this on a Lustre server, and it would be nice to be able to
so on the client, since the files may even be on different servers (hashed by
name at the client to decide which server to contact) and network latency during
parallel file creates (one thread per CPU core, which is getting into the
low hundreds these days) is a much bigger deal than for local filesystems.

The actual inode_*lock() handling would need to be delegated to the filesystems,
with the VFS just using 

Re: [PATCH 15/15] parallel lookups: actual switch to rwsem

2016-04-15 Thread Andreas Dilger
On Apr 15, 2016, at 6:55 PM, Al Viro  wrote:
> 
> From: Al Viro 
> 
> ta-da!
> 
> The main issue is the lack of down_write_killable(), so the places
> like readdir.c switched to plain inode_lock(); once killable
> variants of rwsem primitives appear, that'll be dealt with.
> 
> lockdep side also might need more work
> 
> Signed-off-by: Al Viro 
> ---
> fs/btrfs/ioctl.c   | 16 ++--
> fs/configfs/inode.c|  2 +-
> fs/dcache.c|  9 +
> fs/gfs2/ops_fstype.c   |  2 +-
> fs/inode.c | 12 ++--
> fs/namei.c |  4 ++--
> fs/ocfs2/inode.c   |  2 +-
> fs/overlayfs/readdir.c |  4 +++-
> fs/readdir.c   |  7 ---
> include/linux/fs.h | 12 ++--
> 10 files changed, 39 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 053e677..db1e830 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -837,9 +837,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
>   struct dentry *dentry;
>   int error;
> 
> - error = mutex_lock_killable_nested(>i_mutex, I_MUTEX_PARENT);
> - if (error == -EINTR)
> - return error;
> + inode_lock_nested(dir, I_MUTEX_PARENT);
> + // XXX: should've been
> + // mutex_lock_killable_nested(>i_mutex, I_MUTEX_PARENT);
> + // if (error == -EINTR)
> + //  return error;
> 
>   dentry = lookup_one_len(name, parent->dentry, namelen);
>   error = PTR_ERR(dentry);
> @@ -2366,9 +2368,11 @@ static noinline int btrfs_ioctl_snap_destroy(struct 
> file *file,
>   goto out;
> 
> 
> - err = mutex_lock_killable_nested(>i_mutex, I_MUTEX_PARENT);
> - if (err == -EINTR)
> - goto out_drop_write;
> + inode_lock_nested(dir, I_MUTEX_PARENT);
> + // XXX: should've been
> + // err = mutex_lock_killable_nested(>i_mutex, I_MUTEX_PARENT);
> + // if (err == -EINTR)
> + //  goto out_drop_write;
>   dentry = lookup_one_len(vol_args->name, parent, namelen);
>   if (IS_ERR(dentry)) {
>   err = PTR_ERR(dentry);
> diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c
> index 03d124a..0387968 100644
> --- a/fs/configfs/inode.c
> +++ b/fs/configfs/inode.c
> @@ -156,7 +156,7 @@ static void configfs_set_inode_lock_class(struct 
> configfs_dirent *sd,
> 
>   if (depth > 0) {
>   if (depth <= ARRAY_SIZE(default_group_class)) {
> - lockdep_set_class(>i_mutex,
> + lockdep_set_class(>i_rwsem,
> _group_class[depth - 1]);
>   } else {
>   /*
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 5965588..d110040 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -2911,7 +2911,8 @@ struct dentry *d_ancestor(struct dentry *p1, struct 
> dentry *p2)
> static int __d_unalias(struct inode *inode,
>   struct dentry *dentry, struct dentry *alias)
> {
> - struct mutex *m1 = NULL, *m2 = NULL;
> + struct mutex *m1 = NULL;
> + struct rw_semaphore *m2 = NULL;
>   int ret = -ESTALE;
> 
>   /* If alias and dentry share a parent, then no extra locks required */
> @@ -2922,15 +2923,15 @@ static int __d_unalias(struct inode *inode,
>   if (!mutex_trylock(>d_sb->s_vfs_rename_mutex))
>   goto out_err;
>   m1 = >d_sb->s_vfs_rename_mutex;
> - if (!inode_trylock(alias->d_parent->d_inode))
> + if (!down_read_trylock(>d_parent->d_inode->i_rwsem))
>   goto out_err;
> - m2 = >d_parent->d_inode->i_mutex;
> + m2 = >d_parent->d_inode->i_rwsem;
> out_unalias:
>   __d_move(alias, dentry, false);
>   ret = 0;
> out_err:
>   if (m2)
> - mutex_unlock(m2);
> + up_read(m2);
>   if (m1)
>   mutex_unlock(m1);
>   return ret;
> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
> index c09c63d..4546360 100644
> --- a/fs/gfs2/ops_fstype.c
> +++ b/fs/gfs2/ops_fstype.c
> @@ -824,7 +824,7 @@ static int init_inodes(struct gfs2_sbd *sdp, int undo)
>* i_mutex on quota files is special. Since this inode is hidden system
>* file, we are safe to define locking ourselves.
>*/
> - lockdep_set_class(>sd_quota_inode->i_mutex,
> + lockdep_set_class(>sd_quota_inode->i_rwsem,
> _quota_imutex_key);
> 
>   error = gfs2_rindex_update(sdp);
> diff --git a/fs/inode.c b/fs/inode.c
> index 4b884f7..4ccbc21 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -166,8 +166,8 @@ int inode_init_always(struct super_block *sb, struct 
> inode *inode)
>   spin_lock_init(>i_lock);
>   lockdep_set_class(>i_lock, >s_type->i_lock_key);
> 
> - mutex_init(>i_mutex);
> - lockdep_set_class(>i_mutex, >s_type->i_mutex_key);
> + init_rwsem(>i_rwsem);
> + lockdep_set_class(>i_rwsem, >s_type->i_mutex_key);
> 
>   

Re: [PATCHSET][RFC][CFT] parallel lookups

2016-04-15 Thread Andreas Dilger
On Apr 15, 2016, at 6:52 PM, Al Viro  wrote:
> 
>   The thing appears to be working.  It's in vfs.git#work.lookups; the
> last 5 commits are the infrastructure (fs/namei.c and fs/dcache.c; no changes
> in fs/*/*) + actual switch to rwsem.
> 
>   The missing bits: down_write_killable() (there had been a series
> posted introducing just that; for now I've replaced mutex_lock_killable()
> calls with plain inode_lock() - they are not critical for any testing and
> as soon as down_write_killable() gets there I'll replace those), lockdep
> bits might need corrections and right now it's only for lookups.
> 
>   I'm going to add readdir to the mix; the primitive added in this
> series (d_alloc_parallel()) will need to be used in dcache pre-seeding
> paths, ncpfs use of dentry_update_name_case() will need to be changed to
> something less hacky and syscalls calling iterate_dir() will need to
> switch to fdget_pos() (with FMODE_ATOMIC_POS set for directories as well
> as regulars).  The last bit is needed for exclusion on struct file
> level - there's a bunch of cases where we maintain data structures
> hanging off file->private and those really need to be serialized.  Besides,
> serializing ->f_pos updates is needed for sane semantics; right now we
> tend to use ->i_mutex for that, but it would be easier to go for the same
> mechanism as for regular files.  With any luck we'll have working parallel
> readdir in addition to parallel lookups in this cycle as well.
> 
>   The patchset is on top of switching getxattr to passing dentry and
> inode separately; that part will get changes (in particular, the stuff
> agruen has posted lately), but the lookups queue proper cares only about
> being able to move security_d_instantiate() to the point before dentry
> is attached to inode.
> 
> 1/15: security_d_instantiate(): move to the point prior to attaching dentry
> to inode.  Depends on getxattr changes, allows to do the "attach to inode"
> and "add to dentry hash" parts without dropping ->d_lock in between.
> 
> 2/15 -- 8/15: preparations - stuff similar to what went in during the last
> cycle; several places switched to lookup_one_len_unlocked(), a bunch of
> direct manipulations of ->i_mutex replaced with inode_lock, etc. helpers.
> 
> kernfs: use lookup_one_len_unlocked().
> configfs_detach_prep(): make sure that wait_mutex won't go away
> ocfs2: don't open-code inode_lock/inode_unlock
> orangefs: don't open-code inode_lock/inode_unlock
> reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack()
> reconnect_one(): use lookup_one_len_unlocked()
> ovl_lookup_real(): use lookup_one_len_unlocked()
> 
> 9/15: lookup_slow(): bugger off on IS_DEADDIR() from the very beginning
> open-code real_lookup() call in lookup_slow(), move IS_DEADDIR check upwards.
> 
> 10/15: __d_add(): don't drop/regain ->d_lock
> that's what 1/15 had been for; might make sense to reorder closer to it.
> 
> 11/15 -- 14/15: actual machinery for parallel lookups.  This stuff could've
> been a single commit, along with the actual switch to rwsem and shared lock
> in lookup_slow(), but it's easier to review if carved up like that.  From the
> testing POV it's one chunk - it is bisect-safe, but the added code really
> comes into play only after we go for shared lock, which happens in 15/15.
> That's the core of the series.
> 
> beginning of transition to parallel lookups - marking in-lookup dentries
> parallel lookups machinery, part 2
> parallel lookups machinery, part 3
> parallel lookups machinery, part 4 (and last)
> 
> 15/15: parallel lookups: actual switch to rwsem
> 
> Note that filesystems would be free to switch some of their own uses of
> inode_lock() to grabbing it shared - it's really up to them.  This series
> works only with directories locking, but this field has become an rwsem
> for all inodes.  XFS folks in particular might be interested in using it...

Looks very interesting, and long awaited.  How do you see the parallel
operations moving forward?  Staying as lookup only, or moving on to parallel
modifications as well?

We've been carrying an out-of-tree patch for ext4 for several years to allow
parallel create/unlink for directory entries*, as I discussed a few times with
you in the past.  It is still a bit heavyweight for doing read-only lookups,
but after this patch series it might finally be interesting to merge into ext4,
with a hope that the VFS might allow parallel directory changes in the future?
We can already do this on a Lustre server, and it would be nice to be able to
so on the client, since the files may even be on different servers (hashed by
name at the client to decide which server to contact) and network latency during
parallel file creates (one thread per CPU core, which is getting into the
low hundreds these days) is a much bigger deal than for local filesystems.

The actual inode_*lock() handling would need to be delegated to the filesystems,
with the VFS just using i_rwsem if the 

Re: [PATCH 15/15] parallel lookups: actual switch to rwsem

2016-04-15 Thread Andreas Dilger
On Apr 15, 2016, at 6:55 PM, Al Viro  wrote:
> 
> From: Al Viro 
> 
> ta-da!
> 
> The main issue is the lack of down_write_killable(), so the places
> like readdir.c switched to plain inode_lock(); once killable
> variants of rwsem primitives appear, that'll be dealt with.
> 
> lockdep side also might need more work
> 
> Signed-off-by: Al Viro 
> ---
> fs/btrfs/ioctl.c   | 16 ++--
> fs/configfs/inode.c|  2 +-
> fs/dcache.c|  9 +
> fs/gfs2/ops_fstype.c   |  2 +-
> fs/inode.c | 12 ++--
> fs/namei.c |  4 ++--
> fs/ocfs2/inode.c   |  2 +-
> fs/overlayfs/readdir.c |  4 +++-
> fs/readdir.c   |  7 ---
> include/linux/fs.h | 12 ++--
> 10 files changed, 39 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 053e677..db1e830 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -837,9 +837,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
>   struct dentry *dentry;
>   int error;
> 
> - error = mutex_lock_killable_nested(>i_mutex, I_MUTEX_PARENT);
> - if (error == -EINTR)
> - return error;
> + inode_lock_nested(dir, I_MUTEX_PARENT);
> + // XXX: should've been
> + // mutex_lock_killable_nested(>i_mutex, I_MUTEX_PARENT);
> + // if (error == -EINTR)
> + //  return error;
> 
>   dentry = lookup_one_len(name, parent->dentry, namelen);
>   error = PTR_ERR(dentry);
> @@ -2366,9 +2368,11 @@ static noinline int btrfs_ioctl_snap_destroy(struct 
> file *file,
>   goto out;
> 
> 
> - err = mutex_lock_killable_nested(>i_mutex, I_MUTEX_PARENT);
> - if (err == -EINTR)
> - goto out_drop_write;
> + inode_lock_nested(dir, I_MUTEX_PARENT);
> + // XXX: should've been
> + // err = mutex_lock_killable_nested(>i_mutex, I_MUTEX_PARENT);
> + // if (err == -EINTR)
> + //  goto out_drop_write;
>   dentry = lookup_one_len(vol_args->name, parent, namelen);
>   if (IS_ERR(dentry)) {
>   err = PTR_ERR(dentry);
> diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c
> index 03d124a..0387968 100644
> --- a/fs/configfs/inode.c
> +++ b/fs/configfs/inode.c
> @@ -156,7 +156,7 @@ static void configfs_set_inode_lock_class(struct 
> configfs_dirent *sd,
> 
>   if (depth > 0) {
>   if (depth <= ARRAY_SIZE(default_group_class)) {
> - lockdep_set_class(>i_mutex,
> + lockdep_set_class(>i_rwsem,
> _group_class[depth - 1]);
>   } else {
>   /*
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 5965588..d110040 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -2911,7 +2911,8 @@ struct dentry *d_ancestor(struct dentry *p1, struct 
> dentry *p2)
> static int __d_unalias(struct inode *inode,
>   struct dentry *dentry, struct dentry *alias)
> {
> - struct mutex *m1 = NULL, *m2 = NULL;
> + struct mutex *m1 = NULL;
> + struct rw_semaphore *m2 = NULL;
>   int ret = -ESTALE;
> 
>   /* If alias and dentry share a parent, then no extra locks required */
> @@ -2922,15 +2923,15 @@ static int __d_unalias(struct inode *inode,
>   if (!mutex_trylock(>d_sb->s_vfs_rename_mutex))
>   goto out_err;
>   m1 = >d_sb->s_vfs_rename_mutex;
> - if (!inode_trylock(alias->d_parent->d_inode))
> + if (!down_read_trylock(>d_parent->d_inode->i_rwsem))
>   goto out_err;
> - m2 = >d_parent->d_inode->i_mutex;
> + m2 = >d_parent->d_inode->i_rwsem;
> out_unalias:
>   __d_move(alias, dentry, false);
>   ret = 0;
> out_err:
>   if (m2)
> - mutex_unlock(m2);
> + up_read(m2);
>   if (m1)
>   mutex_unlock(m1);
>   return ret;
> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
> index c09c63d..4546360 100644
> --- a/fs/gfs2/ops_fstype.c
> +++ b/fs/gfs2/ops_fstype.c
> @@ -824,7 +824,7 @@ static int init_inodes(struct gfs2_sbd *sdp, int undo)
>* i_mutex on quota files is special. Since this inode is hidden system
>* file, we are safe to define locking ourselves.
>*/
> - lockdep_set_class(>sd_quota_inode->i_mutex,
> + lockdep_set_class(>sd_quota_inode->i_rwsem,
> _quota_imutex_key);
> 
>   error = gfs2_rindex_update(sdp);
> diff --git a/fs/inode.c b/fs/inode.c
> index 4b884f7..4ccbc21 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -166,8 +166,8 @@ int inode_init_always(struct super_block *sb, struct 
> inode *inode)
>   spin_lock_init(>i_lock);
>   lockdep_set_class(>i_lock, >s_type->i_lock_key);
> 
> - mutex_init(>i_mutex);
> - lockdep_set_class(>i_mutex, >s_type->i_mutex_key);
> + init_rwsem(>i_rwsem);
> + lockdep_set_class(>i_rwsem, >s_type->i_mutex_key);
> 
>   atomic_set(>i_dio_count, 0);
> 
> @@ -925,13 +925,13 @@ void 

Re: [PATCH v11 0/3] printk: Make printk() completely async

2016-04-15 Thread Sergey Senozhatsky
On (04/08/16 02:31), Sergey Senozhatsky wrote:
> Hello,
> 
> This patch set makes printk() completely asynchronous: new messages
> are getting upended to the kernel printk buffer, but instead of 'direct'
> printing the actual print job is performed by a dedicated kthread.
> This has the advantage that printing always happens from a schedulable
> context and thus we don't lockup any particular CPU or even interrupts.

Hello,

Sir, is there anything else you want me to improve in this patch set?

-ss

> against next-20160407
> 
> v11:
> -- switch default to sync printk
> -- make `synchronous' param RW (Andrew, Jan)
> -- set RT priority to printk kthread (Andrew)
> -- correct comments (Andrew)
> 
> v10:
> -- simplify printk_kthread_need_flush_console (Jan, Petr)
> 
> v9:
> -- move need_flush_console assignment down in vprintk_emit (Jan)
> -- simplify need_flush_console assignment rules (Petr)
> -- clear need_flush_console in printing function (Petr)
> -- rename need_flush_console (Petr)
> 
> v8:
> -- rename kthread printing function (Petr)
> -- clear need_flush_console in console_unlock() under logbuf (Petr)
> 
> v7:
> -- do not set global printk_sync in panic in vrintk_emit() (Petr)
> -- simplify vprintk_emit(). drop some of local variables (Petr)
> -- move handling of LOGLEVEL_SCHED messages back to printk_deferred()
>so we wake_up_process()/console_trylock() in vprintk_emit() only
>for !in_sched messages
> 
> v6:
> -- move wake_up_process out of logbuf lock (Jan, Byungchul)
> -- do not disable async printk in recursion handling code.
> -- rebase against next-20160321 (w/NMI patches)
> 
> v5:
> -- make printk.synchronous RO (Petr)
> -- make printing_func() correct and do not use wait_queue (Petr)
> -- do not panic() when can't allocate printing thread (Petr)
> -- do not wake_up_process() only in IRQ, prefer vprintk_emit() (Jan)
> -- move wake_up_klogd_work_func() to a separate patch (Petr)
> -- move wake_up_process() under logbuf lock so printk recursion logic can
>help us out
> -- switch to sync_print mode if printk recursion occured
> -- drop "printk: Skip messages on oops" patch
> 
> v4:
> -- do not directly wake_up() the printing kthread from vprintk_emit(), need
>to go via IRQ->wake_up() to avoid sched deadlocks (Jan)
> 
> v3:
> -- use a dedicated kthread for printing instead of using wq (Jan, Tetsuo, 
> Tejun)
> 
> v2:
> - use dedicated printk workqueue with WQ_MEM_RECLAIM bit
> - fallback to system-wide workqueue only if allocation of printk_wq has
>   failed
> - do not use system_wq as a fallback wq. both console_lock() and 
> onsole_unlock()
>   can spend a significant amount of time; so we need to use system_long_wq.
> - rework sync/!sync detection logic
>   a) we can have deferred (in_sched) messages before we allocate printk_wq,
>  so the only way to handle those messages is via IRQ context
>   b) even in printk.synchronous mode, deferred messages must not be printed
>  directly, and should go via IRQ context
>   c) even if we allocated printk_wq and have !sync_printk mode, we must route
>  deferred messages via IRQ context
> - so this adds additional bool flags to vprint_emit() and introduces a new
>   pending bit to `printk_pending'
> - fix build on !PRINTK configs
> 
> 
> Jan Kara (2):
>   printk: Make printk() completely async
>   printk: Make wake_up_klogd_work_func() async
> 
> Sergey Senozhatsky (1):
>   printk: make printk.synchronous param rw
> 
>  Documentation/kernel-parameters.txt |  12 +++
>  kernel/printk/printk.c  | 155 
> +---
>  2 files changed, 157 insertions(+), 10 deletions(-)
> 
> -- 
> 2.8.0
> 


Re: [PATCH v11 0/3] printk: Make printk() completely async

2016-04-15 Thread Sergey Senozhatsky
On (04/08/16 02:31), Sergey Senozhatsky wrote:
> Hello,
> 
> This patch set makes printk() completely asynchronous: new messages
> are getting upended to the kernel printk buffer, but instead of 'direct'
> printing the actual print job is performed by a dedicated kthread.
> This has the advantage that printing always happens from a schedulable
> context and thus we don't lockup any particular CPU or even interrupts.

Hello,

Sir, is there anything else you want me to improve in this patch set?

-ss

> against next-20160407
> 
> v11:
> -- switch default to sync printk
> -- make `synchronous' param RW (Andrew, Jan)
> -- set RT priority to printk kthread (Andrew)
> -- correct comments (Andrew)
> 
> v10:
> -- simplify printk_kthread_need_flush_console (Jan, Petr)
> 
> v9:
> -- move need_flush_console assignment down in vprintk_emit (Jan)
> -- simplify need_flush_console assignment rules (Petr)
> -- clear need_flush_console in printing function (Petr)
> -- rename need_flush_console (Petr)
> 
> v8:
> -- rename kthread printing function (Petr)
> -- clear need_flush_console in console_unlock() under logbuf (Petr)
> 
> v7:
> -- do not set global printk_sync in panic in vrintk_emit() (Petr)
> -- simplify vprintk_emit(). drop some of local variables (Petr)
> -- move handling of LOGLEVEL_SCHED messages back to printk_deferred()
>so we wake_up_process()/console_trylock() in vprintk_emit() only
>for !in_sched messages
> 
> v6:
> -- move wake_up_process out of logbuf lock (Jan, Byungchul)
> -- do not disable async printk in recursion handling code.
> -- rebase against next-20160321 (w/NMI patches)
> 
> v5:
> -- make printk.synchronous RO (Petr)
> -- make printing_func() correct and do not use wait_queue (Petr)
> -- do not panic() when can't allocate printing thread (Petr)
> -- do not wake_up_process() only in IRQ, prefer vprintk_emit() (Jan)
> -- move wake_up_klogd_work_func() to a separate patch (Petr)
> -- move wake_up_process() under logbuf lock so printk recursion logic can
>help us out
> -- switch to sync_print mode if printk recursion occured
> -- drop "printk: Skip messages on oops" patch
> 
> v4:
> -- do not directly wake_up() the printing kthread from vprintk_emit(), need
>to go via IRQ->wake_up() to avoid sched deadlocks (Jan)
> 
> v3:
> -- use a dedicated kthread for printing instead of using wq (Jan, Tetsuo, 
> Tejun)
> 
> v2:
> - use dedicated printk workqueue with WQ_MEM_RECLAIM bit
> - fallback to system-wide workqueue only if allocation of printk_wq has
>   failed
> - do not use system_wq as a fallback wq. both console_lock() and 
> onsole_unlock()
>   can spend a significant amount of time; so we need to use system_long_wq.
> - rework sync/!sync detection logic
>   a) we can have deferred (in_sched) messages before we allocate printk_wq,
>  so the only way to handle those messages is via IRQ context
>   b) even in printk.synchronous mode, deferred messages must not be printed
>  directly, and should go via IRQ context
>   c) even if we allocated printk_wq and have !sync_printk mode, we must route
>  deferred messages via IRQ context
> - so this adds additional bool flags to vprint_emit() and introduces a new
>   pending bit to `printk_pending'
> - fix build on !PRINTK configs
> 
> 
> Jan Kara (2):
>   printk: Make printk() completely async
>   printk: Make wake_up_klogd_work_func() async
> 
> Sergey Senozhatsky (1):
>   printk: make printk.synchronous param rw
> 
>  Documentation/kernel-parameters.txt |  12 +++
>  kernel/printk/printk.c  | 155 
> +---
>  2 files changed, 157 insertions(+), 10 deletions(-)
> 
> -- 
> 2.8.0
> 


Re: [PATCH 2/2] rtlwifi: Fix reusable codes in core.c

2016-04-15 Thread Julian Calaby
Hi Kalle,

On Sat, Apr 16, 2016 at 4:25 AM, Kalle Valo  wrote:
> Byeoungwook Kim  writes:
>
>> rtl_*_delay() functions were reused same codes about addr variable.
>> So i have converted to rtl_addr_delay() from code about addr variable.
>>
>> Signed-off-by: Byeoungwook Kim 
>> Reviewed-by: Julian Calaby 
>
> Doesn't apply:
>
> Applying: rtlwifi: Fix reusable codes in core.c
> fatal: sha1 information is lacking or useless 
> (drivers/net/wireless/realtek/rtlwifi/core.c).
> Repository lacks necessary blobs to fall back on 3-way merge.
> Cannot fall back to three-way merge.
> Patch failed at 0001 rtlwifi: Fix reusable codes in core.c
>
> Please rebase and resend.

This one is already applied in some form. I thought I'd listed it in
my big list of superseded patches, however I must have missed it.

Thanks,

-- 
Julian Calaby

Email: julian.cal...@gmail.com
Profile: http://www.google.com/profiles/julian.calaby/


Re: [PATCH 2/2] rtlwifi: Fix reusable codes in core.c

2016-04-15 Thread Julian Calaby
Hi Kalle,

On Sat, Apr 16, 2016 at 4:25 AM, Kalle Valo  wrote:
> Byeoungwook Kim  writes:
>
>> rtl_*_delay() functions were reused same codes about addr variable.
>> So i have converted to rtl_addr_delay() from code about addr variable.
>>
>> Signed-off-by: Byeoungwook Kim 
>> Reviewed-by: Julian Calaby 
>
> Doesn't apply:
>
> Applying: rtlwifi: Fix reusable codes in core.c
> fatal: sha1 information is lacking or useless 
> (drivers/net/wireless/realtek/rtlwifi/core.c).
> Repository lacks necessary blobs to fall back on 3-way merge.
> Cannot fall back to three-way merge.
> Patch failed at 0001 rtlwifi: Fix reusable codes in core.c
>
> Please rebase and resend.

This one is already applied in some form. I thought I'd listed it in
my big list of superseded patches, however I must have missed it.

Thanks,

-- 
Julian Calaby

Email: julian.cal...@gmail.com
Profile: http://www.google.com/profiles/julian.calaby/


Re: [PATCHv2]brd: set max_discard_sectors properly

2016-04-15 Thread Christoph Hellwig
> - blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX);
> + blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX >> 9);

Shouldn't we fix the issue by capping to UINT_MAX >> 9 inside
blk_queue_max_discard_sectors?  That way we'll prevent against having
issues like this in any other driver as well.


Re: [PATCHv2]brd: set max_discard_sectors properly

2016-04-15 Thread Christoph Hellwig
> - blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX);
> + blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX >> 9);

Shouldn't we fix the issue by capping to UINT_MAX >> 9 inside
blk_queue_max_discard_sectors?  That way we'll prevent against having
issues like this in any other driver as well.


[PATCH 3/8] genirq: add a helper spread an affinity mask for MSI/MSI-X vectors

2016-04-15 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 include/linux/interrupt.h | 10 +
 kernel/irq/Makefile   |  1 +
 kernel/irq/affinity.c | 54 +++
 3 files changed, 65 insertions(+)
 create mode 100644 kernel/irq/affinity.c

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 9fcabeb..67bc1e1f 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -278,6 +278,9 @@ extern int irq_set_affinity_hint(unsigned int irq, const 
struct cpumask *m);
 extern int
 irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify 
*notify);
 
+int irq_create_affinity_mask(struct cpumask **affinity_mask,
+   unsigned int nr_vecs);
+
 #else /* CONFIG_SMP */
 
 static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
@@ -308,6 +311,13 @@ irq_set_affinity_notifier(unsigned int irq, struct 
irq_affinity_notify *notify)
 {
return 0;
 }
+
+static inline int irq_create_affinity_mask(struct cpumask **affinity_mask,
+   unsigned int nr_vecs)
+{
+   *affinity_mask = NULL;
+   return 0;
+}
 #endif /* CONFIG_SMP */
 
 /*
diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile
index 2ee42e9..1d3ee31 100644
--- a/kernel/irq/Makefile
+++ b/kernel/irq/Makefile
@@ -9,3 +9,4 @@ obj-$(CONFIG_GENERIC_IRQ_MIGRATION) += cpuhotplug.o
 obj-$(CONFIG_PM_SLEEP) += pm.o
 obj-$(CONFIG_GENERIC_MSI_IRQ) += msi.o
 obj-$(CONFIG_GENERIC_IRQ_IPI) += ipi.o
+obj-$(CONFIG_SMP) += affinity.o
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
new file mode 100644
index 000..ecb8915
--- /dev/null
+++ b/kernel/irq/affinity.c
@@ -0,0 +1,54 @@
+
+#include 
+#include 
+#include 
+
+static int get_first_sibling(unsigned int cpu)
+{
+   unsigned int ret;
+
+   ret = cpumask_first(topology_sibling_cpumask(cpu));
+   if (ret < nr_cpu_ids)
+   return ret;
+   return cpu;
+}
+
+/*
+ * Take a map of online CPUs and the number of available interrupt vectors
+ * and generate an output cpumask suitable for spreading MSI/MSI-X vectors
+ * so that they are distributed as good as possible around the CPUs.  If
+ * more vectors than CPUs are available we'll map one to each CPU,
+ * otherwise we map one to the first sibling of each socket.
+ *
+ * If there are more vectors than CPUs we will still only have one bit
+ * set per CPU, but interrupt code will keep on assining the vectors from
+ * the start of the bitmap until we run out of vectors.
+ */
+int irq_create_affinity_mask(struct cpumask **affinity_mask,
+   unsigned int nr_vecs)
+{
+   if (nr_vecs == 1) {
+   *affinity_mask = NULL;
+   return 0;
+   }
+
+   *affinity_mask = kzalloc(cpumask_size(), GFP_KERNEL);
+   if (!*affinity_mask)
+   return -ENOMEM;
+
+   if (nr_vecs >= num_online_cpus()) {
+   cpumask_copy(*affinity_mask, cpu_online_mask);
+   } else {
+   unsigned int cpu;
+
+   for_each_online_cpu(cpu) {
+   if (cpu == get_first_sibling(cpu))
+   cpumask_set_cpu(cpu, *affinity_mask);
+
+   if (--nr_vecs == 0)
+   break;
+   }
+   }
+
+   return 0;
+}
-- 
2.1.4



[PATCH 3/8] genirq: add a helper spread an affinity mask for MSI/MSI-X vectors

2016-04-15 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 include/linux/interrupt.h | 10 +
 kernel/irq/Makefile   |  1 +
 kernel/irq/affinity.c | 54 +++
 3 files changed, 65 insertions(+)
 create mode 100644 kernel/irq/affinity.c

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 9fcabeb..67bc1e1f 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -278,6 +278,9 @@ extern int irq_set_affinity_hint(unsigned int irq, const 
struct cpumask *m);
 extern int
 irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify 
*notify);
 
+int irq_create_affinity_mask(struct cpumask **affinity_mask,
+   unsigned int nr_vecs);
+
 #else /* CONFIG_SMP */
 
 static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
@@ -308,6 +311,13 @@ irq_set_affinity_notifier(unsigned int irq, struct 
irq_affinity_notify *notify)
 {
return 0;
 }
+
+static inline int irq_create_affinity_mask(struct cpumask **affinity_mask,
+   unsigned int nr_vecs)
+{
+   *affinity_mask = NULL;
+   return 0;
+}
 #endif /* CONFIG_SMP */
 
 /*
diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile
index 2ee42e9..1d3ee31 100644
--- a/kernel/irq/Makefile
+++ b/kernel/irq/Makefile
@@ -9,3 +9,4 @@ obj-$(CONFIG_GENERIC_IRQ_MIGRATION) += cpuhotplug.o
 obj-$(CONFIG_PM_SLEEP) += pm.o
 obj-$(CONFIG_GENERIC_MSI_IRQ) += msi.o
 obj-$(CONFIG_GENERIC_IRQ_IPI) += ipi.o
+obj-$(CONFIG_SMP) += affinity.o
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
new file mode 100644
index 000..ecb8915
--- /dev/null
+++ b/kernel/irq/affinity.c
@@ -0,0 +1,54 @@
+
+#include 
+#include 
+#include 
+
+static int get_first_sibling(unsigned int cpu)
+{
+   unsigned int ret;
+
+   ret = cpumask_first(topology_sibling_cpumask(cpu));
+   if (ret < nr_cpu_ids)
+   return ret;
+   return cpu;
+}
+
+/*
+ * Take a map of online CPUs and the number of available interrupt vectors
+ * and generate an output cpumask suitable for spreading MSI/MSI-X vectors
+ * so that they are distributed as good as possible around the CPUs.  If
+ * more vectors than CPUs are available we'll map one to each CPU,
+ * otherwise we map one to the first sibling of each socket.
+ *
+ * If there are more vectors than CPUs we will still only have one bit
+ * set per CPU, but interrupt code will keep on assining the vectors from
+ * the start of the bitmap until we run out of vectors.
+ */
+int irq_create_affinity_mask(struct cpumask **affinity_mask,
+   unsigned int nr_vecs)
+{
+   if (nr_vecs == 1) {
+   *affinity_mask = NULL;
+   return 0;
+   }
+
+   *affinity_mask = kzalloc(cpumask_size(), GFP_KERNEL);
+   if (!*affinity_mask)
+   return -ENOMEM;
+
+   if (nr_vecs >= num_online_cpus()) {
+   cpumask_copy(*affinity_mask, cpu_online_mask);
+   } else {
+   unsigned int cpu;
+
+   for_each_online_cpu(cpu) {
+   if (cpu == get_first_sibling(cpu))
+   cpumask_set_cpu(cpu, *affinity_mask);
+
+   if (--nr_vecs == 0)
+   break;
+   }
+   }
+
+   return 0;
+}
-- 
2.1.4



[PATCH 5/8] blk-mq: allow the driver to pass in an affinity mask

2016-04-15 Thread Christoph Hellwig
Allow drivers to pass in the affinity mask from the generic interrupt
layer, and spread queues based on that.  If the driver doesn't pass in
a mask we will create it using the genirq helper.  As this helper was
modelled after the blk-mq algorithm there should be no change in behavior.

XXX: Just as with the core IRQ spreading code this doesn't handle CPU
hotplug yet.

Signed-off-by: Christoph Hellwig 
---
 block/Makefile |   2 +-
 block/blk-mq-cpumap.c  | 120 -
 block/blk-mq.c |  60 -
 block/blk-mq.h |   8 
 include/linux/blk-mq.h |   1 +
 5 files changed, 60 insertions(+), 131 deletions(-)
 delete mode 100644 block/blk-mq-cpumap.c

diff --git a/block/Makefile b/block/Makefile
index 9eda232..aeb318d 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o 
blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
blk-lib.o blk-mq.o blk-mq-tag.o \
-   blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
+   blk-mq-sysfs.o blk-mq-cpu.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
 
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
deleted file mode 100644
index d0634bc..000
--- a/block/blk-mq-cpumap.c
+++ /dev/null
@@ -1,120 +0,0 @@
-/*
- * CPU <-> hardware queue mapping helpers
- *
- * Copyright (C) 2013-2014 Jens Axboe
- */
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include 
-#include "blk.h"
-#include "blk-mq.h"
-
-static int cpu_to_queue_index(unsigned int nr_cpus, unsigned int nr_queues,
- const int cpu)
-{
-   return cpu * nr_queues / nr_cpus;
-}
-
-static int get_first_sibling(unsigned int cpu)
-{
-   unsigned int ret;
-
-   ret = cpumask_first(topology_sibling_cpumask(cpu));
-   if (ret < nr_cpu_ids)
-   return ret;
-
-   return cpu;
-}
-
-int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
-   const struct cpumask *online_mask)
-{
-   unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
-   cpumask_var_t cpus;
-
-   if (!alloc_cpumask_var(, GFP_ATOMIC))
-   return 1;
-
-   cpumask_clear(cpus);
-   nr_cpus = nr_uniq_cpus = 0;
-   for_each_cpu(i, online_mask) {
-   nr_cpus++;
-   first_sibling = get_first_sibling(i);
-   if (!cpumask_test_cpu(first_sibling, cpus))
-   nr_uniq_cpus++;
-   cpumask_set_cpu(i, cpus);
-   }
-
-   queue = 0;
-   for_each_possible_cpu(i) {
-   if (!cpumask_test_cpu(i, online_mask)) {
-   map[i] = 0;
-   continue;
-   }
-
-   /*
-* Easy case - we have equal or more hardware queues. Or
-* there are no thread siblings to take into account. Do
-* 1:1 if enough, or sequential mapping if less.
-*/
-   if (nr_queues >= nr_cpus || nr_cpus == nr_uniq_cpus) {
-   map[i] = cpu_to_queue_index(nr_cpus, nr_queues, queue);
-   queue++;
-   continue;
-   }
-
-   /*
-* Less then nr_cpus queues, and we have some number of
-* threads per cores. Map sibling threads to the same
-* queue.
-*/
-   first_sibling = get_first_sibling(i);
-   if (first_sibling == i) {
-   map[i] = cpu_to_queue_index(nr_uniq_cpus, nr_queues,
-   queue);
-   queue++;
-   } else
-   map[i] = map[first_sibling];
-   }
-
-   free_cpumask_var(cpus);
-   return 0;
-}
-
-unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
-{
-   unsigned int *map;
-
-   /* If cpus are offline, map them to first hctx */
-   map = kzalloc_node(sizeof(*map) * nr_cpu_ids, GFP_KERNEL,
-   set->numa_node);
-   if (!map)
-   return NULL;
-
-   if (!blk_mq_update_queue_map(map, set->nr_hw_queues, cpu_online_mask))
-   return map;
-
-   kfree(map);
-   return NULL;
-}
-
-/*
- * We have no quick way of doing reverse lookups. This is only used at
- * queue init time, so runtime isn't important.
- */
-int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index)
-{
-   int i;
-
-   for_each_possible_cpu(i) {
-   if (index == mq_map[i])
-   return 

[PATCH 5/8] blk-mq: allow the driver to pass in an affinity mask

2016-04-15 Thread Christoph Hellwig
Allow drivers to pass in the affinity mask from the generic interrupt
layer, and spread queues based on that.  If the driver doesn't pass in
a mask we will create it using the genirq helper.  As this helper was
modelled after the blk-mq algorithm there should be no change in behavior.

XXX: Just as with the core IRQ spreading code this doesn't handle CPU
hotplug yet.

Signed-off-by: Christoph Hellwig 
---
 block/Makefile |   2 +-
 block/blk-mq-cpumap.c  | 120 -
 block/blk-mq.c |  60 -
 block/blk-mq.h |   8 
 include/linux/blk-mq.h |   1 +
 5 files changed, 60 insertions(+), 131 deletions(-)
 delete mode 100644 block/blk-mq-cpumap.c

diff --git a/block/Makefile b/block/Makefile
index 9eda232..aeb318d 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o 
blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
blk-lib.o blk-mq.o blk-mq-tag.o \
-   blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
+   blk-mq-sysfs.o blk-mq-cpu.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
 
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
deleted file mode 100644
index d0634bc..000
--- a/block/blk-mq-cpumap.c
+++ /dev/null
@@ -1,120 +0,0 @@
-/*
- * CPU <-> hardware queue mapping helpers
- *
- * Copyright (C) 2013-2014 Jens Axboe
- */
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include 
-#include "blk.h"
-#include "blk-mq.h"
-
-static int cpu_to_queue_index(unsigned int nr_cpus, unsigned int nr_queues,
- const int cpu)
-{
-   return cpu * nr_queues / nr_cpus;
-}
-
-static int get_first_sibling(unsigned int cpu)
-{
-   unsigned int ret;
-
-   ret = cpumask_first(topology_sibling_cpumask(cpu));
-   if (ret < nr_cpu_ids)
-   return ret;
-
-   return cpu;
-}
-
-int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
-   const struct cpumask *online_mask)
-{
-   unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
-   cpumask_var_t cpus;
-
-   if (!alloc_cpumask_var(, GFP_ATOMIC))
-   return 1;
-
-   cpumask_clear(cpus);
-   nr_cpus = nr_uniq_cpus = 0;
-   for_each_cpu(i, online_mask) {
-   nr_cpus++;
-   first_sibling = get_first_sibling(i);
-   if (!cpumask_test_cpu(first_sibling, cpus))
-   nr_uniq_cpus++;
-   cpumask_set_cpu(i, cpus);
-   }
-
-   queue = 0;
-   for_each_possible_cpu(i) {
-   if (!cpumask_test_cpu(i, online_mask)) {
-   map[i] = 0;
-   continue;
-   }
-
-   /*
-* Easy case - we have equal or more hardware queues. Or
-* there are no thread siblings to take into account. Do
-* 1:1 if enough, or sequential mapping if less.
-*/
-   if (nr_queues >= nr_cpus || nr_cpus == nr_uniq_cpus) {
-   map[i] = cpu_to_queue_index(nr_cpus, nr_queues, queue);
-   queue++;
-   continue;
-   }
-
-   /*
-* Less then nr_cpus queues, and we have some number of
-* threads per cores. Map sibling threads to the same
-* queue.
-*/
-   first_sibling = get_first_sibling(i);
-   if (first_sibling == i) {
-   map[i] = cpu_to_queue_index(nr_uniq_cpus, nr_queues,
-   queue);
-   queue++;
-   } else
-   map[i] = map[first_sibling];
-   }
-
-   free_cpumask_var(cpus);
-   return 0;
-}
-
-unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
-{
-   unsigned int *map;
-
-   /* If cpus are offline, map them to first hctx */
-   map = kzalloc_node(sizeof(*map) * nr_cpu_ids, GFP_KERNEL,
-   set->numa_node);
-   if (!map)
-   return NULL;
-
-   if (!blk_mq_update_queue_map(map, set->nr_hw_queues, cpu_online_mask))
-   return map;
-
-   kfree(map);
-   return NULL;
-}
-
-/*
- * We have no quick way of doing reverse lookups. This is only used at
- * queue init time, so runtime isn't important.
- */
-int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index)
-{
-   int i;
-
-   for_each_possible_cpu(i) {
-   if (index == mq_map[i])
-   return local_memory_node(cpu_to_node(i));
- 

[PATCH 7/8] pci: spread interrupt vectors in pci_alloc_irq_vectors

2016-04-15 Thread Christoph Hellwig
Set the affinity_mask before allocating vectors.  And for now we also
need a little hack after allocation, hopefully someone smarter than me
can move this into the core code.

Signed-off-by: Christoph Hellwig 
---
 drivers/pci/irq.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
index b683465..d26df69 100644
--- a/drivers/pci/irq.c
+++ b/drivers/pci/irq.c
@@ -55,9 +55,14 @@ int pci_alloc_irq_vectors(struct pci_dev *pdev, int nr_vecs)
 
nr_vecs = min(nr_vecs, pci_nr_irq_vectors(pdev));
 
+   ret = irq_create_affinity_mask(>dev.irq_affinity, nr_vecs);
+   if (ret)
+   return ret;
+
+   ret = -ENOMEM;
irqs = kcalloc(nr_vecs, sizeof(u32), GFP_KERNEL);
if (!irqs)
-   return -ENOMEM;
+   goto out_free_affinity;
 
vecs = pci_enable_msix_range_wrapper(pdev, irqs, nr_vecs);
if (vecs <= 0) {
@@ -75,11 +80,20 @@ int pci_alloc_irq_vectors(struct pci_dev *pdev, int nr_vecs)
irqs[i] = pdev->irq + i;
}
 
+   /* XXX: this should really move into the core IRQ allocation code.. */
+   if (vecs > 1) {
+   for (i = 0; i < vecs; i++)
+   irq_program_affinity(irqs[i]);
+   }
+
pdev->irqs = irqs;
return vecs;
 
 out_free_irqs:
kfree(irqs);
+out_free_affinity:
+   kfree(pdev->dev.irq_affinity);
+   pdev->dev.irq_affinity = NULL;
return ret;
 }
 EXPORT_SYMBOL(pci_alloc_irq_vectors);
-- 
2.1.4



[PATCH 8/8] nvme: switch to use pci_alloc_irq_vectors

2016-04-15 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/pci.c | 88 +
 1 file changed, 23 insertions(+), 65 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ff3c8d7..82730bf 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -89,7 +89,6 @@ struct nvme_dev {
unsigned max_qid;
int q_depth;
u32 db_stride;
-   struct msix_entry *entry;
void __iomem *bar;
struct work_struct reset_work;
struct work_struct scan_work;
@@ -209,6 +208,11 @@ static unsigned int nvme_cmd_size(struct nvme_dev *dev)
nvme_iod_alloc_size(dev, NVME_INT_BYTES(dev), NVME_INT_PAGES);
 }
 
+static int nvmeq_irq(struct nvme_queue *nvmeq)
+{
+   return to_pci_dev(nvmeq->dev->dev)->irqs[nvmeq->cq_vector];
+}
+
 static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
unsigned int hctx_idx)
 {
@@ -1016,7 +1020,7 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
spin_unlock_irq(>q_lock);
return 1;
}
-   vector = nvmeq->dev->entry[nvmeq->cq_vector].vector;
+   vector = nvmeq_irq(nvmeq);
nvmeq->dev->online_queues--;
nvmeq->cq_vector = -1;
spin_unlock_irq(>q_lock);
@@ -1024,7 +1028,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
blk_mq_stop_hw_queues(nvmeq->dev->ctrl.admin_q);
 
-   irq_set_affinity_hint(vector, NULL);
free_irq(vector, nvmeq);
 
return 0;
@@ -1135,11 +1138,11 @@ static int queue_request_irq(struct nvme_dev *dev, 
struct nvme_queue *nvmeq,
const char *name)
 {
if (use_threaded_interrupts)
-   return request_threaded_irq(dev->entry[nvmeq->cq_vector].vector,
-   nvme_irq_check, nvme_irq, IRQF_SHARED,
-   name, nvmeq);
-   return request_irq(dev->entry[nvmeq->cq_vector].vector, nvme_irq,
-   IRQF_SHARED, name, nvmeq);
+   return request_threaded_irq(nvmeq_irq(nvmeq), nvme_irq_check,
+   nvme_irq, IRQF_SHARED, name, nvmeq);
+   else
+   return request_irq(nvmeq_irq(nvmeq), nvme_irq, IRQF_SHARED,
+   name, nvmeq);
 }
 
 static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
@@ -1438,7 +1441,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 {
struct nvme_queue *adminq = dev->queues[0];
struct pci_dev *pdev = to_pci_dev(dev->dev);
-   int result, i, vecs, nr_io_queues, size;
+   int result, nr_io_queues, size;
 
nr_io_queues = num_possible_cpus();
result = nvme_set_queue_count(>ctrl, _io_queues);
@@ -1481,29 +1484,17 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
}
 
/* Deregister the admin queue's interrupt */
-   free_irq(dev->entry[0].vector, adminq);
+   free_irq(pdev->irqs[0], adminq);
 
/*
 * If we enable msix early due to not intx, disable it again before
 * setting up the full range we need.
 */
-   if (pdev->msi_enabled)
-   pci_disable_msi(pdev);
-   else if (pdev->msix_enabled)
-   pci_disable_msix(pdev);
-
-   for (i = 0; i < nr_io_queues; i++)
-   dev->entry[i].entry = i;
-   vecs = pci_enable_msix_range(pdev, dev->entry, 1, nr_io_queues);
-   if (vecs < 0) {
-   vecs = pci_enable_msi_range(pdev, 1, min(nr_io_queues, 32));
-   if (vecs < 0) {
-   vecs = 1;
-   } else {
-   for (i = 0; i < vecs; i++)
-   dev->entry[i].vector = i + pdev->irq;
-   }
-   }
+   pci_free_irq_vectors(pdev);
+   nr_io_queues = pci_alloc_irq_vectors(pdev, nr_io_queues);
+   if (nr_io_queues <= 0)
+   return -EIO;
+   dev->max_qid = nr_io_queues;
 
/*
 * Should investigate if there's a performance win from allocating
@@ -1511,8 +1502,6 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 * path to scale better, even if the receive path is limited by the
 * number of interrupts.
 */
-   nr_io_queues = vecs;
-   dev->max_qid = nr_io_queues;
 
result = queue_request_irq(dev, adminq, adminq->irqname);
if (result) {
@@ -1526,22 +1515,6 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
return result;
 }
 
-static void nvme_set_irq_hints(struct nvme_dev *dev)
-{
-   struct nvme_queue *nvmeq;
-   int i;
-
-   for (i = 0; i < dev->online_queues; i++) {
-   nvmeq = dev->queues[i];
-
-   if (!nvmeq->tags || !(*nvmeq->tags))
-   continue;
-

[PATCH 2/8] genirq: Make use of dev->irq_affinity

2016-04-15 Thread Christoph Hellwig
From: Thomas Gleixner 

This allows optimized interrupt allocation and affinity settings for multi
queue devices MSI-X interrupts.

If the device holds a pointer to a cpumask, then this mask is used to:
   - allocate the interrupt descriptor on the proper nodes
   - set the default interrupt affinity for the interrupt

The interrupts are excluded from balancing so the user space balancer cannot
screw with the settings which have been requested by the multiqueue driver.

It's yet not clear how we are going to deal with cpu offlining/onlining. Right
now the affinity will simply break during offline. One option to handle this is:

  If the cpu goes offline, then move the irq to a different cpu on the same
  node. If it's the last cpu on the node or all remaining cpus on that node
  have already a queue we "park" it and reuse it when cpus come online again.

XXX: currently only works properly for MSI-X, not for MSI because MSI
allocates a msi_desc for more than a single vector.

Requested-by: Christoph Hellwig 
Signed-off-by: Thomas Gleixner 
Fucked-up-by: Christoph Hellwig 
---
 arch/sparc/kernel/irq_64.c |  2 +-
 arch/x86/kernel/apic/io_apic.c |  5 +++--
 include/linux/irq.h|  4 ++--
 include/linux/irqdomain.h  |  8 ---
 kernel/irq/ipi.c   |  4 ++--
 kernel/irq/irqdesc.c   | 48 +++---
 kernel/irq/irqdomain.c | 22 ---
 kernel/irq/msi.c   | 11 --
 8 files changed, 67 insertions(+), 37 deletions(-)

diff --git a/arch/sparc/kernel/irq_64.c b/arch/sparc/kernel/irq_64.c
index e22416c..437d0f7 100644
--- a/arch/sparc/kernel/irq_64.c
+++ b/arch/sparc/kernel/irq_64.c
@@ -242,7 +242,7 @@ unsigned int irq_alloc(unsigned int dev_handle, unsigned 
int dev_ino)
 {
int irq;
 
-   irq = __irq_alloc_descs(-1, 1, 1, numa_node_id(), NULL);
+   irq = __irq_alloc_descs(-1, 1, 1, numa_node_id(), NULL, -1);
if (irq <= 0)
goto out;
 
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index fdb0fbf..54267ea 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -981,7 +981,7 @@ static int alloc_irq_from_domain(struct irq_domain *domain, 
int ioapic, u32 gsi,
 
return __irq_domain_alloc_irqs(domain, irq, 1,
   ioapic_alloc_attr_node(info),
-  info, legacy);
+  info, legacy, -1);
 }
 
 /*
@@ -1014,7 +1014,8 @@ static int alloc_isa_irq_from_domain(struct irq_domain 
*domain,
  info->ioapic_pin))
return -ENOMEM;
} else {
-   irq = __irq_domain_alloc_irqs(domain, irq, 1, node, info, true);
+   irq = __irq_domain_alloc_irqs(domain, irq, 1, node, info, true,
+ -1);
if (irq >= 0) {
irq_data = irq_domain_get_irq_data(domain, irq);
data = irq_data->chip_data;
diff --git a/include/linux/irq.h b/include/linux/irq.h
index c4de623..27779d0 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -697,11 +697,11 @@ static inline struct cpumask 
*irq_data_get_affinity_mask(struct irq_data *d)
 unsigned int arch_dynirq_lower_bound(unsigned int from);
 
 int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
-   struct module *owner);
+ struct module *owner, int targetcpu);
 
 /* use macros to avoid needing export.h for THIS_MODULE */
 #define irq_alloc_descs(irq, from, cnt, node)  \
-   __irq_alloc_descs(irq, from, cnt, node, THIS_MODULE)
+   __irq_alloc_descs(irq, from, cnt, node, THIS_MODULE, -1)
 
 #define irq_alloc_desc(node)   \
irq_alloc_descs(-1, 0, 1, node)
diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index 2aed043..fa24663 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -215,7 +215,8 @@ extern struct irq_domain *irq_find_matching_fwnode(struct 
fwnode_handle *fwnode,
   enum irq_domain_bus_token 
bus_token);
 extern void irq_set_default_host(struct irq_domain *host);
 extern int irq_domain_alloc_descs(int virq, unsigned int nr_irqs,
- irq_hw_number_t hwirq, int node);
+ irq_hw_number_t hwirq, int node,
+ int targetcpu);
 
 static inline struct fwnode_handle *of_node_to_fwnode(struct device_node *node)
 {
@@ -377,7 +378,7 @@ static inline struct irq_domain 
*irq_domain_add_hierarchy(struct irq_domain *par
 
 extern int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
   unsigned int nr_irqs, int node, void *arg,
- 

[PATCH 7/8] pci: spread interrupt vectors in pci_alloc_irq_vectors

2016-04-15 Thread Christoph Hellwig
Set the affinity_mask before allocating vectors.  And for now we also
need a little hack after allocation, hopefully someone smarter than me
can move this into the core code.

Signed-off-by: Christoph Hellwig 
---
 drivers/pci/irq.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
index b683465..d26df69 100644
--- a/drivers/pci/irq.c
+++ b/drivers/pci/irq.c
@@ -55,9 +55,14 @@ int pci_alloc_irq_vectors(struct pci_dev *pdev, int nr_vecs)
 
nr_vecs = min(nr_vecs, pci_nr_irq_vectors(pdev));
 
+   ret = irq_create_affinity_mask(>dev.irq_affinity, nr_vecs);
+   if (ret)
+   return ret;
+
+   ret = -ENOMEM;
irqs = kcalloc(nr_vecs, sizeof(u32), GFP_KERNEL);
if (!irqs)
-   return -ENOMEM;
+   goto out_free_affinity;
 
vecs = pci_enable_msix_range_wrapper(pdev, irqs, nr_vecs);
if (vecs <= 0) {
@@ -75,11 +80,20 @@ int pci_alloc_irq_vectors(struct pci_dev *pdev, int nr_vecs)
irqs[i] = pdev->irq + i;
}
 
+   /* XXX: this should really move into the core IRQ allocation code.. */
+   if (vecs > 1) {
+   for (i = 0; i < vecs; i++)
+   irq_program_affinity(irqs[i]);
+   }
+
pdev->irqs = irqs;
return vecs;
 
 out_free_irqs:
kfree(irqs);
+out_free_affinity:
+   kfree(pdev->dev.irq_affinity);
+   pdev->dev.irq_affinity = NULL;
return ret;
 }
 EXPORT_SYMBOL(pci_alloc_irq_vectors);
-- 
2.1.4



[PATCH 8/8] nvme: switch to use pci_alloc_irq_vectors

2016-04-15 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/pci.c | 88 +
 1 file changed, 23 insertions(+), 65 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ff3c8d7..82730bf 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -89,7 +89,6 @@ struct nvme_dev {
unsigned max_qid;
int q_depth;
u32 db_stride;
-   struct msix_entry *entry;
void __iomem *bar;
struct work_struct reset_work;
struct work_struct scan_work;
@@ -209,6 +208,11 @@ static unsigned int nvme_cmd_size(struct nvme_dev *dev)
nvme_iod_alloc_size(dev, NVME_INT_BYTES(dev), NVME_INT_PAGES);
 }
 
+static int nvmeq_irq(struct nvme_queue *nvmeq)
+{
+   return to_pci_dev(nvmeq->dev->dev)->irqs[nvmeq->cq_vector];
+}
+
 static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
unsigned int hctx_idx)
 {
@@ -1016,7 +1020,7 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
spin_unlock_irq(>q_lock);
return 1;
}
-   vector = nvmeq->dev->entry[nvmeq->cq_vector].vector;
+   vector = nvmeq_irq(nvmeq);
nvmeq->dev->online_queues--;
nvmeq->cq_vector = -1;
spin_unlock_irq(>q_lock);
@@ -1024,7 +1028,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
blk_mq_stop_hw_queues(nvmeq->dev->ctrl.admin_q);
 
-   irq_set_affinity_hint(vector, NULL);
free_irq(vector, nvmeq);
 
return 0;
@@ -1135,11 +1138,11 @@ static int queue_request_irq(struct nvme_dev *dev, 
struct nvme_queue *nvmeq,
const char *name)
 {
if (use_threaded_interrupts)
-   return request_threaded_irq(dev->entry[nvmeq->cq_vector].vector,
-   nvme_irq_check, nvme_irq, IRQF_SHARED,
-   name, nvmeq);
-   return request_irq(dev->entry[nvmeq->cq_vector].vector, nvme_irq,
-   IRQF_SHARED, name, nvmeq);
+   return request_threaded_irq(nvmeq_irq(nvmeq), nvme_irq_check,
+   nvme_irq, IRQF_SHARED, name, nvmeq);
+   else
+   return request_irq(nvmeq_irq(nvmeq), nvme_irq, IRQF_SHARED,
+   name, nvmeq);
 }
 
 static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
@@ -1438,7 +1441,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 {
struct nvme_queue *adminq = dev->queues[0];
struct pci_dev *pdev = to_pci_dev(dev->dev);
-   int result, i, vecs, nr_io_queues, size;
+   int result, nr_io_queues, size;
 
nr_io_queues = num_possible_cpus();
result = nvme_set_queue_count(>ctrl, _io_queues);
@@ -1481,29 +1484,17 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
}
 
/* Deregister the admin queue's interrupt */
-   free_irq(dev->entry[0].vector, adminq);
+   free_irq(pdev->irqs[0], adminq);
 
/*
 * If we enable msix early due to not intx, disable it again before
 * setting up the full range we need.
 */
-   if (pdev->msi_enabled)
-   pci_disable_msi(pdev);
-   else if (pdev->msix_enabled)
-   pci_disable_msix(pdev);
-
-   for (i = 0; i < nr_io_queues; i++)
-   dev->entry[i].entry = i;
-   vecs = pci_enable_msix_range(pdev, dev->entry, 1, nr_io_queues);
-   if (vecs < 0) {
-   vecs = pci_enable_msi_range(pdev, 1, min(nr_io_queues, 32));
-   if (vecs < 0) {
-   vecs = 1;
-   } else {
-   for (i = 0; i < vecs; i++)
-   dev->entry[i].vector = i + pdev->irq;
-   }
-   }
+   pci_free_irq_vectors(pdev);
+   nr_io_queues = pci_alloc_irq_vectors(pdev, nr_io_queues);
+   if (nr_io_queues <= 0)
+   return -EIO;
+   dev->max_qid = nr_io_queues;
 
/*
 * Should investigate if there's a performance win from allocating
@@ -1511,8 +1502,6 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 * path to scale better, even if the receive path is limited by the
 * number of interrupts.
 */
-   nr_io_queues = vecs;
-   dev->max_qid = nr_io_queues;
 
result = queue_request_irq(dev, adminq, adminq->irqname);
if (result) {
@@ -1526,22 +1515,6 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
return result;
 }
 
-static void nvme_set_irq_hints(struct nvme_dev *dev)
-{
-   struct nvme_queue *nvmeq;
-   int i;
-
-   for (i = 0; i < dev->online_queues; i++) {
-   nvmeq = dev->queues[i];
-
-   if (!nvmeq->tags || !(*nvmeq->tags))
-   continue;
-
- 

[PATCH 2/8] genirq: Make use of dev->irq_affinity

2016-04-15 Thread Christoph Hellwig
From: Thomas Gleixner 

This allows optimized interrupt allocation and affinity settings for multi
queue devices MSI-X interrupts.

If the device holds a pointer to a cpumask, then this mask is used to:
   - allocate the interrupt descriptor on the proper nodes
   - set the default interrupt affinity for the interrupt

The interrupts are excluded from balancing so the user space balancer cannot
screw with the settings which have been requested by the multiqueue driver.

It's yet not clear how we are going to deal with cpu offlining/onlining. Right
now the affinity will simply break during offline. One option to handle this is:

  If the cpu goes offline, then move the irq to a different cpu on the same
  node. If it's the last cpu on the node or all remaining cpus on that node
  have already a queue we "park" it and reuse it when cpus come online again.

XXX: currently only works properly for MSI-X, not for MSI because MSI
allocates a msi_desc for more than a single vector.

Requested-by: Christoph Hellwig 
Signed-off-by: Thomas Gleixner 
Fucked-up-by: Christoph Hellwig 
---
 arch/sparc/kernel/irq_64.c |  2 +-
 arch/x86/kernel/apic/io_apic.c |  5 +++--
 include/linux/irq.h|  4 ++--
 include/linux/irqdomain.h  |  8 ---
 kernel/irq/ipi.c   |  4 ++--
 kernel/irq/irqdesc.c   | 48 +++---
 kernel/irq/irqdomain.c | 22 ---
 kernel/irq/msi.c   | 11 --
 8 files changed, 67 insertions(+), 37 deletions(-)

diff --git a/arch/sparc/kernel/irq_64.c b/arch/sparc/kernel/irq_64.c
index e22416c..437d0f7 100644
--- a/arch/sparc/kernel/irq_64.c
+++ b/arch/sparc/kernel/irq_64.c
@@ -242,7 +242,7 @@ unsigned int irq_alloc(unsigned int dev_handle, unsigned 
int dev_ino)
 {
int irq;
 
-   irq = __irq_alloc_descs(-1, 1, 1, numa_node_id(), NULL);
+   irq = __irq_alloc_descs(-1, 1, 1, numa_node_id(), NULL, -1);
if (irq <= 0)
goto out;
 
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index fdb0fbf..54267ea 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -981,7 +981,7 @@ static int alloc_irq_from_domain(struct irq_domain *domain, 
int ioapic, u32 gsi,
 
return __irq_domain_alloc_irqs(domain, irq, 1,
   ioapic_alloc_attr_node(info),
-  info, legacy);
+  info, legacy, -1);
 }
 
 /*
@@ -1014,7 +1014,8 @@ static int alloc_isa_irq_from_domain(struct irq_domain 
*domain,
  info->ioapic_pin))
return -ENOMEM;
} else {
-   irq = __irq_domain_alloc_irqs(domain, irq, 1, node, info, true);
+   irq = __irq_domain_alloc_irqs(domain, irq, 1, node, info, true,
+ -1);
if (irq >= 0) {
irq_data = irq_domain_get_irq_data(domain, irq);
data = irq_data->chip_data;
diff --git a/include/linux/irq.h b/include/linux/irq.h
index c4de623..27779d0 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -697,11 +697,11 @@ static inline struct cpumask 
*irq_data_get_affinity_mask(struct irq_data *d)
 unsigned int arch_dynirq_lower_bound(unsigned int from);
 
 int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
-   struct module *owner);
+ struct module *owner, int targetcpu);
 
 /* use macros to avoid needing export.h for THIS_MODULE */
 #define irq_alloc_descs(irq, from, cnt, node)  \
-   __irq_alloc_descs(irq, from, cnt, node, THIS_MODULE)
+   __irq_alloc_descs(irq, from, cnt, node, THIS_MODULE, -1)
 
 #define irq_alloc_desc(node)   \
irq_alloc_descs(-1, 0, 1, node)
diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index 2aed043..fa24663 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -215,7 +215,8 @@ extern struct irq_domain *irq_find_matching_fwnode(struct 
fwnode_handle *fwnode,
   enum irq_domain_bus_token 
bus_token);
 extern void irq_set_default_host(struct irq_domain *host);
 extern int irq_domain_alloc_descs(int virq, unsigned int nr_irqs,
- irq_hw_number_t hwirq, int node);
+ irq_hw_number_t hwirq, int node,
+ int targetcpu);
 
 static inline struct fwnode_handle *of_node_to_fwnode(struct device_node *node)
 {
@@ -377,7 +378,7 @@ static inline struct irq_domain 
*irq_domain_add_hierarchy(struct irq_domain *par
 
 extern int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
   unsigned int nr_irqs, int node, void *arg,
-  bool realloc);
+ 

[PATCH 1/8] device: Add irq affinity hint cpumask pointer

2016-04-15 Thread Christoph Hellwig
From: Thomas Gleixner 

This optional cpumask will be used by the irq core code to optimize interrupt
allocation and affinity setup for multiqueue devices.

Signed-off-by: Thomas Gleixner 
---
 include/linux/device.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index 002c597..0270103 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -741,6 +741,8 @@ struct device_dma_parameters {
  * @msi_list:  Hosts MSI descriptors
  * @msi_domain: The generic MSI domain this device is using.
  * @numa_node: NUMA node this device is close to.
+ * @irq_affinity: Hint for irq affinities and descriptor allocation
+ *   (optional).
  * @dma_mask:  Dma mask (if dma'ble device).
  * @coherent_dma_mask: Like dma_mask, but for alloc_coherent mapping as not all
  * hardware supports 64-bit addresses for consistent allocations
@@ -813,6 +815,8 @@ struct device {
 #ifdef CONFIG_NUMA
int numa_node;  /* NUMA node this device is close to */
 #endif
+
+   struct cpumask  *irq_affinity;
u64 *dma_mask;  /* dma mask (if dma'able device) */
u64 coherent_dma_mask;/* Like dma_mask, but for
 alloc_coherent mappings as
-- 
2.1.4



[PATCH 1/8] device: Add irq affinity hint cpumask pointer

2016-04-15 Thread Christoph Hellwig
From: Thomas Gleixner 

This optional cpumask will be used by the irq core code to optimize interrupt
allocation and affinity setup for multiqueue devices.

Signed-off-by: Thomas Gleixner 
---
 include/linux/device.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index 002c597..0270103 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -741,6 +741,8 @@ struct device_dma_parameters {
  * @msi_list:  Hosts MSI descriptors
  * @msi_domain: The generic MSI domain this device is using.
  * @numa_node: NUMA node this device is close to.
+ * @irq_affinity: Hint for irq affinities and descriptor allocation
+ *   (optional).
  * @dma_mask:  Dma mask (if dma'ble device).
  * @coherent_dma_mask: Like dma_mask, but for alloc_coherent mapping as not all
  * hardware supports 64-bit addresses for consistent allocations
@@ -813,6 +815,8 @@ struct device {
 #ifdef CONFIG_NUMA
int numa_node;  /* NUMA node this device is close to */
 #endif
+
+   struct cpumask  *irq_affinity;
u64 *dma_mask;  /* dma mask (if dma'able device) */
u64 coherent_dma_mask;/* Like dma_mask, but for
 alloc_coherent mappings as
-- 
2.1.4



[PATCH 6/8] pci: provide sensible irq vector alloc/free routines

2016-04-15 Thread Christoph Hellwig
Hide all the MSI-X vs MSI vs legacy bullshit, and provide an array of
interrupt vectors in the pci_dev structure, and ensure we get proper
interrupt affinity by default.

Signed-off-by: Christoph Hellwig 
---
 drivers/pci/irq.c   | 89 -
 drivers/pci/msi.c   |  2 +-
 drivers/pci/pci.h   |  5 +++
 include/linux/pci.h |  5 +++
 4 files changed, 99 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
index 6684f15..b683465 100644
--- a/drivers/pci/irq.c
+++ b/drivers/pci/irq.c
@@ -1,7 +1,8 @@
 /*
- * PCI IRQ failure handing code
+ * PCI IRQ handing code
  *
  * Copyright (c) 2008 James Bottomley 
+ * Copyright (c) 2016 Christoph Hellwig.
  */
 
 #include 
@@ -9,6 +10,92 @@
 #include 
 #include 
 #include 
+#include 
+#include "pci.h"
+
+static int pci_nr_irq_vectors(struct pci_dev *pdev)
+{
+   int nr_entries;
+
+   nr_entries = pci_msix_vec_count(pdev);
+   if (nr_entries <= 0 && pci_msi_supported(pdev, 1))
+   nr_entries = pci_msi_vec_count(pdev);
+   if (nr_entries <= 0)
+   nr_entries = 1;
+   return nr_entries;
+}
+
+static int pci_enable_msix_range_wrapper(struct pci_dev *pdev, u32 *irqs,
+   int nr_vecs)
+{
+   struct msix_entry *msix_entries;
+   int vecs, i;
+
+   msix_entries = kcalloc(nr_vecs, sizeof(struct msix_entry), GFP_KERNEL);
+   if (!msix_entries)
+   return -ENOMEM;
+
+   for (i = 0; i < nr_vecs; i++)
+   msix_entries[i].entry = i;
+
+   vecs = pci_enable_msix_range(pdev, msix_entries, 1, nr_vecs);
+   if (vecs > 0) {
+   for (i = 0; i < vecs; i++)
+   irqs[i] = msix_entries[i].vector;
+   }
+
+   kfree(msix_entries);
+   return vecs;
+}
+
+int pci_alloc_irq_vectors(struct pci_dev *pdev, int nr_vecs)
+{
+   int vecs, ret, i;
+   u32 *irqs;
+
+   nr_vecs = min(nr_vecs, pci_nr_irq_vectors(pdev));
+
+   irqs = kcalloc(nr_vecs, sizeof(u32), GFP_KERNEL);
+   if (!irqs)
+   return -ENOMEM;
+
+   vecs = pci_enable_msix_range_wrapper(pdev, irqs, nr_vecs);
+   if (vecs <= 0) {
+   vecs = pci_enable_msi_range(pdev, 1, min(nr_vecs, 32));
+   if (vecs <= 0) {
+   ret = -EIO;
+   if (!pdev->irq)
+   goto out_free_irqs;
+
+   /* use legacy irq */
+   vecs = 1;
+   }
+
+   for (i = 0; i < vecs; i++)
+   irqs[i] = pdev->irq + i;
+   }
+
+   pdev->irqs = irqs;
+   return vecs;
+
+out_free_irqs:
+   kfree(irqs);
+   return ret;
+}
+EXPORT_SYMBOL(pci_alloc_irq_vectors);
+
+void pci_free_irq_vectors(struct pci_dev *pdev)
+{
+   if (pdev->msi_enabled)
+   pci_disable_msi(pdev);
+   else if (pdev->msix_enabled)
+   pci_disable_msix(pdev);
+
+   kfree(pdev->dev.irq_affinity);
+   pdev->dev.irq_affinity = NULL;
+   kfree(pdev->irqs);
+}
+EXPORT_SYMBOL(pci_free_irq_vectors);
 
 static void pci_note_irq_problem(struct pci_dev *pdev, const char *reason)
 {
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index a080f44..544d306 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -815,7 +815,7 @@ out_free:
  * to determine if MSI/-X are supported for the device. If MSI/-X is
  * supported return 1, else return 0.
  **/
-static int pci_msi_supported(struct pci_dev *dev, int nvec)
+int pci_msi_supported(struct pci_dev *dev, int nvec)
 {
struct pci_bus *bus;
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index d0fb934..263422c 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -144,8 +144,13 @@ extern unsigned int pci_pm_d3_delay;
 
 #ifdef CONFIG_PCI_MSI
 void pci_no_msi(void);
+int pci_msi_supported(struct pci_dev *dev, int nvec);
 #else
 static inline void pci_no_msi(void) { }
+static int pci_msi_supported(struct pci_dev *dev, int nvec)
+{
+   return 0;
+}
 #endif
 
 static inline void pci_msi_set_enable(struct pci_dev *dev, int enable)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 004b813..4fbc14f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -322,6 +322,7 @@ struct pci_dev {
 * directly, use the values stored here. They might be different!
 */
unsigned intirq;
+   unsigned int*irqs;
struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory 
regions + expansion ROMs */
 
bool match_driver;  /* Skip attaching driver */
@@ -1235,6 +1236,9 @@ resource_size_t pcibios_iov_resource_alignment(struct 
pci_dev *dev, int resno);
 int pci_set_vga_state(struct pci_dev *pdev, bool decode,
  unsigned int command_bits, u32 flags);
 
+int pci_alloc_irq_vectors(struct pci_dev *dev, int nr_vecs);
+void 

[PATCH 4/8] genirq: add a helper to program the pre-set affinity mask into the controller

2016-04-15 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 include/linux/interrupt.h |  2 ++
 kernel/irq/manage.c   | 14 ++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 67bc1e1f..ae345da 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -713,4 +713,6 @@ extern char __softirqentry_text_end[];
 #define __softirq_entry
 #endif
 
+void irq_program_affinity(unsigned int irq);
+
 #endif
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index cc1cc64..02552a4 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -240,6 +240,20 @@ int __irq_set_affinity(unsigned int irq, const struct 
cpumask *mask, bool force)
return ret;
 }
 
+void irq_program_affinity(unsigned int irq)
+{
+   struct irq_desc *desc = irq_to_desc(irq);
+   struct irq_data *data = irq_desc_get_irq_data(desc);
+   unsigned long flags;
+
+   if (WARN_ON_ONCE(!desc))
+   return;
+
+   raw_spin_lock_irqsave(>lock, flags);
+   irq_set_affinity_locked(data, desc->irq_common_data.affinity, false);
+   raw_spin_unlock_irqrestore(>lock, flags);
+}
+
 int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
 {
unsigned long flags;
-- 
2.1.4



RFC: automatic interrupt affinity for MSI/MSI-X capable devices

2016-04-15 Thread Christoph Hellwig
This series enhances the irq and PCI code to allow spreading around MSI and
MSI-X vectors so that they have per-cpu affinity if possible, or at least
per-node.  For that it takes the algorithm from blk-mq, moves it to
a common place, and makes it available through a vastly simplified PCI
interrupt allocation API.  It then switches blk-mq to be able to pick up
the queue mapping from the device if available, and demostrates all this
using the NVMe driver.

There is still some work todo, mostly related to handling PCI hotplug,
more details are in the individual patches.



[PATCH 4/8] genirq: add a helper to program the pre-set affinity mask into the controller

2016-04-15 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 include/linux/interrupt.h |  2 ++
 kernel/irq/manage.c   | 14 ++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 67bc1e1f..ae345da 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -713,4 +713,6 @@ extern char __softirqentry_text_end[];
 #define __softirq_entry
 #endif
 
+void irq_program_affinity(unsigned int irq);
+
 #endif
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index cc1cc64..02552a4 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -240,6 +240,20 @@ int __irq_set_affinity(unsigned int irq, const struct 
cpumask *mask, bool force)
return ret;
 }
 
+void irq_program_affinity(unsigned int irq)
+{
+   struct irq_desc *desc = irq_to_desc(irq);
+   struct irq_data *data = irq_desc_get_irq_data(desc);
+   unsigned long flags;
+
+   if (WARN_ON_ONCE(!desc))
+   return;
+
+   raw_spin_lock_irqsave(>lock, flags);
+   irq_set_affinity_locked(data, desc->irq_common_data.affinity, false);
+   raw_spin_unlock_irqrestore(>lock, flags);
+}
+
 int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
 {
unsigned long flags;
-- 
2.1.4



RFC: automatic interrupt affinity for MSI/MSI-X capable devices

2016-04-15 Thread Christoph Hellwig
This series enhances the irq and PCI code to allow spreading around MSI and
MSI-X vectors so that they have per-cpu affinity if possible, or at least
per-node.  For that it takes the algorithm from blk-mq, moves it to
a common place, and makes it available through a vastly simplified PCI
interrupt allocation API.  It then switches blk-mq to be able to pick up
the queue mapping from the device if available, and demostrates all this
using the NVMe driver.

There is still some work todo, mostly related to handling PCI hotplug,
more details are in the individual patches.



[PATCH 6/8] pci: provide sensible irq vector alloc/free routines

2016-04-15 Thread Christoph Hellwig
Hide all the MSI-X vs MSI vs legacy bullshit, and provide an array of
interrupt vectors in the pci_dev structure, and ensure we get proper
interrupt affinity by default.

Signed-off-by: Christoph Hellwig 
---
 drivers/pci/irq.c   | 89 -
 drivers/pci/msi.c   |  2 +-
 drivers/pci/pci.h   |  5 +++
 include/linux/pci.h |  5 +++
 4 files changed, 99 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
index 6684f15..b683465 100644
--- a/drivers/pci/irq.c
+++ b/drivers/pci/irq.c
@@ -1,7 +1,8 @@
 /*
- * PCI IRQ failure handing code
+ * PCI IRQ handing code
  *
  * Copyright (c) 2008 James Bottomley 
+ * Copyright (c) 2016 Christoph Hellwig.
  */
 
 #include 
@@ -9,6 +10,92 @@
 #include 
 #include 
 #include 
+#include 
+#include "pci.h"
+
+static int pci_nr_irq_vectors(struct pci_dev *pdev)
+{
+   int nr_entries;
+
+   nr_entries = pci_msix_vec_count(pdev);
+   if (nr_entries <= 0 && pci_msi_supported(pdev, 1))
+   nr_entries = pci_msi_vec_count(pdev);
+   if (nr_entries <= 0)
+   nr_entries = 1;
+   return nr_entries;
+}
+
+static int pci_enable_msix_range_wrapper(struct pci_dev *pdev, u32 *irqs,
+   int nr_vecs)
+{
+   struct msix_entry *msix_entries;
+   int vecs, i;
+
+   msix_entries = kcalloc(nr_vecs, sizeof(struct msix_entry), GFP_KERNEL);
+   if (!msix_entries)
+   return -ENOMEM;
+
+   for (i = 0; i < nr_vecs; i++)
+   msix_entries[i].entry = i;
+
+   vecs = pci_enable_msix_range(pdev, msix_entries, 1, nr_vecs);
+   if (vecs > 0) {
+   for (i = 0; i < vecs; i++)
+   irqs[i] = msix_entries[i].vector;
+   }
+
+   kfree(msix_entries);
+   return vecs;
+}
+
+int pci_alloc_irq_vectors(struct pci_dev *pdev, int nr_vecs)
+{
+   int vecs, ret, i;
+   u32 *irqs;
+
+   nr_vecs = min(nr_vecs, pci_nr_irq_vectors(pdev));
+
+   irqs = kcalloc(nr_vecs, sizeof(u32), GFP_KERNEL);
+   if (!irqs)
+   return -ENOMEM;
+
+   vecs = pci_enable_msix_range_wrapper(pdev, irqs, nr_vecs);
+   if (vecs <= 0) {
+   vecs = pci_enable_msi_range(pdev, 1, min(nr_vecs, 32));
+   if (vecs <= 0) {
+   ret = -EIO;
+   if (!pdev->irq)
+   goto out_free_irqs;
+
+   /* use legacy irq */
+   vecs = 1;
+   }
+
+   for (i = 0; i < vecs; i++)
+   irqs[i] = pdev->irq + i;
+   }
+
+   pdev->irqs = irqs;
+   return vecs;
+
+out_free_irqs:
+   kfree(irqs);
+   return ret;
+}
+EXPORT_SYMBOL(pci_alloc_irq_vectors);
+
+void pci_free_irq_vectors(struct pci_dev *pdev)
+{
+   if (pdev->msi_enabled)
+   pci_disable_msi(pdev);
+   else if (pdev->msix_enabled)
+   pci_disable_msix(pdev);
+
+   kfree(pdev->dev.irq_affinity);
+   pdev->dev.irq_affinity = NULL;
+   kfree(pdev->irqs);
+}
+EXPORT_SYMBOL(pci_free_irq_vectors);
 
 static void pci_note_irq_problem(struct pci_dev *pdev, const char *reason)
 {
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index a080f44..544d306 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -815,7 +815,7 @@ out_free:
  * to determine if MSI/-X are supported for the device. If MSI/-X is
  * supported return 1, else return 0.
  **/
-static int pci_msi_supported(struct pci_dev *dev, int nvec)
+int pci_msi_supported(struct pci_dev *dev, int nvec)
 {
struct pci_bus *bus;
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index d0fb934..263422c 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -144,8 +144,13 @@ extern unsigned int pci_pm_d3_delay;
 
 #ifdef CONFIG_PCI_MSI
 void pci_no_msi(void);
+int pci_msi_supported(struct pci_dev *dev, int nvec);
 #else
 static inline void pci_no_msi(void) { }
+static int pci_msi_supported(struct pci_dev *dev, int nvec)
+{
+   return 0;
+}
 #endif
 
 static inline void pci_msi_set_enable(struct pci_dev *dev, int enable)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 004b813..4fbc14f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -322,6 +322,7 @@ struct pci_dev {
 * directly, use the values stored here. They might be different!
 */
unsigned intirq;
+   unsigned int*irqs;
struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory 
regions + expansion ROMs */
 
bool match_driver;  /* Skip attaching driver */
@@ -1235,6 +1236,9 @@ resource_size_t pcibios_iov_resource_alignment(struct 
pci_dev *dev, int resno);
 int pci_set_vga_state(struct pci_dev *pdev, bool decode,
  unsigned int command_bits, u32 flags);
 
+int pci_alloc_irq_vectors(struct pci_dev *dev, int nr_vecs);
+void pci_free_irq_vectors(struct pci_dev *pdev);
+
 /* kmem_cache style wrapper 

[PATCH] xen/x86: don't lose event interrupts

2016-04-15 Thread Stefano Stabellini
On slow platforms with unreliable TSC, such as QEMU emulated machines,
it is possible for the kernel to request the next event in the past. In
that case, in the current implementation of xen_vcpuop_clockevent, we
simply return -ETIME. To be precise the Xen returns -ETIME and we pass
it on. However the result of this is a missed event, which simply causes
the kernel to hang.

Instead it is better to always ask the hypervisor for a timer event,
even if the timeout is in the past. That way there are no lost
interrupts and the kernel survives. To do that, remove the
VCPU_SSHOTTMR_future flag.

Signed-off-by: Stefano Stabellini 

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index a0a4e55..6deba5b 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -290,11 +290,11 @@ static int xen_vcpuop_set_next_event(unsigned long delta,
WARN_ON(!clockevent_state_oneshot(evt));
 
single.timeout_abs_ns = get_abs_timeout(delta);
-   single.flags = VCPU_SSHOTTMR_future;
+   /* Get an event anyway, even if the timeout is already expired */
+   single.flags = 0;
 
ret = HYPERVISOR_vcpu_op(VCPUOP_set_singleshot_timer, cpu, );
-
-   BUG_ON(ret != 0 && ret != -ETIME);
+   BUG_ON(ret != 0);
 
return ret;
 }


[PATCH] xen/x86: don't lose event interrupts

2016-04-15 Thread Stefano Stabellini
On slow platforms with unreliable TSC, such as QEMU emulated machines,
it is possible for the kernel to request the next event in the past. In
that case, in the current implementation of xen_vcpuop_clockevent, we
simply return -ETIME. To be precise the Xen returns -ETIME and we pass
it on. However the result of this is a missed event, which simply causes
the kernel to hang.

Instead it is better to always ask the hypervisor for a timer event,
even if the timeout is in the past. That way there are no lost
interrupts and the kernel survives. To do that, remove the
VCPU_SSHOTTMR_future flag.

Signed-off-by: Stefano Stabellini 

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index a0a4e55..6deba5b 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -290,11 +290,11 @@ static int xen_vcpuop_set_next_event(unsigned long delta,
WARN_ON(!clockevent_state_oneshot(evt));
 
single.timeout_abs_ns = get_abs_timeout(delta);
-   single.flags = VCPU_SSHOTTMR_future;
+   /* Get an event anyway, even if the timeout is already expired */
+   single.flags = 0;
 
ret = HYPERVISOR_vcpu_op(VCPUOP_set_singleshot_timer, cpu, );
-
-   BUG_ON(ret != 0 && ret != -ETIME);
+   BUG_ON(ret != 0);
 
return ret;
 }


pmd_set_huge and ACPI warnings

2016-04-15 Thread Paul Sturm
Not sure if this is the right place to post. If it is not please direct me to 
where I should go.

I am running x86_64 kernel 4.4.6 on an Intel Xeon D system. This is an SOC 
system that includes dual 10G ethernet using the ixgbe driver. 
I have also tested this on kernels 4.2 through 4.6rc3 with the same result.

When the ixgbe driver loads, I get the following two warnings: 

[ 5453.184701] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 
4.2.1-k 
[ 5453.184704] ixgbe: Copyright (c) 1999-2015 Intel Corporation. 
[ 5453.184767] ACPI Warning: \_SB_.PCI0.BR2C._PRT: Return Package has no 
elements (empty) (20150930/nsprepkg-126) 
[ 5453.184891] pmd_set_huge: Cannot satisfy [mem 0x383fffa0-0x383fffc0] 
with a huge-page mapping due to MTRR override. 

BIOS is set to enable 64-bit DMA above 4GB. 
cat proc/mtrr looks like this: 
reg00: base=0x08000 ( 2048MB ), size= 2048MB, count=1: uncachable 
reg01: base=0x3800 (58720256MB ), size=262144MB, count=1: uncachable 
reg02: base=0x383fff80 (58982392MB ), size= 8MB, count=1: write-through 
reg03: base=0x3830 (58982399MB ), size= 1MB, count=1: uncachable 

When I change the BIOS setting to disable DMA above 4GB (no other BIOS changes 
I tried had any effect on the MTRR ranges) 
cat /proc/mtrr looks like this: 
reg00: base=0x08000 ( 2048MB ), size= 2048MB, count=1: uncachable 
reg01: base=0x3800 (58720256MB ), size=262144MB, count=1: uncachable 
reg02: base=0x0f980 ( 3992MB ), size= 8MB, count=1: write-through 
reg03: base=0x0f9f0 ( 3999MB ), size= 1MB, count=1: uncachable 

and the pmd_set_huge warning indicates a memory range in the 0x0f 
uncacheable range. 

So the result is that ixgbe seems to always try to get it's hugepage from the 
uncacheable range. 

I can post the full dmesg if requested, but in the meantime, here are the 
TLB-related entries: 
[ 0.027925] Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8 
[ 0.027931] Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4 

[ 0.325307] HugeTLB registered 1 GB page size, pre-allocated 0 pages 
[ 0.325315] HugeTLB registered 2 MB page size, pre-allocated 0 pages 
I tried to pre-allocate both 1GB and 2MB pages via the kernel command line and 
it had no effect. 

I have tried both compiling the driver in the kernel and loading it as a 
module. Same results. 

I first reported this on the e1000 sourceforge list and they directed me to 
linux-mm, but majordomo is non-responsive for that list. 

In addition to the pmd_set_huge warning, there is also that ACPI warning. I am 
not sure if it is related or not, but I can say it only appears when the IXGBE 
driver is loaded and it always loads right before the pmd_set_huge warning. 

Please advise.


pmd_set_huge and ACPI warnings

2016-04-15 Thread Paul Sturm
Not sure if this is the right place to post. If it is not please direct me to 
where I should go.

I am running x86_64 kernel 4.4.6 on an Intel Xeon D system. This is an SOC 
system that includes dual 10G ethernet using the ixgbe driver. 
I have also tested this on kernels 4.2 through 4.6rc3 with the same result.

When the ixgbe driver loads, I get the following two warnings: 

[ 5453.184701] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 
4.2.1-k 
[ 5453.184704] ixgbe: Copyright (c) 1999-2015 Intel Corporation. 
[ 5453.184767] ACPI Warning: \_SB_.PCI0.BR2C._PRT: Return Package has no 
elements (empty) (20150930/nsprepkg-126) 
[ 5453.184891] pmd_set_huge: Cannot satisfy [mem 0x383fffa0-0x383fffc0] 
with a huge-page mapping due to MTRR override. 

BIOS is set to enable 64-bit DMA above 4GB. 
cat proc/mtrr looks like this: 
reg00: base=0x08000 ( 2048MB ), size= 2048MB, count=1: uncachable 
reg01: base=0x3800 (58720256MB ), size=262144MB, count=1: uncachable 
reg02: base=0x383fff80 (58982392MB ), size= 8MB, count=1: write-through 
reg03: base=0x3830 (58982399MB ), size= 1MB, count=1: uncachable 

When I change the BIOS setting to disable DMA above 4GB (no other BIOS changes 
I tried had any effect on the MTRR ranges) 
cat /proc/mtrr looks like this: 
reg00: base=0x08000 ( 2048MB ), size= 2048MB, count=1: uncachable 
reg01: base=0x3800 (58720256MB ), size=262144MB, count=1: uncachable 
reg02: base=0x0f980 ( 3992MB ), size= 8MB, count=1: write-through 
reg03: base=0x0f9f0 ( 3999MB ), size= 1MB, count=1: uncachable 

and the pmd_set_huge warning indicates a memory range in the 0x0f 
uncacheable range. 

So the result is that ixgbe seems to always try to get it's hugepage from the 
uncacheable range. 

I can post the full dmesg if requested, but in the meantime, here are the 
TLB-related entries: 
[ 0.027925] Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8 
[ 0.027931] Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4 

[ 0.325307] HugeTLB registered 1 GB page size, pre-allocated 0 pages 
[ 0.325315] HugeTLB registered 2 MB page size, pre-allocated 0 pages 
I tried to pre-allocate both 1GB and 2MB pages via the kernel command line and 
it had no effect. 

I have tried both compiling the driver in the kernel and loading it as a 
module. Same results. 

I first reported this on the e1000 sourceforge list and they directed me to 
linux-mm, but majordomo is non-responsive for that list. 

In addition to the pmd_set_huge warning, there is also that ACPI warning. I am 
not sure if it is related or not, but I can say it only appears when the IXGBE 
driver is loaded and it always loads right before the pmd_set_huge warning. 

Please advise.


Re: [PATCH 1/2] clk: imx: do not sleep if IRQ's are still disabled

2016-04-15 Thread Stephen Boyd
On 01/29, Stefan Agner wrote:
> If a clock gets enabled early during boot time, it can lead to a PLL
> startup. The wait_lock function makes sure that the PLL is really
> stareted up before it gets used. However, the function sleeps which
> leads to scheduling and an error:
> bad: scheduling from the idle thread!
> ...
> 
> Use udelay in case IRQ's are still disabled.
> 
> Signed-off-by: Stefan Agner 

This is really old. Shawn, are you picking these up? I'm removing
these from my queue for now.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH 1/2] clk: imx: do not sleep if IRQ's are still disabled

2016-04-15 Thread Stephen Boyd
On 01/29, Stefan Agner wrote:
> If a clock gets enabled early during boot time, it can lead to a PLL
> startup. The wait_lock function makes sure that the PLL is really
> stareted up before it gets used. However, the function sleeps which
> leads to scheduling and an error:
> bad: scheduling from the idle thread!
> ...
> 
> Use udelay in case IRQ's are still disabled.
> 
> Signed-off-by: Stefan Agner 

This is really old. Shawn, are you picking these up? I'm removing
these from my queue for now.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


[PATCH 05/15] orangefs: don't open-code inode_lock/inode_unlock

2016-04-15 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 fs/orangefs/file.c| 4 ++--
 fs/orangefs/orangefs-kernel.h | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index ae92795..491e82c 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -445,7 +445,7 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, 
struct iov_iter *ite
 
gossip_debug(GOSSIP_FILE_DEBUG, "orangefs_file_write_iter\n");
 
-   mutex_lock(>f_mapping->host->i_mutex);
+   inode_lock(file->f_mapping->host);
 
/* Make sure generic_write_checks sees an up to date inode size. */
if (file->f_flags & O_APPEND) {
@@ -492,7 +492,7 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, 
struct iov_iter *ite
 
 out:
 
-   mutex_unlock(>f_mapping->host->i_mutex);
+   inode_unlock(file->f_mapping->host);
return rc;
 }
 
diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
index a9925e2..2281882 100644
--- a/fs/orangefs/orangefs-kernel.h
+++ b/fs/orangefs/orangefs-kernel.h
@@ -612,11 +612,11 @@ do {  
\
 static inline void orangefs_i_size_write(struct inode *inode, loff_t i_size)
 {
 #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
-   mutex_lock(>i_mutex);
+   inode_lock(inode);
 #endif
i_size_write(inode, i_size);
 #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
-   mutex_unlock(>i_mutex);
+   inode_unlock(inode);
 #endif
 }
 
-- 
2.8.0.rc3



[PATCH 07/15] reconnect_one(): use lookup_one_len_unlocked()

2016-04-15 Thread Al Viro
From: Al Viro 

... and explain the non-obvious logics in case when lookup yields
a different dentry.

Signed-off-by: Al Viro 
---
 fs/exportfs/expfs.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index c46f1a1..402c5ca 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -143,14 +143,18 @@ static struct dentry *reconnect_one(struct vfsmount *mnt,
if (err)
goto out_err;
dprintk("%s: found name: %s\n", __func__, nbuf);
-   inode_lock(parent->d_inode);
-   tmp = lookup_one_len(nbuf, parent, strlen(nbuf));
-   inode_unlock(parent->d_inode);
+   tmp = lookup_one_len_unlocked(nbuf, parent, strlen(nbuf));
if (IS_ERR(tmp)) {
dprintk("%s: lookup failed: %d\n", __func__, PTR_ERR(tmp));
goto out_err;
}
if (tmp != dentry) {
+   /*
+* Somebody has renamed it since exportfs_get_name();
+* great, since it could've only been renamed if it
+* got looked up and thus connected, and it would
+* remain connected afterwards.  We are done.
+*/
dput(tmp);
goto out_reconnected;
}
-- 
2.8.0.rc3



[PATCH 05/15] orangefs: don't open-code inode_lock/inode_unlock

2016-04-15 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 fs/orangefs/file.c| 4 ++--
 fs/orangefs/orangefs-kernel.h | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index ae92795..491e82c 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -445,7 +445,7 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, 
struct iov_iter *ite
 
gossip_debug(GOSSIP_FILE_DEBUG, "orangefs_file_write_iter\n");
 
-   mutex_lock(>f_mapping->host->i_mutex);
+   inode_lock(file->f_mapping->host);
 
/* Make sure generic_write_checks sees an up to date inode size. */
if (file->f_flags & O_APPEND) {
@@ -492,7 +492,7 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, 
struct iov_iter *ite
 
 out:
 
-   mutex_unlock(>f_mapping->host->i_mutex);
+   inode_unlock(file->f_mapping->host);
return rc;
 }
 
diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
index a9925e2..2281882 100644
--- a/fs/orangefs/orangefs-kernel.h
+++ b/fs/orangefs/orangefs-kernel.h
@@ -612,11 +612,11 @@ do {  
\
 static inline void orangefs_i_size_write(struct inode *inode, loff_t i_size)
 {
 #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
-   mutex_lock(>i_mutex);
+   inode_lock(inode);
 #endif
i_size_write(inode, i_size);
 #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
-   mutex_unlock(>i_mutex);
+   inode_unlock(inode);
 #endif
 }
 
-- 
2.8.0.rc3



[PATCH 07/15] reconnect_one(): use lookup_one_len_unlocked()

2016-04-15 Thread Al Viro
From: Al Viro 

... and explain the non-obvious logics in case when lookup yields
a different dentry.

Signed-off-by: Al Viro 
---
 fs/exportfs/expfs.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index c46f1a1..402c5ca 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -143,14 +143,18 @@ static struct dentry *reconnect_one(struct vfsmount *mnt,
if (err)
goto out_err;
dprintk("%s: found name: %s\n", __func__, nbuf);
-   inode_lock(parent->d_inode);
-   tmp = lookup_one_len(nbuf, parent, strlen(nbuf));
-   inode_unlock(parent->d_inode);
+   tmp = lookup_one_len_unlocked(nbuf, parent, strlen(nbuf));
if (IS_ERR(tmp)) {
dprintk("%s: lookup failed: %d\n", __func__, PTR_ERR(tmp));
goto out_err;
}
if (tmp != dentry) {
+   /*
+* Somebody has renamed it since exportfs_get_name();
+* great, since it could've only been renamed if it
+* got looked up and thus connected, and it would
+* remain connected afterwards.  We are done.
+*/
dput(tmp);
goto out_reconnected;
}
-- 
2.8.0.rc3



[PATCH 10/15] __d_add(): don't drop/regain ->d_lock

2016-04-15 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 fs/dcache.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index e9de4d9..33cad8a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2363,11 +2363,19 @@ EXPORT_SYMBOL(d_rehash);
 
 static inline void __d_add(struct dentry *dentry, struct inode *inode)
 {
+   spin_lock(>d_lock);
if (inode) {
-   __d_instantiate(dentry, inode);
+   unsigned add_flags = d_flags_for_inode(inode);
+   hlist_add_head(>d_u.d_alias, >i_dentry);
+   raw_write_seqcount_begin(>d_seq);
+   __d_set_inode_and_type(dentry, inode, add_flags);
+   raw_write_seqcount_end(>d_seq);
+   __fsnotify_d_instantiate(dentry);
+   }
+   _d_rehash(dentry);
+   spin_unlock(>d_lock);
+   if (inode)
spin_unlock(>i_lock);
-   }
-   d_rehash(dentry);
 }
 
 /**
-- 
2.8.0.rc3



[PATCH 08/15] ovl_lookup_real(): use lookup_one_len_unlocked()

2016-04-15 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 fs/overlayfs/super.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 14cab38..4c26225 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -378,9 +378,7 @@ static inline struct dentry *ovl_lookup_real(struct dentry 
*dir,
 {
struct dentry *dentry;
 
-   inode_lock(dir->d_inode);
-   dentry = lookup_one_len(name->name, dir, name->len);
-   inode_unlock(dir->d_inode);
+   dentry = lookup_one_len_unlocked(name->name, dir, name->len);
 
if (IS_ERR(dentry)) {
if (PTR_ERR(dentry) == -ENOENT)
-- 
2.8.0.rc3



[PATCH 09/15] lookup_slow(): bugger off on IS_DEADDIR() from the very beginning

2016-04-15 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 fs/namei.c | 23 +--
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index c0d551f..6fb33a7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1603,8 +1603,15 @@ static struct dentry *lookup_slow(const struct qstr 
*name,
  struct dentry *dir,
  unsigned int flags)
 {
-   struct dentry *dentry;
-   inode_lock(dir->d_inode);
+   struct dentry *dentry, *old;
+   struct inode *inode = dir->d_inode;
+
+   inode_lock(inode);
+   /* Don't go there if it's already dead */
+   if (unlikely(IS_DEADDIR(inode))) {
+   inode_unlock(inode);
+   return ERR_PTR(-ENOENT);
+   }
dentry = d_lookup(dir, name);
if (unlikely(dentry)) {
if ((dentry->d_flags & DCACHE_OP_REVALIDATE) &&
@@ -1618,17 +1625,21 @@ static struct dentry *lookup_slow(const struct qstr 
*name,
}
}
if (dentry) {
-   inode_unlock(dir->d_inode);
+   inode_unlock(inode);
return dentry;
}
}
dentry = d_alloc(dir, name);
if (unlikely(!dentry)) {
-   inode_unlock(dir->d_inode);
+   inode_unlock(inode);
return ERR_PTR(-ENOMEM);
}
-   dentry = lookup_real(dir->d_inode, dentry, flags);
-   inode_unlock(dir->d_inode);
+   old = inode->i_op->lookup(inode, dentry, flags);
+   if (unlikely(old)) {
+   dput(dentry);
+   dentry = old;
+   }
+   inode_unlock(inode);
return dentry;
 }
 
-- 
2.8.0.rc3



[PATCH 06/15] reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack()

2016-04-15 Thread Al Viro
From: Al Viro 

... and have it use inode_lock()

Signed-off-by: Al Viro 
---
 fs/reiserfs/ioctl.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/reiserfs/ioctl.c b/fs/reiserfs/ioctl.c
index 036a1fc..f49afe7 100644
--- a/fs/reiserfs/ioctl.c
+++ b/fs/reiserfs/ioctl.c
@@ -187,7 +187,11 @@ int reiserfs_unpack(struct inode *inode, struct file *filp)
}
 
/* we need to make sure nobody is changing the file size beneath us */
-   reiserfs_mutex_lock_safe(>i_mutex, inode->i_sb);
+{
+   int depth = reiserfs_write_unlock_nested(inode->i_sb);
+   inode_lock(inode);
+   reiserfs_write_lock_nested(inode->i_sb, depth);
+}
 
reiserfs_write_lock(inode->i_sb);
 
-- 
2.8.0.rc3



[PATCH 10/15] __d_add(): don't drop/regain ->d_lock

2016-04-15 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 fs/dcache.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index e9de4d9..33cad8a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2363,11 +2363,19 @@ EXPORT_SYMBOL(d_rehash);
 
 static inline void __d_add(struct dentry *dentry, struct inode *inode)
 {
+   spin_lock(>d_lock);
if (inode) {
-   __d_instantiate(dentry, inode);
+   unsigned add_flags = d_flags_for_inode(inode);
+   hlist_add_head(>d_u.d_alias, >i_dentry);
+   raw_write_seqcount_begin(>d_seq);
+   __d_set_inode_and_type(dentry, inode, add_flags);
+   raw_write_seqcount_end(>d_seq);
+   __fsnotify_d_instantiate(dentry);
+   }
+   _d_rehash(dentry);
+   spin_unlock(>d_lock);
+   if (inode)
spin_unlock(>i_lock);
-   }
-   d_rehash(dentry);
 }
 
 /**
-- 
2.8.0.rc3



[PATCH 08/15] ovl_lookup_real(): use lookup_one_len_unlocked()

2016-04-15 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 fs/overlayfs/super.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 14cab38..4c26225 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -378,9 +378,7 @@ static inline struct dentry *ovl_lookup_real(struct dentry 
*dir,
 {
struct dentry *dentry;
 
-   inode_lock(dir->d_inode);
-   dentry = lookup_one_len(name->name, dir, name->len);
-   inode_unlock(dir->d_inode);
+   dentry = lookup_one_len_unlocked(name->name, dir, name->len);
 
if (IS_ERR(dentry)) {
if (PTR_ERR(dentry) == -ENOENT)
-- 
2.8.0.rc3



[PATCH 09/15] lookup_slow(): bugger off on IS_DEADDIR() from the very beginning

2016-04-15 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 fs/namei.c | 23 +--
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index c0d551f..6fb33a7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1603,8 +1603,15 @@ static struct dentry *lookup_slow(const struct qstr 
*name,
  struct dentry *dir,
  unsigned int flags)
 {
-   struct dentry *dentry;
-   inode_lock(dir->d_inode);
+   struct dentry *dentry, *old;
+   struct inode *inode = dir->d_inode;
+
+   inode_lock(inode);
+   /* Don't go there if it's already dead */
+   if (unlikely(IS_DEADDIR(inode))) {
+   inode_unlock(inode);
+   return ERR_PTR(-ENOENT);
+   }
dentry = d_lookup(dir, name);
if (unlikely(dentry)) {
if ((dentry->d_flags & DCACHE_OP_REVALIDATE) &&
@@ -1618,17 +1625,21 @@ static struct dentry *lookup_slow(const struct qstr 
*name,
}
}
if (dentry) {
-   inode_unlock(dir->d_inode);
+   inode_unlock(inode);
return dentry;
}
}
dentry = d_alloc(dir, name);
if (unlikely(!dentry)) {
-   inode_unlock(dir->d_inode);
+   inode_unlock(inode);
return ERR_PTR(-ENOMEM);
}
-   dentry = lookup_real(dir->d_inode, dentry, flags);
-   inode_unlock(dir->d_inode);
+   old = inode->i_op->lookup(inode, dentry, flags);
+   if (unlikely(old)) {
+   dput(dentry);
+   dentry = old;
+   }
+   inode_unlock(inode);
return dentry;
 }
 
-- 
2.8.0.rc3



[PATCH 06/15] reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack()

2016-04-15 Thread Al Viro
From: Al Viro 

... and have it use inode_lock()

Signed-off-by: Al Viro 
---
 fs/reiserfs/ioctl.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/reiserfs/ioctl.c b/fs/reiserfs/ioctl.c
index 036a1fc..f49afe7 100644
--- a/fs/reiserfs/ioctl.c
+++ b/fs/reiserfs/ioctl.c
@@ -187,7 +187,11 @@ int reiserfs_unpack(struct inode *inode, struct file *filp)
}
 
/* we need to make sure nobody is changing the file size beneath us */
-   reiserfs_mutex_lock_safe(>i_mutex, inode->i_sb);
+{
+   int depth = reiserfs_write_unlock_nested(inode->i_sb);
+   inode_lock(inode);
+   reiserfs_write_lock_nested(inode->i_sb, depth);
+}
 
reiserfs_write_lock(inode->i_sb);
 
-- 
2.8.0.rc3



[PATCH 15/15] parallel lookups: actual switch to rwsem

2016-04-15 Thread Al Viro
From: Al Viro 

ta-da!

The main issue is the lack of down_write_killable(), so the places
like readdir.c switched to plain inode_lock(); once killable
variants of rwsem primitives appear, that'll be dealt with.

lockdep side also might need more work

Signed-off-by: Al Viro 
---
 fs/btrfs/ioctl.c   | 16 ++--
 fs/configfs/inode.c|  2 +-
 fs/dcache.c|  9 +
 fs/gfs2/ops_fstype.c   |  2 +-
 fs/inode.c | 12 ++--
 fs/namei.c |  4 ++--
 fs/ocfs2/inode.c   |  2 +-
 fs/overlayfs/readdir.c |  4 +++-
 fs/readdir.c   |  7 ---
 include/linux/fs.h | 12 ++--
 10 files changed, 39 insertions(+), 31 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 053e677..db1e830 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -837,9 +837,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
struct dentry *dentry;
int error;
 
-   error = mutex_lock_killable_nested(>i_mutex, I_MUTEX_PARENT);
-   if (error == -EINTR)
-   return error;
+   inode_lock_nested(dir, I_MUTEX_PARENT);
+   // XXX: should've been
+   // mutex_lock_killable_nested(>i_mutex, I_MUTEX_PARENT);
+   // if (error == -EINTR)
+   //  return error;
 
dentry = lookup_one_len(name, parent->dentry, namelen);
error = PTR_ERR(dentry);
@@ -2366,9 +2368,11 @@ static noinline int btrfs_ioctl_snap_destroy(struct file 
*file,
goto out;
 
 
-   err = mutex_lock_killable_nested(>i_mutex, I_MUTEX_PARENT);
-   if (err == -EINTR)
-   goto out_drop_write;
+   inode_lock_nested(dir, I_MUTEX_PARENT);
+   // XXX: should've been
+   // err = mutex_lock_killable_nested(>i_mutex, I_MUTEX_PARENT);
+   // if (err == -EINTR)
+   //  goto out_drop_write;
dentry = lookup_one_len(vol_args->name, parent, namelen);
if (IS_ERR(dentry)) {
err = PTR_ERR(dentry);
diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c
index 03d124a..0387968 100644
--- a/fs/configfs/inode.c
+++ b/fs/configfs/inode.c
@@ -156,7 +156,7 @@ static void configfs_set_inode_lock_class(struct 
configfs_dirent *sd,
 
if (depth > 0) {
if (depth <= ARRAY_SIZE(default_group_class)) {
-   lockdep_set_class(>i_mutex,
+   lockdep_set_class(>i_rwsem,
  _group_class[depth - 1]);
} else {
/*
diff --git a/fs/dcache.c b/fs/dcache.c
index 5965588..d110040 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2911,7 +2911,8 @@ struct dentry *d_ancestor(struct dentry *p1, struct 
dentry *p2)
 static int __d_unalias(struct inode *inode,
struct dentry *dentry, struct dentry *alias)
 {
-   struct mutex *m1 = NULL, *m2 = NULL;
+   struct mutex *m1 = NULL;
+   struct rw_semaphore *m2 = NULL;
int ret = -ESTALE;
 
/* If alias and dentry share a parent, then no extra locks required */
@@ -2922,15 +2923,15 @@ static int __d_unalias(struct inode *inode,
if (!mutex_trylock(>d_sb->s_vfs_rename_mutex))
goto out_err;
m1 = >d_sb->s_vfs_rename_mutex;
-   if (!inode_trylock(alias->d_parent->d_inode))
+   if (!down_read_trylock(>d_parent->d_inode->i_rwsem))
goto out_err;
-   m2 = >d_parent->d_inode->i_mutex;
+   m2 = >d_parent->d_inode->i_rwsem;
 out_unalias:
__d_move(alias, dentry, false);
ret = 0;
 out_err:
if (m2)
-   mutex_unlock(m2);
+   up_read(m2);
if (m1)
mutex_unlock(m1);
return ret;
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index c09c63d..4546360 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -824,7 +824,7 @@ static int init_inodes(struct gfs2_sbd *sdp, int undo)
 * i_mutex on quota files is special. Since this inode is hidden system
 * file, we are safe to define locking ourselves.
 */
-   lockdep_set_class(>sd_quota_inode->i_mutex,
+   lockdep_set_class(>sd_quota_inode->i_rwsem,
  _quota_imutex_key);
 
error = gfs2_rindex_update(sdp);
diff --git a/fs/inode.c b/fs/inode.c
index 4b884f7..4ccbc21 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -166,8 +166,8 @@ int inode_init_always(struct super_block *sb, struct inode 
*inode)
spin_lock_init(>i_lock);
lockdep_set_class(>i_lock, >s_type->i_lock_key);
 
-   mutex_init(>i_mutex);
-   lockdep_set_class(>i_mutex, >s_type->i_mutex_key);
+   init_rwsem(>i_rwsem);
+   lockdep_set_class(>i_rwsem, >s_type->i_mutex_key);
 
atomic_set(>i_dio_count, 0);
 
@@ -925,13 +925,13 @@ void lockdep_annotate_inode_mutex_key(struct inode *inode)
struct file_system_type *type = inode->i_sb->s_type;
 
/* 

[PATCH 11/15] beginning of transition to parallel lookups - marking in-lookup dentries

2016-04-15 Thread Al Viro
From: Al Viro 

marked as such when (would be) parallel lookup is about to pass them
to actual ->lookup(); unmarked when
* __d_add() is about to make it hashed, positive or not.
* __d_move() (from d_splice_alias(), directly or via
__d_unalias()) puts a preexisting dentry in its place
* in caller of ->lookup() if it has escaped all of the
above.  Bug (WARN_ON, actually) if it reaches the final dput()
or d_instantiate() while still marked such.

As the result, we are guaranteed that for as long as the flag is
set, dentry will
* remain negative unhashed with positive refcount
* never have its ->d_alias looked at
* never have its ->d_lru looked at
* never have its ->d_parent and ->d_name changed

Right now we have at most one such for any given parent directory.
With parallel lookups that restriction will weaken to
* only exist when parent is locked shared
* at most one with given (parent,name) pair (comparison of
names is according to ->d_compare())
* only exist when there's no hashed dentry with the same
(parent,name)

Transition will take the next several commits; unfortunately, we'll
only be able to switch to rwsem at the end of this series.  The
reason for not making it a single patch is to simplify review.

Signed-off-by: Al Viro 
---
 fs/dcache.c| 12 
 fs/namei.c |  4 
 include/linux/dcache.h | 13 +
 3 files changed, 29 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 33cad8a..5cea3cb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -761,6 +761,8 @@ repeat:
/* Slow case: now with the dentry lock held */
rcu_read_unlock();
 
+   WARN_ON(dentry->d_flags & DCACHE_PAR_LOOKUP);
+
/* Unreachable? Get rid of it */
if (unlikely(d_unhashed(dentry)))
goto kill_it;
@@ -1743,6 +1745,7 @@ type_determined:
 static void __d_instantiate(struct dentry *dentry, struct inode *inode)
 {
unsigned add_flags = d_flags_for_inode(inode);
+   WARN_ON(dentry->d_flags & DCACHE_PAR_LOOKUP);
 
spin_lock(>d_lock);
hlist_add_head(>d_u.d_alias, >i_dentry);
@@ -2358,12 +2361,19 @@ void d_rehash(struct dentry * entry)
 }
 EXPORT_SYMBOL(d_rehash);
 
+void __d_not_in_lookup(struct dentry *dentry)
+{
+   dentry->d_flags &= ~DCACHE_PAR_LOOKUP;
+   /* more stuff will land here */
+}
 
 /* inode->i_lock held if inode is non-NULL */
 
 static inline void __d_add(struct dentry *dentry, struct inode *inode)
 {
spin_lock(>d_lock);
+   if (unlikely(dentry->d_flags & DCACHE_PAR_LOOKUP))
+   __d_not_in_lookup(dentry);
if (inode) {
unsigned add_flags = d_flags_for_inode(inode);
hlist_add_head(>d_u.d_alias, >i_dentry);
@@ -2609,6 +2619,8 @@ static void __d_move(struct dentry *dentry, struct dentry 
*target,
BUG_ON(d_ancestor(target, dentry));
 
dentry_lock_for_move(dentry, target);
+   if (unlikely(target->d_flags & DCACHE_PAR_LOOKUP))
+   __d_not_in_lookup(target);
 
write_seqcount_begin(>d_seq);
write_seqcount_begin_nested(>d_seq, DENTRY_D_LOCK_NESTED);
diff --git a/fs/namei.c b/fs/namei.c
index 6fb33a7..0ee8b9d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1634,7 +1634,11 @@ static struct dentry *lookup_slow(const struct qstr 
*name,
inode_unlock(inode);
return ERR_PTR(-ENOMEM);
}
+   spin_lock(>d_lock);
+   dentry->d_flags |= DCACHE_PAR_LOOKUP;
+   spin_unlock(>d_lock);
old = inode->i_op->lookup(inode, dentry, flags);
+   d_not_in_lookup(dentry);
if (unlikely(old)) {
dput(dentry);
dentry = old;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 7cb043d..cfc1240 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -230,6 +230,8 @@ struct dentry_operations {
 
 #define DCACHE_ENCRYPTED_WITH_KEY  0x0400 /* dir is encrypted with a 
valid key */
 
+#define DCACHE_PAR_LOOKUP  0x0800 /* being looked up (with 
parent locked shared) */
+
 extern seqlock_t rename_lock;
 
 /*
@@ -365,6 +367,17 @@ static inline void dont_mount(struct dentry *dentry)
spin_unlock(>d_lock);
 }
 
+extern void __d_not_in_lookup(struct dentry *);
+
+static inline void d_not_in_lookup(struct dentry *dentry)
+{
+   if (unlikely(dentry->d_flags & DCACHE_PAR_LOOKUP)) {
+   spin_lock(>d_lock);
+   __d_not_in_lookup(dentry);
+   spin_unlock(>d_lock);
+   }
+}
+
 extern void dput(struct dentry *);
 
 static inline bool d_managed(const struct dentry *dentry)
-- 
2.8.0.rc3



[PATCH 15/15] parallel lookups: actual switch to rwsem

2016-04-15 Thread Al Viro
From: Al Viro 

ta-da!

The main issue is the lack of down_write_killable(), so the places
like readdir.c switched to plain inode_lock(); once killable
variants of rwsem primitives appear, that'll be dealt with.

lockdep side also might need more work

Signed-off-by: Al Viro 
---
 fs/btrfs/ioctl.c   | 16 ++--
 fs/configfs/inode.c|  2 +-
 fs/dcache.c|  9 +
 fs/gfs2/ops_fstype.c   |  2 +-
 fs/inode.c | 12 ++--
 fs/namei.c |  4 ++--
 fs/ocfs2/inode.c   |  2 +-
 fs/overlayfs/readdir.c |  4 +++-
 fs/readdir.c   |  7 ---
 include/linux/fs.h | 12 ++--
 10 files changed, 39 insertions(+), 31 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 053e677..db1e830 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -837,9 +837,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
struct dentry *dentry;
int error;
 
-   error = mutex_lock_killable_nested(>i_mutex, I_MUTEX_PARENT);
-   if (error == -EINTR)
-   return error;
+   inode_lock_nested(dir, I_MUTEX_PARENT);
+   // XXX: should've been
+   // mutex_lock_killable_nested(>i_mutex, I_MUTEX_PARENT);
+   // if (error == -EINTR)
+   //  return error;
 
dentry = lookup_one_len(name, parent->dentry, namelen);
error = PTR_ERR(dentry);
@@ -2366,9 +2368,11 @@ static noinline int btrfs_ioctl_snap_destroy(struct file 
*file,
goto out;
 
 
-   err = mutex_lock_killable_nested(>i_mutex, I_MUTEX_PARENT);
-   if (err == -EINTR)
-   goto out_drop_write;
+   inode_lock_nested(dir, I_MUTEX_PARENT);
+   // XXX: should've been
+   // err = mutex_lock_killable_nested(>i_mutex, I_MUTEX_PARENT);
+   // if (err == -EINTR)
+   //  goto out_drop_write;
dentry = lookup_one_len(vol_args->name, parent, namelen);
if (IS_ERR(dentry)) {
err = PTR_ERR(dentry);
diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c
index 03d124a..0387968 100644
--- a/fs/configfs/inode.c
+++ b/fs/configfs/inode.c
@@ -156,7 +156,7 @@ static void configfs_set_inode_lock_class(struct 
configfs_dirent *sd,
 
if (depth > 0) {
if (depth <= ARRAY_SIZE(default_group_class)) {
-   lockdep_set_class(>i_mutex,
+   lockdep_set_class(>i_rwsem,
  _group_class[depth - 1]);
} else {
/*
diff --git a/fs/dcache.c b/fs/dcache.c
index 5965588..d110040 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2911,7 +2911,8 @@ struct dentry *d_ancestor(struct dentry *p1, struct 
dentry *p2)
 static int __d_unalias(struct inode *inode,
struct dentry *dentry, struct dentry *alias)
 {
-   struct mutex *m1 = NULL, *m2 = NULL;
+   struct mutex *m1 = NULL;
+   struct rw_semaphore *m2 = NULL;
int ret = -ESTALE;
 
/* If alias and dentry share a parent, then no extra locks required */
@@ -2922,15 +2923,15 @@ static int __d_unalias(struct inode *inode,
if (!mutex_trylock(>d_sb->s_vfs_rename_mutex))
goto out_err;
m1 = >d_sb->s_vfs_rename_mutex;
-   if (!inode_trylock(alias->d_parent->d_inode))
+   if (!down_read_trylock(>d_parent->d_inode->i_rwsem))
goto out_err;
-   m2 = >d_parent->d_inode->i_mutex;
+   m2 = >d_parent->d_inode->i_rwsem;
 out_unalias:
__d_move(alias, dentry, false);
ret = 0;
 out_err:
if (m2)
-   mutex_unlock(m2);
+   up_read(m2);
if (m1)
mutex_unlock(m1);
return ret;
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index c09c63d..4546360 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -824,7 +824,7 @@ static int init_inodes(struct gfs2_sbd *sdp, int undo)
 * i_mutex on quota files is special. Since this inode is hidden system
 * file, we are safe to define locking ourselves.
 */
-   lockdep_set_class(>sd_quota_inode->i_mutex,
+   lockdep_set_class(>sd_quota_inode->i_rwsem,
  _quota_imutex_key);
 
error = gfs2_rindex_update(sdp);
diff --git a/fs/inode.c b/fs/inode.c
index 4b884f7..4ccbc21 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -166,8 +166,8 @@ int inode_init_always(struct super_block *sb, struct inode 
*inode)
spin_lock_init(>i_lock);
lockdep_set_class(>i_lock, >s_type->i_lock_key);
 
-   mutex_init(>i_mutex);
-   lockdep_set_class(>i_mutex, >s_type->i_mutex_key);
+   init_rwsem(>i_rwsem);
+   lockdep_set_class(>i_rwsem, >s_type->i_mutex_key);
 
atomic_set(>i_dio_count, 0);
 
@@ -925,13 +925,13 @@ void lockdep_annotate_inode_mutex_key(struct inode *inode)
struct file_system_type *type = inode->i_sb->s_type;
 
/* Set new key only if filesystem hasn't already 

[PATCH 11/15] beginning of transition to parallel lookups - marking in-lookup dentries

2016-04-15 Thread Al Viro
From: Al Viro 

marked as such when (would be) parallel lookup is about to pass them
to actual ->lookup(); unmarked when
* __d_add() is about to make it hashed, positive or not.
* __d_move() (from d_splice_alias(), directly or via
__d_unalias()) puts a preexisting dentry in its place
* in caller of ->lookup() if it has escaped all of the
above.  Bug (WARN_ON, actually) if it reaches the final dput()
or d_instantiate() while still marked such.

As the result, we are guaranteed that for as long as the flag is
set, dentry will
* remain negative unhashed with positive refcount
* never have its ->d_alias looked at
* never have its ->d_lru looked at
* never have its ->d_parent and ->d_name changed

Right now we have at most one such for any given parent directory.
With parallel lookups that restriction will weaken to
* only exist when parent is locked shared
* at most one with given (parent,name) pair (comparison of
names is according to ->d_compare())
* only exist when there's no hashed dentry with the same
(parent,name)

Transition will take the next several commits; unfortunately, we'll
only be able to switch to rwsem at the end of this series.  The
reason for not making it a single patch is to simplify review.

Signed-off-by: Al Viro 
---
 fs/dcache.c| 12 
 fs/namei.c |  4 
 include/linux/dcache.h | 13 +
 3 files changed, 29 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 33cad8a..5cea3cb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -761,6 +761,8 @@ repeat:
/* Slow case: now with the dentry lock held */
rcu_read_unlock();
 
+   WARN_ON(dentry->d_flags & DCACHE_PAR_LOOKUP);
+
/* Unreachable? Get rid of it */
if (unlikely(d_unhashed(dentry)))
goto kill_it;
@@ -1743,6 +1745,7 @@ type_determined:
 static void __d_instantiate(struct dentry *dentry, struct inode *inode)
 {
unsigned add_flags = d_flags_for_inode(inode);
+   WARN_ON(dentry->d_flags & DCACHE_PAR_LOOKUP);
 
spin_lock(>d_lock);
hlist_add_head(>d_u.d_alias, >i_dentry);
@@ -2358,12 +2361,19 @@ void d_rehash(struct dentry * entry)
 }
 EXPORT_SYMBOL(d_rehash);
 
+void __d_not_in_lookup(struct dentry *dentry)
+{
+   dentry->d_flags &= ~DCACHE_PAR_LOOKUP;
+   /* more stuff will land here */
+}
 
 /* inode->i_lock held if inode is non-NULL */
 
 static inline void __d_add(struct dentry *dentry, struct inode *inode)
 {
spin_lock(>d_lock);
+   if (unlikely(dentry->d_flags & DCACHE_PAR_LOOKUP))
+   __d_not_in_lookup(dentry);
if (inode) {
unsigned add_flags = d_flags_for_inode(inode);
hlist_add_head(>d_u.d_alias, >i_dentry);
@@ -2609,6 +2619,8 @@ static void __d_move(struct dentry *dentry, struct dentry 
*target,
BUG_ON(d_ancestor(target, dentry));
 
dentry_lock_for_move(dentry, target);
+   if (unlikely(target->d_flags & DCACHE_PAR_LOOKUP))
+   __d_not_in_lookup(target);
 
write_seqcount_begin(>d_seq);
write_seqcount_begin_nested(>d_seq, DENTRY_D_LOCK_NESTED);
diff --git a/fs/namei.c b/fs/namei.c
index 6fb33a7..0ee8b9d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1634,7 +1634,11 @@ static struct dentry *lookup_slow(const struct qstr 
*name,
inode_unlock(inode);
return ERR_PTR(-ENOMEM);
}
+   spin_lock(>d_lock);
+   dentry->d_flags |= DCACHE_PAR_LOOKUP;
+   spin_unlock(>d_lock);
old = inode->i_op->lookup(inode, dentry, flags);
+   d_not_in_lookup(dentry);
if (unlikely(old)) {
dput(dentry);
dentry = old;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 7cb043d..cfc1240 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -230,6 +230,8 @@ struct dentry_operations {
 
 #define DCACHE_ENCRYPTED_WITH_KEY  0x0400 /* dir is encrypted with a 
valid key */
 
+#define DCACHE_PAR_LOOKUP  0x0800 /* being looked up (with 
parent locked shared) */
+
 extern seqlock_t rename_lock;
 
 /*
@@ -365,6 +367,17 @@ static inline void dont_mount(struct dentry *dentry)
spin_unlock(>d_lock);
 }
 
+extern void __d_not_in_lookup(struct dentry *);
+
+static inline void d_not_in_lookup(struct dentry *dentry)
+{
+   if (unlikely(dentry->d_flags & DCACHE_PAR_LOOKUP)) {
+   spin_lock(>d_lock);
+   __d_not_in_lookup(dentry);
+   spin_unlock(>d_lock);
+   }
+}
+
 extern void dput(struct dentry *);
 
 static inline bool d_managed(const struct dentry *dentry)
-- 
2.8.0.rc3



[PATCH 13/15] parallel lookups machinery, part 3

2016-04-15 Thread Al Viro
From: Al Viro 

We will need to be able to check if there is an in-lookup
dentry with matching parent/name.  Right now it's impossible,
but as soon as start locking directories shared such beasts
will appear.

Add a secondary hash for locating those.  Hash chains go through
the same space where d_alias will be once it's not in-lookup anymore.
Search is done under the same bitlock we use for modifications -
with the primary hash we can rely on d_rehash() into the wrong
chain being the worst that could happen, but here the pointers are
buggered once it's removed from the chain.  On the other hand,
the chains are not going to be long and normally we'll end up
adding to the chain anyway.  That allows us to avoid bothering with
->d_lock when doing the comparisons - everything is stable until
removed from chain.

New helper: d_alloc_parallel().  Right now it allocates, verifies
that no hashed and in-lookup matches exist and adds to in-lookup
hash.

Returns ERR_PTR() for error, hashed match (in the unlikely case it's
been found) or new dentry.  In-lookup matches trigger BUG() for
now; that will change in the next commit when we introduce waiting
for ongoing lookup to finish.  Note that in-lookup matches won't be
possible until we actually go for shared locking.

lookup_slow() switched to use of d_alloc_parallel().

Again, these commits are separated only for making it easier to
review.  All this machinery will start doing something useful only
when we go for shared locking; it's just that the combination is
too large for my taste.

Signed-off-by: Al Viro 
---
 fs/dcache.c| 87 ++
 fs/namei.c | 44 +++--
 include/linux/dcache.h |  2 ++
 3 files changed, 108 insertions(+), 25 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 3959f18..0552002 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -111,6 +111,17 @@ static inline struct hlist_bl_head *d_hash(const struct 
dentry *parent,
return dentry_hashtable + hash_32(hash, d_hash_shift);
 }
 
+#define IN_LOOKUP_SHIFT 10
+static struct hlist_bl_head in_lookup_hashtable[1 << IN_LOOKUP_SHIFT];
+
+static inline struct hlist_bl_head *in_lookup_hash(const struct dentry *parent,
+   unsigned int hash)
+{
+   hash += (unsigned long) parent / L1_CACHE_BYTES;
+   return in_lookup_hashtable + hash_32(hash, IN_LOOKUP_SHIFT);
+}
+
+
 /* Statistics gathering. */
 struct dentry_stat_t dentry_stat = {
.age_limit = 45,
@@ -2377,9 +2388,85 @@ static inline void end_dir_add(struct inode *dir, 
unsigned n)
smp_store_release(>i_dir_seq, n + 2);
 }
 
+struct dentry *d_alloc_parallel(struct dentry *parent,
+   const struct qstr *name)
+{
+   unsigned int len = name->len;
+   unsigned int hash = name->hash;
+   const unsigned char *str = name->name;
+   struct hlist_bl_head *b = in_lookup_hash(parent, hash);
+   struct hlist_bl_node *node;
+   struct dentry *new = d_alloc(parent, name);
+   struct dentry *dentry;
+   unsigned seq;
+
+   if (unlikely(!new))
+   return ERR_PTR(-ENOMEM);
+
+retry:
+   seq = smp_load_acquire(>d_inode->i_dir_seq) & ~1;
+   dentry = d_lookup(parent, name);
+   if (unlikely(dentry)) {
+   dput(new);
+   return dentry;
+   }
+
+   hlist_bl_lock(b);
+   smp_rmb();
+   if (unlikely(parent->d_inode->i_dir_seq != seq)) {
+   hlist_bl_unlock(b);
+   goto retry;
+   }
+   /*
+* No changes for the parent since the beginning of d_lookup().
+* Since all removals from the chain happen with hlist_bl_lock(),
+* any potential in-lookup matches are going to stay here until
+* we unlock the chain.  All fields are stable in everything
+* we encounter.
+*/
+   hlist_bl_for_each_entry(dentry, node, b, d_u.d_in_lookup_hash) {
+   if (dentry->d_name.hash != hash)
+   continue;
+   if (dentry->d_parent != parent)
+   continue;
+   if (d_unhashed(dentry))
+   continue;
+   if (parent->d_flags & DCACHE_OP_COMPARE) {
+   int tlen = dentry->d_name.len;
+   const char *tname = dentry->d_name.name;
+   if (parent->d_op->d_compare(parent, dentry, tlen, 
tname, name))
+   continue;
+   } else {
+   if (dentry->d_name.len != len)
+   continue;
+   if (dentry_cmp(dentry, str, len))
+   continue;
+   }
+   dget(dentry);
+   hlist_bl_unlock(b);
+   /* impossible until we actually enable parallel lookups */
+   BUG();
+   

[PATCH 14/15] parallel lookups machinery, part 4 (and last)

2016-04-15 Thread Al Viro
From: Al Viro 

If we *do* run into an in-lookup match, we need to wait for it to
cease being in-lookup.  Fortunately, we do have unused space in
in-lookup dentries - d_lru is never looked at until it stops being
in-lookup.

So we can stash a pointer to wait_queue_head from stack frame of
the caller of ->lookup().  Some precautions are needed while
waiting, but it's not that hard - we do hold a reference to dentry
we are waiting for, so it can't go away.  If it's found to be
in-lookup the wait_queue_head is still alive and will remain so
at least while ->d_lock is held.  Moreover, the condition we
are waiting for becomes true at the same point where everything
on that wq gets woken up, so we can just add ourselves to the
queue once.

d_alloc_parallel() gets a pointer to wait_queue_head_t from its
caller; lookup_slow() adjusted, d_add_ci() taught to use
d_alloc_parallel() if the dentry passed to it happens to be
in-lookup one (i.e. if it's been called from the parallel lookup).

That's pretty much it - all that remains is to switch ->i_mutex
to rwsem and have lookup_slow() take it shared.

Signed-off-by: Al Viro 
---
 fs/dcache.c| 94 +++---
 fs/namei.c |  3 +-
 include/linux/dcache.h |  8 +++--
 3 files changed, 82 insertions(+), 23 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 0552002..5965588 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1984,28 +1984,36 @@ EXPORT_SYMBOL(d_obtain_root);
 struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
struct qstr *name)
 {
-   struct dentry *found;
-   struct dentry *new;
+   struct dentry *found, *res;
 
/*
 * First check if a dentry matching the name already exists,
 * if not go ahead and create it now.
 */
found = d_hash_and_lookup(dentry->d_parent, name);
-   if (!found) {
-   new = d_alloc(dentry->d_parent, name);
-   if (!new) {
-   found = ERR_PTR(-ENOMEM);
-   } else {
-   found = d_splice_alias(inode, new);
-   if (found) {
-   dput(new);
-   return found;
-   }
-   return new;
+   if (found) {
+   iput(inode);
+   return found;
+   }
+   if (dentry->d_flags & DCACHE_PAR_LOOKUP) {
+   found = d_alloc_parallel(dentry->d_parent, name,
+   dentry->d_wait);
+   if (IS_ERR(found) || !(found->d_flags & DCACHE_PAR_LOOKUP)) {
+   iput(inode);
+   return found;
}
+   } else {
+   found = d_alloc(dentry->d_parent, name);
+   if (!found) {
+   iput(inode);
+   return ERR_PTR(-ENOMEM);
+   } 
+   }
+   res = d_splice_alias(inode, found);
+   if (res) {
+   dput(found);
+   return res;
}
-   iput(inode);
return found;
 }
 EXPORT_SYMBOL(d_add_ci);
@@ -2388,8 +2396,23 @@ static inline void end_dir_add(struct inode *dir, 
unsigned n)
smp_store_release(>i_dir_seq, n + 2);
 }
 
+static void d_wait_lookup(struct dentry *dentry)
+{
+   if (dentry->d_flags & DCACHE_PAR_LOOKUP) {
+   DECLARE_WAITQUEUE(wait, current);
+   add_wait_queue(dentry->d_wait, );
+   do {
+   set_current_state(TASK_UNINTERRUPTIBLE);
+   spin_unlock(>d_lock);
+   schedule();
+   spin_lock(>d_lock);
+   } while (dentry->d_flags & DCACHE_PAR_LOOKUP);
+   }
+}
+
 struct dentry *d_alloc_parallel(struct dentry *parent,
-   const struct qstr *name)
+   const struct qstr *name,
+   wait_queue_head_t *wq)
 {
unsigned int len = name->len;
unsigned int hash = name->hash;
@@ -2444,18 +2467,47 @@ retry:
}
dget(dentry);
hlist_bl_unlock(b);
-   /* impossible until we actually enable parallel lookups */
-   BUG();
-   /* and this will be "wait for it to stop being in-lookup" */
-   /* this one will be handled in the next commit */
+   /* somebody is doing lookup for it right now; wait for it */
+   spin_lock(>d_lock);
+   d_wait_lookup(dentry);
+   /*
+* it's not in-lookup anymore; in principle we should repeat
+* everything from dcache lookup, but it's likely to be what
+* d_lookup() would've found anyway.  If it is, just return it;
+* otherwise we really have to repeat the whole 

[PATCH 12/15] parallel lookups machinery, part 2

2016-04-15 Thread Al Viro
From: Al Viro 

We'll need to verify that there's neither a hashed nor in-lookup
dentry with desired parent/name before adding to in-lookup set.

One possible solution would be to hold the parent's ->d_lock through
both checks, but while the in-lookup set is relatively small at any
time, dcache is not.  And holding the parent's ->d_lock through
something like __d_lookup_rcu() would suck too badly.

So we leave the parent's ->d_lock alone, which means that we watch
out for the following scenario:
* we verify that there's no hashed match
* existing in-lookup match gets hashed by another process
* we verify that there's no in-lookup matches and decide
that everything's fine.

Solution: per-directory kinda-sorta seqlock, bumped around the times
we hash something that used to be in-lookup or move (and hash)
something in place of in-lookup.  Then the above would turn into
* read the counter
* do dcache lookup
* if no matches found, check for in-lookup matches
* if there had been none of those either, check if the
counter has changed; repeat if it has.

The "kinda-sorta" part is due to the fact that we don't have much spare
space in inode.  There is a spare word (shared with i_bdev/i_cdev/i_pipe),
so the counter part is not a problem, but spinlock is a different story.

We could use the parent's ->d_lock, and it would be less painful in
terms of contention, for __d_add() it would be rather inconvenient to
grab; we could do that (using lock_parent()), but...

Fortunately, we can get serialization on the counter itself, and it
might be a good idea in general; we can use cmpxchg() in a loop to
get from even to odd and smp_store_release() from odd to even.

This commit adds the counter and updating logics; the readers will be
added in the next commit.

Signed-off-by: Al Viro 
---
 fs/dcache.c| 34 --
 fs/inode.c |  1 +
 include/linux/fs.h |  1 +
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 5cea3cb..3959f18 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2361,6 +2361,22 @@ void d_rehash(struct dentry * entry)
 }
 EXPORT_SYMBOL(d_rehash);
 
+static inline unsigned start_dir_add(struct inode *dir)
+{
+
+   for (;;) {
+   unsigned n = dir->i_dir_seq;
+   if (!(n & 1) && cmpxchg(>i_dir_seq, n, n + 1) == n)
+   return n;
+   cpu_relax();
+   }
+}
+
+static inline void end_dir_add(struct inode *dir, unsigned n)
+{
+   smp_store_release(>i_dir_seq, n + 2);
+}
+
 void __d_not_in_lookup(struct dentry *dentry)
 {
dentry->d_flags &= ~DCACHE_PAR_LOOKUP;
@@ -2371,9 +2387,14 @@ void __d_not_in_lookup(struct dentry *dentry)
 
 static inline void __d_add(struct dentry *dentry, struct inode *inode)
 {
+   struct inode *dir = NULL;
+   unsigned n;
spin_lock(>d_lock);
-   if (unlikely(dentry->d_flags & DCACHE_PAR_LOOKUP))
+   if (unlikely(dentry->d_flags & DCACHE_PAR_LOOKUP)) {
+   dir = dentry->d_parent->d_inode;
+   n = start_dir_add(dir);
__d_not_in_lookup(dentry);
+   }
if (inode) {
unsigned add_flags = d_flags_for_inode(inode);
hlist_add_head(>d_u.d_alias, >i_dentry);
@@ -2383,6 +2404,8 @@ static inline void __d_add(struct dentry *dentry, struct 
inode *inode)
__fsnotify_d_instantiate(dentry);
}
_d_rehash(dentry);
+   if (dir)
+   end_dir_add(dir, n);
spin_unlock(>d_lock);
if (inode)
spin_unlock(>i_lock);
@@ -2612,6 +2635,8 @@ static void dentry_unlock_for_move(struct dentry *dentry, 
struct dentry *target)
 static void __d_move(struct dentry *dentry, struct dentry *target,
 bool exchange)
 {
+   struct inode *dir = NULL;
+   unsigned n;
if (!dentry->d_inode)
printk(KERN_WARNING "VFS: moving negative dcache entry\n");
 
@@ -2619,8 +2644,11 @@ static void __d_move(struct dentry *dentry, struct 
dentry *target,
BUG_ON(d_ancestor(target, dentry));
 
dentry_lock_for_move(dentry, target);
-   if (unlikely(target->d_flags & DCACHE_PAR_LOOKUP))
+   if (unlikely(target->d_flags & DCACHE_PAR_LOOKUP)) {
+   dir = target->d_parent->d_inode;
+   n = start_dir_add(dir);
__d_not_in_lookup(target);
+   }
 
write_seqcount_begin(>d_seq);
write_seqcount_begin_nested(>d_seq, DENTRY_D_LOCK_NESTED);
@@ -2670,6 +2698,8 @@ static void __d_move(struct dentry *dentry, struct dentry 
*target,
write_seqcount_end(>d_seq);
write_seqcount_end(>d_seq);
 
+   if (dir)
+   end_dir_add(dir, n);
dentry_unlock_for_move(dentry, target);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index 4202aac..4b884f7 100644
--- a/fs/inode.c
+++ 

[PATCH 14/15] parallel lookups machinery, part 4 (and last)

2016-04-15 Thread Al Viro
From: Al Viro 

If we *do* run into an in-lookup match, we need to wait for it to
cease being in-lookup.  Fortunately, we do have unused space in
in-lookup dentries - d_lru is never looked at until it stops being
in-lookup.

So we can stash a pointer to wait_queue_head from stack frame of
the caller of ->lookup().  Some precautions are needed while
waiting, but it's not that hard - we do hold a reference to dentry
we are waiting for, so it can't go away.  If it's found to be
in-lookup the wait_queue_head is still alive and will remain so
at least while ->d_lock is held.  Moreover, the condition we
are waiting for becomes true at the same point where everything
on that wq gets woken up, so we can just add ourselves to the
queue once.

d_alloc_parallel() gets a pointer to wait_queue_head_t from its
caller; lookup_slow() adjusted, d_add_ci() taught to use
d_alloc_parallel() if the dentry passed to it happens to be
in-lookup one (i.e. if it's been called from the parallel lookup).

That's pretty much it - all that remains is to switch ->i_mutex
to rwsem and have lookup_slow() take it shared.

Signed-off-by: Al Viro 
---
 fs/dcache.c| 94 +++---
 fs/namei.c |  3 +-
 include/linux/dcache.h |  8 +++--
 3 files changed, 82 insertions(+), 23 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 0552002..5965588 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1984,28 +1984,36 @@ EXPORT_SYMBOL(d_obtain_root);
 struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
struct qstr *name)
 {
-   struct dentry *found;
-   struct dentry *new;
+   struct dentry *found, *res;
 
/*
 * First check if a dentry matching the name already exists,
 * if not go ahead and create it now.
 */
found = d_hash_and_lookup(dentry->d_parent, name);
-   if (!found) {
-   new = d_alloc(dentry->d_parent, name);
-   if (!new) {
-   found = ERR_PTR(-ENOMEM);
-   } else {
-   found = d_splice_alias(inode, new);
-   if (found) {
-   dput(new);
-   return found;
-   }
-   return new;
+   if (found) {
+   iput(inode);
+   return found;
+   }
+   if (dentry->d_flags & DCACHE_PAR_LOOKUP) {
+   found = d_alloc_parallel(dentry->d_parent, name,
+   dentry->d_wait);
+   if (IS_ERR(found) || !(found->d_flags & DCACHE_PAR_LOOKUP)) {
+   iput(inode);
+   return found;
}
+   } else {
+   found = d_alloc(dentry->d_parent, name);
+   if (!found) {
+   iput(inode);
+   return ERR_PTR(-ENOMEM);
+   } 
+   }
+   res = d_splice_alias(inode, found);
+   if (res) {
+   dput(found);
+   return res;
}
-   iput(inode);
return found;
 }
 EXPORT_SYMBOL(d_add_ci);
@@ -2388,8 +2396,23 @@ static inline void end_dir_add(struct inode *dir, 
unsigned n)
smp_store_release(>i_dir_seq, n + 2);
 }
 
+static void d_wait_lookup(struct dentry *dentry)
+{
+   if (dentry->d_flags & DCACHE_PAR_LOOKUP) {
+   DECLARE_WAITQUEUE(wait, current);
+   add_wait_queue(dentry->d_wait, );
+   do {
+   set_current_state(TASK_UNINTERRUPTIBLE);
+   spin_unlock(>d_lock);
+   schedule();
+   spin_lock(>d_lock);
+   } while (dentry->d_flags & DCACHE_PAR_LOOKUP);
+   }
+}
+
 struct dentry *d_alloc_parallel(struct dentry *parent,
-   const struct qstr *name)
+   const struct qstr *name,
+   wait_queue_head_t *wq)
 {
unsigned int len = name->len;
unsigned int hash = name->hash;
@@ -2444,18 +2467,47 @@ retry:
}
dget(dentry);
hlist_bl_unlock(b);
-   /* impossible until we actually enable parallel lookups */
-   BUG();
-   /* and this will be "wait for it to stop being in-lookup" */
-   /* this one will be handled in the next commit */
+   /* somebody is doing lookup for it right now; wait for it */
+   spin_lock(>d_lock);
+   d_wait_lookup(dentry);
+   /*
+* it's not in-lookup anymore; in principle we should repeat
+* everything from dcache lookup, but it's likely to be what
+* d_lookup() would've found anyway.  If it is, just return it;
+* otherwise we really have to repeat the whole thing.
+*/
+   if 

[PATCH 12/15] parallel lookups machinery, part 2

2016-04-15 Thread Al Viro
From: Al Viro 

We'll need to verify that there's neither a hashed nor in-lookup
dentry with desired parent/name before adding to in-lookup set.

One possible solution would be to hold the parent's ->d_lock through
both checks, but while the in-lookup set is relatively small at any
time, dcache is not.  And holding the parent's ->d_lock through
something like __d_lookup_rcu() would suck too badly.

So we leave the parent's ->d_lock alone, which means that we watch
out for the following scenario:
* we verify that there's no hashed match
* existing in-lookup match gets hashed by another process
* we verify that there's no in-lookup matches and decide
that everything's fine.

Solution: per-directory kinda-sorta seqlock, bumped around the times
we hash something that used to be in-lookup or move (and hash)
something in place of in-lookup.  Then the above would turn into
* read the counter
* do dcache lookup
* if no matches found, check for in-lookup matches
* if there had been none of those either, check if the
counter has changed; repeat if it has.

The "kinda-sorta" part is due to the fact that we don't have much spare
space in inode.  There is a spare word (shared with i_bdev/i_cdev/i_pipe),
so the counter part is not a problem, but spinlock is a different story.

We could use the parent's ->d_lock, and it would be less painful in
terms of contention, for __d_add() it would be rather inconvenient to
grab; we could do that (using lock_parent()), but...

Fortunately, we can get serialization on the counter itself, and it
might be a good idea in general; we can use cmpxchg() in a loop to
get from even to odd and smp_store_release() from odd to even.

This commit adds the counter and updating logics; the readers will be
added in the next commit.

Signed-off-by: Al Viro 
---
 fs/dcache.c| 34 --
 fs/inode.c |  1 +
 include/linux/fs.h |  1 +
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 5cea3cb..3959f18 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2361,6 +2361,22 @@ void d_rehash(struct dentry * entry)
 }
 EXPORT_SYMBOL(d_rehash);
 
+static inline unsigned start_dir_add(struct inode *dir)
+{
+
+   for (;;) {
+   unsigned n = dir->i_dir_seq;
+   if (!(n & 1) && cmpxchg(>i_dir_seq, n, n + 1) == n)
+   return n;
+   cpu_relax();
+   }
+}
+
+static inline void end_dir_add(struct inode *dir, unsigned n)
+{
+   smp_store_release(>i_dir_seq, n + 2);
+}
+
 void __d_not_in_lookup(struct dentry *dentry)
 {
dentry->d_flags &= ~DCACHE_PAR_LOOKUP;
@@ -2371,9 +2387,14 @@ void __d_not_in_lookup(struct dentry *dentry)
 
 static inline void __d_add(struct dentry *dentry, struct inode *inode)
 {
+   struct inode *dir = NULL;
+   unsigned n;
spin_lock(>d_lock);
-   if (unlikely(dentry->d_flags & DCACHE_PAR_LOOKUP))
+   if (unlikely(dentry->d_flags & DCACHE_PAR_LOOKUP)) {
+   dir = dentry->d_parent->d_inode;
+   n = start_dir_add(dir);
__d_not_in_lookup(dentry);
+   }
if (inode) {
unsigned add_flags = d_flags_for_inode(inode);
hlist_add_head(>d_u.d_alias, >i_dentry);
@@ -2383,6 +2404,8 @@ static inline void __d_add(struct dentry *dentry, struct 
inode *inode)
__fsnotify_d_instantiate(dentry);
}
_d_rehash(dentry);
+   if (dir)
+   end_dir_add(dir, n);
spin_unlock(>d_lock);
if (inode)
spin_unlock(>i_lock);
@@ -2612,6 +2635,8 @@ static void dentry_unlock_for_move(struct dentry *dentry, 
struct dentry *target)
 static void __d_move(struct dentry *dentry, struct dentry *target,
 bool exchange)
 {
+   struct inode *dir = NULL;
+   unsigned n;
if (!dentry->d_inode)
printk(KERN_WARNING "VFS: moving negative dcache entry\n");
 
@@ -2619,8 +2644,11 @@ static void __d_move(struct dentry *dentry, struct 
dentry *target,
BUG_ON(d_ancestor(target, dentry));
 
dentry_lock_for_move(dentry, target);
-   if (unlikely(target->d_flags & DCACHE_PAR_LOOKUP))
+   if (unlikely(target->d_flags & DCACHE_PAR_LOOKUP)) {
+   dir = target->d_parent->d_inode;
+   n = start_dir_add(dir);
__d_not_in_lookup(target);
+   }
 
write_seqcount_begin(>d_seq);
write_seqcount_begin_nested(>d_seq, DENTRY_D_LOCK_NESTED);
@@ -2670,6 +2698,8 @@ static void __d_move(struct dentry *dentry, struct dentry 
*target,
write_seqcount_end(>d_seq);
write_seqcount_end(>d_seq);
 
+   if (dir)
+   end_dir_add(dir, n);
dentry_unlock_for_move(dentry, target);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index 4202aac..4b884f7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -151,6 +151,7 @@ int 

[PATCH 13/15] parallel lookups machinery, part 3

2016-04-15 Thread Al Viro
From: Al Viro 

We will need to be able to check if there is an in-lookup
dentry with matching parent/name.  Right now it's impossible,
but as soon as start locking directories shared such beasts
will appear.

Add a secondary hash for locating those.  Hash chains go through
the same space where d_alias will be once it's not in-lookup anymore.
Search is done under the same bitlock we use for modifications -
with the primary hash we can rely on d_rehash() into the wrong
chain being the worst that could happen, but here the pointers are
buggered once it's removed from the chain.  On the other hand,
the chains are not going to be long and normally we'll end up
adding to the chain anyway.  That allows us to avoid bothering with
->d_lock when doing the comparisons - everything is stable until
removed from chain.

New helper: d_alloc_parallel().  Right now it allocates, verifies
that no hashed and in-lookup matches exist and adds to in-lookup
hash.

Returns ERR_PTR() for error, hashed match (in the unlikely case it's
been found) or new dentry.  In-lookup matches trigger BUG() for
now; that will change in the next commit when we introduce waiting
for ongoing lookup to finish.  Note that in-lookup matches won't be
possible until we actually go for shared locking.

lookup_slow() switched to use of d_alloc_parallel().

Again, these commits are separated only for making it easier to
review.  All this machinery will start doing something useful only
when we go for shared locking; it's just that the combination is
too large for my taste.

Signed-off-by: Al Viro 
---
 fs/dcache.c| 87 ++
 fs/namei.c | 44 +++--
 include/linux/dcache.h |  2 ++
 3 files changed, 108 insertions(+), 25 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 3959f18..0552002 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -111,6 +111,17 @@ static inline struct hlist_bl_head *d_hash(const struct 
dentry *parent,
return dentry_hashtable + hash_32(hash, d_hash_shift);
 }
 
+#define IN_LOOKUP_SHIFT 10
+static struct hlist_bl_head in_lookup_hashtable[1 << IN_LOOKUP_SHIFT];
+
+static inline struct hlist_bl_head *in_lookup_hash(const struct dentry *parent,
+   unsigned int hash)
+{
+   hash += (unsigned long) parent / L1_CACHE_BYTES;
+   return in_lookup_hashtable + hash_32(hash, IN_LOOKUP_SHIFT);
+}
+
+
 /* Statistics gathering. */
 struct dentry_stat_t dentry_stat = {
.age_limit = 45,
@@ -2377,9 +2388,85 @@ static inline void end_dir_add(struct inode *dir, 
unsigned n)
smp_store_release(>i_dir_seq, n + 2);
 }
 
+struct dentry *d_alloc_parallel(struct dentry *parent,
+   const struct qstr *name)
+{
+   unsigned int len = name->len;
+   unsigned int hash = name->hash;
+   const unsigned char *str = name->name;
+   struct hlist_bl_head *b = in_lookup_hash(parent, hash);
+   struct hlist_bl_node *node;
+   struct dentry *new = d_alloc(parent, name);
+   struct dentry *dentry;
+   unsigned seq;
+
+   if (unlikely(!new))
+   return ERR_PTR(-ENOMEM);
+
+retry:
+   seq = smp_load_acquire(>d_inode->i_dir_seq) & ~1;
+   dentry = d_lookup(parent, name);
+   if (unlikely(dentry)) {
+   dput(new);
+   return dentry;
+   }
+
+   hlist_bl_lock(b);
+   smp_rmb();
+   if (unlikely(parent->d_inode->i_dir_seq != seq)) {
+   hlist_bl_unlock(b);
+   goto retry;
+   }
+   /*
+* No changes for the parent since the beginning of d_lookup().
+* Since all removals from the chain happen with hlist_bl_lock(),
+* any potential in-lookup matches are going to stay here until
+* we unlock the chain.  All fields are stable in everything
+* we encounter.
+*/
+   hlist_bl_for_each_entry(dentry, node, b, d_u.d_in_lookup_hash) {
+   if (dentry->d_name.hash != hash)
+   continue;
+   if (dentry->d_parent != parent)
+   continue;
+   if (d_unhashed(dentry))
+   continue;
+   if (parent->d_flags & DCACHE_OP_COMPARE) {
+   int tlen = dentry->d_name.len;
+   const char *tname = dentry->d_name.name;
+   if (parent->d_op->d_compare(parent, dentry, tlen, 
tname, name))
+   continue;
+   } else {
+   if (dentry->d_name.len != len)
+   continue;
+   if (dentry_cmp(dentry, str, len))
+   continue;
+   }
+   dget(dentry);
+   hlist_bl_unlock(b);
+   /* impossible until we actually enable parallel lookups */
+   BUG();
+   /* and this will be "wait for it to stop 

[PATCH 02/15] kernfs: use lookup_one_len_unlocked()

2016-04-15 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 fs/kernfs/mount.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index b67dbcc..e006d30 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -120,9 +120,8 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
kntmp = find_next_ancestor(kn, knparent);
if (WARN_ON(!kntmp))
return ERR_PTR(-EINVAL);
-   mutex_lock(_inode(dentry)->i_mutex);
-   dtmp = lookup_one_len(kntmp->name, dentry, strlen(kntmp->name));
-   mutex_unlock(_inode(dentry)->i_mutex);
+   dtmp = lookup_one_len_unlocked(kntmp->name, dentry,
+  strlen(kntmp->name));
dput(dentry);
if (IS_ERR(dtmp))
return dtmp;
-- 
2.8.0.rc3



[PATCH 01/15] security_d_instantiate(): move to the point prior to attaching dentry to inode

2016-04-15 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 fs/dcache.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 32ceae3..e9de4d9 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1772,11 +1772,11 @@ void d_instantiate(struct dentry *entry, struct inode * 
inode)
 {
BUG_ON(!hlist_unhashed(>d_u.d_alias));
if (inode) {
+   security_d_instantiate(entry, inode);
spin_lock(>i_lock);
__d_instantiate(entry, inode);
spin_unlock(>i_lock);
}
-   security_d_instantiate(entry, inode);
 }
 EXPORT_SYMBOL(d_instantiate);
 
@@ -1793,6 +1793,7 @@ int d_instantiate_no_diralias(struct dentry *entry, 
struct inode *inode)
 {
BUG_ON(!hlist_unhashed(>d_u.d_alias));
 
+   security_d_instantiate(entry, inode);
spin_lock(>i_lock);
if (S_ISDIR(inode->i_mode) && !hlist_empty(>i_dentry)) {
spin_unlock(>i_lock);
@@ -1801,7 +1802,6 @@ int d_instantiate_no_diralias(struct dentry *entry, 
struct inode *inode)
}
__d_instantiate(entry, inode);
spin_unlock(>i_lock);
-   security_d_instantiate(entry, inode);
 
return 0;
 }
@@ -1875,6 +1875,7 @@ static struct dentry *__d_obtain_alias(struct inode 
*inode, int disconnected)
goto out_iput;
}
 
+   security_d_instantiate(tmp, inode);
spin_lock(>i_lock);
res = __d_find_any_alias(inode);
if (res) {
@@ -1897,13 +1898,10 @@ static struct dentry *__d_obtain_alias(struct inode 
*inode, int disconnected)
hlist_bl_unlock(>d_sb->s_anon);
spin_unlock(>d_lock);
spin_unlock(>i_lock);
-   security_d_instantiate(tmp, inode);
 
return tmp;
 
  out_iput:
-   if (res && !IS_ERR(res))
-   security_d_instantiate(res, inode);
iput(inode);
return res;
 }
@@ -2369,7 +2367,6 @@ static inline void __d_add(struct dentry *dentry, struct 
inode *inode)
__d_instantiate(dentry, inode);
spin_unlock(>i_lock);
}
-   security_d_instantiate(dentry, inode);
d_rehash(dentry);
 }
 
@@ -2384,8 +2381,10 @@ static inline void __d_add(struct dentry *dentry, struct 
inode *inode)
 
 void d_add(struct dentry *entry, struct inode *inode)
 {
-   if (inode)
+   if (inode) {
+   security_d_instantiate(entry, inode);
spin_lock(>i_lock);
+   }
__d_add(entry, inode);
 }
 EXPORT_SYMBOL(d_add);
@@ -2779,6 +2778,7 @@ struct dentry *d_splice_alias(struct inode *inode, struct 
dentry *dentry)
if (!inode)
goto out;
 
+   security_d_instantiate(dentry, inode);
spin_lock(>i_lock);
if (S_ISDIR(inode->i_mode)) {
struct dentry *new = __d_find_any_alias(inode);
@@ -2806,7 +2806,6 @@ struct dentry *d_splice_alias(struct inode *inode, struct 
dentry *dentry)
} else {
__d_move(new, dentry, false);
write_sequnlock(_lock);
-   security_d_instantiate(new, inode);
}
iput(inode);
return new;
-- 
2.8.0.rc3



[PATCH 03/15] configfs_detach_prep(): make sure that wait_mutex won't go away

2016-04-15 Thread Al Viro
From: Al Viro 

grab a reference to dentry we'd got the sucker from, and return
that dentry via *wait, rather than just returning the address of
->i_mutex.

Signed-off-by: Al Viro 
---
 fs/configfs/dir.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
index ea59c89..48929c4 100644
--- a/fs/configfs/dir.c
+++ b/fs/configfs/dir.c
@@ -494,7 +494,7 @@ out:
  * If there is an error, the caller will reset the flags via
  * configfs_detach_rollback().
  */
-static int configfs_detach_prep(struct dentry *dentry, struct mutex 
**wait_mutex)
+static int configfs_detach_prep(struct dentry *dentry, struct dentry **wait)
 {
struct configfs_dirent *parent_sd = dentry->d_fsdata;
struct configfs_dirent *sd;
@@ -515,8 +515,8 @@ static int configfs_detach_prep(struct dentry *dentry, 
struct mutex **wait_mutex
if (sd->s_type & CONFIGFS_USET_DEFAULT) {
/* Abort if racing with mkdir() */
if (sd->s_type & CONFIGFS_USET_IN_MKDIR) {
-   if (wait_mutex)
-   *wait_mutex = 
_inode(sd->s_dentry)->i_mutex;
+   if (wait)
+   *wait= dget(sd->s_dentry);
return -EAGAIN;
}
 
@@ -524,7 +524,7 @@ static int configfs_detach_prep(struct dentry *dentry, 
struct mutex **wait_mutex
 * Yup, recursive.  If there's a problem, blame
 * deep nesting of default_groups
 */
-   ret = configfs_detach_prep(sd->s_dentry, wait_mutex);
+   ret = configfs_detach_prep(sd->s_dentry, wait);
if (!ret)
continue;
} else
@@ -1458,7 +1458,7 @@ static int configfs_rmdir(struct inode *dir, struct 
dentry *dentry)
 * the new link is temporarily attached
 */
do {
-   struct mutex *wait_mutex;
+   struct dentry *wait;
 
mutex_lock(_symlink_mutex);
spin_lock(_dirent_lock);
@@ -1469,7 +1469,7 @@ static int configfs_rmdir(struct inode *dir, struct 
dentry *dentry)
 */
ret = sd->s_dependent_count ? -EBUSY : 0;
if (!ret) {
-   ret = configfs_detach_prep(dentry, _mutex);
+   ret = configfs_detach_prep(dentry, );
if (ret)
configfs_detach_rollback(dentry);
}
@@ -1483,8 +1483,9 @@ static int configfs_rmdir(struct inode *dir, struct 
dentry *dentry)
}
 
/* Wait until the racing operation terminates */
-   mutex_lock(wait_mutex);
-   mutex_unlock(wait_mutex);
+   inode_lock(d_inode(wait));
+   inode_unlock(d_inode(wait));
+   dput(wait);
}
} while (ret == -EAGAIN);
 
-- 
2.8.0.rc3



[PATCH 02/15] kernfs: use lookup_one_len_unlocked()

2016-04-15 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 fs/kernfs/mount.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index b67dbcc..e006d30 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -120,9 +120,8 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
kntmp = find_next_ancestor(kn, knparent);
if (WARN_ON(!kntmp))
return ERR_PTR(-EINVAL);
-   mutex_lock(_inode(dentry)->i_mutex);
-   dtmp = lookup_one_len(kntmp->name, dentry, strlen(kntmp->name));
-   mutex_unlock(_inode(dentry)->i_mutex);
+   dtmp = lookup_one_len_unlocked(kntmp->name, dentry,
+  strlen(kntmp->name));
dput(dentry);
if (IS_ERR(dtmp))
return dtmp;
-- 
2.8.0.rc3



[PATCH 01/15] security_d_instantiate(): move to the point prior to attaching dentry to inode

2016-04-15 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 fs/dcache.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 32ceae3..e9de4d9 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1772,11 +1772,11 @@ void d_instantiate(struct dentry *entry, struct inode * 
inode)
 {
BUG_ON(!hlist_unhashed(>d_u.d_alias));
if (inode) {
+   security_d_instantiate(entry, inode);
spin_lock(>i_lock);
__d_instantiate(entry, inode);
spin_unlock(>i_lock);
}
-   security_d_instantiate(entry, inode);
 }
 EXPORT_SYMBOL(d_instantiate);
 
@@ -1793,6 +1793,7 @@ int d_instantiate_no_diralias(struct dentry *entry, 
struct inode *inode)
 {
BUG_ON(!hlist_unhashed(>d_u.d_alias));
 
+   security_d_instantiate(entry, inode);
spin_lock(>i_lock);
if (S_ISDIR(inode->i_mode) && !hlist_empty(>i_dentry)) {
spin_unlock(>i_lock);
@@ -1801,7 +1802,6 @@ int d_instantiate_no_diralias(struct dentry *entry, 
struct inode *inode)
}
__d_instantiate(entry, inode);
spin_unlock(>i_lock);
-   security_d_instantiate(entry, inode);
 
return 0;
 }
@@ -1875,6 +1875,7 @@ static struct dentry *__d_obtain_alias(struct inode 
*inode, int disconnected)
goto out_iput;
}
 
+   security_d_instantiate(tmp, inode);
spin_lock(>i_lock);
res = __d_find_any_alias(inode);
if (res) {
@@ -1897,13 +1898,10 @@ static struct dentry *__d_obtain_alias(struct inode 
*inode, int disconnected)
hlist_bl_unlock(>d_sb->s_anon);
spin_unlock(>d_lock);
spin_unlock(>i_lock);
-   security_d_instantiate(tmp, inode);
 
return tmp;
 
  out_iput:
-   if (res && !IS_ERR(res))
-   security_d_instantiate(res, inode);
iput(inode);
return res;
 }
@@ -2369,7 +2367,6 @@ static inline void __d_add(struct dentry *dentry, struct 
inode *inode)
__d_instantiate(dentry, inode);
spin_unlock(>i_lock);
}
-   security_d_instantiate(dentry, inode);
d_rehash(dentry);
 }
 
@@ -2384,8 +2381,10 @@ static inline void __d_add(struct dentry *dentry, struct 
inode *inode)
 
 void d_add(struct dentry *entry, struct inode *inode)
 {
-   if (inode)
+   if (inode) {
+   security_d_instantiate(entry, inode);
spin_lock(>i_lock);
+   }
__d_add(entry, inode);
 }
 EXPORT_SYMBOL(d_add);
@@ -2779,6 +2778,7 @@ struct dentry *d_splice_alias(struct inode *inode, struct 
dentry *dentry)
if (!inode)
goto out;
 
+   security_d_instantiate(dentry, inode);
spin_lock(>i_lock);
if (S_ISDIR(inode->i_mode)) {
struct dentry *new = __d_find_any_alias(inode);
@@ -2806,7 +2806,6 @@ struct dentry *d_splice_alias(struct inode *inode, struct 
dentry *dentry)
} else {
__d_move(new, dentry, false);
write_sequnlock(_lock);
-   security_d_instantiate(new, inode);
}
iput(inode);
return new;
-- 
2.8.0.rc3



[PATCH 03/15] configfs_detach_prep(): make sure that wait_mutex won't go away

2016-04-15 Thread Al Viro
From: Al Viro 

grab a reference to dentry we'd got the sucker from, and return
that dentry via *wait, rather than just returning the address of
->i_mutex.

Signed-off-by: Al Viro 
---
 fs/configfs/dir.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
index ea59c89..48929c4 100644
--- a/fs/configfs/dir.c
+++ b/fs/configfs/dir.c
@@ -494,7 +494,7 @@ out:
  * If there is an error, the caller will reset the flags via
  * configfs_detach_rollback().
  */
-static int configfs_detach_prep(struct dentry *dentry, struct mutex 
**wait_mutex)
+static int configfs_detach_prep(struct dentry *dentry, struct dentry **wait)
 {
struct configfs_dirent *parent_sd = dentry->d_fsdata;
struct configfs_dirent *sd;
@@ -515,8 +515,8 @@ static int configfs_detach_prep(struct dentry *dentry, 
struct mutex **wait_mutex
if (sd->s_type & CONFIGFS_USET_DEFAULT) {
/* Abort if racing with mkdir() */
if (sd->s_type & CONFIGFS_USET_IN_MKDIR) {
-   if (wait_mutex)
-   *wait_mutex = 
_inode(sd->s_dentry)->i_mutex;
+   if (wait)
+   *wait= dget(sd->s_dentry);
return -EAGAIN;
}
 
@@ -524,7 +524,7 @@ static int configfs_detach_prep(struct dentry *dentry, 
struct mutex **wait_mutex
 * Yup, recursive.  If there's a problem, blame
 * deep nesting of default_groups
 */
-   ret = configfs_detach_prep(sd->s_dentry, wait_mutex);
+   ret = configfs_detach_prep(sd->s_dentry, wait);
if (!ret)
continue;
} else
@@ -1458,7 +1458,7 @@ static int configfs_rmdir(struct inode *dir, struct 
dentry *dentry)
 * the new link is temporarily attached
 */
do {
-   struct mutex *wait_mutex;
+   struct dentry *wait;
 
mutex_lock(_symlink_mutex);
spin_lock(_dirent_lock);
@@ -1469,7 +1469,7 @@ static int configfs_rmdir(struct inode *dir, struct 
dentry *dentry)
 */
ret = sd->s_dependent_count ? -EBUSY : 0;
if (!ret) {
-   ret = configfs_detach_prep(dentry, _mutex);
+   ret = configfs_detach_prep(dentry, );
if (ret)
configfs_detach_rollback(dentry);
}
@@ -1483,8 +1483,9 @@ static int configfs_rmdir(struct inode *dir, struct 
dentry *dentry)
}
 
/* Wait until the racing operation terminates */
-   mutex_lock(wait_mutex);
-   mutex_unlock(wait_mutex);
+   inode_lock(d_inode(wait));
+   inode_unlock(d_inode(wait));
+   dput(wait);
}
} while (ret == -EAGAIN);
 
-- 
2.8.0.rc3



[PATCH 04/15] ocfs2: don't open-code inode_lock/inode_unlock

2016-04-15 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 fs/ocfs2/aops.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 1581240..f048a33 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2311,7 +2311,7 @@ static void ocfs2_dio_end_io_write(struct inode *inode,
/* ocfs2_file_write_iter will get i_mutex, so we need not lock if we
 * are in that context. */
if (dwc->dw_writer_pid != task_pid_nr(current)) {
-   mutex_lock(>i_mutex);
+   inode_lock(inode);
locked = 1;
}
 
@@ -2390,7 +2390,7 @@ out:
ocfs2_free_alloc_context(meta_ac);
ocfs2_run_deallocs(osb, );
if (locked)
-   mutex_unlock(>i_mutex);
+   inode_unlock(inode);
ocfs2_dio_free_write_ctx(inode, dwc);
 }
 
-- 
2.8.0.rc3



[PATCH 04/15] ocfs2: don't open-code inode_lock/inode_unlock

2016-04-15 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 fs/ocfs2/aops.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 1581240..f048a33 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2311,7 +2311,7 @@ static void ocfs2_dio_end_io_write(struct inode *inode,
/* ocfs2_file_write_iter will get i_mutex, so we need not lock if we
 * are in that context. */
if (dwc->dw_writer_pid != task_pid_nr(current)) {
-   mutex_lock(>i_mutex);
+   inode_lock(inode);
locked = 1;
}
 
@@ -2390,7 +2390,7 @@ out:
ocfs2_free_alloc_context(meta_ac);
ocfs2_run_deallocs(osb, );
if (locked)
-   mutex_unlock(>i_mutex);
+   inode_unlock(inode);
ocfs2_dio_free_write_ctx(inode, dwc);
 }
 
-- 
2.8.0.rc3



RE: [PATCH net-next 2/2] intel: ixgbevf: Support Windows hosts (Hyper-V)

2016-04-15 Thread KY Srinivasan


> -Original Message-
> From: KY Srinivasan
> Sent: Friday, April 15, 2016 9:01 AM
> To: 'Alexander Duyck' 
> Cc: David Miller ; Netdev
> ; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; Robo Bot
> ; Jason Wang ;
> e...@mellanox.com; ja...@mellanox.com; yevge...@mellanox.com; John
> Ronciak ; intel-wired-lan  l...@lists.osuosl.org>
> Subject: RE: [PATCH net-next 2/2] intel: ixgbevf: Support Windows hosts
> (Hyper-V)
> 
> 
> 
> > -Original Message-
> > From: Alexander Duyck [mailto:alexander.du...@gmail.com]
> > Sent: Friday, April 15, 2016 8:40 AM
> > To: KY Srinivasan 
> > Cc: David Miller ; Netdev
> > ; linux-kernel@vger.kernel.org;
> > de...@linuxdriverproject.org; o...@aepfle.de; Robo Bot
> > ; Jason Wang ;
> > e...@mellanox.com; ja...@mellanox.com; yevge...@mellanox.com; John
> > Ronciak ; intel-wired-lan  > l...@lists.osuosl.org>
> > Subject: Re: [PATCH net-next 2/2] intel: ixgbevf: Support Windows hosts
> > (Hyper-V)
> >
> > On Thu, Apr 14, 2016 at 7:49 PM, KY Srinivasan 
> wrote:
> > >
> > >
> > >> -Original Message-
> > >> From: Alexander Duyck [mailto:alexander.du...@gmail.com]
> > >> Sent: Thursday, April 14, 2016 4:18 PM
> > >> To: KY Srinivasan 
> > >> Cc: David Miller ; Netdev
> > >> ; linux-kernel@vger.kernel.org;
> > >> de...@linuxdriverproject.org; o...@aepfle.de; Robo Bot
> > >> ; Jason Wang ;
> > >> e...@mellanox.com; ja...@mellanox.com; yevge...@mellanox.com;
> > John
> > >> Ronciak ; intel-wired-lan  > >> l...@lists.osuosl.org>
> > >> Subject: Re: [PATCH net-next 2/2] intel: ixgbevf: Support Windows
> hosts
> > >> (Hyper-V)
> > >>
> > >> On Thu, Apr 14, 2016 at 4:55 PM, K. Y. Srinivasan 
> > >> wrote:
> > >> > On Hyper-V, the VF/PF communication is a via software mediated
> path
> > >> > as opposed to the hardware mailbox. Make the necessary
> > >> > adjustments to support Hyper-V.
> > >> >
> > >> > Signed-off-by: K. Y. Srinivasan 
> > >> > ---
> > >> >  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |   11 ++
> > >> >  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   56 ++---
> > >> >  drivers/net/ethernet/intel/ixgbevf/mbx.c  |   12 ++
> > >> >  drivers/net/ethernet/intel/ixgbevf/vf.c   |  138
> > >> +
> > >> >  drivers/net/ethernet/intel/ixgbevf/vf.h   |2 +
> > >> >  5 files changed, 201 insertions(+), 18 deletions(-)
> > >> >
> > >> > diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> > >> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> > >> > index 5ac60ee..f8d2a0b 100644
> > >> > --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> > >> > +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> > >> > @@ -460,9 +460,13 @@ enum ixbgevf_state_t {
> > >> >
> > >> >  enum ixgbevf_boards {
> > >> > board_82599_vf,
> > >> > +   board_82599_vf_hv,
> > >> > board_X540_vf,
> > >> > +   board_X540_vf_hv,
> > >> > board_X550_vf,
> > >> > +   board_X550_vf_hv,
> > >> > board_X550EM_x_vf,
> > >> > +   board_X550EM_x_vf_hv,
> > >> >  };
> > >> >
> > >> >  enum ixgbevf_xcast_modes {
> > >> > @@ -477,6 +481,13 @@ extern const struct ixgbevf_info
> > >> ixgbevf_X550_vf_info;
> > >> >  extern const struct ixgbevf_info ixgbevf_X550EM_x_vf_info;
> > >> >  extern const struct ixgbe_mbx_operations ixgbevf_mbx_ops;
> > >> >
> > >> > +
> > >> > +extern const struct ixgbevf_info ixgbevf_82599_vf_hv_info;
> > >> > +extern const struct ixgbevf_info ixgbevf_X540_vf_hv_info;
> > >> > +extern const struct ixgbevf_info ixgbevf_X550_vf_hv_info;
> > >> > +extern const struct ixgbevf_info ixgbevf_X550EM_x_vf_hv_info;
> > >> > +extern const struct ixgbe_mbx_operations ixgbevf_hv_mbx_ops;
> > >> > +
> > >> >  /* needed by ethtool.c */
> > >> >  extern const char ixgbevf_driver_name[];
> > >> >  extern const char ixgbevf_driver_version[];
> > >> > diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> > >> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> > >> > index 007cbe0..4a0ffac 100644
> > >> > --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> > >> > +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> > >> > @@ -49,6 +49,7 @@
> > >> >  #include 
> > >> >  #include 
> > >> >  #include 
> > >> > +#include 
> > >> >
> > >> >  #include "ixgbevf.h"
> > >> >
> > >> > @@ -62,10 +63,14 @@ static char ixgbevf_copyright[] =
> > >> > "Copyright (c) 2009 - 2015 Intel Corporation.";
> > >> >
> > >> >  static const struct ixgbevf_info *ixgbevf_info_tbl[] = {
> > >> > - 

RE: [PATCH net-next 2/2] intel: ixgbevf: Support Windows hosts (Hyper-V)

2016-04-15 Thread KY Srinivasan


> -Original Message-
> From: KY Srinivasan
> Sent: Friday, April 15, 2016 9:01 AM
> To: 'Alexander Duyck' 
> Cc: David Miller ; Netdev
> ; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; Robo Bot
> ; Jason Wang ;
> e...@mellanox.com; ja...@mellanox.com; yevge...@mellanox.com; John
> Ronciak ; intel-wired-lan  l...@lists.osuosl.org>
> Subject: RE: [PATCH net-next 2/2] intel: ixgbevf: Support Windows hosts
> (Hyper-V)
> 
> 
> 
> > -Original Message-
> > From: Alexander Duyck [mailto:alexander.du...@gmail.com]
> > Sent: Friday, April 15, 2016 8:40 AM
> > To: KY Srinivasan 
> > Cc: David Miller ; Netdev
> > ; linux-kernel@vger.kernel.org;
> > de...@linuxdriverproject.org; o...@aepfle.de; Robo Bot
> > ; Jason Wang ;
> > e...@mellanox.com; ja...@mellanox.com; yevge...@mellanox.com; John
> > Ronciak ; intel-wired-lan  > l...@lists.osuosl.org>
> > Subject: Re: [PATCH net-next 2/2] intel: ixgbevf: Support Windows hosts
> > (Hyper-V)
> >
> > On Thu, Apr 14, 2016 at 7:49 PM, KY Srinivasan 
> wrote:
> > >
> > >
> > >> -Original Message-
> > >> From: Alexander Duyck [mailto:alexander.du...@gmail.com]
> > >> Sent: Thursday, April 14, 2016 4:18 PM
> > >> To: KY Srinivasan 
> > >> Cc: David Miller ; Netdev
> > >> ; linux-kernel@vger.kernel.org;
> > >> de...@linuxdriverproject.org; o...@aepfle.de; Robo Bot
> > >> ; Jason Wang ;
> > >> e...@mellanox.com; ja...@mellanox.com; yevge...@mellanox.com;
> > John
> > >> Ronciak ; intel-wired-lan  > >> l...@lists.osuosl.org>
> > >> Subject: Re: [PATCH net-next 2/2] intel: ixgbevf: Support Windows
> hosts
> > >> (Hyper-V)
> > >>
> > >> On Thu, Apr 14, 2016 at 4:55 PM, K. Y. Srinivasan 
> > >> wrote:
> > >> > On Hyper-V, the VF/PF communication is a via software mediated
> path
> > >> > as opposed to the hardware mailbox. Make the necessary
> > >> > adjustments to support Hyper-V.
> > >> >
> > >> > Signed-off-by: K. Y. Srinivasan 
> > >> > ---
> > >> >  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |   11 ++
> > >> >  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   56 ++---
> > >> >  drivers/net/ethernet/intel/ixgbevf/mbx.c  |   12 ++
> > >> >  drivers/net/ethernet/intel/ixgbevf/vf.c   |  138
> > >> +
> > >> >  drivers/net/ethernet/intel/ixgbevf/vf.h   |2 +
> > >> >  5 files changed, 201 insertions(+), 18 deletions(-)
> > >> >
> > >> > diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> > >> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> > >> > index 5ac60ee..f8d2a0b 100644
> > >> > --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> > >> > +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> > >> > @@ -460,9 +460,13 @@ enum ixbgevf_state_t {
> > >> >
> > >> >  enum ixgbevf_boards {
> > >> > board_82599_vf,
> > >> > +   board_82599_vf_hv,
> > >> > board_X540_vf,
> > >> > +   board_X540_vf_hv,
> > >> > board_X550_vf,
> > >> > +   board_X550_vf_hv,
> > >> > board_X550EM_x_vf,
> > >> > +   board_X550EM_x_vf_hv,
> > >> >  };
> > >> >
> > >> >  enum ixgbevf_xcast_modes {
> > >> > @@ -477,6 +481,13 @@ extern const struct ixgbevf_info
> > >> ixgbevf_X550_vf_info;
> > >> >  extern const struct ixgbevf_info ixgbevf_X550EM_x_vf_info;
> > >> >  extern const struct ixgbe_mbx_operations ixgbevf_mbx_ops;
> > >> >
> > >> > +
> > >> > +extern const struct ixgbevf_info ixgbevf_82599_vf_hv_info;
> > >> > +extern const struct ixgbevf_info ixgbevf_X540_vf_hv_info;
> > >> > +extern const struct ixgbevf_info ixgbevf_X550_vf_hv_info;
> > >> > +extern const struct ixgbevf_info ixgbevf_X550EM_x_vf_hv_info;
> > >> > +extern const struct ixgbe_mbx_operations ixgbevf_hv_mbx_ops;
> > >> > +
> > >> >  /* needed by ethtool.c */
> > >> >  extern const char ixgbevf_driver_name[];
> > >> >  extern const char ixgbevf_driver_version[];
> > >> > diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> > >> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> > >> > index 007cbe0..4a0ffac 100644
> > >> > --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> > >> > +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> > >> > @@ -49,6 +49,7 @@
> > >> >  #include 
> > >> >  #include 
> > >> >  #include 
> > >> > +#include 
> > >> >
> > >> >  #include "ixgbevf.h"
> > >> >
> > >> > @@ -62,10 +63,14 @@ static char ixgbevf_copyright[] =
> > >> > "Copyright (c) 2009 - 2015 Intel Corporation.";
> > >> >
> > >> >  static const struct ixgbevf_info *ixgbevf_info_tbl[] = {
> > >> > -   [board_82599_vf] = _82599_vf_info,
> > >> > -   [board_X540_vf]  = _X540_vf_info,
> > >> > -   [board_X550_vf]  = _X550_vf_info,
> > >> > -   [board_X550EM_x_vf] = _X550EM_x_vf_info,
> > >> > +   [board_82599_vf]= _82599_vf_info,
> > >> > +   [board_82599_vf_hv] = _82599_vf_hv_info,
> > >> > +   [board_X540_vf] = _X540_vf_info,
> > >> > +   [board_X540_vf_hv]  = _X540_vf_hv_info,
> > >> > +   

[PATCHSET][RFC][CFT] parallel lookups

2016-04-15 Thread Al Viro
The thing appears to be working.  It's in vfs.git#work.lookups; the
last 5 commits are the infrastructure (fs/namei.c and fs/dcache.c; no changes
in fs/*/*) + actual switch to rwsem.

The missing bits: down_write_killable() (there had been a series
posted introducing just that; for now I've replaced mutex_lock_killable()
calls with plain inode_lock() - they are not critical for any testing and
as soon as down_write_killable() gets there I'll replace those), lockdep
bits might need corrections and right now it's only for lookups.

I'm going to add readdir to the mix; the primitive added in this
series (d_alloc_parallel()) will need to be used in dcache pre-seeding
paths, ncpfs use of dentry_update_name_case() will need to be changed to
something less hacky and syscalls calling iterate_dir() will need to
switch to fdget_pos() (with FMODE_ATOMIC_POS set for directories as well
as regulars).  The last bit is needed for exclusion on struct file
level - there's a bunch of cases where we maintain data structures
hanging off file->private and those really need to be serialized.  Besides,
serializing ->f_pos updates is needed for sane semantics; right now we
tend to use ->i_mutex for that, but it would be easier to go for the same
mechanism as for regular files.  With any luck we'll have working parallel
readdir in addition to parallel lookups in this cycle as well.

The patchset is on top of switching getxattr to passing dentry and
inode separately; that part will get changes (in particular, the stuff
agruen has posted lately), but the lookups queue proper cares only about
being able to move security_d_instantiate() to the point before dentry
is attached to inode.

1/15: security_d_instantiate(): move to the point prior to attaching dentry
to inode.  Depends on getxattr changes, allows to do the "attach to inode"
and "add to dentry hash" parts without dropping ->d_lock in between.

2/15 -- 8/15: preparations - stuff similar to what went in during the last
cycle; several places switched to lookup_one_len_unlocked(), a bunch of
direct manipulations of ->i_mutex replaced with inode_lock, etc. helpers.

kernfs: use lookup_one_len_unlocked().
configfs_detach_prep(): make sure that wait_mutex won't go away
ocfs2: don't open-code inode_lock/inode_unlock
orangefs: don't open-code inode_lock/inode_unlock
reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack()
reconnect_one(): use lookup_one_len_unlocked()
ovl_lookup_real(): use lookup_one_len_unlocked()

9/15: lookup_slow(): bugger off on IS_DEADDIR() from the very beginning
open-code real_lookup() call in lookup_slow(), move IS_DEADDIR check upwards.

10/15: __d_add(): don't drop/regain ->d_lock
that's what 1/15 had been for; might make sense to reorder closer to it.

11/15 -- 14/15: actual machinery for parallel lookups.  This stuff could've
been a single commit, along with the actual switch to rwsem and shared lock
in lookup_slow(), but it's easier to review if carved up like that.  From the
testing POV it's one chunk - it is bisect-safe, but the added code really
comes into play only after we go for shared lock, which happens in 15/15.
That's the core of the series.

beginning of transition to parallel lookups - marking in-lookup dentries
parallel lookups machinery, part 2
parallel lookups machinery, part 3
parallel lookups machinery, part 4 (and last)

15/15: parallel lookups: actual switch to rwsem

Note that filesystems would be free to switch some of their own uses of
inode_lock() to grabbing it shared - it's really up to them.  This series
works only with directories locking, but this field has become an rwsem
for all inodes.  XFS folks in particular might be interested in using it...

I'll post the individual patches in followups.  Again, this is also available
in vfs.git #work.lookups (head at e2d622a right now).  The thing survives
LTP and xfstests without regressions, but more testing would certainly be
appreciated.  So would review, of course.


[PATCHSET][RFC][CFT] parallel lookups

2016-04-15 Thread Al Viro
The thing appears to be working.  It's in vfs.git#work.lookups; the
last 5 commits are the infrastructure (fs/namei.c and fs/dcache.c; no changes
in fs/*/*) + actual switch to rwsem.

The missing bits: down_write_killable() (there had been a series
posted introducing just that; for now I've replaced mutex_lock_killable()
calls with plain inode_lock() - they are not critical for any testing and
as soon as down_write_killable() gets there I'll replace those), lockdep
bits might need corrections and right now it's only for lookups.

I'm going to add readdir to the mix; the primitive added in this
series (d_alloc_parallel()) will need to be used in dcache pre-seeding
paths, ncpfs use of dentry_update_name_case() will need to be changed to
something less hacky and syscalls calling iterate_dir() will need to
switch to fdget_pos() (with FMODE_ATOMIC_POS set for directories as well
as regulars).  The last bit is needed for exclusion on struct file
level - there's a bunch of cases where we maintain data structures
hanging off file->private and those really need to be serialized.  Besides,
serializing ->f_pos updates is needed for sane semantics; right now we
tend to use ->i_mutex for that, but it would be easier to go for the same
mechanism as for regular files.  With any luck we'll have working parallel
readdir in addition to parallel lookups in this cycle as well.

The patchset is on top of switching getxattr to passing dentry and
inode separately; that part will get changes (in particular, the stuff
agruen has posted lately), but the lookups queue proper cares only about
being able to move security_d_instantiate() to the point before dentry
is attached to inode.

1/15: security_d_instantiate(): move to the point prior to attaching dentry
to inode.  Depends on getxattr changes, allows to do the "attach to inode"
and "add to dentry hash" parts without dropping ->d_lock in between.

2/15 -- 8/15: preparations - stuff similar to what went in during the last
cycle; several places switched to lookup_one_len_unlocked(), a bunch of
direct manipulations of ->i_mutex replaced with inode_lock, etc. helpers.

kernfs: use lookup_one_len_unlocked().
configfs_detach_prep(): make sure that wait_mutex won't go away
ocfs2: don't open-code inode_lock/inode_unlock
orangefs: don't open-code inode_lock/inode_unlock
reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack()
reconnect_one(): use lookup_one_len_unlocked()
ovl_lookup_real(): use lookup_one_len_unlocked()

9/15: lookup_slow(): bugger off on IS_DEADDIR() from the very beginning
open-code real_lookup() call in lookup_slow(), move IS_DEADDIR check upwards.

10/15: __d_add(): don't drop/regain ->d_lock
that's what 1/15 had been for; might make sense to reorder closer to it.

11/15 -- 14/15: actual machinery for parallel lookups.  This stuff could've
been a single commit, along with the actual switch to rwsem and shared lock
in lookup_slow(), but it's easier to review if carved up like that.  From the
testing POV it's one chunk - it is bisect-safe, but the added code really
comes into play only after we go for shared lock, which happens in 15/15.
That's the core of the series.

beginning of transition to parallel lookups - marking in-lookup dentries
parallel lookups machinery, part 2
parallel lookups machinery, part 3
parallel lookups machinery, part 4 (and last)

15/15: parallel lookups: actual switch to rwsem

Note that filesystems would be free to switch some of their own uses of
inode_lock() to grabbing it shared - it's really up to them.  This series
works only with directories locking, but this field has become an rwsem
for all inodes.  XFS folks in particular might be interested in using it...

I'll post the individual patches in followups.  Again, this is also available
in vfs.git #work.lookups (head at e2d622a right now).  The thing survives
LTP and xfstests without regressions, but more testing would certainly be
appreciated.  So would review, of course.


Re: [PATCH] clk: ti: dra7-atl-clock: Fix of_node reference counting

2016-04-15 Thread Stephen Boyd
On 03/11, Peter Ujfalusi wrote:
> of_find_node_by_name() will call of_node_put() on the node so we need to
> get it first to avoid warnings.
> The cfg_node needs to be put after we have finished processing the
> properties.
> 
> Signed-off-by: Peter Ujfalusi 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH] clk: ti: dra7-atl-clock: Fix of_node reference counting

2016-04-15 Thread Stephen Boyd
On 03/11, Peter Ujfalusi wrote:
> of_find_node_by_name() will call of_node_put() on the node so we need to
> get it first to avoid warnings.
> The cfg_node needs to be put after we have finished processing the
> properties.
> 
> Signed-off-by: Peter Ujfalusi 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH 1/5] clk: ti: am335x/am4372: Add tbclk to pwm node

2016-04-15 Thread Stephen Boyd
On 03/07, Franklin S Cooper Jr wrote:
> Add tblck to the pwm nodes. This insures that the ehrpwm driver has access
> to the time-based clk.
> 
> Do not remove similar entries for ehrpwm node. Later patches will switch
> from using ehrpwm node name to pwm. But to maintain ABI compatibility we
> shouldn't remove the old entries.
> 
> Signed-off-by: Franklin S Cooper Jr 
> ---

Acked-by: Stephen Boyd 

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH 1/5] clk: ti: am335x/am4372: Add tbclk to pwm node

2016-04-15 Thread Stephen Boyd
On 03/07, Franklin S Cooper Jr wrote:
> Add tblck to the pwm nodes. This insures that the ehrpwm driver has access
> to the time-based clk.
> 
> Do not remove similar entries for ehrpwm node. Later patches will switch
> from using ehrpwm node name to pwm. But to maintain ABI compatibility we
> shouldn't remove the old entries.
> 
> Signed-off-by: Franklin S Cooper Jr 
> ---

Acked-by: Stephen Boyd 

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [RESEND PATCH v10 2/6] clk: hisilicon: add CRG driver for hi3519 soc

2016-04-15 Thread Stephen Boyd
On 04/15, Jiancheng Xue wrote:
> Hi,
> 
> On 2016/3/31 16:10, Jiancheng Xue wrote:
> > From: Jiancheng Xue 
> > 
> > The CRG(Clock and Reset Generator) block provides clock
> > and reset signals for other modules in hi3519 soc.
> > 
> > Signed-off-by: Jiancheng Xue 
> > Acked-by: Rob Herring 
> > Acked-by: Philipp Zabel 
> > ---
> I hope this patchset can be merged through arch/arm tree
> The dts binding part has been acked by Rob Herring, and
> the reset part has been acked by Philipp Zabel. Could you
> help me to ack this whole clk patch? Please also let me
> know if this patch still have issues. Thank you very much!

Can I merge it through clk tree and make a stable branch to pull
through arm-soc? I assume another patch is coming but it's good
to get clarity before then.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [RESEND PATCH v10 2/6] clk: hisilicon: add CRG driver for hi3519 soc

2016-04-15 Thread Stephen Boyd
On 04/15, Jiancheng Xue wrote:
> Hi,
> 
> On 2016/3/31 16:10, Jiancheng Xue wrote:
> > From: Jiancheng Xue 
> > 
> > The CRG(Clock and Reset Generator) block provides clock
> > and reset signals for other modules in hi3519 soc.
> > 
> > Signed-off-by: Jiancheng Xue 
> > Acked-by: Rob Herring 
> > Acked-by: Philipp Zabel 
> > ---
> I hope this patchset can be merged through arch/arm tree
> The dts binding part has been acked by Rob Herring, and
> the reset part has been acked by Philipp Zabel. Could you
> help me to ack this whole clk patch? Please also let me
> know if this patch still have issues. Thank you very much!

Can I merge it through clk tree and make a stable branch to pull
through arm-soc? I assume another patch is coming but it's good
to get clarity before then.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [RESEND PATCH v10 2/6] clk: hisilicon: add CRG driver for hi3519 soc

2016-04-15 Thread Stephen Boyd
On 03/31, Jiancheng Xue wrote:
> diff --git a/drivers/clk/hisilicon/clk-hi3519.c 
> b/drivers/clk/hisilicon/clk-hi3519.c
> new file mode 100644
> index 000..ee9df82
> --- /dev/null
> +++ b/drivers/clk/hisilicon/clk-hi3519.c
> @@ -0,0 +1,129 @@
> +/*
> + * Hi3519 Clock Driver
> + *
> + * Copyright (c) 2015-2016 HiSilicon Technologies Co., Ltd.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see .
> + */
> +
> +#include 
> +#include 
> +#include 

Please include  here.

> +#include "clk.h"
> +#include "reset.h"
> +
> +#define HI3519_INNER_CLK_OFFSET  64
> +#define HI3519_FIXED_24M 65
> +#define HI3519_FIXED_50M 66
> +#define HI3519_FIXED_75M 67
> +#define HI3519_FIXED_125M68
> +#define HI3519_FIXED_150M69
> +#define HI3519_FIXED_200M70
> +#define HI3519_FIXED_250M71
> +#define HI3519_FIXED_300M72
> +#define HI3519_FIXED_400M73
> +#define HI3519_FMC_MUX   74
> +
> +#define HI3519_NR_CLKS   128
> +
> +static const struct hisi_fixed_rate_clock hi3519_fixed_rate_clks[] = {
> + { HI3519_FIXED_24M, "24m", NULL, CLK_IS_ROOT, 2400, },
> + { HI3519_FIXED_50M, "50m", NULL, CLK_IS_ROOT, 5000, },
> + { HI3519_FIXED_75M, "75m", NULL, CLK_IS_ROOT, 7500, },
> + { HI3519_FIXED_125M, "125m", NULL, CLK_IS_ROOT, 12500, },
> + { HI3519_FIXED_150M, "150m", NULL, CLK_IS_ROOT, 15000, },
> + { HI3519_FIXED_200M, "200m", NULL, CLK_IS_ROOT, 2, },
> + { HI3519_FIXED_250M, "250m", NULL, CLK_IS_ROOT, 25000, },
> + { HI3519_FIXED_300M, "300m", NULL, CLK_IS_ROOT, 3, },
> + { HI3519_FIXED_400M, "400m", NULL, CLK_IS_ROOT, 4, },

CLK_IS_ROOT is dead. Please remove it.

> +};
> +
> +static const char *const fmc_mux_p[] = {
> + "24m", "75m", "125m", "150m", "200m", "250m", "300m", "400m", };
> +static u32 fmc_mux_table[] = {0, 1, 2, 3, 4, 5, 6, 7};
> +
> +static const struct hisi_mux_clock hi3519_mux_clks[] = {
> + { HI3519_FMC_MUX, "fmc_mux", fmc_mux_p, ARRAY_SIZE(fmc_mux_p),
> + CLK_SET_RATE_PARENT, 0xc0, 2, 3, 0, fmc_mux_table, },
> +};
> +
> +static const struct hisi_gate_clock hi3519_gate_clks[] = {
> + { HI3519_FMC_CLK, "clk_fmc", "fmc_mux",
> + CLK_SET_RATE_PARENT, 0xc0, 1, 0, },
> + { HI3519_UART0_CLK, "clk_uart0", "24m",
> + CLK_SET_RATE_PARENT, 0xe4, 20, 0, },
> + { HI3519_UART1_CLK, "clk_uart1", "24m",
> + CLK_SET_RATE_PARENT, 0xe4, 21, 0, },
> + { HI3519_UART2_CLK, "clk_uart2", "24m",
> + CLK_SET_RATE_PARENT, 0xe4, 22, 0, },
> + { HI3519_UART3_CLK, "clk_uart3", "24m",
> + CLK_SET_RATE_PARENT, 0xe4, 23, 0, },
> + { HI3519_UART4_CLK, "clk_uart4", "24m",
> + CLK_SET_RATE_PARENT, 0xe4, 24, 0, },
> + { HI3519_SPI0_CLK, "clk_spi0", "50m",
> + CLK_SET_RATE_PARENT, 0xe4, 16, 0, },
> + { HI3519_SPI1_CLK, "clk_spi1", "50m",
> + CLK_SET_RATE_PARENT, 0xe4, 17, 0, },
> + { HI3519_SPI2_CLK, "clk_spi2", "50m",
> + CLK_SET_RATE_PARENT, 0xe4, 18, 0, },
> +};
> +
> +static int hi3519_clk_probe(struct platform_device *pdev)
> +{
> + struct device_node *np = pdev->dev.of_node;
> + struct hisi_clock_data *clk_data;
> +
> + clk_data = hisi_clk_init(np, HI3519_NR_CLKS);
> + if (!clk_data)
> + return -ENODEV;
> +
> + hisi_clk_register_fixed_rate(hi3519_fixed_rate_clks,
> +  ARRAY_SIZE(hi3519_fixed_rate_clks),
> +  clk_data);
> + hisi_clk_register_mux(hi3519_mux_clks, ARRAY_SIZE(hi3519_mux_clks),
> + clk_data);
> + hisi_clk_register_gate(hi3519_gate_clks,
> + ARRAY_SIZE(hi3519_gate_clks), clk_data);
> +
> + return hisi_reset_init(np);

Now that this is a platform driver we need to do lots of cleanup
in error cases. I mean we need to unregister clks, OF clk
providers, and reset controllers. Please add all that code too.

> +}
> +
> +static const struct of_device_id hi3519_clk_match_table[] = {
> + { .compatible = "hisilicon,hi3519-crg" },
> + { }
> +};
> +MODULE_DEVICE_TABLE(of, hi3519_clk_match_table);
> +
> +static struct platform_driver hi3519_clk_driver = {
> + .probe  = hi3519_clk_probe,
> + .driver = {
> + 

Re: [RESEND PATCH v10 2/6] clk: hisilicon: add CRG driver for hi3519 soc

2016-04-15 Thread Stephen Boyd
On 03/31, Jiancheng Xue wrote:
> diff --git a/drivers/clk/hisilicon/clk-hi3519.c 
> b/drivers/clk/hisilicon/clk-hi3519.c
> new file mode 100644
> index 000..ee9df82
> --- /dev/null
> +++ b/drivers/clk/hisilicon/clk-hi3519.c
> @@ -0,0 +1,129 @@
> +/*
> + * Hi3519 Clock Driver
> + *
> + * Copyright (c) 2015-2016 HiSilicon Technologies Co., Ltd.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see .
> + */
> +
> +#include 
> +#include 
> +#include 

Please include  here.

> +#include "clk.h"
> +#include "reset.h"
> +
> +#define HI3519_INNER_CLK_OFFSET  64
> +#define HI3519_FIXED_24M 65
> +#define HI3519_FIXED_50M 66
> +#define HI3519_FIXED_75M 67
> +#define HI3519_FIXED_125M68
> +#define HI3519_FIXED_150M69
> +#define HI3519_FIXED_200M70
> +#define HI3519_FIXED_250M71
> +#define HI3519_FIXED_300M72
> +#define HI3519_FIXED_400M73
> +#define HI3519_FMC_MUX   74
> +
> +#define HI3519_NR_CLKS   128
> +
> +static const struct hisi_fixed_rate_clock hi3519_fixed_rate_clks[] = {
> + { HI3519_FIXED_24M, "24m", NULL, CLK_IS_ROOT, 2400, },
> + { HI3519_FIXED_50M, "50m", NULL, CLK_IS_ROOT, 5000, },
> + { HI3519_FIXED_75M, "75m", NULL, CLK_IS_ROOT, 7500, },
> + { HI3519_FIXED_125M, "125m", NULL, CLK_IS_ROOT, 12500, },
> + { HI3519_FIXED_150M, "150m", NULL, CLK_IS_ROOT, 15000, },
> + { HI3519_FIXED_200M, "200m", NULL, CLK_IS_ROOT, 2, },
> + { HI3519_FIXED_250M, "250m", NULL, CLK_IS_ROOT, 25000, },
> + { HI3519_FIXED_300M, "300m", NULL, CLK_IS_ROOT, 3, },
> + { HI3519_FIXED_400M, "400m", NULL, CLK_IS_ROOT, 4, },

CLK_IS_ROOT is dead. Please remove it.

> +};
> +
> +static const char *const fmc_mux_p[] = {
> + "24m", "75m", "125m", "150m", "200m", "250m", "300m", "400m", };
> +static u32 fmc_mux_table[] = {0, 1, 2, 3, 4, 5, 6, 7};
> +
> +static const struct hisi_mux_clock hi3519_mux_clks[] = {
> + { HI3519_FMC_MUX, "fmc_mux", fmc_mux_p, ARRAY_SIZE(fmc_mux_p),
> + CLK_SET_RATE_PARENT, 0xc0, 2, 3, 0, fmc_mux_table, },
> +};
> +
> +static const struct hisi_gate_clock hi3519_gate_clks[] = {
> + { HI3519_FMC_CLK, "clk_fmc", "fmc_mux",
> + CLK_SET_RATE_PARENT, 0xc0, 1, 0, },
> + { HI3519_UART0_CLK, "clk_uart0", "24m",
> + CLK_SET_RATE_PARENT, 0xe4, 20, 0, },
> + { HI3519_UART1_CLK, "clk_uart1", "24m",
> + CLK_SET_RATE_PARENT, 0xe4, 21, 0, },
> + { HI3519_UART2_CLK, "clk_uart2", "24m",
> + CLK_SET_RATE_PARENT, 0xe4, 22, 0, },
> + { HI3519_UART3_CLK, "clk_uart3", "24m",
> + CLK_SET_RATE_PARENT, 0xe4, 23, 0, },
> + { HI3519_UART4_CLK, "clk_uart4", "24m",
> + CLK_SET_RATE_PARENT, 0xe4, 24, 0, },
> + { HI3519_SPI0_CLK, "clk_spi0", "50m",
> + CLK_SET_RATE_PARENT, 0xe4, 16, 0, },
> + { HI3519_SPI1_CLK, "clk_spi1", "50m",
> + CLK_SET_RATE_PARENT, 0xe4, 17, 0, },
> + { HI3519_SPI2_CLK, "clk_spi2", "50m",
> + CLK_SET_RATE_PARENT, 0xe4, 18, 0, },
> +};
> +
> +static int hi3519_clk_probe(struct platform_device *pdev)
> +{
> + struct device_node *np = pdev->dev.of_node;
> + struct hisi_clock_data *clk_data;
> +
> + clk_data = hisi_clk_init(np, HI3519_NR_CLKS);
> + if (!clk_data)
> + return -ENODEV;
> +
> + hisi_clk_register_fixed_rate(hi3519_fixed_rate_clks,
> +  ARRAY_SIZE(hi3519_fixed_rate_clks),
> +  clk_data);
> + hisi_clk_register_mux(hi3519_mux_clks, ARRAY_SIZE(hi3519_mux_clks),
> + clk_data);
> + hisi_clk_register_gate(hi3519_gate_clks,
> + ARRAY_SIZE(hi3519_gate_clks), clk_data);
> +
> + return hisi_reset_init(np);

Now that this is a platform driver we need to do lots of cleanup
in error cases. I mean we need to unregister clks, OF clk
providers, and reset controllers. Please add all that code too.

> +}
> +
> +static const struct of_device_id hi3519_clk_match_table[] = {
> + { .compatible = "hisilicon,hi3519-crg" },
> + { }
> +};
> +MODULE_DEVICE_TABLE(of, hi3519_clk_match_table);
> +
> +static struct platform_driver hi3519_clk_driver = {
> + .probe  = hi3519_clk_probe,
> + .driver = {
> + 

  1   2   3   4   5   6   7   8   9   10   >