Re: [PATCH] x86/kvm: disable fast MMIO when running nested

2018-01-24 Thread Wanpeng Li
2018-01-24 23:12 GMT+08:00 Vitaly Kuznetsov :
> I was investigating an issue with seabios >= 1.10 which stopped working
> for nested KVM on Hyper-V. The problem appears to be in
> handle_ept_violation() function: when we do fast mmio we need to skip
> the instruction so we do kvm_skip_emulated_instruction(). This, however,
> depends on VM_EXIT_INSTRUCTION_LEN field being set correctly in VMCS.
> However, this is not the case.
>
> Intel's manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set when
> EPT MISCONFIG occurs. While on real hardware it was observed to be set,
> some hypervisors follow the spec and don't set it; we end up advancing
> IP with some random value.
>
> I checked with Microsoft and they confirmed they don't fill
> VM_EXIT_INSTRUCTION_LEN on EPT MISCONFIG.
>
> Fix the issue by disabling fast mmio when running nested.
>
> Signed-off-by: Vitaly Kuznetsov 

Reviewed-by: Wanpeng Li 

> ---
>  arch/x86/kvm/vmx.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index c829d89e2e63..54afb446f38e 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6558,9 +6558,16 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> /*
>  * A nested guest cannot optimize MMIO vmexits, because we have an
>  * nGPA here instead of the required GPA.
> +* Skipping instruction below depends on undefined behavior: Intel's
> +* manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set in VMCS
> +* when EPT MISCONFIG occurs and while on real hardware it was 
> observed
> +* to be set, other hypervisors (namely Hyper-V) don't set it, we end
> +* up advancing IP with some random value. Disable fast mmio when
> +* running nested and keep it for real hardware in hope that
> +* VM_EXIT_INSTRUCTION_LEN will always be set correctly.
>  */
> gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> -   if (!is_guest_mode(vcpu) &&
> +   if (!static_cpu_has(X86_FEATURE_HYPERVISOR) && !is_guest_mode(vcpu) &&
> !kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
> trace_kvm_fast_mmio(gpa);
> return kvm_skip_emulated_instruction(vcpu);
> --
> 2.14.3
>


Re: tip/master falls off NOP cliff with KPTI under KVM

2018-01-24 Thread David Woodhouse
On Wed, 2018-01-24 at 16:53 -0800, Dexuan-Linux Cui wrote:
> On Wed, Jan 10, 2018 at 2:53 PM, Woodhouse, David  wrote:
> > 
> > On Thu, 2018-01-11 at 01:34 +0300, Alexey Dobriyan wrote:
> > > 
> > > 
> > > Bisection points to
> > > 
> > > f3433c1010c6af61c9897f0f0447f81b991feac1 is the first bad commit
> > > commit f3433c1010c6af61c9897f0f0447f81b991feac1
> > > Author: David Woodhouse 
> > > Date:   Tue Jan 9 14:43:11 2018 +
> > > 
> > > x86/retpoline/entry: Convert entry assembler indirect jumps
> > Thanks. We've fixed the underlying problem with the alternatives
> > mechanism, *and* changed the retpoline code not to actually rely on
> > said fix.
> Hi David and all,
> It looks the latest upstream tree did fix the issue.
> Can you please specify the related commit(s) that fixed the issue?
> I need to cherry-pick the fix. I suppose a quick reply from you would save
>  me a lot of time. :-)

Hi Dexuan,

The above commit didn't ever make it into Linus' tree in that form; the
issues were fixed beforehand. So there isn't a subsequent commit that
fixes it. I think it might have just been removing some .align 16 from
nospec-branch.h?

The correct version has been backported to 4.9 and 4.4 releases
already; if you pulled in an early version directly from tip/x86/pti
then I'd recommend you drop it and pull in the real version instead.


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH 5/6] Documentation for Pmalloc

2018-01-24 Thread Igor Stoppa


On 24/01/18 21:14, Ralph Campbell wrote:
> 2 Minor typos inline below:

thanks for proof-reading, will fix accordingly.

--
igor


Re: [PATCH] block: blk-mq-sched: Replace GFP_ATOMIC with GFP_KERNEL in blk_mq_sched_assign_ioc

2018-01-24 Thread Jia-Ju Bai



On 2018/1/25 12:16, Al Viro wrote:

On Thu, Jan 25, 2018 at 11:13:56AM +0800, Jia-Ju Bai wrote:


I have checked the given call chain, and find that nvme_dev_disable in
nvme_timeout calls mutex_lock that can sleep.
Thus, I suppose this call chain is not in atomic context.

... or it is broken.


Besides, how do you find that "function (nvme_timeout()) strongly suggests
that it *is* meant to be called from bloody atomic context"?
I check the comments in nvme_timeout, and do not find related description...

Anything that reads registers for controller state presumably won't be
happy if it can happen in parallel with other threads poking the same
hardware.  Not 100% guaranteed, but it's a fairly strong sign that there's
some kind of exclusion between whatever's submitting requests / handling
interrupts and the caller of that thing.  And such exclusion is likely
to be spin_lock_irqsave()-based.

Again, that does not _prove_ it's called from atomic contexts, but does
suggest such possibility.

Looking through the callers of that method, blk_abort_request() certainly
*is* called from under queue lock.  Different drivers, though.  No idea
if nvme_timeout() blocking case is broken - I'm not familiar with that
code.  Question should go to nvme maintainers...

However, digging through other call chains, there's this:
drivers/md/dm-mpath.c:530:  clone = blk_get_request(q, rq->cmd_flags | 
REQ_NOMERGE, GFP_ATOMIC);
in multipath_clone_and_map(), aka. ->clone_and_map_rq(), called at
drivers/md/dm-rq.c:480: r = ti->type->clone_and_map_rq(ti, rq, &tio->info, 
&clone);
in map_request(), which is called from dm_mq_queue_rq(), aka ->queue_rq(),
which is called from blk_mq_dispatch_rq_list(), called from
blk_mq_do_dispatch_sched(), called from blk_mq_sched_dispatch_requests(),
called under rcu_read_lock().  Not a context where you want GFP_KERNEL
allocations...


By the way, do you mean that I should add "My tool has proved that this
function is never called in atomic context" in the description?

I mean that proof itself should be at least outlined.  Crediting the tool
for finding the proof is fine *IF* it's done along wiht the proof itself.

You want to convince the people applying the patch that it is correct.
Leaving out something trivial to verify is fine - "foo_bar_baz() has
no callers" doesn't require grep output + quoting the area around each
instance to prove that all of them are in the comments, etc.; readers
can bloody well check the accuracy of that claim themselves.  This
kind of analysis, however, is decidedly *NOT* trivial to verify from
scratch.

Moreover, if what you've proven is that for each call chain leading
to that place there's a blocking operation nearby, there is still
a possibility that some of those *are* called while in non-blocking
context.  In that case you've found real bugs, and strictly speaking
your change doesn't break correct code.  However, it does not make
the change itself correct - if you have something like
enter non-blocking section
.
in very unusual cases grab a mutex (or panic, or...)
.
do GFP_ATOMIC allocation
.
leave non-blocking section
changing that to GFP_KERNEL will turn "we deadlock in very hard to
hit case" into "we deadlock easily"...

At the very least, I'd like to see those cutoffs - i.e. the places
that already could block on the callchains.  You might very well
have found actual bugs there.


Okay, thanks for your detailed explanation :)
I admit that my report here is not correct, and I will improve my tool.


Thanks,
Jia-Ju Bai


Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-01-24 Thread Amir Goldstein
On Thu, Jan 25, 2018 at 3:08 AM, Shakeel Butt  wrote:
> On Wed, Jan 24, 2018 at 3:12 AM, Amir Goldstein  wrote:
>> On Wed, Jan 24, 2018 at 12:34 PM, Jan Kara  wrote:
>>> On Mon 22-01-18 22:31:20, Amir Goldstein wrote:
 On Fri, Jan 19, 2018 at 5:02 PM, Shakeel Butt  wrote:
 > On Wed, Nov 15, 2017 at 1:31 AM, Jan Kara  wrote:
 >> On Wed 15-11-17 01:32:16, Yang Shi wrote:
 >>> On 11/14/17 1:39 AM, Michal Hocko wrote:
 >>> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
 >>> >>
 >>> >>
 >>> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
 >>> >>>[Sorry for the late reply]
 >>> >>>
 >>> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
 >>> On Tue 31-10-17 00:39:58, Yang Shi wrote:
 >>> >>>[...]
 >>> >I do agree it is not fair and not neat to account to producer 
 >>> >rather than
 >>> >misbehaving consumer, but current memcg design looks not support 
 >>> >such use
 >>> >case. And, the other question is do we know who is the listener 
 >>> >if it
 >>> >doesn't read the events?
 >>> 
 >>> So you never know who will read from the notification file 
 >>> descriptor but
 >>> you can simply account that to the process that created the 
 >>> notification
 >>> group and that is IMO the right process to account to.
 >>> >>>
 >>> >>>Yes, if the creator is de-facto owner which defines the lifetime of
 >>> >>>those objects then this should be a target of the charge.
 >>> >>>
 >>> I agree that current SLAB memcg accounting does not allow to 
 >>> account to a
 >>> different memcg than the one of the running process. However I 
 >>> *think* it
 >>> should be possible to add such interface. Michal?
 >>> >>>
 >>> >>>We do have memcg_kmem_charge_memcg but that would require some 
 >>> >>>plumbing
 >>> >>>to hook it into the specific allocation path. I suspect it uses 
 >>> >>>kmalloc,
 >>> >>>right?
 >>> >>
 >>> >>Yes.
 >>> >>
 >>> >>I took a look at the implementation and the callsites of
 >>> >>memcg_kmem_charge_memcg(). It looks it is called by:
 >>> >>
 >>> >>* charge kmem to memcg, but it is charged to the allocator's memcg
 >>> >>* allocate new slab page, charge to memcg_params.memcg
 >>> >>
 >>> >>I think this is the plumbing you mentioned, right?
 >>> >
 >>> >Maybe I have misunderstood, but you are using slab allocator. So you
 >>> >would need to force it to use a different charging context than 
 >>> >current.
 >>>
 >>> Yes.
 >>>
 >>> >I haven't checked deeply but this doesn't look trivial to me.
 >>>
 >>> I agree. This is also what I explained to Jan and Amir in earlier
 >>> discussion.
 >>
 >> And I also agree. But the fact that it is not trivial does not mean 
 >> that it
 >> should not be done...
 >>
 >
 > I am currently working on directed or remote memcg charging for a
 > different usecase and I think that would be helpful here as well.
 >
 > I have two questions though:
 >
 > 1) Is fsnotify_group the right structure to hold the reference to
 > target mem_cgroup for charging?

 I think it is. The process who set up the group and determined the 
 unlimited
 events queue size and did not consume the events from the queue in a timely
 manner is the process to blame for the OOM situation.
>>>
>>> Agreed here.
>
> Please note that for fcntl(F_NOTIFY), a global group, dnotify_group,
> is used. The allocations from dnotify_struct_cache &
> dnotify_mark_cache happens in the fcntl(F_NOTIFY), so , I take that
> the memcg of the current process should be charged.

Correct. Note that dnotify_struct_cache is allocated when setting up the
watch. that is always done by the listener, which is the correct process to
charge. handling the event does not allocate an event struct, so there is
no issue here.

>
>>>
 > 2) Remote charging can trigger an OOM in the target memcg. In this
 > usecase, I think, there should be security concerns if the events
 > producer can trigger OOM in the memcg of the monitor. We can either
 > change these allocations to use __GFP_NORETRY or some new gfp flag to
 > not trigger oom-killer. So, is this valid concern or am I
 > over-thinking?
>
> First, let me apologize, I think I might have led the discussion in
> wrong direction by giving one wrong information. The current upstream
> kernel, from the syscall context, does not invoke oom-killer when a
> memcg hits its limit and fails to reclaim memory, instead ENOMEM is
> returned. The memcg oom-killer is only invoked on page faults. However
> in a separate effort I do plan to converge the behavior, long
> discussion at .
>

So I guess it would be better if we limit the discussion on this
thread to which memcg sho

Re: [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns

2018-01-24 Thread vcaputo
On Fri, Jan 19, 2018 at 11:57:32AM +0100, Enric Balletbo Serra wrote:
> Hi Vito,
> 
> 2018-01-17 23:48 GMT+01:00  :
> > On Mon, Dec 18, 2017 at 10:25:33AM +0100, Enric Balletbo Serra wrote:
> >> Hi Vito,
> >>
> >> 2017-12-01 22:33 GMT+01:00  :
> >> > On Wed, Nov 29, 2017 at 10:39:19AM -0800, vcap...@pengaru.com wrote:
> >> >> Hello,
> >> >>
> >> >> Recently I noticed substantial audio dropouts when listening to MP3s in
> >> >> `cmus` while doing big and churny `git checkout` commands in my linux 
> >> >> git
> >> >> tree.
> >> >>
> >> >> It's not something I've done much of over the last couple months so I
> >> >> hadn't noticed until yesterday, but didn't remember this being a 
> >> >> problem in
> >> >> recent history.
> >> >>
> >> >> As there's quite an accumulation of similarly configured and built 
> >> >> kernels
> >> >> in my grub menu, it was trivial to determine approximately when this 
> >> >> began:
> >> >>
> >> >> 4.11.0: no dropouts
> >> >> 4.12.0-rc7: dropouts
> >> >> 4.14.0-rc6: dropouts (seem more substantial as well, didn't investigate)
> >> >>
> >> >> Watching top while this is going on in the various kernel versions, it's
> >> >> apparent that the kworker behavior changed.  Both the priority and 
> >> >> quantity
> >> >> of running kworker threads is elevated in kernels experiencing dropouts.
> >> >>
> >> >> Searching through the commit history for v4.11..v4.12 uncovered:
> >> >>
> >> >> commit a1b89132dc4f61071bdeaab92ea958e0953380a1
> >> >> Author: Tim Murray 
> >> >> Date:   Fri Apr 21 11:11:36 2017 +0200
> >> >>
> >> >> dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
> >> >>
> >> >> Running dm-crypt with workqueues at the standard priority results 
> >> >> in IO
> >> >> competing for CPU time with standard user apps, which can lead to
> >> >> pipeline bubbles and seriously degraded performance.  Move to using
> >> >> WQ_HIGHPRI workqueues to protect against that.
> >> >>
> >> >> Signed-off-by: Tim Murray 
> >> >> Signed-off-by: Enric Balletbo i Serra 
> >> >> Signed-off-by: Mike Snitzer 
> >> >>
> >> >> ---
> >> >>
> >> >> Reverting a1b8913 from 4.14.0-rc6, my current kernel, eliminates the
> >> >> problem completely.
> >> >>
> >> >> Looking at the diff in that commit, it looks like the commit message 
> >> >> isn't
> >> >> even accurate; not only is the priority of the dmcrypt workqueues being
> >> >> changed - they're also being made "CPU intensive" workqueues as well.
> >> >>
> >> >> This combination appears to result in both elevated scheduling priority 
> >> >> and
> >> >> greater quantity of participant worker threads effectively starving any
> >> >> normal priority user task under periods of heavy IO on dmcrypt volumes.
> >> >>
> >> >> I don't know what the right solution is here.  It seems to me we're 
> >> >> lacking
> >> >> the appropriate mechanism for charging CPU resources consumed on behalf 
> >> >> of
> >> >> user processes in kworker threads to the work-causing process.
> >> >>
> >> >> What effectively happens is my normal `git` user process is able to
> >> >> greatly amplify what share of CPU it takes from the system by 
> >> >> generating IO
> >> >> on what happens to be a high-priority CPU-intensive storage volume.
> >> >>
> >> >> It looks potentially complicated to fix properly, but I suspect at its 
> >> >> core
> >> >> this may be a fairly longstanding shortcoming of the page cache and its
> >> >> asynchronous design.  Something that has been exacerbated substantially 
> >> >> by
> >> >> the introduction of CPU-intensive storage subsystems like dmcrypt.
> >> >>
> >> >> If we imagine the whole stack simplified, where all the IO was being 
> >> >> done
> >> >> synchronously in-band, and the dmcrypt kernel code simply ran in the
> >> >> IO-causing process context, it would be getting charged to the calling
> >> >> process and scheduled accordingly.  The resource accounting and 
> >> >> scheduling
> >> >> problems all emerge with the page cache, buffered IO, and async 
> >> >> background
> >> >> writeback in a pool of unrelated worker threads, etc.  That's how it
> >> >> appears to me anyways...
> >> >>
> >> >> The system used is a X61s Thinkpad 1.8Ghz with 840 EVO SSD, lvm on 
> >> >> dmcrypt.
> >> >> The kernel .config is attached in case it's of interest.
> >> >>
> >> >> Thanks,
> >> >> Vito Caputo
> >> >
> >> >
> >> >
> >> > Ping...
> >> >
> >> > Could somebody please at least ACK receiving this so I'm not left 
> >> > wondering
> >> > if my mails to lkml are somehow winding up flagged as spam, thanks!
> >>
> >> Sorry I did not notice your email before you ping me directly. It's
> >> interesting that issue, though we didn't notice this problem. It's a
> >> bit far since I tested this patch but I'll setup the environment again
> >> and do more tests to understand better what is happening.
> >>
> >
> > Any update on this?
> >
> 
> I did not reproduce the issue for now. Can you try what happens if you
> remove the WQ_CPU_INT

Re: [PATCH] ARM: dts: use a correct at24 fallback for at91-nattis-2-natte-2

2018-01-24 Thread Bartosz Golaszewski
2018-01-24 23:12 GMT+01:00 Peter Rosin :
> Hi Bartosz,
>
> On 2018-01-24 22:34, Bartosz Golaszewski wrote:
>> We now require all at24 users to use the "atmel," fallback in
>> device tree for different manufacturers.
>
> I think my patch [3/4] from about a week ago was just a tiny bit
> better.
> https://lkml.org/lkml/2018/1/16/609
>
> Can we please pick that one directly instead of doing this partial
> step?
>
> Cheers,
> Peter

Hi Peter,

oh yes, sorry - I just grepped the entire code base from next for
invalid compatibles and didn't notice this was already fixed by you.
Let's drop it.

Thanks,
Bartosz


Re: [PATCH v2 4/4] RISC-V: Move to the new generic IRQ handler

2018-01-24 Thread Christoph Hellwig
On Wed, Jan 24, 2018 at 07:07:56PM -0800, Palmer Dabbelt wrote:
> The old mechanism for handling IRQs on RISC-V was pretty ugly: the arch
> code looked at the Kconfig entry for our first-level irqchip driver and
> called into it directly.
> 
> This patch uses the new 0generic IRQ handling infastructure, which
> essentially just deletes a bunch of code.  This does add an additional
> load to the interrupt latency, but there's a lot of tuning left to be
> done there on RISC-V so I think it's OK for now.
> 
> Signed-off-by: Palmer Dabbelt 

Looks good,

Reviewed-by: Christoph Hellwig 


Re: [PATCH v2 1/4] arm: Make set_handle_irq and handle_arch_irq generic

2018-01-24 Thread Christoph Hellwig
On Wed, Jan 24, 2018 at 07:07:53PM -0800, Palmer Dabbelt wrote:
> It looks like this same irqchip registration mechanism has been copied
> into a handful of ports, including aarch64 and openrisc.  I want to use
> this in the RISC-V port, so I thought it would be good to make this
> generic instead.
> 
> This patch simply moves set_handle_irq and handle_arch_irq from arch/arm
> to kernel/irq/handle.c.

The two important changes here are that:

 a) the handle_arch_irq defintion is moved from assembly to C code
 b) it is now marked __ro_after_init

Those should be prominently mentioned in the changelog, and for a
we probably need an explicit ACK from the ARM folks.


Re: [PATCH v2 3/4] openrisc: Use the new MULTI_IRQ_HANDLER

2018-01-24 Thread Christoph Hellwig
On Wed, Jan 24, 2018 at 07:07:55PM -0800, Palmer Dabbelt wrote:
> It appears that openrisc copied arm64's MULTI_IRQ_HANDLER code (which
> came from arm).  I wanted to make this generic so I could use it in the
> RISC-V port.  This patch converts the openrisc code to use the generic
> version.

Note that openriscv overrides previous handle_arch_irq assignments.
We'll need to know from the openrisc folks if that was intentional.

Otherwise this looks fine to me.


Re: [PATCH v2 2/4] arm64: Use the new MULTI_IRQ_HANDLER

2018-01-24 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 


[PATCH] media: leds: as3645a: Add CONFIG_OF support

2018-01-24 Thread Akash Gajjar
From: Akash Gajjar 

Witth this changes, the driver builds with CONFIG_OF support

Signed-off-by: Akash Gajjar 
---
 drivers/media/i2c/as3645a.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/media/i2c/as3645a.c b/drivers/media/i2c/as3645a.c
index af5db71..24233fa 100644
--- a/drivers/media/i2c/as3645a.c
+++ b/drivers/media/i2c/as3645a.c
@@ -858,6 +858,14 @@ static int as3645a_remove(struct i2c_client *client)
 };
 MODULE_DEVICE_TABLE(i2c, as3645a_id_table);
 
+#if IS_ENABLED(CONFIG_OF)
+static const struct of_device_id as3645a_of_match[] = {
+   { .compatible = "ams,as3645a", },
+   { /* sentinel */ },
+};
+MODULE_DEVICE_TABLE(of, as3645a_of_match);
+#endif
+
 static const struct dev_pm_ops as3645a_pm_ops = {
.suspend = as3645a_suspend,
.resume = as3645a_resume,
@@ -867,6 +875,7 @@ static int as3645a_remove(struct i2c_client *client)
.driver = {
.name = AS3645A_NAME,
.pm   = &as3645a_pm_ops,
+   .of_match_table = of_match_ptr(as3645a_of_match),
},
.probe  = as3645a_probe,
.remove = as3645a_remove,
-- 
1.9.1



[PATCH net-next] ptr_ring: fix integer overflow

2018-01-24 Thread Jason Wang
We try to allocate one more entry for lockless peeking. The adding
operation may overflow which causes zero to be passed to kmalloc().
In this case, it returns ZERO_SIZE_PTR without any notice by ptr
ring. Try to do producing or consuming on such ring will lead NULL
dereference. Fix this detect and fail early.

Fixes: bcecb4bbf88a ("net: ptr_ring: otherwise safe empty checks can overrun 
array bounds")
Reported-by: syzbot+87678bcf753b44c39...@syzkaller.appspotmail.com
Cc: John Fastabend 
Signed-off-by: Jason Wang 
---
 include/linux/ptr_ring.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 9ca1726..3f99484 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -453,6 +453,8 @@ static inline int ptr_ring_consume_batched_bh(struct 
ptr_ring *r,
 
 static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp)
 {
+   if (unlikely(size + 1 == 0))
+   return NULL;
/* Allocate an extra dummy element at end of ring to avoid consumer head
 * or produce head access past the end of the array. Possible when
 * producer/consumer operations and __ptr_ring_peek operations run in
-- 
2.7.4



Re: [PATCH RFC 01/16] prcu: Add PRCU implementation

2018-01-24 Thread Boqun Feng
On Wed, Jan 24, 2018 at 10:16:18PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 23, 2018 at 03:59:26PM +0800, liangli...@huawei.com wrote:
> > From: Heng Zhang 
> > 
> > This RCU implementation (PRCU) is based on a fast consensus protocol
> > published in the following paper:
> > 
> > Fast Consensus Using Bounded Staleness for Scalable Read-mostly 
> > Synchronization.
> > Haibo Chen, Heng Zhang, Ran Liu, Binyu Zang, and Haibing Guan.
> > IEEE Transactions on Parallel and Distributed Systems (TPDS), 2016.
> > https://dl.acm.org/citation.cfm?id=3024114.3024143
> > 
> > Signed-off-by: Heng Zhang 
> > Signed-off-by: Lihao Liang 
> 
> A few comments and questions interspersed.
> 
>   Thanx, Paul
> 
> > ---
> >  include/linux/prcu.h |  37 +++
> >  kernel/rcu/Makefile  |   2 +-
> >  kernel/rcu/prcu.c| 125 
> > +++
> >  kernel/sched/core.c  |   2 +
> >  4 files changed, 165 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/prcu.h
> >  create mode 100644 kernel/rcu/prcu.c
> > 
> > diff --git a/include/linux/prcu.h b/include/linux/prcu.h
> > new file mode 100644
> > index ..653b4633
> > --- /dev/null
> > +++ b/include/linux/prcu.h
> > @@ -0,0 +1,37 @@
> > +#ifndef __LINUX_PRCU_H
> > +#define __LINUX_PRCU_H
> > +
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#define CONFIG_PRCU
> > +
> > +struct prcu_local_struct {
> > +   unsigned int locked;
> > +   unsigned int online;
> > +   unsigned long long version;
> > +};
> > +
> > +struct prcu_struct {
> > +   atomic64_t global_version;
> > +   atomic_t active_ctr;
> > +   struct mutex mtx;
> > +   wait_queue_head_t wait_q;
> > +};
> > +
> > +#ifdef CONFIG_PRCU
> > +void prcu_read_lock(void);
> > +void prcu_read_unlock(void);
> > +void synchronize_prcu(void);
> > +void prcu_note_context_switch(void);
> > +
> > +#else /* #ifdef CONFIG_PRCU */
> > +
> > +#define prcu_read_lock() do {} while (0)
> > +#define prcu_read_unlock() do {} while (0)
> > +#define synchronize_prcu() do {} while (0)
> > +#define prcu_note_context_switch() do {} while (0)
> 
> If CONFIG_PRCU=n and some code is built that uses PRCU, shouldn't you
> get a build error rather than an error-free but inoperative PRCU?
> 
> Of course, Peter's question about purpose of the patch set applies
> here as well.
> 
> > +
> > +#endif /* #ifdef CONFIG_PRCU */
> > +#endif /* __LINUX_PRCU_H */
> > diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
> > index 23803c7d..8791419c 100644
> > --- a/kernel/rcu/Makefile
> > +++ b/kernel/rcu/Makefile
> > @@ -2,7 +2,7 @@
> >  # and is generally not a function of system call inputs.
> >  KCOV_INSTRUMENT := n
> > 
> > -obj-y += update.o sync.o
> > +obj-y += update.o sync.o prcu.o
> >  obj-$(CONFIG_CLASSIC_SRCU) += srcu.o
> >  obj-$(CONFIG_TREE_SRCU) += srcutree.o
> >  obj-$(CONFIG_TINY_SRCU) += srcutiny.o
> > diff --git a/kernel/rcu/prcu.c b/kernel/rcu/prcu.c
> > new file mode 100644
> > index ..a00b9420
> > --- /dev/null
> > +++ b/kernel/rcu/prcu.c
> > @@ -0,0 +1,125 @@
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include 
> > +
> > +DEFINE_PER_CPU_SHARED_ALIGNED(struct prcu_local_struct, prcu_local);
> > +
> > +struct prcu_struct global_prcu = {
> > +   .global_version = ATOMIC64_INIT(0),
> > +   .active_ctr = ATOMIC_INIT(0),
> > +   .mtx = __MUTEX_INITIALIZER(global_prcu.mtx),
> > +   .wait_q = __WAIT_QUEUE_HEAD_INITIALIZER(global_prcu.wait_q)
> > +};
> > +struct prcu_struct *prcu = &global_prcu;
> > +
> > +static inline void prcu_report(struct prcu_local_struct *local)
> > +{
> > +   unsigned long long global_version;
> > +   unsigned long long local_version;
> > +
> > +   global_version = atomic64_read(&prcu->global_version);
> > +   local_version = local->version;
> > +   if (global_version > local_version)
> > +   cmpxchg(&local->version, local_version, global_version);
> > +}
> > +
> > +void prcu_read_lock(void)
> > +{
> > +   struct prcu_local_struct *local;
> > +
> > +   local = get_cpu_ptr(&prcu_local);
> > +   if (!local->online) {
> > +   WRITE_ONCE(local->online, 1);
> > +   smp_mb();
> > +   }
> > +
> > +   local->locked++;
> > +   put_cpu_ptr(&prcu_local);
> > +}
> > +EXPORT_SYMBOL(prcu_read_lock);
> > +
> > +void prcu_read_unlock(void)
> > +{
> > +   int locked;
> > +   struct prcu_local_struct *local;
> > +
> > +   barrier();
> > +   local = get_cpu_ptr(&prcu_local);
> > +   locked = local->locked;
> > +   if (locked) {
> > +   local->locked--;
> > +   if (locked == 1)
> > +   prcu_report(local);
> 
> Is ordering important here?  It looks to me that the compiler could
> rearrange some of the accesses within prcu_report() with the local->locked
> decrement.  There appears to be some potential for load and store tearing,
> though perhaps you have verified that your compiler avoids this on
> the architecture th

[PATCH v2 2/2] free_pcppages_bulk: prefetch buddy while not holding lock

2018-01-24 Thread Aaron Lu
When a page is freed back to the global pool, its buddy will be checked
to see if it's possible to do a merge. This requires accessing buddy's
page structure and that access could take a long time if it's cache cold.

This patch adds a prefetch to the to-be-freed page's buddy outside of
zone->lock in hope of accessing buddy's page structure later under
zone->lock will be faster. Since we *always* do buddy merging and check
an order-0 page's buddy to try to merge it when it goes into the main
allocator, the cacheline will always come in, i.e. the prefetched data
will never be unused.

In the meantime, there is the concern of a prefetch potentially evicting
existing cachelines. This can be true for L1D cache since it is not huge.
Considering the prefetch instruction used is prefetchnta, which will only
store the date in L2 for "Pentium 4 and Intel Xeon processors" according
to Intel's "Instruction Manual Set" document, it is not likely to cause
cache pollution. Other architectures may have this cache pollution problem
though.

There is also some additional instruction overhead, namely calculating
buddy pfn twice. Since the calculation is a XOR on two local variables,
it's expected in many cases that cycles spent will be offset by reduced
memory latency later. This is especially true for NUMA machines where multiple
CPUs are contending on zone->lock and the most time consuming part under
zone->lock is the wait of 'struct page' cacheline of the to-be-freed pages
and their buddies.

Test with will-it-scale/page_fault1 full load:

kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
v4.15-rc4   90373328000124   13642741   15728686
patch1/29608786 +6.3%  8368915 +4.6% 14042169 +2.9% 17433559 +10.8%
this patch 10462292 +8.9%  8602889 +2.8% 14802073 +5.4% 17624575 +1.1%

Note: this patch's performance improvement percent is against patch1/2.

Please also note the actual benefit of this patch will be workload/CPU
dependant.

[changelog stole from Dave Hansen and Mel Gorman's comments]
https://lkml.org/lkml/2018/1/24/551
Suggested-by: Ying Huang 
Signed-off-by: Aaron Lu 
---
v2:
update changelog according to Dave Hansen and Mel Gorman's comments.
Add more comments in code to explain why prefetch is done as requested
by Mel Gorman.

 mm/page_alloc.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c9e5ded39b16..6566a4b5b124 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1138,6 +1138,9 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
batch_free = count;
 
do {
+   unsigned long pfn, buddy_pfn;
+   struct page *buddy;
+
page = list_last_entry(list, struct page, lru);
/* must delete as __free_one_page list manipulates */
list_del(&page->lru);
@@ -1146,6 +1149,18 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
continue;
 
list_add_tail(&page->lru, &head);
+
+   /*
+* We are going to put the page back to the global
+* pool, prefetch its buddy to speed up later access
+* under zone->lock. It is believed the overhead of
+* calculating buddy_pfn here can be offset by reduced
+* memory latency later.
+*/
+   pfn = page_to_pfn(page);
+   buddy_pfn = __find_buddy_pfn(pfn, 0);
+   buddy = page + (buddy_pfn - pfn);
+   prefetch(buddy);
} while (--count && --batch_free && !list_empty(list));
}
 
-- 
2.14.3



linux-next: build failure after merge of the rdma tree

2018-01-24 Thread Stephen Rothwell
Hi all,

After merging the rdma tree, today's linux-next build (x86_64
allmodconfig) failed like this:

ERROR: "init_rcu_head" [drivers/infiniband/ulp/srpt/ib_srpt.ko] undefined!

Caused by commit

  a11253142e6d ("IB/srpt: Rework multi-channel support")

I have used the rdma tree from next-20180119 for today.

-- 
Cheers,
Stephen Rothwell


[PATCH v2 1/2] free_pcppages_bulk: do not hold lock when picking pages to free

2018-01-24 Thread Aaron Lu
When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
the zone->lock is held and then pages are chosen from PCP's migratetype
list. While there is actually no need to do this 'choose part' under
lock since it's PCP pages, the only CPU that can touch them is us and
irq is also disabled.

Moving this part outside could reduce lock held time and improve
performance. Test with will-it-scale/page_fault1 full load:

kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
v4.15-rc4   90373328000124   13642741   15728686
this patch  9608786 +6.3%  8368915 +4.6% 14042169 +2.9% 17433559 +10.8%

What the test does is: starts $nr_cpu processes and each will repeatedly
do the following for 5 minutes:
1 mmap 128M anonymouse space;
2 write access to that space;
3 munmap.
The score is the aggregated iteration.

https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Acked-by: Mel Gorman 
Signed-off-by: Aaron Lu 
---
v2: use LIST_HEAD(head) as suggested by Mel Gorman.

 mm/page_alloc.c | 33 ++---
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4093728f292e..c9e5ded39b16 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1113,12 +1113,10 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
int migratetype = 0;
int batch_free = 0;
bool isolated_pageblocks;
-
-   spin_lock(&zone->lock);
-   isolated_pageblocks = has_isolate_pageblock(zone);
+   struct page *page, *tmp;
+   LIST_HEAD(head);
 
while (count) {
-   struct page *page;
struct list_head *list;
 
/*
@@ -1140,26 +1138,31 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
batch_free = count;
 
do {
-   int mt; /* migratetype of the to-be-freed page */
-
page = list_last_entry(list, struct page, lru);
/* must delete as __free_one_page list manipulates */
list_del(&page->lru);
 
-   mt = get_pcppage_migratetype(page);
-   /* MIGRATE_ISOLATE page should not go to pcplists */
-   VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
-   /* Pageblock could have been isolated meanwhile */
-   if (unlikely(isolated_pageblocks))
-   mt = get_pageblock_migratetype(page);
-
if (bulkfree_pcp_prepare(page))
continue;
 
-   __free_one_page(page, page_to_pfn(page), zone, 0, mt);
-   trace_mm_page_pcpu_drain(page, 0, mt);
+   list_add_tail(&page->lru, &head);
} while (--count && --batch_free && !list_empty(list));
}
+
+   spin_lock(&zone->lock);
+   isolated_pageblocks = has_isolate_pageblock(zone);
+
+   list_for_each_entry_safe(page, tmp, &head, lru) {
+   int mt = get_pcppage_migratetype(page);
+   /* MIGRATE_ISOLATE page should not go to pcplists */
+   VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
+   /* Pageblock could have been isolated meanwhile */
+   if (unlikely(isolated_pageblocks))
+   mt = get_pageblock_migratetype(page);
+
+   __free_one_page(page, page_to_pfn(page), zone, 0, mt);
+   trace_mm_page_pcpu_drain(page, 0, mt);
+   }
spin_unlock(&zone->lock);
 }
 
-- 
2.14.3



Re: [PATCH v4 02/10] asm/nospec, array_ptr: sanitize speculative array de-references

2018-01-24 Thread Cyril Novikov

On 1/18/2018 4:01 PM, Dan Williams wrote:

'array_ptr' is proposed as a generic mechanism to mitigate against
Spectre-variant-1 attacks, i.e. an attack that bypasses boundary checks
via speculative execution). The 'array_ptr' implementation is expected
to be safe for current generation cpus across multiple architectures
(ARM, x86).


I'm an outside reviewer, not subscribed to the list, so forgive me if I 
do something not according to protocol. I have the following comments on 
this change:


After discarding the speculation barrier variant, is array_ptr() needed 
at all? You could have a simpler sanitizing macro, say


#define array_sanitize_idx(idx, sz) ((idx) & array_ptr_mask((idx), (sz)))

(adjusted to not evaluate idx twice). And use it as follows:

if (idx < array_size) {
idx = array_sanitize_idx(idx, array_size);
do_something(array[idx]);
}

If I understand the speculation stuff correctly, unlike array_ptr(), 
this "leaks" array[0] rather than nothing (*NULL) when executed 
speculatively. However, it's still much better than leaking an arbitrary 
location in memory. The attacker can likely get array[0] "leaked" by 
passing 0 as idx anyway.



+/*
+ * If idx is negative or if idx > size then bit 63 is set in the mask,
+ * and the value of ~(-1L) is zero. When the mask is zero, bounds check
+ * failed, array_ptr will return NULL.
+ */
+#ifndef array_ptr_mask
+static inline unsigned long array_ptr_mask(unsigned long idx, unsigned long sz)
+{
+   return ~(long)(idx | (sz - 1 - idx)) >> (BITS_PER_LONG - 1);
+}
+#endif


Why does this have to resort to the undefined behavior of shifting a 
negative number to the right? You can do without it:


return ((idx | (sz - 1 - idx)) >> (BITS_PER_LONG - 1)) - 1;

Of course, you could argue that subtracting 1 from 0 to get all ones is 
also an undefined behavior, but it's still much better than the shift, 
isn't it?



+#define array_ptr(base, idx, sz)   \
+({ \
+   union { typeof(*(base)) *_ptr; unsigned long _bit; } __u;   \
+   typeof(*(base)) *_arr = (base); \
+   unsigned long _i = (idx);   \
+   unsigned long _mask = array_ptr_mask(_i, (sz)); \
+   \
+   __u._ptr = _arr + (_i & _mask); \
+   __u._bit &= _mask;  \
+   __u._ptr;   \
+})


Call me paranoid, but I think this may actually create an exploitable 
bug on 32-bit systems due to casting the index to an unsigned long, if 
the index as it comes from userland is a 64-bit value. You have 
*replaced* the "if (idx < array_size)" check with checking if 
array_ptr() returns NULL. Well, it doesn't return NULL if the low 32 
bits of the index are in-bounds, but the high 32 bits are not zero. 
Apart from the return value pointing to the wrong place, the subsequent 
code may then assume that the 64-bit idx is actually valid and trip on 
it badly.


--
Cyril


Re: [RFC PATCH v2 1/1] of: introduce event tracepoints for dynamic device_node lifecyle

2018-01-24 Thread Frank Rowand
On 01/24/18 22:48, Frank Rowand wrote:
> On 01/21/18 06:31, Wolfram Sang wrote:
>> From: Tyrel Datwyler 
>>
>> This patch introduces event tracepoints for tracking a device_nodes
>> reference cycle as well as reconfig notifications generated in response
>> to node/property manipulations.
>>
>> With the recent upstreaming of the refcount API several device_node
>> underflows and leaks have come to my attention in the pseries (DLPAR)
>> dynamic logical partitioning code (ie. POWER speak for hotplugging
>> virtual and physcial resources at runtime such as cpus or IOAs). These
>> tracepoints provide a easy and quick mechanism for validating the
>> reference counting of device_nodes during their lifetime.
>>
>> Further, when pseries lpars are migrated to a different machine we
>> perform a live update of our device tree to bring it into alignment with
>> the configuration of the new machine. The of_reconfig_notify trace point
>> provides a mechanism that can be turned for debuging the device tree
>> modifications with out having to build a custom kernel to get at the
>> DEBUG code introduced by commit 00aa37206e1a54 ("of/reconfig: Add debug
>> output for OF_RECONFIG notifiers").
>>
>> The following trace events are provided: of_node_get, of_node_put,
>> of_node_release, and of_reconfig_notify. These trace points require a
> 
> Please add a note that the of_reconfig_notify trace event is not an
> added bit of debug info, but is instead replacing information that
> was previously available via pr_debug() when DEBUG was defined.

I got a little carried away, "when DEBUG was defined" is extra
un-needed detail for the commit message.


> 
> 
>> kernel built with ftrace support to be enabled. In a typical environment
>> where debugfs is mounted at /sys/kernel/debug the entire set of
>> tracepoints can be set with the following:
>>
>>   echo "of:*" > /sys/kernel/debug/tracing/set_event
>>
>> or
>>
>>   echo 1 > /sys/kernel/debug/tracing/events/of/enable
>>
>> The following shows the trace point data from a DLPAR remove of a cpu
>> from a pseries lpar:
>>
>> cat /sys/kernel/debug/tracing/trace | grep "POWER8@10"
>>
>> cpuhp/23-147   [023]    128.324827:
>> of_node_put: refcount=5, dn->full_name=/cpus/PowerPC,POWER8@10
>> cpuhp/23-147   [023]    128.324829:
>> of_node_put: refcount=4, dn->full_name=/cpus/PowerPC,POWER8@10
>> cpuhp/23-147   [023]    128.324829:
>> of_node_put: refcount=3, dn->full_name=/cpus/PowerPC,POWER8@10
>> cpuhp/23-147   [023]    128.324831:
>> of_node_put: refcount=2, dn->full_name=/cpus/PowerPC,POWER8@10
>>drmgr-7284  [009]    128.439000:
>> of_node_put: refcount=1, dn->full_name=/cpus/PowerPC,POWER8@10
>>drmgr-7284  [009]    128.439002:
>> of_reconfig_notify: action=DETACH_NODE, 
>> dn->full_name=/cpus/PowerPC,POWER8@10,
>> prop->name=null, old_prop->name=null
>>drmgr-7284  [009]    128.439015:
>> of_node_put: refcount=0, dn->full_name=/cpus/PowerPC,POWER8@10
>>drmgr-7284  [009]    128.439016:
>> of_node_release: dn->full_name=/cpus/PowerPC,POWER8@10, dn->_flags=4
>>
>> Signed-off-by: Tyrel Datwyler 
> 
> The following belongs in a list of version 2 changes, below the "---" line:
> 
>> [wsa: fixed commit abbrev and one of the sysfs paths in commit desc,
>> removed trailing space and fixed pointer declaration in code]
> 
>> Signed-off-by: Wolfram Sang 
>> ---
>>  drivers/of/dynamic.c  | 32 ++--
>>  include/trace/events/of.h | 93 
>> +++
>>  2 files changed, 105 insertions(+), 20 deletions(-)
>>  create mode 100644 include/trace/events/of.h
> 
> mode looks incorrect.  Existing files in include/trace/events/ are -rw-rw
> 
> 
>> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
>> index ab988d88704da0..b0d6ab5a35b8c6 100644
>> --- a/drivers/of/dynamic.c
>> +++ b/drivers/of/dynamic.c
>> @@ -21,6 +21,9 @@ static struct device_node *kobj_to_device_node(struct 
>> kobject *kobj)
>>  return container_of(kobj, struct device_node, kobj);
>>  }
>>  
>> +#define CREATE_TRACE_POINTS
>> +#include 
>> +
>>  /**
>>   * of_node_get() - Increment refcount of a node
>>   * @node:   Node to inc refcount, NULL is supported to simplify writing of
>> @@ -30,8 +33,10 @@ static struct device_node *kobj_to_device_node(struct 
>> kobject *kobj)
>>   */
>>  struct device_node *of_node_get(struct device_node *node)
>>  {
>> -if (node)
>> +if (node) {
>>  kobject_get(&node->kobj);
>> +trace_of_node_get(refcount_read(&node->kobj.kref.refcount), 
>> node->full_name);
> 
> See the comment from Ron that I mentioned in my previous email.
   
   Rob, darn it.


> Also, the path has been removed from node->full_name.  Does using it here
> still give all of the information that is desired?  Same for all others uses
> of full_name in this patch.
> 

[RFC PATCH V4 4/5] workqueue: convert ->nice to ->sched_attr

2018-01-24 Thread Wen Yang
The new /sys interface like this:
#cat /sys/devices/virtual/workqueue/writeback/sched_attr
policy=0 prio=0 nice=0

# echo "policy=0 prio=0 nice=-1" > 
/sys/devices/virtual/workqueue/writeback/sched_attr
# cat /sys/devices/virtual/workqueue/writeback/sched_attr
policy=0 prio=0 nice=-1

Also, the possibility of specifying more than just a priority
for the wq may be useful for a wide variety of applications.

Signed-off-by: Wen Yang 
Signed-off-by: Jiang Biao 
Signed-off-by: Tan Hu 
Suggested-by: Tejun Heo 
Cc: Tejun Heo 
Cc: Lai Jiangshan 
Cc: kernel test robot 
Cc: linux-kernel@vger.kernel.org
---
 include/linux/workqueue.h |   5 --
 kernel/workqueue.c| 130 --
 2 files changed, 79 insertions(+), 56 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 9faaade..d9d0f36 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -128,11 +128,6 @@ struct delayed_work {
  */
 struct workqueue_attrs {
/**
-* @nice: nice level
-*/
-   int nice;
-
-   /**
 * @sched_attr: kworker's scheduling parameters
 */
struct sched_attr sched_attr;
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e1613d0..8c5aba5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1773,7 +1773,7 @@ static struct worker *create_worker(struct worker_pool 
*pool)
 
if (pool->cpu >= 0)
snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
-pool->attrs->nice < 0  ? "H" : "");
+pool->attrs->sched_attr.sched_nice < 0  ? "H" : "");
else
snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
 
@@ -1782,7 +1782,7 @@ static struct worker *create_worker(struct worker_pool 
*pool)
if (IS_ERR(worker->task))
goto fail;
 
-   set_user_nice(worker->task, pool->attrs->nice);
+   set_user_nice(worker->task, pool->attrs->sched_attr.sched_nice);
kthread_bind_mask(worker->task, pool->attrs->cpumask);
 
/* successful, attach the worker to the pool */
@@ -3179,7 +3179,6 @@ static void copy_sched_attr(struct sched_attr *to,
 static void copy_workqueue_attrs(struct workqueue_attrs *to,
 const struct workqueue_attrs *from)
 {
-   to->nice = from->nice;
copy_sched_attr(&to->sched_attr, &from->sched_attr);
cpumask_copy(to->cpumask, from->cpumask);
/*
@@ -3195,17 +3194,29 @@ static u32 wqattrs_hash(const struct workqueue_attrs 
*attrs)
 {
u32 hash = 0;
 
-   hash = jhash_1word(attrs->nice, hash);
+   hash = jhash_1word(attrs->sched_attr.sched_nice, hash);
hash = jhash(cpumask_bits(attrs->cpumask),
 BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash);
return hash;
 }
 
+static bool sched_attr_equal(const struct sched_attr *a,
+   const struct sched_attr *b)
+{
+   if (a->sched_policy != b->sched_policy)
+   return false;
+   if (a->sched_priority != b->sched_priority)
+   return false;
+   if (a->sched_nice != b->sched_nice)
+   return false;
+   return true;
+}
+
 /* content equality test */
 static bool wqattrs_equal(const struct workqueue_attrs *a,
  const struct workqueue_attrs *b)
 {
-   if (a->nice != b->nice)
+   if (a->sched_attr.sched_nice != b->sched_attr.sched_nice)
return false;
if (!cpumask_equal(a->cpumask, b->cpumask))
return false;
@@ -3259,8 +3270,6 @@ static void rcu_free_wq(struct rcu_head *rcu)
 
if (!(wq->flags & WQ_UNBOUND))
free_percpu(wq->cpu_pwqs);
-   else
-   free_workqueue_attrs(wq->unbound_attrs);
free_workqueue_attrs(wq->attrs);
kfree(wq->rescuer);
kfree(wq);
@@ -4353,7 +4362,8 @@ static void pr_cont_pool_info(struct worker_pool *pool)
pr_cont(" cpus=%*pbl", nr_cpumask_bits, pool->attrs->cpumask);
if (pool->node != NUMA_NO_NODE)
pr_cont(" node=%d", pool->node);
-   pr_cont(" flags=0x%x nice=%d", pool->flags, pool->attrs->nice);
+   pr_cont(" flags=0x%x nice=%d", pool->flags,
+   pool->attrs->sched_attr.sched_nice);
 }
 
 static void pr_cont_work(bool comma, struct work_struct *work)
@@ -5074,7 +5084,64 @@ static ssize_t sched_attr_show(struct device *dev,
return written;
 }
 
-static DEVICE_ATTR_RO(sched_attr);
+static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct 
*wq);
+
+static int wq_set_unbound_sched_attr(struct workqueue_struct *wq,
+   const struct sched_attr *new)
+{
+   struct workqueue_attrs *attrs;
+   int ret = -ENOMEM;
+
+   apply_wqattrs_lock();
+   attrs = wq_sysfs_prep_attrs(wq);
+   if (!attrs)
+   goto out_unlock;
+   copy_sc

[RFC PATCH V4 5/5] workqueue: introduce a way to set workqueue's scheduler

2018-01-24 Thread Wen Yang
When pinning RT threads to specific cores using CPU affinity, the
kworkers on the same CPU would starve, which may lead to some kind
of priority inversion. In that case, the RT threads would also
suffer high performance impact.

The priority inversion looks like,
CPU 0:  libvirtd acquired cgroup_mutex, and triggered
lru_add_drain_per_cpu, then waiting for all the kworkers to complete:
PID: 44145  TASK: 8807bec7b980  CPU: 0   COMMAND: "libvirtd"
#0 [8807f2cbb9d0] __schedule at 816410ed
#1 [8807f2cbba38] schedule at 81641789
#2 [8807f2cbba48] schedule_timeout at 8163f479
#3 [8807f2cbbaf8] wait_for_completion at 81641b56
#4 [8807f2cbbb58] flush_work at 8109efdc
#5 [8807f2cbbbd0] lru_add_drain_all at 81179002
#6 [8807f2cbbc08] migrate_prep at 811c77be
#7 [8807f2cbbc18] do_migrate_pages at 811b8010
#8 [8807f2cbbcf8] cpuset_migrate_mm at 810fea6c
#9 [8807f2cbbd10] cpuset_attach at 810ff91e
#10 [8807f2cbbd50] cgroup_attach_task at 810f9972
#11 [8807f2cbbe08] attach_task_by_pid at 810fa520
#12 [8807f2cbbe58] cgroup_tasks_write at 810fa593
#13 [8807f2cbbe68] cgroup_file_write at 810f8773
#14 [8807f2cbbef8] vfs_write at 811dfdfd
#15 [8807f2cbbf38] sys_write at 811e089f
#16 [8807f2cbbf80] system_call_fastpath at 8164c809

CPU 43: kworker/43 starved because of the RT threads:
CURRENT: PID: 21294  TASK: 883fd2d45080  COMMAND: "lwip"
RT PRIO_ARRAY: 883fff3f4950
[ 79] PID: 21294  TASK: 883fd2d45080  COMMAND: "lwip"
[ 79] PID: 21295  TASK: 88276d481700  COMMAND: "ovdk-ovsvswitch"
[ 79] PID: 21351  TASK: 8807be822280  COMMAND: "dispatcher"
[ 79] PID: 21129  TASK: 8807bef0f300  COMMAND: "ovdk-ovsvswitch"
[ 79] PID: 21337  TASK: 88276d482e00  COMMAND: "handler_3"
[ 79] PID: 21352  TASK: 8807be824500  COMMAND: "flow_dumper"
[ 79] PID: 21336  TASK: 88276d480b80  COMMAND: "handler_2"
[ 79] PID: 21342  TASK: 88276d484500  COMMAND: "handler_8"
[ 79] PID: 21341  TASK: 88276d482280  COMMAND: "handler_7"
[ 79] PID: 21338  TASK: 88276d483980  COMMAND: "handler_4"
[ 79] PID: 21339  TASK: 88276d48  COMMAND: "handler_5"
[ 79] PID: 21340  TASK: 88276d486780  COMMAND: "handler_6"
CFS RB_ROOT: 883fff3f4868
[120] PID: 37959  TASK: 88276e148000  COMMAND: "kworker/43:1"

CPU 28: Systemd(Victim) was blocked by cgroup_mutex:
PID: 1  TASK: 883fd2d4  CPU: 28  COMMAND: "systemd"
#0 [881fd317bd60] __schedule at 816410ed
#1 [881fd317bdc8] schedule_preempt_disabled at 81642869
#2 [881fd317bdd8] __mutex_lock_slowpath at 81640565
#3 [881fd317be38] mutex_lock at 8163f9cf
#4 [881fd317be50] proc_cgroup_show at 810fd256
#5 [881fd317be98] seq_read at 81203cda
#6 [881fd317bf08] vfs_read at 811dfc6c
#7 [881fd317bf38] sys_read at 811e07bf
#8 [881fd317bf80] system_call_fastpath at 81

The simplest way to fix that is to set the scheduler of kworkers to
higher RT priority, just like,
chrt --fifo -p 61 
However, it can not avoid other WORK_CPU_BOUND worker threads running
and starving.

This patch introduces a way to set the scheduler(policy and priority)
of percpu worker_pool, in that way, user could set proper scheduler
policy and priority of the worker_pool as needed, which could apply
to all the WORK_CPU_BOUND workers on the same CPU. On the other hand,
we could using /sys/devices/virtual/workqueue/cpumask for
WORK_CPU_UNBOUND workers to prevent them starving.

Tejun Heo suggested:
"* Add scheduler type to wq_attrs so that unbound workqueues can be
 configured.

* Rename system_wq's wq->name from "events" to "system_percpu", and
 similarly for the similarly named workqueues.

* Enable wq_attrs (only the applicable part should show up in the
 interface) for system_percpu and system_percpu_highpri, and use that
 to change the attributes of the percpu pools."

This patch implements the basic infrastructure and /sys interface,
such as:
# cat  /sys/devices/virtual/workqueue/system_percpu/sched_attr
policy=0 prio=0 nice=0
# echo "policy=1 prio=1 nice=0" > 
/sys/devices/virtual/workqueue/system_percpu/sched_attr
# cat  /sys/devices/virtual/workqueue/system_percpu/sched_attr
policy=1 prio=1 nice=0
# cat  /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr
policy=0 prio=0 nice=-20
# echo "policy=1 prio=2 nice=0" > 
/sys/devices/virtual/workqueue/system_percpu_h

[RFC PATCH V4 3/5] workqueue: rename unbound_attrs to attrs

2018-01-24 Thread Wen Yang
Replace workqueue's unbound_attrs by attrs, so that both unbound
or bound wq can use it.

Signed-off-by: Wen Yang 
Signed-off-by: Jiang Biao 
Signed-off-by: Tan Hu 
Suggested-by: Tejun Heo 
Cc: Tejun Heo 
Cc: Lai Jiangshan 
Cc: kernel test robot 
Cc: linux-kernel@vger.kernel.org
---
 kernel/workqueue.c | 24 
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 993f225..e1613d0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -255,7 +255,6 @@ struct workqueue_struct {
int nr_drainers;/* WQ: drain in progress */
int saved_max_active; /* WQ: saved pwq max_active */
 
-   struct workqueue_attrs  *unbound_attrs; /* PW: only for unbound wqs */
struct workqueue_attrs  *attrs;
struct pool_workqueue   *dfl_pwq;   /* PW: only for unbound wqs */
 
@@ -3737,7 +3736,7 @@ static void apply_wqattrs_commit(struct apply_wqattrs_ctx 
*ctx)
/* all pwqs have been created successfully, let's install'em */
mutex_lock(&ctx->wq->mutex);
 
-   copy_workqueue_attrs(ctx->wq->unbound_attrs, ctx->attrs);
+   copy_workqueue_attrs(ctx->wq->attrs, ctx->attrs);
 
/* save the previous pwq and install the new one */
for_each_node(node)
@@ -3854,7 +3853,7 @@ static void wq_update_unbound_numa(struct 
workqueue_struct *wq, int cpu,
lockdep_assert_held(&wq_pool_mutex);
 
if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND) ||
-   wq->unbound_attrs->no_numa)
+   wq->attrs->no_numa)
return;
 
/*
@@ -3865,7 +3864,7 @@ static void wq_update_unbound_numa(struct 
workqueue_struct *wq, int cpu,
target_attrs = wq_update_unbound_numa_attrs_buf;
cpumask = target_attrs->cpumask;
 
-   copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
+   copy_workqueue_attrs(target_attrs, wq->attrs);
pwq = unbound_pwq_by_node(wq, node);
 
/*
@@ -3985,12 +3984,6 @@ struct workqueue_struct *__alloc_workqueue_key(const 
char *fmt,
if (!wq)
return NULL;
 
-   if (flags & WQ_UNBOUND) {
-   wq->unbound_attrs = alloc_workqueue_attrs(GFP_KERNEL);
-   if (!wq->unbound_attrs)
-   goto err_free_wq;
-   }
-
wq->attrs = alloc_workqueue_attrs(GFP_KERNEL);
if (!wq->attrs)
goto err_free_wq;
@@ -4069,7 +4062,6 @@ struct workqueue_struct *__alloc_workqueue_key(const char 
*fmt,
return wq;
 
 err_free_wq:
-   free_workqueue_attrs(wq->unbound_attrs);
free_workqueue_attrs(wq->attrs);
kfree(wq);
return NULL;
@@ -4941,7 +4933,7 @@ static int workqueue_apply_unbound_cpumask(void)
if (wq->flags & __WQ_ORDERED)
continue;
 
-   ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs);
+   ctx = apply_wqattrs_prepare(wq, wq->attrs);
if (!ctx) {
ret = -ENOMEM;
break;
@@ -5119,7 +5111,7 @@ static ssize_t wq_nice_show(struct device *dev, struct 
device_attribute *attr,
int written;
 
mutex_lock(&wq->mutex);
-   written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->unbound_attrs->nice);
+   written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->attrs->nice);
mutex_unlock(&wq->mutex);
 
return written;
@@ -5136,7 +5128,7 @@ static struct workqueue_attrs *wq_sysfs_prep_attrs(struct 
workqueue_struct *wq)
if (!attrs)
return NULL;
 
-   copy_workqueue_attrs(attrs, wq->unbound_attrs);
+   copy_workqueue_attrs(attrs, wq->attrs);
return attrs;
 }
 
@@ -5173,7 +5165,7 @@ static ssize_t wq_cpumask_show(struct device *dev,
 
mutex_lock(&wq->mutex);
written = scnprintf(buf, PAGE_SIZE, "%*pb\n",
-   cpumask_pr_args(wq->unbound_attrs->cpumask));
+   cpumask_pr_args(wq->attrs->cpumask));
mutex_unlock(&wq->mutex);
return written;
 }
@@ -5210,7 +5202,7 @@ static ssize_t wq_numa_show(struct device *dev, struct 
device_attribute *attr,
 
mutex_lock(&wq->mutex);
written = scnprintf(buf, PAGE_SIZE, "%d\n",
-   !wq->unbound_attrs->no_numa);
+   !wq->attrs->no_numa);
mutex_unlock(&wq->mutex);
 
return written;
-- 
1.8.3.1



[RFC PATCH V4 1/5] workqueue: rename system workqueues

2018-01-24 Thread Wen Yang
Rename system_wq's wq->name from "events" to "system_percpu",
and similarly for the similarly named workqueues.

Signed-off-by: Wen Yang 
Signed-off-by: Jiang Biao 
Signed-off-by: Tan Hu 
Suggested-by: Tejun Heo 
Cc: Tejun Heo 
Cc: Lai Jiangshan 
Cc: linux-kernel@vger.kernel.org
---
 kernel/workqueue.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f699122..67b68bb 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5601,16 +5601,18 @@ int __init workqueue_init_early(void)
ordered_wq_attrs[i] = attrs;
}
 
-   system_wq = alloc_workqueue("events", 0, 0);
-   system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0);
-   system_long_wq = alloc_workqueue("events_long", 0, 0);
-   system_unbound_wq = alloc_workqueue("events_unbound", WQ_UNBOUND,
+   system_wq = alloc_workqueue("system_percpu", 0, 0);
+   system_highpri_wq = alloc_workqueue("system_percpu_highpri",
+   WQ_HIGHPRI, 0);
+   system_long_wq = alloc_workqueue("system_percpu_long", 0, 0);
+   system_unbound_wq = alloc_workqueue("system_unbound", WQ_UNBOUND,
WQ_UNBOUND_MAX_ACTIVE);
-   system_freezable_wq = alloc_workqueue("events_freezable",
+   system_freezable_wq = alloc_workqueue("system_percpu_freezable",
  WQ_FREEZABLE, 0);
-   system_power_efficient_wq = alloc_workqueue("events_power_efficient",
+   system_power_efficient_wq = 
alloc_workqueue("system_percpu_power_efficient",
  WQ_POWER_EFFICIENT, 0);
-   system_freezable_power_efficient_wq = 
alloc_workqueue("events_freezable_power_efficient",
+   system_freezable_power_efficient_wq = alloc_workqueue(
+ 
"system_percpu_freezable_power_efficient",
  WQ_FREEZABLE | WQ_POWER_EFFICIENT,
  0);
BUG_ON(!system_wq || !system_highpri_wq || !system_long_wq ||
-- 
1.8.3.1



[RFC PATCH V4 2/5] workqueue: expose attrs for system workqueues

2018-01-24 Thread Wen Yang
Expose sched_attr for system workqueues, such as:
# cat   /sys/devices/virtual/workqueue/system_percpu/sched_attr
policy=0 prio=0 nice=0
cat   /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr
policy=0 prio=0 nice=-20

Signed-off-by: Wen Yang 
Signed-off-by: Jiang Biao 
Signed-off-by: Tan Hu 
Suggested-by: Tejun Heo 
Cc: Tejun Heo 
Cc: Lai Jiangshan 
Cc: kernel test robot 
Cc: linux-kernel@vger.kernel.org
---
 include/linux/workqueue.h |  6 ++
 kernel/workqueue.c| 50 +--
 2 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 4a54ef9..9faaade 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct workqueue_struct;
 
@@ -132,6 +133,11 @@ struct workqueue_attrs {
int nice;
 
/**
+* @sched_attr: kworker's scheduling parameters
+*/
+   struct sched_attr sched_attr;
+
+   /**
 * @cpumask: allowed CPUs
 */
cpumask_var_t cpumask;
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 67b68bb..993f225 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -256,6 +256,7 @@ struct workqueue_struct {
int saved_max_active; /* WQ: saved pwq max_active */
 
struct workqueue_attrs  *unbound_attrs; /* PW: only for unbound wqs */
+   struct workqueue_attrs  *attrs;
struct pool_workqueue   *dfl_pwq;   /* PW: only for unbound wqs */
 
 #ifdef CONFIG_SYSFS
@@ -1699,6 +1700,7 @@ static void worker_attach_to_pool(struct worker *worker,
 * online CPUs.  It'll be re-applied when any of the CPUs come up.
 */
set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+   sched_setattr(worker->task, &pool->attrs->sched_attr);
 
/*
 * The pool->attach_mutex ensures %POOL_DISASSOCIATED remains
@@ -3166,10 +3168,20 @@ struct workqueue_attrs *alloc_workqueue_attrs(gfp_t 
gfp_mask)
return NULL;
 }
 
+static void copy_sched_attr(struct sched_attr *to,
+   const struct sched_attr *from)
+{
+   to->sched_policy = from->sched_policy;
+   to->sched_priority = from->sched_priority;
+   to->sched_nice = from->sched_nice;
+   to->sched_flags = from->sched_flags;
+}
+
 static void copy_workqueue_attrs(struct workqueue_attrs *to,
 const struct workqueue_attrs *from)
 {
to->nice = from->nice;
+   copy_sched_attr(&to->sched_attr, &from->sched_attr);
cpumask_copy(to->cpumask, from->cpumask);
/*
 * Unlike hash and equality test, this function doesn't ignore
@@ -3250,7 +3262,7 @@ static void rcu_free_wq(struct rcu_head *rcu)
free_percpu(wq->cpu_pwqs);
else
free_workqueue_attrs(wq->unbound_attrs);
-
+   free_workqueue_attrs(wq->attrs);
kfree(wq->rescuer);
kfree(wq);
 }
@@ -3979,6 +3991,10 @@ struct workqueue_struct *__alloc_workqueue_key(const 
char *fmt,
goto err_free_wq;
}
 
+   wq->attrs = alloc_workqueue_attrs(GFP_KERNEL);
+   if (!wq->attrs)
+   goto err_free_wq;
+
va_start(args, lock_name);
vsnprintf(wq->name, sizeof(wq->name), fmt, args);
va_end(args);
@@ -3999,6 +4015,11 @@ struct workqueue_struct *__alloc_workqueue_key(const 
char *fmt,
lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
INIT_LIST_HEAD(&wq->list);
 
+   wq->attrs->sched_attr.sched_policy = SCHED_NORMAL;
+   wq->attrs->sched_attr.sched_priority = 0;
+   wq->attrs->sched_attr.sched_nice = wq->flags & WQ_HIGHPRI ?
+   HIGHPRI_NICE_LEVEL : 0;
+
if (alloc_and_link_pwqs(wq) < 0)
goto err_free_wq;
 
@@ -4049,6 +4070,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char 
*fmt,
 
 err_free_wq:
free_workqueue_attrs(wq->unbound_attrs);
+   free_workqueue_attrs(wq->attrs);
kfree(wq);
return NULL;
 err_destroy:
@@ -5043,9 +5065,29 @@ static ssize_t max_active_store(struct device *dev,
 }
 static DEVICE_ATTR_RW(max_active);
 
+static ssize_t sched_attr_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   size_t written;
+   struct workqueue_struct *wq = dev_to_wq(dev);
+
+   mutex_lock(&wq->mutex);
+   written = scnprintf(buf, PAGE_SIZE,
+   "policy=%u prio=%u nice=%d\n",
+   wq->attrs->sched_attr.sched_policy,
+   wq->attrs->sched_attr.sched_priority,
+   wq->attrs->sched_attr.sched_nice);
+   mutex_unlock(&wq->mutex);
+
+   return written;
+}
+
+static DEVICE_ATTR_RO(sched_attr);
+
 static struct attribute *wq_sysfs_attrs[] = {
&dev_attr_per_cpu.attr,
&dev_attr_max_active.attr,
+

[PATCH v3] f2fs: support inode creation time

2018-01-24 Thread Chao Yu
This patch adds creation time field in inode layout to support showing
kstat.btime in ->statx.

Signed-off-by: Chao Yu 
---
v3:
- fix address alignment isue.
 fs/f2fs/f2fs.h  |  7 +++
 fs/f2fs/file.c  |  9 +
 fs/f2fs/inode.c | 16 
 fs/f2fs/namei.c |  3 ++-
 fs/f2fs/sysfs.c |  7 +++
 include/linux/f2fs_fs.h |  4 +++-
 6 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index b7ba496af28f..6300ac5bcbe4 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -124,6 +124,7 @@ struct f2fs_mount_info {
 #define F2FS_FEATURE_INODE_CHKSUM  0x0020
 #define F2FS_FEATURE_FLEXIBLE_INLINE_XATTR 0x0040
 #define F2FS_FEATURE_QUOTA_INO 0x0080
+#define F2FS_FEATURE_INODE_CRTIME  0x0100
 
 #define F2FS_HAS_FEATURE(sb, mask) \
((F2FS_SB(sb)->raw_super->feature & cpu_to_le32(mask)) != 0)
@@ -635,6 +636,7 @@ struct f2fs_inode_info {
int i_extra_isize;  /* size of extra space located in 
i_addr */
kprojid_t i_projid; /* id for project quota */
int i_inline_xattr_size;/* inline xattr size */
+   struct timespec i_crtime;   /* inode creation time */
 };
 
 static inline void get_extent_info(struct extent_info *ext,
@@ -3205,6 +3207,11 @@ static inline int f2fs_sb_has_quota_ino(struct 
super_block *sb)
return F2FS_HAS_FEATURE(sb, F2FS_FEATURE_QUOTA_INO);
 }
 
+static inline int f2fs_sb_has_inode_crtime(struct super_block *sb)
+{
+   return F2FS_HAS_FEATURE(sb, F2FS_FEATURE_INODE_CRTIME);
+}
+
 #ifdef CONFIG_BLK_DEV_ZONED
 static inline int get_blkz_type(struct f2fs_sb_info *sbi,
struct block_device *bdev, block_t blkaddr)
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 84306c718e68..672a542e5464 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -672,8 +672,17 @@ int f2fs_getattr(const struct path *path, struct kstat 
*stat,
 {
struct inode *inode = d_inode(path->dentry);
struct f2fs_inode_info *fi = F2FS_I(inode);
+   struct f2fs_inode *ri;
unsigned int flags;
 
+   if (f2fs_has_extra_attr(inode) &&
+   f2fs_sb_has_inode_crtime(inode->i_sb) &&
+   F2FS_FITS_IN_INODE(ri, fi->i_extra_isize, i_crtime)) {
+   stat->result_mask |= STATX_BTIME;
+   stat->btime.tv_sec = fi->i_crtime.tv_sec;
+   stat->btime.tv_nsec = fi->i_crtime.tv_nsec;
+   }
+
flags = fi->i_flags & (FS_FL_USER_VISIBLE | FS_PROJINHERIT_FL);
if (flags & FS_APPEND_FL)
stat->attributes |= STATX_ATTR_APPEND;
diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
index 1dc77a40d0ad..99ee72bff628 100644
--- a/fs/f2fs/inode.c
+++ b/fs/f2fs/inode.c
@@ -278,6 +278,13 @@ static int do_read_inode(struct inode *inode)
i_projid = F2FS_DEF_PROJID;
fi->i_projid = make_kprojid(&init_user_ns, i_projid);
 
+   if (f2fs_has_extra_attr(inode) && f2fs_sb_has_inode_crtime(sbi->sb) &&
+   F2FS_FITS_IN_INODE(ri, fi->i_extra_isize, i_crtime)) {
+   fi->i_crtime.tv_sec = le64_to_cpu(ri->i_crtime);
+   fi->i_crtime.tv_nsec = le32_to_cpu(ri->i_crtime_nsec);
+   }
+
+
f2fs_put_page(node_page, 1);
 
stat_inc_inline_xattr(inode);
@@ -421,6 +428,15 @@ void update_inode(struct inode *inode, struct page 
*node_page)
F2FS_I(inode)->i_projid);
ri->i_projid = cpu_to_le32(i_projid);
}
+
+   if (f2fs_sb_has_inode_crtime(F2FS_I_SB(inode)->sb) &&
+   F2FS_FITS_IN_INODE(ri, F2FS_I(inode)->i_extra_isize,
+   i_crtime)) {
+   ri->i_crtime =
+   cpu_to_le64(F2FS_I(inode)->i_crtime.tv_sec);
+   ri->i_crtime_nsec =
+   cpu_to_le32(F2FS_I(inode)->i_crtime.tv_nsec);
+   }
}
 
__set_inode_rdev(inode, ri);
diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
index 3ee97ba9d2d7..c4c94c7e9f4f 100644
--- a/fs/f2fs/namei.c
+++ b/fs/f2fs/namei.c
@@ -50,7 +50,8 @@ static struct inode *f2fs_new_inode(struct inode *dir, 
umode_t mode)
 
inode->i_ino = ino;
inode->i_blocks = 0;
-   inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+   inode->i_mtime = inode->i_atime = inode->i_ctime =
+   F2FS_I(inode)->i_crtime = current_time(inode);
inode->i_generation = sbi->s_next_generation++;
 
err = insert_inode_locked(inode);
diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c
index 41887e6ec1b3..d978c7b6ea04 100644
--- a/fs/f2fs/sysfs.c
+++ b/fs/f2fs/sysfs.c
@@ -113,6 +113,9 @@ static ssize_t features_show(struct f2fs_attr *a,
if (f2fs_sb_has_quota_ino(sb))

Re: [RFC PATCH v2 0/1] of: easier debugging for node life cycle issues

2018-01-24 Thread Frank Rowand
Hi Steve,

On 01/21/18 06:31, Wolfram Sang wrote:
> I got a bug report for a DT node refcounting problem in the I2C subsystem. 
> This
> patch was a huge help in validating the bug report and the proposed solution.
> So, I thought I bring it to attention again. Thanks Tyrel, for the initial
> work!
> 
> Note that I did not test the dynamic updates, only of_node_{get|put} so far. I
> read that Tyrel checked dynamic updates extensively with this patch. And since
> DT overlays are also used within our Renesas dev team, this will help there, 
> as
> well.
> 
> Tested on a Renesas Salvator-XS board (R-Car H3).
> 
> Changes since RFC v1:
>   * rebased to v4.15-rc8
>   * fixed commit abbrev and one of the sysfs paths in commit desc
>   * removed trailing space and fixed pointer declaration in code
> 
> I consider all the remaining checkpatch issues irrelevant for this patch.
> 
> So what about applying it?
> 
> Kind regards,
> 
>Wolfram
> 
> 
> Tyrel Datwyler (1):
>   of: introduce event tracepoints for dynamic device_node lifecyle
> 
>  drivers/of/dynamic.c  | 32 ++--
>  include/trace/events/of.h | 93 
> +++
>  2 files changed, 105 insertions(+), 20 deletions(-)
>  create mode 100644 include/trace/events/of.h
> 

Off the top of your head, can you tell me know early in the boot
process a trace_event can be called and successfully provide the
data to someone trying to debug early boot issues?

Also, way back when version 1 of this patch was being discussed,
a question about stacktrace triggers:

 >>> # echo stacktrace > /sys/kernel/debug/tracing/trace_options
 >>> # cat trace | grep -A6 "/pci@8002018"  
 >>
 >> Just to let you know that there is now stacktrace event triggers, where
 >> you don't need to stacktrace all events, you can pick and choose. And
 >> even filter the stack trace on specific fields of the event.  
 >
 > This is great, and I did figure that out this afternoon. One thing I was
 > still trying to determine though was whether its possible to set these
 > triggers at boot? As far as I could tell I'm still limited to
 > "trace_options=stacktrace" as a kernel boot parameter to get the stack
 > for event tracepoints.

 No not yet. But I'll add that to the todo list.

 Thanks,

 -- Steve

Is this still on your todo list, or is it now available?

Thanks,

Frank


Re: [PATCH v2] f2fs: support inode creation time

2018-01-24 Thread Chao Yu
On 2018/1/24 22:59, Chao Yu wrote:
> Hi Jaegeuk,
> 
> On 2018/1/24 13:38, Jaegeuk Kim wrote:
>> On 01/22, Chao Yu wrote:
>>> From: Chao Yu 
>>>
>>> This patch adds creation time field in inode layout to support showing
>>> kstat.btime in ->statx.
>>
>> Hi Chao,
>>
>> Could you please check this patch again? I reverted this due to kernel panic.
> 
> I can't reproduce it, could you show me your call stack?
> 
> And, could you please have a try with kernel patch only?

Sigh, it's address alignment issue, which will make sizeof(f2fs_node)
exceeding 4096... let me resend the patch.

Thanks,

> 
> Thanks,
> 
>>
>> Thanks,
>>
>>>
>>> Signed-off-by: Chao Yu 
>>> ---
>>> v2:
>>> - add missing sysfs entry.
>>>  fs/f2fs/f2fs.h  |  7 +++
>>>  fs/f2fs/file.c  |  9 +
>>>  fs/f2fs/inode.c | 16 
>>>  fs/f2fs/namei.c |  3 ++-
>>>  fs/f2fs/sysfs.c |  7 +++
>>>  include/linux/f2fs_fs.h |  2 ++
>>>  6 files changed, 43 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>>> index b7ba496af28f..6300ac5bcbe4 100644
>>> --- a/fs/f2fs/f2fs.h
>>> +++ b/fs/f2fs/f2fs.h
>>> @@ -124,6 +124,7 @@ struct f2fs_mount_info {
>>>  #define F2FS_FEATURE_INODE_CHKSUM  0x0020
>>>  #define F2FS_FEATURE_FLEXIBLE_INLINE_XATTR 0x0040
>>>  #define F2FS_FEATURE_QUOTA_INO 0x0080
>>> +#define F2FS_FEATURE_INODE_CRTIME  0x0100
>>>  
>>>  #define F2FS_HAS_FEATURE(sb, mask) \
>>> ((F2FS_SB(sb)->raw_super->feature & cpu_to_le32(mask)) != 0)
>>> @@ -635,6 +636,7 @@ struct f2fs_inode_info {
>>> int i_extra_isize;  /* size of extra space located in 
>>> i_addr */
>>> kprojid_t i_projid; /* id for project quota */
>>> int i_inline_xattr_size;/* inline xattr size */
>>> +   struct timespec i_crtime;   /* inode creation time */
>>>  };
>>>  
>>>  static inline void get_extent_info(struct extent_info *ext,
>>> @@ -3205,6 +3207,11 @@ static inline int f2fs_sb_has_quota_ino(struct 
>>> super_block *sb)
>>> return F2FS_HAS_FEATURE(sb, F2FS_FEATURE_QUOTA_INO);
>>>  }
>>>  
>>> +static inline int f2fs_sb_has_inode_crtime(struct super_block *sb)
>>> +{
>>> +   return F2FS_HAS_FEATURE(sb, F2FS_FEATURE_INODE_CRTIME);
>>> +}
>>> +
>>>  #ifdef CONFIG_BLK_DEV_ZONED
>>>  static inline int get_blkz_type(struct f2fs_sb_info *sbi,
>>> struct block_device *bdev, block_t blkaddr)
>>> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
>>> index b9b1efe61d29..0873ba12ebc6 100644
>>> --- a/fs/f2fs/file.c
>>> +++ b/fs/f2fs/file.c
>>> @@ -672,8 +672,17 @@ int f2fs_getattr(const struct path *path, struct kstat 
>>> *stat,
>>>  {
>>> struct inode *inode = d_inode(path->dentry);
>>> struct f2fs_inode_info *fi = F2FS_I(inode);
>>> +   struct f2fs_inode *ri;
>>> unsigned int flags;
>>>  
>>> +   if (f2fs_has_extra_attr(inode) &&
>>> +   f2fs_sb_has_inode_crtime(inode->i_sb) &&
>>> +   F2FS_FITS_IN_INODE(ri, fi->i_extra_isize, i_crtime)) {
>>> +   stat->result_mask |= STATX_BTIME;
>>> +   stat->btime.tv_sec = fi->i_crtime.tv_sec;
>>> +   stat->btime.tv_nsec = fi->i_crtime.tv_nsec;
>>> +   }
>>> +
>>> flags = fi->i_flags & (FS_FL_USER_VISIBLE | FS_PROJINHERIT_FL);
>>> if (flags & FS_APPEND_FL)
>>> stat->attributes |= STATX_ATTR_APPEND;
>>> diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
>>> index 1dc77a40d0ad..99ee72bff628 100644
>>> --- a/fs/f2fs/inode.c
>>> +++ b/fs/f2fs/inode.c
>>> @@ -278,6 +278,13 @@ static int do_read_inode(struct inode *inode)
>>> i_projid = F2FS_DEF_PROJID;
>>> fi->i_projid = make_kprojid(&init_user_ns, i_projid);
>>>  
>>> +   if (f2fs_has_extra_attr(inode) && f2fs_sb_has_inode_crtime(sbi->sb) &&
>>> +   F2FS_FITS_IN_INODE(ri, fi->i_extra_isize, i_crtime)) {
>>> +   fi->i_crtime.tv_sec = le64_to_cpu(ri->i_crtime);
>>> +   fi->i_crtime.tv_nsec = le32_to_cpu(ri->i_crtime_nsec);
>>> +   }
>>> +
>>> +
>>> f2fs_put_page(node_page, 1);
>>>  
>>> stat_inc_inline_xattr(inode);
>>> @@ -421,6 +428,15 @@ void update_inode(struct inode *inode, struct page 
>>> *node_page)
>>> F2FS_I(inode)->i_projid);
>>> ri->i_projid = cpu_to_le32(i_projid);
>>> }
>>> +
>>> +   if (f2fs_sb_has_inode_crtime(F2FS_I_SB(inode)->sb) &&
>>> +   F2FS_FITS_IN_INODE(ri, F2FS_I(inode)->i_extra_isize,
>>> +   i_crtime)) {
>>> +   ri->i_crtime =
>>> +   cpu_to_le64(F2FS_I(inode)->i_crtime.tv_sec);
>>> +   ri->i_crtime_nsec =
>>> +   cpu_to_le32(F2FS_I(inode)->i_crtime.tv_nsec);
>>> +   }
>>> }
>>>  
>>> __set_inode_rdev(inode, ri);
>>> diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
>>> index 3ee97ba9d2d7..c

Re: backport Rewrite sync_core() to use IRET-to-self to stable kernels?

2018-01-24 Thread gre...@linuxfoundation.org

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

A: No.
Q: Should I include quotations after my reply?

http://daringfireball.net/2007/07/on_top

On Thu, Jan 25, 2018 at 02:22:57AM +, Zhang, Ning A wrote:
> When Linux runs as a guest OS, CPUID is privileged instruction, sync_core() 
> will be very slow.
> 
> If apply this patch, 200ms will be saved for kernel initial, when Linux runs 
> as a guest OS.

That's not a regression.  Why not just use a newer kernel release if you
want to get the feature of "faster boot"?  :)

thanks,

greg k-h


Re: [RFC PATCH v2 1/1] of: introduce event tracepoints for dynamic device_node lifecyle

2018-01-24 Thread Frank Rowand
On 01/21/18 06:31, Wolfram Sang wrote:
> From: Tyrel Datwyler 
> 
> This patch introduces event tracepoints for tracking a device_nodes
> reference cycle as well as reconfig notifications generated in response
> to node/property manipulations.
> 
> With the recent upstreaming of the refcount API several device_node
> underflows and leaks have come to my attention in the pseries (DLPAR)
> dynamic logical partitioning code (ie. POWER speak for hotplugging
> virtual and physcial resources at runtime such as cpus or IOAs). These
> tracepoints provide a easy and quick mechanism for validating the
> reference counting of device_nodes during their lifetime.
> 
> Further, when pseries lpars are migrated to a different machine we
> perform a live update of our device tree to bring it into alignment with
> the configuration of the new machine. The of_reconfig_notify trace point
> provides a mechanism that can be turned for debuging the device tree
> modifications with out having to build a custom kernel to get at the
> DEBUG code introduced by commit 00aa37206e1a54 ("of/reconfig: Add debug
> output for OF_RECONFIG notifiers").
> 
> The following trace events are provided: of_node_get, of_node_put,
> of_node_release, and of_reconfig_notify. These trace points require a

Please add a note that the of_reconfig_notify trace event is not an
added bit of debug info, but is instead replacing information that
was previously available via pr_debug() when DEBUG was defined.


> kernel built with ftrace support to be enabled. In a typical environment
> where debugfs is mounted at /sys/kernel/debug the entire set of
> tracepoints can be set with the following:
> 
>   echo "of:*" > /sys/kernel/debug/tracing/set_event
> 
> or
> 
>   echo 1 > /sys/kernel/debug/tracing/events/of/enable
> 
> The following shows the trace point data from a DLPAR remove of a cpu
> from a pseries lpar:
> 
> cat /sys/kernel/debug/tracing/trace | grep "POWER8@10"
> 
> cpuhp/23-147   [023]    128.324827:
> of_node_put: refcount=5, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324829:
> of_node_put: refcount=4, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324829:
> of_node_put: refcount=3, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324831:
> of_node_put: refcount=2, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439000:
> of_node_put: refcount=1, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439002:
> of_reconfig_notify: action=DETACH_NODE, 
> dn->full_name=/cpus/PowerPC,POWER8@10,
> prop->name=null, old_prop->name=null
>drmgr-7284  [009]    128.439015:
> of_node_put: refcount=0, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439016:
> of_node_release: dn->full_name=/cpus/PowerPC,POWER8@10, dn->_flags=4
> 
> Signed-off-by: Tyrel Datwyler 

The following belongs in a list of version 2 changes, below the "---" line:

> [wsa: fixed commit abbrev and one of the sysfs paths in commit desc,
> removed trailing space and fixed pointer declaration in code]

> Signed-off-by: Wolfram Sang 
> ---
>  drivers/of/dynamic.c  | 32 ++--
>  include/trace/events/of.h | 93 
> +++
>  2 files changed, 105 insertions(+), 20 deletions(-)
>  create mode 100644 include/trace/events/of.h

mode looks incorrect.  Existing files in include/trace/events/ are -rw-rw


> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index ab988d88704da0..b0d6ab5a35b8c6 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -21,6 +21,9 @@ static struct device_node *kobj_to_device_node(struct 
> kobject *kobj)
>   return container_of(kobj, struct device_node, kobj);
>  }
>  
> +#define CREATE_TRACE_POINTS
> +#include 
> +
>  /**
>   * of_node_get() - Increment refcount of a node
>   * @node:Node to inc refcount, NULL is supported to simplify writing of
> @@ -30,8 +33,10 @@ static struct device_node *kobj_to_device_node(struct 
> kobject *kobj)
>   */
>  struct device_node *of_node_get(struct device_node *node)
>  {
> - if (node)
> + if (node) {
>   kobject_get(&node->kobj);
> + trace_of_node_get(refcount_read(&node->kobj.kref.refcount), 
> node->full_name);

See the comment from Ron that I mentioned in my previous email.

Also, the path has been removed from node->full_name.  Does using it here
still give all of the information that is desired?  Same for all others uses
of full_name in this patch.

The trace point should have a single argument, node.  Accessing the two
fields can be done in the tracepoint assignment.  Or is there some
reason that can't be done?  Same for the trace_of_node_put() tracepoint.


> + }
>   return node;
>  }
>  EXPORT_SYMBOL(of_node_get);
> @@ -43,8 +48,10 @@ EXPORT_SYMBOL(of_node_get);

Re: [RFC PATCH v2 0/1] of: easier debugging for node life cycle issues

2018-01-24 Thread Frank Rowand
On 01/21/18 06:31, Wolfram Sang wrote:
> I got a bug report for a DT node refcounting problem in the I2C subsystem. 
> This
> patch was a huge help in validating the bug report and the proposed solution.
> So, I thought I bring it to attention again. Thanks Tyrel, for the initial
> work!
> 
> Note that I did not test the dynamic updates, only of_node_{get|put} so far. I
> read that Tyrel checked dynamic updates extensively with this patch. And since
> DT overlays are also used within our Renesas dev team, this will help there, 
> as
> well.

It's been nine months since version 1.  If you are going to include the
dynamic updates part of the patch then please test them.


> Tested on a Renesas Salvator-XS board (R-Car H3).
> 
> Changes since RFC v1:
>   * rebased to v4.15-rc8
>   * fixed commit abbrev and one of the sysfs paths in commit desc
>   * removed trailing space and fixed pointer declaration in code
> 

> I consider all the remaining checkpatch issues irrelevant for this patch.

I am OK with the line length warnings in this patch.

Why can't the macro error be fixed?

A file entry needs to be added to MAINTAINERS.


> 
> So what about applying it?
> 
> Kind regards,
> 
>Wolfram
> 
> 
> Tyrel Datwyler (1):
>   of: introduce event tracepoints for dynamic device_node lifecyle
> 
>  drivers/of/dynamic.c  | 32 ++--
>  include/trace/events/of.h | 93 
> +++
>  2 files changed, 105 insertions(+), 20 deletions(-)
>  create mode 100644 include/trace/events/of.h
> 



Re: [RFC PATCH v2 0/1] of: easier debugging for node life cycle issues

2018-01-24 Thread Frank Rowand
On 01/22/18 03:49, Wolfram Sang wrote:
> Hi Frank,
> 
>> Please go back and read the thread for version 1.  Simply resubmitting a
>> forward port is ignoring that whole conversation.
>>
>> There is a lot of good info in that thread.  I certainly learned stuff in it.
> 
> Yes, I did that and learned stuff, too. My summary of the discussion was:
> 
> - you mentioned some drawbacks you saw (like the mixture of trace output
>   and printk output)> - most of them look like addressed to me? (e.g. Steven 
> showed a way to redirect
>   printk to trace
> - you posted your version (which was, however, marked as "not user friendly"
>   even by yourself)

Not exactly a fair quoting.  There were two things I said:

  "Here is a patch that I have used.  It is not as user friendly in terms
  of human readable stack traces (though a very small user space program
  should be able to fix that)."

 So easy to fix using existing userspace programs to convert kernel
 addresses to symbols.

  "FIXME: Currently using pr_err() so I don't need to set loglevel on boot.

  So obviously not a user friendly tool!!!
  The process is:
 - apply patch
 - configure, build, boot kernel
 - analyze data
 - remove patch"

 So not friendly because it uses pr_err() instead of pr_debug().  In
 a reply I said if I submitted my patches I would change it to use
 pr_debug() instead.  So not an issue.

 And not user friendly because it requires patching the kernel.
 Again a NOP if I submitted my patch, because the patch would
 already be in the kernel.

But whatever, let's ignore that - a poor quoting is not a reason to
reject this version of the patch.


> - The discussion stalled over having two approaches

Then you should have stated such when you resubmitted.


> So, I thought reposting would be a good way of finding out if your
> concerns were addressed in the discussion or not. If I overlooked

Then you should have stated that there were concerns raised in the
discussion and asked me if my concerns were addressed.


> something, I am sorry for that. Still, my intention is to continue the
> discussion, not to ignore it. Because as it stands, we don't have such a
> debugging mechanism in place currently, and with people working with DT
> overlays, I'd think it would be nice to have.
> 
> Kind regards,
> 
>Wolfram
> 


Rob suggested:

 >
 > @@ -25,8 +28,10 @@
 >   */
 >  struct device_node *of_node_get(struct device_node *node)
 >  {
 > -   if (node)
 > +   if (node) {
 > kobject_get(&node->kobj);
 > +   
trace_of_node_get(refcount_read(&node->kobj.kref.refcount), node->full_name);

 Seems like there should be a kobj wrapper to read the refcount.

As far as I noticed, that was never addressed.  I don't know the answer, but
the question was asked.  And if there is no such function, then there is at
least kref_read(), which would improve the code a little bit.

I'll reply to the patch 0/1 and patch 1/1 emails with review comments.

-Frank


Re: [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns

2018-01-24 Thread vcaputo
On Fri, Jan 19, 2018 at 11:57:32AM +0100, Enric Balletbo Serra wrote:
> Hi Vito,
> 
> 2018-01-17 23:48 GMT+01:00  :
> > On Mon, Dec 18, 2017 at 10:25:33AM +0100, Enric Balletbo Serra wrote:
> >> Hi Vito,
> >>
> >> 2017-12-01 22:33 GMT+01:00  :
> >> > On Wed, Nov 29, 2017 at 10:39:19AM -0800, vcap...@pengaru.com wrote:
> >> >> Hello,
> >> >>
> >> >> Recently I noticed substantial audio dropouts when listening to MP3s in
> >> >> `cmus` while doing big and churny `git checkout` commands in my linux 
> >> >> git
> >> >> tree.
> >> >>
> >> >> It's not something I've done much of over the last couple months so I
> >> >> hadn't noticed until yesterday, but didn't remember this being a 
> >> >> problem in
> >> >> recent history.
> >> >>
> >> >> As there's quite an accumulation of similarly configured and built 
> >> >> kernels
> >> >> in my grub menu, it was trivial to determine approximately when this 
> >> >> began:
> >> >>
> >> >> 4.11.0: no dropouts
> >> >> 4.12.0-rc7: dropouts
> >> >> 4.14.0-rc6: dropouts (seem more substantial as well, didn't investigate)
> >> >>
> >> >> Watching top while this is going on in the various kernel versions, it's
> >> >> apparent that the kworker behavior changed.  Both the priority and 
> >> >> quantity
> >> >> of running kworker threads is elevated in kernels experiencing dropouts.
> >> >>
> >> >> Searching through the commit history for v4.11..v4.12 uncovered:
> >> >>
> >> >> commit a1b89132dc4f61071bdeaab92ea958e0953380a1
> >> >> Author: Tim Murray 
> >> >> Date:   Fri Apr 21 11:11:36 2017 +0200
> >> >>
> >> >> dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
> >> >>
> >> >> Running dm-crypt with workqueues at the standard priority results 
> >> >> in IO
> >> >> competing for CPU time with standard user apps, which can lead to
> >> >> pipeline bubbles and seriously degraded performance.  Move to using
> >> >> WQ_HIGHPRI workqueues to protect against that.
> >> >>
> >> >> Signed-off-by: Tim Murray 
> >> >> Signed-off-by: Enric Balletbo i Serra 
> >> >> Signed-off-by: Mike Snitzer 
> >> >>
> >> >> ---
> >> >>
> >> >> Reverting a1b8913 from 4.14.0-rc6, my current kernel, eliminates the
> >> >> problem completely.
> >> >>
> >> >> Looking at the diff in that commit, it looks like the commit message 
> >> >> isn't
> >> >> even accurate; not only is the priority of the dmcrypt workqueues being
> >> >> changed - they're also being made "CPU intensive" workqueues as well.
> >> >>
> >> >> This combination appears to result in both elevated scheduling priority 
> >> >> and
> >> >> greater quantity of participant worker threads effectively starving any
> >> >> normal priority user task under periods of heavy IO on dmcrypt volumes.
> >> >>
> >> >> I don't know what the right solution is here.  It seems to me we're 
> >> >> lacking
> >> >> the appropriate mechanism for charging CPU resources consumed on behalf 
> >> >> of
> >> >> user processes in kworker threads to the work-causing process.
> >> >>
> >> >> What effectively happens is my normal `git` user process is able to
> >> >> greatly amplify what share of CPU it takes from the system by 
> >> >> generating IO
> >> >> on what happens to be a high-priority CPU-intensive storage volume.
> >> >>
> >> >> It looks potentially complicated to fix properly, but I suspect at its 
> >> >> core
> >> >> this may be a fairly longstanding shortcoming of the page cache and its
> >> >> asynchronous design.  Something that has been exacerbated substantially 
> >> >> by
> >> >> the introduction of CPU-intensive storage subsystems like dmcrypt.
> >> >>
> >> >> If we imagine the whole stack simplified, where all the IO was being 
> >> >> done
> >> >> synchronously in-band, and the dmcrypt kernel code simply ran in the
> >> >> IO-causing process context, it would be getting charged to the calling
> >> >> process and scheduled accordingly.  The resource accounting and 
> >> >> scheduling
> >> >> problems all emerge with the page cache, buffered IO, and async 
> >> >> background
> >> >> writeback in a pool of unrelated worker threads, etc.  That's how it
> >> >> appears to me anyways...
> >> >>
> >> >> The system used is a X61s Thinkpad 1.8Ghz with 840 EVO SSD, lvm on 
> >> >> dmcrypt.
> >> >> The kernel .config is attached in case it's of interest.
> >> >>
> >> >> Thanks,
> >> >> Vito Caputo
> >> >
> >> >
> >> >
> >> > Ping...
> >> >
> >> > Could somebody please at least ACK receiving this so I'm not left 
> >> > wondering
> >> > if my mails to lkml are somehow winding up flagged as spam, thanks!
> >>
> >> Sorry I did not notice your email before you ping me directly. It's
> >> interesting that issue, though we didn't notice this problem. It's a
> >> bit far since I tested this patch but I'll setup the environment again
> >> and do more tests to understand better what is happening.
> >>
> >
> > Any update on this?
> >
> 
> I did not reproduce the issue for now. Can you try what happens if you
> remove the WQ_CPU_INT

linux-next: manual merge of the net-next tree with the vfs tree

2018-01-24 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  net/tipc/socket.c

between commit:

  ade994f4f6c8 ("net: annotate ->poll() instances")

from the vfs tree and commit:

  60c253069632 ("tipc: fix race between poll() and setsockopt()")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc net/tipc/socket.c
index 2aa46e8cd8fe,473a096b6fba..
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@@ -715,8 -716,7 +716,7 @@@ static __poll_t tipc_poll(struct file *
  {
struct sock *sk = sock->sk;
struct tipc_sock *tsk = tipc_sk(sk);
-   struct tipc_group *grp = tsk->group;
 -  u32 revents = 0;
 +  __poll_t revents = 0;
  
sock_poll_wait(file, sk_sleep(sk), wait);
  


Re: [PATCH RFC 09/16] prcu: Implement prcu_barrier() API

2018-01-24 Thread Paul E. McKenney
On Tue, Jan 23, 2018 at 03:59:34PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> This is PRCU's counterpart of RCU's rcu_barrier() API.
> 
> Reviewed-by: Heng Zhang 
> Signed-off-by: Lihao Liang 
> ---
>  include/linux/prcu.h |  7 ++
>  kernel/rcu/prcu.c| 63 
> 
>  2 files changed, 70 insertions(+)
> 
> diff --git a/include/linux/prcu.h b/include/linux/prcu.h
> index 4e7d5d65..cce967fd 100644
> --- a/include/linux/prcu.h
> +++ b/include/linux/prcu.h
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #define CONFIG_PRCU
> 
> @@ -32,6 +33,7 @@ struct prcu_local_struct {
>   unsigned int online;
>   unsigned long long version;
>   unsigned long long cb_version;
> + struct rcu_head barrier_head;
>   struct prcu_cblist cblist;
>  };
> 
> @@ -39,8 +41,11 @@ struct prcu_struct {
>   atomic64_t global_version;
>   atomic64_t cb_version;
>   atomic_t active_ctr;
> + atomic_t barrier_cpu_count;
>   struct mutex mtx;
> + struct mutex barrier_mtx;
>   wait_queue_head_t wait_q;
> + struct completion barrier_completion;
>  };
> 
>  #ifdef CONFIG_PRCU
> @@ -48,6 +53,7 @@ void prcu_read_lock(void);
>  void prcu_read_unlock(void);
>  void synchronize_prcu(void);
>  void call_prcu(struct rcu_head *head, rcu_callback_t func);
> +void prcu_barrier(void);
>  void prcu_init(void);
>  void prcu_note_context_switch(void);
>  int prcu_pending(void);
> @@ -60,6 +66,7 @@ void prcu_check_callbacks(void);
>  #define prcu_read_unlock() do {} while (0)
>  #define synchronize_prcu() do {} while (0)
>  #define call_prcu() do {} while (0)
> +#define prcu_barrier() do {} while (0)
>  #define prcu_init() do {} while (0)
>  #define prcu_note_context_switch() do {} while (0)
>  #define prcu_pending() 0
> diff --git a/kernel/rcu/prcu.c b/kernel/rcu/prcu.c
> index 373039c5..2664d091 100644
> --- a/kernel/rcu/prcu.c
> +++ b/kernel/rcu/prcu.c
> @@ -15,6 +15,7 @@ struct prcu_struct global_prcu = {
>   .cb_version = ATOMIC64_INIT(0),
>   .active_ctr = ATOMIC_INIT(0),
>   .mtx = __MUTEX_INITIALIZER(global_prcu.mtx),
> + .barrier_mtx = __MUTEX_INITIALIZER(global_prcu.barrier_mtx),
>   .wait_q = __WAIT_QUEUE_HEAD_INITIALIZER(global_prcu.wait_q)
>  };
>  struct prcu_struct *prcu = &global_prcu;
> @@ -250,6 +251,68 @@ static __latent_entropy void 
> prcu_process_callbacks(struct softirq_action *unuse
>   local_irq_restore(flags);
>  }
> 
> +/*
> + * PRCU callback function for prcu_barrier().
> + * If we are last, wake up the task executing prcu_barrier().
> + */
> +static void prcu_barrier_callback(struct rcu_head *rhp)
> +{
> + if (atomic_dec_and_test(&prcu->barrier_cpu_count))
> + complete(&prcu->barrier_completion);
> +}
> +
> +/*
> + * Called with preemption disabled, and from cross-cpu IRQ context.
> + */
> +static void prcu_barrier_func(void *info)
> +{
> + struct prcu_local_struct *local = this_cpu_ptr(&prcu_local);
> +
> + atomic_inc(&prcu->barrier_cpu_count);
> + call_prcu(&local->barrier_head, prcu_barrier_callback);
> +}
> +
> +/* Waiting for all PRCU callbacks to complete. */
> +void prcu_barrier(void)
> +{
> + int cpu;
> +
> + /* Take mutex to serialize concurrent prcu_barrier() requests. */
> + mutex_lock(&prcu->barrier_mtx);
> +
> + /*
> +  * Initialize the count to one rather than to zero in order to
> +  * avoid a too-soon return to zero in case of a short grace period
> +  * (or preemption of this task).
> +  */
> + init_completion(&prcu->barrier_completion);
> + atomic_set(&prcu->barrier_cpu_count, 1);
> +
> + /*
> +  * Register a new callback on each CPU using IPI to prevent races
> +  * with call_prcu(). When that callback is invoked, we will know
> +  * that all of the corresponding CPU's preceding callbacks have
> +  * been invoked.
> +  */
> + for_each_possible_cpu(cpu)
> + smp_call_function_single(cpu, prcu_barrier_func, NULL, 1);

This code seems to be assuming CONFIG_HOTPLUG_CPU=n.  This might explain
your rcutorture failure.

> + /* Decrement the count as we initialize it to one. */
> + if (atomic_dec_and_test(&prcu->barrier_cpu_count))
> + complete(&prcu->barrier_completion);
> +
> + /*
> +  * Now that we have an prcu_barrier_callback() callback on each
> +  * CPU, and thus each counted, remove the initial count.
> +  * Wait for all prcu_barrier_callback() callbacks to be invoked.
> +  */
> + wait_for_completion(&prcu->barrier_completion);
> +
> + /* Other rcu_barrier() invocations can now safely proceed. */
> + mutex_unlock(&prcu->barrier_mtx);
> +}
> +EXPORT_SYMBOL(prcu_barrier);
> +
>  void prcu_init_local_struct(int cpu)
>  {
>   struct prcu_local_struct *local;
> -- 
> 2.14.1.729.g59c0ea183
> 



Re: [PATCH RFC 07/16] prcu: Implement call_prcu() API

2018-01-24 Thread Paul E. McKenney
On Tue, Jan 23, 2018 at 03:59:32PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> This is PRCU's counterpart of RCU's call_rcu() API.
> 
> Reviewed-by: Heng Zhang 
> Signed-off-by: Lihao Liang 
> ---
>  include/linux/prcu.h | 25 
>  init/main.c  |  2 ++
>  kernel/rcu/prcu.c| 67 
> +---
>  3 files changed, 91 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/prcu.h b/include/linux/prcu.h
> index 653b4633..e5e09c9b 100644
> --- a/include/linux/prcu.h
> +++ b/include/linux/prcu.h
> @@ -2,15 +2,36 @@
>  #define __LINUX_PRCU_H
> 
>  #include 
> +#include 
>  #include 
>  #include 
> 
>  #define CONFIG_PRCU
> 
> +struct prcu_version_head {
> + unsigned long long version;
> + struct prcu_version_head *next;
> +};
> +
> +/* Simple unsegmented callback list for PRCU. */
> +struct prcu_cblist {
> + struct rcu_head *head;
> + struct rcu_head **tail;
> + struct prcu_version_head *version_head;
> + struct prcu_version_head **version_tail;
> + long len;
> +};
> +
> +#define PRCU_CBLIST_INITIALIZER(n) { \
> + .head = NULL, .tail = &n.head, \
> + .version_head = NULL, .version_tail = &n.version_head, \
> +}
> +
>  struct prcu_local_struct {
>   unsigned int locked;
>   unsigned int online;
>   unsigned long long version;
> + struct prcu_cblist cblist;
>  };
> 
>  struct prcu_struct {
> @@ -24,6 +45,8 @@ struct prcu_struct {
>  void prcu_read_lock(void);
>  void prcu_read_unlock(void);
>  void synchronize_prcu(void);
> +void call_prcu(struct rcu_head *head, rcu_callback_t func);
> +void prcu_init(void);
>  void prcu_note_context_switch(void);
> 
>  #else /* #ifdef CONFIG_PRCU */
> @@ -31,6 +54,8 @@ void prcu_note_context_switch(void);
>  #define prcu_read_lock() do {} while (0)
>  #define prcu_read_unlock() do {} while (0)
>  #define synchronize_prcu() do {} while (0)
> +#define call_prcu() do {} while (0)
> +#define prcu_init() do {} while (0)
>  #define prcu_note_context_switch() do {} while (0)
> 
>  #endif /* #ifdef CONFIG_PRCU */
> diff --git a/init/main.c b/init/main.c
> index f8665104..4925964e 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -38,6 +38,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -574,6 +575,7 @@ asmlinkage __visible void __init start_kernel(void)
>   workqueue_init_early();
> 
>   rcu_init();
> + prcu_init();
> 
>   /* Trace events are available after this */
>   trace_init();
> diff --git a/kernel/rcu/prcu.c b/kernel/rcu/prcu.c
> index a00b9420..f198285c 100644
> --- a/kernel/rcu/prcu.c
> +++ b/kernel/rcu/prcu.c
> @@ -1,11 +1,12 @@
>  #include 
> -#include 
>  #include 
> -#include 
> +#include 
>  #include 
> -
> +#include 
>  #include 
> 
> +#include "rcu.h"
> +
>  DEFINE_PER_CPU_SHARED_ALIGNED(struct prcu_local_struct, prcu_local);
> 
>  struct prcu_struct global_prcu = {
> @@ -16,6 +17,16 @@ struct prcu_struct global_prcu = {
>  };
>  struct prcu_struct *prcu = &global_prcu;
> 
> +/* Initialize simple callback list. */
> +static void prcu_cblist_init(struct prcu_cblist *rclp)
> +{
> + rclp->head = NULL;
> + rclp->tail = &rclp->head;
> + rclp->version_head = NULL;
> + rclp->version_tail = &rclp->version_head;
> + rclp->len = 0;
> +}
> +
>  static inline void prcu_report(struct prcu_local_struct *local)
>  {
>   unsigned long long global_version;
> @@ -123,3 +134,53 @@ void prcu_note_context_switch(void)
>   prcu_report(local);
>   put_cpu_ptr(&prcu_local);
>  }
> +
> +void call_prcu(struct rcu_head *head, rcu_callback_t func)
> +{
> + unsigned long flags;
> + struct prcu_local_struct *local;
> + struct prcu_cblist *rclp;
> + struct prcu_version_head *vhp;
> +
> + debug_rcu_head_queue(head);
> +
> + /* Use GFP_ATOMIC with IRQs disabled */
> + vhp = kmalloc(sizeof(struct prcu_version_head), GFP_ATOMIC);
> + if (!vhp)
> + return;

Silently failing to post the callback can cause system hangs.  I suggest
finding some way to avoid allocating on the call_prcu() code path.

Thanx, Paul

> +
> + head->func = func;
> + head->next = NULL;
> + vhp->next = NULL;
> +
> + local_irq_save(flags);
> + local = this_cpu_ptr(&prcu_local);
> + vhp->version = local->version;
> + rclp = &local->cblist;
> + rclp->len++;
> + *rclp->tail = head;
> + rclp->tail = &head->next;
> + *rclp->version_tail = vhp;
> + rclp->version_tail = &vhp->next;
> + local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL(call_prcu);
> +
> +void prcu_init_local_struct(int cpu)
> +{
> + struct prcu_local_struct *local;
> +
> + local = per_cpu_ptr(&prcu_local, cpu);
> + local->locked = 0;
> + local->online = 0;
> + local->version = 0;
> + prcu_cblist_init(&local->cblist);
> +}
> +
> +void __init prcu_init(void)
> +

Re: [PATCH RFC 06/16] rcuperf: Set gp_exp to true for tests to run

2018-01-24 Thread Paul E. McKenney
On Tue, Jan 23, 2018 at 03:59:31PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> Signed-off-by: Lihao Liang 
> ---
>  kernel/rcu/rcuperf.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
> index ea80fa3e..baccc123 100644
> --- a/kernel/rcu/rcuperf.c
> +++ b/kernel/rcu/rcuperf.c
> @@ -60,7 +60,7 @@ MODULE_AUTHOR("Paul E. McKenney 
> ");
>  #define VERBOSE_PERFOUT_ERRSTRING(s) \
>   do { if (verbose) pr_alert("%s" PERF_FLAG "!!! %s\n", perf_type, s); } 
> while (0)
> 
> -torture_param(bool, gp_exp, false, "Use expedited GP wait primitives");
> +torture_param(bool, gp_exp, true, "Use expedited GP wait primitives");

This is fine as a convenience for internal testing, but the usual way
to make this happen is using the rcuperf.gp_exp kernel boot parameter.
Or was that not working for you?

Thanx, Paul

>  torture_param(int, holdoff, 10, "Holdoff time before test start (s)");
>  torture_param(int, nreaders, -1, "Number of RCU reader threads");
>  torture_param(int, nwriters, -1, "Number of RCU updater threads");
> -- 
> 2.14.1.729.g59c0ea183
> 



Re: [PATCH RFC 15/16] rcutorture: Add scripts to run experiments

2018-01-24 Thread Paul E. McKenney
On Tue, Jan 23, 2018 at 03:59:40PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> Signed-off-by: Lihao Liang 
> ---
>  kvm.sh | 452 
> +
>  run-rcuperf.sh |  26 

The usual approach would be to add what you need to the existing kvm.sh...

Thanx, Paul

>  2 files changed, 478 insertions(+)
>  create mode 100755 kvm.sh
>  create mode 100755 run-rcuperf.sh
> 
> diff --git a/kvm.sh b/kvm.sh
> new file mode 100755
> index ..3b3c1b69
> --- /dev/null
> +++ b/kvm.sh
> @@ -0,0 +1,452 @@
> +#!/bin/bash
> +#
> +# Run a series of 14 tests under KVM.  These are not particularly
> +# well-selected or well-tuned, but are the current set.  Run from the
> +# top level of the source tree.
> +#
> +# Edit the definitions below to set the locations of the various directories,
> +# as well as the test duration.
> +#
> +# Usage: kvm.sh [ options ]
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, you can access it online at
> +# http://www.gnu.org/licenses/gpl-2.0.html.
> +#
> +# Copyright (C) IBM Corporation, 2011
> +#
> +# Authors: Paul E. McKenney 
> +
> +scriptname=$0
> +args="$*"
> +
> +T=/tmp/kvm.sh.$$
> +trap 'rm -rf $T' 0
> +mkdir $T
> +
> +dur=$((30*60))
> +dryrun=""
> +KVM="`pwd`/tools/testing/selftests/rcutorture"; export KVM
> +PATH=${KVM}/bin:$PATH; export PATH
> +TORTURE_DEFCONFIG=defconfig
> +TORTURE_BOOT_IMAGE=""
> +TORTURE_INITRD="$KVM/initrd"; export TORTURE_INITRD
> +TORTURE_KMAKE_ARG=""
> +TORTURE_SHUTDOWN_GRACE=180
> +TORTURE_SUITE=rcu
> +resdir=""
> +configs=""
> +cpus=0
> +ds=`date +%Y.%m.%d-%H:%M:%S`
> +jitter="-1"
> +
> +. functions.sh
> +
> +usage () {
> + echo "Usage: $scriptname optional arguments:"
> + echo "   --bootargs kernel-boot-arguments"
> + echo "   --bootimage relative-path-to-kernel-boot-image"
> + echo "   --buildonly"
> + echo "   --configs \"config-file list w/ repeat factor (3*TINY01)\""
> + echo "   --cpus N"
> + echo "   --datestamp string"
> + echo "   --defconfig string"
> + echo "   --dryrun sched|script"
> + echo "   --duration minutes"
> + echo "   --interactive"
> + echo "   --jitter N [ maxsleep (us) [ maxspin (us) ] ]"
> + echo "   --kmake-arg kernel-make-arguments"
> + echo "   --mac nn:nn:nn:nn:nn:nn"
> + echo "   --no-initrd"
> + echo "   --qemu-args qemu-system-..."
> + echo "   --qemu-cmd qemu-system-..."
> + echo "   --results absolute-pathname"
> + echo "   --torture rcu"
> + exit 1
> +}
> +
> +while test $# -gt 0
> +do
> + case "$1" in
> + --bootargs|--bootarg)
> + checkarg --bootargs "(list of kernel boot arguments)" "$#" "$2" 
> '.*' '^--'
> + TORTURE_BOOTARGS="$2"
> + shift
> + ;;
> + --bootimage)
> + checkarg --bootimage "(relative path to kernel boot image)" 
> "$#" "$2" '[a-zA-Z0-9][a-zA-Z0-9_]*' '^--'
> + TORTURE_BOOT_IMAGE="$2"
> + shift
> + ;;
> + --buildonly)
> + TORTURE_BUILDONLY=1
> + ;;
> + --configs|--config)
> + checkarg --configs "(list of config files)" "$#" "$2" '^[^/]*$' 
> '^--'
> + configs="$2"
> + shift
> + ;;
> + --cpus)
> + checkarg --cpus "(number)" "$#" "$2" '^[0-9]*$' '^--'
> + cpus=$2
> + shift
> + ;;
> + --datestamp)
> + checkarg --datestamp "(relative pathname)" "$#" "$2" '^[^/]*$' 
> '^--'
> + ds=$2
> + shift
> + ;;
> + --defconfig)
> + checkarg --defconfig "defconfigtype" "$#" "$2" '^[^/][^/]*$' 
> '^--'
> + TORTURE_DEFCONFIG=$2
> + shift
> + ;;
> + --dryrun)
> + checkarg --dryrun "sched|script" $# "$2" 'sched\|script' '^--'
> + dryrun=$2
> + shift
> + ;;
> + --duration)
> + checkarg --duration "(minutes)" $# "$2" '^[0-9]*$' '^error'
> + dur=$(($2*60))
> + shift
> + ;;
> + --interactive)
> + TORTURE_QEMU_INTERACTIVE=1; export TORTURE_QEMU_INTERACTIVE
> + ;;
> + --jitter)
> + ch

Re: [PATCH RFC 00/16] A new RCU implementation based on a fast consensus protocol

2018-01-24 Thread Paul E. McKenney
On Tue, Jan 23, 2018 at 03:59:25PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> Dear Paul,
> 
> This patch set implements a preemptive version of RCU (PRCU) based on the 
> following paper:
> 
> Fast Consensus Using Bounded Staleness for Scalable Read-mostly 
> Synchronization.
> Haibo Chen, Heng Zhang, Ran Liu, Binyu Zang, and Haibing Guan.
> IEEE Transactions on Parallel and Distributed Systems (TPDS), 2016.
> https://dl.acm.org/citation.cfm?id=3024114.3024143
> 
> We have also added preliminary callback-handling support.  Thus, the current 
> version
> provides APIs prcu_read_lock(), prcu_read_unlock(), synchronize_prcu(), 
> call_prcu(),
> and prcu_barrier().
> 
> This is an experimental patch, so it would be good to have some feedback.
> 
> Known shortcoming is that the grace-period version is incremented in 
> synchronize_prcu().
> If call_prcu() or prcu_barrier() is called but there is no 
> synchronized_prcu() invoked,
> callbacks cannot be invoked.  Later version should address this issue, e.g. 
> adding a
> grace-period expedition mechanism.  Others include to use a a hierarchical 
> structure,
> taking into account the NUMA topology, to send IPI in synchronize_prcu().
> 
> We have tested the implementation using rcutorture on both an x86 and ARM64 
> machine.
> PRCU passed 1h and 3h tests on all the newly added config files except PRCU07 
> reported BUG 
> in a 1h run.
> 
> [ 1593.604201] ---[ end trace b3bae911bec86152 ]---
> [ 1594.629450] prcu-torture:torture_onoff task: offlining 14
> [ 1594.73] smpboot: CPU 14 is now offline
> [ 1594.757732] prcu-torture:torture_onoff task: offlined 14
> [ 1597.765149] prcu-torture:torture_onoff task: onlining 11
> [ 1597.766795] smpboot: Booting Node 0 Processor 11 APIC 0xb
> [ 1597.804102] prcu-torture:torture_onoff task: onlined 11
> [ 1599.365098] prcu-torture: rtc: b0277b90 ver: 66358 tfle: 0 rta: 
> 66358 rtaf: 0 
> rtf: 66349 rtmbe: 0 rtbe: 1 rtbke: 0 rtbre: 0 rtbf: 0 rtb: 0 nt: 2233418 
> onoff: 191/191:199/199 34,199:59,5102 10403:0 (HZ=1000) barrier: 188/189:1 
> cbflood: 225
> [ 1599.367946] prcu-torture: !!!
> [ 1599.367966] [ cut here ]

The "rtbe: 1" indicates that your implementation of prcu_barrier()
failed to wait for all preceding call_prcu() callbacks to be invoked.

Does the immediately following "Reader Pipe:" list have any but the
first two numbers non-zero?

> We have also compared PRCU with TREE RCU using rcuperf with gp_exp set to 
> true, that is
> synchronize_rcu_expedited was tested.
> 
> The rcuperf results are as follows (average grace-period duration in ms of 
> ten 10min runs):
> 
> 16*Intel Xeon CPU@2.4GHz, 16GB memory, Ubuntu Linux 3.13.0-47-generic
> 
> CPUs  2   4   8  12  15   16
> PRCU   0.141.074.158.02   10.7915.16 
> TREE  49.30  104.75  277.55  390.82  620.82  1381.54
> 
> 64*Cortex-A72 CPU@2.4GHz, 130GB memory, Ubuntu Linux 4.10.0-21.23-generic
> 
> CPUs   2   48  16  32   48   6364
> PRCU0.23   19.6938.28   63.21   95.41   167.18   252.01   1841.44
> TREE  416.73  901.89  1060.86  743.00  920.66  1325.21  1646.20  23806.27

Well, at the very least, this is a bug report on either expedited RCU
grace-period latency or on rcuperf's measurements, and thank you for that.
I will look into this.  In the meantime, could you please let me know
exactly how you invoked rcuperf?

I have a few comments on some of your patches based on a quick scan
through them.

Thanx, Paul

> Best wishes,
> Lihao.
> 
> 
> Lihao Liang (15):
>   rcutorture: Add PRCU rcu_torture_ops
>   rcutorture: Add PRCU test config files
>   rcuperf: Add PRCU rcu_perf_ops
>   rcuperf: Add PRCU test config files
>   rcuperf: Set gp_exp to true for tests to run
>   prcu: Implement call_prcu() API
>   prcu: Implement PRCU callback processing
>   prcu: Implement prcu_barrier() API
>   rcutorture: Test call_prcu() and prcu_barrier()
>   rcutorture: Add basic ARM64 support to run scripts
>   prcu: Add PRCU Kconfig parameter
>   prcu: Comment source code
>   rcuperf: Add config files with various CONFIG_NR_CPUS
>   rcutorture: Add scripts to run experiments
>   Add GPLv2 license
> 
> Heng Zhang (1):
>   prcu: Add PRCU implementation
> 
>  include/linux/interrupt.h  |   3 +
>  include/linux/prcu.h   | 122 +
>  include/linux/rcupdate.h   |   1 +
>  init/Kconfig   |   7 +
>  init/main.c|   2 +
>  kernel/rcu/Makefile|   1 +
>  kernel/rcu/prcu.c  | 497 
> +
>  kernel/rcu/rcuperf.c   |  33 +-
>  kernel/rcu/rcutorture.c|  40 +-
>  kernel/rcu/tree.c  |   1 +

Re: [PATCH RFC 03/16] rcutorture: Add PRCU test config files

2018-01-24 Thread Paul E. McKenney
On Tue, Jan 23, 2018 at 03:59:28PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> Use the same config files as TREE02, TREE03, TREE06, TREE07, and TREE09.
> 
> Signed-off-by: Lihao Liang 
> ---
>  .../selftests/rcutorture/configs/rcu/CFLIST|  5 
>  .../selftests/rcutorture/configs/rcu/PRCU02| 27 
> ++
>  .../selftests/rcutorture/configs/rcu/PRCU02.boot   |  1 +
>  .../selftests/rcutorture/configs/rcu/PRCU03| 23 ++
>  .../selftests/rcutorture/configs/rcu/PRCU03.boot   |  2 ++
>  .../selftests/rcutorture/configs/rcu/PRCU06| 26 +
>  .../selftests/rcutorture/configs/rcu/PRCU06.boot   |  5 
>  .../selftests/rcutorture/configs/rcu/PRCU07| 25 
>  .../selftests/rcutorture/configs/rcu/PRCU07.boot   |  2 ++
>  .../selftests/rcutorture/configs/rcu/PRCU09| 19 +++
>  .../selftests/rcutorture/configs/rcu/PRCU09.boot   |  1 +
>  11 files changed, 136 insertions(+)
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU02
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU02.boot
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU03
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU03.boot
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU06
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU06.boot
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU07
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU07.boot
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU09
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU09.boot
> 
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST 
> b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
> index a3a1a05a..7359e194 100644
> --- a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
> @@ -1,3 +1,8 @@
> +PRCU02
> +PRCU03
> +PRCU06
> +PRCU07
> +PRCU09
>  TREE01
>  TREE02
>  TREE03
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/PRCU02 
> b/tools/testing/selftests/rcutorture/configs/rcu/PRCU02
> new file mode 100644
> index ..5f532f05
> --- /dev/null
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/PRCU02
> @@ -0,0 +1,27 @@
> +CONFIG_SMP=y
> +CONFIG_NR_CPUS=8
> +CONFIG_PREEMPT_NONE=n
> +CONFIG_PREEMPT_VOLUNTARY=n
> +CONFIG_PREEMPT=y
> +CONFIG_PRCU=y
> +#CHECK#CONFIG_PREEMPT_RCU=y
> +CONFIG_HZ_PERIODIC=n
> +CONFIG_NO_HZ_IDLE=y
> +CONFIG_NO_HZ_FULL=n
> +CONFIG_RCU_FAST_NO_HZ=n
> +CONFIG_RCU_TRACE=n
> +CONFIG_HOTPLUG_CPU=n
> +CONFIG_SUSPEND=n
> +CONFIG_HIBERNATION=n
> +CONFIG_RCU_FANOUT=3
> +CONFIG_RCU_FANOUT_LEAF=3
> +CONFIG_RCU_NOCB_CPU=n
> +CONFIG_DEBUG_LOCK_ALLOC=y
> +CONFIG_PROVE_LOCKING=n
> +CONFIG_RCU_BOOST=n
> +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> +CONFIG_RCU_EXPERT=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
> +CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/PRCU02.boot 
> b/tools/testing/selftests/rcutorture/configs/rcu/PRCU02.boot
> new file mode 100644
> index ..6c5e626f
> --- /dev/null
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/PRCU02.boot
> @@ -0,0 +1 @@
> +rcutorture.torture_type=prcu
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/PRCU03 
> b/tools/testing/selftests/rcutorture/configs/rcu/PRCU03
> new file mode 100644
> index ..869cadc8
> --- /dev/null
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/PRCU03
> @@ -0,0 +1,23 @@
> +CONFIG_SMP=y
> +CONFIG_NR_CPUS=16
> +CONFIG_PREEMPT_NONE=n
> +CONFIG_PREEMPT_VOLUNTARY=n
> +CONFIG_PREEMPT=y
> +CONFIG_PRCU=y
> +#CHECK#CONFIG_PREEMPT_RCU=y
> +CONFIG_HZ_PERIODIC=y
> +CONFIG_NO_HZ_IDLE=n
> +CONFIG_NO_HZ_FULL=n
> +CONFIG_RCU_TRACE=y
> +CONFIG_HOTPLUG_CPU=y

And from what I can see, PRCU doesn't handle CPU hotplug.  I would not
be surprised to see rcutorture failures when running this scenario.

> +CONFIG_RCU_FANOUT=2
> +CONFIG_RCU_FANOUT_LEAF=2
> +CONFIG_RCU_NOCB_CPU=n
> +CONFIG_DEBUG_LOCK_ALLOC=n
> +CONFIG_RCU_BOOST=y
> +CONFIG_RCU_KTHREAD_PRIO=2
> +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> +CONFIG_RCU_EXPERT=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/PRCU03.boot 
> b/tools/testing/selftests/rcutorture/configs/rcu/PRCU03.boot
> new file mode 100644
> index ..0be10cba
> --- /dev/null
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/PRCU03.boot
> @@ -0,0 +1,2 @@
> +rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30
> +rcutorture.torture_type=prcu
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/PRCU06 
> b/tools/testing/

Re: [PATCH RFC 01/16] prcu: Add PRCU implementation

2018-01-24 Thread Paul E. McKenney
On Tue, Jan 23, 2018 at 03:59:26PM +0800, liangli...@huawei.com wrote:
> From: Heng Zhang 
> 
> This RCU implementation (PRCU) is based on a fast consensus protocol
> published in the following paper:
> 
> Fast Consensus Using Bounded Staleness for Scalable Read-mostly 
> Synchronization.
> Haibo Chen, Heng Zhang, Ran Liu, Binyu Zang, and Haibing Guan.
> IEEE Transactions on Parallel and Distributed Systems (TPDS), 2016.
> https://dl.acm.org/citation.cfm?id=3024114.3024143
> 
> Signed-off-by: Heng Zhang 
> Signed-off-by: Lihao Liang 

A few comments and questions interspersed.

Thanx, Paul

> ---
>  include/linux/prcu.h |  37 +++
>  kernel/rcu/Makefile  |   2 +-
>  kernel/rcu/prcu.c| 125 
> +++
>  kernel/sched/core.c  |   2 +
>  4 files changed, 165 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/prcu.h
>  create mode 100644 kernel/rcu/prcu.c
> 
> diff --git a/include/linux/prcu.h b/include/linux/prcu.h
> new file mode 100644
> index ..653b4633
> --- /dev/null
> +++ b/include/linux/prcu.h
> @@ -0,0 +1,37 @@
> +#ifndef __LINUX_PRCU_H
> +#define __LINUX_PRCU_H
> +
> +#include 
> +#include 
> +#include 
> +
> +#define CONFIG_PRCU
> +
> +struct prcu_local_struct {
> + unsigned int locked;
> + unsigned int online;
> + unsigned long long version;
> +};
> +
> +struct prcu_struct {
> + atomic64_t global_version;
> + atomic_t active_ctr;
> + struct mutex mtx;
> + wait_queue_head_t wait_q;
> +};
> +
> +#ifdef CONFIG_PRCU
> +void prcu_read_lock(void);
> +void prcu_read_unlock(void);
> +void synchronize_prcu(void);
> +void prcu_note_context_switch(void);
> +
> +#else /* #ifdef CONFIG_PRCU */
> +
> +#define prcu_read_lock() do {} while (0)
> +#define prcu_read_unlock() do {} while (0)
> +#define synchronize_prcu() do {} while (0)
> +#define prcu_note_context_switch() do {} while (0)

If CONFIG_PRCU=n and some code is built that uses PRCU, shouldn't you
get a build error rather than an error-free but inoperative PRCU?

Of course, Peter's question about purpose of the patch set applies
here as well.

> +
> +#endif /* #ifdef CONFIG_PRCU */
> +#endif /* __LINUX_PRCU_H */
> diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
> index 23803c7d..8791419c 100644
> --- a/kernel/rcu/Makefile
> +++ b/kernel/rcu/Makefile
> @@ -2,7 +2,7 @@
>  # and is generally not a function of system call inputs.
>  KCOV_INSTRUMENT := n
> 
> -obj-y += update.o sync.o
> +obj-y += update.o sync.o prcu.o
>  obj-$(CONFIG_CLASSIC_SRCU) += srcu.o
>  obj-$(CONFIG_TREE_SRCU) += srcutree.o
>  obj-$(CONFIG_TINY_SRCU) += srcutiny.o
> diff --git a/kernel/rcu/prcu.c b/kernel/rcu/prcu.c
> new file mode 100644
> index ..a00b9420
> --- /dev/null
> +++ b/kernel/rcu/prcu.c
> @@ -0,0 +1,125 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +DEFINE_PER_CPU_SHARED_ALIGNED(struct prcu_local_struct, prcu_local);
> +
> +struct prcu_struct global_prcu = {
> + .global_version = ATOMIC64_INIT(0),
> + .active_ctr = ATOMIC_INIT(0),
> + .mtx = __MUTEX_INITIALIZER(global_prcu.mtx),
> + .wait_q = __WAIT_QUEUE_HEAD_INITIALIZER(global_prcu.wait_q)
> +};
> +struct prcu_struct *prcu = &global_prcu;
> +
> +static inline void prcu_report(struct prcu_local_struct *local)
> +{
> + unsigned long long global_version;
> + unsigned long long local_version;
> +
> + global_version = atomic64_read(&prcu->global_version);
> + local_version = local->version;
> + if (global_version > local_version)
> + cmpxchg(&local->version, local_version, global_version);
> +}
> +
> +void prcu_read_lock(void)
> +{
> + struct prcu_local_struct *local;
> +
> + local = get_cpu_ptr(&prcu_local);
> + if (!local->online) {
> + WRITE_ONCE(local->online, 1);
> + smp_mb();
> + }
> +
> + local->locked++;
> + put_cpu_ptr(&prcu_local);
> +}
> +EXPORT_SYMBOL(prcu_read_lock);
> +
> +void prcu_read_unlock(void)
> +{
> + int locked;
> + struct prcu_local_struct *local;
> +
> + barrier();
> + local = get_cpu_ptr(&prcu_local);
> + locked = local->locked;
> + if (locked) {
> + local->locked--;
> + if (locked == 1)
> + prcu_report(local);

Is ordering important here?  It looks to me that the compiler could
rearrange some of the accesses within prcu_report() with the local->locked
decrement.  There appears to be some potential for load and store tearing,
though perhaps you have verified that your compiler avoids this on
the architecture that you are using.

> + put_cpu_ptr(&prcu_local);
> + } else {

Hmmm...  We get here if the RCU read-side critical section was preempted.
If none of them are preempted, ->active_ctr remains zero.

> + put_cpu_ptr(&prcu_local);
> + if (!atomic_dec_return(&prcu->active_ctr))
> +

Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-24 Thread jianchao.wang
Hi Eric

Thanks for you kindly response and suggestion.
That's really appreciated.

Jianchao

On 01/25/2018 11:55 AM, Eric Dumazet wrote:
> On Thu, 2018-01-25 at 11:27 +0800, jianchao.wang wrote:
>> Hi Tariq
>>
>> On 01/22/2018 10:12 AM, jianchao.wang wrote:
> On 19/01/2018 5:49 PM, Eric Dumazet wrote:
>> On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote:
>>> Hi Tariq
>>>
>>> Very sad that the crash was reproduced again after applied the patch.

 Memory barriers vary for different Archs, can you please share more 
 details regarding arch and repro steps?
>>> The hardware is HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 
>>> 12/27/2015
>>> The xen is installed. The crash occurred in DOM0.
>>> Regarding to the repro steps, it is a customer's test which does heavy disk 
>>> I/O over NFS storage without any guest.
>>>
>>
>> What is the finial suggestion on this ?
>> If use wmb there, is the performance pulled down ?
> 
> Since 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_davem_net-2Dnext.git_commit_-3Fid-3Ddad42c3038a59d27fced28ee4ec1d4a891b28155&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=c0oI8duFkyFBILMQYDsqRApHQrOlLY_2uGiz_utcd7s&s=E4_XKmSI0B63qB0DLQ1EX_fj1bOP78ZdeYADBf33B-k&e=
> 
> we batch allocations, so mlx4_en_refill_rx_buffers() is not called that often.
> 
> I doubt the additional wmb() will have serious impact there.
> 
> 


Re: [PATCH] powerpc: pseries: use irq_of_parse_and_map helper

2018-01-24 Thread Michael Ellerman
Rob Herring  writes:

> On Tue, Jan 23, 2018 at 12:53 AM, Michael Ellerman  
> wrote:
>> Rob Herring  writes:
>>
>>> Instead of calling both of_irq_parse_one and irq_create_of_mapping, call
>>> of_irq_parse_and_map instead which does the same thing. This gets us closer
>>> to making the former 2 functions static.
...
>> Are you trying to remove the low-level routines or is this just a
>> cleanup?
>
> The former, but I'm not sure that will happen. There's a handful of
> others left, but they aren't simply a call to of_irq_parse_one and
> then irq_create_of_mapping.
>
>> The patch below works, it loses the error handling if the interrupts
>> property is corrupt/empty, but that's probably overly paranoid anyway.
>
> Not quite. Previously, it was silent if parsing failed. Only the
> mapping would give an error which would mean the interrupt parent had
> some error.
>
> Actually, we could use of_irq_get here to preserve the error handling.
> It will return error codes from parsing, 0 on mapping failure, or the
> Linux irq number. It adds an irq_find_host call for deferred probe,
> but that should be harmless. I'll respin it.

OK thanks.

cheers


linux-next: build failure after merge of the pci tree

2018-01-24 Thread Stephen Rothwell
Hi Bjorn,

After merging the pci tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

arch/powerpc/kernel/pci-common.c: In function 'pcibios_setup
_device':
arch/powerpc/kernel/pci-common.c:406:15: error: 'virq' may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
  pci_dev->irq = virq;
  ~^~
arch/powerpc/kernel/pci-common.c:365:15: note: 'virq' was declared here
  unsigned int virq;
   ^~~~

Caused by commit

  c5042ac60fe5 ("powerpc/pci: Use of_irq_parse_and_map_pci() helper")

I have applied the following patch for today:

From: Stephen Rothwell 
Date: Thu, 25 Jan 2018 16:44:19 +1100
Subject: [PATCH] powerpc/pci: fix for "Use of_irq_parse_and_map_pci() helper"

Fixes: c5042ac60fe5 ("powerpc/pci: Use of_irq_parse_and_map_pci() helper")
Signed-off-by: Stephen Rothwell 
---
 arch/powerpc/kernel/pci-common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 6be3f2c22a9b..ae2ede4de6be 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -362,7 +362,7 @@ struct pci_controller* pci_find_hose_for_OF_device(struct 
device_node* node)
  */
 static int pci_read_irq_line(struct pci_dev *pci_dev)
 {
-   unsigned int virq;
+   unsigned int virq = 0;
 
pr_debug("PCI: Try to map irq for %s...\n", pci_name(pci_dev));
 
-- 
2.15.1

-- 
Cheers,
Stephen Rothwell


[PATCH 8/8] kprobes/s390: Fix %p uses in error messages

2018-01-24 Thread Masami Hiramatsu
Remove %p because the kprobe will be dumped in
dump_kprobe().

Signed-off-by: Masami Hiramatsu 
---
 arch/s390/kernel/kprobes.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c
index af3722c28fd9..df30e5b9a572 100644
--- a/arch/s390/kernel/kprobes.c
+++ b/arch/s390/kernel/kprobes.c
@@ -282,7 +282,7 @@ static void kprobe_reenter_check(struct kprobe_ctlblk *kcb, 
struct kprobe *p)
 * is a BUG. The code path resides in the .kprobes.text
 * section and is executed with interrupts disabled.
 */
-   printk(KERN_EMERG "Invalid kprobe detected at %p.\n", p->addr);
+   pr_err("Invalid kprobe detected.\n");
dump_kprobe(p);
BUG();
}



[PATCH 6/8] kprobes/arm64: Fix %p uses in error messages

2018-01-24 Thread Masami Hiramatsu
Fix %p uses in error messages by removing it because
those are redundant or meaningless.

Signed-off-by: Masami Hiramatsu 
---
 arch/arm64/kernel/probes/kprobes.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/probes/kprobes.c 
b/arch/arm64/kernel/probes/kprobes.c
index d849d9804011..34f78d07a068 100644
--- a/arch/arm64/kernel/probes/kprobes.c
+++ b/arch/arm64/kernel/probes/kprobes.c
@@ -275,7 +275,7 @@ static int __kprobes reenter_kprobe(struct kprobe *p,
break;
case KPROBE_HIT_SS:
case KPROBE_REENTER:
-   pr_warn("Unrecoverable kprobe detected at %p.\n", p->addr);
+   pr_warn("Unrecoverable kprobe detected.\n");
dump_kprobe(p);
BUG();
break;
@@ -521,7 +521,7 @@ int __kprobes longjmp_break_handler(struct kprobe *p, 
struct pt_regs *regs)
(struct pt_regs *)kcb->jprobe_saved_regs.sp;
pr_err("current sp %lx does not match saved sp %lx\n",
   orig_sp, stack_addr);
-   pr_err("Saved registers for jprobe %p\n", jp);
+   pr_err("Saved registers for jprobe\n");
__show_regs(saved_regs);
pr_err("Current registers\n");
__show_regs(regs);



[PATCH 7/8] kprobes/MN10300: Fix %p uses in error messages

2018-01-24 Thread Masami Hiramatsu
Replace %p with %px because it is right before BUG().

Signed-off-by: Masami Hiramatsu 
---
 arch/mn10300/kernel/kprobes.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/mn10300/kernel/kprobes.c b/arch/mn10300/kernel/kprobes.c
index 0311a7fcea16..e539fac00321 100644
--- a/arch/mn10300/kernel/kprobes.c
+++ b/arch/mn10300/kernel/kprobes.c
@@ -632,9 +632,9 @@ int __kprobes longjmp_break_handler(struct kprobe *p, 
struct pt_regs *regs)
 
if (addr == (u8 *) jprobe_return_bp_addr) {
if (jprobe_saved_regs_location != regs) {
-   printk(KERN_ERR"JPROBE:"
-  " Current regs (%p) does not match saved regs"
-  " (%p).\n",
+   pr_err("JPROBE:"
+  " Current regs (%px) does not match saved regs"
+  " (%px).\n",
   regs, jprobe_saved_regs_location);
BUG();
}



Re: [PATCH] block: blk-mq-sched: Replace GFP_ATOMIC with GFP_KERNEL in blk_mq_sched_assign_ioc

2018-01-24 Thread Ming Lei
On Wed, Jan 24, 2018 at 08:34:14PM -0700, Jens Axboe wrote:
> On 1/24/18 7:46 PM, Jia-Ju Bai wrote:
> > The function ioc_create_icq here is not called in atomic context.
> > Thus GFP_ATOMIC is not necessary, and it can be replaced with GFP_KERNEL.
> > 
> > This is found by a static analysis tool named DCNS written by myself.
> 
> But it's running off the IO submission path, so by definition the GFP
> mask cannot include anything that will do IO. GFP_KERNEL will make
> it deadlock prone.
> 
> It could be GFP_NOIO, but that's also overlooking the fact that we can
> have preemption disabled here.

We have REQ_NOWAIT request too, so GFP_NOIO isn't OK too.

-- 
Ming


[PATCH 5/8] kprobes/arm: Fix %p uses in error messages

2018-01-24 Thread Masami Hiramatsu
Fix %p uses in error messages by removing it and
using general dumper.

Signed-off-by: Masami Hiramatsu 
---
 arch/arm/probes/kprobes/core.c  |   10 +-
 arch/arm/probes/kprobes/test-core.c |1 -
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/arm/probes/kprobes/core.c b/arch/arm/probes/kprobes/core.c
index 52d1cd14fda4..8f37d505194f 100644
--- a/arch/arm/probes/kprobes/core.c
+++ b/arch/arm/probes/kprobes/core.c
@@ -291,8 +291,8 @@ void __kprobes kprobe_handler(struct pt_regs *regs)
break;
case KPROBE_REENTER:
/* A nested probe was hit in FIQ, it is a BUG */
-   pr_warn("Unrecoverable kprobe detected at 
%p.\n",
-   p->addr);
+   pr_warn("Unrecoverable kprobe detected.\n");
+   dump_kprobe(p);
/* fall through */
default:
/* impossible cases */
@@ -617,11 +617,11 @@ int __kprobes longjmp_break_handler(struct kprobe *p, 
struct pt_regs *regs)
if (orig_sp != stack_addr) {
struct pt_regs *saved_regs =
(struct pt_regs *)kcb->jprobe_saved_regs.ARM_sp;
-   printk("current sp %lx does not match saved sp %lx\n",
+   pr_err("current sp %lx does not match saved sp %lx\n",
   orig_sp, stack_addr);
-   printk("Saved registers for jprobe %p\n", jp);
+   pr_err("Saved registers for jprobe\n");
show_regs(saved_regs);
-   printk("Current registers\n");
+   pr_err("Current registers\n");
show_regs(regs);
BUG();
}
diff --git a/arch/arm/probes/kprobes/test-core.c 
b/arch/arm/probes/kprobes/test-core.c
index 9ed0129bed3c..b5c892e24244 100644
--- a/arch/arm/probes/kprobes/test-core.c
+++ b/arch/arm/probes/kprobes/test-core.c
@@ -1460,7 +1460,6 @@ static bool check_test_results(void)
print_registers(&result_regs);
 
if (mem) {
-   pr_err("current_stack=%p\n", current_stack);
pr_err("expected_memory:\n");
print_memory(expected_memory, mem_size);
pr_err("result_memory:\n");



[PATCH 4/8] kprobes/x86: Fix %p uses in error messages

2018-01-24 Thread Masami Hiramatsu
Fix %p uses in error messages in kprobes/x86.
- Some %p uses are not needed. Just remove it (or remove message).
- One %p use is right before the BUG() so replaced with %px.

Signed-off-by: Masami Hiramatsu 
---
 arch/x86/kernel/kprobes/core.c |   12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index bd36f3c33cd0..aea956aedad7 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -391,8 +391,6 @@ int __copy_instruction(u8 *dest, u8 *src, u8 *real, struct 
insn *insn)
  - (u8 *) real;
if ((s64) (s32) newdisp != newdisp) {
pr_err("Kprobes error: new displacement does not fit 
into s32 (%llx)\n", newdisp);
-   pr_err("\tSrc: %p, Dest: %p, old disp: %x\n",
-   src, real, insn->displacement.value);
return 0;
}
disp = (u8 *) dest + insn_offset_displacement(insn);
@@ -636,8 +634,7 @@ static int reenter_kprobe(struct kprobe *p, struct pt_regs 
*regs,
 * Raise a BUG or we'll continue in an endless reentering loop
 * and eventually a stack overflow.
 */
-   printk(KERN_WARNING "Unrecoverable kprobe detected at %p.\n",
-  p->addr);
+   pr_err("Unrecoverable kprobe detected.\n");
dump_kprobe(p);
BUG();
default:
@@ -1146,12 +1143,11 @@ int longjmp_break_handler(struct kprobe *p, struct 
pt_regs *regs)
(addr < (u8 *) jprobe_return_end)) {
if (stack_addr(regs) != saved_sp) {
struct pt_regs *saved_regs = &kcb->jprobe_saved_regs;
-   printk(KERN_ERR
-  "current sp %p does not match saved sp %p\n",
+   pr_err("current sp %px does not match saved sp %px\n",
   stack_addr(regs), saved_sp);
-   printk(KERN_ERR "Saved registers for jprobe %p\n", jp);
+   pr_err("Saved registers for jprobe\n");
show_regs(saved_regs);
-   printk(KERN_ERR "Current registers\n");
+   pr_err("Current registers\n");
show_regs(regs);
BUG();
}



Hello Linux

2018-01-24 Thread norsk5
good morning Linux

https://goo.gl/2YZCS5


[PATCH 3/8] kprobes: Replace %p with other pointer types

2018-01-24 Thread Masami Hiramatsu
Replace %p with appropriate pointer types (or just remove it)
 - Use %pS if possible
 - Use %px only for the function right before BUG().
 - Remove unneeded error message.

Signed-off-by: Masami Hiramatsu 
---
 kernel/kprobes.c |   16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index a5f13e379ae1..02887975d445 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -712,7 +712,7 @@ static void reuse_unused_kprobe(struct kprobe *ap)
op = container_of(ap, struct optimized_kprobe, kp);
if (unlikely(list_empty(&op->list)))
printk(KERN_WARNING "Warning: found a stray unused "
-   "aggrprobe@%p\n", ap->addr);
+   "aggrprobe@%pS\n", ap->addr);
/* Enable the probe again */
ap->flags &= ~KPROBE_FLAG_DISABLED;
/* Optimize it again (remove from op->list) */
@@ -984,7 +984,7 @@ static void arm_kprobe_ftrace(struct kprobe *p)
 
ret = ftrace_set_filter_ip(&kprobe_ftrace_ops,
   (unsigned long)p->addr, 0, 0);
-   WARN(ret < 0, "Failed to arm kprobe-ftrace at %p (%d)\n", p->addr, ret);
+   WARN(ret < 0, "Failed to arm kprobe-ftrace at %pS (%d)\n", p->addr, 
ret);
kprobe_ftrace_enabled++;
if (kprobe_ftrace_enabled == 1) {
ret = register_ftrace_function(&kprobe_ftrace_ops);
@@ -1004,7 +1004,7 @@ static void disarm_kprobe_ftrace(struct kprobe *p)
}
ret = ftrace_set_filter_ip(&kprobe_ftrace_ops,
   (unsigned long)p->addr, 1, 0);
-   WARN(ret < 0, "Failed to disarm kprobe-ftrace at %p (%d)\n", p->addr, 
ret);
+   WARN(ret < 0, "Failed to disarm kprobe-ftrace at %pS (%d)\n", p->addr, 
ret);
 }
 #else  /* !CONFIG_KPROBES_ON_FTRACE */
 #define prepare_kprobe(p)  arch_prepare_kprobe(p)
@@ -2124,10 +2124,11 @@ int enable_kprobe(struct kprobe *kp)
 }
 EXPORT_SYMBOL_GPL(enable_kprobe);
 
+/* Caller must NOT call this in usual path. This is only for critical case */
 void dump_kprobe(struct kprobe *kp)
 {
-   printk(KERN_WARNING "Dumping kprobe:\n");
-   printk(KERN_WARNING "Name: %s\nAddress: %p\nOffset: %x\n",
+   pr_err("Dumping kprobe:\n");
+   pr_err("Name: %s\nAddress: %px\nOffset: %x\n",
   kp->symbol_name, kp->addr, kp->offset);
 }
 NOKPROBE_SYMBOL(dump_kprobe);
@@ -2151,11 +2152,8 @@ static int __init populate_kprobe_blacklist(unsigned 
long *start,
entry = arch_deref_entry_point((void *)*iter);
 
if (!kernel_text_address(entry) ||
-   !kallsyms_lookup_size_offset(entry, &size, &offset)) {
-   pr_err("Failed to find blacklist at %p\n",
-   (void *)entry);
+   !kallsyms_lookup_size_offset(entry, &size, &offset))
continue;
-   }
 
ent = kmalloc(sizeof(*ent), GFP_KERNEL);
if (!ent)



[PATCH 2/8] kprobes: Show address of kprobes if kallsyms does

2018-01-24 Thread Masami Hiramatsu
Show probed address in debugfs kprobe list file as same
as kallsyms does. This information is used for checking
kprobes are placed in the expected address. So it should
be able to compared with address in kallsyms.

Signed-off-by: Masami Hiramatsu 
---
 kernel/kprobes.c |   14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index cb15991ba676..a5f13e379ae1 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2281,6 +2281,7 @@ static void report_probe(struct seq_file *pi, struct 
kprobe *p,
const char *sym, int offset, char *modname, struct kprobe *pp)
 {
char *kprobe_type;
+   void *addr = p->addr;
 
if (p->pre_handler == pre_handler_kretprobe)
kprobe_type = "r";
@@ -2289,13 +2290,16 @@ static void report_probe(struct seq_file *pi, struct 
kprobe *p,
else
kprobe_type = "k";
 
+   if (!kallsyms_show_value())
+   addr = NULL;
+
if (sym)
-   seq_printf(pi, "%p  %s  %s+0x%x  %s ",
-   p->addr, kprobe_type, sym, offset,
+   seq_printf(pi, "%px  %s  %s+0x%x  %s ",
+   addr, kprobe_type, sym, offset,
(modname ? modname : " "));
-   else
-   seq_printf(pi, "%p  %s  %p ",
-   p->addr, kprobe_type, p->addr);
+   else/* try to use %pS */
+   seq_printf(pi, "%px  %s  %pS ",
+   addr, kprobe_type, p->addr);
 
if (!pp)
pp = p;



[PATCH 0/8] kprobes: Fix %p in kprobes

2018-01-24 Thread Masami Hiramatsu
Hi,

This series fixes %p uses in kprobes. Some by replacing
with %pS, some by replacing with %px but masking with
kallsyms_show_value().

I've read the thread about %pK and if I understand correctly
we shouldn't print kernel addresses. However, kprobes debugfs
interface can not stop to show the actual probe address because
it should be compared with addresses in kallsyms for debugging.
So, it depends on that kallsyms_show_value() allows to show
address to user, because if it returns true, anyway that user
can dump /proc/kallsyms.

Other error messages are replaced it with %pS, and one critical
function uses %px which is called right before BUG().

Also, I tried to fix this issue on each arch port. I searched
it by

 # find arch/* | grep -e 'kprobe.*c' | xargs grep -w %p

And fixed all %p uses in those files.

Thank you,

---

Masami Hiramatsu (8):
  kprobes: Show blacklist addresses as same as kallsyms does
  kprobes: Show address of kprobes if kallsyms does
  kprobes: Replace %p with other pointer types
  kprobes/x86: Fix %p uses in error messages
  kprobes/arm: Fix %p uses in error messages
  kprobes/arm64: Fix %p uses in error messages
  kprobes/MN10300: Fix %p uses in error messages
  kprobes/s390: Fix %p uses in error messages


 arch/arm/probes/kprobes/core.c  |   10 
 arch/arm/probes/kprobes/test-core.c |1 -
 arch/arm64/kernel/probes/kprobes.c  |4 ++-
 arch/mn10300/kernel/kprobes.c   |6 +++--
 arch/s390/kernel/kprobes.c  |2 +-
 arch/x86/kernel/kprobes/core.c  |   12 +++---
 kernel/kprobes.c|   42 ++-
 7 files changed, 41 insertions(+), 36 deletions(-)

--
Masami Hiramatsu (Linaro) 


[PATCH 1/8] kprobes: Show blacklist addresses as same as kallsyms does

2018-01-24 Thread Masami Hiramatsu
Show kprobes blacklist addresses under same condition of
showing kallsyms addresses.

Since there are several name conflict for local symbols,
kprobe blacklist needs to show each addresses so that
user can identify where is on blacklist by comparing
with kallsyms.

Signed-off-by: Masami Hiramatsu 
---
 kernel/kprobes.c |   12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index da2ccf142358..cb15991ba676 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2382,9 +2382,17 @@ static int kprobe_blacklist_seq_show(struct seq_file *m, 
void *v)
 {
struct kprobe_blacklist_entry *ent =
list_entry(v, struct kprobe_blacklist_entry, list);
+   void *start = (void *)ent->start_addr, *end = (void *)ent->end_addr;
 
-   seq_printf(m, "0x%p-0x%p\t%ps\n", (void *)ent->start_addr,
-  (void *)ent->end_addr, (void *)ent->start_addr);
+   /*
+* As long as kallsyms shows the address, kprobes blacklist also
+* show it, Or, it shows null address and symbol.
+*/
+   if (!kallsyms_show_value())
+   start = end = NULL;
+
+   seq_printf(m, "0x%px-0x%px\t%ps\n", start, end,
+  (void *)ent->start_addr);
return 0;
 }
 



Re: [PATCH] bpf/stackmap: Implement bpf_get_next_key

2018-01-24 Thread Joel Fernandes
On Wed, Jan 24, 2018 at 8:42 PM, Alexei Starovoitov
 wrote:
> On Wed, Jan 24, 2018 at 08:37:52PM -0800, Joel Fernandes wrote:
>> Currently stackmaps can't be iterated over. The keys are obtained
>> through other maps and look ups have to be performed. In new usecases,
>> its useful to be able to iterate over the stackmap independently.
>> Implement bpf_get_next_key to make this possible.
>>
>> More details of use case:
>> Currently iterating over stack maps is done like so, for example
>> in BCC tools offcputime script:
>> 1. Obtain keys from the 'counts' hash map.
>> 2. Look up stackmap with each of the keys obtained from step 1.
>>
>> This makes the iteration of stackmap dependent on the counts map.
>> In a new tool I'm working on called BPFd [1], it gives a huge speed
>> up when I dump entire maps before transmitting them to BCC tools
>> on a different machine [2]. This patch makes it possible to dump
>> stackmaps independent of other maps.
>>
>> Tested on x86 and arm64 machines.
>>
>> [1] https://lwn.net/Articles/744522/
>> [2] https://github.com/joelagnel/bpfd/issues/8
>>
>> Cc: Alexei Starovoitov 
>> Cc: Daniel Borkmann 
>> Cc: Brendan Gregg 
>> Cc: Brenden Blanco 
>> Signed-off-by: Joel Fernandes 
>> ---
>>  kernel/bpf/stackmap.c | 18 +-
>>  1 file changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>> index a15bc636cc98..b0bf7d009f76 100644
>> --- a/kernel/bpf/stackmap.c
>> +++ b/kernel/bpf/stackmap.c
>> @@ -228,7 +228,23 @@ int bpf_stackmap_copy(struct bpf_map *map, void *key, 
>> void *value)
>>
>>  static int stack_map_get_next_key(struct bpf_map *map, void *key, void 
>> *next_key)
>>  {
>> - return -EINVAL;
>> + struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, 
>> map);
>> + u32 id = 0;
>
> did you check net-next tree before sending the patch?

Oh. I was a bit late to send it! Glad its done though.

> Also please see Documentation/bpf/bpf_devel_QA.txt

Sure, if you don't mind could you elaborate what I screwed up from the
QA with regard to this patch? Sorry if I missed something.

thanks,

- Joel


linux-next: manual merge of the iversion tree with the nfsd tree

2018-01-24 Thread Stephen Rothwell
Hi Jeff,

Today's linux-next merge of the iversion tree got a conflict in:

  fs/nfsd/nfsfh.h

between commit:

  3ac71a649c3d ("nfsd: store stat times in fill_pre_wcc() instead of inode 
times")

from the nfsd tree and commit:

  8ace40dfbdf4 ("nfsd: convert to new i_version API")

from the iversion tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc fs/nfsd/nfsfh.h
index 99be87b50ebe,b8444189223b..
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@@ -257,10 -257,10 +258,10 @@@ static inline u64 nfsd4_change_attribut
  {
u64 chattr;
  
 -  chattr =  inode->i_ctime.tv_sec;
 +  chattr =  stat->ctime.tv_sec;
chattr <<= 30;
 -  chattr += inode->i_ctime.tv_nsec;
 +  chattr += stat->ctime.tv_nsec;
-   chattr += inode->i_version;
+   chattr += inode_query_iversion(inode);
return chattr;
  }
  


Re: [PATCH] bpf/stackmap: Implement bpf_get_next_key

2018-01-24 Thread Alexei Starovoitov
On Wed, Jan 24, 2018 at 08:37:52PM -0800, Joel Fernandes wrote:
> Currently stackmaps can't be iterated over. The keys are obtained
> through other maps and look ups have to be performed. In new usecases,
> its useful to be able to iterate over the stackmap independently.
> Implement bpf_get_next_key to make this possible.
> 
> More details of use case:
> Currently iterating over stack maps is done like so, for example
> in BCC tools offcputime script:
> 1. Obtain keys from the 'counts' hash map.
> 2. Look up stackmap with each of the keys obtained from step 1.
> 
> This makes the iteration of stackmap dependent on the counts map.
> In a new tool I'm working on called BPFd [1], it gives a huge speed
> up when I dump entire maps before transmitting them to BCC tools
> on a different machine [2]. This patch makes it possible to dump
> stackmaps independent of other maps.
> 
> Tested on x86 and arm64 machines.
> 
> [1] https://lwn.net/Articles/744522/
> [2] https://github.com/joelagnel/bpfd/issues/8
> 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Cc: Brendan Gregg 
> Cc: Brenden Blanco 
> Signed-off-by: Joel Fernandes 
> ---
>  kernel/bpf/stackmap.c | 18 +-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index a15bc636cc98..b0bf7d009f76 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -228,7 +228,23 @@ int bpf_stackmap_copy(struct bpf_map *map, void *key, 
> void *value)
>  
>  static int stack_map_get_next_key(struct bpf_map *map, void *key, void 
> *next_key)
>  {
> - return -EINVAL;
> + struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, 
> map);
> + u32 id = 0;

did you check net-next tree before sending the patch?
Also please see Documentation/bpf/bpf_devel_QA.txt



[PATCH] bpf/stackmap: Implement bpf_get_next_key

2018-01-24 Thread Joel Fernandes
Currently stackmaps can't be iterated over. The keys are obtained
through other maps and look ups have to be performed. In new usecases,
its useful to be able to iterate over the stackmap independently.
Implement bpf_get_next_key to make this possible.

More details of use case:
Currently iterating over stack maps is done like so, for example
in BCC tools offcputime script:
1. Obtain keys from the 'counts' hash map.
2. Look up stackmap with each of the keys obtained from step 1.

This makes the iteration of stackmap dependent on the counts map.
In a new tool I'm working on called BPFd [1], it gives a huge speed
up when I dump entire maps before transmitting them to BCC tools
on a different machine [2]. This patch makes it possible to dump
stackmaps independent of other maps.

Tested on x86 and arm64 machines.

[1] https://lwn.net/Articles/744522/
[2] https://github.com/joelagnel/bpfd/issues/8

Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Brendan Gregg 
Cc: Brenden Blanco 
Signed-off-by: Joel Fernandes 
---
 kernel/bpf/stackmap.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index a15bc636cc98..b0bf7d009f76 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -228,7 +228,23 @@ int bpf_stackmap_copy(struct bpf_map *map, void *key, void 
*value)
 
 static int stack_map_get_next_key(struct bpf_map *map, void *key, void 
*next_key)
 {
-   return -EINVAL;
+   struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, 
map);
+   u32 id = 0;
+
+   if (key) {
+   id = *(u32 *)key;
+   if (unlikely(id >= smap->n_buckets))
+   return -ENOENT;
+   id++;
+   }
+
+   for ( ; id < smap->n_buckets; id++) {
+   if (READ_ONCE(smap->buckets[id])) {
+   *(u32 *)next_key = id;
+   return 0;
+   }
+   }
+   return -ENOENT;
 }
 
 static int stack_map_update_elem(struct bpf_map *map, void *key, void *value,
-- 
2.16.0.rc1.238.g530d649a79-goog



Re: ANNOUNCE: bpfd - a remote proxy daemon for executing bpf code (with corres. bcc changes)

2018-01-24 Thread Joel Fernandes
Sigh, Correcting Brenden's email address. Apologies.

On Wed, Jan 24, 2018 at 8:29 PM, Joel Fernandes  wrote:
> Hi Guys,
>
> Just providing an update: I made lots of progress last few weeks and
> all the issues mentioned below (in the last post) are resolved.
> I published an LWN article explaining the design of BPFd and BCC-side
> changes: https://lwn.net/SubscriberLink/744522/ba023b555957408e/
>
> There are still a few things that don't work well like symbol
> resolution, and I am working on it. The effect is just tools that need
> it wont work, which aren't that many from what I see but the goal is
> to get everything working ultimately.
>
> My next steps are to post the BCC side changes on github as a pull
> request (once I get a chance to rebase and send along with any clean
> ups). I also request you to take a look at the top 7 patches on this
> branch and provide any early feedback to me:
> https://github.com/joelagnel/bcc/commits/bcc-bpfd
>
> Based on activity on twitter and folks pinging me, it seems there is a
> LOT of interest for this both within Google and outside so its quite
> exciting. I strongly believe this will expand the use of BCC in the
> community. I look forward to presenting some demos at SCALE showing it
> in action.
>
> I am looking forward to your comments. Thanks.
>
> Regards,
>
> - Joel
>
>
> On Fri, Dec 29, 2017 at 12:58 AM, Joel Fernandes  wrote:
>> Hi Guys,
>>
>> I've been working on an idea I discussed with other kernel developers
>> (Alexei, Josef etc) last LPC about how to make it easier to run bcc
>> tools on remote systems.
>>
>> Use case
>> 
>> Run bcc tools on a remotely connected system without having to load
>> the entire LLVM infrastructure onto the remote target and have to sync
>> the kernel sources with it.
>> On architecture such as ARM64 especially, its a bit more work if you
>> were to run the tools directly on the target itself (local to the
>> target) because LLVM and Python have to be cross-compiled for it
>> (along with syncing of kernel sources which takes up space and needs
>> to be kept sync'ed for correct operation). I believe Facebook also has
>> some usecases where they want to run bcc tools on remote instances.
>> Lastly this is also the way arm64 development normally happens, you
>> cross build for it and typically the ARM64 embedded systems may not
>> have much space for kernel sources and clang so its better some times
>> if the tools are remote. All our kernel development for android is
>> cross developed with the cross-toolchain running remotely as well.
>> I am looking forward to collaborating with interested developers on
>> this idea and getting more feedback about the design etc. I am also
>> planning to talk about it next year during SCALE and OSPM.
>>
>> Implementation
>> ==
>> To facilitate this, I started working on a daemon called bpfd which is
>> executed on the remote target and listening for commands:
>> https://github.com/joelagnel/bpfd
>> The daemon does more than proxy the bpf syscall, there's several
>> things like registering a kprobe with perf, and perf callbacks that
>> need to be replicated. All this infrastructure is pretty much code
>> complete in bpfd.
>>
>> Sample commands sent to bpfd are as follows:
>> https://github.com/joelagnel/bpfd/blob/master/tests/TESTS
>> 
>> ; Program opensnoop
>> BPF_CREATE_MAP 1 8 40 10240 0
>> BPF_CREATE_MAP 4 4 4 2 0
>>
>> BPF_PROG_LOAD 2 248 GPL 264721 eRdwAAC3AQAAAHsa+P8AA[...]
>> BPF_PROG_LOAD 2 664 GPL 264721 vxYAAACFDgAAAHsK+P8AA[...]
>> 
>> Binary streams are communicated using base64 making it possible to
>> keep interaction with binary simple.
>>
>> Several patches is written on the bcc side to be able to send these
>> commands using a "remotes infrastructure", available in the branch at:
>> https://github.com/joelagnel/bcc/commits/bcc-bpfd
>> My idea was to keep the remote infrastructure as generic/plug-and-play
>> as possible - so in the future its easy to add other remotes like
>> networking. Currently I've adb (android bridge) remote and a shell
>> remote: https://github.com/joelagnel/bcc/tree/bcc-bpfd/src/python/bcc/remote
>>
>> The shell remote is more of a "test" remote that simply forks bpfd and
>> communicates with it over stdio. This makes the development quite
>> easy.
>>
>> Status
>> ==
>> What's working:
>> - executing several bcc tools across process boundary using "shell"
>> remote (bcc tools and bpfd both running on local x86 machine).
>> - communication with remote arm64 android target using the "adb
>> remote". But there are several issues to do with arm64 and bcc tools
>> that I'm ironing out. Since my arm64 bcc hackery is a bit recent, I
>> created a separate WIP branch here:
>> https://github.com/joelagnel/bcc/tree/bcc-bpfd-arm64. I don't suspect
>> these to be a major issue since I noticed some folks have been using
>> bcc tools on arm64 already.
>>
>> Issues:
>> - Since b

Re: ANNOUNCE: bpfd - a remote proxy daemon for executing bpf code (with corres. bcc changes)

2018-01-24 Thread Joel Fernandes
Hi Guys,

Just providing an update: I made lots of progress last few weeks and
all the issues mentioned below (in the last post) are resolved.
I published an LWN article explaining the design of BPFd and BCC-side
changes: https://lwn.net/SubscriberLink/744522/ba023b555957408e/

There are still a few things that don't work well like symbol
resolution, and I am working on it. The effect is just tools that need
it wont work, which aren't that many from what I see but the goal is
to get everything working ultimately.

My next steps are to post the BCC side changes on github as a pull
request (once I get a chance to rebase and send along with any clean
ups). I also request you to take a look at the top 7 patches on this
branch and provide any early feedback to me:
https://github.com/joelagnel/bcc/commits/bcc-bpfd

Based on activity on twitter and folks pinging me, it seems there is a
LOT of interest for this both within Google and outside so its quite
exciting. I strongly believe this will expand the use of BCC in the
community. I look forward to presenting some demos at SCALE showing it
in action.

I am looking forward to your comments. Thanks.

Regards,

- Joel


On Fri, Dec 29, 2017 at 12:58 AM, Joel Fernandes  wrote:
> Hi Guys,
>
> I've been working on an idea I discussed with other kernel developers
> (Alexei, Josef etc) last LPC about how to make it easier to run bcc
> tools on remote systems.
>
> Use case
> 
> Run bcc tools on a remotely connected system without having to load
> the entire LLVM infrastructure onto the remote target and have to sync
> the kernel sources with it.
> On architecture such as ARM64 especially, its a bit more work if you
> were to run the tools directly on the target itself (local to the
> target) because LLVM and Python have to be cross-compiled for it
> (along with syncing of kernel sources which takes up space and needs
> to be kept sync'ed for correct operation). I believe Facebook also has
> some usecases where they want to run bcc tools on remote instances.
> Lastly this is also the way arm64 development normally happens, you
> cross build for it and typically the ARM64 embedded systems may not
> have much space for kernel sources and clang so its better some times
> if the tools are remote. All our kernel development for android is
> cross developed with the cross-toolchain running remotely as well.
> I am looking forward to collaborating with interested developers on
> this idea and getting more feedback about the design etc. I am also
> planning to talk about it next year during SCALE and OSPM.
>
> Implementation
> ==
> To facilitate this, I started working on a daemon called bpfd which is
> executed on the remote target and listening for commands:
> https://github.com/joelagnel/bpfd
> The daemon does more than proxy the bpf syscall, there's several
> things like registering a kprobe with perf, and perf callbacks that
> need to be replicated. All this infrastructure is pretty much code
> complete in bpfd.
>
> Sample commands sent to bpfd are as follows:
> https://github.com/joelagnel/bpfd/blob/master/tests/TESTS
> 
> ; Program opensnoop
> BPF_CREATE_MAP 1 8 40 10240 0
> BPF_CREATE_MAP 4 4 4 2 0
>
> BPF_PROG_LOAD 2 248 GPL 264721 eRdwAAC3AQAAAHsa+P8AA[...]
> BPF_PROG_LOAD 2 664 GPL 264721 vxYAAACFDgAAAHsK+P8AA[...]
> 
> Binary streams are communicated using base64 making it possible to
> keep interaction with binary simple.
>
> Several patches is written on the bcc side to be able to send these
> commands using a "remotes infrastructure", available in the branch at:
> https://github.com/joelagnel/bcc/commits/bcc-bpfd
> My idea was to keep the remote infrastructure as generic/plug-and-play
> as possible - so in the future its easy to add other remotes like
> networking. Currently I've adb (android bridge) remote and a shell
> remote: https://github.com/joelagnel/bcc/tree/bcc-bpfd/src/python/bcc/remote
>
> The shell remote is more of a "test" remote that simply forks bpfd and
> communicates with it over stdio. This makes the development quite
> easy.
>
> Status
> ==
> What's working:
> - executing several bcc tools across process boundary using "shell"
> remote (bcc tools and bpfd both running on local x86 machine).
> - communication with remote arm64 android target using the "adb
> remote". But there are several issues to do with arm64 and bcc tools
> that I'm ironing out. Since my arm64 bcc hackery is a bit recent, I
> created a separate WIP branch here:
> https://github.com/joelagnel/bcc/tree/bcc-bpfd-arm64. I don't suspect
> these to be a major issue since I noticed some folks have been using
> bcc tools on arm64 already.
>
> Issues:
> - Since bcc is building with clang on x86 - the eBPF backend code is
> generated for x86. Although it loads fine on arm64, there seem several
> issues such as kprobe handler doesn't see arguments or return code
> correctly in opensnoop. This is (prob

Re: [PATCH v24 2/2] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-01-24 Thread Wei Wang

On 01/25/2018 01:15 AM, Michael S. Tsirkin wrote:

On Wed, Jan 24, 2018 at 06:42:42PM +0800, Wei Wang wrote:
+
+static void report_free_page_func(struct work_struct *work)
+{
+   struct virtio_balloon *vb;
+   unsigned long flags;
+
+   vb = container_of(work, struct virtio_balloon, report_free_page_work);
+
+   /* Start by sending the obtained cmd id to the host with an outbuf */
+   send_cmd_id(vb, &vb->start_cmd_id);
+
+   /*
+* Set start_cmd_id to VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID to
+* indicate a new request can be queued.
+*/
+   spin_lock_irqsave(&vb->stop_update_lock, flags);
+   vb->start_cmd_id = cpu_to_virtio32(vb->vdev,
+   VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID);
+   spin_unlock_irqrestore(&vb->stop_update_lock, flags);
+
+   walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
Can you teach walk_free_mem_block to return the && of all
return calls, so caller knows whether it completed?


There will be two cases that can cause walk_free_mem_block to return 
without completing:

1) host requests to stop in advance
2) vq->broken

How about letting walk_free_mem_block simply return the value returned 
by its callback (i.e. virtio_balloon_send_free_pages)?


For host requests to stop, it returns "1", and the above only bails out 
when walk_free_mem_block return a "< 0" value.


Best,
Wei


Re: [PATCH] block: blk-mq-sched: Replace GFP_ATOMIC with GFP_KERNEL in blk_mq_sched_assign_ioc

2018-01-24 Thread Al Viro
On Thu, Jan 25, 2018 at 11:13:56AM +0800, Jia-Ju Bai wrote:

> I have checked the given call chain, and find that nvme_dev_disable in
> nvme_timeout calls mutex_lock that can sleep.
> Thus, I suppose this call chain is not in atomic context.

... or it is broken.

> Besides, how do you find that "function (nvme_timeout()) strongly suggests
> that it *is* meant to be called from bloody atomic context"?
> I check the comments in nvme_timeout, and do not find related description...

Anything that reads registers for controller state presumably won't be
happy if it can happen in parallel with other threads poking the same
hardware.  Not 100% guaranteed, but it's a fairly strong sign that there's
some kind of exclusion between whatever's submitting requests / handling
interrupts and the caller of that thing.  And such exclusion is likely
to be spin_lock_irqsave()-based.

Again, that does not _prove_ it's called from atomic contexts, but does
suggest such possibility.

Looking through the callers of that method, blk_abort_request() certainly
*is* called from under queue lock.  Different drivers, though.  No idea
if nvme_timeout() blocking case is broken - I'm not familiar with that
code.  Question should go to nvme maintainers...

However, digging through other call chains, there's this:
drivers/md/dm-mpath.c:530:  clone = blk_get_request(q, rq->cmd_flags | 
REQ_NOMERGE, GFP_ATOMIC);
in multipath_clone_and_map(), aka. ->clone_and_map_rq(), called at
drivers/md/dm-rq.c:480: r = ti->type->clone_and_map_rq(ti, rq, &tio->info, 
&clone);
in map_request(), which is called from dm_mq_queue_rq(), aka ->queue_rq(),
which is called from blk_mq_dispatch_rq_list(), called from
blk_mq_do_dispatch_sched(), called from blk_mq_sched_dispatch_requests(),
called under rcu_read_lock().  Not a context where you want GFP_KERNEL
allocations...

> By the way, do you mean that I should add "My tool has proved that this
> function is never called in atomic context" in the description?

I mean that proof itself should be at least outlined.  Crediting the tool
for finding the proof is fine *IF* it's done along wiht the proof itself.

You want to convince the people applying the patch that it is correct.
Leaving out something trivial to verify is fine - "foo_bar_baz() has
no callers" doesn't require grep output + quoting the area around each
instance to prove that all of them are in the comments, etc.; readers
can bloody well check the accuracy of that claim themselves.  This
kind of analysis, however, is decidedly *NOT* trivial to verify from
scratch.

Moreover, if what you've proven is that for each call chain leading
to that place there's a blocking operation nearby, there is still
a possibility that some of those *are* called while in non-blocking
context.  In that case you've found real bugs, and strictly speaking
your change doesn't break correct code.  However, it does not make
the change itself correct - if you have something like
enter non-blocking section
.
in very unusual cases grab a mutex (or panic, or...)
.
do GFP_ATOMIC allocation
.
leave non-blocking section
changing that to GFP_KERNEL will turn "we deadlock in very hard to
hit case" into "we deadlock easily"...

At the very least, I'd like to see those cutoffs - i.e. the places
that already could block on the callchains.  You might very well
have found actual bugs there.


[PATCH 1/2] block: blk-tag: Replace GFP_ATOMIC with GFP_KERNEL in __blk_queue_init_tags

2018-01-24 Thread Jia-Ju Bai
After checking all possible call chains to kmalloc here,
my tool finds that kmalloc is never called in atomic context.
Thus GFP_ATOMIC is not necessary, and it can be replaced with GFP_KERNEL.

This is found by a static analysis tool named DCNS written by myself.

Signed-off-by: Jia-Ju Bai 
---
 block/blk-tag.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-tag.c b/block/blk-tag.c
index 09f19c6..84db7f663 100644
--- a/block/blk-tag.c
+++ b/block/blk-tag.c
@@ -124,7 +124,7 @@ static struct blk_queue_tag *__blk_queue_init_tags(struct 
request_queue *q,
 {
struct blk_queue_tag *tags;
 
-   tags = kmalloc(sizeof(struct blk_queue_tag), GFP_ATOMIC);
+   tags = kmalloc(sizeof(struct blk_queue_tag), GFP_KERNEL);
if (!tags)
goto fail;
 
-- 
1.7.9.5



Re: [PATCH v2] NTB: ntb_perf: fix cast to restricted __le32

2018-01-24 Thread Jon Mason
On Wed, Jan 24, 2018 at 09:07:26AM +0100, Arnd Bergmann wrote:
> On Wed, Jan 24, 2018 at 8:48 AM, Serge Semin  wrote:
> > Sparse is whining about the u32 and __le32 mixed usage in the driver
> >
> > drivers/ntb/test/ntb_perf.c:288:21: warning: cast to restricted __le32
> > drivers/ntb/test/ntb_perf.c:295:37: warning: incorrect type in argument 4 
> > (different base types)
> > drivers/ntb/test/ntb_perf.c:295:37:expected unsigned int [unsigned] 
> > [usertype] val
> > drivers/ntb/test/ntb_perf.c:295:37:got restricted __le32 [usertype] 
> > 
> > ...
> >
> > NTB hardware drivers shall accept CPU-endian data and translate it to
> > the portable formate by internal means, so the explicit conversions
> > are not necessary before Scratchpad/Messages API usage anymore.
> >
> > Fixes: b83003b3fdc1 ("NTB: ntb_perf: Add full multi-port NTB API support")
> > Signed-off-by: Serge Semin 
> 
> Looks good to me,
> 
> Acked-by: Arnd Bergmann 

Applied to my ntb-next branch.  Thanks for all of the help on this.

Thanks,
Jon


Re: [PATCH 10/10] kill kernel_sock_ioctl()

2018-01-24 Thread David Miller
From: Al Viro 
Date: Thu, 25 Jan 2018 00:01:25 +

> On Wed, Jan 24, 2018 at 03:52:44PM -0500, David Miller wrote:
>> 
>> Al this series looks fine to me, want me to toss it into net-next?
> 
> Do you want them reposted (with updated commit messages), or would
> you prefer a pull request (with or without rebase to current tip
> of net-next)?

A pull request works for me.  Rebasing to net-next tip is pilot's
discretion.


[PATCH v2 13/15] gen_initramfs_list.sh: add -x option to enable newcx format

2018-01-24 Thread Taras Kondratiuk
From: Mimi Zohar 

-x option populates extended attributes in cpio_list file passed to
get_init_cpio and selects newcx CPIO format.

Signed-off-by: Mimi Zohar 
Signed-off-by: Taras Kondratiuk 
---
 scripts/gen_initramfs_list.sh | 13 -
 usr/Kconfig   | 11 +++
 usr/Makefile  |  3 ++-
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/scripts/gen_initramfs_list.sh b/scripts/gen_initramfs_list.sh
index 86a3c0e5cfbc..cddb82f093d9 100755
--- a/scripts/gen_initramfs_list.sh
+++ b/scripts/gen_initramfs_list.sh
@@ -24,6 +24,7 @@ $0 [-o ] [-u ] [-g ] {-d | } ...
-gGroup ID to map to group ID 0 (root).
is only meaningful if  is a
   directory.  "squash" forces all files to gid 0.
+   -x include file extended attributes in cpio archive.
  File list or directory for cpio archive.
   If  is a .cpio file it will be used
   as direct input to initramfs.
@@ -146,6 +147,9 @@ parse() {
;;
esac
 
+   $include_xattrs && \
+   getfattr -h -d -m - -e hex --absolute-names ${location} | \
+   sed -e '/^#/d' -e '/^$/d' -e 's/^/xattr /' >> ${output}
echo "${str}" >> ${output}
 
return 0
@@ -226,6 +230,8 @@ root_gid=0
 dep_list=
 cpio_file=
 cpio_list=
+cpio_opts=
+include_xattrs=false
 output="/dev/stdout"
 output_file=""
 is_cpio_compressed=
@@ -283,6 +289,10 @@ while [ $# -gt 0 ]; do
default_list="$arg"
${dep_list}default_initramfs
;;
+   "-x")   # include extended attributers
+   cpio_opts="-x"
+   include_xattrs=true
+   ;;
"-h")
usage
exit 0
@@ -312,7 +322,8 @@ if [ ! -z ${output_file} ]; then
fi
fi
cpio_tfile="$(mktemp ${TMPDIR:-/tmp}/cpiofile.XX)"
-   usr/gen_init_cpio $timestamp ${cpio_list} > ${cpio_tfile}
+   usr/gen_init_cpio $timestamp ${cpio_opts} ${cpio_list} \
+   > ${cpio_tfile}
else
cpio_tfile=${cpio_file}
fi
diff --git a/usr/Kconfig b/usr/Kconfig
index 43658b8a975e..0cc03bc4614c 100644
--- a/usr/Kconfig
+++ b/usr/Kconfig
@@ -52,6 +52,17 @@ config INITRAMFS_ROOT_GID
 
  If you are not sure, leave it set to "0".
 
+config INITRAMFS_NEWCX
+   bool "Use newcx CPIO format for initramfs"
+   depends on INITRAMFS_SOURCE!=""
+   default n
+   help
+ If selected "usr/gen_init_cpio" will generate newcx CPIO archive
+ format that supports extended attributes.
+
+ See  for
+ more details.
+
 config RD_GZIP
bool "Support initial ramdisk/ramfs compressed using gzip"
depends on BLK_DEV_INITRD
diff --git a/usr/Makefile b/usr/Makefile
index 237a028693ce..1106bfd61475 100644
--- a/usr/Makefile
+++ b/usr/Makefile
@@ -29,7 +29,8 @@ ramfs-input := $(if $(filter-out 
"",$(CONFIG_INITRAMFS_SOURCE)), \
$(shell echo $(CONFIG_INITRAMFS_SOURCE)),-d)
 ramfs-args  := \
 $(if $(CONFIG_INITRAMFS_ROOT_UID), -u $(CONFIG_INITRAMFS_ROOT_UID)) \
-$(if $(CONFIG_INITRAMFS_ROOT_GID), -g $(CONFIG_INITRAMFS_ROOT_GID))
+$(if $(CONFIG_INITRAMFS_ROOT_GID), -g $(CONFIG_INITRAMFS_ROOT_GID)) \
+$(if $(CONFIG_INITRAMFS_NEWCX), -x)
 
 # $(datafile_d_y) is used to identify all files included
 # in initramfs and to detect if any files are added/removed.
-- 
2.10.3.dirty



[PATCH v2 09/15] initramfs: set extended attributes

2018-01-24 Thread Taras Kondratiuk
From: Mimi Zohar 

This patch writes out the extended attributes included in the cpio file.
As the "security.ima" xattr needs to be written after the file data.
this patch separates extracting and setting the xattrs by defining new
do_setxattrs state.

[kamensky: fixed restoring of xattrs for symbolic links by using
   sys_lsetxattr() instead of sys_setxattr()]

Signed-off-by: Mimi Zohar 
Signed-off-by: Victor Kamensky 
Signed-off-by: Taras Kondratiuk 
---
 init/initramfs.c | 57 +++-
 1 file changed, 52 insertions(+), 5 deletions(-)

diff --git a/init/initramfs.c b/init/initramfs.c
index 3d0f46c28459..040e26cf451a 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -310,6 +310,7 @@ static int __init do_xattrs(void);
 static int __init do_create(void);
 static int __init do_copy(void);
 static int __init do_symlink(void);
+static int __init do_setxattrs(void);
 static int __init do_reset(void);
 
 typedef int (*fsm_state_t)(void);
@@ -472,7 +473,7 @@ static int __init do_name(void)
 
 static int __init do_xattrs(void)
 {
-   /* Do nothing for now */
+   memcpy_optional(xattr_buf, collected, xattr_len);
state = do_create;
return 0;
 }
@@ -481,8 +482,7 @@ static __initdata int wfd;
 
 static int __init do_create(void)
 {
-   state = do_skip;
-   next_state = do_reset;
+   state = do_setxattrs;
clean_path(name_buf, mode);
if (S_ISREG(mode)) {
int ml = maybe_link(name_buf);
@@ -515,8 +515,11 @@ static int __init do_create(void)
do_utime(name_buf, mtime);
}
} else if (S_ISLNK(mode)) {
-   if (body_len > PATH_MAX)
+   if (body_len > PATH_MAX) {
+   state = do_skip;
+   next_state = do_reset;
return 0;
+   }
read_into(symlink_buf, body_len, do_symlink);
}
return 0;
@@ -530,7 +533,7 @@ static int __init do_copy(void)
sys_close(wfd);
do_utime(name_buf, mtime);
eat(body_len);
-   state = do_skip;
+   state = do_setxattrs;
return 0;
} else {
if (xwrite(wfd, victim, byte_count) != byte_count)
@@ -549,8 +552,52 @@ static int __init do_symlink(void)
sys_symlink(symlink_buf, name_buf);
sys_lchown(name_buf, uid, gid);
do_utime(name_buf, mtime);
+   state = do_setxattrs;
+   return 0;
+}
+
+struct xattr_hdr {
+   char c_size[8]; /* total size including c_size field */
+   char c_data[];  /* \0 */
+};
+
+static int __init do_setxattrs(void)
+{
+   char *buf = xattr_buf;
+   char *bufend = buf + xattr_len;
+   struct xattr_hdr *hdr;
+   char str[sizeof(hdr->c_size) + 1];
+
state = do_skip;
next_state = do_reset;
+   if (!xattr_len)
+   return 0;
+
+   str[sizeof(hdr->c_size)] = 0;
+
+   while (buf < bufend) {
+   char *xattr_name, *xattr_value;
+   unsigned long xattr_entry_size, xattr_value_size;
+   int ret;
+
+   hdr = (struct xattr_hdr *)buf;
+   memcpy(str, hdr->c_size, sizeof(hdr->c_size));
+   ret = kstrtoul(str, 16, &xattr_entry_size);
+   buf += xattr_entry_size;
+   if (ret || buf > bufend) {
+   error("malformed xattrs");
+   break;
+   }
+
+   xattr_name = hdr->c_data;
+   xattr_value = xattr_name + strlen(xattr_name) + 1;
+   xattr_value_size = buf - xattr_value;
+
+   ret = sys_lsetxattr(name_buf, xattr_name, xattr_value,
+   xattr_value_size, 0);
+   pr_debug("%s: %s size: %lu val: %s (ret: %d)\n", name_buf,
+   xattr_name, xattr_value_size, xattr_value, ret);
+   }
return 0;
 }
 
-- 
2.10.3.dirty



[PATCH v2 07/15] initramfs: split header layout information from parsing function

2018-01-24 Thread Taras Kondratiuk
Header parsing has hardcoded assumption about header field size and
layout. It is hard to modify the function to parse a new format.

Move information about size and layout into a data structure to
make parsing code more generic and simplify adding a new format.
This also removes some magic numbers.

Signed-off-by: Taras Kondratiuk 
---
 init/initramfs.c | 122 +--
 1 file changed, 92 insertions(+), 30 deletions(-)

diff --git a/init/initramfs.c b/init/initramfs.c
index b3d39c8793be..7f0bbfde94e3 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -150,39 +150,100 @@ static void __init dir_utime(void)
}
 }
 
-static __initdata time64_t mtime;
-
 /* cpio header parsing */
-
-static __initdata unsigned long ino, major, minor, nlink;
+static __initdata time64_t mtime;
+static __initdata u32 ino, major, minor, nlink, rmajor, rminor;
 static __initdata umode_t mode;
-static __initdata unsigned long body_len, name_len;
+static __initdata u32 body_len, name_len;
 static __initdata uid_t uid;
 static __initdata gid_t gid;
-static __initdata unsigned rdev;
+static __initdata u32 mode_u32;
+
+struct cpio_hdr_field {
+   size_t offset;
+   size_t size;
+   void *out;
+   size_t out_size;
+   const char *name;
+};
+
+#define HDR_FIELD(type, field, variable) \
+   { .offset = offsetof(type, field) + \
+ BUILD_BUG_ON_ZERO(sizeof(*(variable))*2 < FIELD_SIZEOF(type, field)),\
+ .size = FIELD_SIZEOF(type, field), \
+ .out = variable, \
+ .out_size = sizeof(*(variable)), \
+ .name = #field }
+
+#define NEWC_FIELD(field, variable) \
+   HDR_FIELD(struct cpio_newc_header, field, variable)
+
+#define CPIO_MAX_HEADER_SIZE sizeof(struct cpio_newc_header)
+#define CPIO_MAX_FIELD_SIZE 8
+#define CPIO_MAGIC_SIZE 6
+
+struct cpio_newc_header {
+   charc_ino[8];
+   charc_mode[8];
+   charc_uid[8];
+   charc_gid[8];
+   charc_nlink[8];
+   charc_mtime[8];
+   charc_filesize[8];
+   charc_devmajor[8];
+   charc_devminor[8];
+   charc_rdevmajor[8];
+   charc_rdevminor[8];
+   charc_namesize[8];
+   charc_check[8];
+};
+
+static struct cpio_hdr_field cpio_newc_header_info[] __initdata = {
+   NEWC_FIELD(c_ino, &ino),
+   NEWC_FIELD(c_mode, &mode_u32),
+   NEWC_FIELD(c_uid, &uid),
+   NEWC_FIELD(c_gid, &gid),
+   NEWC_FIELD(c_nlink, &nlink),
+   NEWC_FIELD(c_mtime, &mtime),
+   NEWC_FIELD(c_filesize, &body_len),
+   NEWC_FIELD(c_devmajor, &major),
+   NEWC_FIELD(c_devminor, &minor),
+   NEWC_FIELD(c_rdevmajor, &rmajor),
+   NEWC_FIELD(c_rdevminor, &rminor),
+   NEWC_FIELD(c_namesize, &name_len),
+   { 0 },
+};
 
 static void __init parse_header(char *s)
 {
-   unsigned long parsed[12];
-   char buf[9];
-   int i;
-
-   buf[8] = '\0';
-   for (i = 0; i < 12; i++, s += 8) {
-   memcpy(buf, s, 8);
-   parsed[i] = simple_strtoul(buf, NULL, 16);
+   char buf[CPIO_MAX_FIELD_SIZE + 1];
+   struct cpio_hdr_field *field = cpio_newc_header_info;
+
+   while (field->size) {
+   int ret = 0;
+
+   memcpy(buf, s + field->offset, field->size);
+   buf[field->size] = '\0';
+   switch (field->out_size) {
+   case sizeof(u32):
+   ret = kstrtou32(buf, 16, field->out);
+   pr_debug("cpio field %s: %u, buf: %s\n",
+field->name, *(u32 *)field->out, buf);
+   break;
+   case sizeof(u64):
+   ret = kstrtou64(buf, 16, field->out);
+   pr_debug("cpio field %s: %llu, buf: %s\n",
+field->name, *(u64 *)field->out, buf);
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   if (ret)
+   pr_err("invalid cpio header field (%d)", ret);
+   field++;
}
-   ino = parsed[0];
-   mode = parsed[1];
-   uid = parsed[2];
-   gid = parsed[3];
-   nlink = parsed[4];
-   mtime = parsed[5]; /* breaks in y2106 */
-   body_len = parsed[6];
-   major = parsed[7];
-   minor = parsed[8];
-   rdev = new_encode_dev(MKDEV(parsed[9], parsed[10]));
-   name_len = parsed[11];
+   mode = mode_u32;
 }
 
 /* FSM */
@@ -234,7 +295,7 @@ static __initdata char *header_buf, *symlink_buf, *name_buf;
 
 static int __init do_start(void)
 {
-   read_into(header_buf, 6, do_format);
+   read_into(header_buf, CPIO_MAGIC_SIZE, do_format);
return 0;
 }
 
@@ -254,15 +315,15 @@ static int __init do_collect(void)
 
 static int __init do_format(void)
 {
-   if (memcmp(collected, "070707", 6)==0) {
+   if (memcmp(collected, "070707", CPIO_MAGIC_SIZE) == 

Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-24 Thread Eric Dumazet
On Thu, 2018-01-25 at 11:27 +0800, jianchao.wang wrote:
> Hi Tariq
> 
> On 01/22/2018 10:12 AM, jianchao.wang wrote:
> > > > On 19/01/2018 5:49 PM, Eric Dumazet wrote:
> > > > > On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote:
> > > > > > Hi Tariq
> > > > > > 
> > > > > > Very sad that the crash was reproduced again after applied the 
> > > > > > patch.
> > > 
> > > Memory barriers vary for different Archs, can you please share more 
> > > details regarding arch and repro steps?
> > The hardware is HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 
> > 12/27/2015
> > The xen is installed. The crash occurred in DOM0.
> > Regarding to the repro steps, it is a customer's test which does heavy disk 
> > I/O over NFS storage without any guest.
> > 
> 
> What is the finial suggestion on this ?
> If use wmb there, is the performance pulled down ?

Since 
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=dad42c3038a59d27fced28ee4ec1d4a891b28155

we batch allocations, so mlx4_en_refill_rx_buffers() is not called that often.

I doubt the additional wmb() will have serious impact there.



[PATCH v2 05/15] initramfs: move files creation into separate state

2018-01-24 Thread Taras Kondratiuk
Move most of the file creation logic into a separate state. This splits
collection of data stage from data processing and makes it easier to add
additional states for a new archive format.

Signed-off-by: Taras Kondratiuk 
---
 init/initramfs.c | 52 ++--
 1 file changed, 30 insertions(+), 22 deletions(-)

diff --git a/init/initramfs.c b/init/initramfs.c
index d0ab7ad6ac05..2d5920c094e0 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -192,6 +192,7 @@ static int __init do_collect(void);
 static int __init do_header(void);
 static int __init do_skip(void);
 static int __init do_name(void);
+static int __init do_create(void);
 static int __init do_copy(void);
 static int __init do_symlink(void);
 static int __init do_reset(void);
@@ -292,12 +293,12 @@ static int __init do_reset(void)
return 1;
 }
 
-static int __init maybe_link(void)
+static int __init maybe_link(char *name)
 {
if (nlink >= 2) {
-   char *old = find_link(major, minor, ino, mode, collected);
+   char *old = find_link(major, minor, ino, mode, name);
if (old)
-   return (sys_link(old, collected) < 0) ? -1 : 1;
+   return (sys_link(old, name) < 0) ? -1 : 1;
}
return 0;
 }
@@ -321,52 +322,59 @@ static void *memcpy_optional(void *dest, const void *src, 
size_t n)
return dest;
 }
 
-static __initdata int wfd;
-
 static int __init do_name(void)
 {
-   state = do_skip;
-   next_state = do_reset;
if (strcmp(collected, "TRAILER!!!") == 0) {
+   state = do_skip;
+   next_state = do_reset;
free_hash();
return 0;
}
-   clean_path(collected, mode);
+   memcpy_optional(name_buf, collected, N_ALIGN(name_len));
+   state = do_create;
+   return 0;
+}
+
+
+static __initdata int wfd;
+
+static int __init do_create(void)
+{
+   state = do_skip;
+   next_state = do_reset;
+   clean_path(name_buf, mode);
if (S_ISREG(mode)) {
-   int ml = maybe_link();
+   int ml = maybe_link(name_buf);
if (ml >= 0) {
int openflags = O_WRONLY|O_CREAT;
if (ml != 1)
openflags |= O_TRUNC;
-   wfd = sys_open(collected, openflags, mode);
+   wfd = sys_open(name_buf, openflags, mode);
 
if (wfd >= 0) {
sys_fchown(wfd, uid, gid);
sys_fchmod(wfd, mode);
if (body_len)
sys_ftruncate(wfd, body_len);
-   memcpy_optional(name_buf, collected,
-   N_ALIGN(name_len));
state = do_copy;
}
}
} else if (S_ISDIR(mode)) {
-   sys_mkdir(collected, mode);
-   sys_chown(collected, uid, gid);
-   sys_chmod(collected, mode);
-   dir_add(collected, mtime);
+   sys_mkdir(name_buf, mode);
+   sys_chown(name_buf, uid, gid);
+   sys_chmod(name_buf, mode);
+   dir_add(name_buf, mtime);
} else if (S_ISBLK(mode) || S_ISCHR(mode) ||
   S_ISFIFO(mode) || S_ISSOCK(mode)) {
-   if (maybe_link() == 0) {
-   sys_mknod(collected, mode, rdev);
-   sys_chown(collected, uid, gid);
-   sys_chmod(collected, mode);
-   do_utime(collected, mtime);
+   if (maybe_link(name_buf) == 0) {
+   sys_mknod(name_buf, mode, rdev);
+   sys_chown(name_buf, uid, gid);
+   sys_chmod(name_buf, mode);
+   do_utime(name_buf, mtime);
}
} else if (S_ISLNK(mode)) {
if (body_len > PATH_MAX)
return 0;
-   memcpy_optional(name_buf, collected, N_ALIGN(name_len));
read_into(symlink_buf, body_len, do_symlink);
}
return 0;
-- 
2.10.3.dirty



[PATCH v2 01/15] Documentation: add newcx initramfs format description

2018-01-24 Thread Taras Kondratiuk
Many of the Linux security/integrity features are dependent on file
metadata, stored as extended attributes (xattrs), for making decisions.
These features need to be initialized during initcall and enabled as
early as possible for complete security coverage.

Initramfs (tmpfs) supports xattrs, but newc CPIO archive format does not
support including them into the archive.

This patch describes "extended" newc format (newcx) that is based on
newc and has following changes:
- extended attributes support
- increased size of filesize to support files >4GB.
- increased mtime field size to have usec precision and more than
  32-bit of seconds.
- removed unused checksum field.

Signed-off-by: Taras Kondratiuk 
Signed-off-by: Mimi Zohar 
Signed-off-by: Victor Kamensky 
---
 Documentation/early-userspace/buffer-format.txt | 46 ++---
 1 file changed, 41 insertions(+), 5 deletions(-)

diff --git a/Documentation/early-userspace/buffer-format.txt 
b/Documentation/early-userspace/buffer-format.txt
index e1fd7f9dad16..d818df4f72dc 100644
--- a/Documentation/early-userspace/buffer-format.txt
+++ b/Documentation/early-userspace/buffer-format.txt
@@ -24,6 +24,7 @@ grammar, where:
+   indicates concatenation
GZIP()  indicates the gzip(1) of the operand
ALGN(n) means padding with null bytes to an n-byte boundary
+   [n] means size of field is n bytes
 
initramfs  := ("\0" | cpio_archive | cpio_gzip_archive)*
 
@@ -31,20 +32,29 @@ grammar, where:
 
cpio_archive := cpio_file* + ( | cpio_trailer)
 
-   cpio_file := ALGN(4) + cpio_header + filename + "\0" + ALGN(4) + data
+   cpio_file := (cpio_newc_file | cpio_newcx_file)
+
+   cpio_newc_file := ALGN(4) + cpio_newc_header + filename + "\0" + \
+ ALGN(4) + data
+
+   cpio_newcx_file := ALGN(4) + cpio_newcx_header + filename + "\0" + \
+  ALGN(4) + xattrs + ALGN(4) + data
+
+   xattrs := xattr_entry*
+
+   xattr_entry := xattr_size[8] + xattr_name + "\0" + xattr_value
 
cpio_trailer := ALGN(4) + cpio_header + "TRAILER!!!\0" + ALGN(4)
 
 
 In human terms, the initramfs buffer contains a collection of
-compressed and/or uncompressed cpio archives (in the "newc" or "crc"
-formats); arbitrary amounts zero bytes (for padding) can be added
-between members.
+compressed and/or uncompressed cpio archives; arbitrary amounts
+zero bytes (for padding) can be added between members.
 
 The cpio "TRAILER!!!" entry (cpio end-of-archive) is optional, but is
 not ignored; see "handling of hard links" below.
 
-The structure of the cpio_header is as follows (all fields contain
+The structure of the cpio_newc_header is as follows (all fields contain
 hexadecimal ASCII numbers fully padded with '0' on the left to the
 full width of the field, for example, the integer 4780 is represented
 by the ASCII string "12ac"):
@@ -81,6 +91,32 @@ algorithm used.
 If the filename is "TRAILER!!!" this is actually an end-of-archive
 marker; the c_filesize for an end-of-archive marker must be zero.
 
+"Extended" newc format (newcx)
+"newcx" cpio format extends "newc" by increasing size of some fields
+and adding extended attributes support. cpio_newcx_header structure:
+
+Field nameField sizeMeaning
+c_magic   6 bytes   The string "070703"
+c_ino 8 bytes   File inode number
+c_mode8 bytes   File mode and permissions
+c_uid 8 bytes   File uid
+c_gid 8 bytes   File gid
+c_nlink   8 bytes   Number of links
+c_mtime  16 bytes   Modification time (microseconds)
+c_filesize16 bytes  Size of data field
+c_maj 8 bytes   Major part of file device number
+c_min 8 bytes   Minor part of file device number
+c_rmaj8 bytes   Major part of device node reference
+c_rmin8 bytes   Minor part of device node reference
+c_namesize 8 bytes  Length of filename, including final \0
+c_xattrs_size  8 bytes  Size of xattrs field
+
+Most of the fields match cpio_newc_header except c_mtime that contains
+microseconds. c_chksum field is dropped.
+
+xattr_size is a total size of xattr_entry including 8 bytes of
+xattr_size. xattr_size has the same hexadecimal ASCII encoding as other
+fields of cpio header.
 
 *** Handling of hard links
 
-- 
2.10.3.dirty



[PATCH v2 10/15] gen_init_cpio: move header formatting into function

2018-01-24 Thread Taras Kondratiuk
CPIO header is generated in multiple places with the same sprintf()
format string. Move formatting into a single function in preparation
to adding a new cpio format.

Signed-off-by: Taras Kondratiuk 
---
 usr/gen_init_cpio.c | 186 ++--
 1 file changed, 92 insertions(+), 94 deletions(-)

diff --git a/usr/gen_init_cpio.c b/usr/gen_init_cpio.c
index 03b21189d58b..7a2a6d85345d 100644
--- a/usr/gen_init_cpio.c
+++ b/usr/gen_init_cpio.c
@@ -64,34 +64,55 @@ static void push_rest(const char *name)
}
 }
 
-static void push_hdr(const char *s)
+struct cpio_header {
+   unsigned int ino;
+   unsigned int mode;
+   uid_t uid;
+   gid_t gid;
+   unsigned int nlink;
+   time_t mtime;
+   size_t filesize;
+   int devmajor;
+   int devminor;
+   int rdevmajor;
+   int rdevminor;
+   size_t namesize;
+   unsigned int check;
+};
+
+static void push_hdr(const struct cpio_header *hdr)
 {
+   char s[256];
+
+   sprintf(s, "%s%08X%08X%08lX%08lX%08X%08lX"
+  "%08X%08X%08X%08X%08X%08X%08X",
+   "070701",
+   hdr->ino,
+   hdr->mode,
+   (long)hdr->uid,
+   (long)hdr->gid,
+   hdr->nlink,
+   (long)hdr->mtime,
+   (unsigned int)hdr->filesize,
+   hdr->devmajor,
+   hdr->devminor,
+   hdr->rdevmajor,
+   hdr->rdevminor,
+   (unsigned int)hdr->namesize,
+   hdr->check);
fputs(s, stdout);
offset += 110;
 }
 
 static void cpio_trailer(void)
 {
-   char s[256];
const char name[] = "TRAILER!!!";
+   struct cpio_header hdr = {
+   .nlink = 1,
+   .namesize = strlen(name)+1,
+   };
 
-   sprintf(s, "%s%08X%08X%08lX%08lX%08X%08lX"
-  "%08X%08X%08X%08X%08X%08X%08X",
-   "070701",   /* magic */
-   0,  /* ino */
-   0,  /* mode */
-   (long) 0,   /* uid */
-   (long) 0,   /* gid */
-   1,  /* nlink */
-   (long) 0,   /* mtime */
-   0,  /* filesize */
-   0,  /* major */
-   0,  /* minor */
-   0,  /* rmajor */
-   0,  /* rminor */
-   (unsigned)strlen(name)+1, /* namesize */
-   0); /* chksum */
-   push_hdr(s);
+   push_hdr(&hdr);
push_rest(name);
 
while (offset % 512) {
@@ -103,27 +124,21 @@ static void cpio_trailer(void)
 static int cpio_mkslink(const char *name, const char *target,
 unsigned int mode, uid_t uid, gid_t gid)
 {
-   char s[256];
-
if (name[0] == '/')
name++;
-   sprintf(s,"%s%08X%08X%08lX%08lX%08X%08lX"
-  "%08X%08X%08X%08X%08X%08X%08X",
-   "070701",   /* magic */
-   ino++,  /* ino */
-   S_IFLNK | mode, /* mode */
-   (long) uid, /* uid */
-   (long) gid, /* gid */
-   1,  /* nlink */
-   (long) default_mtime,   /* mtime */
-   (unsigned)strlen(target)+1, /* filesize */
-   3,  /* major */
-   1,  /* minor */
-   0,  /* rmajor */
-   0,  /* rminor */
-   (unsigned)strlen(name) + 1,/* namesize */
-   0); /* chksum */
-   push_hdr(s);
+   struct cpio_header hdr = {
+   .ino = ino++,
+   .mode = S_IFLNK | mode,
+   .uid = uid,
+   .gid = gid,
+   .nlink = 1,
+   .mtime = default_mtime,
+   .filesize = strlen(target)+1,
+   .devmajor = 3,
+   .devminor = 1,
+   .namesize = strlen(name)+1,
+   };
+   push_hdr(&hdr);
push_string(name);
push_pad();
push_string(target);
@@ -152,27 +167,20 @@ static int cpio_mkslink_line(const char *line)
 static int cpio_mkgeneric(const char *name, unsigned int mode,
   uid_t uid, gid_t gid)
 {
-   char s[256];
-
if (name[0] == '/')
name++;
-   sprintf(s,"%s%08X%08X%08lX%08lX%08X%08lX"
-  "%08X%08X%08X%08X%08X%08X%08X",
-   "070701",   /* magic */
-   ino++,  /* ino */
-   mode,   /* mode */
-   (long) uid, /* uid */
-   (long) gid, /* gid */
-   2,

Re: [PATCH] block: blk-mq-sched: Replace GFP_ATOMIC with GFP_KERNEL in blk_mq_sched_assign_ioc

2018-01-24 Thread Jia-Ju Bai



On 2018/1/25 11:34, Jens Axboe wrote:

On 1/24/18 7:46 PM, Jia-Ju Bai wrote:

The function ioc_create_icq here is not called in atomic context.
Thus GFP_ATOMIC is not necessary, and it can be replaced with GFP_KERNEL.

This is found by a static analysis tool named DCNS written by myself.

But it's running off the IO submission path, so by definition the GFP
mask cannot include anything that will do IO. GFP_KERNEL will make
it deadlock prone.

It could be GFP_NOIO, but that's also overlooking the fact that we can
have preemption disabled here.

On top of all that, we want something quick here, and it's OK that
it fails. That's preferable to blocking. So we want an atomic alloc,
even if we could tolerate a blocking one.

So I think you need to fix your static analysis tool, it's missing
a few key things.



Okay, thanks for your advice.


Thanks,
Jia-Ju Bai



[PATCH v2 12/15] gen_init_cpio: set extended attributes for newcx format

2018-01-24 Thread Taras Kondratiuk
gen_init_cpio creates CPIO archive according to cpio_list manifest file
that contains list of archive entries (one per line). To be able to
store extended attributes in newcx CPIO format we need to pass them via
cpio_list file.

One way of doing it would be to append xattrs to each entry line, but
"file" lines have a variable number of elements because of hardlinks. It
is not obvious how to mark end of hardlinks and start of xattrs in this
case.

This patch introduces a new entry type: "xattr". Each "xattr" line
specify one name=value pair. xattr values are applied to the next
non-xattr line. There can be multiple "xattr" lines before non-xattr
line.

It may be more logical to have xattr lines after corresponding
file entry, but it makes parsing a bit more complex and needs more
intrusive changes.

Xattr value is hex-encoded (see getfattr(1)). Plain string variant would
be easier to read, but special symbols have to be escaped. Hex encoding
is much simpler.

Signed-off-by: Taras Kondratiuk 
---
 usr/gen_init_cpio.c | 142 +++-
 1 file changed, 119 insertions(+), 23 deletions(-)

diff --git a/usr/gen_init_cpio.c b/usr/gen_init_cpio.c
index 78a47a5bdcb1..e356f9e532a2 100644
--- a/usr/gen_init_cpio.c
+++ b/usr/gen_init_cpio.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -50,21 +51,10 @@ static void push_pad (void)
}
 }
 
-static void push_rest(const char *name)
+static void push_string_padded(const char *name)
 {
-   unsigned int name_len = strlen(name) + 1;
-   unsigned int tmp_ofs;
-
-   fputs(name, stdout);
-   putchar(0);
-   offset += name_len;
-
-   tmp_ofs = name_len + cpio_hdr_size;
-   while (tmp_ofs & 3) {
-   putchar(0);
-   offset++;
-   tmp_ofs++;
-   }
+   push_string(name);
+   push_pad();
 }
 
 struct cpio_header {
@@ -137,7 +127,7 @@ static void cpio_trailer(void)
};
 
push_hdr(&hdr);
-   push_rest(name);
+   push_string_padded(name);
 
while (offset % 512) {
putchar(0);
@@ -145,6 +135,96 @@ static void cpio_trailer(void)
}
 }
 
+struct xattr_hdr {
+   char c_size[8]; /* total size including c_size field */
+   char c_data[];
+};
+static unsigned int xattr_buflen;
+static char xattr_buf[4096];
+
+static void push_xattrs(void)
+{
+   if (!newcx || !xattr_buflen)
+   return;
+
+   if (fwrite(xattr_buf, xattr_buflen, 1, stdout) != 1)
+   fprintf(stderr, "writing xattrs failed\n");
+   offset += xattr_buflen;
+   xattr_buflen = 0;
+
+   push_pad();
+}
+
+static int convert_hex_string(const char *hex_str, char *out, size_t out_size)
+{
+   char buf[3];
+   size_t str_len = strlen(hex_str);
+
+   if (str_len % 2 != 0 || str_len / 2 > out_size)
+   return 0;
+
+   buf[2] = '\0';
+   while (*hex_str != '\0') {
+   buf[0] = *hex_str++;
+   buf[1] = *hex_str++;
+   *out++ = (char)strtol(buf, NULL, 16);
+   }
+
+   return str_len / 2;
+}
+
+static int collect_xattr(const char *line)
+{
+   const char *name, *value;
+   size_t name_len, value_len;
+   char *buf = xattr_buf + xattr_buflen;
+   struct xattr_hdr *hdr = (struct xattr_hdr *)buf;
+   char *bufend = xattr_buf + sizeof(xattr_buf);
+   char *value_buf;
+   size_t xattr_entry_size;
+   char size_str[sizeof(hdr->c_size) + 1];
+
+   if (!newcx)
+   return 0;
+
+   name = line;
+   value = strchr(line, '=');
+   if (!value) {
+   fprintf(stderr, "Unrecognized xattr format '%s'", line);
+   return -1;
+   }
+   name_len = value - name;
+   value++;
+
+   /*
+* For now we support only hex encoded values.
+* String or base64 can be added later.
+*/
+   if (strncmp(value, "0x", 2)) {
+   fprintf(stderr,
+   "Only hex encoded xattr value is supported '%s'",
+   value);
+   return -1;
+   }
+
+   value += 2;
+   value_buf = buf + sizeof(struct xattr_hdr) + name_len + 1;
+   value_len = convert_hex_string(value, value_buf, bufend - value_buf);
+   if (value_len == 0) {
+   fprintf(stderr, "Failed to parse xattr value '%s'", line);
+   return -1;
+   }
+   xattr_entry_size = sizeof(struct xattr_hdr) + name_len + 1 + value_len;
+
+   sprintf(size_str, "%08X", (unsigned int)xattr_entry_size);
+   memcpy(hdr->c_size, size_str, sizeof(hdr->c_size));
+   memcpy(hdr->c_data, name, name_len);
+   hdr->c_data[name_len] = '\0';
+   xattr_buflen += xattr_entry_size;
+
+   return 0;
+}
+
 static int cpio_mkslink(const char *name, const char *target,
 unsigned int mode, uid_t uid, gid_t gid)
 {
@@ -161,12 +241,12 @@ static int cpio_mkslin

Re: [PATCH] block: blk-mq-sched: Replace GFP_ATOMIC with GFP_KERNEL in blk_mq_sched_assign_ioc

2018-01-24 Thread Jens Axboe
On 1/24/18 7:46 PM, Jia-Ju Bai wrote:
> The function ioc_create_icq here is not called in atomic context.
> Thus GFP_ATOMIC is not necessary, and it can be replaced with GFP_KERNEL.
> 
> This is found by a static analysis tool named DCNS written by myself.

But it's running off the IO submission path, so by definition the GFP
mask cannot include anything that will do IO. GFP_KERNEL will make
it deadlock prone.

It could be GFP_NOIO, but that's also overlooking the fact that we can
have preemption disabled here.

On top of all that, we want something quick here, and it's OK that
it fails. That's preferable to blocking. So we want an atomic alloc,
even if we could tolerate a blocking one.

So I think you need to fix your static analysis tool, it's missing
a few key things.

-- 
Jens Axboe



Re: linux-next: manual merge of the vfs tree with the overlayfs tree

2018-01-24 Thread Stephen Rothwell
Hi all,

On Thu, 25 Jan 2018 14:31:55 +1100 Stephen Rothwell  
wrote:
>
> + if (!disconnected) {
>  -hlist_bl_lock(&tmp->d_sb->s_roots);
>  -hlist_bl_add_head(&tmp->d_hash, &tmp->d_sb->s_roots);
>  -hlist_bl_unlock(&tmp->d_sb->s_roots);
> ++hlist_bl_lock(&dentry->d_sb->s_roots);
> ++hlist_bl_add_head(&dentry->d_hash, &tmp->d_sb->s_roots);

I fixed up the "tmp" here ... it should have been "dentry".

-- 
Cheers,
Stephen Rothwell


Re: [PATCH v1 0/2] Remove duplicate driver for MyGica T230C

2018-01-24 Thread Stefan Brüns
On Wednesday, 10 January 2018 00:33:37 CET Stefan Brüns wrote:
> In 2017-02, two drivers for the T230C where submitted, but until now
> only the one based on the older dvb-usb/cxusb.c driver has been part
> of the mainline kernel. As a dvb-usb-v2 driver is preferable, remove
> the other driver.
> 
> The cxusb.c patch also contained an unrelated change for the T230,
> i.e. a correction of the RC model. As this change apparently is
> correct, restore it. This has not been tested due to lack of hardware.
> 
> 
> Evgeny Plehov (1):
>   Revert "[media] dvb-usb-cxusb: Geniatech T230C support"
> 
> Stefan Brüns (1):
>   [media] cxusb: restore RC_MAP for MyGica T230
> 
>  drivers/media/usb/dvb-usb/cxusb.c | 137
> -- 1 file changed, 137 deletions(-)


Ping!

-- 
Stefan Brüns  /  Bergstraße 21  /  52062 Aachen
home: +49 241 53809034 mobile: +49 151 50412019

signature.asc
Description: This is a digitally signed message part.


[PATCH v2 11/15] gen_init_cpio: add newcx format

2018-01-24 Thread Taras Kondratiuk
Add "newcx" format that supports extended attributes and has increased
size of c_mtime and c_filesize fields.

Added -x option to select "newcx" format. Default is "newc".

Refer to Documentation/early-userspace/buffer-format.txt for detailed
format description.

Signed-off-by: Taras Kondratiuk 
---
 usr/gen_init_cpio.c | 70 +
 1 file changed, 49 insertions(+), 21 deletions(-)

diff --git a/usr/gen_init_cpio.c b/usr/gen_init_cpio.c
index 7a2a6d85345d..78a47a5bdcb1 100644
--- a/usr/gen_init_cpio.c
+++ b/usr/gen_init_cpio.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Original work by Jeff Garzik
@@ -21,6 +22,8 @@
 #define xstr(s) #s
 #define str(s) xstr(s)
 
+static int newcx;
+static unsigned int cpio_hdr_size;
 static unsigned int offset;
 static unsigned int ino = 721;
 static time_t default_mtime;
@@ -56,7 +59,7 @@ static void push_rest(const char *name)
putchar(0);
offset += name_len;
 
-   tmp_ofs = name_len + 110;
+   tmp_ofs = name_len + cpio_hdr_size;
while (tmp_ofs & 3) {
putchar(0);
offset++;
@@ -77,6 +80,7 @@ struct cpio_header {
int rdevmajor;
int rdevminor;
size_t namesize;
+   size_t xattrsize;
unsigned int check;
 };
 
@@ -84,24 +88,44 @@ static void push_hdr(const struct cpio_header *hdr)
 {
char s[256];
 
-   sprintf(s, "%s%08X%08X%08lX%08lX%08X%08lX"
-  "%08X%08X%08X%08X%08X%08X%08X",
-   "070701",
-   hdr->ino,
-   hdr->mode,
-   (long)hdr->uid,
-   (long)hdr->gid,
-   hdr->nlink,
-   (long)hdr->mtime,
-   (unsigned int)hdr->filesize,
-   hdr->devmajor,
-   hdr->devminor,
-   hdr->rdevmajor,
-   hdr->rdevminor,
-   (unsigned int)hdr->namesize,
-   hdr->check);
+   if (newcx) {
+   sprintf(s, "%s%08X%08X%08lX%08lX%08X%016llX"
+  "%016llX%08X%08X%08X%08X%08X%08X",
+   "070703",
+   hdr->ino,
+   hdr->mode,
+   (long)hdr->uid,
+   (long)hdr->gid,
+   hdr->nlink,
+   hdr->mtime * 100ULL,
+   (long long)hdr->filesize,
+   hdr->devmajor,
+   hdr->devminor,
+   hdr->rdevmajor,
+   hdr->rdevminor,
+   (unsigned int)hdr->namesize,
+   (unsigned int)hdr->xattrsize);
+   } else {
+   sprintf(s, "%s%08X%08X%08lX%08lX%08X%08lX"
+  "%08X%08X%08X%08X%08X%08X%08X",
+   "070701",
+   hdr->ino,
+   hdr->mode,
+   (long)hdr->uid,
+   (long)hdr->gid,
+   hdr->nlink,
+   (long)hdr->mtime,
+   (unsigned int)hdr->filesize,
+   hdr->devmajor,
+   hdr->devminor,
+   hdr->rdevmajor,
+   hdr->rdevminor,
+   (unsigned int)hdr->namesize,
+   hdr->check);
+   }
fputs(s, stdout);
-   offset += 110;
+   assert((offset & 3) == 0);
+   offset += cpio_hdr_size;
 }
 
 static void cpio_trailer(void)
@@ -301,7 +325,7 @@ static int cpio_mkfile(const char *name, const char 
*location,
 {
char *filebuf = NULL;
struct stat buf;
-   long size;
+   size_t size;
int file = -1;
int retval;
int rc = -1;
@@ -450,7 +474,7 @@ static int cpio_mkfile_line(const char *line)
 static void usage(const char *prog)
 {
fprintf(stderr, "Usage:\n"
-   "\t%s [-t ] \n"
+   "\t%s [-t ] [-x] \n"
"\n"
" is a file containing newline separated entries 
that\n"
"describe the files to be included in the initramfs archive:\n"
@@ -527,7 +551,7 @@ int main (int argc, char *argv[])
 
default_mtime = time(NULL);
while (1) {
-   int opt = getopt(argc, argv, "t:h");
+   int opt = getopt(argc, argv, "t:h:x");
char *invalid;
 
if (opt == -1)
@@ -542,12 +566,16 @@ int main (int argc, char *argv[])
exit(1);
}
break;
+   case 'x':
+   newcx = 1;
+   break;
case 'h':
case '?':
usage(argv[0]);
exit(opt == 'h' ? 0 : 1);
}
}
+   cpio_hdr_size = newcx ? 134 : 110;
 
if (argc - optind != 1) {
usa

[PATCH 2/2] block: blk-tag: Replace GFP_ATOMIC with GFP_KERNEL in init_tag_map

2018-01-24 Thread Jia-Ju Bai
After checking all possible call chains to init_tag_map here,
my tool finds that init_tag_map is never called in atomic context.
Thus GFP_ATOMIC is not necessary, and it can be replaced with GFP_KERNEL.

This is found by a static analysis tool named DCNS written by myself.

Signed-off-by: Jia-Ju Bai 
---
 block/blk-tag.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-tag.c b/block/blk-tag.c
index 09f19c6..177b6d1 100644
--- a/block/blk-tag.c
+++ b/block/blk-tag.c
@@ -99,12 +99,12 @@ void blk_queue_free_tags(struct request_queue *q)
   __func__, depth);
}
 
-   tag_index = kzalloc(depth * sizeof(struct request *), GFP_ATOMIC);
+   tag_index = kzalloc(depth * sizeof(struct request *), GFP_KERNEL);
if (!tag_index)
goto fail;
 
nr_ulongs = ALIGN(depth, BITS_PER_LONG) / BITS_PER_LONG;
-   tag_map = kzalloc(nr_ulongs * sizeof(unsigned long), GFP_ATOMIC);
+   tag_map = kzalloc(nr_ulongs * sizeof(unsigned long), GFP_KERNEL);
if (!tag_map)
goto fail;
 
-- 
1.7.9.5



[PATCH v2 14/15] selinux: allow setxattr on rootfs so initramfs code can set them

2018-01-24 Thread Taras Kondratiuk
From: Victor Kamensky 

initramfs code supporting extended cpio format have ability to
fill extended attributes from cpio archive, but if SELinux enabled
and security server is not initialized yet, selinux callback would
refuse setxattr made by initramfs code.

Solution enable SBLABEL_MNT on rootfs even if secrurity server is
not initialized yet.

Signed-off-by: Victor Kamensky 
---
 security/selinux/hooks.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 8644d864e3c1..f3fe65589f02 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -706,6 +706,18 @@ static int selinux_set_mnt_opts(struct super_block *sb,
 
if (!ss_initialized) {
if (!num_opts) {
+   /*
+* Special handling for rootfs. Is genfs but supports
+* setting SELinux context on in-core inodes.
+*
+* Chicken and egg problem: policy may reside in rootfs
+* but for initramfs code to fill in attributes, it
+* needs selinux to allow that.
+*/
+   if (!strncmp(sb->s_type->name, "rootfs",
+sizeof("rootfs")))
+   sbsec->flags |= SBLABEL_MNT;
+
/* Defer initialization until selinux_complete_init,
   after the initial policy is loaded and the security
   server is ready to handle calls. */
-- 
2.10.3.dirty



[PATCH v2 04/15] initramfs: remove unnecessary symlinks processing shortcut

2018-01-24 Thread Taras Kondratiuk
Special handling of symlinks in do_header() assumes that name and body
entries are sequential and reads them together. This shortcut has no
real performance benefits, but it complicates changes to the state
machine.

Make handling of symlinks more similar to a regular files. Store name
in name_buf and destination in symlink_buf.

Signed-off-by: Taras Kondratiuk 
---
 init/initramfs.c | 29 +
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/init/initramfs.c b/init/initramfs.c
index b6ee675e5cdb..d0ab7ad6ac05 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -266,16 +266,7 @@ static int __init do_header(void)
state = do_skip;
if (name_len <= 0 || name_len > PATH_MAX)
return 0;
-   if (S_ISLNK(mode)) {
-   if (body_len > PATH_MAX)
-   return 0;
-   collect = collected = symlink_buf;
-   remains = N_ALIGN(name_len) + body_len;
-   next_state = do_symlink;
-   state = do_collect;
-   return 0;
-   }
-   if (S_ISREG(mode) || !body_len)
+   if (S_ISREG(mode) || S_ISLNK(mode) || !body_len)
read_into(name_buf, N_ALIGN(name_len), do_name);
return 0;
 }
@@ -372,6 +363,11 @@ static int __init do_name(void)
sys_chmod(collected, mode);
do_utime(collected, mtime);
}
+   } else if (S_ISLNK(mode)) {
+   if (body_len > PATH_MAX)
+   return 0;
+   memcpy_optional(name_buf, collected, N_ALIGN(name_len));
+   read_into(symlink_buf, body_len, do_symlink);
}
return 0;
 }
@@ -397,11 +393,12 @@ static int __init do_copy(void)
 
 static int __init do_symlink(void)
 {
-   collected[N_ALIGN(name_len) + body_len] = '\0';
-   clean_path(collected, 0);
-   sys_symlink(collected + N_ALIGN(name_len), collected);
-   sys_lchown(collected, uid, gid);
-   do_utime(collected, mtime);
+   memcpy_optional(symlink_buf, collected, body_len);
+   symlink_buf[body_len] = '\0';
+   clean_path(name_buf, 0);
+   sys_symlink(symlink_buf, name_buf);
+   sys_lchown(name_buf, uid, gid);
+   do_utime(name_buf, mtime);
state = do_skip;
next_state = do_reset;
return 0;
@@ -453,7 +450,7 @@ static char * __init unpack_to_rootfs(char *buf, unsigned 
long len)
static __initdata char msg_buf[64];
 
header_buf = kmalloc(110, GFP_KERNEL);
-   symlink_buf = kmalloc(PATH_MAX + N_ALIGN(PATH_MAX) + 1, GFP_KERNEL);
+   symlink_buf = kmalloc(PATH_MAX + 1, GFP_KERNEL);
name_buf = kmalloc(N_ALIGN(PATH_MAX), GFP_KERNEL);
 
if (!header_buf || !symlink_buf || !name_buf)
-- 
2.10.3.dirty



[PATCH v2 08/15] initramfs: add newcx format

2018-01-24 Thread Taras Kondratiuk
Add 'newcx' format that adds extended attributes and increased size of
c_mtime and c_filesize fields.

Refer to Documentation/early-userspace/buffer-format.txt for detailed
format description.

Signed-off-by: Taras Kondratiuk 
---
 init/initramfs.c | 95 ++--
 1 file changed, 85 insertions(+), 10 deletions(-)

diff --git a/init/initramfs.c b/init/initramfs.c
index 7f0bbfde94e3..3d0f46c28459 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -54,6 +54,7 @@ static void __init error(char *x)
 /* link hash */
 
 #define N_ALIGN(len) len) + 1) & ~3) + 2)
+#define X_ALIGN(len) ((len + 3) & ~3)
 
 static __initdata struct hash {
int ino, minor, major;
@@ -154,7 +155,8 @@ static void __init dir_utime(void)
 static __initdata time64_t mtime;
 static __initdata u32 ino, major, minor, nlink, rmajor, rminor;
 static __initdata umode_t mode;
-static __initdata u32 body_len, name_len;
+static __initdata u32 name_len, xattr_len;
+static __initdata u64 body_len, mtime_u64;
 static __initdata uid_t uid;
 static __initdata gid_t gid;
 static __initdata u32 mode_u32;
@@ -167,6 +169,12 @@ struct cpio_hdr_field {
const char *name;
 };
 
+static __initdata enum cpio_format {
+   CPIO_NO_MAGIC,
+   CPIO_NEWC,
+   CPIO_NEWCX,
+} cpio_format;
+
 #define HDR_FIELD(type, field, variable) \
{ .offset = offsetof(type, field) + \
  BUILD_BUG_ON_ZERO(sizeof(*(variable))*2 < FIELD_SIZEOF(type, field)),\
@@ -177,9 +185,11 @@ struct cpio_hdr_field {
 
 #define NEWC_FIELD(field, variable) \
HDR_FIELD(struct cpio_newc_header, field, variable)
+#define NEWCX_FIELD(field, variable) \
+   HDR_FIELD(struct cpio_newcx_header, field, variable)
 
-#define CPIO_MAX_HEADER_SIZE sizeof(struct cpio_newc_header)
-#define CPIO_MAX_FIELD_SIZE 8
+#define CPIO_MAX_HEADER_SIZE sizeof(struct cpio_newcx_header)
+#define CPIO_MAX_FIELD_SIZE 16
 #define CPIO_MAGIC_SIZE 6
 
 struct cpio_newc_header {
@@ -214,10 +224,44 @@ static struct cpio_hdr_field cpio_newc_header_info[] 
__initdata = {
{ 0 },
 };
 
+struct cpio_newcx_header {
+   charc_ino[8];
+   charc_mode[8];
+   charc_uid[8];
+   charc_gid[8];
+   charc_nlink[8];
+   charc_mtime[16];
+   charc_filesize[16];
+   charc_devmajor[8];
+   charc_devminor[8];
+   charc_rdevmajor[8];
+   charc_rdevminor[8];
+   charc_namesize[8];
+   charc_xattrsize[8];
+};
+
+static struct cpio_hdr_field cpio_newcx_header_info[] __initdata = {
+   NEWCX_FIELD(c_ino, &ino),
+   NEWCX_FIELD(c_mode, &mode_u32),
+   NEWCX_FIELD(c_uid, &uid),
+   NEWCX_FIELD(c_gid, &gid),
+   NEWCX_FIELD(c_nlink, &nlink),
+   NEWCX_FIELD(c_mtime, &mtime_u64),
+   NEWCX_FIELD(c_filesize, &body_len),
+   NEWCX_FIELD(c_devmajor, &major),
+   NEWCX_FIELD(c_devminor, &minor),
+   NEWCX_FIELD(c_rdevmajor, &rmajor),
+   NEWCX_FIELD(c_rdevminor, &rminor),
+   NEWCX_FIELD(c_namesize, &name_len),
+   NEWCX_FIELD(c_xattrsize, &xattr_len),
+   { 0 },
+};
+
 static void __init parse_header(char *s)
 {
char buf[CPIO_MAX_FIELD_SIZE + 1];
-   struct cpio_hdr_field *field = cpio_newc_header_info;
+   struct cpio_hdr_field *field = (cpio_format == CPIO_NEWC) ?
+   cpio_newc_header_info : cpio_newcx_header_info;
 
while (field->size) {
int ret = 0;
@@ -243,7 +287,15 @@ static void __init parse_header(char *s)
pr_err("invalid cpio header field (%d)", ret);
field++;
}
+
mode = mode_u32;
+   if (cpio_format == CPIO_NEWCX) {
+   /* Microseconds are ignored for now */
+   do_div(mtime_u64, USEC_PER_SEC);
+   mtime = mtime_u64;
+   } else {
+   xattr_len = 0;
+   }
 }
 
 /* FSM */
@@ -254,6 +306,7 @@ static int __init do_format(void);
 static int __init do_header(void);
 static int __init do_skip(void);
 static int __init do_name(void);
+static int __init do_xattrs(void);
 static int __init do_create(void);
 static int __init do_copy(void);
 static int __init do_symlink(void);
@@ -291,7 +344,7 @@ static void __init read_into(char *buf, unsigned size, 
fsm_state_t next)
}
 }
 
-static __initdata char *header_buf, *symlink_buf, *name_buf;
+static __initdata char *header_buf, *symlink_buf, *name_buf, *xattr_buf;
 
 static int __init do_start(void)
 {
@@ -315,22 +368,34 @@ static int __init do_collect(void)
 
 static int __init do_format(void)
 {
-   if (memcmp(collected, "070707", CPIO_MAGIC_SIZE) == 0) {
+   int header_size = 0;
+
+   cpio_format = CPIO_NO_MAGIC;
+
+   if (!memcmp(collected, "070707", CPIO_MAGIC_SIZE)) {
error("incorrect cpio method used: use -H newc option");
return 1;
+   } else if (!memcmp(collected, "070701", CPIO_MAGIC_SIZE)) {
+

Re: [PATCH v6 16/36] nds32: DMA mapping API

2018-01-24 Thread Greentime Hu
Hi, Arnd:

2018-01-24 19:36 GMT+08:00 Arnd Bergmann :
> On Tue, Jan 23, 2018 at 12:52 PM, Greentime Hu  wrote:
>> Hi, Arnd:
>>
>> 2018-01-23 16:23 GMT+08:00 Greentime Hu :
>>> Hi, Arnd:
>>>
>>> 2018-01-18 18:26 GMT+08:00 Arnd Bergmann :
 On Mon, Jan 15, 2018 at 6:53 AM, Greentime Hu  wrote:
> From: Greentime Hu 
>
> This patch adds support for the DMA mapping API. It uses dma_map_ops for
> flexibility.
>
> Signed-off-by: Vincent Chen 
> Signed-off-by: Greentime Hu 

 I'm still unhappy about the way the cache flushes are done here as 
 discussed
 before. It's not a show-stopped, but no Ack from me.
>>>
>>> How about this implementation?
>
>> I am not sure if I understand it correctly.
>> I list all the combinations.
>>
>> RAM to DEVICE
>> before DMA => writeback cache
>> after DMA => nop
>>
>> DEVICE to RAM
>> before DMA => nop
>> after DMA => invalidate cache
>>
>> static void consistent_sync(void *vaddr, size_t size, int direction, int 
>> master)
>> {
>> unsigned long start = (unsigned long)vaddr;
>> unsigned long end = start + size;
>>
>> if (master == FOR_CPU) {
>> switch (direction) {
>> case DMA_TO_DEVICE:
>> break;
>> case DMA_FROM_DEVICE:
>> case DMA_BIDIRECTIONAL:
>> cpu_dma_inval_range(start, end);
>> break;
>> default:
>> BUG();
>> }
>> } else {
>> /* FOR_DEVICE */
>> switch (direction) {
>> case DMA_FROM_DEVICE:
>> break;
>> case DMA_TO_DEVICE:
>> case DMA_BIDIRECTIONAL:
>> cpu_dma_wb_range(start, end);
>> break;
>> default:
>> BUG();
>> }
>> }
>> }
>
> That looks reasonable enough, but it does depend on a number of factors,
> and the dma-mapping.h implementation is not just about cache flushes.
>
> As I don't know the microarchitecture, can you answer these questions:
>
> - are caches always write-back, or could they be write-through?
Yes, we can config it to write-back or write-through.

> - can the cache be shared with another CPU or a device?
No, we don't support it.

> - if the cache is shared, is it always coherent, never coherent, or
> either of them?
We don't support SMP and the device will access memory through bus. I
think the cache is not shared.

> - could the same memory be visible at different physical addresses
>   and have conflicting caches?
We currently don't have such kind of SoC memory map.

> - is the CPU physical address always the same as the address visible to the
>   device?
Yes, it is always the same unless the CPU uses local memory. The
physical address of local memory will overlap the original bus
address.
I think the local memory case can be ignored because we don't use it for now.

> - are there devices that can only see a subset of the physical memory?
All devices are able to see the whole physical memory in our current
SoC, but I think other SoC may support such kind of HW behavior.

> - can there be an IOMMU?
No.

> - are there write-buffers in the CPU that might need to get flushed before
>   flushing the cache?
Yes, there are write-buffers in front of CPU caches but it should be
transparent to SW. We don't need to flush it.

> - could cache lines be loaded speculatively or with read-ahead while
>   a buffer is owned by a device?
No.


[PATCH v2 00/15] extend initramfs archive format to support xattrs

2018-01-24 Thread Taras Kondratiuk
Many of the Linux security/integrity features are dependent on file
metadata, stored as extended attributes (xattrs), for making decisions.
These features need to be initialized during initcall and enabled as
early as possible for complete security coverage. 

Initramfs (tmpfs) supports xattrs, but newc CPIO archive format does not
support including them into the archive.

There are several ways to include xattrs for initramfs:

- Add TAR support. Complicated format and big headers looks like too
  much overhead.

- Include a file manifest containing the xattrs in the CPIO. Should be
  easy for initramfs because we can set xattrs at the end when all files
  are extracted, but extracting such archive in userspace will be more
  complicated. For example it may be necessary to set SELinux labels for
  directories before extracting files into them, so manifest has to be
  extracted first and then searched during each file extraction.

- Extend CPIO header to support xattrs. This seem to be the most
  straight forward way. It also allows to do other useful changes
  to CPIO format at the same time. E.g. increase filesize field to
  support files >4GB.

This patch set extends the existing newc CPIO archive format to include
xattrs in the initramfs.

The series is based on v4.15-rc8. cpio_xattr branch is available here:
https://github.com/kontar/linux/commits/cpio_xattr

=== Patch summary ===

Documentation:
[PATCH 01/15] Documentation: add newcx initramfs format description

Refactoring to simplify adding the new format:
[PATCH 02/15] initramfs: replace states with function pointers
[PATCH 03/15] initramfs: store file name in name_buf
[PATCH 04/15] initramfs: remove unnecessary symlinks processing shortcut
[PATCH 05/15] initramfs: move files creation into separate state
[PATCH 06/15] initramfs: separate reading cpio method from header
[PATCH 07/15] initramfs: split header layout information from parsing
function

Parse newxc format:
[PATCH 08/15] initramfs: add newcx format
[PATCH 09/15] initramfs: set extended attributes

Generate newcx cpio archive:
[PATCH 10/15] gen_init_cpio: move header formatting into function
[PATCH 11/15] gen_init_cpio: add newcx format
[PATCH 12/15] gen_init_cpio: set extended attributes for newcx
[PATCH 13/15] gen_initramfs_list.sh: add -x option to enable newcx

SELinux patches used for testing. They will be sent to SELinux
maintainers separately.
[PATCH 14/15] selinux: allow setxattr on rootfs so initramfs code can
set them
[PATCH 15/15] selinux: delay sid population for rootfs till init is
complete

=== Testing ===

gen_initramfs_list.sh can be used to generate newcx CPIO archive: if
CONFIG_INITRAMFS_NEWCX is enabled CONFIG_INITRAMFS_SOURCE will be packed
into newcx archive. It is enough for basic testing, but it is not
convenient for more complex setup.

Victor have prepared a test setup with SELinux-labeled initramfs based
on Poky(Yocto) with meta-selinux layer.

Repo manifest and build instructions:
https://github.com/victorkamensky/initramfs-xattrs-manifest

Reference cpio utility patch to support newcx format could be
found as part of poky/meta-selinux testing environment at

https://raw.githubusercontent.com/victorkamensky/initramfs-xattrs-poky/rocko/meta/recipes-extended/cpio/cpio-2.12/cpio-xattrs.patch

=== History ===

The patch set is based on Mimi's series from Jan 2015:
https://www.mail-archive.com/initramfs@vger.kernel.org/msg03971.html
Latest discussion I was able to find is from Dec 2015:
https://www.mail-archive.com/initramfs@vger.kernel.org/msg04198.html

Format changes:
- increased size of filesize to 64 bits to support files >4GB.
- increased mtime field size to 64 bits to get usec precision
  and more than 32 bits of seconds. Rob mentioned nsec, but that
  will leave only 34 bits for seconds. If nsec precision is needed
  then we can add a separate 32 bit nsec field (it will match statx).
- Checksum field is replased by xattrs_size field.

Other fields are left unchanged. See patch format description in the
patch #1.

James suggested to add many other changes, but that will bloat header
and make this new format very similar to tar. To keep format simple no
other fields were added.

v2 changes:
- added documentation
- made format more consistent. In previous version a sequence of fields
  in newcx header was different for symlinks and regular files (for
  symlinks data field was before xattrs). It was caused by a flow
  shortcut during symlink entry parsing.
- removed unused checksum field in newcx header
- removed redundant xattrcount at the beginning of xattr section
  (xattrs_size is enough to determine the end of section).
- size of xattr entry in xattr section includes both name and value.
  This makes format more consistent and allows to jump over an entry
  without scanning for the end of name string first.
- streamlined the state machine to address the previous issue and make
  it easier to add the new format
- made header parsing data-driven to remove mag

[PATCH v2 03/15] initramfs: store file name in name_buf

2018-01-24 Thread Taras Kondratiuk
There is already name_buf buffer pre-allocated for a file name. No need
to allocate vcollected for every file. More over a name can be already
stored in name_buf by read_info() function.

Add memcpy_optional() function to handle such case.

Signed-off-by: Taras Kondratiuk 
---
 init/initramfs.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/init/initramfs.c b/init/initramfs.c
index 49cd2681a26f..b6ee675e5cdb 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -210,7 +210,6 @@ static inline void __init eat(unsigned n)
byte_count -= n;
 }
 
-static __initdata char *vcollected;
 static __initdata char *collected;
 static long remains __initdata;
 static __initdata char *collect;
@@ -324,6 +323,13 @@ static void __init clean_path(char *path, umode_t fmode)
}
 }
 
+static void *memcpy_optional(void *dest, const void *src, size_t n)
+{
+   if (dest != src)
+   return memcpy(dest, src, n);
+   return dest;
+}
+
 static __initdata int wfd;
 
 static int __init do_name(void)
@@ -348,7 +354,8 @@ static int __init do_name(void)
sys_fchmod(wfd, mode);
if (body_len)
sys_ftruncate(wfd, body_len);
-   vcollected = kstrdup(collected, GFP_KERNEL);
+   memcpy_optional(name_buf, collected,
+   N_ALIGN(name_len));
state = do_copy;
}
}
@@ -375,8 +382,7 @@ static int __init do_copy(void)
if (xwrite(wfd, victim, body_len) != body_len)
error("write error");
sys_close(wfd);
-   do_utime(vcollected, mtime);
-   kfree(vcollected);
+   do_utime(name_buf, mtime);
eat(body_len);
state = do_skip;
return 0;
-- 
2.10.3.dirty



Re: [PATCH 1/2] block: blk-tag: Replace GFP_ATOMIC with GFP_KERNEL in __blk_queue_init_tags

2018-01-24 Thread Jens Axboe
On 1/24/18 8:38 PM, Jia-Ju Bai wrote:
> After checking all possible call chains to kmalloc here,
> my tool finds that kmalloc is never called in atomic context.
> Thus GFP_ATOMIC is not necessary, and it can be replaced with GFP_KERNEL.
> 
> This is found by a static analysis tool named DCNS written by myself.

These two look OK to me, mostly. One issue is that blk_queue_init_tags()
can be used to resize tags, in which case it's called with the
queue lock held. Nobody is using it like that anymore, though. So I'd
prefer if you did a v2 of this patch, and include a correction to that
comment.

-- 
Jens Axboe



[PATCH v2 06/15] initramfs: separate reading cpio method from header

2018-01-24 Thread Taras Kondratiuk
From: Mimi Zohar 

In preparation for adding xattr support, read the CPIO method
separately from the rest of the header.

Signed-off-by: Mimi Zohar 
Signed-off-by: Taras Kondratiuk 
---
 init/initramfs.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/init/initramfs.c b/init/initramfs.c
index 2d5920c094e0..b3d39c8793be 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -168,7 +168,7 @@ static void __init parse_header(char *s)
int i;
 
buf[8] = '\0';
-   for (i = 0, s += 6; i < 12; i++, s += 8) {
+   for (i = 0; i < 12; i++, s += 8) {
memcpy(buf, s, 8);
parsed[i] = simple_strtoul(buf, NULL, 16);
}
@@ -189,6 +189,7 @@ static void __init parse_header(char *s)
 
 static int __init do_start(void);
 static int __init do_collect(void);
+static int __init do_format(void);
 static int __init do_header(void);
 static int __init do_skip(void);
 static int __init do_name(void);
@@ -233,7 +234,7 @@ static __initdata char *header_buf, *symlink_buf, *name_buf;
 
 static int __init do_start(void)
 {
-   read_into(header_buf, 110, do_header);
+   read_into(header_buf, 6, do_format);
return 0;
 }
 
@@ -251,7 +252,7 @@ static int __init do_collect(void)
return 0;
 }
 
-static int __init do_header(void)
+static int __init do_format(void)
 {
if (memcmp(collected, "070707", 6)==0) {
error("incorrect cpio method used: use -H newc option");
@@ -261,6 +262,12 @@ static int __init do_header(void)
error("no cpio magic");
return 1;
}
+   read_into(header_buf, 104, do_header);
+   return 0;
+}
+
+static int __init do_header(void)
+{
parse_header(collected);
next_header = this_header + N_ALIGN(name_len) + body_len;
next_header = (next_header + 3) & ~3;
@@ -457,7 +464,7 @@ static char * __init unpack_to_rootfs(char *buf, unsigned 
long len)
const char *compress_name;
static __initdata char msg_buf[64];
 
-   header_buf = kmalloc(110, GFP_KERNEL);
+   header_buf = kmalloc(104, GFP_KERNEL);
symlink_buf = kmalloc(PATH_MAX + 1, GFP_KERNEL);
name_buf = kmalloc(N_ALIGN(PATH_MAX), GFP_KERNEL);
 
-- 
2.10.3.dirty



[PATCH v2 02/15] initramfs: replace states with function pointers

2018-01-24 Thread Taras Kondratiuk
Currently the FSM states are mapped directly to function pointers. Extra
level of intirection is not needed and makes navigation over the code
harder. One can't jump between states directly when browsing code (e.g.
with cscope). Need to go through actions[] array each time.

Replace states with their action function pointers. No behaviour change.

Signed-off-by: Taras Kondratiuk 
---
 init/initramfs.c | 73 +---
 1 file changed, 32 insertions(+), 41 deletions(-)

diff --git a/init/initramfs.c b/init/initramfs.c
index 7e99a0038942..49cd2681a26f 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -187,16 +187,17 @@ static void __init parse_header(char *s)
 
 /* FSM */
 
-static __initdata enum state {
-   Start,
-   Collect,
-   GotHeader,
-   SkipIt,
-   GotName,
-   CopyFile,
-   GotSymlink,
-   Reset
-} state, next_state;
+static int __init do_start(void);
+static int __init do_collect(void);
+static int __init do_header(void);
+static int __init do_skip(void);
+static int __init do_name(void);
+static int __init do_copy(void);
+static int __init do_symlink(void);
+static int __init do_reset(void);
+
+typedef int (*fsm_state_t)(void);
+static __initdata fsm_state_t state, next_state;
 
 static __initdata char *victim;
 static unsigned long byte_count __initdata;
@@ -214,7 +215,7 @@ static __initdata char *collected;
 static long remains __initdata;
 static __initdata char *collect;
 
-static void __init read_into(char *buf, unsigned size, enum state next)
+static void __init read_into(char *buf, unsigned size, fsm_state_t next)
 {
if (byte_count >= size) {
collected = victim;
@@ -224,7 +225,7 @@ static void __init read_into(char *buf, unsigned size, enum 
state next)
collect = collected = buf;
remains = size;
next_state = next;
-   state = Collect;
+   state = do_collect;
}
 }
 
@@ -232,7 +233,7 @@ static __initdata char *header_buf, *symlink_buf, *name_buf;
 
 static int __init do_start(void)
 {
-   read_into(header_buf, 110, GotHeader);
+   read_into(header_buf, 110, do_header);
return 0;
 }
 
@@ -263,7 +264,7 @@ static int __init do_header(void)
parse_header(collected);
next_header = this_header + N_ALIGN(name_len) + body_len;
next_header = (next_header + 3) & ~3;
-   state = SkipIt;
+   state = do_skip;
if (name_len <= 0 || name_len > PATH_MAX)
return 0;
if (S_ISLNK(mode)) {
@@ -271,12 +272,12 @@ static int __init do_header(void)
return 0;
collect = collected = symlink_buf;
remains = N_ALIGN(name_len) + body_len;
-   next_state = GotSymlink;
-   state = Collect;
+   next_state = do_symlink;
+   state = do_collect;
return 0;
}
if (S_ISREG(mode) || !body_len)
-   read_into(name_buf, N_ALIGN(name_len), GotName);
+   read_into(name_buf, N_ALIGN(name_len), do_name);
return 0;
 }
 
@@ -327,8 +328,8 @@ static __initdata int wfd;
 
 static int __init do_name(void)
 {
-   state = SkipIt;
-   next_state = Reset;
+   state = do_skip;
+   next_state = do_reset;
if (strcmp(collected, "TRAILER!!!") == 0) {
free_hash();
return 0;
@@ -348,7 +349,7 @@ static int __init do_name(void)
if (body_len)
sys_ftruncate(wfd, body_len);
vcollected = kstrdup(collected, GFP_KERNEL);
-   state = CopyFile;
+   state = do_copy;
}
}
} else if (S_ISDIR(mode)) {
@@ -377,7 +378,7 @@ static int __init do_copy(void)
do_utime(vcollected, mtime);
kfree(vcollected);
eat(body_len);
-   state = SkipIt;
+   state = do_skip;
return 0;
} else {
if (xwrite(wfd, victim, byte_count) != byte_count)
@@ -395,29 +396,19 @@ static int __init do_symlink(void)
sys_symlink(collected + N_ALIGN(name_len), collected);
sys_lchown(collected, uid, gid);
do_utime(collected, mtime);
-   state = SkipIt;
-   next_state = Reset;
+   state = do_skip;
+   next_state = do_reset;
return 0;
 }
 
-static __initdata int (*actions[])(void) = {
-   [Start] = do_start,
-   [Collect]   = do_collect,
-   [GotHeader] = do_header,
-   [SkipIt]= do_skip,
-   [GotName]   = do_name,
-   [CopyFile]  = do_copy,
-   [GotSymlink]= do_symlink,
-   [Reset] = do_reset,
-};
-
 static long __init write_buffer(char *buf, unsigned long len)
 {
byte_count = len;
  

[PATCH v2 15/15] selinux: delay sid population for rootfs till init is complete

2018-01-24 Thread Taras Kondratiuk
From: Victor Kamensky 

With initramfs cpio format that supports extended attributes
we need to skip sid population on sys_lsetxattr call from
initramfs for rootfs if security server is not initialized yet.

Otherwise callback in selinux_inode_post_setxattr will try to
translate give security.selinux label into sid context and since
security server is not available yet inode will receive default
sid (typically kernel_t). Note that in the same time proper
label will be stored in inode xattrs. Later, since inode sid
would be already populated system will never look back at
actual xattrs. But if we skip sid population for rootfs and
we have policy that direct use of xattrs for rootfs, proper
sid will be filled in from extended attributes one node is
accessed and server is initialized.

Note new DELAYAFTERINIT_MNT super block flag is introduced
to only mark rootfs for such behavior. For other types of
tmpfs original logic is still used.

Signed-off-by: Victor Kamensky 
---
 security/selinux/hooks.c| 9 -
 security/selinux/include/security.h | 1 +
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index f3fe65589f02..bb25268f734e 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -716,7 +716,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
 */
if (!strncmp(sb->s_type->name, "rootfs",
 sizeof("rootfs")))
-   sbsec->flags |= SBLABEL_MNT;
+   sbsec->flags |= SBLABEL_MNT|DELAYAFTERINIT_MNT;
 
/* Defer initialization until selinux_complete_init,
   after the initial policy is loaded and the security
@@ -3253,6 +3253,7 @@ static void selinux_inode_post_setxattr(struct dentry 
*dentry, const char *name,
 {
struct inode *inode = d_backing_inode(dentry);
struct inode_security_struct *isec;
+   struct superblock_security_struct *sbsec;
u32 newsid;
int rc;
 
@@ -3261,6 +3262,12 @@ static void selinux_inode_post_setxattr(struct dentry 
*dentry, const char *name,
return;
}
 
+   if (!ss_initialized) {
+   sbsec = inode->i_sb->s_security;
+   if (sbsec->flags & DELAYAFTERINIT_MNT)
+   return;
+   }
+
rc = security_context_to_sid_force(value, size, &newsid);
if (rc) {
printk(KERN_ERR "SELinux:  unable to map context to SID"
diff --git a/security/selinux/include/security.h 
b/security/selinux/include/security.h
index 02f0412d42f2..585acfd6cbcf 100644
--- a/security/selinux/include/security.h
+++ b/security/selinux/include/security.h
@@ -52,6 +52,7 @@
 #define ROOTCONTEXT_MNT0x04
 #define DEFCONTEXT_MNT 0x08
 #define SBLABEL_MNT0x10
+#define DELAYAFTERINIT_MNT 0x20
 /* Non-mount related flags */
 #define SE_SBINITIALIZED   0x0100
 #define SE_SBPROC  0x0200
-- 
2.10.3.dirty



linux-next: manual merge of the vfs tree with the overlayfs tree

2018-01-24 Thread Stephen Rothwell
Hi Al,

Today's linux-next merge of the vfs tree got a conflict in:

  fs/dcache.c

between commit:

  f9c34674bc60 ("vfs: factor out helpers d_instantiate_anon() and 
d_alloc_anon()")

from the overlayfs tree and commit:

  f1ee616214cb ("VFS: don't keep disconnected dentries on d_anon")

from the vfs tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc fs/dcache.c
index 99bce0ed0213,17e6b84b9656..
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@@ -1954,16 -1961,18 +1953,18 @@@ static struct dentry *__d_instantiate_a
if (disconnected)
add_flags |= DCACHE_DISCONNECTED;
  
 -  spin_lock(&tmp->d_lock);
 -  __d_set_inode_and_type(tmp, inode, add_flags);
 -  hlist_add_head(&tmp->d_u.d_alias, &inode->i_dentry);
 +  spin_lock(&dentry->d_lock);
 +  __d_set_inode_and_type(dentry, inode, add_flags);
 +  hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
-   hlist_bl_lock(&dentry->d_sb->s_anon);
-   hlist_bl_add_head(&dentry->d_hash, &dentry->d_sb->s_anon);
-   hlist_bl_unlock(&dentry->d_sb->s_anon);
+   if (!disconnected) {
 -  hlist_bl_lock(&tmp->d_sb->s_roots);
 -  hlist_bl_add_head(&tmp->d_hash, &tmp->d_sb->s_roots);
 -  hlist_bl_unlock(&tmp->d_sb->s_roots);
++  hlist_bl_lock(&dentry->d_sb->s_roots);
++  hlist_bl_add_head(&dentry->d_hash, &tmp->d_sb->s_roots);
++  hlist_bl_unlock(&dentry->d_sb->s_roots);
+   }
 -  spin_unlock(&tmp->d_lock);
 +  spin_unlock(&dentry->d_lock);
spin_unlock(&inode->i_lock);
  
 -  return tmp;
 +  return dentry;
  
   out_iput:
iput(inode);


Re: [PATCH AUTOSEL for 4.14 001/100] drm/vc4: Account for interrupts in flight

2018-01-24 Thread Eric Anholt
Sasha Levin  writes:

> From: Stefan Schake 
>
> [ Upstream commit 253696ccd613fbdaa5aba1de44c461a058e0a114 ]
>
> Synchronously disable the IRQ to make the following cancel_work_sync
> invocation effective.
>
> An interrupt in flight could enqueue further overflow mem work. As we
> free the binner BO immediately following vc4_irq_uninstall this caused
> a NULL pointer dereference in the work callback vc4_overflow_mem_work.
>
> Link: https://github.com/anholt/linux/issues/114
> Signed-off-by: Stefan Schake 
> Fixes: d5b1a78a772f ("drm/vc4: Add support for drawing 3D frames.")
> Signed-off-by: Eric Anholt 
> Reviewed-by: Eric Anholt 
> Link: 
> https://patchwork.freedesktop.org/patch/msgid/1510275907-993-2-git-send-email-stsch...@gmail.com
> Signed-off-by: Sasha Levin 

We found a bug, and this patch also needs:

commit ce9caf2f79a5aa170a4b6456a03db639eed9c988 (tag: drm-misc-fixes-2018-01-08)
Author: Stefan Schake 
Date:   Fri Dec 29 17:05:43 2017 +0100

drm/vc4: Move IRQ enable to PM path

We were calling enable_irq on bind, where it was already enabled previously
by the IRQ helper. Additionally, dev->irq is not set correctly until after
postinstall and so was always zero here, triggering a warning in 4.15.
Fix both by moving the enable to the power management resume path, where we
know there was a previous disable invocation during suspend.

Fixes: 253696ccd613 ("drm/vc4: Account for interrupts in flight")
Signed-off-by: Stefan Schake 
Signed-off-by: Eric Anholt 
Link: 
https://patchwork.freedesktop.org/patch/msgid/1514563543-32511-1-git-send-email-stsch...@gmail.com
Tested-by: Stefan Wahren 
Reviewed-by: Eric Anholt 


signature.asc
Description: PGP signature


Re: "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel

2018-01-24 Thread Mike Galbraith
On Wed, 2018-01-24 at 15:02 -0500, Lyude Paul wrote:
> Almost forgot to mention: I came across this patch because reverting it
> locally on the mainline kernel makes request_irq() behave normally (it doesn't
> attempt to allocate the same vector twice anymore) and nouveau starts doing
> suspend/resume correctly again

Ah, someone already hunted down my resume woes.  Yup, reverting
$subject fixed up my sole reason to use nouveau (to be able to _resume_
as well as suspend:).  If anyone needs a lab rat, just holler.

-Mike


Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-24 Thread jianchao.wang
Hi Tariq

On 01/22/2018 10:12 AM, jianchao.wang wrote:
>>> On 19/01/2018 5:49 PM, Eric Dumazet wrote:
 On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote:
> Hi Tariq
>
> Very sad that the crash was reproduced again after applied the patch.
>> Memory barriers vary for different Archs, can you please share more details 
>> regarding arch and repro steps?

> The hardware is HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 
> 12/27/2015
> The xen is installed. The crash occurred in DOM0.
> Regarding to the repro steps, it is a customer's test which does heavy disk 
> I/O over NFS storage without any guest.
> 

What is the finial suggestion on this ?
If use wmb there, is the performance pulled down ?

Thanks in advance
Jianchao


Re: [PATCH] block: blk-mq-sched: Replace GFP_ATOMIC with GFP_KERNEL in blk_mq_sched_assign_ioc

2018-01-24 Thread Jia-Ju Bai



On 2018/1/25 10:58, Al Viro wrote:

On Thu, Jan 25, 2018 at 10:46:26AM +0800, Jia-Ju Bai wrote:

The function ioc_create_icq here is not called in atomic context.
Thus GFP_ATOMIC is not necessary, and it can be replaced with GFP_KERNEL.

This is found by a static analysis tool named DCNS written by myself.

Umm...  Some human-readable analysis would be welcome.  FWIW, I've tried to
put a proof together, but...
struct blk_mq_ops->timeout = nvme_timeout
nvme_timeout()
nvme_alloc_request()
blk_mq_alloc_request_hctx()
blk_mq_get_request()
blk_mq_sched_assign_ioc()
... and while I have not traced the call chain further, the look of that
function (nvme_timeout()) strongly suggests that it *is* meant to be
called from bloody atomic context.

"My tool has found that place/put together a proof" is nice, but it
doesn't replace the proof itself...


Thanks for reply :)

I have checked the given call chain, and find that nvme_dev_disable in 
nvme_timeout calls mutex_lock that can sleep.

Thus, I suppose this call chain is not in atomic context.

Besides, how do you find that "function (nvme_timeout()) strongly 
suggests that it *is* meant to be called from bloody atomic context"?

I check the comments in nvme_timeout, and do not find related description...

By the way, do you mean that I should add "My tool has proved that this 
function is never called in atomic context" in the description?



Thanks,
Jia-Ju Bai


[PATCH v2 2/4] arm64: Use the new MULTI_IRQ_HANDLER

2018-01-24 Thread Palmer Dabbelt
It appears arm64 copied arm's MULTI_IRQ_HANDLER code, but made it
unconditional.  I wanted to make this generic so it could be used by the
RISC-V port.  This patch converts the arm64 code to use the new generic
code, which simply consists of deleting the arm64 code and setting
MULTI_IRQ_HANDLER instead.

Signed-off-by: Palmer Dabbelt 
---
 arch/arm64/Kconfig   |  1 +
 arch/arm64/include/asm/irq.h |  2 --
 arch/arm64/kernel/irq.c  | 10 --
 3 files changed, 1 insertion(+), 12 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c9a7e9e1414f..effb04a80520 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -129,6 +129,7 @@ config ARM64
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
select MODULES_USE_ELF_RELA
+   select MULTI_IRQ_HANDLER
select NO_BOOTMEM
select OF
select OF_EARLY_FLATTREE
diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index a0fee6985e6a..b2b0c6405eb0 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -8,8 +8,6 @@
 
 struct pt_regs;
 
-extern void set_handle_irq(void (*handle_irq)(struct pt_regs *));
-
 static inline int nr_legacy_irqs(void)
 {
return 0;
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index 713561e5bcab..67b781270402 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -41,16 +41,6 @@ int arch_show_interrupts(struct seq_file *p, int prec)
return 0;
 }
 
-void (*handle_arch_irq)(struct pt_regs *) = NULL;
-
-void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
-{
-   if (handle_arch_irq)
-   return;
-
-   handle_arch_irq = handle_irq;
-}
-
 #ifdef CONFIG_VMAP_STACK
 static void init_irq_stacks(void)
 {
-- 
2.13.6



[PATCH v2 4/4] RISC-V: Move to the new generic IRQ handler

2018-01-24 Thread Palmer Dabbelt
The old mechanism for handling IRQs on RISC-V was pretty ugly: the arch
code looked at the Kconfig entry for our first-level irqchip driver and
called into it directly.

This patch uses the new 0generic IRQ handling infastructure, which
essentially just deletes a bunch of code.  This does add an additional
load to the interrupt latency, but there's a lot of tuning left to be
done there on RISC-V so I think it's OK for now.

Signed-off-by: Palmer Dabbelt 
---
 arch/riscv/Kconfig|  1 +
 arch/riscv/include/asm/Kbuild |  1 +
 arch/riscv/kernel/entry.S |  5 +++--
 arch/riscv/kernel/irq.c   | 13 -
 4 files changed, 5 insertions(+), 15 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 2c6adf12713a..e67f42178059 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -35,6 +35,7 @@ config RISCV
select THREAD_INFO_IN_TASK
select RISCV_IRQ_INTC
select RISCV_TIMER
+   select MULTI_IRQ_HANDLER
 
 config MMU
def_bool y
diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
index 970460a0b492..e0d0fbe43ca2 100644
--- a/arch/riscv/include/asm/Kbuild
+++ b/arch/riscv/include/asm/Kbuild
@@ -16,6 +16,7 @@ generic-y += ftrace.h
 generic-y += futex.h
 generic-y += hardirq.h
 generic-y += hash.h
+generic-y += handle_irq.h
 generic-y += hw_irq.h
 generic-y += ioctl.h
 generic-y += ioctls.h
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 7404ec222406..a79869151aea 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -166,8 +166,9 @@ ENTRY(handle_exception)
/* Handle interrupts */
slli a0, s4, 1
srli a0, a0, 1
-   move a1, sp /* pt_regs */
-   tail do_IRQ
+   move a0, sp /* pt_regs */
+   REG_L a1, handle_arch_irq
+   jr a1
 1:
/* Handle syscalls */
li t0, EXC_SYSCALL
diff --git a/arch/riscv/kernel/irq.c b/arch/riscv/kernel/irq.c
index 328718e8026e..b74cbfbce2d0 100644
--- a/arch/riscv/kernel/irq.c
+++ b/arch/riscv/kernel/irq.c
@@ -24,16 +24,3 @@ void __init init_IRQ(void)
 {
irqchip_init();
 }
-
-asmlinkage void __irq_entry do_IRQ(unsigned int cause, struct pt_regs *regs)
-{
-#ifdef CONFIG_RISCV_INTC
-   /*
-* FIXME: We don't want a direct call to riscv_intc_irq here.  The plan
-* is to put an IRQ domain here and let the interrupt controller
-* register with that, but I poked around the arm64 code a bit and
-* there might be a better way to do it (ie, something fully generic).
-*/
-   riscv_intc_irq(cause, regs);
-#endif
-}
-- 
2.13.6



Re: [patches] Re: [PATCH 1/2] asm-generic: Add a generic set_handle_irq

2018-01-24 Thread Palmer Dabbelt

On Thu, 18 Jan 2018 07:45:13 PST (-0800), Christoph Hellwig wrote:

I think this should not be asm-generic and lib, but kernel/irq/handle.c
and include/linux/irq.h, under the CONFIG_MULTI_IRQ_HANDLER symbol
already used by arm.

Also for completeness of the series please convert arm, arm64 and
openrisc over to the generic version.  Last but not least I think
handle_arch_irq needs to be marked as __ro_after_init as a security
hardening measure.


Makes sense.  I sent out a new patch set.


Make set_handle_irq and handle_arch_irq generic

2018-01-24 Thread Palmer Dabbelt
This is the second version of a patch set titled "Use arm64's scheme for
registering first-level IRQ handlers on RISC-V".  That patch set's cover letter
is still the best way to describe what's going on, so I'm just copying it here:

This patch set has been sitting around for a while, but it got a bit lost
in the shuffle.  In RISC-V land we currently couple do_IRQ (the C entry
point for interrupt handling) to our first-level interrupt controller.
While this isn't completely crazy (as the first-level interrupt controller
is specified by the ISA), it is a bit awkward.

This patch set decouples our trap handler from our first-level IRQ chip
driver by copying what a handful of other architectures are doing.  This
does add an additional load to the interrupt handling path, but there's a
handful of performance problems in there that I've been meaning to look at
so I don't mind adding another one for now.  The advantage is that our
irqchip driver is decoupled from our arch port, at least at compile time.

I've build tested this on arm, arm64, and openrisc but haven't run on any of
those systems.  The goal was to make no functional changes, but the
__ro_after_init addition does actaully change behavior.

Changes since v1:

* I based this on arm instead of arm64, which means we guard the selection of
  these routines with CONFIG_MULTI_IRQ_HANDLER.
* The changes are in kernel/irq/handle.c and include/linux/irq.h instead of
  lib.
* I've converted the arm, arm64, and openrisc ports to use the generic versions
  of these routines.



[PATCH v2 3/4] openrisc: Use the new MULTI_IRQ_HANDLER

2018-01-24 Thread Palmer Dabbelt
It appears that openrisc copied arm64's MULTI_IRQ_HANDLER code (which
came from arm).  I wanted to make this generic so I could use it in the
RISC-V port.  This patch converts the openrisc code to use the generic
version.

Signed-off-by: Palmer Dabbelt 
---
 arch/openrisc/Kconfig  | 1 +
 arch/openrisc/kernel/irq.c | 7 ---
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
index 339df7324e9c..77a9d45ac11e 100644
--- a/arch/openrisc/Kconfig
+++ b/arch/openrisc/Kconfig
@@ -35,6 +35,7 @@ config OPENRISC
select ARCH_USE_QUEUED_RWLOCKS
select OMPIC if SMP
select ARCH_WANT_FRAME_POINTERS
+   select MULTI_IRQ_HANDLER
 
 config CPU_BIG_ENDIAN
def_bool y
diff --git a/arch/openrisc/kernel/irq.c b/arch/openrisc/kernel/irq.c
index 35e478a93116..5f9445effaf8 100644
--- a/arch/openrisc/kernel/irq.c
+++ b/arch/openrisc/kernel/irq.c
@@ -41,13 +41,6 @@ void __init init_IRQ(void)
irqchip_init();
 }
 
-static void (*handle_arch_irq)(struct pt_regs *);
-
-void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
-{
-   handle_arch_irq = handle_irq;
-}
-
 void __irq_entry do_IRQ(struct pt_regs *regs)
 {
handle_arch_irq(regs);
-- 
2.13.6



[PATCH v2 1/4] arm: Make set_handle_irq and handle_arch_irq generic

2018-01-24 Thread Palmer Dabbelt
It looks like this same irqchip registration mechanism has been copied
into a handful of ports, including aarch64 and openrisc.  I want to use
this in the RISC-V port, so I thought it would be good to make this
generic instead.

This patch simply moves set_handle_irq and handle_arch_irq from arch/arm
to kernel/irq/handle.c.

Signed-off-by: Palmer Dabbelt 
---
 arch/arm/Kconfig |  5 -
 arch/arm/include/asm/irq.h   |  5 -
 arch/arm/kernel/entry-armv.S |  6 --
 arch/arm/kernel/irq.c| 10 --
 include/linux/irq.h  | 18 ++
 kernel/irq/Kconfig   |  5 +
 kernel/irq/handle.c  | 10 ++
 7 files changed, 33 insertions(+), 26 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 51c8df561077..e51f907668f6 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -917,11 +917,6 @@ config IWMMXT
  Enable support for iWMMXt context switching at run time if
  running on a CPU that supports it.
 
-config MULTI_IRQ_HANDLER
-   bool
-   help
- Allow each machine to specify it's own IRQ handler at run time.
-
 if !MMU
 source "arch/arm/Kconfig-nommu"
 endif
diff --git a/arch/arm/include/asm/irq.h b/arch/arm/include/asm/irq.h
index b6f319606e30..c883fcbe93b6 100644
--- a/arch/arm/include/asm/irq.h
+++ b/arch/arm/include/asm/irq.h
@@ -31,11 +31,6 @@ extern void asm_do_IRQ(unsigned int, struct pt_regs *);
 void handle_IRQ(unsigned int, struct pt_regs *);
 void init_IRQ(void);
 
-#ifdef CONFIG_MULTI_IRQ_HANDLER
-extern void (*handle_arch_irq)(struct pt_regs *);
-extern void set_handle_irq(void (*handle_irq)(struct pt_regs *));
-#endif
-
 #ifdef CONFIG_SMP
 extern void arch_trigger_cpumask_backtrace(const cpumask_t *mask,
   bool exclude_self);
diff --git a/arch/arm/kernel/entry-armv.S b/arch/arm/kernel/entry-armv.S
index fbc707626b3e..2d31cec2f4cb 100644
--- a/arch/arm/kernel/entry-armv.S
+++ b/arch/arm/kernel/entry-armv.S
@@ -1230,9 +1230,3 @@ vector_addrexcptn:
.globl  cr_alignment
 cr_alignment:
.space  4
-
-#ifdef CONFIG_MULTI_IRQ_HANDLER
-   .globl  handle_arch_irq
-handle_arch_irq:
-   .space  4
-#endif
diff --git a/arch/arm/kernel/irq.c b/arch/arm/kernel/irq.c
index ece04a457486..9908dacf9229 100644
--- a/arch/arm/kernel/irq.c
+++ b/arch/arm/kernel/irq.c
@@ -102,16 +102,6 @@ void __init init_IRQ(void)
uniphier_cache_init();
 }
 
-#ifdef CONFIG_MULTI_IRQ_HANDLER
-void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
-{
-   if (handle_arch_irq)
-   return;
-
-   handle_arch_irq = handle_irq;
-}
-#endif
-
 #ifdef CONFIG_SPARSE_IRQ
 int __init arch_probe_nr_irqs(void)
 {
diff --git a/include/linux/irq.h b/include/linux/irq.h
index a0231e96a578..4e7ca0216fb9 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -1170,4 +1170,22 @@ int __ipi_send_mask(struct irq_desc *desc, const struct 
cpumask *dest);
 int ipi_send_single(unsigned int virq, unsigned int cpu);
 int ipi_send_mask(unsigned int virq, const struct cpumask *dest);
 
+#ifdef CONFIG_MULTI_IRQ_HANDLER
+/*
+ * Registers a generic IRQ handling function as the top-level IRQ handler in
+ * the system, which is generally the first C code called from an assembly
+ * architecture-specific interrupt handler.
+ *
+ * Returns 0 on success, or -EBUSY if an IRQ handler has already been
+ * registered.
+ */
+void __init set_handle_irq(void (*handle_irq)(struct pt_regs *));
+
+/*
+ * Allows interrupt handlers to find the irqchip that's been registered as the
+ * top-level IRQ handler.
+ */
+extern void (*handle_arch_irq)(struct pt_regs *) __ro_after_init;
+#endif
+
 #endif /* _LINUX_IRQ_H */
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
index 89e355866450..1b84dccd1a03 100644
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -142,3 +142,8 @@ config GENERIC_IRQ_DEBUGFS
  If you don't know what to do here, say N.
 
 endmenu
+
+config MULTI_IRQ_HANDLER
+   bool
+   help
+ Allow each machine to specify it's own IRQ handler at run time.
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index 79f987b942b8..eb55650a2abc 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -20,6 +20,8 @@
 
 #include "internals.h"
 
+void (*handle_arch_irq)(struct pt_regs *) __ro_after_init;
+
 /**
  * handle_bad_irq - handle spurious and unhandled irqs
  * @desc:  description of the interrupt
@@ -207,3 +209,11 @@ irqreturn_t handle_irq_event(struct irq_desc *desc)
irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
return ret;
 }
+
+void __init set_handle_irq(void (*handle_irq)(struct pt_regs *))
+{
+   if (handle_arch_irq)
+   return;
+
+   handle_arch_irq = handle_irq;
+}
-- 
2.13.6



  1   2   3   4   5   6   7   8   9   >