Re: [PATCH] mfd: syscon: set regmap name to DT node name
On Tue, Jan 23, 2018 at 11:57 PM, David Lechner wrote: > This sets the regmap name to the device tree node name. This is useful > for debugging. > > Signed-off-by: David Lechner Looks reasonable, Acked-by: Arnd Bergmann
Re: [PATCH v2] bsg: use pr_debug instead of hand craftet macros
On Tue, Jan 23, 2018 at 06:53:18PM -0800, Joe Perches wrote: > On Tue, 2018-01-23 at 23:12 +, Bart Van Assche wrote: > > On Tue, 2018-01-23 at 04:45 -0800, Joe Perches wrote: > > > Perhaps the email subject could be improved to describe > > > the new macro and as well, this macro, without a pr_fmt > > > define somewhat above it loses the __func__ output too. > > > > Hmm ... I thought that the pr_debug() output can be configured to include > > the function name (__func__)? From > > Exactly right. > > > https://www.kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html: > > which is what I wrote when I said use +f. > > It's just nice to mention these changes in the > commit message. Sorry Joe but this slowly approaches the bikeshedding area. Why should I duplicate the dynamic debug howto? This was literally a drive-by patch I did while riding on the subway to the office which was indented to ease debugging problems in the field for our (and other) support engineers (and they are usually well aware of the +f switch). Byte, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH v6 00/41] ARM: davinci: convert to common clock framework
2018-01-23 21:23 GMT+01:00 David Lechner : > On 01/23/2018 02:05 PM, David Lechner wrote: >> >> On 01/23/2018 02:01 PM, David Lechner wrote: >>> >>> On 01/23/2018 01:53 PM, Bartosz Golaszewski wrote: In the mdio case - the problem is that devm_clk_get() doesn't fail, but somehow the clock doesn't end up in the list of the device's clocks - which is why it's not enabled by pm_runtime_get_sync(). >>> >>> >>> Right. This is because devm_clk_get() now finds the clock via device >>> tree instead of a clkdev lookup entry. However, I think that the PM >>> notifier registered in arch/arm/mach-davinci/pm_domain.c only uses >>> the clkdev lookup to match the con_id and does not use device tree. >>> The same thing is happing in mdio, emac and lcdc. >>> >> >> Minor correction: It looks like emac doesn't do this because it doesn't >> have a con_id of "fck". But, the same clock is shared by emac and mdio, so >> since mdio enables the clock, emac doesn't notice or care that it did >> not enable the clock itself. > > > How about using pm_clk_add_clk() in these drivers to explicitly use the > clocks for power management instead of relying on pm_clk_add_notifier() > to do this implicitly? Yes, this sounds good. Bartosz
Re: unixbench context switch perfomance & cpu topology
2018-01-23 21:49 GMT+08:00 Mike Galbraith : > On Tue, 2018-01-23 at 18:36 +0800, Wanpeng Li wrote: >> >> Thanks for having a try, Mike. :) Actually the two context1 tasks >> don't stack up on one logical cpu at the most of time which is >> observed by kernelshark. Do you have any idea why there is 4.5 times >> RESCHED IPIs which is mentioned in another reply for this thread? > > See resched_curr(). Yeah, I observe writer/reader pair is running on the same core sometimes after more digging. Thanks Mike! Regards, Wanpeng Li
Re: [PATCH v2] NTB: ntb_perf: fix cast to restricted __le32
On Wed, Jan 24, 2018 at 8:48 AM, Serge Semin wrote: > Sparse is whining about the u32 and __le32 mixed usage in the driver > > drivers/ntb/test/ntb_perf.c:288:21: warning: cast to restricted __le32 > drivers/ntb/test/ntb_perf.c:295:37: warning: incorrect type in argument 4 > (different base types) > drivers/ntb/test/ntb_perf.c:295:37:expected unsigned int [unsigned] > [usertype] val > drivers/ntb/test/ntb_perf.c:295:37:got restricted __le32 [usertype] > > ... > > NTB hardware drivers shall accept CPU-endian data and translate it to > the portable formate by internal means, so the explicit conversions > are not necessary before Scratchpad/Messages API usage anymore. > > Fixes: b83003b3fdc1 ("NTB: ntb_perf: Add full multi-port NTB API support") > Signed-off-by: Serge Semin Looks good to me, Acked-by: Arnd Bergmann
Re: [PATCH 1/5] prctl: add PR_ISOLATE_BP process control
On 01/23/2018 06:07 PM, Dominik Brodowski wrote: > On Tue, Jan 23, 2018 at 02:07:01PM +0100, Martin Schwidefsky wrote: >> Add the PR_ISOLATE_BP operation to prctl. The effect of the process >> control is to make all branch prediction entries created by the execution >> of the user space code of this task not applicable to kernel code or the >> code of any other task. > > What is the rationale for requiring a per-process *opt-in* for this added > protection? > > For KPTI on x86, the exact opposite approach is being discussed (see, e.g. > http://lkml.kernel.org/r/1515612500-14505-1-git-send-emai...@1wt.eu ): By > default, play it safe, with KPTI enabled. But for "trusted" processes, one > may opt out using prctrl. FWIW, this is not about KPTI. s390 always has the kernel in a separate address space. Its only about potential spectre like attacks. This idea is to be able to isolate in controlled environments, e.g. if you have only one thread with untrusted code (e.g. jitting remote code). The property of the branch prediction mode on s390 is that it protects in two ways - against being attacked but also against being able to attack via the btb.
RE: backport Rewrite sync_core() to use IRET-to-self to stable kernels?
Thanks Greg, Andy How about c198b121b1a1d7a7171770c634cd49191bac4477? Will both patches go to stable kernel? BR. Ning. -Original Message- From: gre...@linuxfoundation.org [mailto:gre...@linuxfoundation.org] Sent: Tuesday, January 23, 2018 3:26 PM To: Andy Lutomirski Cc: Zhang, Ning A ; t...@linutronix.de; sta...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: backport Rewrite sync_core() to use IRET-to-self to stable kernels? On Mon, Jan 22, 2018 at 07:06:29PM -0800, Andy Lutomirski wrote: > On Mon, Jan 22, 2018 at 6:59 PM, Zhang, Ning A wrote: > > hello, Greg, Andy, Thomas > > > > would you like to backport these two patches to LTS kernel? > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable. > > git/commit/arch/x86/include/asm/processor.h?h=v4.14.14&id=1c52d859cb > > 2d417e7216d3e56bb7fea88444cec9 > > > > x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit > > kernels > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable. > > git/commit/arch/x86/include/asm/processor.h?h=v4.14.14&id=c198b121b1 > > a1d7a7171770c634cd49191bac4477 > > > > x86/asm: Rewrite sync_core() to use IRET-to-self > > > > I'd be in favor of backporting > 1c52d859cb2d417e7216d3e56bb7fea88444cec9. I see no compelling reason > to backport the other one, since it doesn't fix a bug. Greg, can you > do this? I'll work on this after this round of stable kernels are released. thanks, greg k-h
Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
On Tue 23-01-18 14:22:07, David Rientjes wrote: > On Tue, 23 Jan 2018, Michal Hocko wrote: > > > > It can't, because the current patchset locks the system into a single > > > selection criteria that is unnecessary and the mount option would become > > > a > > > no-op after the policy per subtree becomes configurable by the user as > > > part of the hierarchy itself. > > > > This is simply not true! OOM victim selection has changed in the > > past and will be always a subject to changes in future. Current > > implementation doesn't provide any externally controlable selection > > policy and therefore the default can be assumed. Whatever that default > > means now or in future. The only contract added here is the kill full > > memcg if selected and that can be implemented on _any_ selection policy. > > > > The current implementation of memory.oom_group is based on top of a > selection implementation that is broken in three ways I have listed for > months: This doesn't lead to anywhere. You are not presenting any new arguments and you are ignoring feedback you have received so far. We have tried really hard. Considering different _independent_ people presented more or less consistent view on these points I think you should deeply reconsider how you take that feedback. > - allows users to intentionally/unintentionally evade the oom killer, >requires not locking the selection implementation for the entire >system, requires subtree control to prevent, makes a mount option >obsolete, and breaks existing users who would use the implementation >based on 4.16 if this were merged, > > - unfairly compares the root mem cgroup vs leaf mem cgroup such that >users must structure their hierarchy only for 4.16 in such a way >that _all_ processes are under hierarchical control and have no >power to create sub cgroups because of the point above and >completely breaks any user of oom_score_adj in a completely >undocumented and unspecified way, such that fixing that breakage >would also break any existing users who would use the implementation >based on 4.16 if this were merged, and > > - does not allow userspace to protect important cgroups, which can be >built on top. For the last time. This all can be done on top of the proposed solution without breaking the proposed user API. I am really _convinced_ that you underestimate how complex it is to provide a sane selection policy API and it will take _months_ to settle on something. Existing OOM APIs are a sad story and I definitly do not want to repeat same mistakes from the past. -- Michal Hocko SUSE Labs
Re: [PATCH v2] perf/core: Fix installing cgroup event into cpu
On Wed, Jan 24, 2018 at 03:50:10PM +0800, linxiu...@gmail.com wrote: > From: "leilei.lin" > > Do not install cgroup event into the CPU context if the cgroup > is not running on this CPU > > While there is no task of cgroup running specified CPU, current > kernel still install cgroup event into CPU context, that causes > another cgroup event can't be installed into this CPU. This changelog doesn't really cover the extend of the changes done. > Signed-off-by: leilei.lin > --- > kernel/events/core.c | 44 +++- > 1 file changed, 31 insertions(+), 13 deletions(-) > > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 4df5b69..f766b60 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -933,31 +933,36 @@ list_update_cgroup_event(struct perf_event *event, > { > struct perf_cpu_context *cpuctx; > struct list_head *cpuctx_entry; > + struct perf_cgroup *cgrp; > > if (!is_cgroup_event(event)) > return; > > /* >* Because cgroup events are always per-cpu events, >* this will always be called from the right CPU. >*/ > cpuctx = __get_cpu_context(ctx); > + cgrp = perf_cgroup_from_task(current, ctx); > > + /* cpuctx->cgrp is NULL unless a cgroup event is running in this CPU .*/ > + if (cgroup_is_descendant(cgrp->css.cgroup, event->cgrp->css.cgroup)) { > + if (add) > cpuctx->cgrp = cgrp; > + else > + cpuctx->cgrp = NULL; > } > + > + if (add && ctx->nr_cgroups++) > + return; > + else if (!add && --ctx->nr_cgroups) > + return; > + > + cpuctx_entry = &cpuctx->cgrp_cpuctx_entry; > + if (add) > + list_add(cpuctx_entry, this_cpu_ptr(&cgrp_cpuctx_list)); > + else > + list_del(cpuctx_entry); > } I'm a little confused; you unconditionally set cpuctx->cgrp for every add/delete. So if we have >1 cgroup events on, and we remove one, you still clear cpuctx->cgrp, that seems wrong. Why did you change that? The Changelog doesn't include enough clues for me to know what you were trying to do.
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
Hi, Le Thursday 18 Jan 2018 à 10:38:07 (+), Morten Rasmussen a écrit : > On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: > > Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : > > > Hi Peter, > > > > > > On 22 December 2017 at 21:42, Peter Zijlstra wrote: > > > > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: > > > >> Right; but I figured we'd try and do it 'right' and see how horrible it > > > >> is before we try and do funny things. > > > > > > > > So now it should have a 32ms tick for up to .5s when the system goes > > > > completely idle. > > > > > > > > No idea how bad that is.. > > > > > > I have tested your branch but the timer doesn't seem to fire correctly > > > because i can still see blocked load in the use case i have run. > > > I haven't found the reason yet > > > > Hi Peter, > > > > With the patch below on top of your branch, the blocked loads are updated > > and > > decayed regularly. The main differences are: > > - It doesn't use a timer to trig ilb but the tick and when a cpu becomes > > idle. > > The main drawback of this solution is that the load is blocked when the > > system is fully idle with the advantage of not waking up a fully idle > > system. We have to wait for the next tick or newly idle event for updating > > blocked load when the system leaves idle stat which can be up to a tick > > long. > > If this is too long, we can check for kicking ilb when task wakes up so > > the > > blocked load will be updated as soon as the system leaves idle state. > > The main advantage is that we don't wake up a fully idle system every > > 32ms to > > update blocked load that will be not used. > > - I'm working on one more improvement to use nohz_idle_balance in the newly > > idle case when the system is not overloaded and > > (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can > > try to > > use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed > > this_rq->avg_idle. This will remove some calls to kick_ilb and some wake > > up > > of an idle cpus. > > This sound like what I meant in my other reply :-) > > It seems pointless to have a timer to update PELT if the system is > completely idle, and when it isn't we can piggy back other events to > make the updates happen. The patch below implements what has been described above. It calls part of nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too much time. This removes part of ilb that are kicked on an idle cpu for updating the blocked load but the ratio really depends on when the tick happens compared to a cpu becoming idle and the 32ms boundary. I have an additionnal patch that enables to update the blocked loads when a cpu becomes idle 1 period before kicking an ilb and there is far less ilb because we give more chance to the newly idle case (time_after is replaced by time_after_eq in idle_balance()). The patch also uses a function cfs_rq_has_blocked, which only checks the util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too. This reduce significantly the number of update of blocked load. the *_avg will be fully decayed in around 300~400ms but it's far longer for the *_sum which have a higher resolution and we can easily reach almost seconds. But only the *_avg are used to make decision so keeping some blocked *_sum is acceptable. --- kernel/sched/fair.c | 121 +++- 1 file changed, 92 insertions(+), 29 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 898785d..ed90303 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) return true; } +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) +{ + if (cfs_rq->avg.load_avg) + return true; + + if (cfs_rq->avg.util_avg) + return true; + + return false; +} + #ifdef CONFIG_FAIR_GROUP_SCHED static void update_blocked_averages(int cpu) @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu) */ if (cfs_rq_is_decayed(cfs_rq)) list_del_leaf_cfs_rq(cfs_rq); - else + + /* Don't need periodic decay once load/util_avg are null */ + if (cfs_rq_has_blocked(cfs_rq)) done = false; } @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu) update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); #ifdef CONFIG_NO_HZ_COMMON rq->last_blocked_load_update_tick = jiffies; - if (cfs_rq_is_decayed(cfs_rq)) + if (cfs_rq_has_blocked(cfs_rq)) rq->has_blocked_load = 0; #endif rq_unlock_irqrestore(rq, &rf); @@ -8818,6 +8831,7 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
Re: [PATCH-next] misc: mic: Use PTR_ERR_OR_ZERO
Am 23.01.2018 21:10, schrieb Christopher Díaz Riveros: > Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR > > This issue was detected by using the Coccinelle software. > > Signed-off-by: Christopher Díaz Riveros > --- > drivers/misc/mic/scif/scif_epd.h | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/drivers/misc/mic/scif/scif_epd.h > b/drivers/misc/mic/scif/scif_epd.h > index f39b663da287..b2a835665390 100644 > --- a/drivers/misc/mic/scif/scif_epd.h > +++ b/drivers/misc/mic/scif/scif_epd.h > @@ -165,9 +165,7 @@ static inline int scif_verify_epd(struct scif_endpt *ep) > static inline int scif_anon_inode_getfile(scif_epd_t epd) > { > epd->anon = anon_inode_getfile("scif", &scif_anon_fops, NULL, 0); > - if (IS_ERR(epd->anon)) > - return PTR_ERR(epd->anon); > - return 0; > + return PTR_ERR_OR_ZERO(epd->anon); > } > the patch looks ok, but someone should thing about if it makes sense to have a oneliner as function. IMHO this will only confuse readers (note: yes, there are reasons in some cases, but i do not see what applies here if any). re, wh > static inline void scif_anon_inode_fput(scif_epd_t epd)
[PATCH v7 0/2] add tracepoints for nvme command submission and completion
Add tracepoints for nvme command submission and completion. The tracepoints are modeled after SCSI's trace_scsi_dispatch_cmd_start() and trace_scsi_dispatch_cmd_done() tracepoints and fulfil a similar purpose, namely a fast way to check which command is going to be queued into the HW or Fabric driver and which command is completed again. This version is a re-base onto nvme-4.16. Here's an example output using the qemu emulated pci nvme: # tracer: nop # # _-=> irqs-off # / _=> need-resched #| / _---=> hardirq/softirq #|| / _--=> preempt-depth #||| / delay # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | kworker/u8:0-5 [003] 9.158541: nvme_setup_admin_cmd: cmdid=14, flags=0x0, meta=0x0, cmd=(nvme_admin_create_cq cqid=1, qsize=1023, cq_flags=0x3, irq_vector=0) -0 [003] d.h. 9.158705: nvme_complete_rq: cmdid=14, qid=0, res=0, retries=0, flags=0x0, status=0 kworker/u8:0-5 [003] 9.158712: nvme_setup_admin_cmd: cmdid=14, flags=0x0, meta=0x0, cmd=(nvme_admin_create_sq sqid=1, qsize=1023, sq_flags=0x1, cqid=1) -0 [003] d.h. 9.159214: nvme_complete_rq: cmdid=14, qid=0, res=0, retries=0, flags=0x0, status=0 kworker/u8:0-5 [003] 9.159236: nvme_setup_admin_cmd: cmdid=14, flags=0x0, meta=0x0, cmd=(nvme_admin_create_cq cqid=2, qsize=1023, cq_flags=0x3, irq_vector=1) -0 [003] d.h. 9.159288: nvme_complete_rq: cmdid=14, qid=0, res=0, retries=0, flags=0x0, status=0 kworker/u8:0-5 [003] 9.159293: nvme_setup_admin_cmd: cmdid=14, flags=0x0, meta=0x0, cmd=(nvme_admin_create_sq sqid=2, qsize=1023, sq_flags=0x1, cqid=2) -0 [003] d.h. 9.159479: nvme_complete_rq: cmdid=14, qid=0, res=0, retries=0, flags=0x0, status=0 kworker/u8:0-5 [003] 9.159525: nvme_setup_admin_cmd: cmdid=14, flags=0x0, meta=0x0, cmd=(nvme_admin_create_cq cqid=3, qsize=1023, cq_flags=0x3, irq_vector=2) -0 [003] d.h. 9.159565: nvme_complete_rq: cmdid=14, qid=0, res=0, retries=0, flags=0x0, status=0 kworker/u8:0-5 [003] 9.159569: nvme_setup_admin_cmd: cmdid=14, flags=0x0, meta=0x0, cmd=(nvme_admin_create_sq sqid=3, qsize=1023, sq_flags=0x1, cqid=3) -0 [003] d.h. 9.159726: nvme_complete_rq: cmdid=14, qid=0, res=0, retries=0, flags=0x0, status=0 kworker/u8:0-5 [003] 9.159769: nvme_setup_admin_cmd: cmdid=14, flags=0x0, meta=0x0, cmd=(nvme_admin_create_cq cqid=4, qsize=1023, cq_flags=0x3, irq_vector=3) -0 [003] d.h. 9.159795: nvme_complete_rq: cmdid=14, qid=0, res=0, retries=0, flags=0x0, status=0 kworker/u8:0-5 [003] 9.159799: nvme_setup_admin_cmd: cmdid=14, flags=0x0, meta=0x0, cmd=(nvme_admin_create_sq sqid=4, qsize=1023, sq_flags=0x1, cqid=4) -0 [003] d.h. 9.159957: nvme_complete_rq: cmdid=14, qid=0, res=0, retries=0, flags=0x0, status=0 kworker/u8:0-5 [003] 9.160971: nvme_setup_admin_cmd: cmdid=14, flags=0x0, meta=0x0, cmd=(nvme_admin_identify cns=0, ctrlid=1) -0 [003] d.h. 9.161059: nvme_complete_rq: cmdid=14, qid=0, res=0, retries=0, flags=0x0, status=0 kworker/u8:0-5 [003] 9.161101: nvme_setup_admin_cmd: cmdid=14, flags=0x0, meta=0x0, cmd=(nvme_admin_identify cns=0, ctrlid=0) -0 [003] d.h. 9.161160: nvme_complete_rq: cmdid=14, qid=0, res=0, retries=0, flags=0x0, status=0 kworker/u8:0-5 [003] 9.161305: nvme_setup_admin_cmd: cmdid=14, flags=0x0, meta=0x0, cmd=(nvme_admin_identify cns=0, ctrlid=0) -0 [003] d.h. 9.161344: nvme_complete_rq: cmdid=14, qid=0, res=0, retries=0, flags=0x0, status=0 kworker/u8:0-5 [003] 9.161390: nvme_setup_nvm_cmd: qid=1, nsid=1, cmdid=718, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=0, len=7, ctrl=0x0, dsmgmt=0, reftag=0) -0 [003] d.h. 9.161578: nvme_complete_rq: cmdid=718, qid=1, res=0, retries=0, flags=0x0, status=0 kworker/u8:0-5 [003] 9.161608: nvme_setup_nvm_cmd: qid=1, nsid=1, cmdid=718, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=24, len=7, ctrl=0x0, dsmgmt=0, reftag=0) -0 [003] d.h. 9.161681: nvme_complete_rq: cmdid=718, qid=1, res=0, retries=0, flags=0x0, status=0 dd-205 [001] 9.662909: nvme_setup_nvm_cmd: qid=1, nsid=1, cmdid=1011, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=0, len=2559, ctrl=0x0, dsmgmt=0, reftag=0) dd-205 [001] 9.662967: nvme_setup_nvm_cmd: qid=1, nsid=1, cmdid=1012, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=2560, len=1535, ctrl=0x0, dsmgmt=0, reftag=0) -0 [001] d.h. 9.664413: nvme_complete_rq: cmdid=1011, qid=1, res=0, retrie
Re: [PATCH 14/14] MIPS: memblock: Deactivate bootmem allocator
On Tue, Jan 23, 2018 at 11:59:35PM +, James Hogan wrote: > On Thu, Jan 18, 2018 at 01:23:12AM +0300, Serge Semin wrote: > > Memblock allocator can be successfully used from now for early > > memory management > > > > Signed-off-by: Serge Semin > > Am I correct that intermediate commits in this patchset (i.e. bisection) > may not work correctly, since bootmem will have been stripped out but > NO_BOOTMEM=n and memblock may not be properly operational yet? > Yes. You're absolutely right. The kernel will be buildable, but most likely isn't operable until the PATCH 14 deactivates bootmem allocator. > If so, is there a way to switch without breaking bisection that doesn't > involve squashing most of the series into a single atomic commit? > I don't think so. There is no way to switch without squashing at all, at least since the alteration involves arch and platforms code, which all relied on the bootmem allocator. Here is the list of patches, which need to be combined to have the bisection unbroken: [PATCH 03/14] MIPS: memblock: Reserve initrd memory in memblock [PATCH 04/14] MIPS: memblock: Discard bootmem initialization [PATCH 05/14] MIPS: memblock: Add reserved memory regions to memblock [PATCH 06/14] MIPS: memblock: Reserve kdump/crash regions in memblock [PATCH 07/14] MIPS: memblock: Mark present sparsemem sections [PATCH 08/14] MIPS: memblock: Simplify DMA contiguous reservation [PATCH 09/14] MIPS: memblock: Allow memblock regions resize [PATCH 12/14] MIPS: memblock: Discard bootmem from Loongson3 code [PATCH 13/14] MIPS: memblock: Discard bootmem from SGI IP27 code [PATCH 14/14] MIPS: memblock: Deactivate bootmem allocator So the patches 03-09 imply the functional alterations so the arch code would work correctly with memblock, the patches 13-14 alter the platforms code of the specific NUMA devices like Loongson and SGI IP27. After it's done the bootmem can be finally deactivated. Regards, -Sergey > Cheers > James > > > --- > > arch/mips/Kconfig | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig > > index 725b5ece7..a6c4fb6b6 100644 > > --- a/arch/mips/Kconfig > > +++ b/arch/mips/Kconfig > > @@ -4,7 +4,6 @@ config MIPS > > default y > > select ARCH_BINFMT_ELF_STATE > > select ARCH_CLOCKSOURCE_DATA > > - select ARCH_DISCARD_MEMBLOCK > > select ARCH_HAS_ELF_RANDOMIZE > > select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST > > select ARCH_MIGHT_HAVE_PC_PARPORT > > @@ -57,6 +57,7 @@ config MIPS > > select HAVE_IRQ_TIME_ACCOUNTING > > select HAVE_KPROBES > > select HAVE_KRETPROBES > > + select NO_BOOTMEM > > select HAVE_MEMBLOCK > > select HAVE_MEMBLOCK_NODE_MAP > > select HAVE_MOD_ARCH_SPECIFIC > > -- > > 2.12.0 > >
[PATCH v7 1/2] nvme: add tracepoint for nvme_setup_cmd
Add tracepoints for nvme_setup_cmd() for tracing admin and/or nvm commands. Examples of the two tracepoints are as follows for trace_nvme_setup_admin_cmd(): kworker/u8:0-5 [003] 2.998792: nvme_setup_admin_cmd: cmdid=14, flags=0x0, meta=0x0, cmd=(nvme_admin_create_cq cqid=1, qsize=1023, cq_flags=0x3, irq_vector=0) and trace_nvme_setup_nvm_cmd(): dd-205 [001] 3.503929: nvme_setup_nvm_cmd: qid=1, nsid=1, cmdid=989, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=4096, len=2047, ctrl=0x0, dsmgmt=0, reftag=0) Signed-off-by: Johannes Thumshirn Reviewed-by: Hannes Reinecke Reviewed-by: Martin K. Petersen Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg --- drivers/nvme/host/Makefile | 4 ++ drivers/nvme/host/core.c | 7 +++ drivers/nvme/host/trace.c | 131 + drivers/nvme/host/trace.h | 141 + 4 files changed, 283 insertions(+) create mode 100644 drivers/nvme/host/trace.c create mode 100644 drivers/nvme/host/trace.h diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile index a25fd43650ad..441e67e3a9d7 100644 --- a/drivers/nvme/host/Makefile +++ b/drivers/nvme/host/Makefile @@ -1,4 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 + +ccflags-y += -I$(src) + obj-$(CONFIG_NVME_CORE)+= nvme-core.o obj-$(CONFIG_BLK_DEV_NVME) += nvme.o obj-$(CONFIG_NVME_FABRICS) += nvme-fabrics.o @@ -6,6 +9,7 @@ obj-$(CONFIG_NVME_RDMA) += nvme-rdma.o obj-$(CONFIG_NVME_FC) += nvme-fc.o nvme-core-y:= core.o +nvme-core-$(CONFIG_TRACING)+= trace.o nvme-core-$(CONFIG_NVME_MULTIPATH) += multipath.o nvme-core-$(CONFIG_NVM)+= lightnvm.o diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index fde6fd2e7eef..358dfdc1f6da 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -29,6 +29,9 @@ #include #include +#define CREATE_TRACE_POINTS +#include "trace.h" + #include "nvme.h" #include "fabrics.h" @@ -628,6 +631,10 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req, } cmd->common.command_id = req->tag; + if (ns) + trace_nvme_setup_nvm_cmd(req->q->id, cmd); + else + trace_nvme_setup_admin_cmd(cmd); return ret; } EXPORT_SYMBOL_GPL(nvme_setup_cmd); diff --git a/drivers/nvme/host/trace.c b/drivers/nvme/host/trace.c new file mode 100644 index ..87beb77dcc6c --- /dev/null +++ b/drivers/nvme/host/trace.c @@ -0,0 +1,131 @@ +/* + * NVM Express device driver tracepoints + * Copyright (c) 2018 Johannes Thumshirn, SUSE Linux GmbH + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ + +#include "trace.h" + +static const char *nvme_trace_create_sq(struct trace_seq *p, __le32 *cdw10) +{ + const char *ret = trace_seq_buffer_ptr(p); + __le16 qsize = cdw10[0] >> 16; + __le16 sqid = cdw10[0] & 0x; + __le16 cqid = cdw10[1] >> 16; + __le16 sq_flags = cdw10[1] & 0x; + + trace_seq_printf(p, "sqid=%u, qsize=%u, sq_flags=0x%x, cqid=%u", +le16_to_cpu(sqid), le16_to_cpu(qsize), +le16_to_cpu(sq_flags), le16_to_cpu(cqid)); + trace_seq_putc(p, 0); + + return ret; +} + +static const char *nvme_trace_create_cq(struct trace_seq *p, __le32 *cdw10) +{ + const char *ret = trace_seq_buffer_ptr(p); + __le16 qsize = cdw10[0] >> 16; + __le16 cqid = cdw10[0] & 0x; + __le16 irq_vector = cdw10[1] >> 16; + __le16 cq_flags = cdw10[1] & 0x; + + trace_seq_printf(p, "cqid=%u, qsize=%u, cq_flags=0x%x, irq_vector=%u", +le16_to_cpu(cqid), le16_to_cpu(qsize), +le16_to_cpu(cq_flags), le16_to_cpu(irq_vector)); + trace_seq_putc(p, 0); + + return ret; +} + +static const char *nvme_trace_admin_identify(struct trace_seq *p, __le32 *cdw10) +{ + const char *ret = trace_seq_buffer_ptr(p); + __u8 cns = (cdw10[0] >> 24) & 0xff; + __le16 ctrlid = cdw10[0] & 0x; + + trace_seq_printf(p, "cns=%u, ctrlid=%u", cns, le16_to_cpu(ctrlid)); + trace_seq_putc(p, 0); + + return ret; +} + + + +static const char *nvme_trace_read_write(struct trace_seq *p, __le32 *cdw10) +{ + const char *ret = trace_seq_buffer_ptr(p); + __le64 slba = ((u64)cdw10[1] << 32) | cdw10[0]; + __le16 control = cdw10[2] >>
Re: [PATCH v8 8/9] pwm: pwm-omap-dmtimer: Adapt driver to utilize dmtimer pdata ops
On Wednesday 24 January 2018 12:54 PM, Ladislav Michl wrote: > Keerthy, > > On Wed, Jan 24, 2018 at 11:14:40AM +0530, Keerthy wrote: >> Adapt driver to utilize dmtimer pdata ops instead of pdata-quirks. >> >> Signed-off-by: Keerthy >> Acked-by: Neil Armstrong >> Reviewed-by: Claudiu Beznea >> --- >> >> Changes in v8: >> >> * Added of_node_put call in success case of probe. >> >> Boot tested on am437x-gp-evm and dra7xx-evm. >> Also compile tested omap1_defconfig with other patches of v7 >> posted here: >> >> https://www.spinics.net/lists/linux-omap/msg141100.html >> >> With v8 version of Patch 8/9. >> >> drivers/pwm/pwm-omap-dmtimer.c | 68 >> ++ >> 1 file changed, 42 insertions(+), 26 deletions(-) >> >> diff --git a/drivers/pwm/pwm-omap-dmtimer.c b/drivers/pwm/pwm-omap-dmtimer.c >> index 5ad42f3..c00e474 100644 >> --- a/drivers/pwm/pwm-omap-dmtimer.c >> +++ b/drivers/pwm/pwm-omap-dmtimer.c >> @@ -23,6 +23,7 @@ >> #include >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -37,7 +38,7 @@ struct pwm_omap_dmtimer_chip { >> struct pwm_chip chip; >> struct mutex mutex; >> pwm_omap_dmtimer *dm_timer; >> -struct pwm_omap_dmtimer_pdata *pdata; >> +struct omap_dm_timer_ops *pdata; >> struct platform_device *dm_timer_pdev; >> }; >> >> @@ -242,19 +243,35 @@ static int pwm_omap_dmtimer_probe(struct >> platform_device *pdev) >> { >> struct device_node *np = pdev->dev.of_node; >> struct device_node *timer; >> +struct platform_device *timer_pdev; >> struct pwm_omap_dmtimer_chip *omap; >> -struct pwm_omap_dmtimer_pdata *pdata; >> +struct dmtimer_platform_data *timer_pdata; >> +struct omap_dm_timer_ops *pdata; >> pwm_omap_dmtimer *dm_timer; >> u32 v; >> -int status; >> +int status, ret = 0; >> >> -pdata = dev_get_platdata(&pdev->dev); >> -if (!pdata) { >> -dev_err(&pdev->dev, "Missing dmtimer platform data\n"); >> -return -EINVAL; >> +timer = of_parse_phandle(np, "ti,timers", 0); >> +if (!timer) >> +return -ENODEV; >> + >> +timer_pdev = of_find_device_by_node(timer); >> +if (!timer_pdev) { >> +dev_err(&pdev->dev, "Unable to find Timer pdev\n"); >> +ret = -ENODEV; >> +goto put; >> } >> >> -if (!pdata->request_by_node || >> +timer_pdata = dev_get_platdata(&timer_pdev->dev); >> +if (!timer_pdata) { >> +dev_err(&pdev->dev, "dmtimer pdata structure NULL\n"); >> +ret = -EINVAL; >> +goto put; >> +} >> + >> +pdata = timer_pdata->timer_ops; >> + >> +if (!pdata || !pdata->request_by_node || >> !pdata->free || >> !pdata->enable || >> !pdata->disable || >> @@ -267,37 +284,32 @@ static int pwm_omap_dmtimer_probe(struct >> platform_device *pdev) >> !pdata->set_prescaler || >> !pdata->write_counter) { >> dev_err(&pdev->dev, "Incomplete dmtimer pdata structure\n"); >> -return -EINVAL; >> +ret = -EINVAL; >> +goto put; >> } >> >> -timer = of_parse_phandle(np, "ti,timers", 0); >> -if (!timer) >> -return -ENODEV; >> - >> if (!of_get_property(timer, "ti,timer-pwm", NULL)) { >> dev_err(&pdev->dev, "Missing ti,timer-pwm capability\n"); >> -return -ENODEV; >> +ret = -ENODEV; >> +goto put; > > Should we call of_node_put() even from here? of_get_property() failed, so > reference was not updated. The of_node_put to balance the of_parse_handle called for timer. I hope that is what you wanted to check right? > >> } >> >> dm_timer = pdata->request_by_node(timer); > > And timer seems to be used only here, so calling > of_node_put(timer); > just here should be enough. Okay yes. This can be optimized. of_node_put(timer); can be called here and the instances below need not have that additional step. > >> -if (!dm_timer) >> -return -EPROBE_DEFER; >> +if (!dm_timer) { >> +ret = -EPROBE_DEFER; >> +goto put; >> +} >> >> omap = devm_kzalloc(&pdev->dev, sizeof(*omap), GFP_KERNEL); >> if (!omap) { >> pdata->free(dm_timer); >> -return -ENOMEM; >> +ret = -ENOMEM; >> +goto put; >> } >> >> omap->pdata = pdata; >> omap->dm_timer = dm_timer; >> - >> -omap->dm_timer_pdev = of_find_device_by_node(timer); >> -if (!omap->dm_timer_pdev) { >> -dev_err(&pdev->dev, "Unable to find timer pdev\n"); >> -omap->pdata->free(dm_timer); >> -return -EINVAL; >> -} >> +omap->dm_timer_pdev = timer_pdev; >> >> /* >> * Ensure that the timer is stopped before we allow PWM core to call >> @@ -326,12 +338,16 @@ static int pwm_omap_dmtimer_probe(struct >> platform_device *pdev
[PATCH v7 2/2] nvme: add tracepoint for nvme_complete_rq
Add a tracepoint in nvme_complete_rq() for completions of NVMe commands. An expmale output of the trace-point is as follows: -0 [001] d.h. 3.505266: nvme_complete_rq: cmdid=989, qid=1, res=0, retries=0, flags=0x0, status=0 Signed-off-by: Johannes Thumshirn Reviewed-by: Hannes Reinecke Reviewed-by: Martin K. Petersen Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg --- drivers/nvme/host/core.c | 2 ++ drivers/nvme/host/trace.h | 26 ++ 2 files changed, 28 insertions(+) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 358dfdc1f6da..7cbd4a260d30 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -220,6 +220,8 @@ void nvme_complete_rq(struct request *req) { blk_status_t status = nvme_error_status(req); + trace_nvme_complete_rq(req); + if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { if (nvme_req_needs_failover(req, status)) { nvme_failover_req(req); diff --git a/drivers/nvme/host/trace.h b/drivers/nvme/host/trace.h index f17dbfbead5a..afd87235311b 100644 --- a/drivers/nvme/host/trace.h +++ b/drivers/nvme/host/trace.h @@ -130,6 +130,32 @@ TRACE_EVENT(nvme_setup_nvm_cmd, __parse_nvme_cmd(__entry->opcode, __entry->cdw10)) ); +TRACE_EVENT(nvme_complete_rq, + TP_PROTO(struct request *req), + TP_ARGS(req), + TP_STRUCT__entry( + __field(int, qid) + __field(int, cid) + __field(__le64, result) + __field(u8, retries) + __field(u8, flags) + __field(u16, status) + ), + TP_fast_assign( + __entry->qid = req->q->id; + __entry->cid = req->tag; + __entry->result = nvme_req(req)->result.u64; + __entry->retries = nvme_req(req)->retries; + __entry->flags = nvme_req(req)->flags; + __entry->status = nvme_req(req)->status; + ), + TP_printk("cmdid=%u, qid=%d, res=%llu, retries=%u, flags=0x%x, status=%u", + __entry->cid, __entry->qid, + le64_to_cpu(__entry->result), + __entry->retries, __entry->flags, __entry->status) + +); + #endif /* _TRACE_NVME_H */ #undef TRACE_INCLUDE_PATH -- 2.13.6
Re: [PATCH v2 2/5] x86/cpufeatures: Add Intel feature bits for Speculation Control
On Tue, 2018-01-23 at 17:28 -0800, Dave Hansen wrote: > On 01/23/2018 05:23 PM, Woodhouse, David wrote: > > > > On Tue, 2018-01-23 at 10:43 -0800, Dave Hansen wrote: > ... > > > > > > > > > > > > > /* Intel-defined CPU features, CPUID level 0x0007:0 (EDX), word > > > > 18 */ > > > > #define X86_FEATURE_AVX512_4VNNIW(18*32+ 2) /* AVX-512 Neural > > > > Network Instructions */ > > > > #define X86_FEATURE_AVX512_4FMAPS(18*32+ 3) /* AVX-512 Multiply > > > > Accumulation Single precision */ > > > > +#define X86_FEATURE_SPEC_CTRL(18*32+26) /* Speculation > > > > Control (IBRS + IBPB) */ > > > > +#define X86_FEATURE_STIBP(18*32+27) /* Single Thread > > > > Indirect Branch Predictors */ > > > > +#define X86_FEATURE_ARCH_CAPABILITIES(18*32+29) /* > > > > IA32_ARCH_CAPABILITIES MSR (Intel) */ > > > Should we be adding flags (STIBP) for which we currently have no user in > > > the kernel? > > They're in an existing word (now) so it costs us absolutely nothing to > > do so. And they'll be exposed to KVM guests in imminent patches if > > nothing else. > > Doesn't just defining it here generate something in the tables that then > get exported in /proc/cpuinfo? That's far from our most strict ABI, but > a single #define here can be seen by users IIRC. That's true, but still we're *working* on exposing and using these; let's not go wild adding one feature at a time and having to tweak the surrounding blacklist/enable/disable/expose logic at every step. smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v2] perf/core: Fix installing cgroup event into cpu
2018-01-24 16:20 GMT+08:00 Peter Zijlstra : > On Wed, Jan 24, 2018 at 03:50:10PM +0800, linxiu...@gmail.com wrote: >> From: "leilei.lin" >> >> Do not install cgroup event into the CPU context if the cgroup >> is not running on this CPU >> >> While there is no task of cgroup running specified CPU, current >> kernel still install cgroup event into CPU context, that causes >> another cgroup event can't be installed into this CPU. > > This changelog doesn't really cover the extend of the changes done. > See below please > >> Signed-off-by: leilei.lin >> --- >> kernel/events/core.c | 44 +++- >> 1 file changed, 31 insertions(+), 13 deletions(-) >> >> diff --git a/kernel/events/core.c b/kernel/events/core.c >> index 4df5b69..f766b60 100644 >> --- a/kernel/events/core.c >> +++ b/kernel/events/core.c >> @@ -933,31 +933,36 @@ list_update_cgroup_event(struct perf_event *event, >> { >> struct perf_cpu_context *cpuctx; >> struct list_head *cpuctx_entry; >> + struct perf_cgroup *cgrp; >> >> if (!is_cgroup_event(event)) >> return; >> >> /* >>* Because cgroup events are always per-cpu events, >>* this will always be called from the right CPU. >>*/ >> cpuctx = __get_cpu_context(ctx); >> + cgrp = perf_cgroup_from_task(current, ctx); >> >> + /* cpuctx->cgrp is NULL unless a cgroup event is running in this CPU >> .*/ >> + if (cgroup_is_descendant(cgrp->css.cgroup, event->cgrp->css.cgroup)) { >> + if (add) >> cpuctx->cgrp = cgrp; >> + else >> + cpuctx->cgrp = NULL; >> } >> + >> + if (add && ctx->nr_cgroups++) >> + return; >> + else if (!add && --ctx->nr_cgroups) >> + return; >> + >> + cpuctx_entry = &cpuctx->cgrp_cpuctx_entry; >> + if (add) >> + list_add(cpuctx_entry, this_cpu_ptr(&cgrp_cpuctx_list)); >> + else >> + list_del(cpuctx_entry); >> } > > I'm a little confused; you unconditionally set cpuctx->cgrp for every > add/delete. > > So if we have >1 cgroup events on, and we remove one, you still clear > cpuctx->cgrp, that seems wrong. > > Why did you change that? The Changelog doesn't include enough clues for > me to know what you were trying to do. if we have > 1 cgroup events on, whenever a cgroup was really to be deleted, only if this cgroup is the same as the cgroup running on this cpu, I would clear cpuctx->cgrp. Reverse is the same. Here is the problem, previous version didn't set cpuctx->cgrp anymore if ctx->nr_cgroups > 1, which cases a second event would not be activated immediately because cpuctx->cgrp isn't equal to event->cgrp at event_filter_match()
Re: [Intel-wired-lan] [RFC PATCH] e1000e: Remove Other from EIAC.
On 2018/01/22 10:01, Alexander Duyck wrote: [...] > > > > If the patch that I submitted for the current vmware issue is merged, > > the significant commits that are left are: > > > > 0a8047ac68e5 e1000e: Fix msi-x interrupt automask (v4.5-rc1) > > Fixes a problem in the irq disabling of the napi implementation. > > This one I fully agree is still needed. > > > 19110cfbb34d e1000e: Separate signaling for link check/link up > > (v4.15-rc1) > > Fixes link flapping caused by a race condition in link > > detection. It was found because the Other interrupt was being > > triggered sort of spuriously by rxo. > > This one is somewhat iffy. I am not sure if the patch description > really matches what it is doing. It doesn't appear to do what it says > it is trying to do since clearing get_link_status will still trigger a > link up, it just takes an extra 2 seconds. I think there may be issues > if you aren't using autoneg, as I don't see how you are getting the > link to report up other than the fact that mac->get_link_status has > been cleared but we are reporting a pseduo-error. In addition it is > only really needed after the RXO problem was introduced which really > didn't exist until after we stopped checking for LSC. One interesting > test we may want to look at is to see if there is an additional delay > in a link coming up for a non-autoneg setup. If we find an additional > 2 second delay then I would be even more confident that this patch has > a bug. It seems like you're right but I didn't look into this part of the problem in detail yet. I'll get back to it. > > > 4aea7a5c5e94 e1000e: Avoid receiver overrun interrupt bursts (v4.15-rc1) > > Fixes Other interrupt bursts during sustained rxo conditions. > > So the RXO problem probably didn't exist until we stopped checking for > the OTHER and LSC bits in the "other" interrupt handler. Yes there > would be more "other" cause interrupts, but they shouldn't have been > causing much in the way of issues since the get_link_status value > never changed. Personally I would lean more toward the option of I agree. I tested rxo behavior on commit 4d432f67ff00 ("e1000e: Remove unreachable code", v4.5-rc1) which is before any significant change in that area. (I force rxo by adding mdelay(10) to e1000_clean_rx_irq and sending a netperf UDP_STREAM from another host). In case of sustained rxo condition, we get repeated Other interrupts. Handling these irqs is useless work that could be avoided when the system is already overloaded but it doesn't lead to misbehavior like the race condition described in the log of commit 19110cfbb34d ("e1000e: Separate signaling for link check/link up", v4.15-rc1). However, I noticed something unexpected. It seems like reading ICR doesn't clear every bit that's set in IAM, most notably not rxo. In a different test, I was doing a single write of RXO | OTHER to ICS, then two subsequent reads of icr gave 0x0141. OTOH, writing a bit to ICS reliably clears it. So if you want to remove RXO interrupt mitigation, you should at least add a write of RXO to ICR, to clear it. On my system it reduced Other interrupts from ~17000/s to ~1700/s when using the mdelay testing approach. > reverting this patch and instead just focus on testing OTHER and LSC > as we originally were so that we don't risk messing up NAPI by messing > with ring state from a non-ring interrupt. > > I will try to get to these later this week if you would like. > Unfortunately I don't have any of these devices in any of my > development systems so I have to go chase one down. Otherwise you are > free to take these on and tell me if I have made another flawed > assumption somewhere, but I am thinking the RXO issue goes away if we > get the original "other" interrupt routine back to where it was. > > So the last bit in all this ends up being that because of 0a8047ac68e5 > e1000e: Fix msi-x interrupt automask (v4.5-rc1) we don't seem to > auto-clear interrupt causes anymore on ICR read. I am not certain what > the impact of this is. I would be interested in finding out if a cause > left set will trigger an interrupt storm or if it just goes quiet when > we just leave the value high. If it goes quiet then that in itself > might solve the RXO interrupt burst problem if we don't clear it. > Otherwise we need to make certain to clear all of the causes that can > trigger the "other" interrupt to fire regardless of if we service the > events or not. In MSI-X mode, as long as Other is not set in ICR, nothing will happen even if the bits related to Other (LSC, RXO, MDAC, SRPD, ACK, MNG) are set. However, that doesn't solve the rxo interrupt burst because an rxo condition in hardware sets both RXO and Other, so it triggers an interrupt.
[PATCH] crypto: sha512-mb - initialize pending lengths correctly
From: Eric Biggers The SHA-512 multibuffer code keeps track of the number of blocks pending in each lane. The minimum of these values is used to identify the next lane that will be completed. Unused lanes are set to a large number (0x) so that they don't affect this calculation. However, it was forgotten to set the lengths to this value in the initial state, where all lanes are unused. As a result it was possible for sha512_mb_mgr_get_comp_job_avx2() to select an unused lane, causing a NULL pointer dereference. Specifically this could happen in the case where ->update() was passed fewer than SHA512_BLOCK_SIZE bytes of data, so it then called sha_complete_job() without having actually submitted any blocks to the multi-buffer code. This hit a NULL pointer dereference if another task happened to have submitted blocks concurrently to the same CPU and the flush timer had not yet expired. Fix this by initializing sha512_mb_mgr->lens correctly. As usual, this bug was found by syzkaller. Fixes: 45691e2d9b18 ("crypto: sha512-mb - submit/flush routines for AVX2") Reported-by: syzbot Cc: # v4.8+ Signed-off-by: Eric Biggers --- arch/x86/crypto/sha512-mb/sha512_mb_mgr_init_avx2.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/x86/crypto/sha512-mb/sha512_mb_mgr_init_avx2.c b/arch/x86/crypto/sha512-mb/sha512_mb_mgr_init_avx2.c index 36870b26067a..d08805032f01 100644 --- a/arch/x86/crypto/sha512-mb/sha512_mb_mgr_init_avx2.c +++ b/arch/x86/crypto/sha512-mb/sha512_mb_mgr_init_avx2.c @@ -57,10 +57,12 @@ void sha512_mb_mgr_init_avx2(struct sha512_mb_mgr *state) { unsigned int j; - state->lens[0] = 0; - state->lens[1] = 1; - state->lens[2] = 2; - state->lens[3] = 3; + /* initially all lanes are unused */ + state->lens[0] = 0x; + state->lens[1] = 0x0001; + state->lens[2] = 0x0002; + state->lens[3] = 0x0003; + state->unused_lanes = 0xFF03020100; for (j = 0; j < 4; j++) state->ldata[j].job_in_lane = NULL; -- 2.16.0
Avoiding information leaks between users and between processes by default? [Was: : [PATCH 1/5] prctl: add PR_ISOLATE_BP process control]
On Wed, Jan 24, 2018 at 07:29:53AM +0100, Martin Schwidefsky wrote: > On Tue, 23 Jan 2018 18:07:19 +0100 > Dominik Brodowski wrote: > > > On Tue, Jan 23, 2018 at 02:07:01PM +0100, Martin Schwidefsky wrote: > > > Add the PR_ISOLATE_BP operation to prctl. The effect of the process > > > control is to make all branch prediction entries created by the execution > > > of the user space code of this task not applicable to kernel code or the > > > code of any other task. > > > > What is the rationale for requiring a per-process *opt-in* for this added > > protection? > > > > For KPTI on x86, the exact opposite approach is being discussed (see, e.g. > > http://lkml.kernel.org/r/1515612500-14505-1-git-send-emai...@1wt.eu ): By > > default, play it safe, with KPTI enabled. But for "trusted" processes, one > > may opt out using prctrl. > > The rationale is that there are cases where you got code from *somewhere* > and want to run it in an isolated context. Think: a docker container that > runs under KVM. But with spectre this is still not really safe. So you > include a wrapper program in the docker container to use the trap door > prctl to start the potential malicious program. Now you should be good, no? Well, partly. It may be that s390 and its use cases are special -- but as I understand it, this uapi question goes beyond this question: To my understanding, Linux traditionally tried to aim for the security goal of avoiding information leaks *between* users[+], probably even between processes of the same user. It wasn't a guarantee, and there always were (and will be) information leaks -- and that is where additional safeguards such as seccomp come into play, which reduce the attack surface against unknown or unresolved security-related bugs. And everyone knew (or should have known) that allowing "untrusted" code to be run (be it by an user, be it JavaScript, etc.) is more risky. But still, avoiding information leaks between users and between processes was (to my understanding) at least a goal.[§] In recent days however, the outlook on this issue seems to have shifted: - Your proposal would mean to trust all userspace code, unless it is specifically marked as untrusted. As I understand it, this would mean that by default, spectre isn't fully mitigated cross-user and cross-process, though the kernel could. And rogue user-run code may make use of that, unless it is run with a special wrapper. - Concerning x86 and IPBP, the current proposal is to limit the protection offered by IPBP to non-dumpable processes. As I understand it, this would mean that other processes are left hanging out to dry.[~] - Concerning x86 and STIBP, David mentioned that "[t]here's an argument that there are so many other information leaks between HT siblings that we might not care"; in the last couple of hours, a proposal emerged to limit the protection offered by STIBP to non-dumpable processes as well. To my understanding, this would mean that many processes are left hanging out to dry again. I am a bit worried whether this is a sign for a shift in the security goals. I fully understand that there might be processes (e.g. some[?] kernel threads) and users (root) which you need to trust anyway, as they can already access anything. Disabling additional, costly safeguards for those special cases then seems OK. Opting out of additional protections for single-user or single-use systems (haproxy?) might make sense as well. But the kernel[*] not offering full[#] spectre mitigation by default for regular users and their processes? I'm not so sure. Thanks, Dominik [+] root is different. [§] Whether such goals and their pursuit may have legal relevance -- e.g. concerning the criminal law protection against unlawful access to data -- is a related, fascinating topic. [~] For example, I doubt that mutt sets the non-dumpable flag. But I wouldn't want other users to be able to read my mail. [#] Well, at least the best the kernel can currently and reasonably manage. [*] Whether CPUs should enable full mitigation (IBRS_ALL) by default in future has been discussed on this list as well.
Re: [PATCH v2 3/5] x86/cpufeatures: Add AMD feature bits for Speculation Control
On Tue, Jan 23, 2018 at 04:52:53PM +, David Woodhouse wrote: > AMD exposes the PRED_CMD/SPEC_CTRL MSRs slightly differently to Intel. > Documented at https://lkml.org/lkml/2018/1/21/112 lkml.org is not very stable. Maybe use messageid-based url instead? http://lkml.kernel.org/r/2b3e25cc-286d-8bd0-aeaf-9ac4aae39...@amd.com -- Kirill A. Shutemov
Re: [PATCH 4/8] mfd: stm32-timers: add support for dmas
On 01/23/2018 05:41 PM, Lee Jones wrote: > On Tue, 23 Jan 2018, Fabrice Gasnier wrote: >> On 01/23/2018 04:30 PM, Lee Jones wrote: >>> On Tue, 23 Jan 2018, Fabrice Gasnier wrote: >>> On 01/23/2018 02:32 PM, Lee Jones wrote: > On Tue, 16 Jan 2018, Fabrice Gasnier wrote: > >> STM32 Timers can support up to 7 dma requests: >> 4 channels, update, compare and trigger. >> Optionally request part, or all dmas from stm32-timers MFD core. >> Also, keep reference of device's bus address to allow child drivers to >> transfer data from/to device by using dma. >> >> Signed-off-by: Fabrice Gasnier >> --- >> drivers/mfd/stm32-timers.c | 37 >> - >> include/linux/mfd/stm32-timers.h | 14 ++ >> 2 files changed, 50 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/mfd/stm32-timers.c b/drivers/mfd/stm32-timers.c >> static int stm32_timers_probe(struct platform_device *pdev) >> { >> struct device *dev = &pdev->dev; >> @@ -44,6 +61,7 @@ static int stm32_timers_probe(struct platform_device >> *pdev) >> mmio = devm_ioremap_resource(dev, res); >> if (IS_ERR(mmio)) >> return PTR_ERR(mmio); >> +ddata->phys_base = res->start; > > What do you use this for? This is used in in child driver (pwm) for capture data transfer by dma. >>> >>> Might be worth being clear about that. >>> >>> Perhaps pass in 'dma_base' (phys_base + offset) instead? >> >> I guess you've had a look at [PATCH 5/8] pwm: stm32: add capture >> support. Are you talking about passing phys_base + TIM_DMAR ? > > I have and I am. > >> If this is the case, I'd prefer to keep phys base only if you don't >> mind, and handle TIM_DMAR offset in pwm driver. This way, all dma slave >> config is kept locally at one place. >> Or do you mean something else ? >> >> Maybe I can add a comment here about this ? >> Something like: >> /* phys_base to be used by child driver, e.g. DMA burst mode */ > > I haven't seen the memory map for this device, so it's not easy for me > to comment, but passing in the physical address of the parent MFD into > a child device doesn't quite sit right with me. > > At what level does TIM_DMAR sit? Is it a child (PWM) specific > property, or is it described at parent (Timer) level? > Hi Lee, This isn't child (PWM) specific. TIM_DMAR is described at timer level as well as all timers DMA requests lines. Current patchset make it useful for PWM capture. Basically, I think this can be seen as interrupts, as each (0..7) dma request has an enable bit (in DIER: interrupt enable register). This is similar as interrupts at timer level. So, I understand your point regarding passing physical address of the parent MFD... Speaking of interrupts, I'd probably have looked at irq_chip. Regarding dma, i'm not sure what is preferred way ? Another way maybe to export a routine (export symbol) from MFD core, to handle dma transfer from there? By looking into drivers/mfd, I found similar approach, e.g. rtsx_pci_dma_transfer(). Do you think this is better approach ? Please let me know your opinion. Best Regards, Fabrice
Re: [PATCH] staging: media: remove unused VIDEO_ATOMISP_OV8858 kconfig
On Tue, Jan 23, 2018 at 07:31:27PM +0200, Andy Shevchenko wrote: > On Tue, Jan 23, 2018 at 4:37 PM, Corentin Labbe wrote: > > Nothing in kernel use VIDEO_ATOMISP_OV8858 since commit 3a81c7660f80 > > ("media: staging: atomisp: Remove IMX sensor support") > > Lets remove this kconfig option. > > First of all, I hardly understand how that change is related. It's pretty obvious if you look at the commit. -obj-$(CONFIG_VIDEO_ATOMISP_OV8858) += atomisp-ov8858.o It sounds like you deleted that line by mistake and re-added it to your local Makefile? regards, dan carpenter
Re: [PATCH v2 3/5] x86/cpufeatures: Add AMD feature bits for Speculation Control
On Wed, 2018-01-24 at 11:39 +0300, Kirill A. Shutemov wrote: > On Tue, Jan 23, 2018 at 04:52:53PM +, David Woodhouse wrote: > > > > AMD exposes the PRED_CMD/SPEC_CTRL MSRs slightly differently to Intel. > > Documented at https://lkml.org/lkml/2018/1/21/112 > > lkml.org is not very stable. Maybe use messageid-based url instead? > > http://lkml.kernel.org/r/2b3e25cc-286d-8bd0-aeaf-9ac4aae39...@amd.com Well... I like the fact that your one has "amd.com" in it... but I did want to see it a lot further to the left. Tom? smime.p7s Description: S/MIME cryptographic signature
Re: Coccinelle: zalloc-simple: Checking consequences from the usage of at signs in Python strings
> So it works, but you are complaining anyway? How serious do you interpret such information in the SmPL manual? >> https://github.com/coccinelle/coccinelle/blob/bf1c6a5869dd324f5faeeaa3a12d57270e478b21/docs/manual/cocci_syntax.tex#L50 >> >> “… >> Furthermore, @ should not be used in this code. … >> …” > > I guess the conclusion is that it woks in strings (which are pretty > universal) and not in comments (which are language specific). Would you like to achieve any safer data processing for potentially “reserved” characters? Regards, Markus
Re: [PATCH v6 05/99] xarray: Add definition of struct xarray
Mathhew, Just a minor question. On Wed, 2018-01-17 at 12:20 -0800, Matthew Wilcox wrote: > This is a direct replacement for struct radix_tree_root. Some of the > struct members have changed name; convert those, and use a #define so > that radix_tree users continue to work without change. > > Signed-off-by: Matthew Wilcox > --- a/include/linux/xarray.h > +++ b/include/linux/xarray.h > @@ -10,6 +10,8 @@ > */ > > #include > +#include > +#include The top Makefile includes linux/kconfig.h globally. (See the odd USERINCLUDE variable, which is actually part of the LINUXINCLUDE variable, but split off to make things confusing.) Why do you need to include linux/kconfig.h here? > #include > #include Thanks, Paul Bolle
Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
On 2018-01-09 17:37, Andrey Ryabinin wrote: > strscpy() performs the word-at-a-time optimistic reads. So it may > may access the memory past the end of the object, which is perfectly fine > since strscpy() doesn't use that (past-the-end) data and makes sure the > optimistic read won't cross a page boundary. > > But KASAN doesn't know anything about that so it will complain. > There are several possible ways to address this issue, but none > are perfect. See > https://lkml.kernel.org/r/9f0a9cf6-51f7-cd1f-5dc6-6d510a7b8...@virtuozzo.com > > It seems the best solution is to simply disable word-at-a-time > optimization. My trivial testing shows that byte-at-a-time > could be up to x4.3 times slower than word-at-a-time. > It may seems like a lot, but it's actually ~1.2e-10 sec per symbol vs > ~4.8e-10 sec per symbol on modern hardware. And we don't use strscpy() > in a performance critical paths to copy large amounts of data, > so it shouldn't matter anyway. > > Fixes: 30035e45753b7 ("string: provide strscpy()") > Signed-off-by: Andrey Ryabinin > Cc: > Acked-by: Rasmus Villemoes Your microbenchmark even favours word-at-a-time slightly, since in practice I think at least one of src or dst will be unaligned a lot of the time, and while x86 may HAVE_EFFICIENT_UNALIGNED_ACCESS, it's still a little more expensive than doing aligned access. And since strscpy is not called that often, I expect some of the ~300 bytes of instruction cache it occupies can be put to better use elsewhere. Rasmus
Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
On Tue, Jan 23, 2018 at 08:58:36PM +, David Woodhouse wrote: > +static const struct sku_microcode spectre_bad_microcodes[] = { > + { INTEL_FAM6_KABYLAKE_DESKTOP, 0x0B, 0x80 }, > + { INTEL_FAM6_KABYLAKE_MOBILE, 0x0A, 0x80 }, > + { INTEL_FAM6_KABYLAKE_MOBILE, 0x0A, 0x80 }, > + { INTEL_FAM6_KABYLAKE_MOBILE, 0x09, 0x80 }, > + { INTEL_FAM6_KABYLAKE_DESKTOP, 0x09, 0x80 }, > + { INTEL_FAM6_SKYLAKE_X, 0x04, 0x023C }, > + { INTEL_FAM6_SKYLAKE_MOBILE, 0x03, 0x00C2 }, > + { INTEL_FAM6_SKYLAKE_DESKTOP, 0x03, 0x00C2 }, > + { INTEL_FAM6_BROADWELL_CORE, 0x04, 0x28 }, > + { INTEL_FAM6_BROADWELL_GT3E, 0x01, 0x001B }, > + { INTEL_FAM6_HASWELL_ULT, 0x01, 0x21 }, > + { INTEL_FAM6_HASWELL_GT3E, 0x01, 0x18 }, > + { INTEL_FAM6_HASWELL_CORE, 0x03, 0x23 }, > + { INTEL_FAM6_IVYBRIDGE_X, 0x04, 0x42a }, > + { INTEL_FAM6_HASWELL_X, 0x02, 0x3b }, > + { INTEL_FAM6_HASWELL_X, 0x04, 0x10 }, > + { INTEL_FAM6_HASWELL_CORE, 0x03, 0x23 }, > + { INTEL_FAM6_BROADWELL_XEON_D, 0x02, 0x14 }, > + { INTEL_FAM6_BROADWELL_XEON_D, 0x03, 0x711 }, > + { INTEL_FAM6_BROADWELL_GT3E, 0x01, 0x001B }, > + /* For 406F1 Intel says "0x25, 0x23" while VMware says 0x0B25 > + * and a real CPU has a firmware in the 0x0Bxx range. So: */ > + { INTEL_FAM6_BROADWELL_X, 0x01, 0x0b25 }, > + { INTEL_FAM6_KABYLAKE_DESKTOP, 0x09, 0x80 }, > + { INTEL_FAM6_SKYLAKE_X, 0x03, 0x100013e }, > + { INTEL_FAM6_SKYLAKE_X, 0x04, 0x23c }, > +}; Typically tglx likes to use x86_match_cpu() for these things; see also commit: bd9240a18edfb ("x86/apic: Add TSC_DEADLINE quirk due to errata"). > + > +static int bad_spectre_microcode(struct cpuinfo_x86 *c) > +{ > + int i; > + > + for (i = 0; i < ARRAY_SIZE(spectre_bad_microcodes); i++) { > + if (c->x86_model == spectre_bad_microcodes[i].model && > + c->x86_mask == spectre_bad_microcodes[i].stepping) > + return (c->microcode <= > spectre_bad_microcodes[i].microcode); > + } > + return 0; > +} The above is Intel only, you should check vendor too I think. > static void early_init_intel(struct cpuinfo_x86 *c) > { > u64 misc_enable; > @@ -122,6 +173,18 @@ static void early_init_intel(struct cpuinfo_x86 *c) > if (c->x86 >= 6 && !cpu_has(c, X86_FEATURE_IA64)) > c->microcode = intel_get_microcode_revision(); > > + if ((cpu_has(c, X86_FEATURE_SPEC_CTRL) || > + cpu_has(c, X86_FEATURE_AMD_SPEC_CTRL) || > + cpu_has(c, X86_FEATURE_AMD_PRED_CMD) || > + cpu_has(c, X86_FEATURE_AMD_STIBP)) && bad_spectre_microcode(c)) { > + pr_warn("Intel Spectre v2 broken microcode detected; disabling > SPEC_CTRL\n"); > + clear_cpu_cap(c, X86_FEATURE_SPEC_CTRL); > + clear_cpu_cap(c, X86_FEATURE_STIBP); > + clear_cpu_cap(c, X86_FEATURE_AMD_SPEC_CTRL); > + clear_cpu_cap(c, X86_FEATURE_AMD_PRED_CMD); > + clear_cpu_cap(c, X86_FEATURE_AMD_STIBP); > + } And since its Intel only, what are those AMD features doing there?
Facebook Email Notifications Stopped for Several Months Before Returning on 21 Jan 2018 Sunday
Topic/Subject: Facebook Email Notifications Stopped for Several Months Before Returning on 21 Jan 2018 Sunday Singapore Government Linked Company (GLC) Keppel Corporation bribed Brazilian Government Officials with $55 MILLION over 14 years My best guess is that secret Singapore Government agents accessed my Facebook account several months ago and turned off my Facebook email notifications. Subsequently secret Singapore Government agents turned on my Facebook email notifications on 21 Jan 2018 Sunday. I would not be surprised if the Singapore Government bribed Facebook to toggle/turn on and off my Facebook email notifications at their whims and fancies. Singapore Government Linked Company (GLC) Keppel Corporation bribed Brazilian Government Officials with $55 MILLION over 14 years Regards, Mr. Turritopsis Dohrnii Teo En Ming TARGETED INDIVIDUAL (TI) Singapore Citizen 24 January 2018 Wednesday 4:37 PM Singapore Time
Re: [PATCH 0/3] ftrace, orc, x86, tracing: Fix stack traces again
On Wed, 24 Jan 2018 08:46:50 +0100 Ingo Molnar wrote: > No fundamental objections from me, assuming they are well tested. > Yeah, I ran it through my ftrace test suite, and they did fine till I hit test 20 of 34, which tests static ftrace (CONFIG_DYNAMIC_FTRACE not set), and it crashed. I'm hoping to have it fixed and retested today. Thanks! -- Steve
Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
On 2018-01-09 17:47, Andrey Ryabinin wrote: > Attached user space program I used to see the difference. > Usage: > gcc -02 -o strscpy strscpy_test.c > ./strscpy {b|w} src_str_len count > > src_str_len - length of source string in between 1-4096 > count - how many strscpy() to execute. > > Also I've noticed something strange. I'm not sure why, but certain > src_len values (e.g. 30) drives branch predictor crazy causing worse than > usual results > for byte-at-a-time copy: I see something similar, but at the 30->31 transition, and the branch-misses remain at 1-3% for higher values, until 42 where it drops back to 0%. Anyway, I highly doubt we do a lot of string copies of strings longer then 32. $ perf stat ./strscpy_test b 30 1000 Performance counter stats for './strscpy_test b 30 1000': 156,777082 task-clock (msec) #0,999 CPUs utilized 0 context-switches #0,000 K/sec 0 cpu-migrations#0,000 K/sec 48 page-faults #0,306 K/sec 584.646.177 cycles#3,729 GHz stalled-cycles-frontend stalled-cycles-backend 2.580.599.614 instructions #4,41 insns per cycle 660.114.283 branches # 4210,528 M/sec 4.891 branch-misses #0,00% of all branches 0,156970910 seconds time elapsed $ perf stat ./strscpy_test b 31 1000 Performance counter stats for './strscpy_test b 31 1000': 258,533250 task-clock (msec) #0,999 CPUs utilized 0 context-switches #0,000 K/sec 0 cpu-migrations#0,000 K/sec 50 page-faults #0,193 K/sec 965.505.138 cycles#3,735 GHz stalled-cycles-frontend stalled-cycles-backend 2.660.773.463 instructions #2,76 insns per cycle 680.141.051 branches # 2630,768 M/sec 19.150.367 branch-misses #2,82% of all branches 0,258725192 seconds time elapsed Rasmus
Re: [PATCH] drm/bridge/synopsys: dsi: Adopt SPDX identifiers
2018-01-24 0:32 GMT+01:00 Laurent Pinchart : > Hi Philippe, > > On Tuesday, 23 January 2018 12:25:51 EET Philippe CORNU wrote: >> On 01/23/2018 12:30 AM, Laurent Pinchart wrote: >> > On Monday, 22 January 2018 12:26:08 EET Philippe Cornu wrote: >> >> Add SPDX identifiers to the Synopsys DesignWare MIPI DSI >> >> host controller driver. >> >> >> >> Signed-off-by: Philippe Cornu >> >> --- >> >> >> >> drivers/gpu/drm/bridge/synopsys/dw-mipi-dsi.c | 6 +- >> >> 1 file changed, 1 insertion(+), 5 deletions(-) >> >> >> >> diff --git a/drivers/gpu/drm/bridge/synopsys/dw-mipi-dsi.c >> >> b/drivers/gpu/drm/bridge/synopsys/dw-mipi-dsi.c index >> >> 46b0e73404d1..e06836dec77c 100644 >> >> --- a/drivers/gpu/drm/bridge/synopsys/dw-mipi-dsi.c >> >> +++ b/drivers/gpu/drm/bridge/synopsys/dw-mipi-dsi.c >> >> @@ -1,12 +1,8 @@ >> >> +// SPDX-License-Identifier: GPL-2.0 >> > >> > According to Documentation/process/license-rules.txt this would change >> > the existing license. The correct identifier is GPL-2.0+. >> >> You are right, I did not put the correct identifier :( >> >> After reading more spdx.org, I wonder if the correct value should be >> GPL-2.0-or-later instead of GPL-2.0+ >> >> https://spdx.org/licenses/GPL-2.0-or-later.html >> https://spdx.org/licenses/GPL-2.0+.html >> >> What is your opinion? > > I agree in principle, and I've even asked for that before, but I've been told > that we should stick to the license identifiers defined in Documentation/ > process/license-rules.txt. The file might get updated to use GPL-2.0-or-later > and GPL-2.0-only later, and kernel sources will likely then get patched in one > go. + Philippe O. to check what I'm writing just below. In -next branch I only see reference to GPL-2.0+ identifier so for me it fine to use it here. Is that right ? or should we use GPL-2.0-or-later keyword ? Benjamin > >> >> /* >> >> * Copyright (c) 2016, Fuzhou Rockchip Electronics Co., Ltd >> >> * Copyright (C) STMicroelectronics SA 2017 >> >> * >> >> - * This program is free software; you can redistribute it and/or modify >> >> - * it under the terms of the GNU General Public License as published by >> >> - * the Free Software Foundation; either version 2 of the License, or >> >> - * (at your option) any later version. >> >> - * >> >> * Modified by Philippe Cornu >> >> * This generic Synopsys DesignWare MIPI DSI host driver is based on the >> >> * Rockchip version from rockchip/dw-mipi-dsi.c with phy & bridge APIs. > > -- > Regards, > > Laurent Pinchart
Re: [PATCH 0/3] ftrace, orc, x86, tracing: Fix stack traces again
On 23.01.2018 20:32, Steven Rostedt wrote: > With the new ORC unwinder, ftrace stack tracing became disfunctional. > > One was that ORC didn't know how to handle the ftrace callbacks in > general (which Josh fixed). The other was that ORC would just bail > if it hit a dynamically allocated trampoline. I added a check to > the ORC unwinder to see if the trampoline belonged to ftrace, and > if it did, use the orc entry of the static trampoline that was used > to create the dynamic one (it would be identical). > > Finally, I noticed that the skip values of the stack tracing is out > of whack. I went through and fixed them. > > Anyone have any issues with these patches? I'm starting my tests on > them now and if all goes well, I plan on pushing them to Linus > hopefully tonight. > > Thanks! FWIW with this series ftrace with ORC now produces correct stacktraces both for events and function tracing. Furthermore, with frame pointer unwinder I don't see chopped off stack entries. > > -- Steve > > > Josh Poimboeuf (1): > x86/ftrace: Fix ORC unwinding from ftrace handlers > > Steven Rostedt (VMware) (2): > ftrace, orc, x86: Handle ftrace dynamically allocated trampolines > tracing: Update stack trace skipping for ORC unwinder > > > arch/x86/kernel/Makefile| 5 +++- > arch/x86/kernel/ftrace_64.S | 24 +++--- > arch/x86/kernel/unwind_orc.c| 48 +++- > include/linux/ftrace.h | 2 ++ > kernel/trace/ftrace.c | 26 +++- > kernel/trace/trace.c| 34 ++--- > kernel/trace/trace_events_trigger.c | 13 -- > kernel/trace/trace_functions.c | 49 > +++-- > 8 files changed, 150 insertions(+), 51 deletions(-) >
Re: backport Rewrite sync_core() to use IRET-to-self to stable kernels?
On Tue, Jan 23, 2018 at 08:25:33AM +0100, gre...@linuxfoundation.org wrote: > On Mon, Jan 22, 2018 at 07:06:29PM -0800, Andy Lutomirski wrote: > > On Mon, Jan 22, 2018 at 6:59 PM, Zhang, Ning A > > wrote: > > > hello, Greg, Andy, Thomas > > > > > > would you like to backport these two patches to LTS kernel? > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/arch/x86/include/asm/processor.h?h=v4.14.14&id=1c52d859cb2d417e7216d3e56bb7fea88444cec9 > > > > > > x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/arch/x86/include/asm/processor.h?h=v4.14.14&id=c198b121b1a1d7a7171770c634cd49191bac4477 > > > > > > x86/asm: Rewrite sync_core() to use IRET-to-self > > > > > > > I'd be in favor of backporting > > 1c52d859cb2d417e7216d3e56bb7fea88444cec9. I see no compelling reason > > to backport the other one, since it doesn't fix a bug. Greg, can you > > do this? > > I'll work on this after this round of stable kernels are released. Now applied, thanks. greg k-h
Re: backport Rewrite sync_core() to use IRET-to-self to stable kernels?
On Wed, Jan 24, 2018 at 08:19:44AM +, Zhang, Ning A wrote: > Thanks Greg, Andy > > How about c198b121b1a1d7a7171770c634cd49191bac4477? Why do you think that is needed in the older stable kernels? What problem does it solve? Do any distros have this patch in their kernels? thanks, greg k-h
Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
On Wed, 2018-01-24 at 09:47 +0100, Peter Zijlstra wrote: > Typically tglx likes to use x86_match_cpu() for these things; see also > commit: bd9240a18edfb ("x86/apic: Add TSC_DEADLINE quirk due to > errata"). Thanks, will fix. I think we might also end up in whitelist mode, adding "known good" microcodes to the list as they get released or retroactively blessed. I would really have liked a new bit in IA32_ARCH_CAPABILITIES to say that it's safe, but that's not possible for *existing* microcode which actually turns out to be OK in the end. That means the whitelist ends up basically empty right now. Should I add a command line parameter to override it? Otherwise we end up having to rebuild the kernel every time there's a microcode release which covers a new CPU SKU (which is why I kind of hate the whitelist, but Arjan is very insistent...) I'm kind of tempted to turn it into a whitelist just by adding 1 to the microcode revision in each table entry. Sure, that N+1 might be another microcode build that also has issues but never saw the light of day... but that's OK as long it never *does*. And yes we'd have to tweak it if revisions that are blacklisted in the Intel doc are subsequently cleared. But at least it'd require *less* tweaking. > > > > + > > +static int bad_spectre_microcode(struct cpuinfo_x86 *c) > > +{ > > + int i; > > + > > + for (i = 0; i < ARRAY_SIZE(spectre_bad_microcodes); i++) { > > + if (c->x86_model == spectre_bad_microcodes[i].model && > > + c->x86_mask == spectre_bad_microcodes[i].stepping) > > + return (c->microcode <= > > spectre_bad_microcodes[i].microcode); > > + } > > + return 0; > > +} > The above is Intel only, you should check vendor too I think. It's in intel.c, called from early_init_intel(). Isn't that sufficient? > > > > static void early_init_intel(struct cpuinfo_x86 *c) > > { > > u64 misc_enable; > > @@ -122,6 +173,18 @@ static void early_init_intel(struct cpuinfo_x86 *c) > > if (c->x86 >= 6 && !cpu_has(c, X86_FEATURE_IA64)) > > c->microcode = intel_get_microcode_revision(); > > > > + if ((cpu_has(c, X86_FEATURE_SPEC_CTRL) || > > + cpu_has(c, X86_FEATURE_AMD_SPEC_CTRL) || > > + cpu_has(c, X86_FEATURE_AMD_PRED_CMD) || > > + cpu_has(c, X86_FEATURE_AMD_STIBP)) && bad_spectre_microcode(c)) { > > + pr_warn("Intel Spectre v2 broken microcode detected; disabling > > SPEC_CTRL\n"); > > + clear_cpu_cap(c, X86_FEATURE_SPEC_CTRL); > > + clear_cpu_cap(c, X86_FEATURE_STIBP); > > + clear_cpu_cap(c, X86_FEATURE_AMD_SPEC_CTRL); > > + clear_cpu_cap(c, X86_FEATURE_AMD_PRED_CMD); > > + clear_cpu_cap(c, X86_FEATURE_AMD_STIBP); > > + } > And since its Intel only, what are those AMD features doing there? Hypervisors which only want to expose PRED_CMD may do so using the AMD feature bit. SPEC_CTRL requires save/restore and live migration support, and isn't needed with retpoline anyway (since guests won't be calling directly into firmware). smime.p7s Description: S/MIME cryptographic signature
[PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice
An application running with libhugetlbfs fails to allocate additional pages to HEAP due to the hugemap being done inconditionally as topdown mapping: mmap(0x1008, 1572864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0) = 0x73e8 [...] mmap(0x7400, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d8 munmap(0x73d8, 1048576) = 0 [...] mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0 munmap(0x73d0, 1572864) = 0 [...] mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0 munmap(0x73d0, 1572864) = 0 [...] As one can see from the above strace log, mmap() allocates further pages below the initial one because no space is available on top of it. This patch fixes it by requesting bottomup mapping as the non generic hugetlb_get_unmapped_area() does Fixes: d0f13e3c20b6f ("[POWERPC] Introduce address space "slices" ") Signed-off-by: Christophe Leroy --- v3: Was a standalone patch before, but conflicts with this serie. arch/powerpc/mm/hugetlbpage.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 79e1378ee303..368ea6b248ad 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -558,7 +558,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, return radix__hugetlb_get_unmapped_area(file, addr, len, pgoff, flags); #endif - return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1); + return slice_get_unmapped_area(addr, len, flags, mmu_psize, 0); } #endif -- 2.13.3
[PATCH v3 4/5] powerpc/mm: Allow up to 64 low slices
While the implementation of the "slices" address space allows a significant amount of high slices, it limits the number of low slices to 16 due to the use of a single u64 low_slices_psize element in struct mm_context_t On the 8xx, the minimum slice size is the size of the area covered by a single PMD entry, ie 4M in 4K pages mode and 64M in 16K pages mode. This means we could have at least 64 slices. In order to override this limitation, this patch switches the handling of low_slices_psize to char array as done already for high_slices_psize. This allows to increase the number of low slices to 64 on the 8xx. Signed-off-by: Christophe Leroy --- v2: Usign slice_bitmap_xxx() macros instead of bitmap_xxx() functions. v3: keep low_slices as a u64, this allows 64 slices which is enough. arch/powerpc/include/asm/book3s/64/mmu.h | 3 +- arch/powerpc/include/asm/mmu-8xx.h | 7 +++- arch/powerpc/include/asm/paca.h | 2 +- arch/powerpc/include/asm/slice.h | 1 - arch/powerpc/include/asm/slice_32.h | 2 ++ arch/powerpc/include/asm/slice_64.h | 2 ++ arch/powerpc/kernel/paca.c | 3 +- arch/powerpc/mm/hash_utils_64.c | 13 arch/powerpc/mm/slb_low.S| 8 +++-- arch/powerpc/mm/slice.c | 57 +--- 10 files changed, 56 insertions(+), 42 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h index c9448e19847a..b076a2d74c69 100644 --- a/arch/powerpc/include/asm/book3s/64/mmu.h +++ b/arch/powerpc/include/asm/book3s/64/mmu.h @@ -91,7 +91,8 @@ typedef struct { struct npu_context *npu_context; #ifdef CONFIG_PPC_MM_SLICES - u64 low_slices_psize; /* SLB page size encodings */ +/* SLB page size encodings*/ + unsigned char low_slices_psize[BITS_PER_LONG / BITS_PER_BYTE]; unsigned char high_slices_psize[SLICE_ARRAY_SIZE]; unsigned long slb_addr_limit; #else diff --git a/arch/powerpc/include/asm/mmu-8xx.h b/arch/powerpc/include/asm/mmu-8xx.h index 5f89b6010453..5f37ba06b56c 100644 --- a/arch/powerpc/include/asm/mmu-8xx.h +++ b/arch/powerpc/include/asm/mmu-8xx.h @@ -164,6 +164,11 @@ */ #define SPRN_M_TW 799 +#ifdef CONFIG_PPC_MM_SLICES +#include +#define SLICE_ARRAY_SIZE (1 << (32 - SLICE_LOW_SHIFT - 1)) +#endif + #ifndef __ASSEMBLY__ typedef struct { unsigned int id; @@ -171,7 +176,7 @@ typedef struct { unsigned long vdso_base; #ifdef CONFIG_PPC_MM_SLICES u16 user_psize; /* page size index */ - u64 low_slices_psize; /* page size encodings */ + unsigned char low_slices_psize[SLICE_ARRAY_SIZE]; unsigned char high_slices_psize[0]; unsigned long slb_addr_limit; #endif diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h index 23ac7fc0af23..a3e531fe9ac7 100644 --- a/arch/powerpc/include/asm/paca.h +++ b/arch/powerpc/include/asm/paca.h @@ -141,7 +141,7 @@ struct paca_struct { #ifdef CONFIG_PPC_BOOK3S mm_context_id_t mm_ctx_id; #ifdef CONFIG_PPC_MM_SLICES - u64 mm_ctx_low_slices_psize; + unsigned char mm_ctx_low_slices_psize[BITS_PER_LONG / BITS_PER_BYTE]; unsigned char mm_ctx_high_slices_psize[SLICE_ARRAY_SIZE]; unsigned long mm_ctx_slb_addr_limit; #else diff --git a/arch/powerpc/include/asm/slice.h b/arch/powerpc/include/asm/slice.h index 2b4b70de7e71..b67ba8faa507 100644 --- a/arch/powerpc/include/asm/slice.h +++ b/arch/powerpc/include/asm/slice.h @@ -16,7 +16,6 @@ #define HAVE_ARCH_UNMAPPED_AREA #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN -#define SLICE_LOW_SHIFT28 #define SLICE_LOW_TOP (0x1ull) #define SLICE_NUM_LOW (SLICE_LOW_TOP >> SLICE_LOW_SHIFT) #define GET_LOW_SLICE_INDEX(addr) ((addr) >> SLICE_LOW_SHIFT) diff --git a/arch/powerpc/include/asm/slice_32.h b/arch/powerpc/include/asm/slice_32.h index 7e27c0dfb913..349187c20100 100644 --- a/arch/powerpc/include/asm/slice_32.h +++ b/arch/powerpc/include/asm/slice_32.h @@ -2,6 +2,8 @@ #ifndef _ASM_POWERPC_SLICE_32_H #define _ASM_POWERPC_SLICE_32_H +#define SLICE_LOW_SHIFT26 /* 64 slices */ + #define SLICE_HIGH_SHIFT 0 #define SLICE_NUM_HIGH 0ul #define GET_HIGH_SLICE_INDEX(addr) (addr & 0) diff --git a/arch/powerpc/include/asm/slice_64.h b/arch/powerpc/include/asm/slice_64.h index 9d1c97b83010..0959475239c6 100644 --- a/arch/powerpc/include/asm/slice_64.h +++ b/arch/powerpc/include/asm/slice_64.h @@ -2,6 +2,8 @@ #ifndef _ASM_POWERPC_SLICE_64_H #define _ASM_POWERPC_SLICE_64_H +#define SLICE_LOW_SHIFT28 + #define SLICE_HIGH_SHIFT 40 #define SLICE_NUM_HIGH (H_PGTABLE_RANGE >> SLICE_HIGH_SHIFT) #define GET_HIGH_SLICE_INDEX(addr) ((addr) >> SLICE_HIGH_SHIFT) diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index d6597038931d..8e1566bf82b8 100644 --
Re: [PATCH 4.9] usbip: prevent vhci_hcd driver from leaking a socket pointer address
On Tue, Jan 23, 2018 at 07:30:06PM -0700, Shuah Khan wrote: > commit 2f2d0088eb93 ("usbip: prevent vhci_hcd driver from leaking a > socket pointer address") Now queued up for 4.4 and 4.9, thanks. greg k-h
Re: [PATCH 4/4] rtc: isl1208: add support for isl1219 with hwmon for tamper detection
On Tue, Jan 23, 2018 at 10:22:54AM -0800, Guenter Roeck wrote: > On Tue, Jan 23, 2018 at 01:18:01PM +0100, Michael Grzeschik wrote: > > We add support for the ISL1219 chip that got an integrated tamper > > detection function. This patch implements the feature by using an hwmon > > interface. > > > > The ISL1219 can also describe the timestamp of the intrusion > > event. For this we add the documentation of the new interface > > intrusion[0-*]_timestamp. > > > > The devicetree documentation for the ISL1219 device tree > > binding is added with an short example. > > > > Signed-off-by: Michael Grzeschik > > Signed-off-by: Denis Osterland > > --- > > .../rtc/{intersil,isl1208.txt => isil,isl1208.txt} | 18 +- > > Documentation/hwmon/sysfs-interface| 7 + > > drivers/rtc/rtc-isl1208.c | 190 > > +++-- > > 3 files changed, 201 insertions(+), 14 deletions(-) > > rename Documentation/devicetree/bindings/rtc/{intersil,isl1208.txt => > > isil,isl1208.txt} (57%) > > > > diff --git a/Documentation/devicetree/bindings/rtc/intersil,isl1208.txt > > b/Documentation/devicetree/bindings/rtc/isil,isl1208.txt > > similarity index 57% > > rename from Documentation/devicetree/bindings/rtc/intersil,isl1208.txt > > rename to Documentation/devicetree/bindings/rtc/isil,isl1208.txt > > index a54e99feae1ca..d549699e1cfc4 100644 > > --- a/Documentation/devicetree/bindings/rtc/intersil,isl1208.txt > > +++ b/Documentation/devicetree/bindings/rtc/isil,isl1208.txt > > @@ -1,14 +1,21 @@ > > -Intersil ISL1208, ISL1218 I2C RTC/Alarm chip > > +Intersil ISL1208, ISL1218, ISL1219 I2C RTC/Alarm chip > > > > ISL1208 is a trivial I2C device (it has simple device tree bindings, > > consisting of a compatible field, an address and possibly an interrupt > > line). > > > > +ISL1219 supports tamper detection user space representation through > > +case intrusion hwmon sensor. > > +ISL1219 has additional pins EVIN and #EVDET for tamper detection. > > +I2C devices support only one irq. #IRQ and #EVDET are open-drain active > > low, > > +so it is possible layout them to one SoC pin with pull-up. > > + > > Required properties supported by the device: > > > > - "compatible": must be one of > > "isil,isl1208" > > "isil,isl1218" > > + "isil,isl1219" > > - "reg": I2C bus address of the device > > > > Optional properties: > > @@ -33,3 +40,12 @@ Example isl1208 node with #IRQ pin connected to SoC > > gpio1 pin 12: > > interrupt-parent = <&gpio1>; > > interrupts = <12 IRQ_TYPE_EDGE_FALLING>; > > }; > > + > > +Example isl1219 node with #IRQ pin and #EVDET pin connected to SoC gpio1 > > pin 12: > > + > > + isl1219: isl1219@68 { > > + compatible = "intersil,isl1219"; > > + reg = <0x68>; > > + interrupts-extended = <&gpio1 12 IRQ_TYPE_EDGE_FALLING>; > > + }; > > + > > diff --git a/Documentation/hwmon/sysfs-interface > > b/Documentation/hwmon/sysfs-interface > > index fc337c317c673..a12b3c2b2a18c 100644 > > --- a/Documentation/hwmon/sysfs-interface > > +++ b/Documentation/hwmon/sysfs-interface > > @@ -702,6 +702,13 @@ intrusion[0-*]_alarm > > the user. This is done by writing 0 to the file. Writing > > other values is unsupported. > > > > +intrusion[0-*]_timestamp > > + Chassis intrusion detection > > + -MM-DD HH:MM:SS UTC (ts.sec): intrusion detected > > + RO > > + The corresponding timestamp on which the intrustion > > + was detected. > > + > > Sneaky. Nack. You don't just add attributes to the ABI because you want it, > without serious discussion, and much less so hidden in an RTC driver > (and even less as unparseable attribute). Right; but it was not meant to be sneaky. I should have stick to my first thought and label this patch RFC. Sorry for that. > In addition to that, I consider the attribute unnecessary. The intrusion > already generates an event which should be sufficient for all practical > purposes. Would it make sense in between the other sysfs attributes of this driver? Thanks, Michael > > intrusion[0-*]_beep > > Chassis intrusion beep > > 0: disable > > diff --git a/drivers/rtc/rtc-isl1208.c b/drivers/rtc/rtc-isl1208.c > > index a13a4ba79004d..6e4d24614d98b 100644 > > --- a/drivers/rtc/rtc-isl1208.c > > +++ b/drivers/rtc/rtc-isl1208.c > > @@ -14,6 +14,8 @@ > > #include > > #include > > #include > > +#include > > +#include > > > > /* Register map */ > > /* rtc section */ > > @@ -33,6 +35,7 @@ > > #define ISL1208_REG_SR_ARST(1<<7) /* auto reset */ > > #define ISL1208_REG_SR_XTOSCB (1<<6) /* crystal oscillator */ > > #define ISL1208_REG_SR_WRTC(1<<4) /* write rtc */ > > +#define ISL1208_REG_SR_EVT (1<<3) /* event */ > > #define ISL1208_REG_SR_ALM (1<<2) /* alarm */ > > #define ISL1208_REG_SR_BAT (1<<1) /* battery */ >
[PATCH v3 1/5] powerpc/mm: Remove intermediate bitmap copy in 'slices'
bitmap_or() and bitmap_andnot() can work properly with dst identical to src1 or src2. There is no need of an intermediate result bitmap that is copied back to dst in a second step. Signed-off-by: Christophe Leroy --- v2: New in v2 v3: patch moved up front of the serie to avoid ephemeral slice_bitmap_copy() function in following patch arch/powerpc/mm/slice.c | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index 23ec2c5e3b78..98b53d48968f 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -388,21 +388,17 @@ static unsigned long slice_find_area(struct mm_struct *mm, unsigned long len, static inline void slice_or_mask(struct slice_mask *dst, struct slice_mask *src) { - DECLARE_BITMAP(result, SLICE_NUM_HIGH); - dst->low_slices |= src->low_slices; - bitmap_or(result, dst->high_slices, src->high_slices, SLICE_NUM_HIGH); - bitmap_copy(dst->high_slices, result, SLICE_NUM_HIGH); + bitmap_or(dst->high_slices, dst->high_slices, src->high_slices, + SLICE_NUM_HIGH); } static inline void slice_andnot_mask(struct slice_mask *dst, struct slice_mask *src) { - DECLARE_BITMAP(result, SLICE_NUM_HIGH); - dst->low_slices &= ~src->low_slices; - bitmap_andnot(result, dst->high_slices, src->high_slices, SLICE_NUM_HIGH); - bitmap_copy(dst->high_slices, result, SLICE_NUM_HIGH); + bitmap_andnot(dst->high_slices, dst->high_slices, src->high_slices, + SLICE_NUM_HIGH); } #ifdef CONFIG_PPC_64K_PAGES -- 2.13.3
[PATCH v3 2/5] powerpc/mm: Enhance 'slice' for supporting PPC32
In preparation for the following patch which will fix an issue on the 8xx by re-using the 'slices', this patch enhances the 'slices' implementation to support 32 bits CPUs. On PPC32, the address space is limited to 4Gbytes, hence only the low slices will be used. This patch moves "slices" functions prototypes from page64.h to slice.h The high slices use bitmaps. As bitmap functions are not prepared to handling bitmaps of size 0, the bitmap_xxx() calls are wrapped into slice_bitmap_xxx() functions which will void on PPC32 Signed-off-by: Christophe Leroy --- v2: First patch of v1 serie split in two parts ; added slice_bitmap_xxx() macros. v3: Moving slice related stuff in slice.h and slice_32/64.h slice_bitmap_xxx() are now static inline functions and platform dependent SLICE_LOW_TOP declared ull on PPC32 with correct casts allows to keep it 0x1 arch/powerpc/include/asm/page.h | 1 + arch/powerpc/include/asm/page_64.h | 59 -- arch/powerpc/include/asm/slice.h| 63 + arch/powerpc/include/asm/slice_32.h | 56 + arch/powerpc/include/asm/slice_64.h | 61 +++ arch/powerpc/mm/slice.c | 38 -- 6 files changed, 203 insertions(+), 75 deletions(-) create mode 100644 arch/powerpc/include/asm/slice.h create mode 100644 arch/powerpc/include/asm/slice_32.h create mode 100644 arch/powerpc/include/asm/slice_64.h diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h index 8da5d4c1cab2..d5f1c41b7dba 100644 --- a/arch/powerpc/include/asm/page.h +++ b/arch/powerpc/include/asm/page.h @@ -344,5 +344,6 @@ typedef struct page *pgtable_t; #include #endif /* __ASSEMBLY__ */ +#include #endif /* _ASM_POWERPC_PAGE_H */ diff --git a/arch/powerpc/include/asm/page_64.h b/arch/powerpc/include/asm/page_64.h index 56234c6fcd61..af04acdb873f 100644 --- a/arch/powerpc/include/asm/page_64.h +++ b/arch/powerpc/include/asm/page_64.h @@ -86,65 +86,6 @@ extern u64 ppc64_pft_size; #endif /* __ASSEMBLY__ */ -#ifdef CONFIG_PPC_MM_SLICES - -#define SLICE_LOW_SHIFT28 -#define SLICE_HIGH_SHIFT 40 - -#define SLICE_LOW_TOP (0x1ul) -#define SLICE_NUM_LOW (SLICE_LOW_TOP >> SLICE_LOW_SHIFT) -#define SLICE_NUM_HIGH (H_PGTABLE_RANGE >> SLICE_HIGH_SHIFT) - -#define GET_LOW_SLICE_INDEX(addr) ((addr) >> SLICE_LOW_SHIFT) -#define GET_HIGH_SLICE_INDEX(addr) ((addr) >> SLICE_HIGH_SHIFT) - -#ifndef __ASSEMBLY__ -struct mm_struct; - -extern unsigned long slice_get_unmapped_area(unsigned long addr, -unsigned long len, -unsigned long flags, -unsigned int psize, -int topdown); - -extern unsigned int get_slice_psize(struct mm_struct *mm, - unsigned long addr); - -extern void slice_set_user_psize(struct mm_struct *mm, unsigned int psize); -extern void slice_set_range_psize(struct mm_struct *mm, unsigned long start, - unsigned long len, unsigned int psize); - -#endif /* __ASSEMBLY__ */ -#else -#define slice_init() -#ifdef CONFIG_PPC_BOOK3S_64 -#define get_slice_psize(mm, addr) ((mm)->context.user_psize) -#define slice_set_user_psize(mm, psize)\ -do { \ - (mm)->context.user_psize = (psize); \ - (mm)->context.sllp = SLB_VSID_USER | mmu_psize_defs[(psize)].sllp; \ -} while (0) -#else /* !CONFIG_PPC_BOOK3S_64 */ -#ifdef CONFIG_PPC_64K_PAGES -#define get_slice_psize(mm, addr) MMU_PAGE_64K -#else /* CONFIG_PPC_64K_PAGES */ -#define get_slice_psize(mm, addr) MMU_PAGE_4K -#endif /* !CONFIG_PPC_64K_PAGES */ -#define slice_set_user_psize(mm, psize)do { BUG(); } while(0) -#endif /* CONFIG_PPC_BOOK3S_64 */ - -#define slice_set_range_psize(mm, start, len, psize) \ - slice_set_user_psize((mm), (psize)) -#endif /* CONFIG_PPC_MM_SLICES */ - -#ifdef CONFIG_HUGETLB_PAGE - -#ifdef CONFIG_PPC_MM_SLICES -#define HAVE_ARCH_HUGETLB_UNMAPPED_AREA -#endif - -#endif /* !CONFIG_HUGETLB_PAGE */ - #define VM_DATA_DEFAULT_FLAGS \ (is_32bit_task() ? \ VM_DATA_DEFAULT_FLAGS32 : VM_DATA_DEFAULT_FLAGS64) diff --git a/arch/powerpc/include/asm/slice.h b/arch/powerpc/include/asm/slice.h new file mode 100644 index ..2b4b70de7e71 --- /dev/null +++ b/arch/powerpc/include/asm/slice.h @@ -0,0 +1,63 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_POWERPC_SLICE_H +#define _ASM_POWERPC_SLICE_H + +#ifdef CONFIG_PPC_MM_SLICES + +#ifdef CONFIG_PPC64 +#include +#else +#include +#endif + +#ifdef CONFIG_HUGETLB_PAGE +#define HAVE_ARCH_HUGETLB_UNMAPPED_AREA +#endif +#define HAVE_ARCH_UNMAPPED_AREA +#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN +
[PATCH v3 3/5] powerpc/32: Fix hugepage allocation on 8xx at hint address
On the 8xx, the page size is set in the PMD entry and applies to all pages of the page table pointed by the said PMD entry. When an app has some regular pages allocated (e.g. see below) and tries to mmap() a huge page at a hint address covered by the same PMD entry, the kernel accepts the hint allthough the 8xx cannot handle different page sizes in the same PMD entry. 1000-10001000 r-xp 00:0f 2597 /root/malloc 1001-10011000 rwxp 00:0f 2597 /root/malloc mmap(0x1008, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0) = 0x1008 This results the app remaining forever in do_page_fault()/hugetlb_fault() and when interrupting that app, we get the following warning: [162980.035629] WARNING: CPU: 0 PID: 2777 at arch/powerpc/mm/hugetlbpage.c:354 hugetlb_free_pgd_range+0xc8/0x1e4 [162980.035699] CPU: 0 PID: 2777 Comm: malloc Tainted: G W 4.14.6 #85 [162980.035744] task: c67e2c00 task.stack: c668e000 [162980.035783] NIP: c000fe18 LR: c00e1eec CTR: c00f90c0 [162980.035830] REGS: c668fc20 TRAP: 0700 Tainted: G W(4.14.6) [162980.035854] MSR: 00029032 CR: 24044224 XER: 2000 [162980.036003] [162980.036003] GPR00: c00e1eec c668fcd0 c67e2c00 0010 c6869410 1008 77fb4000 [162980.036003] GPR08: 0001 0683c001 ff80 44028228 10018a34 4008 418004fc [162980.036003] GPR16: c668e000 00040100 c668e000 c06c c668fe78 c668e000 c6835ba0 c668fd48 [162980.036003] GPR24: 73ff 7400 0001 77fb4000 100f 1010 1010 [162980.036743] NIP [c000fe18] hugetlb_free_pgd_range+0xc8/0x1e4 [162980.036839] LR [c00e1eec] free_pgtables+0x12c/0x150 [162980.036861] Call Trace: [162980.036939] [c668fcd0] [c00f0774] unlink_anon_vmas+0x1c4/0x214 (unreliable) [162980.037040] [c668fd10] [c00e1eec] free_pgtables+0x12c/0x150 [162980.037118] [c668fd40] [c00eabac] exit_mmap+0xe8/0x1b4 [162980.037210] [c668fda0] [c0019710] mmput.part.9+0x20/0xd8 [162980.037301] [c668fdb0] [c001ecb0] do_exit+0x1f0/0x93c [162980.037386] [c668fe00] [c001f478] do_group_exit+0x40/0xcc [162980.037479] [c668fe10] [c002a76c] get_signal+0x47c/0x614 [162980.037570] [c668fe70] [c0007840] do_signal+0x54/0x244 [162980.037654] [c668ff30] [c0007ae8] do_notify_resume+0x34/0x88 [162980.037744] [c668ff40] [c000dae8] do_user_signal+0x74/0xc4 [162980.037781] Instruction dump: [162980.037821] 7fdff378 8137 54a3463a 80890020 7d24182e 7c841a14 712a0004 4082ff94 [162980.038014] 2f89 419e0010 712a0ff0 408200e0 <0fe0> 54a9000a 7f984840 419d0094 [162980.038216] ---[ end trace c0ceeca8e7a5800a ]--- [162980.038754] BUG: non-zero nr_ptes on freeing mm: 1 [162985.363322] BUG: non-zero nr_ptes on freeing mm: -1 In order to fix this, this patch uses the address space "slices" implemented for BOOK3S/64 and enhanced to support PPC32 by the preceding patch. This patch modifies the context.id on the 8xx to be in the range [1:16] instead of [0:15] in order to identify context.id == 0 as not initialised contexts as done on BOOK3S This patch activates CONFIG_PPC_MM_SLICES when CONFIG_HUGETLB_PAGE is selected for the 8xx Alltough we could in theory have as many slices as PMD entries, the current slices implementation limits the number of low slices to 16. This limitation is not preventing us to fix the initial issue allthough it is suboptimal. It will be cured in a subsequent patch. Fixes: 4b91428699477 ("powerpc/8xx: Implement support of hugepages") Signed-off-by: Christophe Leroy --- v2: First patch of v1 serie split in two parts v3: No changes arch/powerpc/include/asm/mmu-8xx.h | 6 ++ arch/powerpc/kernel/setup-common.c | 2 ++ arch/powerpc/mm/8xx_mmu.c | 2 +- arch/powerpc/mm/hugetlbpage.c | 2 ++ arch/powerpc/mm/mmu_context_nohash.c | 18 -- arch/powerpc/platforms/Kconfig.cputype | 1 + 6 files changed, 28 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/mmu-8xx.h b/arch/powerpc/include/asm/mmu-8xx.h index 5bb3dbede41a..5f89b6010453 100644 --- a/arch/powerpc/include/asm/mmu-8xx.h +++ b/arch/powerpc/include/asm/mmu-8xx.h @@ -169,6 +169,12 @@ typedef struct { unsigned int id; unsigned int active; unsigned long vdso_base; +#ifdef CONFIG_PPC_MM_SLICES + u16 user_psize; /* page size index */ + u64 low_slices_psize; /* page size encodings */ + unsigned char high_slices_psize[0]; + unsigned long slb_addr_limit; +#endif } mm_context_t; #define PHYS_IMMR_BASE (mfspr(SPRN_IMMR) & 0xfff8) diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 8fd3a70047f1..edf98ea92035 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -916,6 +916,8 @@ void __init setup_arch(char **cmdline_p) #ifdef CONFIG_PPC64 if (!radix_enabled()) init_mm.context.slb_addr_limit = DEFAULT_MAP_WINDOW_USER64; +#elif defined(CONFIG_PPC_8xx)
[PATCH RFC 0/3] API for 128-bit IO access
Hi all, This series adds API for 128-bit memory IO access and enables it for ARM64. The original motivation for 128-bit API came from new Cavium network device driver. The hardware requires 128-bit access to make things work. See description in patch 3 for details. Also, starting from ARMv8.4, stp and ldp instructions become atomic, and API for 128-bit access would be helpful in core arm64 code. This series is RFC. I'd like to collect opinions on idea and implementation details. * I didn't implement all 128-bit operations existing for 64-bit variables and other types (__swab128p etc). Do we need them all right now, or we can add them when actually needed? * u128 name is already used in crypto code. So here I use __uint128_t that comes from GCC for 128-bit types. Should I rename existing type in crypto and make core code for 128-bit variables consistent with u64, u32 etc? (I think yes, but would like to ask crypto people for it.) * Some compilers don't support __uint128_t, so I protected all generic code with config option HAVE_128BIT_ACCESS. I think it's OK, but... * For 128-bit read/write functions I take suffix 'o', which means read/write the octet of bytes. Is this name OK? * my mips-linux-gnu-gcc v6.3.0 doesn't support __uint128_t, and I don't have other BE setup on hand, so BE case is formally not tested. BE code for arm64 is looking well though. With all that, this example code: static int __init 128bit_test(void) { __uint128_t v; __uint128_t addr; __uint128_t val = (__uint128_t) 0x1234567890abc; val |= ((__uint128_t) 0xdeadbeaf) << 64; writeo(val, &addr); v = reado(&addr); pr_err("%llx%llx\n", (u64) (val >> 64), (u64) val); pr_err("%llx%llx\n", (u64) (v >> 64), (u64) v); return v != val; } Generates this listing for arm64-le: <128bit_test>: 0: a9bb7bfdstp x29, x30, [sp, #-80]! 4: 910003fdmov x29, sp 8: a90153f3stp x19, x20, [sp, #16] c: a9025bf5stp x21, x22, [sp, #32] 10: f9001bf7str x23, [sp, #48] 14: d5033e9fdsb st 18: d2815797mov x23, #0xabc // #2748 1c: d297d5f6mov x22, #0xbeaf// #48815 20: f2acf137movkx23, #0x6789, lsl #16 24: f2bbd5b6movkx22, #0xdead, lsl #16 28: f2c468b7movkx23, #0x2345, lsl #32 2c: f2e00037movkx23, #0x1, lsl #48 30: a9045bb7stp x23, x22, [x29, #64] 34: a94453b3ldp x19, x20, [x29, #64] 38: d5033d9fdsb ld 3c: 9015adrpx21, 0 <128bit_test> 40: 910002b5add x21, x21, #0x0 44: aa1703e2mov x2, x23 48: aa1603e1mov x1, x22 4c: aa1503e0mov x0, x21 50: 9400bl 0 54: aa1303e2mov x2, x19 58: aa1403e1mov x1, x20 5c: ca170273eor x19, x19, x23 60: ca160294eor x20, x20, x22 64: aa1503e0mov x0, x21 68: aa140273orr x19, x19, x20 6c: 9400bl 0 70: f9401bf7ldr x23, [sp, #48] 74: f100027fcmp x19, #0x0 78: a94153f3ldp x19, x20, [sp, #16] 7c: 1a9f07e0csetw0, ne // ne = any 80: a9425bf5ldp x21, x22, [sp, #32] 84: a8c57bfdldp x29, x30, [sp], #80 88: d65f03c0ret And for arm64-be: <128bit_test>: 0: a9bb7bfdstp x29, x30, [sp, #-80]! 4: 910003fdmov x29, sp 8: a90153f3stp x19, x20, [sp, #16] c: a9025bf5stp x21, x22, [sp, #32] 10: f9001bf7str x23, [sp, #48] 14: d5033e9fdsb st 18: d2802001mov x1, #0x100 // #256 1c: d2d5bbc0mov x0, #0xadde // #191168994344960 20: f2a8a461movkx1, #0x4523, lsl #16 24: f2f5f7c0movkx0, #0xafbe, lsl #48 28: f2d12ce1movkx1, #0x8967, lsl #32 2c: f2f78141movkx1, #0xbc0a, lsl #48 30: a90407a0stp x0, x1, [x29, #64] 34: a94453b3ldp x19, x20, [x29, #64] 38: dac00e73rev x19, x19 3c: dac00e94rev x20, x20 40: d5033d9fdsb ld 44: d2815796mov x22, #0xabc // #2748 48: 9015adrpx21, 0 <128bit_test> 4c: f2acf136movkx22, #0x6789, lsl #16 50: 910002b5add x21, x21, #0x0 54: f2c468b6movkx22, #0x2345, lsl #32 58: d297d5f7mov x23, #0xbeaf// #48815 5c: f2e00036movkx22, #0x1, lsl #48 60: f2bbd5b7movkx23, #0xdead, lsl #16 64: aa1603e2mov x2, x22 68: aa1703e1mov x1, x23 6c:
[PATCH 1/3] UAPI: Introduce 128-bit types and byteswap operations
Architectures like arm64 support 128-bit integer types and operations. This patch introduces corresponding types and __swab128() operation for be/le conversions. They are required to implement 128-bit access to the memory, in following patches. Signed-off-by: Yury Norov --- include/linux/byteorder/generic.h| 8 include/uapi/asm-generic/int-ll64.h | 8 include/uapi/linux/byteorder/big_endian.h| 4 include/uapi/linux/byteorder/little_endian.h | 8 include/uapi/linux/swab.h| 22 ++ include/uapi/linux/types.h | 4 6 files changed, 54 insertions(+) diff --git a/include/linux/byteorder/generic.h b/include/linux/byteorder/generic.h index 451aaa0786ae..aa61662ee3dc 100644 --- a/include/linux/byteorder/generic.h +++ b/include/linux/byteorder/generic.h @@ -85,12 +85,20 @@ #define cpu_to_le64 __cpu_to_le64 #define le64_to_cpu __le64_to_cpu +#ifdef CONFIG_HAVE_128BIT_ACCESS +#define cpu_to_le128 __cpu_to_le128 +#define le128_to_cpu __le128_to_cpu +#endif #define cpu_to_le32 __cpu_to_le32 #define le32_to_cpu __le32_to_cpu #define cpu_to_le16 __cpu_to_le16 #define le16_to_cpu __le16_to_cpu #define cpu_to_be64 __cpu_to_be64 #define be64_to_cpu __be64_to_cpu +#ifdef CONFIG_HAVE_128BIT_ACCESS +#define cpu_to_be128 __cpu_to_be128 +#define be128_to_cpu __be128_to_cpu +#endif #define cpu_to_be32 __cpu_to_be32 #define be32_to_cpu __be32_to_cpu #define cpu_to_be16 __cpu_to_be16 diff --git a/include/uapi/asm-generic/int-ll64.h b/include/uapi/asm-generic/int-ll64.h index 1ed06964257c..4bc2241988a9 100644 --- a/include/uapi/asm-generic/int-ll64.h +++ b/include/uapi/asm-generic/int-ll64.h @@ -29,9 +29,17 @@ typedef unsigned int __u32; #ifdef __GNUC__ __extension__ typedef __signed__ long long __s64; __extension__ typedef unsigned long long __u64; +#ifdef CONFIG_HAVE_128BIT_ACCESS +__extension__ typedef __int128_t __s128; +__extension__ typedef __uint128_t __u128; +#endif #else typedef __signed__ long long __s64; typedef unsigned long long __u64; +#ifdef CONFIG_HAVE_128BIT_ACCESS +typedef __int128_t __s128; +typedef __uint128_t __u128; +#endif #endif #endif /* __ASSEMBLY__ */ diff --git a/include/uapi/linux/byteorder/big_endian.h b/include/uapi/linux/byteorder/big_endian.h index 2199adc6a6c2..28a69ec10dd2 100644 --- a/include/uapi/linux/byteorder/big_endian.h +++ b/include/uapi/linux/byteorder/big_endian.h @@ -30,6 +30,10 @@ #define __constant_be16_to_cpu(x) ((__force __u16)(__be16)(x)) #define __cpu_to_le64(x) ((__force __le64)__swab64((x))) #define __le64_to_cpu(x) __swab64((__force __u64)(__le64)(x)) +#ifdef CONFIG_HAVE_128BIT_ACCESS +#define __cpu_to_le128(x) ((__force __le128)__swab128((x))) +#define __le128_to_cpu(x) __swab128((__force __u128)(__le128)(x)) +#endif #define __cpu_to_le32(x) ((__force __le32)__swab32((x))) #define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x)) #define __cpu_to_le16(x) ((__force __le16)__swab16((x))) diff --git a/include/uapi/linux/byteorder/little_endian.h b/include/uapi/linux/byteorder/little_endian.h index 601c904fd5cd..15365bd0fe29 100644 --- a/include/uapi/linux/byteorder/little_endian.h +++ b/include/uapi/linux/byteorder/little_endian.h @@ -18,6 +18,10 @@ #define __constant_ntohs(x) ___constant_swab16((__force __be16)(x)) #define __constant_cpu_to_le64(x) ((__force __le64)(__u64)(x)) #define __constant_le64_to_cpu(x) ((__force __u64)(__le64)(x)) +#ifdef CONFIG_HAVE_128BIT_ACCESS +#define __constant_cpu_to_le128(x) ((__force __le128)(__u128)(x)) +#define __constant_le128_to_cpu(x) ((__force __u128)(__le128)(x)) +#endif #define __constant_cpu_to_le32(x) ((__force __le32)(__u32)(x)) #define __constant_le32_to_cpu(x) ((__force __u32)(__le32)(x)) #define __constant_cpu_to_le16(x) ((__force __le16)(__u16)(x)) @@ -30,6 +34,10 @@ #define __constant_be16_to_cpu(x) ___constant_swab16((__force __u16)(__be16)(x)) #define __cpu_to_le64(x) ((__force __le64)(__u64)(x)) #define __le64_to_cpu(x) ((__force __u64)(__le64)(x)) +#ifdef CONFIG_HAVE_128BIT_ACCESS +#define __cpu_to_le128(x) ((__force __le128)(__u128)(x)) +#define __le128_to_cpu(x) ((__force __u128)(__le128)(x)) +#endif #define __cpu_to_le32(x) ((__force __le32)(__u32)(x)) #define __le32_to_cpu(x) ((__force __u32)(__le32)(x)) #define __cpu_to_le16(x) ((__force __le16)(__u16)(x)) diff --git a/include/uapi/linux/swab.h b/include/uapi/linux/swab.h index 23cd84868cc3..a7e97eb06a3e 100644 --- a/include/uapi/linux/swab.h +++ b/include/uapi/linux/swab.h @@ -75,6 +75,20 @@ static inline __attribute_const__ __u64 __fswab64(__u64 val) #endif } +#ifdef CONFIG_HAVE_128BIT_ACCESS +static inline __attribute_const__ __u128 __fswab128(__u128 val) +{ +#if defined(__arch_swab128) + return __arch_swab128(val); +#else + __u64 h = (__u64) (val >> 64); + __u64 l = (__u64) val; + + return (((__u128)__fswab64(l)) << 64) | (__u128)(__fswab64(h)); +#endif +} +#endif +
[PATCH 3/3] arm64: enable 128-bit memory read/write support
Introduce __raw_writeo(), __raw_reado() and other arch-specific RW functions for 128-bit memory access, and enable it for arm64. 128-bit I/O is required for example by Octeon TX2 device to access some registers. According to Hardware Reference Manual: A 128-bit write to the OP_FREE0/1 registers frees a pointer into a given [...] pool. All other accesses to these registers (e.g. reads and 64-bit writes) are RAZ/WI. Starting from ARMv8.4, stp and ldp instructions become atomic, and API for 128-bit access would be helpful for core code. Signed-off-by: Yury Norov --- arch/Kconfig| 7 +++ arch/arm64/include/asm/io.h | 31 +++ 2 files changed, 38 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index 76c0b54443b1..2baff7de405d 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -116,6 +116,13 @@ config UPROBES managed by the kernel and kept transparent to the probed application. ) +config HAVE_128BIT_ACCESS + def_bool ARM64 + help + Architectures having 128-bit access require corresponding APIs, + like reado() and writeo(), which stands for reading and writing + the octet of bytes at once. + config HAVE_64BIT_ALIGNED_ACCESS def_bool 64BIT && !HAVE_EFFICIENT_UNALIGNED_ACCESS help diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h index 35b2e50f17fb..7c5d834abfd8 100644 --- a/arch/arm64/include/asm/io.h +++ b/arch/arm64/include/asm/io.h @@ -60,6 +60,18 @@ static inline void __raw_writeq(u64 val, volatile void __iomem *addr) asm volatile("str %x0, [%1]" : : "rZ" (val), "r" (addr)); } +#define __raw_writeo __raw_writeo +static inline void __raw_writeo(__uint128_t val, volatile void __iomem *addr) +{ + u64 l = (u64) val; + u64 h = (u64) (val >> 64); + __uint128_t *__addr = (__uint128_t *) addr; + + asm volatile("stp %x[x0], %x[x1], %x[p1]" +: [p1]"=Ump"(*__addr) +: [x0]"r"(l), [x1]"r"(h)); +} + #define __raw_readb __raw_readb static inline u8 __raw_readb(const volatile void __iomem *addr) { @@ -105,6 +117,19 @@ static inline u64 __raw_readq(const volatile void __iomem *addr) return val; } +#define __raw_reado __raw_reado +static inline __uint128_t __raw_reado(const volatile void __iomem *addr) +{ + u64 l, h; + __uint128_t *__addr = (__uint128_t *) addr; + + asm volatile("ldp %x[x0], %x[x1], %x[p1]" +: [x0]"=r"(l), [x1]"=r"(h) +: [p1]"Ump"(*__addr)); + + return (__uint128_t) l | ((__uint128_t) h) << 64; +} + /* IO barriers */ #define __iormb() rmb() #define __iowmb() wmb() @@ -120,11 +145,13 @@ static inline u64 __raw_readq(const volatile void __iomem *addr) #define readw_relaxed(c) ({ u16 __r = le16_to_cpu((__force __le16)__raw_readw(c)); __r; }) #define readl_relaxed(c) ({ u32 __r = le32_to_cpu((__force __le32)__raw_readl(c)); __r; }) #define readq_relaxed(c) ({ u64 __r = le64_to_cpu((__force __le64)__raw_readq(c)); __r; }) +#define reado_relaxed(c) ({ __uint128_t __r = le128_to_cpu((__force __le128)__raw_reado(c)); __r; }) #define writeb_relaxed(v,c)((void)__raw_writeb((v),(c))) #define writew_relaxed(v,c)((void)__raw_writew((__force u16)cpu_to_le16(v),(c))) #define writel_relaxed(v,c)((void)__raw_writel((__force u32)cpu_to_le32(v),(c))) #define writeq_relaxed(v,c)((void)__raw_writeq((__force u64)cpu_to_le64(v),(c))) +#define writeo_relaxed(v,c)((void)__raw_writeo((__force __uint128_t)cpu_to_le128(v),(c))) /* * I/O memory access primitives. Reads are ordered relative to any @@ -135,11 +162,13 @@ static inline u64 __raw_readq(const volatile void __iomem *addr) #define readw(c) ({ u16 __v = readw_relaxed(c); __iormb(); __v; }) #define readl(c) ({ u32 __v = readl_relaxed(c); __iormb(); __v; }) #define readq(c) ({ u64 __v = readq_relaxed(c); __iormb(); __v; }) +#define reado(c) ({ __uint128_t __v = reado_relaxed(c); __iormb(); __v; }) #define writeb(v,c)({ __iowmb(); writeb_relaxed((v),(c)); }) #define writew(v,c)({ __iowmb(); writew_relaxed((v),(c)); }) #define writel(v,c)({ __iowmb(); writel_relaxed((v),(c)); }) #define writeq(v,c)({ __iowmb(); writeq_relaxed((v),(c)); }) +#define writeo(v,c)({ __iowmb(); writeo_relaxed((v),(c)); }) /* * I/O port access primitives. @@ -188,10 +217,12 @@ extern void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size); #define ioread16be(p) ({ __u16 __v = be16_to_cpu((__force __be16)__raw_readw(p)); __iormb(); __v; }) #define ioread32be(p) ({ __u32 __v = be32_to_cpu((__force __be32)__raw_readl(p)); __iormb(); __v; }) #define ioread64be(p) ({ __u64 __v = be64_to_cpu((__force __be64)__raw_readq(p)); __iormb(); __v
[PATCH 2/3] asm-generic/io.h: API for 128-bit memory accessors
Some architectures, like arm64, support 128-bit memory access. For ARM64 - using load/store pair instructions. This patch introduces reado() and writeo() functions family, where suffix 'o' stands for reading and writing the octet of bytes at once. Signed-off-by: Yury Norov --- include/asm-generic/io.h | 147 +++ 1 file changed, 147 insertions(+) diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h index b4531e3b2120..ac4a4de69efc 100644 --- a/include/asm-generic/io.h +++ b/include/asm-generic/io.h @@ -67,6 +67,16 @@ static inline u64 __raw_readq(const volatile void __iomem *addr) #endif #endif /* CONFIG_64BIT */ +#ifdef CONFIG_HAVE_128BIT_ACCESS +#ifndef __raw_reado +#define __raw_reado __raw_reado +static inline __uint128_t __raw_reado(const volatile void __iomem *addr) +{ + return *(const volatile __uint128_t __force *)addr; +} +#endif +#endif /* CONFIG_HAVE_128BIT_ACCESS */ + #ifndef __raw_writeb #define __raw_writeb __raw_writeb static inline void __raw_writeb(u8 value, volatile void __iomem *addr) @@ -101,6 +111,16 @@ static inline void __raw_writeq(u64 value, volatile void __iomem *addr) #endif #endif /* CONFIG_64BIT */ +#ifdef CONFIG_HAVE_128BIT_ACCESS +#ifndef __raw_writeo +#define __raw_writeo __raw_writeo +static inline void __raw_writeo(__uint128_t value, volatile void __iomem *addr) +{ + *(volatile __uint128_t __force *)addr = value; +} +#endif +#endif /* CONFIG_HAVE_128BIT_ACCESS */ + /* * {read,write}{b,w,l,q}() access little endian memory and return result in * native endianness. @@ -140,6 +160,16 @@ static inline u64 readq(const volatile void __iomem *addr) #endif #endif /* CONFIG_64BIT */ +#ifdef CONFIG_HAVE_128BIT_ACCESS +#ifndef reado +#define reado reado +static inline __uint128_t reado(const volatile void __iomem *addr) +{ + return __le128_to_cpu(__raw_reado(addr)); +} +#endif +#endif /* CONFIG_HAVE_128BIT_ACCESS */ + #ifndef writeb #define writeb writeb static inline void writeb(u8 value, volatile void __iomem *addr) @@ -174,6 +204,16 @@ static inline void writeq(u64 value, volatile void __iomem *addr) #endif #endif /* CONFIG_64BIT */ +#ifdef CONFIG_HAVE_128BIT_ACCESS +#ifndef writeo +#define writeo writeo +static inline void writeo(__uint128_t value, volatile void __iomem *addr) +{ + __raw_writeo(__cpu_to_le128(value), addr); +} +#endif +#endif /* CONFIG_HAVE_128BIT_ACCESS */ + /* * {read,write}{b,w,l,q}_relaxed() are like the regular version, but * are not guaranteed to provide ordering against spinlocks or memory @@ -195,6 +235,10 @@ static inline void writeq(u64 value, volatile void __iomem *addr) #define readq_relaxed readq #endif +#if defined(reado) && !defined(reado_relaxed) +#define reado_relaxed reado +#endif + #ifndef writeb_relaxed #define writeb_relaxed writeb #endif @@ -211,6 +255,10 @@ static inline void writeq(u64 value, volatile void __iomem *addr) #define writeq_relaxed writeq #endif +#if defined(writeo) && !defined(writeo_relaxed) +#define writeo_relaxed writeo +#endif + /* * {read,write}s{b,w,l,q}() repeatedly access the same memory address in * native endianness in 8-, 16-, 32- or 64-bit chunks (@count times). @@ -281,6 +329,24 @@ static inline void readsq(const volatile void __iomem *addr, void *buffer, #endif #endif /* CONFIG_64BIT */ +#ifdef CONFIG_HAVE_128BIT_ACCESS +#ifndef readso +#define readso readso +static inline void readso(const volatile void __iomem *addr, void *buffer, + unsigned int count) +{ + if (count) { + __uint128_t *buf = buffer; + + do { + __uint128_t x = __raw_reado(addr); + *buf++ = x; + } while (--count); + } +} +#endif +#endif /* CONFIG_HAVE_128BIT_ACCESS */ + #ifndef writesb #define writesb writesb static inline void writesb(volatile void __iomem *addr, const void *buffer, @@ -343,6 +409,23 @@ static inline void writesq(volatile void __iomem *addr, const void *buffer, #endif #endif /* CONFIG_64BIT */ +#ifdef CONFIG_HAVE_128BIT_ACCESS +#ifndef writeso +#define writeso writeso +static inline void writeso(volatile void __iomem *addr, const void *buffer, + unsigned int count) +{ + if (count) { + const __uint128_t *buf = buffer; + + do { + __raw_writeo(*buf++, addr); + } while (--count); + } +} +#endif +#endif /* CONFIG_HAVE_128BIT_ACCESS */ + #ifndef PCI_IOBASE #define PCI_IOBASE ((void __iomem *)0) #endif @@ -595,6 +678,16 @@ static inline u64 ioread64(const volatile void __iomem *addr) #endif #endif /* CONFIG_64BIT */ +#ifdef CONFIG_HAVE_128BIT_ACCESS +#ifndef ioread128 +#define ioread128 ioread128 +static inline __uint128_t ioread128(const volatile void __iomem *addr) +{ + return reado(addr); +} +#endif +#endif /* CONFIG_HAVE_128BIT_ACCESS */ + #ifndef iowrit
Re: ppc elf_map breakage with MAP_FIXED_NOREPLACE
On Wed 24-01-18 10:39:41, Anshuman Khandual wrote: > On 01/23/2018 09:36 PM, Michal Hocko wrote: > > On Tue 23-01-18 21:28:28, Anshuman Khandual wrote: > >> On 01/23/2018 06:15 PM, Michal Hocko wrote: > >>> On Tue 23-01-18 16:55:18, Anshuman Khandual wrote: > On 01/17/2018 01:37 PM, Michal Hocko wrote: > > On Thu 11-01-18 15:38:37, Anshuman Khandual wrote: > >> On 01/09/2018 09:43 PM, Michal Hocko wrote: > > [...] > >>> Did you manage to catch _who_ is requesting that anonymous mapping? Do > >>> you need a help with the debugging patch? > >> > >> Not yet, will get back on this. > > > > ping? > > Hey Michal, > > Missed this thread, my apologies. This problem is happening only with > certain binaries like 'sed', 'tmux', 'hostname', 'pkg-config' etc. As > you had mentioned before the map request collision is happening on > [1003, 1004] and [1003, 1004] ranges only which is > just a single PAGE_SIZE. You asked previously that who might have > requested the anon mapping which is already present in there ? Would > not that be the same process itself ? I am bit confused. > >>> > >>> We are early in the ELF loading. If we are mapping over an existing > >>> mapping then we are effectivelly corrupting it. In other words exactly > >>> what this patch tries to prevent. I fail to see what would be a relevant > >>> anon mapping this early and why it would be colliding with elf > >>> segements. > >>> > Would it be > helpful to trap all the mmap() requests from any of the binaries > and see where we might have created that anon mapping ? > >>> > >>> Yeah, that is exactly what I was suggesting. Sorry for not being clear > >>> about that. > >>> > >> > >> Tried to instrument just for the 'sed' binary and dont see any where > >> it actually requests the anon VMA which got hit when loading the ELF > >> section which is strange. All these requested flags here already has > >> MAP_FIXED_NOREPLACE (0x10). Wondering from where the anon VMA > >> actually came from. > > > > Could you try to dump backtrace? > > This is when it fails inside elf_map() function due to collision with > existing anon VMA mapping. This is not the interesting one. This is the ELF loader. And we know it fails. We are really interested in the one _who_ installs the original VMA. Because nothing should be really there. It would be also very helpful to translate the backtrace with faddr2line to get line numbers. > [c000201c9ad07880] [c0b0b4c0] dump_stack+0xb0/0xf0 (unreliable) > [c000201c9ad078c0] [c03c4550] elf_map+0x2d0/0x310 > [c000201c9ad07b60] [c03c6258] load_elf_binary+0x6f8/0x158c > [c000201c9ad07c80] [c0352900] search_binary_handler+0xd0/0x270 > [c000201c9ad07d10] [c0354838] do_execveat_common.isra.31+0x658/0x890 > [c000201c9ad07df0] [c0354e80] SyS_execve+0x40/0x50 > [c000201c9ad07e30] [c000b220] system_call+0x58/0x6c -- Michal Hocko SUSE Labs
Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
On Wed, Jan 24, 2018 at 09:02:21AM +, David Woodhouse wrote: > On Wed, 2018-01-24 at 09:47 +0100, Peter Zijlstra wrote: > > Typically tglx likes to use x86_match_cpu() for these things; see also > > commit: bd9240a18edfb ("x86/apic: Add TSC_DEADLINE quirk due to > > errata"). > > Thanks, will fix. I think we might also end up in whitelist mode, > adding "known good" microcodes to the list as they get released or > retroactively blessed. > > I would really have liked a new bit in IA32_ARCH_CAPABILITIES to say > that it's safe, but that's not possible for *existing* microcode which > actually turns out to be OK in the end. > > That means the whitelist ends up basically empty right now. Should I > add a command line parameter to override it? Otherwise we end up having > to rebuild the kernel every time there's a microcode release which > covers a new CPU SKU (which is why I kind of hate the whitelist, but > Arjan is very insistent...) Ick, no, whitelists are a pain for everyone involved. Don't do that unless it is absolutely the only way it will ever work. Arjan, why do you think this can only be done as a whitelist? It's much easier to just mark the "bad" microcode versions as those _should_ be a much smaller list that Intel knows about today. And of course, any future microcode updates will not be "bad" because they know how to properly test for this now before they are released :) thanks, greg k-h
Re: [PATCH 4.9] usbip: Fix implicit fallthrough warning
On Tue, Jan 23, 2018 at 07:32:11PM -0700, Shuah Khan wrote: > From: Jonathan Dieter > > Upstream commit cfd6ed4537a9 ("usbip: Fix implicit fallthrough warning") > > GCC 7 now warns when switch statements fall through implicitly, and with > -Werror enabled in configure.ac, that makes these tools unbuildable. > > We fix this by notifying the compiler that this particular case statement > is meant to fall through. > > Reviewed-by: Peter Senna Tschudin > Signed-off-by: Jonathan Dieter > Signed-off-by: Greg Kroah-Hartman > Signed-off-by: Shuah Khan > --- > tools/usb/usbip/src/usbip.c | 2 ++ > 1 file changed, 2 insertions(+) All of these usbip tools patches are now queued up, thanks! greg k-h
Re: [PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice
On 01/24/2018 02:32 PM, Christophe Leroy wrote: An application running with libhugetlbfs fails to allocate additional pages to HEAP due to the hugemap being done inconditionally as topdown mapping: mmap(0x1008, 1572864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0) = 0x73e8 [...] mmap(0x7400, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d8 munmap(0x73d8, 1048576) = 0 [...] mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0 munmap(0x73d0, 1572864) = 0 [...] mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0 munmap(0x73d0, 1572864) = 0 [...] As one can see from the above strace log, mmap() allocates further pages below the initial one because no space is available on top of it. This patch fixes it by requesting bottomup mapping as the non generic hugetlb_get_unmapped_area() does Fixes: d0f13e3c20b6f ("[POWERPC] Introduce address space "slices" ") Signed-off-by: Christophe Leroy --- v3: Was a standalone patch before, but conflicts with this serie. arch/powerpc/mm/hugetlbpage.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 79e1378ee303..368ea6b248ad 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -558,7 +558,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, return radix__hugetlb_get_unmapped_area(file, addr, len, pgoff, flags); #endif - return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1); + return slice_get_unmapped_area(addr, len, flags, mmu_psize, 0); } #endif Why make this change also for PPC64? Can you do this #ifdef 8xx?.You can ideally move hugetlb_get_unmapped_area to slice.h and then make this much simpler for 8xxx? -aneesh -aneesh
Re: [PATCH v2] perf/core: Fix installing cgroup event into cpu
On Wed, Jan 24, 2018 at 04:32:38PM +0800, Lin Xiulei wrote: > >> kernel/events/core.c | 44 +++- > >> 1 file changed, 31 insertions(+), 13 deletions(-) > >> > >> diff --git a/kernel/events/core.c b/kernel/events/core.c > >> index 4df5b69..f766b60 100644 > >> --- a/kernel/events/core.c > >> +++ b/kernel/events/core.c > >> @@ -933,31 +933,36 @@ list_update_cgroup_event(struct perf_event *event, > >> { > >> struct perf_cpu_context *cpuctx; > >> struct list_head *cpuctx_entry; > >> + struct perf_cgroup *cgrp; > >> > >> if (!is_cgroup_event(event)) > >> return; > >> > >> /* > >>* Because cgroup events are always per-cpu events, > >>* this will always be called from the right CPU. > >>*/ > >> cpuctx = __get_cpu_context(ctx); > >> + cgrp = perf_cgroup_from_task(current, ctx); > >> > >> + /* cpuctx->cgrp is NULL unless a cgroup event is running in this CPU > >> .*/ > >> + if (cgroup_is_descendant(cgrp->css.cgroup, event->cgrp->css.cgroup)) > >> { > >> + if (add) > >> cpuctx->cgrp = cgrp; > >> + else > >> + cpuctx->cgrp = NULL; > >> } > >> + > >> + if (add && ctx->nr_cgroups++) > >> + return; > >> + else if (!add && --ctx->nr_cgroups) > >> + return; > >> + > >> + cpuctx_entry = &cpuctx->cgrp_cpuctx_entry; > >> + if (add) > >> + list_add(cpuctx_entry, this_cpu_ptr(&cgrp_cpuctx_list)); > >> + else > >> + list_del(cpuctx_entry); > >> } > > > > I'm a little confused; you unconditionally set cpuctx->cgrp for every > > add/delete. > > > > So if we have >1 cgroup events on, and we remove one, you still clear > > cpuctx->cgrp, that seems wrong. > > > > Why did you change that? The Changelog doesn't include enough clues for > > me to know what you were trying to do. > > if we have > 1 cgroup events on, whenever a cgroup was really to be > deleted, only if this cgroup is the same as the cgroup running on this > cpu, I would clear cpuctx->cgrp. But that might still be too early, we might still have more cgroup events active. What goes wrong if we leave it set? > Here is the problem, previous version didn't set cpuctx->cgrp anymore > if ctx->nr_cgroups > 1, which cases a second event would not be > activated immediately because cpuctx->cgrp isn't equal to event->cgrp > at event_filter_match() OK, I think I can see that happening. Please clarify the Changelog and maybe put a comment in the code as well.
From: Mr.Ahmed Owain
Good Day, Please accept my apologies for writing you a surprise letter.I am Mr.Ahmed Owain, account Manager with an investment bank here in Burkina Faso.I have a very important business I want to discuss with you.There is a draft account opened in my firm by a long-time client of our bank.I have the opportunity of transferring the left over fund (15.8 Million UsDollars)Fiftheen Million Eight Hundred Thousand United States of American Dollars of one of my Bank clients who died at the collapsing of the world trade center at the United States on September 11th 2001. I want to invest this funds and introduce you to our bank for this deal.All I require is your honest co-operation and I guarantee you that this will be executed under a legitimate arrangement that will protect us from any breach of the law.I agree that 40% of this money will be for you as my foreign partner,50% for me while 10% is for establishing of foundation for the less privilleges in your country.If you are really interested in my proposal further details of the Transfer will be forwarded unto you as soon as I receive your willingness mail for a successful transfer. Yours Sincerely, Mr.Ahmed Owain,
Re: [PATCH v2] perf/core: Fix installing cgroup event into cpu
2018-01-24 17:14 GMT+08:00 Peter Zijlstra : > On Wed, Jan 24, 2018 at 04:32:38PM +0800, Lin Xiulei wrote: >> >> kernel/events/core.c | 44 +++- >> >> 1 file changed, 31 insertions(+), 13 deletions(-) >> >> >> >> diff --git a/kernel/events/core.c b/kernel/events/core.c >> >> index 4df5b69..f766b60 100644 >> >> --- a/kernel/events/core.c >> >> +++ b/kernel/events/core.c >> >> @@ -933,31 +933,36 @@ list_update_cgroup_event(struct perf_event *event, >> >> { >> >> struct perf_cpu_context *cpuctx; >> >> struct list_head *cpuctx_entry; >> >> + struct perf_cgroup *cgrp; >> >> >> >> if (!is_cgroup_event(event)) >> >> return; >> >> >> >> /* >> >>* Because cgroup events are always per-cpu events, >> >>* this will always be called from the right CPU. >> >>*/ >> >> cpuctx = __get_cpu_context(ctx); >> >> + cgrp = perf_cgroup_from_task(current, ctx); >> >> >> >> + /* cpuctx->cgrp is NULL unless a cgroup event is running in this >> >> CPU .*/ >> >> + if (cgroup_is_descendant(cgrp->css.cgroup, >> >> event->cgrp->css.cgroup)) { >> >> + if (add) >> >> cpuctx->cgrp = cgrp; >> >> + else >> >> + cpuctx->cgrp = NULL; >> >> } >> >> + >> >> + if (add && ctx->nr_cgroups++) >> >> + return; >> >> + else if (!add && --ctx->nr_cgroups) >> >> + return; >> >> + >> >> + cpuctx_entry = &cpuctx->cgrp_cpuctx_entry; >> >> + if (add) >> >> + list_add(cpuctx_entry, this_cpu_ptr(&cgrp_cpuctx_list)); >> >> + else >> >> + list_del(cpuctx_entry); >> >> } >> > >> > I'm a little confused; you unconditionally set cpuctx->cgrp for every >> > add/delete. >> > >> > So if we have >1 cgroup events on, and we remove one, you still clear >> > cpuctx->cgrp, that seems wrong. >> > >> > Why did you change that? The Changelog doesn't include enough clues for >> > me to know what you were trying to do. >> >> if we have > 1 cgroup events on, whenever a cgroup was really to be >> deleted, only if this cgroup is the same as the cgroup running on this >> cpu, I would clear cpuctx->cgrp. > > But that might still be too early, we might still have more cgroup > events active. > > What goes wrong if we leave it set? > >> Here is the problem, previous version didn't set cpuctx->cgrp anymore >> if ctx->nr_cgroups > 1, which cases a second event would not be >> activated immediately because cpuctx->cgrp isn't equal to event->cgrp >> at event_filter_match() > > OK, I think I can see that happening. Please clarify the Changelog and > maybe put a comment in the code as well. Sure, and I consider this "OK" works for "What goes wrong if we leave it set?". : )
Re: Avoiding information leaks between users and between processes by default? [Was: : [PATCH 1/5] prctl: add PR_ISOLATE_BP process control]
On Wed, 2018-01-24 at 09:37 +0100, Dominik Brodowski wrote: > On Wed, Jan 24, 2018 at 07:29:53AM +0100, Martin Schwidefsky wrote: > > > > On Tue, 23 Jan 2018 18:07:19 +0100 > > Dominik Brodowski wrote: > > > > > > > > On Tue, Jan 23, 2018 at 02:07:01PM +0100, Martin Schwidefsky wrote: > > > > > > > > Add the PR_ISOLATE_BP operation to prctl. The effect of the process > > > > control is to make all branch prediction entries created by the > > > > execution > > > > of the user space code of this task not applicable to kernel code or the > > > > code of any other task. > > > > > > What is the rationale for requiring a per-process *opt-in* for this added > > > protection? > > > > > > For KPTI on x86, the exact opposite approach is being discussed (see, e.g. > > > http://lkml.kernel.org/r/1515612500-14505-1-git-send-emai...@1wt.eu ): By > > > default, play it safe, with KPTI enabled. But for "trusted" processes, one > > > may opt out using prctrl. > > > > The rationale is that there are cases where you got code from *somewhere* > > and want to run it in an isolated context. Think: a docker container that > > runs under KVM. But with spectre this is still not really safe. So you > > include a wrapper program in the docker container to use the trap door > > prctl to start the potential malicious program. Now you should be good, no? > > Well, partly. It may be that s390 and its use cases are special -- but as I > understand it, this uapi question goes beyond this question: > > To my understanding, Linux traditionally tried to aim for the security goal > of avoiding information leaks *between* users[+], probably even between > processes of the same user. It wasn't a guarantee, and there always were > (and will be) information leaks -- and that is where additional safeguards > such as seccomp come into play, which reduce the attack surface against > unknown or unresolved security-related bugs. And everyone knew (or should > have known) that allowing "untrusted" code to be run (be it by an user, be > it JavaScript, etc.) is more risky. But still, avoiding information leaks > between users and between processes was (to my understanding) at least a > goal.[§] > > In recent days however, the outlook on this issue seems to have shifted: > > - Your proposal would mean to trust all userspace code, unless it is > specifically marked as untrusted. As I understand it, this would mean that > by default, spectre isn't fully mitigated cross-user and cross-process, > though the kernel could. And rogue user-run code may make use of that, > unless it is run with a special wrapper. > > - Concerning x86 and IPBP, the current proposal is to limit the protection > offered by IPBP to non-dumpable processes. As I understand it, this would > mean that other processes are left hanging out to dry.[~] > > - Concerning x86 and STIBP, David mentioned that "[t]here's an argument that > there are so many other information leaks between HT siblings that we > might not care"; in the last couple of hours, a proposal emerged to limit > the protection offered by STIBP to non-dumpable processes as well. To my > understanding, this would mean that many processes are left hanging out to > dry again. > > I am a bit worried whether this is a sign for a shift in the security goals. > I fully understand that there might be processes (e.g. some[?] kernel > threads) and users (root) which you need to trust anyway, as they can > already access anything. Disabling additional, costly safeguards for > those special cases then seems OK. Opting out of additional protections for > single-user or single-use systems (haproxy?) might make sense as well. But > the kernel[*] not offering full[#] spectre mitigation by default for regular > users and their processes? I'm not so sure. Note that for STIBP/IBPB the operation of the flag is different in another way. We're using it as a "protect this process from others" flag, not a "protect others from this process" flag. I'm not sure this is a fundamental shift in overall security goals; more a recognition that on *current* hardware the cost of 100% protection against an attack that was fairly unlikely in the first place, is fairly prohibitive. For a process to make itself non-dumpable is a simple enough way to opt in. And *maybe* we could contemplate a command line option for 'IBPB always' but I'm *really* wary of exposing too much of that stuff, rather than simply trying to Do The Right Thing. > [*] Whether CPUs should enable full mitigation (IBRS_ALL) by default > in future has been discussed on this list as well. The kernel will do that; it's just not implemented yet because it's slightly non-trivial and can't be fully tested yet. We *will* want to ALTERNATIVE away the retpolines and just set IBRS_ALL because it'll be faster to do so. For IBRS_ALL, note that we still need the same IBPB flushes on context switch; just not STIBP. That's because IBRS_ALL, as Linus so eloquently
Re: [PATCH v3] drm/bridge/synopsys: dsi: add optional pixel clock
Hi Brian, On 01/23/2018 09:49 PM, Brian Norris wrote: > Hi, > > Philippe asked me to review the last version. I'm not sure I have a lot > to contribute. Maybe Rockchip folks who wrote this stuff in the first > place might. I've CC'd some. > > On Tue, Jan 23, 2018 at 06:08:06PM +0100, Philippe Cornu wrote: >> The pixel clock is optional. When available, it offers a better >> preciseness for timing computations and allows to reduce the extra dsi >> bandwidth in burst mode (from ~20% to ~10-12%, hw platform dependant). >> >> Reviewed-by: Andrzej Hajda >> Signed-off-by: Philippe Cornu >> --- >> Changes in v3: Simplify px_clk probing thanks to Andrzej Hajda comments >> >> Changes in v2: Improve px_clk probing in case of ENOENT dt returned value >> (thanks to Philipp Zabel & Andrzej Hajda comments) >> >> drivers/gpu/drm/bridge/synopsys/dw-mipi-dsi.c | 25 >> +++-- >> 1 file changed, 19 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/gpu/drm/bridge/synopsys/dw-mipi-dsi.c >> b/drivers/gpu/drm/bridge/synopsys/dw-mipi-dsi.c >> index ed8af32f8e52..9fb35385c348 100644 >> --- a/drivers/gpu/drm/bridge/synopsys/dw-mipi-dsi.c >> +++ b/drivers/gpu/drm/bridge/synopsys/dw-mipi-dsi.c >> @@ -217,6 +217,7 @@ struct dw_mipi_dsi { >> void __iomem *base; >> >> struct clk *pclk; >> +struct clk *px_clk; >> >> unsigned int lane_mbps; /* per lane */ >> u32 channel; >> @@ -703,24 +704,28 @@ static void dw_mipi_dsi_bridge_mode_set(struct >> drm_bridge *bridge, >> struct dw_mipi_dsi *dsi = bridge_to_dsi(bridge); >> const struct dw_mipi_dsi_phy_ops *phy_ops = dsi->plat_data->phy_ops; >> void *priv_data = dsi->plat_data->priv_data; >> +struct drm_display_mode px_clk_mode = *mode; >> int ret; >> >> clk_prepare_enable(dsi->pclk); >> >> -ret = phy_ops->get_lane_mbps(priv_data, mode, dsi->mode_flags, >> +if (dsi->px_clk) >> +px_clk_mode.clock = clk_get_rate(dsi->px_clk) / 1000; >> + >> +ret = phy_ops->get_lane_mbps(priv_data, &px_clk_mode, dsi->mode_flags, >> dsi->lanes, dsi->format, &dsi->lane_mbps); >> if (ret) >> DRM_DEBUG_DRIVER("Phy get_lane_mbps() failed\n"); >> >> pm_runtime_get_sync(dsi->dev); >> dw_mipi_dsi_init(dsi); >> -dw_mipi_dsi_dpi_config(dsi, mode); >> +dw_mipi_dsi_dpi_config(dsi, &px_clk_mode); >> dw_mipi_dsi_packet_handler_config(dsi); >> dw_mipi_dsi_video_mode_config(dsi); >> -dw_mipi_dsi_video_packet_config(dsi, mode); >> +dw_mipi_dsi_video_packet_config(dsi, &px_clk_mode); >> dw_mipi_dsi_command_mode_config(dsi); >> -dw_mipi_dsi_line_timer_config(dsi, mode); >> -dw_mipi_dsi_vertical_timing_config(dsi, mode); >> +dw_mipi_dsi_line_timer_config(dsi, &px_clk_mode); >> +dw_mipi_dsi_vertical_timing_config(dsi, &px_clk_mode); >> >> dw_mipi_dsi_dphy_init(dsi); >> dw_mipi_dsi_dphy_timing_config(dsi); >> @@ -734,7 +739,7 @@ static void dw_mipi_dsi_bridge_mode_set(struct >> drm_bridge *bridge, >> >> dw_mipi_dsi_dphy_enable(dsi); >> >> -dw_mipi_dsi_wait_for_two_frames(mode); >> +dw_mipi_dsi_wait_for_two_frames(&px_clk_mode); >> >> /* Switch to cmd mode for panel-bridge pre_enable & panel prepare */ >> dw_mipi_dsi_set_mode(dsi, 0); >> @@ -828,6 +833,14 @@ __dw_mipi_dsi_probe(struct platform_device *pdev, >> return ERR_PTR(ret); >> } >> >> +dsi->px_clk = devm_clk_get(dev, "px_clk"); > > Did you write a device tree binding document update for this anywhere? > > Brian > Many thanks for your review, yes, "px_clk" is already documented, please have a look to Documentation/devicetree/bindings/display/bridge/dw_mipi_dsi.txt Philippe :-) >> +if (IS_ERR(dsi->px_clk)) { >> +ret = PTR_ERR(dsi->px_clk); >> +if (ret != ENOENT) >> +dev_err(dev, "Unable to get opt. px_clk: %d\n", ret); >> +dsi->px_clk = NULL; >> +} >> + >> /* >> * Note that the reset was not defined in the initial device tree, so >> * we have to be prepared for it not being found. >> -- >> 2.15.1 >>
[PATCH] clk: imx: imx6sx: update cko mux options
According to latest reference manual (Rev.2, 9/2017), previous CKO1/2's mux options are incorrect, update them. Signed-off-by: Anson Huang --- drivers/clk/imx/clk-imx6sx.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/clk/imx/clk-imx6sx.c b/drivers/clk/imx/clk-imx6sx.c index e6d389e..bc3f9eb 100644 --- a/drivers/clk/imx/clk-imx6sx.c +++ b/drivers/clk/imx/clk-imx6sx.c @@ -63,17 +63,17 @@ static const char *lcdif2_sels[]= { "lcdif2_podf", "ipp_di0", "ipp_di1", "ldb_d static const char *display_sels[] = { "pll2_bus", "pll2_pfd2_396m", "pll3_usb_otg", "pll3_pfd1_540m", }; static const char *csi_sels[] = { "osc", "pll2_pfd2_396m", "pll3_120m", "pll3_pfd1_540m", }; static const char *cko1_sels[] = { - "pll3_usb_otg", "pll2_bus", "pll1_sys", "pll5_video_div", - "dummy", "ocram", "dummy", "pxp_axi", "epdc_axi", "lcdif_pix", - "epdc_pix", "ahb", "ipg", "perclk", "ckil", "pll4_audio_div", + "dummy", "dummy", "dummy", "dummy", + "vadc", "ocram", "qspi2", "m4", "enet_ahb", "lcdif2_pix", + "lcdif1_pix", "ahb", "ipg", "perclk", "ckil", "pll4_audio_div", }; static const char *cko2_sels[] = { "dummy", "mmdc_p0_fast", "usdhc4", "usdhc1", "dummy", "wrck", "ecspi_root", "dummy", "usdhc3", "pcie", "arm", "csi_core", - "lcdif_axi", "dummy", "osc", "dummy", "gpu2d_ovg_core", - "usdhc2", "ssi1", "ssi2", "ssi3", "gpu2d_core", "dummy", - "dummy", "dummy", "dummy", "esai_extal", "eim_slow", "uart_serial", - "spdif", "asrc", "dummy", + "display_axi", "dummy", "osc", "dummy", "dummy", + "usdhc2", "ssi1", "ssi2", "ssi3", "gpu_axi_podf", "dummy", + "can_podf", "lvds1_out", "qspi1", "esai_extal", "eim_slow", + "uart_serial", "spdif", "audio", "dummy", }; static const char *cko_sels[] = { "cko1", "cko2", }; static const char *lvds_sels[] = { -- 2.7.4
Re: [PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice
Le 24/01/2018 à 10:15, Aneesh Kumar K.V a écrit : On 01/24/2018 02:32 PM, Christophe Leroy wrote: An application running with libhugetlbfs fails to allocate additional pages to HEAP due to the hugemap being done inconditionally as topdown mapping: mmap(0x1008, 1572864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0) = 0x73e8 [...] mmap(0x7400, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d8 munmap(0x73d8, 1048576) = 0 [...] mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0 munmap(0x73d0, 1572864) = 0 [...] mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0 munmap(0x73d0, 1572864) = 0 [...] As one can see from the above strace log, mmap() allocates further pages below the initial one because no space is available on top of it. This patch fixes it by requesting bottomup mapping as the non generic hugetlb_get_unmapped_area() does Fixes: d0f13e3c20b6f ("[POWERPC] Introduce address space "slices" ") Signed-off-by: Christophe Leroy --- v3: Was a standalone patch before, but conflicts with this serie. arch/powerpc/mm/hugetlbpage.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 79e1378ee303..368ea6b248ad 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -558,7 +558,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, return radix__hugetlb_get_unmapped_area(file, addr, len, pgoff, flags); #endif - return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1); + return slice_get_unmapped_area(addr, len, flags, mmu_psize, 0); } #endif Why make this change also for PPC64? Can you do this #ifdef 8xx?.You can ideally move hugetlb_get_unmapped_area to slice.h and then make this much simpler for 8xxx? Did you try with HUGETLB_MORECORE_HEAPBASE=0x1100 on PPC64 as I suggested in my last email on this subject (22/01/2018 9:22) ? Before doing anything specific to the PPC32/8xx, I'd like to be sure the issue is definitly only on PPC32. Thanks, Christophe
Re: [RFC] Per file OOM badness
On Tue 23-01-18 17:39:19, Michel Dänzer wrote: > On 2018-01-23 04:36 PM, Michal Hocko wrote: > > On Tue 23-01-18 15:27:00, Roman Gushchin wrote: > >> On Thu, Jan 18, 2018 at 06:00:06PM +0100, Michal Hocko wrote: > >>> On Thu 18-01-18 11:47:48, Andrey Grodzovsky wrote: > Hi, this series is a revised version of an RFC sent by Christian König > a few years ago. The original RFC can be found at > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.freedesktop.org_archives_dri-2Ddevel_2015-2DSeptember_089778.html&d=DwIDAw&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=R-JIQjy8rqmH5qD581_VYL0Q7cpWSITKOnBCE-3LI8U&s=QZGqKpKuJ2BtioFGSy8_721owcWJ0J6c6d4jywOwN4w&; > >>> Here is the origin cover letter text > >>> : I'm currently working on the issue that when device drivers allocate > >>> memory on > >>> : behalf of an application the OOM killer usually doesn't knew about that > >>> unless > >>> : the application also get this memory mapped into their address space. > >>> : > >>> : This is especially annoying for graphics drivers where a lot of the VRAM > >>> : usually isn't CPU accessible and so doesn't make sense to map into the > >>> : address space of the process using it. > >>> : > >>> : The problem now is that when an application starts to use a lot of VRAM > >>> those > >>> : buffers objects sooner or later get swapped out to system memory, but > >>> when we > >>> : now run into an out of memory situation the OOM killer obviously > >>> doesn't knew > >>> : anything about that memory and so usually kills the wrong process. > >>> : > >>> : The following set of patches tries to address this problem by > >>> introducing a per > >>> : file OOM badness score, which device drivers can use to give the OOM > >>> killer a > >>> : hint how many resources are bound to a file descriptor so that it can > >>> make > >>> : better decisions which process to kill. > >>> : > >>> : So question at every one: What do you think about this approach? > >>> : > >>> : My biggest concern right now is the patches are messing with a core > >>> kernel > >>> : structure (adding a field to struct file). Any better idea? I'm > >>> considering > >>> : to put a callback into file_ops instead. > >> > >> Hello! > >> > >> I wonder if groupoom (aka cgroup-aware OOM killer) can work for you? > > > > I do not think so. The problem is that the allocating context is not > > identical with the end consumer. > > That's actually not really true. Even in cases where a BO is shared with > a different process, it is still used at least occasionally in the > process which allocated it as well. Otherwise there would be no point in > sharing it between processes. OK, but somebody has to be made responsible. Otherwise you are just killing a process which doesn't really release any memory. > There should be no problem if the memory of a shared BO is accounted for > in each process sharing it. It might be nice to scale each process' > "debt" by 1 / (number of processes sharing it) if possible, but in the > worst case accounting it fully in each process should be fine. So how exactly then helps to kill one of those processes? The memory stays pinned behind or do I still misunderstand? -- Michal Hocko SUSE Labs
[PATCH v4 0/2] media: ov7670: Implement mbus configuration
Hello, 4th round for this series, now based on Hans' 'parm' branch from git://linuxtv.org/hverkuil/media_tree.git I addressed Sakari's comments on bindings documentation and driver error path, and I hope to get both driver and bindings acked to have this included in next merge window. Thanks j v3->v4: - Change bindings documentation to drop default value as vsync and hsync polarities are now required properties - Do not put fwnode handle in driver dt parse error path as dev_fwnode() does not increase ref counting v2->v3: - Drop 'pll-bypass' property - Make 'plck-hb-disable' a boolean optional property - List 'hsync' and 'vsync' polarities as required endpoint properties - Restructured 'ov7670_parse_dt()' function to reflect the above changes v1->v2: - Split bindings description and implementation - Addressed Sakari's comments on v1 - Check for return values of register writes in set_fmt() - TODO: decide if "pll-bypass" should be an OF property. Jacopo Mondi (2): media: dt-bindings: Add OF properties to ov7670 v4l2: i2c: ov7670: Implement OF mbus configuration .../devicetree/bindings/media/i2c/ov7670.txt | 16 +++- drivers/media/i2c/ov7670.c | 98 ++ 2 files changed, 98 insertions(+), 16 deletions(-) -- 2.7.4
[PATCH v4 2/2] v4l2: i2c: ov7670: Implement OF mbus configuration
ov7670 driver supports two optional properties supplied through platform data, but currently does not support any standard video interface property. Add support through OF parsing for 2 generic properties (vsync and hsync polarities) and for one custom property already supported through platform data to suppress pixel clock output during horizontal blanking. While at there, check return value of register writes in set_fmt function and rationalize spacings. Signal polarities and pixel clock blanking verified through scope and image capture. Signed-off-by: Jacopo Mondi --- drivers/media/i2c/ov7670.c | 98 +++--- 1 file changed, 84 insertions(+), 14 deletions(-) diff --git a/drivers/media/i2c/ov7670.c b/drivers/media/i2c/ov7670.c index 61c472e..80c822c 100644 --- a/drivers/media/i2c/ov7670.c +++ b/drivers/media/i2c/ov7670.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -242,6 +243,7 @@ struct ov7670_info { struct clk *clk; struct gpio_desc *resetb_gpio; struct gpio_desc *pwdn_gpio; + unsigned int mbus_config; /* Media bus configuration flags */ int min_width; /* Filter out smaller sizes */ int min_height; /* Filter out smaller sizes */ int clock_speed;/* External clock speed (MHz) */ @@ -1018,7 +1020,7 @@ static int ov7670_set_fmt(struct v4l2_subdev *sd, #ifdef CONFIG_VIDEO_V4L2_SUBDEV_API struct v4l2_mbus_framefmt *mbus_fmt; #endif - unsigned char com7; + unsigned char com7, com10 = 0; int ret; if (format->pad) @@ -1038,7 +1040,6 @@ static int ov7670_set_fmt(struct v4l2_subdev *sd, } ret = ov7670_try_fmt_internal(sd, &format->format, &ovfmt, &wsize); - if (ret) return ret; /* @@ -1049,16 +1050,41 @@ static int ov7670_set_fmt(struct v4l2_subdev *sd, */ com7 = ovfmt->regs[0].value; com7 |= wsize->com7_bit; - ov7670_write(sd, REG_COM7, com7); + ret = ov7670_write(sd, REG_COM7, com7); + if (ret) + return ret; + + /* +* Configure the media bus through COM10 register +*/ + if (info->mbus_config & V4L2_MBUS_VSYNC_ACTIVE_LOW) + com10 |= COM10_VS_NEG; + if (info->mbus_config & V4L2_MBUS_HSYNC_ACTIVE_LOW) + com10 |= COM10_HREF_REV; + if (info->pclk_hb_disable) + com10 |= COM10_PCLK_HB; + ret = ov7670_write(sd, REG_COM10, com10); + if (ret) + return ret; + /* * Now write the rest of the array. Also store start/stops */ - ov7670_write_array(sd, ovfmt->regs + 1); - ov7670_set_hw(sd, wsize->hstart, wsize->hstop, wsize->vstart, - wsize->vstop); - ret = 0; - if (wsize->regs) + ret = ov7670_write_array(sd, ovfmt->regs + 1); + if (ret) + return ret; + + ret = ov7670_set_hw(sd, wsize->hstart, wsize->hstop, wsize->vstart, + wsize->vstop); + if (ret) + return ret; + + if (wsize->regs) { ret = ov7670_write_array(sd, wsize->regs); + if (ret) + return ret; + } + info->fmt = ovfmt; /* @@ -1071,8 +1097,10 @@ static int ov7670_set_fmt(struct v4l2_subdev *sd, * to write it unconditionally, and that will make the frame * rate persistent too. */ - if (ret == 0) - ret = ov7670_write(sd, REG_CLKRC, info->clkrc); + ret = ov7670_write(sd, REG_CLKRC, info->clkrc); + if (ret) + return ret; + return 0; } @@ -1698,6 +1726,45 @@ static int ov7670_init_gpio(struct i2c_client *client, struct ov7670_info *info) return 0; } +/* + * ov7670_parse_dt() - Parse device tree to collect mbus configuration + * properties + */ +static int ov7670_parse_dt(struct device *dev, + struct ov7670_info *info) +{ + struct fwnode_handle *fwnode = dev_fwnode(dev); + struct v4l2_fwnode_endpoint bus_cfg; + struct fwnode_handle *ep; + int ret; + + if (!fwnode) + return -EINVAL; + + info->pclk_hb_disable = false; + if (fwnode_property_present(fwnode, "ov7670,pclk-hb-disable")) + info->pclk_hb_disable = true; + + ep = fwnode_graph_get_next_endpoint(fwnode, NULL); + if (!ep) + return -EINVAL; + + ret = v4l2_fwnode_endpoint_parse(ep, &bus_cfg); + if (ret) { + fwnode_handle_put(ep); + return ret; + } + + if (bus_cfg.bus_type != V4L2_MBUS_PARALLEL) { + dev_err(dev, "Unsupported media bus type\n"); + fwnode_handle_put(ep); + return ret; + } + info->mbus_config
[PATCH v4 1/2] media: dt-bindings: Add OF properties to ov7670
Describe newly introduced OF properties for ov7670 image sensor. The driver supports two standard properties to configure synchronism signals polarities and one custom property already supported as platform data options to suppress pixel clock during horizontal blankings. Re-phrase child nodes description while at there. Signed-off-by: Jacopo Mondi --- Documentation/devicetree/bindings/media/i2c/ov7670.txt | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/Documentation/devicetree/bindings/media/i2c/ov7670.txt b/Documentation/devicetree/bindings/media/i2c/ov7670.txt index 826b656..2c972a5 100644 --- a/Documentation/devicetree/bindings/media/i2c/ov7670.txt +++ b/Documentation/devicetree/bindings/media/i2c/ov7670.txt @@ -9,14 +9,21 @@ Required Properties: - clocks: reference to the xclk input clock. - clock-names: should be "xclk". +Required Endpoint Properties: +- hsync-active: active state of the HSYNC signal, 0/1 for LOW/HIGH respectively. +- vsync-active: active state of the VSYNC signal, 0/1 for LOW/HIGH respectively. + Optional Properties: - reset-gpios: reference to the GPIO connected to the resetb pin, if any. Active is low. - powerdown-gpios: reference to the GPIO connected to the pwdn pin, if any. Active is high. +- ov7670,pclk-hb-disable: a boolean property to suppress pixel clock output + signal during horizontal blankings. -The device node must contain one 'port' child node for its digital output -video port, in accordance with the video interface bindings defined in +The device node must contain one 'port' child node with one 'endpoint' child +sub-node for its digital output video port, in accordance with the video +interface bindings defined in: Documentation/devicetree/bindings/media/video-interfaces.txt. Example: @@ -34,8 +41,13 @@ Example: assigned-clocks = <&pck0>; assigned-clock-rates = <2500>; + ov7670,pclk-hb-disable; + port { ov7670_0: endpoint { + hsync-active = <0>; + vsync-active = <0>; + remote-endpoint = <&isi_0>; }; }; -- 2.7.4
Re: [PATCH v8 8/9] pwm: pwm-omap-dmtimer: Adapt driver to utilize dmtimer pdata ops
On Wed, Jan 24, 2018 at 01:58:19PM +0530, Keerthy wrote: > On Wednesday 24 January 2018 12:54 PM, Ladislav Michl wrote: > > Keerthy, > > > > On Wed, Jan 24, 2018 at 11:14:40AM +0530, Keerthy wrote: > >> Adapt driver to utilize dmtimer pdata ops instead of pdata-quirks. > >> > >> Signed-off-by: Keerthy > >> Acked-by: Neil Armstrong > >> Reviewed-by: Claudiu Beznea > >> --- > >> > >> Changes in v8: > >> > >> * Added of_node_put call in success case of probe. > >> > >> Boot tested on am437x-gp-evm and dra7xx-evm. > >> Also compile tested omap1_defconfig with other patches of v7 > >> posted here: > >> > >> https://www.spinics.net/lists/linux-omap/msg141100.html > >> > >> With v8 version of Patch 8/9. > >> > >> drivers/pwm/pwm-omap-dmtimer.c | 68 > >> ++ > >> 1 file changed, 42 insertions(+), 26 deletions(-) > >> > >> diff --git a/drivers/pwm/pwm-omap-dmtimer.c > >> b/drivers/pwm/pwm-omap-dmtimer.c > >> index 5ad42f3..c00e474 100644 > >> --- a/drivers/pwm/pwm-omap-dmtimer.c > >> +++ b/drivers/pwm/pwm-omap-dmtimer.c > >> @@ -23,6 +23,7 @@ > >> #include > >> #include > >> #include > >> +#include > >> #include > >> #include > >> #include > >> @@ -37,7 +38,7 @@ struct pwm_omap_dmtimer_chip { > >>struct pwm_chip chip; > >>struct mutex mutex; > >>pwm_omap_dmtimer *dm_timer; > >> - struct pwm_omap_dmtimer_pdata *pdata; > >> + struct omap_dm_timer_ops *pdata; > >>struct platform_device *dm_timer_pdev; > >> }; > >> > >> @@ -242,19 +243,35 @@ static int pwm_omap_dmtimer_probe(struct > >> platform_device *pdev) > >> { > >>struct device_node *np = pdev->dev.of_node; > >>struct device_node *timer; > >> + struct platform_device *timer_pdev; > >>struct pwm_omap_dmtimer_chip *omap; > >> - struct pwm_omap_dmtimer_pdata *pdata; > >> + struct dmtimer_platform_data *timer_pdata; > >> + struct omap_dm_timer_ops *pdata; > >>pwm_omap_dmtimer *dm_timer; > >>u32 v; > >> - int status; > >> + int status, ret = 0; > >> > >> - pdata = dev_get_platdata(&pdev->dev); > >> - if (!pdata) { > >> - dev_err(&pdev->dev, "Missing dmtimer platform data\n"); > >> - return -EINVAL; > >> + timer = of_parse_phandle(np, "ti,timers", 0); > >> + if (!timer) > >> + return -ENODEV; > >> + > >> + timer_pdev = of_find_device_by_node(timer); > >> + if (!timer_pdev) { > >> + dev_err(&pdev->dev, "Unable to find Timer pdev\n"); > >> + ret = -ENODEV; > >> + goto put; > >>} > >> > >> - if (!pdata->request_by_node || > >> + timer_pdata = dev_get_platdata(&timer_pdev->dev); > >> + if (!timer_pdata) { > >> + dev_err(&pdev->dev, "dmtimer pdata structure NULL\n"); > >> + ret = -EINVAL; > >> + goto put; > >> + } > >> + > >> + pdata = timer_pdata->timer_ops; > >> + > >> + if (!pdata || !pdata->request_by_node || > >>!pdata->free || > >>!pdata->enable || > >>!pdata->disable || > >> @@ -267,37 +284,32 @@ static int pwm_omap_dmtimer_probe(struct > >> platform_device *pdev) > >>!pdata->set_prescaler || > >>!pdata->write_counter) { > >>dev_err(&pdev->dev, "Incomplete dmtimer pdata structure\n"); > >> - return -EINVAL; > >> + ret = -EINVAL; > >> + goto put; > >>} > >> > >> - timer = of_parse_phandle(np, "ti,timers", 0); > >> - if (!timer) > >> - return -ENODEV; > >> - > >>if (!of_get_property(timer, "ti,timer-pwm", NULL)) { > >>dev_err(&pdev->dev, "Missing ti,timer-pwm capability\n"); > >> - return -ENODEV; > >> + ret = -ENODEV; > >> + goto put; > > > > Should we call of_node_put() even from here? of_get_property() failed, so > > reference was not updated. > > The of_node_put to balance the of_parse_handle called for timer. I hope > that is what you wanted to check right? Right, your code does it already, so please ignore my previous comment. > > > >>} > >> > >>dm_timer = pdata->request_by_node(timer); > > > > And timer seems to be used only here, so calling > > of_node_put(timer); > > just here should be enough. > > Okay yes. This can be optimized. of_node_put(timer); can be called > here and the instances below need not have that additional step. Yes, thank you. Then I'll send rebased patches. > >> - if (!dm_timer) > >> - return -EPROBE_DEFER; > >> + if (!dm_timer) { > >> + ret = -EPROBE_DEFER; > >> + goto put; > >> + } > >> > >>omap = devm_kzalloc(&pdev->dev, sizeof(*omap), GFP_KERNEL); > >>if (!omap) { > >>pdata->free(dm_timer); > >> - return -ENOMEM; > >> + ret = -ENOMEM; > >> + goto put; > >>} > >> > >>omap->pdata = pdata; > >>omap->dm_timer = dm_timer; > >> - > >> - omap->dm_timer_pdev = of_find_device_by_node(timer); > >> - if (!omap->dm_timer_pdev) { > >> - dev_err(&pdev->dev, "Unable to find timer p
Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
> > > + for (i = 0; i < ARRAY_SIZE(spectre_bad_microcodes); i++) { > > > + if (c->x86_model == spectre_bad_microcodes[i].model && > > > + c->x86_mask == spectre_bad_microcodes[i].stepping) > > > + return (c->microcode <= > > > spectre_bad_microcodes[i].microcode); > > > + } > > > + return 0; > > > +} > > The above is Intel only, you should check vendor too I think. > > It's in intel.c, called from early_init_intel(). Isn't that sufficient? Duh, so much for reading skillz on my end ;-) > > > + pr_warn("Intel Spectre v2 broken microcode detected; disabling > > > SPEC_CTRL\n"); > > > + clear_cpu_cap(c, X86_FEATURE_SPEC_CTRL); > > > + clear_cpu_cap(c, X86_FEATURE_STIBP); > > > + clear_cpu_cap(c, X86_FEATURE_AMD_SPEC_CTRL); > > > + clear_cpu_cap(c, X86_FEATURE_AMD_PRED_CMD); > > > + clear_cpu_cap(c, X86_FEATURE_AMD_STIBP); > > > + } > > And since its Intel only, what are those AMD features doing there? > > Hypervisors which only want to expose PRED_CMD may do so using the AMD > feature bit. SPEC_CTRL requires save/restore and live migration > support, and isn't needed with retpoline anyway (since guests won't be > calling directly into firmware). Egads, I suppose that makes some sense, but it does make a horrible muddle of things.
Re: [PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice
On 01/24/2018 02:57 PM, Christophe LEROY wrote: Le 24/01/2018 à 10:15, Aneesh Kumar K.V a écrit : On 01/24/2018 02:32 PM, Christophe Leroy wrote: An application running with libhugetlbfs fails to allocate additional pages to HEAP due to the hugemap being done inconditionally as topdown mapping: mmap(0x1008, 1572864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0) = 0x73e8 [...] mmap(0x7400, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d8 munmap(0x73d8, 1048576) = 0 [...] mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0 munmap(0x73d0, 1572864) = 0 [...] mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0 munmap(0x73d0, 1572864) = 0 [...] As one can see from the above strace log, mmap() allocates further pages below the initial one because no space is available on top of it. This patch fixes it by requesting bottomup mapping as the non generic hugetlb_get_unmapped_area() does Fixes: d0f13e3c20b6f ("[POWERPC] Introduce address space "slices" ") Signed-off-by: Christophe Leroy --- v3: Was a standalone patch before, but conflicts with this serie. arch/powerpc/mm/hugetlbpage.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 79e1378ee303..368ea6b248ad 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -558,7 +558,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, return radix__hugetlb_get_unmapped_area(file, addr, len, pgoff, flags); #endif - return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1); + return slice_get_unmapped_area(addr, len, flags, mmu_psize, 0); } #endif Why make this change also for PPC64? Can you do this #ifdef 8xx?.You can ideally move hugetlb_get_unmapped_area to slice.h and then make this much simpler for 8xxx? Did you try with HUGETLB_MORECORE_HEAPBASE=0x1100 on PPC64 as I suggested in my last email on this subject (22/01/2018 9:22) ? yes. The test ran fine for me kvaneesh@ltctulc6a-p1:[~]$ HUGETLB_MORECORE=yes HUGETLB_MORECORE_HEAPBASE=0x3000 ./a.out 1000-1001 r-xp fc:00 9044312 /home/kvaneesh/a.out 1001-1002 r--p fc:00 9044312 /home/kvaneesh/a.out 1002-1003 rw-p 0001 fc:00 9044312 /home/kvaneesh/a.out 3000-3300 rw-p 00:0d 1062697 /anon_hugepage (deleted) 3300-3500 rw-p 0300 00:0d 1062698 /anon_hugepage (deleted) 3500-3700 rw-p 0500 00:0d 1062699 /anon_hugepage (deleted) 77d6-77f1 r-xp fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f1-77f2 r--p 001a fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f2-77f3 rw-p 001b fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f4-77f6 r-xp fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f6-77f7 r--p 0001 fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f7-77f8 rw-p 0002 fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f8-77fa r-xp 00:00 0 [vdso] 77fa-77fe r-xp fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 77fe-77ff r--p 0003 fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 77ff-7800 rw-p 0004 fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 7ffd-8000 rw-p 00:00 0 [stack] Before doing anything specific to the PPC32/8xx, I'd like to be sure the issue is definitly only on PPC32. I am not sure I understand the problem correctly. If there is a free space in the required range, both topdown/bottomup search should be able to find it. Unless topdown found another free area suitable for hugetlb allocation above. My take is we should not change the topdown to bottomup without really understanding the failure scenarios. -aneesh
Re: [PATCH 4.4] sched/deadline: Use the revised wakeup rule for suspending constrained dl tasks
On Mon, Jan 22, 2018 at 06:16:19PM +0100, Daniel Bristot de Oliveira wrote: > --- %< --- > Here it is! the backport for 4.4.y-stable. > > The main difference from the original commit is that > the BW_SHIFT define was not present yet. As BW_SHIFT was > introduced in a new feature, I just used the value (20), > likewise we used to use before the #define. > Other changes were required because of comments. Thanks for the backport, now queued up. greg k-h
Re: [PATCH v5 0/2] printk: Console owner and waiter logic cleanup
On Wed, Jan 10, 2018 at 09:02:23AM -0800, Tejun Heo wrote: > 1. Console is IPMI emulated serial console. Super slow. Also >netconsole is in use. So my IPMI SoE typically run at 115200 Baud (or higher) and I've not had trouble like that (granted I don't typically trigger OOM storms, but they do occasionally happen). Is your IPMI much slower and not fixable to be faster?
[RFC][PATCH] printk: do not flush printk_safe from irq_work
Tejun Heo reported that net_console can cause recursive printk()-s from call_console_drivers() (under OOM/etc.) and this can lockup that CPU, because CPU calls call_console_drivers() to print pending logbufs, but recursive printk() issued by one of console drivers adds several new messages to the logbuf via recursive printk(). Note, this is not a printk_safe problem. In fact printk_safe helps us to address the issue here and to rate-limit recursive printk()-s. Schematically, what we have at the moment is: == printk() preempt_disable() console_unlock() { for (;;) { local_irq_save() printk_safe_enter() call_console_drivers() printk() vprintk_safe() irq_work_queue() printk_safe_exit() local_irq_restore() << irq work >> printk_safe_flush() printk() } } preempt_enable() == So CPU triggers printk() recursion in call_console_drivers(), immediately after local_irq_restore() it executes printk_safe flush irq_work, which adds even more pending messages to the logbuf which it has to print. CPU hits printk() recursion again while it prints pending messages, which results in even more pending messages flushed from printk_safe via irq_work. And so on. The idea is to delay printk_safe flush until CPU is in preemptible context, so we won't lockup it up. The new behaviour is: == printk() preempt_disable() console_unlock() { for (;;) { local_irq_save() printk_safe_enter() call_console_drivers() printk() vprintk_safe() queue_work_on(smp_processor_id()) printk_safe_exit() local_irq_restore() } } preempt_enable() << work queue >> printk_safe_flush() printk() == Since console_unlock() is called under preemption_disabled() now we cannot schedule flush work as long as CPU is printing messages. We only can flush printk_safe messages after CPU returns from console_unlock() and enables preemption. This might look like with the delayed flush we increase the possibility of lost printk_safe messages (printk_safe per-CPU buffers are pretty limited in size). But prinkt_safe was never meant to be used for huge amounts of data which can generate e.g. debugging of serial consoles, etc. Its goal is to make recursive printk()-s less deadlock prone. With this patch we add one more thing: avoid CPU lockup which might be caused by printk_safe_flush(). Signed-off-by: Sergey Senozhatsky Reported-by: Tejun Heo --- kernel/printk/printk_safe.c | 90 +++-- 1 file changed, 78 insertions(+), 12 deletions(-) diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c index 3e3c2004bb23..eb5a4049314a 100644 --- a/kernel/printk/printk_safe.c +++ b/kernel/printk/printk_safe.c @@ -22,6 +22,7 @@ #include #include #include +#include #include "internal.h" @@ -49,7 +50,10 @@ static int printk_safe_irq_ready __read_mostly; struct printk_safe_seq_buf { atomic_tlen;/* length of written data */ atomic_tmessage_lost; - struct irq_work work; /* IRQ work that flushes the buffer */ + /* IRQ work that flushes printk_nmi buffer */ + struct irq_work irq_flush_work; + /* WQ work that flushes printk_safe buffer */ + struct work_struct wq_flush_work; unsigned char buffer[SAFE_LOG_BUF_LEN]; }; @@ -61,10 +65,33 @@ static DEFINE_PER_CPU(struct printk_safe_seq_buf, nmi_print_seq); #endif /* Get flushed in a more safe context. */ -static void queue_flush_work(struct printk_safe_seq_buf *s) +static void queue_irq_flush_work(struct printk_safe_seq_buf *s) { if (printk_safe_irq_ready) - irq_work_queue(&s->work); + irq_work_queue(&s->irq_flush_work); +} + +/* + * Get flushed from a process context. printk() recursion may happen in + * call_console_drivers(), while CPU that hit printk() recursion is + * printing pending logbud messages. In this case flushing printk_safe buffer + * from IRQ work may result in CPU lockup - we flush printk_safe every time + * we enable local IRQs after call_console_drivers(), which adds new messages + * to the logbuf and, thus, CPU has more messages to print, triggering more + * recursive printk()-s in call_console_drivers() and appending even more + * messages to the logbuf. + * + * Process context cannot flush messages while CPU is in console_unlock(), + * because printk() calls console_unlock() with preemption disabled. So + * we can flush recursive call_console_drivers() printk()-s only after CPU + * leaves console_unlock() printing loop and enables preemption. + */ +static void queue_wq_flush_work(struct printk_safe_seq_buf *s) +{ + if (print
Re: [PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice
Le 24/01/2018 à 10:35, Aneesh Kumar K.V a écrit : On 01/24/2018 02:57 PM, Christophe LEROY wrote: Le 24/01/2018 à 10:15, Aneesh Kumar K.V a écrit : On 01/24/2018 02:32 PM, Christophe Leroy wrote: An application running with libhugetlbfs fails to allocate additional pages to HEAP due to the hugemap being done inconditionally as topdown mapping: mmap(0x1008, 1572864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0) = 0x73e8 [...] mmap(0x7400, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d8 munmap(0x73d8, 1048576) = 0 [...] mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0 munmap(0x73d0, 1572864) = 0 [...] mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0 munmap(0x73d0, 1572864) = 0 [...] As one can see from the above strace log, mmap() allocates further pages below the initial one because no space is available on top of it. This patch fixes it by requesting bottomup mapping as the non generic hugetlb_get_unmapped_area() does Fixes: d0f13e3c20b6f ("[POWERPC] Introduce address space "slices" ") Signed-off-by: Christophe Leroy --- v3: Was a standalone patch before, but conflicts with this serie. arch/powerpc/mm/hugetlbpage.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 79e1378ee303..368ea6b248ad 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -558,7 +558,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, return radix__hugetlb_get_unmapped_area(file, addr, len, pgoff, flags); #endif - return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1); + return slice_get_unmapped_area(addr, len, flags, mmu_psize, 0); } #endif Why make this change also for PPC64? Can you do this #ifdef 8xx?.You can ideally move hugetlb_get_unmapped_area to slice.h and then make this much simpler for 8xxx? Did you try with HUGETLB_MORECORE_HEAPBASE=0x1100 on PPC64 as I suggested in my last email on this subject (22/01/2018 9:22) ? yes. The test ran fine for me You tried with 0x3000, it works as well on PPC32. I'd really like you to try with 0x1100 which is in the same slice as the 1002-1003 range. Christophe kvaneesh@ltctulc6a-p1:[~]$ HUGETLB_MORECORE=yes HUGETLB_MORECORE_HEAPBASE=0x3000 ./a.out 1000-1001 r-xp fc:00 9044312 /home/kvaneesh/a.out 1001-1002 r--p fc:00 9044312 /home/kvaneesh/a.out 1002-1003 rw-p 0001 fc:00 9044312 /home/kvaneesh/a.out 3000-3300 rw-p 00:0d 1062697 /anon_hugepage (deleted) 3300-3500 rw-p 0300 00:0d 1062698 /anon_hugepage (deleted) 3500-3700 rw-p 0500 00:0d 1062699 /anon_hugepage (deleted) 77d6-77f1 r-xp fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f1-77f2 r--p 001a fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f2-77f3 rw-p 001b fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f4-77f6 r-xp fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f6-77f7 r--p 0001 fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f7-77f8 rw-p 0002 fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f8-77fa r-xp 00:00 0 [vdso] 77fa-77fe r-xp fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 77fe-77ff r--p 0003 fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 77ff-7800 rw-p 0004 fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 7ffd-8000 rw-p 00:00 0 [stack] Before doing anything specific to the PPC32/8xx, I'd like to be sure the issue is definitly only on PPC32. I am not sure I understand the problem correctly. If there is a free space in the required range, both topdown/bottomup search should be able to find it. Unless topdown found another free area suitable for hugetlb allocation above. My take is we should not change the topdown to bottomup without really understanding the failure scenarios. -aneesh
Re: [PATCH v2] perf/core: Fix installing cgroup event into cpu
On Wed, Jan 24, 2018 at 05:19:34PM +0800, Lin Xiulei wrote: > Sure, and I consider this "OK" works for "What goes wrong if we leave > it set?". : ) It would be good if you inspect the code for the case of leaving cpuctx->cgrp set with no cgroup events left -- AND -- put a blurb about what you found in your new Changelog. I suspect it works out and something like perf_cgroup_switch() will fix things up for us later, but double check and test.
Re: [PATCH 11/14] MIPS: memblock: Print out kernel virtual mem layout
Hi Serge, On 23/01/18 19:10, Serge Semin wrote: Hello Matt, On Tue, Jan 23, 2018 at 03:35:14PM +, Matt Redfearn wrote: Hi Serge, On 19/01/18 14:27, Serge Semin wrote: On Fri, Jan 19, 2018 at 07:59:43AM +, Matt Redfearn wrote: Hello Matt, Hi Serge, On 18/01/18 20:18, Serge Semin wrote: On Thu, Jan 18, 2018 at 12:03:03PM -0800, Florian Fainelli wrote: On 01/17/2018 02:23 PM, Serge Semin wrote: It is useful to have the kernel virtual memory layout printed at boot time so to have the full information about the booted kernel. In some cases it might be unsafe to have virtual addresses freely visible in logs, so the %pK format is used if one want to hide them. Signed-off-by: Serge Semin I personally like having that information because that helps debug and have a quick reference, but there appears to be a trend to remove this in the name of security: https://patchwork.kernel.org/patch/10124007/ maybe hide this behind a configuration option? Yeah, arm code was the place I picked the function up.) But in my case I've used %pK so the pointers would disappear from logging when kptr_restrict sysctl is 1 or 2. I agree, that we might need to make the printouts optional. If there is any kernel config, which for instance increases the kernel security we could also use it or anything else to discard the printouts at compile time. Certainly, when KASLR is active it would be preferable to hide this information, so you could use CONFIG_RELOCATABLE. The existing KASLR stuff additionally hides this kind of information behind CONFIG_DEBUG_KERNEL, so that only people actively debugging the kernel see it: http://elixir.free-electrons.com/linux/v4.15-rc8/source/arch/mips/kernel/setup.c#L604 Ok. I'll hide the printouts behind both of that config macros in the next patchset version. Another thing to note - since ad67b74d2469d ("printk: hash addresses printed with %p") %pK at this time in the boot process is useless since the RNG is not sufficiently initialised and all prints end up being "(ptrval)". Hence after v4.15-rc2 we end up with output like: [0.00] Kernel virtual memory layout: [0.00] lowmem : 0x(ptrval) - 0x(ptrval) ( 256 MB) [0.00] .text : 0x(ptrval) - 0x(ptrval) (7374 kB) [0.00] .data : 0x(ptrval) - 0x(ptrval) (1901 kB) [0.00] .init : 0x(ptrval) - 0x(ptrval) (1600 kB) [0.00] .bss : 0x(ptrval) - 0x(ptrval) ( 415 kB) [0.00] vmalloc : 0x(ptrval) - 0x(ptrval) (1023 MB) [0.00] fixmap : 0x(ptrval) - 0x(ptrval) ( 68 kB) It must be some bug in the algo. What point in the %pK then? According to the documentation the only way to see the pointers is when (kptr_restrict == 0). But if it is we don't get into the restricted_pointer() method at all: http://elixir.free-electrons.com/linux/v4.15-rc9/source/lib/vsprintf.c#L1934 In this case the vsprintf() executes the method ptr_to_id(), which of course default to _not_ leak addresses, and hash it before printing. Really %pK isn't supposed to be dependent from RNG at all since kptr_restrict doesn't do any value randomization. That was true until v4.15-rc2. The behavior of %pK was changed without that being reflected in the documentation. A patch (https://patchwork.kernel.org/patch/10124413/) is in progress to update this. The %px format specifier was added for cases such as this, where we really want to print the unmodified address. And as long as this function is suitably guarded to only do this when KASLR is deactivated / CONFIG_DEBUG_KERNEL is activated, etc, then we are not unwittingly leaking information - we are deliberately making it available. If %pK would work as it's stated by the kernel documentation: https://www.kernel.org/doc/Documentation/printk-formats.txt then the only change I'd suggest to have here is to close the kernel memory layout printout method by the CONFIG_DEBUG_KERNEL ifdef-macro. The kptr_restrict should default to 1/2 if the KASLR is activated: https://lwn.net/Articles/444556/ Yeah, again, the documentation is no longer correct, and %pK will always be hashed, and before the RNG is initialized it does not even hash it, just returning "(ptrval)". So I'd recommend guarding with CONFIG_DEBUG_KERNEL and switching the format specifier to %px. Thanks, Matt Regards, -Sergey Thanks, Matt Regards, -Sergey Thanks, Matt -- Florian
Re: [PATCH 02/14] MIPS: memblock: Surely map BSS kernel memory section
Hi Serge, On 23/01/18 19:27, Serge Semin wrote: Hello Matt, On Tue, Jan 23, 2018 at 11:03:27AM +, Matt Redfearn wrote: Hi Serge, On 22/01/18 21:47, Serge Semin wrote: Hello Matt, On Mon, Jan 22, 2018 at 04:35:26PM +, Matt Redfearn wrote: Hi Serge, On 17/01/18 22:23, Serge Semin wrote: The current MIPS code makes sure the kernel code/data/init sections are in the maps, but BSS should also be there. Quite right - it should. But this was protected against by reserving all bootmem up to the _end symbol here: http://elixir.free-electrons.com/linux/v4.15-rc8/source/arch/mips/kernel/setup.c#L388 Which you remove in the next patch in this series. I'm not sure it is worth Right. Missed that part. The old code just doesn't set the kernel memory free calling the free_bootmem() method for non-reserved parts below reserved_end. disentangling the reserved_end stuff from the next patch to make this into a single logical change of reserving just .bss rather than everything below _end. Good point. I'll move this change into the "[PATCH 05/14] MIPS: memblock: Add reserved memory regions to memblock". It logically belongs to that place. Since basically by the arch_mem_addpart() calls we reserve all the kernel Actually I was wrong - it's not this sequence of arch_mem_addpart's that reserves the kernels memory. At least on DT based systems, it's pretty likely that these regions will overlap with the system memory already added. of_scan_flat_dt will look for the memory node and add it via early_init_dt_add_memory_arch. These calls to add the kernel text, init and bss detect that they overlap with the already present system memory, so don't get added, here: http://elixir.free-electrons.com/linux/v4.15-rc9/source/arch/mips/kernel/setup.c#L759 As such, when we print out the content of boot_mem_map, we only have a single entry: [0.00] Determined physical RAM map: [0.00] memory: 1000 @ (usable) memory now I'd also merged them into a single call for the range [_text, _end]. What do you think? I think that this patch makes sense in case the .bss is for some reason not covered by an existing entry, but I would leave it as a separate patch. Your [PATCH 05/14] MIPS: memblock: Add reserved memory regions to memblock is actually self-contained since it replaces reserving all memory up to _end with the single reservation of the kernel's whole size + size = __pa_symbol(&_end) - __pa_symbol(&_text); + memblock_reserve(__pa_symbol(&_text), size); Which I think is definitely an improvement since it is much clearer. Alright lets sum it up. First of all, yeah, you are right, arch_mem_addpart() is created to make sure the kernel memory is added to the memblock/bootmem pool. The previous arch code was leaving such the memory range non-freed since it was higher the reserved_end, so to make sure the early memory allocations wouldn't be made from the pages, where kernel actually resides. In my code I still wanted to make sure the kernel memory is in the memblock pool. But I also noticed, that .bss memory range wouldn't be added to the pool if neither dts nor platform-specific code added any memory to the boot_mem_map pool. So I decided to fix it. The actual kernel memory reservation is performed after all the memory regions are declared by the code you cited. It's essential to do the [_text, _end] memory range reservation there, otherwise memblock may allocate from the memory range occupied by the kernel code/data. Since you agree with leaving it in the separate patch, I'd only suggest to call the arch_mem_addpart() method for just one range [_text, _end] instead of doing it three times for a separate _text, _data and bss sections. What do you think? I think it's best left as 3 separate reservations, mainly due to the different attribute used for the init section. So all in all, I like this patch as it is. Thanks, Matt Regards, -Sergey Thanks, Matt Regards, -Sergey Reviewed-by: Matt Redfearn Thanks, Matt Signed-off-by: Serge Semin --- arch/mips/kernel/setup.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c index 76e9e2075..0d21c9e04 100644 --- a/arch/mips/kernel/setup.c +++ b/arch/mips/kernel/setup.c @@ -845,6 +845,9 @@ static void __init arch_mem_init(char **cmdline_p) arch_mem_addpart(PFN_UP(__pa_symbol(&__init_begin)) << PAGE_SHIFT, PFN_DOWN(__pa_symbol(&__init_end)) << PAGE_SHIFT, BOOT_MEM_INIT_RAM); + arch_mem_addpart(PFN_DOWN(__pa_symbol(&__bss_start)) << PAGE_SHIFT, +PFN_UP(__pa_symbol(&__bss_stop)) << PAGE_SHIFT, +BOOT_MEM_RAM); pr_info("Determined physical RAM map:\n"); print_memory_map();
Re: [PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice
On 01/24/2018 03:09 PM, Christophe LEROY wrote: Le 24/01/2018 à 10:35, Aneesh Kumar K.V a écrit : Did you try with HUGETLB_MORECORE_HEAPBASE=0x1100 on PPC64 as I suggested in my last email on this subject (22/01/2018 9:22) ? yes. The test ran fine for me You tried with 0x3000, it works as well on PPC32. I'd really like you to try with 0x1100 which is in the same slice as the 1002-1003 range. Now that explains is better. But then the requested HEAPBASE was not free and hence topdown search got an address in the below range. 7efffd00-7f00 rw-p 00:0d 1082770 /anon_hugepage (deleted) The new range allocated is such that there is no scope for expansion of heap if we do a topdown search. But why should that require us to change from topdown/bottomup search? 1000-1001 r-xp fc:00 9044312 /home/kvaneesh/a.out 1001-1002 r--p fc:00 9044312 /home/kvaneesh/a.out 1002-1003 rw-p 0001 fc:00 9044312 /home/kvaneesh/a.out 7efffd00-7f00 rw-p 00:0d 1082770 /anon_hugepage (deleted) 72d4-77d6 rw-p 00:00 0 77d6-77f1 r-xp fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f1-77f2 r--p 001a fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f2-77f3 rw-p 001b fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f4-77f6 r-xp fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f6-77f7 r--p 0001 fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f7-77f8 rw-p 0002 fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f8-77fa r-xp 00:00 0 [vdso] 77fa-77fe r-xp fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 77fe-77ff r--p 0003 fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 77ff-7800 rw-p 0004 fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 7ffd-8000 rw-p 00:00 0 [stack] For the specific test, one should pass the HEAPBASE value such that it can be expanded if required isn't it ? -aneesh
Re: [PATCH v2] platform/x86: silead_dmi: Add touchscreen platform data for the Teclast X3 Plus tablet
There is no problem whatsoever! I'm beginner on this so every correction or warning is welcolme. Thank you both for your help! -- Alberto Ponces On Tue, Jan 23, 2018 at 05:18:44PM -0800, Dmitry Torokhov wrote: > On Tue, Jan 23, 2018 at 05:02:59PM -0800, Darren Hart wrote: > > On Tue, Jan 23, 2018 at 04:37:43PM -0800, Dmitry Torokhov wrote: > > > On Tue, Jan 23, 2018 at 3:27 PM, Darren Hart wrote: > > > > On Tue, Jan 23, 2018 at 06:33:38PM +, Alberto Ponces wrote: > > > >> Add touchscreen platform data for the Teclast X3 Plus tablet. > > > >> > > > >> Reviewed-by: Hans de Goede > > > >> > > > >> Signed-off-by: Alberto Ponces > > > > > > > > Queued, thanks. > > > > > > > > Note for the future: Author signoff goes first, then reviewers, then > > > > committer. > > > > > > In this case Alberto was the committer, so his sign-off is last, as it > > > should be. Any reported-by, suggested-by, acked-by or reviewed-by he > > > collected should go above his sign-off. Once you picked up his patch > > > you 'll become committer, so any markings you add should go between > > > his sign-off and yours. > > > > Of course, you're correct. Thank you for the correction - and Alberto, > > apologies > > for the noise/confusion. > > > > Dmitry, I reviewed documentation/process/submitting-patches.rst and > > 5.Posting.rst and didn't find this clearly defined. Have I missed where > > this is > > documented, or is an update in order? > > I do not think is was ever stated explicitly, the closest comes "14) The > canonical patch format" which states that sign off goes before the --- > divider. > > I think it comes naturally if you consider patch vs pull request: if you > decided to pull from Alberto's tree (or anyone else's tree) instead of > taking the patch via email, then if they'd put non-sign-off tags after > the sign-off, they'd end up in your and then Linus' tree like that, as > you would not add your sign-off when doing git merge. > > But if you believe this should be called explicitly then adding a few > more words to section 14 should work. > > Thanks. > > -- > Dmitry
Re: [PATCH] drm/bridge/synopsys: dsi: use common mipi_dsi_create_packet()
Hi Brian, On 01/23/2018 10:15 PM, Brian Norris wrote: > Hi Philippe, > > On Thu, Jan 18, 2018 at 11:40:48AM +, Philippe CORNU wrote: >> On 01/11/2018 12:16 PM, Philippe CORNU wrote: >>> To be honest, I do not really like the memcpy here too and I agree with >>> you regarding the BE issue. >>> >>> My first "stm" driver (ie. before using this "freescale/rockchip" >>> dw-mipi-dsi driver with the memcpy) used the "exact" same code as the >>> Tegra dsi tegra_dsi_writesl() function with the 2 loops. >>> >>> https://elixir.free-electrons.com/linux/v4.14/source/drivers/gpu/drm/tegra/dsi.c#L1248 >>> >>> >>> IMHO, it is better than memcpy... >>> I added these 3 "documentation" lines, maybe we may reuse them or >>> something similar... >>> >>> /* >>> * Write 8-bit payload data into the 32-bit payload data register. >>> * ex: payload data "0x01, 0x02, 0x03, 0x04, 0x05, 0x06" will become >>> * "0x04030201 0x0605" 32-bit writes >>> */ >>> >>> Not sure it helps to fix the BE issue but we may add a TODO stating that >>> "this loop has not been tested on BE"... >>> >>> What is your opinion? > > I'm sorry, I don't think I noticed your reply here. I'm trying to unbury > some email, but that's sometimes a losing battle... > > That code actually does look correct, and it's perhaps marginally > better-looking in my opinion. It's up to you if you want to propose > another patch :) At this point, it's only a matter of nice code, not > correctness I believe. > >> As your patch has been merged, I have few short questions and for each >> related new patch, I would like to know if you prefer that I implement >> it or if you prefer to do it by yourself, it's really like you want, on >> my side, no problem to make them all, some or none, I don't want us to >> implement these in parallel :-) >> >> * Do you have any opinion regarding Tegra-like loops vs the memcpy? (see >> my comment above) If you think the Tegra-like loops is a better approach >> than memcpy, there is a small patch to write. > > My opinion is above. > I do not know yet if I will send a patch but several reasons may push me to do it: * Andrzej proposed a nicer code in his last review so it means the actual code with memcpy's is "not so nice" (even if it works fine) * Several dsi drivers use the Tegra-like loops (Tegra, intel,... ) and in vc4/exynos/mtk drivers memcpy is not used, msm uses memcpy... well, not sure it is then a good argument, different solutions for different hw... * Coming cadence dsi bridge driver uses Tegra-like loops. * I think my read function will also have Tegra-like loops, if it is the case, it could be nice to have something homogeneous... Anyway, it is not an important point : ) >> * Returned value with number of bytes received/transferred: there is a >> small patch to write > > I don't think I followed that one very well. I'm not sure my opinion > really matters, as long as you get someone else to agree. I do not plan > to write any such patch in the near term. > >> * Regarding read operations: I propose to add a TODO + DRM_WARN in case >> someone want to use the API for read operations. Note that I plan to >> implement the read feature but I do not know yet when and maybe Rockchip >> people already have something ~ready? > > The warning would be nice to do now, regardless. > > Rockchip folks wrote up something for read support here [1], but it's > based on a semi-forked version of the driver (we're trying to clean up > the divergence, but it's not there yet). Perhaps it would provide useful > fodder for your work. I don't think Rockchip is immediately working on > upstreaming this particular patch, so it's totally fair to handle it > yourself. It's got the GPL sign-off ;) > > Brian > > [1] > https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/863347 > Very good information, I will have a look, many thanks Philippe :-)
Re: [PATCH v4 08/13] iommu/rockchip: Control clocks needed to access the IOMMU
On Mon, Jan 22, 2018 at 1:09 PM, JeffyChen wrote: > Hi Randy, > > > On 01/22/2018 10:15 AM, JeffyChen wrote: >> >> Hi Randy, >> >> On 01/22/2018 09:18 AM, Randy Li wrote: >>> Also the power domain driver could manage the clocks as well, I would >>> suggest to use pm_runtime_*. >> >> >> actually the clocks required by pm domain may not be the same as what we >> want to control here, there might be some clocks only be needed when >> accessing mmu registers. >> >> but i'm not very sure about that, will confirm it with Simon Xue. > > > confirmed with Simon, there might be some iommus don't have a pd, and the > CONFIG_PM could be disabled. > > so it might be better to control clocks in iommu driver itself. > Agreed with Jeffy. I'd give Reviewed-by, but this is my own patch reposted by Jeffy (thanks!), so it wouldn't have any value. :) Best regards, Tomasz
Re: [PULL] alpha.git
On Tue, 23 Jan 2018, Matt Turner wrote: > On Tue, Jan 23, 2018 at 2:23 AM, Mikulas Patocka wrote: > > > > > > On Sat, 20 Jan 2018, Matt Turner wrote: > > > >> Hi Linus, > >> > >> Please pull my alpha git tree. It contains a build fix and a regression > >> fix. > >> > >> Hopefully still in time for 4.15 :) > >> > >> Thanks, > >> Matt > > > > Hi > > > > Will you also submit these patches? The first one fixes a crash when > > pthread_create races with signal delivery, it could cause random crashing > > in applications. > > > > https://marc.info/?l=linux-alpha&m=151491969711913&w=2 > > https://marc.info/?l=linux-alpha&m=151491960011839&w=2 > > https://marc.info/?l=linux-alpha&m=151491963911901&w=2 > > Yes, I've already queued them up for the next merge window. I wasn't > sure if they were appropriate for 4.15 so late in the cycle. If you > think they are, I can send another pull request for 4.15. > > https://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha.git/log/?h=alpha-next > > Thanks, > Matt It's OK to send these patches in the 4.16 merge window. They will be backported to the stable kernels anyway. Mikulas
Re: [PATCH] x86/cpuid: Fewer CPUID invocations in init_scattered_cpuid_features()
On Tue, Jan 23, 2018 at 06:32:16PM +0100, Borislav Petkov wrote: > It's probably not even worth doing anything though - I doubt the speedup > is visible at all. One more thing I forgot to mention yesterday: I'm working on changing the CPUID parsing we do now and we'll probably end up simply reading in all CPUID leafs so scattered.c will disappear eventually. So I wouldn't waste my energy on it. :-) -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v8 04/12] x86/spectre: Add boot time option to select Spectre v2 mitigation
On Wed, 24 Jan 2018, Greg Kroah-Hartman wrote: > But will Andi's patch work well for you? Adding a MODULE_INFO() tag to > every module? Yes, that would work -- all the modules that get built in tree, or out of tree but with retpolined compiler, would have that marker that could be optionally checked for by the kernel. > I just thought since you were already using modversions in enterprise > distros already, that adding it there would be the simplest. The patch as-is introduces immediate modversion mismatch between retpolined kernel and non-retpolined module, making each and every one fail to load. Thanks, -- Jiri Kosina SUSE Labs
[PATCH]sched: completion: use bool in try_wait_for_completion
Use bool in place of int in the function try_wait_for_completion. Signed-off-by: Gaurav Jindal --- diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c index 0926aef..3e15e8d 100644 --- a/kernel/sched/completion.c +++ b/kernel/sched/completion.c @@ -283,7 +283,7 @@ int __sched wait_for_completion_killable(struct completion *x) bool try_wait_for_completion(struct completion *x) { unsigned long flags; - int ret = 1; + bool ret = true; /* * Since x->done will need to be locked only @@ -292,11 +292,11 @@ bool try_wait_for_completion(struct completion *x) * return early in the blocking case. */ if (!READ_ONCE(x->done)) - return 0; + return false; spin_lock_irqsave(&x->wait.lock, flags); if (!x->done) - ret = 0; + ret = false; else if (x->done != UINT_MAX) x->done--; spin_unlock_irqrestore(&x->wait.lock, flags);
Re: [PATCH v8 04/12] x86/spectre: Add boot time option to select Spectre v2 mitigation
On Wed, Jan 24, 2018 at 11:32:00AM +1100, Kees Cook wrote: > I've wanted this for a while (especially for the coming detected > support for stack protector). Having more than just the clfags is, I > think, important. We'd likely want to record the entire environment > (compiler version, linker version, flags on both, etc). ... which makes me think of /proc/config.gz Can we do something like that here too? -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v2] perf/core: Fix installing cgroup event into cpu
2018-01-24 17:46 GMT+08:00 Peter Zijlstra : > On Wed, Jan 24, 2018 at 05:19:34PM +0800, Lin Xiulei wrote: >> Sure, and I consider this "OK" works for "What goes wrong if we leave >> it set?". : ) > > It would be good if you inspect the code for the case of leaving > cpuctx->cgrp set with no cgroup events left -- AND -- put a blurb about > what you found in your new Changelog. > I have some test cases for this issue, I don't know if it's good to put those in changelog reproduction as below Step 1 Create program for measuring, write below to the file d.py ``` while True: sumup = 0 for i in range(1000): sumup += i ``` Step 2 Create cgroup path and run relative program ``` mkdir /sys/fs/cgroup/perf_event/test1 mkdir /sys/fs/cgroup/perf_event/test2 python d.py & echo $! > /sys/fs/cgroup/perf_event/test1/cgroup.procs python d.py & echo $! > /sys/fs/cgroup/perf_event/test2/cgroup.procs ``` Step 3 ``` perf stat -e cycles -G test1 -e cycles -G test2 -a sleep 1 ``` You would see output like below ``` Performance counter stats for 'system wide': 2,161,022,123 cyclestest1 138,626,073 cyclestest2 1.001858328 seconds time elapsed ``` The result of test2 is much less than test1, which happens commonly. Just because of what I mentioned above that a second event couldn't be activated immediately, that cases some loss > I suspect it works out and something like perf_cgroup_switch() will fix > things up for us later, but double check and test. exactly, the case above wouldn't have any result if no perf_cgroup_switch() happened.
Re: [PATCH v3] vfio/type1: Adopt fast IOTLB flush interface when unmap IOVAs
Alex / Joerg, On 1/24/18 5:04 AM, Alex Williamson wrote: @@ -648,12 +685,40 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data, return i > npage ? npage : (i > 0 ? i : -EINVAL); } +static size_t try_unmap_unpin_fast(struct vfio_domain *domain, dma_addr_t iova, + size_t len, phys_addr_t phys, + struct list_head *unmapped_regions) +{ + struct vfio_regions *entry; + size_t unmapped; + + entry = kzalloc(sizeof(*entry), GFP_KERNEL); + if (!entry) + return -ENOMEM; + + unmapped = iommu_unmap_fast(domain->domain, iova, len); + if (WARN_ON(!unmapped)) { + kfree(entry); + return -EINVAL; + } Not sure about the handling of this, the zero check is just a consistency validation. If there's nothing mapped where we think there should be something mapped, we warn and throw out the whole vfio_dma. After this patch, such an error gets warned twice, which doesn't really seem to be an improvement. Since iommu_unmap() and iommu_unmap_fast() can return errors, instead of just zero check, we should also check for errors, warn, and bail out the whole vfio_dma. Thanks, Suravee
Re: [PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice
Le 24/01/2018 à 10:51, Aneesh Kumar K.V a écrit : On 01/24/2018 03:09 PM, Christophe LEROY wrote: Le 24/01/2018 à 10:35, Aneesh Kumar K.V a écrit : Did you try with HUGETLB_MORECORE_HEAPBASE=0x1100 on PPC64 as I suggested in my last email on this subject (22/01/2018 9:22) ? yes. The test ran fine for me You tried with 0x3000, it works as well on PPC32. I'd really like you to try with 0x1100 which is in the same slice as the 1002-1003 range. Now that explains is better. But then the requested HEAPBASE was not free and hence topdown search got an address in the below range. 7efffd00-7f00 rw-p 00:0d 1082770 /anon_hugepage (deleted) The new range allocated is such that there is no scope for expansion of heap if we do a topdown search. But why should that require us to change from topdown/bottomup search? 1000-1001 r-xp fc:00 9044312 /home/kvaneesh/a.out 1001-1002 r--p fc:00 9044312 /home/kvaneesh/a.out 1002-1003 rw-p 0001 fc:00 9044312 /home/kvaneesh/a.out 7efffd00-7f00 rw-p 00:0d 1082770 /anon_hugepage (deleted) 72d4-77d6 rw-p 00:00 0 77d6-77f1 r-xp fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f1-77f2 r--p 001a fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f2-77f3 rw-p 001b fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f4-77f6 r-xp fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f6-77f7 r--p 0001 fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f7-77f8 rw-p 0002 fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f8-77fa r-xp 00:00 0 [vdso] 77fa-77fe r-xp fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 77fe-77ff r--p 0003 fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 77ff-7800 rw-p 0004 fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 7ffd-8000 rw-p 00:00 0 [stack] For the specific test, one should pass the HEAPBASE value such that it can be expanded if required isn't it ? For the test, yes, it is dumb to pass an unusable HEAPBASE, but what happens in real life: * PPC32: No HEAPBASE, hugetlbfs defines a HEAPBASE at sbrk(0) + PAGE_SIZE = 0x1080 ==> This is in the same slice as already allocated ==> the kernel does as if mmap() had been called with no hint address and allocates something unusable instead. * PPC64: No HEAPBASE, hugetlbfs seems to define a HEAPBASE at 1000, which doesn't conflict with an already allocated mapping ==> it works. Now, when we take the generic case, ie when slice is not activated, when you call mmap() without a hint address, it allocates a suitable address because it does bottom-up. Why do differently with slices ? Christophe
Re: [PATCH 02/14] MIPS: memblock: Surely map BSS kernel memory section
Hello Matt, On Wed, Jan 24, 2018 at 09:49:31AM +, Matt Redfearn wrote: > Hi Serge, > > On 23/01/18 19:27, Serge Semin wrote: > >Hello Matt, > > > >On Tue, Jan 23, 2018 at 11:03:27AM +, Matt Redfearn > > wrote: > >>Hi Serge, > >> > >>On 22/01/18 21:47, Serge Semin wrote: > >>>Hello Matt, > >>> > >>>On Mon, Jan 22, 2018 at 04:35:26PM +, Matt Redfearn > >>> wrote: > Hi Serge, > > On 17/01/18 22:23, Serge Semin wrote: > >The current MIPS code makes sure the kernel code/data/init > >sections are in the maps, but BSS should also be there. > > Quite right - it should. But this was protected against by reserving all > bootmem up to the _end symbol here: > http://elixir.free-electrons.com/linux/v4.15-rc8/source/arch/mips/kernel/setup.c#L388 > Which you remove in the next patch in this series. I'm not sure it is > worth > >>> > >>>Right. Missed that part. The old code just doesn't set the kernel memory > >>>free > >>>calling the free_bootmem() method for non-reserved parts below > >>>reserved_end. > >>> > disentangling the reserved_end stuff from the next patch to make this > into a > single logical change of reserving just .bss rather than everything below > _end. > >>> > >>>Good point. I'll move this change into the "[PATCH 05/14] MIPS: memblock: > >>>Add reserved memory regions to memblock". It logically belongs to that > >>>place. > >>>Since basically by the arch_mem_addpart() calls we reserve all the kernel > >> > >> > >>Actually I was wrong - it's not this sequence of arch_mem_addpart's that > >>reserves the kernels memory. At least on DT based systems, it's pretty > >>likely that these regions will overlap with the system memory already added. > >>of_scan_flat_dt will look for the memory node and add it via > >>early_init_dt_add_memory_arch. > >>These calls to add the kernel text, init and bss detect that they overlap > >>with the already present system memory, so don't get added, here: > >>http://elixir.free-electrons.com/linux/v4.15-rc9/source/arch/mips/kernel/setup.c#L759 > >> > >>As such, when we print out the content of boot_mem_map, we only have a > >>single entry: > >> > >>[0.00] Determined physical RAM map: > >>[0.00] memory: 1000 @ (usable) > >> > >> > >>>memory now I'd also merged them into a single call for the range [_text, > >>>_end]. > >>>What do you think? > >> > >> > >>I think that this patch makes sense in case the .bss is for some reason not > >>covered by an existing entry, but I would leave it as a separate patch. > >> > >>Your [PATCH 05/14] MIPS: memblock: Add reserved memory regions to memblock > >>is actually self-contained since it replaces reserving all memory up to _end > >>with the single reservation of the kernel's whole size > >> > >>+ size = __pa_symbol(&_end) - __pa_symbol(&_text); > >>+ memblock_reserve(__pa_symbol(&_text), size); > >> > >> > >>Which I think is definitely an improvement since it is much clearer. > >> > > > >Alright lets sum it up. First of all, yeah, you are right, arch_mem_addpart() > >is created to make sure the kernel memory is added to the memblock/bootmem > >pool. > >The previous arch code was leaving such the memory range non-freed since it > >was > >higher the reserved_end, so to make sure the early memory allocations > >wouldn't > >be made from the pages, where kernel actually resides. > > > >In my code I still wanted to make sure the kernel memory is in the memblock > >pool. > >But I also noticed, that .bss memory range wouldn't be added to the pool if > >neither > >dts nor platform-specific code added any memory to the boot_mem_map pool. So > >I > >decided to fix it. The actual kernel memory reservation is performed after > >all > >the memory regions are declared by the code you cited. It's essential to do > >the [_text, _end] memory range reservation there, otherwise memblock may > >allocate from the memory range occupied by the kernel code/data. > > > >Since you agree with leaving it in the separate patch, I'd only suggest to > >call the arch_mem_addpart() method for just one range [_text, _end] instead > >of > >doing it three times for a separate _text, _data and bss sections. What do > >you > >think? > > I think it's best left as 3 separate reservations, mainly due to the > different attribute used for the init section. So all in all, I like this > patch as it is. > Alright. I'll leave as is. Lets see what others think about it. Regards, -Sergey > Thanks, > Matt > > > > >Regards, > >-Sergey > > > >>Thanks, > >>Matt > >> > >>> > >>>Regards, > >>>-Sergey > >>> > > Reviewed-by: Matt Redfearn > > Thanks, > Matt > > > > >Signed-off-by: Serge Semin > >--- > > arch/mips/kernel/setup.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > >diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c > >index 76e9e2075..0d21c9e04 100644 > >--- a/arch/mips/ke
Re: [PATCH 0/3] ftrace, orc, x86, tracing: Fix stack traces again
On Wed, 24 Jan 2018 03:49:04 -0500 Steven Rostedt wrote: > On Wed, 24 Jan 2018 08:46:50 +0100 > Ingo Molnar wrote: > > > > No fundamental objections from me, assuming they are well tested. > > > > Yeah, I ran it through my ftrace test suite, and they did fine till I > hit test 20 of 34, which tests static ftrace (CONFIG_DYNAMIC_FTRACE not > set), and it crashed. I'm hoping to have it fixed and retested today. > Looking at the crash output, it was hung tasks and not an actual crash. This is the first time I ran this test on 4.15-rc9, I'll make sure it's not something associated with rc9 before blaming my patches. With DYNAMIC_FTRACE disabled, the machine runs much slower. With all the PTI work, it could possibly pushed it passed a breaking point. -- Steve
Re: [PATCH v3 bpf] bpf: introduce BPF_JIT_ALWAYS_ON config
On Tue, 2018-01-09 at 22:39 +0100, Daniel Borkmann wrote: > On 01/09/2018 07:04 PM, Alexei Starovoitov wrote: > > > > The BPF interpreter has been used as part of the spectre 2 attack > > CVE-2017-5715. > > > > A quote from goolge project zero blog: > > "At this point, it would normally be necessary to locate gadgets in > > the host kernel code that can be used to actually leak data by reading > > from an attacker-controlled location, shifting and masking the result > > appropriately and then using the result of that as offset to an > > attacker-controlled address for a load. But piecing gadgets together > > and figuring out which ones work in a speculation context seems annoying. > > So instead, we decided to use the eBPF interpreter, which is built into > > the host kernel - while there is no legitimate way to invoke it from inside > > a VM, the presence of the code in the host kernel's text section is > > sufficient > > to make it usable for the attack, just like with ordinary ROP gadgets." > > > > To make attacker job harder introduce BPF_JIT_ALWAYS_ON config > > option that removes interpreter from the kernel in favor of JIT-only mode. > > So far eBPF JIT is supported by: > > x64, arm64, arm32, sparc64, s390, powerpc64, mips64 > > > > The start of JITed program is randomized and code page is marked as > > read-only. > > In addition "constant blinding" can be turned on with > > net.core.bpf_jit_harden > > > > v2->v3: > > - move __bpf_prog_ret0 under ifdef (Daniel) > > > > v1->v2: > > - fix init order, test_bpf and cBPF (Daniel's feedback) > > - fix offloaded bpf (Jakub's feedback) > > - add 'return 0' dummy in case something can invoke prog->bpf_func > > - retarget bpf tree. For bpf-next the patch would need one extra hunk. > > It will be sent when the trees are merged back to net-next > > > > Considered doing: > > int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT; > > but it seems better to land the patch as-is and in bpf-next remove > > bpf_jit_enable global variable from all JITs, consolidate in one place > > and remove this jit_init() function. > > > > Signed-off-by: Alexei Starovoitov > > Applied to bpf tree, thanks Alexei! For stable too? smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 11/14] MIPS: memblock: Print out kernel virtual mem layout
Hello Matt, On Wed, Jan 24, 2018 at 09:46:07AM +, Matt Redfearn wrote: > Hi Serge, > > On 23/01/18 19:10, Serge Semin wrote: > >Hello Matt, > > > >On Tue, Jan 23, 2018 at 03:35:14PM +, Matt Redfearn > > wrote: > >>Hi Serge, > >> > >>On 19/01/18 14:27, Serge Semin wrote: > >>>On Fri, Jan 19, 2018 at 07:59:43AM +, Matt Redfearn > >>> wrote: > >>> > >>>Hello Matt, > >>> > Hi Serge, > > > > On 18/01/18 20:18, Serge Semin wrote: > >On Thu, Jan 18, 2018 at 12:03:03PM -0800, Florian Fainelli > > wrote: > >>On 01/17/2018 02:23 PM, Serge Semin wrote: > >>>It is useful to have the kernel virtual memory layout printed > >>>at boot time so to have the full information about the booted > >>>kernel. In some cases it might be unsafe to have virtual > >>>addresses freely visible in logs, so the %pK format is used if > >>>one want to hide them. > >>> > >>>Signed-off-by: Serge Semin > >> > >>I personally like having that information because that helps debug and > >>have a quick reference, but there appears to be a trend to remove this > >>in the name of security: > >> > >>https://patchwork.kernel.org/patch/10124007/ > >> > >>maybe hide this behind a configuration option? > > > >Yeah, arm code was the place I picked the function up.) But in my case > >I've used %pK so the pointers would disappear from logging when > >kptr_restrict sysctl is 1 or 2. > >I agree, that we might need to make the printouts optional. If there is > >any kernel config, which for instance increases the kernel security we > >could also use it or anything else to discard the printouts at compile > >time. > > > Certainly, when KASLR is active it would be preferable to hide this > information, so you could use CONFIG_RELOCATABLE. The existing KASLR stuff > additionally hides this kind of information behind CONFIG_DEBUG_KERNEL, so > that only people actively debugging the kernel see it: > > http://elixir.free-electrons.com/linux/v4.15-rc8/source/arch/mips/kernel/setup.c#L604 > >>> > >>>Ok. I'll hide the printouts behind both of that config macros in the next > >>>patchset > >>>version. > >> > >> > >>Another thing to note - since ad67b74d2469d ("printk: hash addresses printed > >>with %p") %pK at this time in the boot process is useless since the RNG is > >>not sufficiently initialised and all prints end up being "(ptrval)". Hence > >>after v4.15-rc2 we end up with output like: > >> > >>[0.00] Kernel virtual memory layout: > >>[0.00] lowmem : 0x(ptrval) - 0x(ptrval) ( 256 MB) > >>[0.00] .text : 0x(ptrval) - 0x(ptrval) (7374 kB) > >>[0.00] .data : 0x(ptrval) - 0x(ptrval) (1901 kB) > >>[0.00] .init : 0x(ptrval) - 0x(ptrval) (1600 kB) > >>[0.00] .bss : 0x(ptrval) - 0x(ptrval) ( 415 kB) > >>[0.00] vmalloc : 0x(ptrval) - 0x(ptrval) (1023 MB) > >>[0.00] fixmap : 0x(ptrval) - 0x(ptrval) ( 68 kB) > >> > > > >It must be some bug in the algo. What point in the %pK then? According to > >the documentation the only way to see the pointers is when (kptr_restrict == > >0). > >But if it is we don't get into the restricted_pointer() method at all: > >http://elixir.free-electrons.com/linux/v4.15-rc9/source/lib/vsprintf.c#L1934 > >In this case the vsprintf() executes the method ptr_to_id(), which of course > >default to _not_ leak addresses, and hash it before printing. > > > >Really %pK isn't supposed to be dependent from RNG at all since kptr_restrict > >doesn't do any value randomization. > > > That was true until v4.15-rc2. The behavior of %pK was changed without that > being reflected in the documentation. A patch > (https://patchwork.kernel.org/patch/10124413/) is in progress to update > this. > > > > >> > >>The %px format specifier was added for cases such as this, where we really > >>want to print the unmodified address. And as long as this function is > >>suitably guarded to only do this when KASLR is deactivated / > >>CONFIG_DEBUG_KERNEL is activated, etc, then we are not unwittingly leaking > >>information - we are deliberately making it available. > >> > > > >If %pK would work as it's stated by the kernel documentation: > >https://www.kernel.org/doc/Documentation/printk-formats.txt > >then the only change I'd suggest to have here is to close the kernel memory > >layout printout method by the CONFIG_DEBUG_KERNEL ifdef-macro. The > >kptr_restrict > >should default to 1/2 if the KASLR is activated: > >https://lwn.net/Articles/444556/ > > Yeah, again, the documentation is no longer correct, and %pK will always be > hashed, and before the RNG is initialized it does not even hash it, just > returning "(ptrval)". So I'd recommend guarding with CONFIG_DEBUG_KERNEL > and switching the format specifier to %px. > Oh, it isn't the bug then) I'll do as you suggest and replace %pK
Re: [PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice
On 01/24/2018 03:33 PM, Christophe LEROY wrote: Le 24/01/2018 à 10:51, Aneesh Kumar K.V a écrit : On 01/24/2018 03:09 PM, Christophe LEROY wrote: Le 24/01/2018 à 10:35, Aneesh Kumar K.V a écrit : Did you try with HUGETLB_MORECORE_HEAPBASE=0x1100 on PPC64 as I suggested in my last email on this subject (22/01/2018 9:22) ? yes. The test ran fine for me You tried with 0x3000, it works as well on PPC32. I'd really like you to try with 0x1100 which is in the same slice as the 1002-1003 range. Now that explains is better. But then the requested HEAPBASE was not free and hence topdown search got an address in the below range. 7efffd00-7f00 rw-p 00:0d 1082770 /anon_hugepage (deleted) The new range allocated is such that there is no scope for expansion of heap if we do a topdown search. But why should that require us to change from topdown/bottomup search? 1000-1001 r-xp fc:00 9044312 /home/kvaneesh/a.out 1001-1002 r--p fc:00 9044312 /home/kvaneesh/a.out 1002-1003 rw-p 0001 fc:00 9044312 /home/kvaneesh/a.out 7efffd00-7f00 rw-p 00:0d 1082770 /anon_hugepage (deleted) 72d4-77d6 rw-p 00:00 0 77d6-77f1 r-xp fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f1-77f2 r--p 001a fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f2-77f3 rw-p 001b fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f4-77f6 r-xp fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f6-77f7 r--p 0001 fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f7-77f8 rw-p 0002 fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f8-77fa r-xp 00:00 0 [vdso] 77fa-77fe r-xp fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 77fe-77ff r--p 0003 fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 77ff-7800 rw-p 0004 fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 7ffd-8000 rw-p 00:00 0 [stack] For the specific test, one should pass the HEAPBASE value such that it can be expanded if required isn't it ? For the test, yes, it is dumb to pass an unusable HEAPBASE, but what happens in real life: * PPC32: No HEAPBASE, hugetlbfs defines a HEAPBASE at sbrk(0) + PAGE_SIZE = 0x1080 ==> This is in the same slice as already allocated ==> the kernel does as if mmap() had been called with no hint address and allocates something unusable instead. * PPC64: No HEAPBASE, hugetlbfs seems to define a HEAPBASE at 1000, which doesn't conflict with an already allocated mapping ==> it works. Now, when we take the generic case, ie when slice is not activated, when you call mmap() without a hint address, it allocates a suitable address because it does bottom-up. Why do differently with slices ? IIUC that is largely arch dependent, PPC64 always did topdown search. Even for regular non hugetlb mmap it did topdown search. If you set legacy mmap we selected bottom up approach. You can check arch_pick_mmap_layout() for more details. Now x86 is slightly different. For the default search if we can't find a mapping address it will try a bottomup search. Having said that if you think libhugetlbfs made assumptions with respect to 8xx and you don't want to break it make 8xx unmapped area search bottomup. -aneesh
Re: [PATCH RFC 0/4] irqchip: qcom: add support for PDC interrupt controller
On 23/01/18 18:44, Lina Iyer wrote: > On Tue, Jan 23 2018 at 18:15 +, Sudeep Holla wrote: >> Hi Lina, >> >> On Tue, Jan 23, 2018 at 5:56 PM, Lina Iyer wrote: >>> On newer Qualcomm Techonologies Inc's SoCs like the SDM845, the GIC >>> is in a >>> power domain that can be powered off when not needed. Interrupts that >>> need to >>> be sensed even when the GIC is powered off, are routed through an >>> interrupt >>> controller in an always-on domain called the Power Domain Controller >>> a.k.a PDC. >>> This series adds support for the PDC's interrupt controller. >>> >> >> Sorry for the basic questions: >> >> 1. Will the GIC be powered off in any other state other than System >> suspend ? >> > Yes. When all the CPUs are in idle, there is an opportunity to power off > the CPU's power domain and the GIC. QCOM SoCs have been doing that for > many generations now. > OK interesting, in that case so either GIC state is saved/restored with some out of tree patches or the firmware takes care of it and it's transparent to Linux ? Also when will this PDC wakeup interrupts get configured ? >> 2. Why this needs to be done in Linux, why can't it be transparent and >> hidden >> in the firmware doing the actual GIC power down ? I assume Linux is >> not >> powering down the GIC. > No. You are right, Linux is not powering off the GIC directly. A > dedicated processor for power management in the SoC does that. Platform > drivers in Linux, know and configure the wakeup interrupts (depending on > the usecase). This is runtime specific and this is the way to tell the > SoC to wake up the processor even if the GIC and the CPU domain were > powered off. > OK, understood. By transparent, I mean firmware can copy the interrupts enabled in the GIC to the PDC. It need not be kernel driven. >> >> 3. I see some bits that enable secure interrupts in one of the patch. >> Is that even >> safe to allow Linux to enable some secure interrupts in PDC ? >> > Linux should not and would not configure secure interrupts. We would not > have permissions for secure interrupts. The interrupt names might be a > misnomer, but the interrupts listed in patch #4 are all non-secure > interrupts. > OK. So I can assume PDC is partitioned in secure and non-secure. If not it's safe not have any access for PDC in the kernel. Couple of designs of similar PDC I have seen is system wide and does handle even secure part of the system. That was the main reason for checking. -- Regards, Sudeep
Re: [PATCH v3 bpf] bpf: introduce BPF_JIT_ALWAYS_ON config
On 01/24/2018 11:07 AM, David Woodhouse wrote: > On Tue, 2018-01-09 at 22:39 +0100, Daniel Borkmann wrote: >> On 01/09/2018 07:04 PM, Alexei Starovoitov wrote: >>> >>> The BPF interpreter has been used as part of the spectre 2 attack >>> CVE-2017-5715. >>> >>> A quote from goolge project zero blog: >>> "At this point, it would normally be necessary to locate gadgets in >>> the host kernel code that can be used to actually leak data by reading >>> from an attacker-controlled location, shifting and masking the result >>> appropriately and then using the result of that as offset to an >>> attacker-controlled address for a load. But piecing gadgets together >>> and figuring out which ones work in a speculation context seems annoying. >>> So instead, we decided to use the eBPF interpreter, which is built into >>> the host kernel - while there is no legitimate way to invoke it from inside >>> a VM, the presence of the code in the host kernel's text section is >>> sufficient >>> to make it usable for the attack, just like with ordinary ROP gadgets." >>> >>> To make attacker job harder introduce BPF_JIT_ALWAYS_ON config >>> option that removes interpreter from the kernel in favor of JIT-only mode. >>> So far eBPF JIT is supported by: >>> x64, arm64, arm32, sparc64, s390, powerpc64, mips64 >>> >>> The start of JITed program is randomized and code page is marked as >>> read-only. >>> In addition "constant blinding" can be turned on with >>> net.core.bpf_jit_harden >>> >>> v2->v3: >>> - move __bpf_prog_ret0 under ifdef (Daniel) >>> >>> v1->v2: >>> - fix init order, test_bpf and cBPF (Daniel's feedback) >>> - fix offloaded bpf (Jakub's feedback) >>> - add 'return 0' dummy in case something can invoke prog->bpf_func >>> - retarget bpf tree. For bpf-next the patch would need one extra hunk. >>> It will be sent when the trees are merged back to net-next >>> >>> Considered doing: >>> int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT; >>> but it seems better to land the patch as-is and in bpf-next remove >>> bpf_jit_enable global variable from all JITs, consolidate in one place >>> and remove this jit_init() function. >>> >>> Signed-off-by: Alexei Starovoitov >> >> Applied to bpf tree, thanks Alexei! > > For stable too? Yes, this will go into stable as well; batch of backports will come Thurs/Fri.
[PATCH v2 3/3] mtd: nand: davinci: Use devm_ioremap_shared_resource()
Simplify error handling by using devm_ioremap_shared_resource(). Signed-off-by: Ladislav Michl --- Changes: - v2: None drivers/mtd/nand/davinci_nand.c | 24 +++- 1 file changed, 7 insertions(+), 17 deletions(-) diff --git a/drivers/mtd/nand/davinci_nand.c b/drivers/mtd/nand/davinci_nand.c index ccc8c43abcff..9b6f06b177b9 100644 --- a/drivers/mtd/nand/davinci_nand.c +++ b/drivers/mtd/nand/davinci_nand.c @@ -612,10 +612,8 @@ static int nand_davinci_probe(struct platform_device *pdev) { struct davinci_nand_pdata *pdata; struct davinci_nand_info*info; - struct resource *res1; - struct resource *res2; + struct resource *res; void __iomem*vaddr; - void __iomem*base; int ret; uint32_tval; struct mtd_info *mtd; @@ -638,14 +636,8 @@ static int nand_davinci_probe(struct platform_device *pdev) platform_set_drvdata(pdev, info); - res1 = platform_get_resource(pdev, IORESOURCE_MEM, 0); - res2 = platform_get_resource(pdev, IORESOURCE_MEM, 1); - if (!res1 || !res2) { - dev_err(&pdev->dev, "resource missing\n"); - return -EINVAL; - } - - vaddr = devm_ioremap_resource(&pdev->dev, res1); + res = platform_get_resource(pdev, IORESOURCE_MEM, 0); + vaddr = devm_ioremap_resource(&pdev->dev, res); if (IS_ERR(vaddr)) return PTR_ERR(vaddr); @@ -655,14 +647,12 @@ static int nand_davinci_probe(struct platform_device *pdev) * by AEMIF, so we cannot request it twice, just ioremap. * The AEMIF and NAND drivers not use the same registers in this range. */ - base = devm_ioremap(&pdev->dev, res2->start, resource_size(res2)); - if (!base) { - dev_err(&pdev->dev, "ioremap failed for resource %pR\n", res2); - return -EADDRNOTAVAIL; - } + res = platform_get_resource(pdev, IORESOURCE_MEM, 1); + info->base = devm_ioremap_shared_resource(&pdev->dev, res); + if (IS_ERR(info->base)) + return PTR_ERR(info->base); info->dev = &pdev->dev; - info->base = base; info->vaddr = vaddr; mtd = nand_to_mtd(&info->chip); -- 2.15.1
[PATCH v2 1/3] devres: Move devm_ioremap_resource() out of device.h
Move devm_ioremap_resource() out of device.h into io.h to be consistent with similar APIs. Signed-off-by: Ladislav Michl --- Changes: - v2: new patch include/linux/device.h | 2 -- include/linux/io.h | 1 + 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/include/linux/device.h b/include/linux/device.h index 4d88b6b9cda9..c9fcee2f5b91 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -683,8 +683,6 @@ extern unsigned long devm_get_free_pages(struct device *dev, gfp_t gfp_mask, unsigned int order); extern void devm_free_pages(struct device *dev, unsigned long addr); -void __iomem *devm_ioremap_resource(struct device *dev, struct resource *res); - /* allows to add/remove a custom action to devres stack */ int devm_add_action(struct device *dev, void (*action)(void *), void *data); void devm_remove_action(struct device *dev, void (*action)(void *), void *data); diff --git a/include/linux/io.h b/include/linux/io.h index 32e30e8fb9db..2aea3363bfb2 100644 --- a/include/linux/io.h +++ b/include/linux/io.h @@ -79,6 +79,7 @@ void __iomem *devm_ioremap_nocache(struct device *dev, resource_size_t offset, resource_size_t size); void __iomem *devm_ioremap_wc(struct device *dev, resource_size_t offset, resource_size_t size); +void __iomem *devm_ioremap_resource(struct device *dev, struct resource *res); void devm_iounmap(struct device *dev, void __iomem *addr); int check_signature(const volatile void __iomem *io_addr, const unsigned char *signature, int length); -- 2.15.1
[PATCH v2 2/3] devres: Add devm_ioremap_shared_resource()
Implement managed ioremap function for shared resources. Signed-off-by: Ladislav Michl --- Changes: - v2: Rebased on top of PATCH v2 1/3 include/linux/io.h | 8 +++- lib/devres.c | 22 ++ 2 files changed, 21 insertions(+), 9 deletions(-) diff --git a/include/linux/io.h b/include/linux/io.h index 2aea3363bfb2..2b9eb48e0f49 100644 --- a/include/linux/io.h +++ b/include/linux/io.h @@ -79,10 +79,16 @@ void __iomem *devm_ioremap_nocache(struct device *dev, resource_size_t offset, resource_size_t size); void __iomem *devm_ioremap_wc(struct device *dev, resource_size_t offset, resource_size_t size); -void __iomem *devm_ioremap_resource(struct device *dev, struct resource *res); +#define devm_ioremap_resource(dev, res) \ + __devm_ioremap_resource(dev, res, false) +#define devm_ioremap_shared_resource(dev, res) \ + __devm_ioremap_resource(dev, res, true) +void __iomem *__devm_ioremap_resource(struct device *dev, struct resource *res, + bool shared); void devm_iounmap(struct device *dev, void __iomem *addr); int check_signature(const volatile void __iomem *io_addr, const unsigned char *signature, int length); + void devm_ioremap_release(struct device *dev, void *res); void *devm_memremap(struct device *dev, resource_size_t offset, diff --git a/lib/devres.c b/lib/devres.c index 5f2aedd58bc5..7711ff40a572 100644 --- a/lib/devres.c +++ b/lib/devres.c @@ -22,6 +22,9 @@ static int devm_ioremap_match(struct device *dev, void *res, void *match_data) * @size: Size of map * * Managed ioremap(). Map is automatically unmapped on driver detach. + * + * When possible, use devm_ioremap_resource() or + * devm_ioremap_shared_resource() instead. */ void __iomem *devm_ioremap(struct device *dev, resource_size_t offset, resource_size_t size) @@ -116,13 +119,14 @@ void devm_iounmap(struct device *dev, void __iomem *addr) EXPORT_SYMBOL(devm_iounmap); /** - * devm_ioremap_resource() - check, request region, and ioremap resource + * __devm_ioremap_resource() - check, request region, and ioremap resource * @dev: generic device to handle the resource for * @res: resource to be handled + * @shared: region is not requested when true * - * Checks that a resource is a valid memory region, requests the memory - * region and ioremaps it. All operations are managed and will be undone - * on driver detach. + * Checks that a resource is a valid memory region, eventually requests the + * memory region and ioremaps it. All operations are managed and will be + * undone on driver detach. * * Returns a pointer to the remapped memory or an ERR_PTR() encoded error code * on failure. Usage example: @@ -132,7 +136,8 @@ EXPORT_SYMBOL(devm_iounmap); * if (IS_ERR(base)) * return PTR_ERR(base); */ -void __iomem *devm_ioremap_resource(struct device *dev, struct resource *res) +void __iomem *__devm_ioremap_resource(struct device *dev, struct resource *res, + bool shared) { resource_size_t size; const char *name; @@ -148,7 +153,7 @@ void __iomem *devm_ioremap_resource(struct device *dev, struct resource *res) size = resource_size(res); name = res->name ?: dev_name(dev); - if (!devm_request_mem_region(dev, res->start, size, name)) { + if (!shared && !devm_request_mem_region(dev, res->start, size, name)) { dev_err(dev, "can't request region for resource %pR\n", res); return IOMEM_ERR_PTR(-EBUSY); } @@ -156,13 +161,14 @@ void __iomem *devm_ioremap_resource(struct device *dev, struct resource *res) dest_ptr = devm_ioremap(dev, res->start, size); if (!dest_ptr) { dev_err(dev, "ioremap failed for resource %pR\n", res); - devm_release_mem_region(dev, res->start, size); + if (!shared) + devm_release_mem_region(dev, res->start, size); dest_ptr = IOMEM_ERR_PTR(-ENOMEM); } return dest_ptr; } -EXPORT_SYMBOL(devm_ioremap_resource); +EXPORT_SYMBOL(__devm_ioremap_resource); #ifdef CONFIG_HAS_IOPORT_MAP /* -- 2.15.1
[PATCH v2 0/3] Add managed ioremap function for shared resources
Many drivers can simplified by using devm_ioremap_resource() instead of open coding its functionality. However, as pointed by Wei Yongjun, that function cannot be used when memory region is already taken. See previous discussion here: https://www.spinics.net/lists/linux-pci/msg68495.html To ease job of driver developers, new function for that purpose is implemented and its usage shown on davinci mtd driver. Changes from previous version: - moved function prototype in headers other way around (PATCH 1/3), the rest of patches was dropped. Ladislav Michl (3): devres: Move devm_ioremap_resource() out of device.h devres: Add devm_ioremap_shared_resource() mtd: nand: davinci: Use devm_ioremap_shared_resource() drivers/mtd/nand/davinci_nand.c | 24 +++- include/linux/device.h | 2 -- include/linux/io.h | 7 +++ lib/devres.c| 22 ++ 4 files changed, 28 insertions(+), 27 deletions(-) -- 2.15.1
Re: [PATCH V5 4/8] perf/x86/intel/uncore: add new data structures for free running counters
On Tue, Jan 23, 2018 at 10:00:58PM +, Liang, Kan wrote: > > On Fri, Jan 19, 2018 at 12:24:17PM -0800, Andi Kleen wrote: > > > > Oh, think a bit more. > > > > I think we cannot do the same thing as we did for CPU PMU's fixed > > counters. > > > > > > > > The counters here are free running counters. They cannot be start/stop. > > > > > > Yes free running counter have completely different semantics. They > > > need a separate event code. > > > > The only thing that matters is if they count the same thing or not. > > > > Hi Peter, > > There is NO event available on the GPs, that is exactly the same as > the free-running counters. > > For example, the BW free-running counters count the requests associated > with writes and completions. > The most similar events on the GPs are DATA_REQ_{OF,BY}_CPU.* events. > Except that some of their sub-events count requests which not completions. > There are also other minor differences. > So we don't have alternative events for the free-running counters. > I think we have to use 0xff. OK, but explicitly mention this as the reason for having to invent event codes. Them being fixed purpose or free running isn't a valid reason for that.
Re: [PATCH v2] staging: android: ion: Zero CMA allocated memory
On 01/22/2018 09:46 AM, Liam Mark wrote: Since commit 204f672255c2 ("staging: android: ion: Use CMA APIs directly") the CMA API is now used directly and therefore the allocated memory is no longer automatically zeroed. Explicitly zero CMA allocated memory to ensure that no data is exposed to userspace. Fixes: 204f672255c2 ("staging: android: ion: Use CMA APIs directly") Signed-off-by: Liam Mark --- Changes in v2: - Clean up the commit message. - Add 'Fixes:' drivers/staging/android/ion/ion_cma_heap.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/staging/android/ion/ion_cma_heap.c b/drivers/staging/android/ion/ion_cma_heap.c index 86196ffd2faf..91a98785607a 100644 --- a/drivers/staging/android/ion/ion_cma_heap.c +++ b/drivers/staging/android/ion/ion_cma_heap.c @@ -51,6 +51,8 @@ static int ion_cma_allocate(struct ion_heap *heap, struct ion_buffer *buffer, if (!pages) return -ENOMEM; + memset(page_address(pages), 0, size); + This won't work for highmem pages. You need to zero by page via kmap_atomic in that case. You can use PageHighMem to separate the paths. Thanks, Laura table = kmalloc(sizeof(*table), GFP_KERNEL); if (!table) goto err;
Re: [PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice
Le 24/01/2018 à 11:08, Aneesh Kumar K.V a écrit : On 01/24/2018 03:33 PM, Christophe LEROY wrote: Le 24/01/2018 à 10:51, Aneesh Kumar K.V a écrit : On 01/24/2018 03:09 PM, Christophe LEROY wrote: Le 24/01/2018 à 10:35, Aneesh Kumar K.V a écrit : Did you try with HUGETLB_MORECORE_HEAPBASE=0x1100 on PPC64 as I suggested in my last email on this subject (22/01/2018 9:22) ? yes. The test ran fine for me You tried with 0x3000, it works as well on PPC32. I'd really like you to try with 0x1100 which is in the same slice as the 1002-1003 range. Now that explains is better. But then the requested HEAPBASE was not free and hence topdown search got an address in the below range. 7efffd00-7f00 rw-p 00:0d 1082770 /anon_hugepage (deleted) The new range allocated is such that there is no scope for expansion of heap if we do a topdown search. But why should that require us to change from topdown/bottomup search? 1000-1001 r-xp fc:00 9044312 /home/kvaneesh/a.out 1001-1002 r--p fc:00 9044312 /home/kvaneesh/a.out 1002-1003 rw-p 0001 fc:00 9044312 /home/kvaneesh/a.out 7efffd00-7f00 rw-p 00:0d 1082770 /anon_hugepage (deleted) 72d4-77d6 rw-p 00:00 0 77d6-77f1 r-xp fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f1-77f2 r--p 001a fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f2-77f3 rw-p 001b fc:00 9250090 /lib/powerpc64le-linux-gnu/libc-2.23.so 77f4-77f6 r-xp fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f6-77f7 r--p 0001 fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f7-77f8 rw-p 0002 fc:00 10754812 /usr/lib/libhugetlbfs.so.0 77f8-77fa r-xp 00:00 0 [vdso] 77fa-77fe r-xp fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 77fe-77ff r--p 0003 fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 77ff-7800 rw-p 0004 fc:00 9250107 /lib/powerpc64le-linux-gnu/ld-2.23.so 7ffd-8000 rw-p 00:00 0 [stack] For the specific test, one should pass the HEAPBASE value such that it can be expanded if required isn't it ? For the test, yes, it is dumb to pass an unusable HEAPBASE, but what happens in real life: * PPC32: No HEAPBASE, hugetlbfs defines a HEAPBASE at sbrk(0) + PAGE_SIZE = 0x1080 ==> This is in the same slice as already allocated ==> the kernel does as if mmap() had been called with no hint address and allocates something unusable instead. * PPC64: No HEAPBASE, hugetlbfs seems to define a HEAPBASE at 1000, which doesn't conflict with an already allocated mapping ==> it works. Now, when we take the generic case, ie when slice is not activated, when you call mmap() without a hint address, it allocates a suitable address because it does bottom-up. Why do differently with slices ? IIUC that is largely arch dependent, PPC64 always did topdown search. Even for regular non hugetlb mmap it did topdown search. If you set legacy mmap we selected bottom up approach. You can check arch_pick_mmap_layout() for more details. Now x86 is slightly different. For the default search if we can't find a mapping address it will try a bottomup search. Having said that if you think libhugetlbfs made assumptions with respect to 8xx and you don't want to break it make 8xx unmapped area search bottomup. Or would there be a way to make libhugetlbfs aware of the slices constraints and make it choose a suitable hint address at first try ? Christophe
Re: [PATCH RFC 0/3] API for 128-bit IO access
On Wed, Jan 24, 2018 at 12:05:16PM +0300, Yury Norov wrote: > This series adds API for 128-bit memory IO access and enables it for ARM64. > The original motivation for 128-bit API came from new Cavium network device > driver. The hardware requires 128-bit access to make things work. See > description in patch 3 for details. > > Also, starting from ARMv8.4, stp and ldp instructions become atomic, and > API for 128-bit access would be helpful in core arm64 code. Only for normal, cacheable memory, so they're not suitable for IO accesses as you're proposing here. Will