from:"Mika Penttilä"

Re: a question of split_huge_page

2020-07-10 Thread Mika Penttilä



On 10.7.2020 10.00, Alex Shi wrote:
>
> 在 2020/7/10 下午1:28, Mika Penttilä 写道:
>>> Thanks a lot for quick reply!
>>> What I am confusing is the call chain: __iommu_dma_alloc_pages()
>>> to split_huge_page(), in the func, splited page,
>>> page = alloc_pages_node(nid, alloc_flags, order);
>>> And if the pages were added into lru, they maybe reclaimed and lost,
>>> that would be a panic bug. But in fact, this never happened for long time.
>>> Also I put a BUG() at the line, it's nevre triggered in ltp, and run_vmtests
>> In  __iommu_dma_alloc_pages, after split_huge_page(),  who is taking a
>> reference on tail pages? Seems tail pages are freed and the function
>> errornously returns them in pages[] array for use?
>>
> Why you say so? It looks like the tail page returned and be used
>   pages = __iommu_dma_alloc_pages() in iommu_dma_alloc_remap()
> and still on node's lru. Is this right?
>
> thanks!
IMHO they are new pages coming from alloc_pages_node() so they are not
on lru. And split_huge_page() frees not pinned tail pages again to page
allocator.

Thanks,
Mika





pEpkey.asc
Description: application/pgp-keys

Re: a question of split_huge_page

2020-07-09 Thread Mika Penttilä



On 10.7.2020 7.51, Alex Shi wrote:
>
> 在 2020/7/10 上午12:07, Kirill A. Shutemov 写道:
>> On Thu, Jul 09, 2020 at 04:50:02PM +0100, Matthew Wilcox wrote:
>>> On Thu, Jul 09, 2020 at 11:11:11PM +0800, Alex Shi wrote:
 Hi Kirill & Matthew,

 In the func call chain, from split_huge_page() to lru_add_page_tail(),
 Seems tail pages are added to lru list at line 963, but in this scenario
 the head page has no lru bit and isn't set the bit later. Why we do this?
 or do I miss sth?
>>> I don't understand how we get to split_huge_page() with a page that's
>>> not on an LRU list.  Both anonymous and page cache pages should be on
>>> an LRU list.  What am I missing?> 
>
> Thanks a lot for quick reply!
> What I am confusing is the call chain: __iommu_dma_alloc_pages()
> to split_huge_page(), in the func, splited page,
>   page = alloc_pages_node(nid, alloc_flags, order);
> And if the pages were added into lru, they maybe reclaimed and lost,
> that would be a panic bug. But in fact, this never happened for long time.
> Also I put a BUG() at the line, it's nevre triggered in ltp, and run_vmtests


In  __iommu_dma_alloc_pages, after split_huge_page(),  who is taking a
reference on tail pages? Seems tail pages are freed and the function
errornously returns them in pages[] array for use?

> in kselftest.
>
>> Right, and it's never got removed from LRU during the split. The tail
>> pages have to be added to LRU because they now separate from the tail
>> page.
>>
> According to the explaination, looks like we could remove the code path,
> since it's never got into. (base on my v15 patchset). Any comments?
>
> Thanks
> Alex
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7c52c5228aab..c28409509ad3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2357,17 +2357,6 @@ static void lru_add_page_tail(struct page *head, 
> struct page *page_tail,
> if (!list)
> SetPageLRU(page_tail);
>
> if (likely(PageLRU(head)))
> list_add_tail(&page_tail->lru, &head->lru);
> else if (list) {
> /* page reclaim is reclaiming a huge page */
> get_page(page_tail);
> list_add_tail(&page_tail->lru, list);
> -   } else {
> -   /*
> -* Head page has not yet been counted, as an hpage,
> -* so we must account for each subpage individually.
> -*
> -* Put page_tail on the list at the correct position
> -* so they all end up in order.
> -*/
> -   VM_BUG_ON_PAGE(1, head);
> -   add_page_to_lru_list_tail(page_tail, lruvec,
> - page_lru(page_tail));
> }
>  }



pEpkey.asc
Description: application/pgp-keys

Re: [PATCH 4/6] vhost_vdpa: support doorbell mapping via mmap

2020-05-29 Thread Mika Penttilä


Hi,

On 29.5.2020 11.03, Jason Wang wrote:

Currently the doorbell is relayed via eventfd which may have
significant overhead because of the cost of vmexits or syscall. This
patch introduces mmap() based doorbell mapping which can eliminate the
overhead caused by vmexit or syscall.


Just wondering. I know very little about vdpa. But how is such a "sw 
doorbell" monitored or observed, if no fault or wmexit etc.

Is there some kind of polling used?


To ease the userspace modeling of the doorbell layout (usually
virtio-pci), this patch starts from a doorbell per page
model. Vhost-vdpa only support the hardware doorbell that sit at the
boundary of a page and does not share the page with other registers.

Doorbell of each virtqueue must be mapped separately, pgoff is the
index of the virtqueue. This allows userspace to map a subset of the
doorbell which may be useful for the implementation of software
assisted virtqueue (control vq) in the future.

Signed-off-by: Jason Wang 
---
  drivers/vhost/vdpa.c | 59 
  1 file changed, 59 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 6ff72289f488..bbe23cea139a 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -15,6 +15,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -741,12 +742,70 @@ static int vhost_vdpa_release(struct inode *inode, struct 
file *filep)
return 0;
  }
  
+static vm_fault_t vhost_vdpa_fault(struct vm_fault *vmf)

+{
+   struct vhost_vdpa *v = vmf->vma->vm_file->private_data;
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+   struct vdpa_notification_area notify;
+   struct vm_area_struct *vma = vmf->vma;
+   u16 index = vma->vm_pgoff;
+
+   notify = ops->get_vq_notification(vdpa, index);
+
+   vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+   if (remap_pfn_range(vma, vmf->address & PAGE_MASK,
+   notify.addr >> PAGE_SHIFT, PAGE_SIZE,
+   vma->vm_page_prot))
+   return VM_FAULT_SIGBUS;
+
+   return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct vhost_vdpa_vm_ops = {
+   .fault = vhost_vdpa_fault,
+};
+
+static int vhost_vdpa_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   struct vhost_vdpa *v = vma->vm_file->private_data;
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+   struct vdpa_notification_area notify;
+   int index = vma->vm_pgoff;
+
+   if (vma->vm_end - vma->vm_start != PAGE_SIZE)
+   return -EINVAL;
+   if ((vma->vm_flags & VM_SHARED) == 0)
+   return -EINVAL;
+   if (vma->vm_flags & VM_READ)
+   return -EINVAL;
+   if (index > 65535)
+   return -EINVAL;
+   if (!ops->get_vq_notification)
+   return -ENOTSUPP;
+
+   /* To be safe and easily modelled by userspace, We only
+* support the doorbell which sits on the page boundary and
+* does not share the page with other registers.
+*/
+   notify = ops->get_vq_notification(vdpa, index);
+   if (notify.addr & (PAGE_SIZE - 1))
+   return -EINVAL;
+   if (vma->vm_end - vma->vm_start != notify.size)
+   return -ENOTSUPP;
+
+   vma->vm_ops = &vhost_vdpa_vm_ops;
+   return 0;
+}
+
  static const struct file_operations vhost_vdpa_fops = {
.owner  = THIS_MODULE,
.open   = vhost_vdpa_open,
.release= vhost_vdpa_release,
.write_iter = vhost_vdpa_chr_write_iter,
.unlocked_ioctl = vhost_vdpa_unlocked_ioctl,
+   .mmap   = vhost_vdpa_mmap,
.compat_ioctl   = compat_ptr_ioctl,
  };

BUG/Oops - unix domain sockets / wakeups

2019-08-25 Thread Mika Penttilä

Hello,

I have got  a couple of these in a week with 5.3-rc. Happens very
randomly and infrequently, the system has been nearly idle.
Otoh, seems new to 5.3, dont remeber this happening before.

Thanks,
Mika

# [386517.339218] Internal error: Oops: 17 [#1] PREEMPT SMP ARM
[386517.344726] Modules linked in: btwilink radio_quantek st_drv
[386517.350488] CPU: 0 PID: 363 Comm: compositor Not tainted 5.3.0-rc5 #5
[386517.357016] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[386517.363644] PC is at __wake_up_common+0x5c/0x134
[386517.368351] LR is at __wake_up_common+0x90/0x134
[386517.373057] pc : [<80152bec>]    lr : [<80152c20>]    psr: 80070093
[386517.379411] sp : a9193cf0  ip :   fp : 0001
[386517.384724] r10: 0001  r9 : a9193d2c  r8 : 80219b08
[386517.390037] r7 : 0001  r6 : a63668c4  r5 :   r4 : 00f4
[386517.396653] r3 : 0304  r2 : 0100  r1 : 0001  r0 : 
[386517.403270] Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM 
Segment user
[386517.410580] Control: 10c5387d  Table: 3942404a  DAC: 0055
[386517.416415] Process compositor (pid: 363, stack limit = 0x5fe37b3d)
[386517.422771] Stack: (0xa9193cf0 to 0xa9194000)
[386517.427218] 3ce0: 0304
0001  a63668c0
[386517.435486] 3d00: 0001 20070013 0001 0001 a9193d2c
0304  80152e90
[386517.443754] 3d20: 0304 a9193d2c 012dd844  
 0100 0122
[386517.452022] 3d40: a8cc8f00 a8cc8fec a8cca7a0 a9193e4c a8cca700
0fc0 0040 80152f74
[386517.460289] 3d60: 0304  a9192000 8079cc44 a8cc8f00
806d4310 9a0e6180 
[386517.468556] 3d80: a8cca7a0 807a0744 a9193e4c  
  
[386517.476824] 3da0: 9a0e6180 806da340 9a0e6180 806da3ac 9a0e6180
806daa0c 9a0e6180 8079d730
[386517.485091] 3dc0: 7eb24360 a9193e14  0040 a8cca944
  a8cca700
[386517.493358] 3de0: a6752f40 a8cca90c 0001  a9192000
0001  
[386517.501627] 3e00:    a9193e14 7eb245c8
a9193e78 80101204 a9193f70
[386517.509896] 3e20: a9193e74 a9193f68 4040 a9193eb8 7eb245c8
7eb245e4 a6752f40 0129
[386517.518163] 3e40: 7615db10 8079dac8 0002 8079c408 a6752f40
a9193f68  1000
[386517.526431] 3e60: 4040  7eb245c8 806cfe7c 
 014c9b90 0d00
[386517.534698] 3e80: 014c9890 0300  011c52cc 75bc9788
0001 a8cdfa00 a8cdfa0c
[386517.542966] 3ea0: a8f7e980 8024afd8 0148ef88  
0019 75bc9788 8024aeb4
[386517.551233] 3ec0: a8f7e980 a8f7e9ac a8f7e9b4 a8f7e980 
808fe018  8024a7ac
[386517.559500] 3ee0: a9193ee0 a9193ee0 a94b0c00 0020 7eb246e0
a94b0c01 a94b0c00 a8f7e980
[386517.567767] 3f00: 0001 a8f7e9ac  8024b43c 
7eb24318 a89ccf68 a94d5480
[386517.576034] 3f20: a8fce600 80223bc0 a9193f64 a9193f60 4040
0129 80101204 a6752f40
[386517.584301] 3f40: 7eb245c8 4040 0129 80101204 a9192000
806d0fb8  7eb246e0
[386517.592568] 3f60: 0001 fff7   0004
0040 0fc0 a9193e78
[386517.600837] 3f80: 0002 011c5460 7eb245e4 007c 4000
 0040 0028
[386517.609105] 3fa0: 7eb245c8 80101000 0040 0028 0028
7eb245c8 4040 
[386517.617374] 3fc0: 0040 0028 7eb245c8 0129 76f01600
7eb245c8 0024 7615db10
[386517.625642] 3fe0:  7eb24598 76f826e0 75acd798 80070010
0028  
[386517.633921] [<80152bec>] (__wake_up_common) from [<80152e90>]
(__wake_up_common_lock+0x64/0x88)
[386517.642715] [<80152e90>] (__wake_up_common_lock) from [<80152f74>]
(__wake_up_sync_key+0x24/0x2c)
[386517.651683] [<80152f74>] (__wake_up_sync_key) from [<8079cc44>]
(unix_write_space+0x58/0x84)
[386517.660220] [<8079cc44>] (unix_write_space) from [<806d4310>]
(sock_wfree+0x58/0x74)
[386517.668059] [<806d4310>] (sock_wfree) from [<807a0744>]
(unix_destruct_scm+0x6c/0x74)
[386517.675984] [<807a0744>] (unix_destruct_scm) from [<806da340>]
(skb_release_head_state+0x50/0xb0)
[386517.684947] [<806da340>] (skb_release_head_state) from [<806da3ac>]
(skb_release_all+0xc/0x24)
[386517.693649] [<806da3ac>] (skb_release_all) from [<806daa0c>]
(consume_skb+0x28/0x48)
[386517.701485] [<806daa0c>] (consume_skb) from [<8079d730>]
(unix_stream_read_generic+0x494/0x76c)
[386517.710275] [<8079d730>] (unix_stream_read_generic) from
[<8079dac8>] (unix_stream_recvmsg+0x3c/0x44)
[386517.719586] [<8079dac8>] (unix_stream_recvmsg) from [<806cfe7c>]
(___sys_recvmsg+0x90/0x158)
[386517.728115] [<806cfe7c>] (___sys_recvmsg) from [<806d0fb8>]
(__sys_recvmsg+0x3c/0x6c)
[386517.736038] [<806d0fb8>] (__sys_recvmsg) from [<80101000>]
(ret_fast_syscall+0x0/0x4c)



pEpkey.asc
Description: pEpkey.asc

Re: [PATCH 2/4] pid: add pidctl()

2019-03-25 Thread Mika Penttilä

Hi!


> +SYSCALL_DEFINE5(pidctl, unsigned int, cmd, pid_t, pid, int, source, int, 
> target,
> + unsigned int, flags)
> +{
> + struct pid_namespace *source_ns = NULL, *target_ns = NULL;
> + struct pid *struct_pid;
> + pid_t result;
> +
> + switch (cmd) {
> + case PIDCMD_QUERY_PIDNS:
> + if (pid != 0)
> + return -EINVAL;
> + pid = 1;
> + /* fall through */
> + case PIDCMD_QUERY_PID:
> + if (flags != 0)
> + return -EINVAL;
> + break;
> + case PIDCMD_GET_PIDFD:
> + if (flags & ~PIDCTL_CLOEXEC)
> + return -EINVAL;
> + break;
> + default:
> + return -EOPNOTSUPP;
> + }
> +
> + source_ns = get_pid_ns_by_fd(source);
> + result = PTR_ERR(source_ns);
> + if (IS_ERR(source_ns))
> + goto err_source;
> +
> + target_ns = get_pid_ns_by_fd(target);
> + result = PTR_ERR(target_ns);
> + if (IS_ERR(target_ns))
> + goto err_target;
> +
> + if (cmd == PIDCMD_QUERY_PIDNS) {
> + result = pidns_related(source_ns, target_ns);
> + } else {
> + rcu_read_lock();
> + struct_pid = find_pid_ns(pid, source_ns);
> + result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;

Should you do get_pid(struct_pid) here to keep it alive till 
pidfd_create_fd() ?

> + rcu_read_unlock();
> +
> + if (cmd == PIDCMD_GET_PIDFD) {
> + int cloexec = (flags & PIDCTL_CLOEXEC) ? O_CLOEXEC : 0;
> + if (result > 0)
> + result = pidfd_create_fd(struct_pid, cloexec);
> + else if (result == 0)
> + result = -ENOENT;
> + }
> + }
> +
> + if (target)
> + put_pid_ns(target_ns);
> +err_target:
> + if (source)
> + put_pid_ns(source_ns);
> +err_source:
> + return result;
> +}
> +
>  void __init pid_idr_init(void)
>  {
>   /* Verify no one has done anything silly: */
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index aa6e72fb7c08..1c863fb3d55a 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -429,6 +429,31 @@ static struct ns_common *pidns_get_parent(struct 
> ns_common *ns)
>   return &get_pid_ns(pid_ns)->ns;
>  }
>  
> +/**
> + * pidnscmp - Determine if @ancestor is ancestor of @descendant
> + * @ancestor:   pidns suspected to be the ancestor of @descendant
> + * @descendant: pidns suspected to be the descendant of @ancestor
> + *
> + * Returns -1 if @ancestor is not an ancestor of @descendant,
> + * 0 if @ancestor is the same pidns as @descendant, 1 if @ancestor
> + * is an ancestor of @descendant.
> + */
> +int pidnscmp(struct pid_namespace *ancestor, struct pid_namespace 
> *descendant)
> +{
> + if (ancestor == descendant)
> + return 0;
> +
> + for (;;) {
> + if (!descendant)
> + return -1;
> + if (descendant == ancestor)
> + break;
> + descendant = descendant->parent;
> + }
> +
> + return 1;
> +}
> +
>  static struct user_namespace *pidns_owner(struct ns_common *ns)
>  {
>   return to_pid_ns(ns)->user_ns;

Re: Regression: i.MX6: pinctrl: fsl: add scu based pinctrl support

2019-01-14 Thread Mika Penttilä

Hi!

On 14.1.2019 18.58, Fabio Estevam wrote:
> Hi Mika,
>
> On Mon, Jan 14, 2019 at 8:21 AM Mika Penttilä
>  wrote:
>> Hello,
>>
>>
>> The patch titled "pinctrl: fsl: add scu based pinctrl support" causes
>> regression on i.MX6. Tested on a custom board based on
>> i.MX6Q and sgtl5000 codec, there is no sound,  tested with simple wav
>> playing.
>>
>> Reverting the patch makes audio work again.
> Which kernel version do you use?
>
> Most likely this has already been fixed by the following commit:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/pinctrl/freescale/pinctrl-imx.c?h=v5.0-rc2&id=571610678bf344006ab4c47c6fd0a842e9ac6a1b
>
> Please let us know if you still have issues with such commit applied.

I am merging mainline and this regression is present in 5.0-rc2 so it should 
have the mentioned patch applied.

Obviously I had to revert them both ("pinctrl: imx: fix NO_PAD_CTL setting for 
MMIO pads"
 and "pinctrl: fsl: add scu based pinctrl support"), and after that it works. 
Is there something more I could test?


--Mika

Regression: i.MX6: pinctrl: fsl: add scu based pinctrl support

2019-01-14 Thread Mika Penttilä

Hello,


The patch titled "pinctrl: fsl: add scu based pinctrl support" causes
regression on i.MX6. Tested on a custom board based on
i.MX6Q and sgtl5000 codec, there is no sound,  tested with simple wav
playing.

Reverting the patch makes audio work again.

--Mika



pEpkey.asc
Description: pEpkey.asc

Re: [PATCH] PM / suspend: Count suspend-to-idle loop as sleep time

2018-09-14 Thread Mika Penttilä

On 09/14/2018 11:46 AM, Rafael J. Wysocki wrote:
> On Friday, September 14, 2018 10:28:44 AM CEST Mika Penttilä wrote:
>> Hi!
>>
>>
>> On 09/14/2018 09:59 AM, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki 
>>>
>>> There is a difference in behavior between suspend-to-idle and
>>> suspend-to-RAM in the timekeeping handling that leads to functional
>>> issues.  Namely, every iteration of the loop in s2idle_loop()
>>> increases the monotinic clock somewhat, even if timekeeping_suspend()
>>> and timekeeping_resume() are invoked from s2idle_enter(), and if
>>> many of them are carried out in a row, the monotonic clock can grow
>>> significantly while the system is regarded as suspended, which
>>> doesn't happen during suspend-to-RAM and so it is unexpected and
>>> leads to confusion and misbehavior in user space (similar to what
>>> ensued when we tried to combine the boottime and monotonic clocks).
>>>
>>> To avoid that, count all iterations of the loop in s2idle_loop()
>>> as "sleep time" and adjust the clock for that on exit from
>>> suspend-to-idle.
>>>
>>> [That also covers systems on which timekeeping is not suspended
>>>  by by s2idle_enter().]
>>>
>>> Signed-off-by: Rafael J. Wysocki 
>>> ---
>>>
>>> This is a replacement for https://patchwork.kernel.org/patch/10599209/
>>>
>>> I decided to count the entire loop in s2idle_loop() as "sleep time" as the
>>> patch is then simpler and it also covers systems where timekeeping is not
>>> suspended in the final step of suspend-to-idle.
>>>
>>> I dropped the "Fixes:" tag, because the monotonic clock delta problem
>>> has been present on the latter since the very introduction of "freeze"
>>> (as suspend-to-idle was referred to previously) and so this doesn't fix
>>> any particular later commits.
>>>
>>> ---
>>>  kernel/power/suspend.c |   18 ++
>>>  1 file changed, 18 insertions(+)
>>>
>>> Index: linux-pm/kernel/power/suspend.c
>>> ===
>>> --- linux-pm.orig/kernel/power/suspend.c
>>> +++ linux-pm/kernel/power/suspend.c
>>> @@ -109,8 +109,12 @@ static void s2idle_enter(void)
>>>  
>>>  static void s2idle_loop(void)
>>>  {
>>> +   ktime_t start, delta;
>>> +
>>> pm_pr_dbg("suspend-to-idle\n");
>>>  
>>> +   start = ktime_get();
>>> +
>>> for (;;) {
>>> int error;
>>>  
>>> @@ -150,6 +154,20 @@ static void s2idle_loop(void)
>>> pm_wakeup_clear(false);
>>> }
>>>  
>>> +   /*
>>> +* If the monotonic clock difference between the start of the loop and
>>> +* this point is too large, user space may get confused about whether or
>>> +* not the system has been suspended and tasks may get killed by
>>> +* watchdogs etc., so count the loop as "sleep time" to compensate for
>>> +* that.
>>> +*/
>>> +   delta = ktime_sub(ktime_get(), start);
>>> +   if (ktime_to_ns(delta) > 0) {
>>> +   struct timespec64 timespec64_delta = ktime_to_timespec64(delta);
>>> +
>>> +   timekeeping_inject_sleeptime64(×pec64_delta);
>>> +   }
>>
>> But doesn't injecting sleep time here make monotonic clock too large by the 
>> amount of sleeptime? 
>> tick_freeze() / tick_unfreeze() already injects the sleeptime (otherwise 
>> delta would be 0).
> 
> No, it doesn't.
> 
> The delta here is the extra time taken by the loop which hasn't been counted
> as sleep time yet.

I said incorrectly monotonic clock, but timekeeping_inject_sleeptime64() 
forwards the wall time, by the amount of delta.
Why wouldn't some other cpu update xtime when one cpu is in the loop? And if 
all cpus enter s2idle, tick_unfreeze()
injects sleeptime. My point is that this extra injection makes wall time wrong, 
no?

> 
> Thanks,
> Rafael
>

Re: [PATCH] PM / suspend: Count suspend-to-idle loop as sleep time

2018-09-14 Thread Mika Penttilä

Hi!


On 09/14/2018 09:59 AM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> There is a difference in behavior between suspend-to-idle and
> suspend-to-RAM in the timekeeping handling that leads to functional
> issues.  Namely, every iteration of the loop in s2idle_loop()
> increases the monotinic clock somewhat, even if timekeeping_suspend()
> and timekeeping_resume() are invoked from s2idle_enter(), and if
> many of them are carried out in a row, the monotonic clock can grow
> significantly while the system is regarded as suspended, which
> doesn't happen during suspend-to-RAM and so it is unexpected and
> leads to confusion and misbehavior in user space (similar to what
> ensued when we tried to combine the boottime and monotonic clocks).
> 
> To avoid that, count all iterations of the loop in s2idle_loop()
> as "sleep time" and adjust the clock for that on exit from
> suspend-to-idle.
> 
> [That also covers systems on which timekeeping is not suspended
>  by by s2idle_enter().]
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> This is a replacement for https://patchwork.kernel.org/patch/10599209/
> 
> I decided to count the entire loop in s2idle_loop() as "sleep time" as the
> patch is then simpler and it also covers systems where timekeeping is not
> suspended in the final step of suspend-to-idle.
> 
> I dropped the "Fixes:" tag, because the monotonic clock delta problem
> has been present on the latter since the very introduction of "freeze"
> (as suspend-to-idle was referred to previously) and so this doesn't fix
> any particular later commits.
> 
> ---
>  kernel/power/suspend.c |   18 ++
>  1 file changed, 18 insertions(+)
> 
> Index: linux-pm/kernel/power/suspend.c
> ===
> --- linux-pm.orig/kernel/power/suspend.c
> +++ linux-pm/kernel/power/suspend.c
> @@ -109,8 +109,12 @@ static void s2idle_enter(void)
>  
>  static void s2idle_loop(void)
>  {
> + ktime_t start, delta;
> +
>   pm_pr_dbg("suspend-to-idle\n");
>  
> + start = ktime_get();
> +
>   for (;;) {
>   int error;
>  
> @@ -150,6 +154,20 @@ static void s2idle_loop(void)
>   pm_wakeup_clear(false);
>   }
>  
> + /*
> +  * If the monotonic clock difference between the start of the loop and
> +  * this point is too large, user space may get confused about whether or
> +  * not the system has been suspended and tasks may get killed by
> +  * watchdogs etc., so count the loop as "sleep time" to compensate for
> +  * that.
> +  */
> + delta = ktime_sub(ktime_get(), start);
> + if (ktime_to_ns(delta) > 0) {
> + struct timespec64 timespec64_delta = ktime_to_timespec64(delta);
> +
> + timekeeping_inject_sleeptime64(×pec64_delta);
> + }

But doesn't injecting sleep time here make monotonic clock too large by the 
amount of sleeptime? 
tick_freeze() / tick_unfreeze() already injects the sleeptime (otherwise delta 
would be 0).

> +
>   pm_pr_dbg("resume from suspend-to-idle\n");
>  }
>  
>

Re: [RFC v6 PATCH 2/2] mm: mmap: zap pages with read mmap_sem in munmap

2018-07-26 Thread Mika Penttilä




On 26.07.2018 21:10, Yang Shi wrote:
> When running some mmap/munmap scalability tests with large memory (i.e.
>> 300GB), the below hung task issue may happen occasionally.
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
>
> It is because munmap holds mmap_sem exclusively from very beginning to
> all the way down to the end, and doesn't release it in the middle. When
> unmapping large mapping, it may take long time (take ~18 seconds to
> unmap 320GB mapping with every single page mapped on an idle machine).
>
> Zapping pages is the most time consuming part, according to the
> suggestion from Michal Hocko [1], zapping pages can be done with holding
> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> mmap_sem to cleanup vmas.
>
> But, some part may need write mmap_sem, for example, vma splitting. So,
> the design is as follows:
> acquire write mmap_sem
> lookup vmas (find and split vmas)
>   detach vmas
> deal with special mappings
> downgrade_write
>
> zap pages
>   free page tables
> release mmap_sem
>
> The vm events with read mmap_sem may come in during page zapping, but
> since vmas have been detached before, they, i.e. page fault, gup, etc,
> will not be able to find valid vma, then just return SIGSEGV or -EFAULT
> as expected.
>
> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, they are
> considered as special mappings. They will be dealt with before zapping
> pages with write mmap_sem held. Basically, just update vm_flags.
>
> And, since they are also manipulated by unmap_single_vma() which is
> called by unmap_vma() with read mmap_sem held in this case, to
> prevent from updating vm_flags in read critical section, a new
> parameter, called "skip_flags" is added to unmap_region(), unmap_vmas()
> and unmap_single_vma(). If it is true, then just skip unmap those
> special mappings. Currently, the only place which pass true to this
> parameter is us.
>
> With this approach we don't have to re-acquire mmap_sem again to clean
> up vmas to avoid race window which might get the address space changed.
>
> And, since the lock acquire/release cost is managed to the minimum and
> almost as same as before, the optimization could be extended to any size
> of mapping without incurring significant penalty to small mappings.
>
> For the time being, just do this in munmap syscall path. Other
> vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
> intact for stability reason.
>
> With the patches, exclusive mmap_sem hold time when munmap a 80GB
> address space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to
> us level from second.
>
> munmap_test-15002 [008]   594.380138: funcgraph_entry: |  
> vm_munmap_zap_rlock() {
> munmap_test-15002 [008]   594.380146: funcgraph_entry:  !2485684 us |
> unmap_region();
> munmap_test-15002 [008]   596.865836: funcgraph_exit:   !2485692 us |  }
>
> Here the excution time of unmap_region() is used to evaluate the time of
> holding read mmap_sem, then the remaining time is used with holding
> exclusive lock.
>
> [1] https://lwn.net/Articles/753269/
>
> Suggested-by: Michal Hocko 
> Suggested-by: Kirill A. Shutemov 
> Cc: Matthew Wilcox 
> Cc: Laurent Dufour 
> Cc: Andrew Morton 
> Signed-off-by: Yang Shi 
> ---
>  include/linux/mm.h |  2 +-
>  mm/memory.c| 41 --
>  mm/mmap.c  | 99 
> +-
>  3 files changed, 123 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a0fbb9f..e4480d8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1321,7 +1321,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned 
> long address,
>  void zap_page_range(struct vm_area_struct *vma, unsigned long address,
>   unsigned long size);
>  void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> - unsigned long start, unsigned long end);
> + unsigned long start, unsigned long end, bool skip_vm_flags);
>  
>  /**
>   * mm_walk - callbacks for

Re: [PATCHv3 14/17] x86/mm: Introduce direct_mapping_size

2018-06-12 Thread Mika Penttilä




On 12.06.2018 17:39, Kirill A. Shutemov wrote:
> Kernel need to have a way to access encrypted memory. We are going to
> use per-KeyID direct mapping to facilitate the access with minimal
> overhead.
>
> Direct mapping for each KeyID will be put next to each other in the
> virtual address space. We need to have a way to find boundaries of
> direct mapping for particular KeyID.
>
> The new variable direct_mapping_size specifies the size of direct
> mapping. With the value, it's trivial to find direct mapping for
> KeyID-N: PAGE_OFFSET + N * direct_mapping_size.
>
> Size of direct mapping is calculated during KASLR setup. If KALSR is
> disable it happens during MKTME initialization.
>
> Signed-off-by: Kirill A. Shutemov 
> ---
>  arch/x86/include/asm/mktme.h   |  2 ++
>  arch/x86/include/asm/page_64.h |  1 +
>  arch/x86/kernel/head64.c   |  2 ++
>  arch/x86/mm/kaslr.c| 21 ---
>  arch/x86/mm/mktme.c| 48 ++
>  5 files changed, 71 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
> index 9363b989a021..3bf481fe3f56 100644
> --- a/arch/x86/include/asm/mktme.h
> +++ b/arch/x86/include/asm/mktme.h
> @@ -40,6 +40,8 @@ int page_keyid(const struct page *page);
>  
>  void mktme_disable(void);
>  
> +void setup_direct_mapping_size(void);
> +
>  #else
>  #define mktme_keyid_mask ((phys_addr_t)0)
>  #define mktme_nr_keyids  0
> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> index 939b1cff4a7b..53c32af895ab 100644
> --- a/arch/x86/include/asm/page_64.h
> +++ b/arch/x86/include/asm/page_64.h
> @@ -14,6 +14,7 @@ extern unsigned long phys_base;
>  extern unsigned long page_offset_base;
>  extern unsigned long vmalloc_base;
>  extern unsigned long vmemmap_base;
> +extern unsigned long direct_mapping_size;
>  
>  static inline unsigned long __phys_addr_nodebug(unsigned long x)
>  {
> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> index a21d6ace648e..b6175376b2e1 100644
> --- a/arch/x86/kernel/head64.c
> +++ b/arch/x86/kernel/head64.c
> @@ -59,6 +59,8 @@ EXPORT_SYMBOL(vmalloc_base);
>  unsigned long vmemmap_base __ro_after_init = __VMEMMAP_BASE_L4;
>  EXPORT_SYMBOL(vmemmap_base);
>  #endif
> +unsigned long direct_mapping_size __ro_after_init = -1UL;
> +EXPORT_SYMBOL(direct_mapping_size);
>  
>  #define __head   __section(.head.text)
>  
> diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
> index 4408cd9a3bef..3d8ef8cb97e1 100644
> --- a/arch/x86/mm/kaslr.c
> +++ b/arch/x86/mm/kaslr.c
> @@ -69,6 +69,15 @@ static inline bool kaslr_memory_enabled(void)
>   return kaslr_enabled() && !IS_ENABLED(CONFIG_KASAN);
>  }
>  
> +#ifndef CONFIG_X86_INTEL_MKTME
> +static void __init setup_direct_mapping_size(void)
> +{
> + direct_mapping_size = max_pfn << PAGE_SHIFT;
> + direct_mapping_size = round_up(direct_mapping_size, 1UL << TB_SHIFT);
> + direct_mapping_size += (1UL << TB_SHIFT) * 
> CONFIG_MEMORY_PHYSICAL_PADDING;
> +}
> +#endif
> +
>  /* Initialize base and padding for each memory region randomized with KASLR 
> */
>  void __init kernel_randomize_memory(void)
>  {
> @@ -93,7 +102,11 @@ void __init kernel_randomize_memory(void)
>   if (!kaslr_memory_enabled())
>   return;
>  
> - kaslr_regions[0].size_tb = 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT);
> + /*
> +  * Upper limit for direct mapping size is 1/4 of whole virtual
> +  * address space

or 1/2?

> +  */
> + kaslr_regions[0].size_tb = 1 << (__VIRTUAL_MASK_SHIFT - 1 - TB_SHIFT);



>   kaslr_regions[1].size_tb = VMALLOC_SIZE_TB;
>  
>   /*
> @@ -101,8 +114,10 @@ void __init kernel_randomize_memory(void)
>* add padding if needed (especially for memory hotplug support).
>*/
>   BUG_ON(kaslr_regions[0].base != &page_offset_base);
> - memory_tb = DIV_ROUND_UP(max_pfn << PAGE_SHIFT, 1UL << TB_SHIFT) +
> - CONFIG_MEMORY_PHYSICAL_PADDING;
> +
> + setup_direct_mapping_size();
> +
> + memory_tb = direct_mapping_size * mktme_nr_keyids + 1;

parenthesis ?

memory_tb = direct_mapping_size * (mktme_nr_keyids + 1);


>  
>   /* Adapt phyiscal memory region size based on available memory */
>   if (memory_tb < kaslr_regions[0].size_tb)
> diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
> index 43a44f0f2a2d..3e5322bf035e 100644
> --- a/arch/x86/mm/mktme.c
> +++ b/arch/x86/mm/mktme.c
> @@ -89,3 +89,51 @@ static bool need_page_mktme(void)
>  struct page_ext_operations page_mktme_ops = {
>   .need = need_page_mktme,
>  };
> +
> +void __init setup_direct_mapping_size(void)
> +{
> + unsigned long available_va;
> +
> + /* 1/4 of virtual address space is didicated for direct mapping */
> + available_va = 1UL << (__VIRTUAL_MASK_SHIFT - 1);
> +
> + /* How much memory the systrem has? */
> + direct_mapping_size = max_pfn << PAGE_SHIFT;
> + direct_mapping_si

Re: [PATCH 6/6] x86/vdso: Move out the CPU number store

2018-06-04 Thread Mika Penttilä

On 06/05/2018 08:36 AM, H. Peter Anvin wrote:
> On 06/04/18 20:57, Mika Penttilä wrote:
>>
>> This won't work on X86-32 because it actually uses the segment limit with 
>> fs: access. So there 
>> is a reason why the lsl based method is X86-64 only.
>>
> 
> 
> 
> Why does that matter in any shape, way, or form?  The LSL instruction
> doesn't touch any of the segment registers, it just uses a segment
> selector number.
> 
> 
> 
> I see... we have a VERY unfortunate name collision: the x86-64
> GDT_ENTRY_PERC_PU and the i386 GDT_ENTRY_PERCPU are in fact two
> completely different things, with the latter being the actual percpu
> offset used by the kernel.
> 
> So yes, this patch is wrong, because the naming of the x86-64 segment is
> insane especially in the proximity of the  -- it should be something
> like GDT_ENTRY_CPU_NUMBER.
> 
> Unfortunately we probably can't use the same GDT entry on x86-32 and
> x86-64, because entry 15 (selector 0x7b) is USER_DS, which is something
> we really don't want to screw with.  This means i386 programs that
> execute LSL directly for whatever reason will have to understand the
> difference, but most of the other segment numbers differ as well,
> including user space %cs (USER_CS/USER32_CS) and %ds/%es/%ss (USER_DS).
> Perhaps we could bump down segments 23-28 by one and put it as 23, that
> way the CPU_NUMBER segment would always be at %ss+80 for the default
> (flat, initial) user space %ss.  (We want to specify using %ss rather
> than %ds, because it is less likely to be changed and because 64 bits,
> too, have %ss defined, but not %ds.)
> 
> 
> 
> Rename the x86-64 segment to CPU_NUMBER, fixing the naming conflict.
> Add 1 to GDT entry numbers 23-28 for i386 (all of these are
> kernel-internal segments and so have no impact on user space).
> Add i386 CPU_NUMBER equivalent to x86-64 at GDT entry 23.
> Document the above relationship between segments.
> 
> OK, everyone?
> 
>   -hpa
> 

Yes GDT_ENTRY_PER_CPU and GDT_ENTRY_PERCPU meaning two totally different things 
is really confusing,
the proposal seems ok to me!

--Mika

Re: [PATCH 6/6] x86/vdso: Move out the CPU number store

2018-06-04 Thread Mika Penttilä

On 06/05/2018 07:44 AM, Bae, Chang Seok wrote:
>>> diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
>>> index ea554f8..e716e94 100644
>>> --- a/arch/x86/kernel/setup_percpu.c
>>> +++ b/arch/x86/kernel/setup_percpu.c
>>> @@ -155,12 +155,21 @@ static void __init pcpup_populate_pte(unsigned long 
>>> addr)
>>>  
>>>  static inline void setup_percpu_segment(int cpu)
>>>  {
>>> -#ifdef CONFIG_X86_32
>>> -   struct desc_struct d = GDT_ENTRY_INIT(0x8092, per_cpu_offset(cpu),
>>> - 0xF);
>>> +#ifdef CONFIG_NUMA
>>> +   unsigned long node = early_cpu_to_node(cpu);
>>> +#else
>>> +   unsigned long node = 0;
>>> +#endif
>>> +   struct desc_struct d = GDT_ENTRY_INIT(0x0, per_cpu_offset(cpu),
>>> +  make_lsl_tscp(cpu, node));
>>> +
>>> +   d.type = 5; /* R0 data, expand down, accessed */
>>> +   d.dpl = 3;  /* Visible to user code */
>>> +   d.s = 1;/* Not a system segment */
>>> +   d.p = 1;/* Present */
>>> +   d.d = 1;/* 32-bit */
>>>  
>>> write_gdt_entry(get_cpu_gdt_rw(cpu), GDT_ENTRY_PERCPU, &d, DESCTYPE_S);
>>> -#endif
>>>  }
> 
>> This won't work on X86-32 because it actually uses the segment limit with 
>> fs: access. So there 
>> is a reason why the lsl based method is X86-64 only.
> 
> The limit will be consumed in X86-64 only, while the unification with i386 
> was suggested for a
> different reason.
> 
> Thanks,
> Chang
> 

The unification affects i386, and the limit is consumed by the processor with 
fs: access.
The limit was 0xF before, now it depends on the cpu/node. So accesses on 
small number cpus 
are likely to fault.

--Mika

Re: [PATCH 6/6] x86/vdso: Move out the CPU number store

2018-06-04 Thread Mika Penttilä

On 06/04/2018 10:24 PM, Chang S. Bae wrote:
> The CPU (and node) number will be written, as early enough,
> to the segment limit of per CPU data and TSC_AUX MSR entry.
> The information has been retrieved by vgetcpu in user space
> and will be also loaded from the paranoid entry, when
> FSGSBASE enabled. So, it is moved out from vDSO to the CPU
> initialization path where IST setup is serialized.
> 
> Now, redundant setting of the segment in entry/vdso/vma.c
> was removed; a substantial code removal. It removes a
> hotplug notifier, makes a facility useful to both the kernel
> and userspace unconditionally available much sooner, and
> unification with i386. (Thanks to HPA for suggesting the
> cleanup)
> 
> Signed-off-by: Chang S. Bae 
> Cc: H. Peter Anvin 
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Andi Kleen 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> ---

> diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
> index ea554f8..e716e94 100644
> --- a/arch/x86/kernel/setup_percpu.c
> +++ b/arch/x86/kernel/setup_percpu.c
> @@ -155,12 +155,21 @@ static void __init pcpup_populate_pte(unsigned long 
> addr)
>  
>  static inline void setup_percpu_segment(int cpu)
>  {
> -#ifdef CONFIG_X86_32
> - struct desc_struct d = GDT_ENTRY_INIT(0x8092, per_cpu_offset(cpu),
> -   0xF);
> +#ifdef CONFIG_NUMA
> + unsigned long node = early_cpu_to_node(cpu);
> +#else
> + unsigned long node = 0;
> +#endif
> + struct desc_struct d = GDT_ENTRY_INIT(0x0, per_cpu_offset(cpu),
> +make_lsl_tscp(cpu, node));
> +
> + d.type = 5; /* R0 data, expand down, accessed */
> + d.dpl = 3;  /* Visible to user code */
> + d.s = 1;/* Not a system segment */
> + d.p = 1;/* Present */
> + d.d = 1;/* 32-bit */
>  
>   write_gdt_entry(get_cpu_gdt_rw(cpu), GDT_ENTRY_PERCPU, &d, DESCTYPE_S);
> -#endif
>  }


This won't work on X86-32 because it actually uses the segment limit with fs: 
access. So there 
is a reason why the lsl based method is X86-64 only.


-Mika

>  
>  void __init setup_per_cpu_areas(void)
>

Re: [PATCH v2 4/9] x86, memcpy_mcsafe: add write-protection-fault handling

2018-05-02 Thread Mika Penttilä

On 05/03/2018 07:59 AM, Dan Williams wrote:
> In preparation for using memcpy_mcsafe() to handle user copies it needs
> to be to handle write-protection faults while writing user pages. Add
> MMU-fault handlers alongside the machine-check exception handlers.
> 
> Note that the machine check fault exception handling makes assumptions
> about source buffer alignment and poison alignment. In the write fault
> case, given the destination buffer is arbitrarily aligned, it needs a
> separate / additional fault handling approach. The mcsafe_handle_tail()
> helper is reused. The @limit argument is set to @len since there is no
> safety concern about retriggering an MMU fault, and this simplifies the
> assembly.
> 

> diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> index 75d3776123cc..9787f5ee0cf9 100644
> --- a/arch/x86/lib/usercopy_64.c
> +++ b/arch/x86/lib/usercopy_64.c
> @@ -75,6 +75,23 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
>   return len;
>  }
>  
> +/*
> + * Similar to copy_user_handle_tail, probe for the write fault point,
> + * but reuse __memcpy_mcsafe in case a new read error is encountered.
> + * clac() is handled in _copy_to_iter_mcsafe().
> + */
> +__visible unsigned long
> +mcsafe_handle_tail(char *to, char *from, unsigned len)
> +{
> + for (; len; --len, to++) {
> + unsigned long rem = memcpy_mcsafe(to, from, 1);
> +


Hmm why not 
for (; len; --len, from++, to++)



> + if (rem)
> + break;
> + }
> + return len;
> +}


--Mika

Re: [REGRESSION][BISECTED] i.MX6 pinctrl hogs stopped working

2018-04-10 Thread Mika Penttilä



On 10.04.2018 13:21, Richard Fitzgerald wrote:
> On 04/04/18 06:33, Mika Penttilä wrote:
>> Hi!
>>
>> Reverting this made the hogs on a i.MX6 board work again. :
>>
>>
>> commit b89405b6102fcc3746f43697b826028caa94c823
>> Author: Richard Fitzgerald 
>> Date:   Wed Feb 28 15:53:06 2018 +
>>
>>  pinctrl: devicetree: Fix dt_to_map_one_config handling of hogs
>>
>>
>>
>> --Mika
>>
>
> I think you should check whether the bug is with the i.MX6 driver
> relying on the previous buggy behaviour of pinctrl. I haven't got
> i.MX6 hardware to test myself.
>
> The bug I fixed in that patch was that when pinctrl is probing a
> pinctrl driver it would try to apply all the pinctrl settings
> listed in a dt node to the pinctrl driver it is probing instead
> of the pinctrl drivers they actually refer to. This was a bug
> introduced by an earlier patch (which unfortunately I forgot to
> include a fixes line reference to)
>
>   pinctrl: core: Use delayed work for hogs
>
> So if a pinctrl driver "A" had a dependency on another pinctrl
> driver "B" those dependencies wouldn't be properly created because
> all the "B" pinctrl DT entries would be attempted against "A"
> instead of "B". This caused failures if a pinctrl driver had a
> dependency on another pinctrl driver, of if creating a pinctrl
> driver that is a child of an MFD and that MFD has dependencies
> on another pinctrl driver.
>

Hard to say, but the kernel/dts has worked ok for 3+ years, from 3.17 until 
4.17-rc. Nothing fancy, just normal hogs, in two groups.
Can send you relevant pieces of DT if interested.

--Mika

Re: [PATCH] ASoC: fsl_ssi: Fix mode setting when changing channel number

2018-04-08 Thread Mika Penttilä

On 04/08/2018 07:40 AM, Nicolin Chen wrote:
> This is a partial revert (in a cleaner way) of commit ebf08ae3bc90
> ("ASoC: fsl_ssi: Keep ssi->i2s_net updated") to fix a regression
> at test cases when switching between mono and stereo audio.
> 
> The problem is that ssi->i2s_net is initialized in set_dai_fmt()
> only, while this set_dai_fmt() is only called during the dai-link
> probe(). The original patch assumed set_dai_fmt() would be called
> during every playback instance, so it failed at the overriding use
> cases.
> 
> This patch adds the local variable i2s_net back to let regular use
> cases still follow the mode settings from the set_dai_fmt().
> 
> Meanwhile, the original commit of keeping ssi->i2s_net updated was
> to make set_tdm_slot() clean by checking the ssi->i2s_net directly
> instead of reading SCR register. However, the change itself is not
> necessary (or even harmful) because the set_tdm_slot() might fail
> to check the slot number for Normal-Mode-None-Net settings while
> mono audio cases still need 2 slots. So this patch can also fix it.
> And it adds an extra line of comments to declare ssi->i2s_net does
> not reflect the register value but merely the initial setting from
> the set_dai_fmt().
> 
> Reported-by: Mika Penttilä 
> Signed-off-by: Nicolin Chen 
> Cc: Mika Penttilä 
> ---
>  sound/soc/fsl/fsl_ssi.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/sound/soc/fsl/fsl_ssi.c b/sound/soc/fsl/fsl_ssi.c
> index 0823b08..89df2d9 100644
> --- a/sound/soc/fsl/fsl_ssi.c
> +++ b/sound/soc/fsl/fsl_ssi.c
> @@ -217,6 +217,7 @@ struct fsl_ssi_soc_data {
>   * @dai_fmt: DAI configuration this device is currently used with
>   * @streams: Mask of current active streams: BIT(TX) and BIT(RX)
>   * @i2s_net: I2S and Network mode configurations of SCR register
> + *   (this is the initial settings based on the DAI format)
>   * @synchronous: Use synchronous mode - both of TX and RX use STCK and SFCK
>   * @use_dma: DMA is used or FIQ with stream filter
>   * @use_dual_fifo: DMA with support for dual FIFO mode
> @@ -829,16 +830,23 @@ static int fsl_ssi_hw_params(struct snd_pcm_substream 
> *substream,
>   }
>  
>   if (!fsl_ssi_is_ac97(ssi)) {
> + /*
> +  * Keep the ssi->i2s_net intact while having a local variable
> +  * to override settings for special use cases. Otherwise, the
> +  * ssi->i2s_net will lose the settings for regular use cases.
> +  */
> + u8 i2s_net = ssi->i2s_net;
> +
>   /* Normal + Network mode to send 16-bit data in 32-bit frames */
>   if (fsl_ssi_is_i2s_cbm_cfs(ssi) && sample_size == 16)
> - ssi->i2s_net = SSI_SCR_I2S_MODE_NORMAL | SSI_SCR_NET;
> + i2s_net = SSI_SCR_I2S_MODE_NORMAL | SSI_SCR_NET;
>  
>   /* Use Normal mode to send mono data at 1st slot of 2 slots */
>   if (channels == 1)
> - ssi->i2s_net = SSI_SCR_I2S_MODE_NORMAL;
> + i2s_net = SSI_SCR_I2S_MODE_NORMAL;
>  
>   regmap_update_bits(regs, REG_SSI_SCR,
> -    SSI_SCR_I2S_NET_MASK, ssi->i2s_net);
> +SSI_SCR_I2S_NET_MASK, i2s_net);
>   }
>  
>   /* In synchronous mode, the SSI uses STCCR for capture */
> 

This patch fixes my problems, so: 

Tested-by: Mika Penttilä 


--Mika

Fwd: regression, imx6 and sgtl5000 sound problems

2018-04-06 Thread Mika Penttilä




> On Fri, Apr 06, 2018 at 07:46:37AM +0300, Mika Penttilä wrote:
>>
>> With recent merge to pre 4.17-rc, audio stopped workin (or it's hearable but 
>> way too slow).
>> imx6q + sgtl5000 codec.
> 
> Could you please be more specific at your test cases?
> 
> Which board? Whose is the DAI master? Which sample rate?
> 
>> Maybe some of the soc/fsl changes is causing this.
> 
> There are quite a few clean-up patches of SSI driver being merged.
> Would you please try to revert/bisect the changes of fsl_ssi driver
> so as to figure out which one breaks your test cases?
> 
> If there is a regression because of one of the changes, I will need
> to fix it.
> 
> Thanks
> Nicolin
> 


Hi,


bisected and with this reverted I get it working : 

ebf08ae3bc906fc5dd33d02977efa5d4b9831517 is the first bad commit
commit ebf08ae3bc906fc5dd33d02977efa5d4b9831517
Author: Nicolin Chen 
Date:   Mon Feb 12 14:03:10 2018 -0800

ASoC: fsl_ssi: Keep ssi->i2s_net updated



--Mika

Re: regression, imx6 and sgtl5000 sound problems

2018-04-06 Thread Mika Penttilä

On 04/06/2018 10:23 AM, Nicolin Chen wrote:
> On Fri, Apr 06, 2018 at 07:46:37AM +0300, Mika Penttilä wrote:
>>
>> With recent merge to pre 4.17-rc, audio stopped workin (or it's hearable but 
>> way too slow).
>> imx6q + sgtl5000 codec.
> 
> Could you please be more specific at your test cases?
> 
> Which board? Whose is the DAI master? Which sample rate?
> 
>> Maybe some of the soc/fsl changes is causing this.
> 
> There are quite a few clean-up patches of SSI driver being merged.
> Would you please try to revert/bisect the changes of fsl_ssi driver
> so as to figure out which one breaks your test cases?
> 
> If there is a regression because of one of the changes, I will need
> to fix it.
> 
> Thanks
> Nicolin
> 

Hi,

We have a custom board (very near to Karo tx evkit). The test is simply aplay 
file.wav, at least sample rate 48kHz tested
and not working. Nothing special there hw wise.

I try to bisect it and report back to you.

--Mika

regression, imx6 and sgtl5000 sound problems

2018-04-05 Thread Mika Penttilä

Hi,

With recent merge to pre 4.17-rc, audio stopped workin (or it's hearable but 
way too slow).
imx6q + sgtl5000 codec.

Maybe some of the soc/fsl changes is causing this.

--Mika

[REGRESSION][BISECTED] i.MX6 pinctrl hogs stopped working

2018-04-03 Thread Mika Penttilä

Hi!

Reverting this made the hogs on a i.MX6 board work again. : 


commit b89405b6102fcc3746f43697b826028caa94c823
Author: Richard Fitzgerald 
Date:   Wed Feb 28 15:53:06 2018 +

pinctrl: devicetree: Fix dt_to_map_one_config handling of hogs



--Mika

Re: [regression, bisected] 4.11+ imx uarts broken

2017-05-09 Thread Mika Penttilä

On 05/09/2017 10:14 AM, Uwe Kleine-König wrote:
> Hello Mika,
> 
> On Tue, May 09, 2017 at 07:18:09AM +0300, Mika Penttilä wrote:
>> The following commit e61c38d85b7 makes the uarts on i.MX6 nonfunctional (no 
>> data transmitted or received). 
>> With e61c38d85b7 reverted the uarts work ok.
>>
>> ---
>> commit e61c38d85b7392e033ee03bca46f1d6006156175
>> Author: Uwe Kleine-König 
>> Date:   Tue Apr 4 11:18:51 2017 +0200
>>
>> serial: imx: setup DCEDTE early and ensure DCD and RI irqs to be off
>>  
>> 
> 
> are you operating the UART in DTE or DCE mode? Does this affect all
> UARTs or only those that are not used in the bootloader?

I am operating in DCE mode. The debug/console uart works ok, but two others 
don't.

> 
> Looking at the patch I wonder if setting IMX21_UCR3_RXDMUXSEL |
> UCR3_ADNIMP is missing for you.
> 

Probably yes, but I can verify this later and get back to you.

> Can you please check which hunk of e61c38d85b73 is giving you problems?
> 
> Best regards
> Uwe
> 

--Mika

[regression, bisected] 4.11+ imx uarts broken

2017-05-08 Thread Mika Penttilä

Hi,

The following commit e61c38d85b7 makes the uarts on i.MX6 nonfunctional (no 
data transmitted or received). 
With e61c38d85b7 reverted the uarts work ok.

---
commit e61c38d85b7392e033ee03bca46f1d6006156175
Author: Uwe Kleine-König 
Date:   Tue Apr 4 11:18:51 2017 +0200

serial: imx: setup DCEDTE early and ensure DCD and RI irqs to be off
 



--Mika

nvdimm/pmem device lifetime

2017-04-27 Thread Mika Penttilä

Hi,

Just wondering the pmem struct device vs gendisk lifetimes.. from 
pmem_attach_disk():

device_add_disk(dev, disk);
devm_add_action_or_reset(dev, pmem_release_disk, disk);


where:
static void pmem_release_disk(void *disk)
{
del_gendisk(disk);
put_disk(disk);
}


but device_add_disk() makes disk pin dev (as a parent), and it's unpinned by 
del_gendisk()
which is called when dev is released, but it's not because of this circular 
dependency?

--Mika

Re: [PATCH 2/5] zram: partial IO refactoring

2017-04-02 Thread Mika Penttilä

On 04/03/2017 09:12 AM, Minchan Kim wrote:
> Hi Mika,
> 
> On Mon, Apr 03, 2017 at 08:52:33AM +0300, Mika Penttilä wrote:
>>
>> Hi!
>>
>> On 04/03/2017 08:17 AM, Minchan Kim wrote:
>>> For architecture(PAGE_SIZE > 4K), zram have supported partial IO.
>>> However, the mixed code for handling normal/partial IO is too mess,
>>> error-prone to modify IO handler functions with upcoming feature
>>> so this patch aims for cleaning up zram's IO handling functions.
>>>
>>> Signed-off-by: Minchan Kim 
>>> ---
>>>  drivers/block/zram/zram_drv.c | 333 
>>> +++---
>>>  1 file changed, 184 insertions(+), 149 deletions(-)
>>>
>>> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>>> index 28c2836f8c96..7938f4b98b01 100644
>>> --- a/drivers/block/zram/zram_drv.c
>>> +++ b/drivers/block/zram/zram_drv.c
>>> @@ -45,6 +45,8 @@ static const char *default_compressor = "lzo";
>>>  /* Module params (documentation at end) */
>>>  static unsigned int num_devices = 1;
>>>  
>>> +static void zram_free_page(struct zram *zram, size_t index);
>>> +
>>>  static inline bool init_done(struct zram *zram)
>>>  {
>>> return zram->disksize;
>>> @@ -98,10 +100,17 @@ static void zram_set_obj_size(struct zram_meta *meta,
>>> meta->table[index].value = (flags << ZRAM_FLAG_SHIFT) | size;
>>>  }
>>>  
>>> +#if PAGE_SIZE != 4096
>>>  static inline bool is_partial_io(struct bio_vec *bvec)
>>>  {
>>> return bvec->bv_len != PAGE_SIZE;
>>>  }
>>> +#else
>>
>> For page size of 4096 bv_len can still be < 4096 and partial pages should be 
>> supported 
>> (uncompress before write etc). ? 
> 
> zram declares this.
> 
> #define ZRAM_LOGICAL_BLOCK_SIZE (1<<12)
> 
>   blk_queue_physical_block_size(zram->disk->queue, PAGE_SIZE);
>   blk_queue_logical_block_size(zram->disk->queue,
>   ZRAM_LOGICAL_BLOCK_SIZE);
> 
> So, I thought there is no such partial IO in 4096 page architecture.
> Am I missing something? Could you tell the scenario if it happens?

I think you're right. At least swap operates with min 4096 sizes.

> 
> Thanks!
>

Re: [PATCH 2/5] zram: partial IO refactoring

2017-04-02 Thread Mika Penttilä


Hi!

On 04/03/2017 08:17 AM, Minchan Kim wrote:
> For architecture(PAGE_SIZE > 4K), zram have supported partial IO.
> However, the mixed code for handling normal/partial IO is too mess,
> error-prone to modify IO handler functions with upcoming feature
> so this patch aims for cleaning up zram's IO handling functions.
> 
> Signed-off-by: Minchan Kim 
> ---
>  drivers/block/zram/zram_drv.c | 333 
> +++---
>  1 file changed, 184 insertions(+), 149 deletions(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 28c2836f8c96..7938f4b98b01 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -45,6 +45,8 @@ static const char *default_compressor = "lzo";
>  /* Module params (documentation at end) */
>  static unsigned int num_devices = 1;
>  
> +static void zram_free_page(struct zram *zram, size_t index);
> +
>  static inline bool init_done(struct zram *zram)
>  {
>   return zram->disksize;
> @@ -98,10 +100,17 @@ static void zram_set_obj_size(struct zram_meta *meta,
>   meta->table[index].value = (flags << ZRAM_FLAG_SHIFT) | size;
>  }
>  
> +#if PAGE_SIZE != 4096
>  static inline bool is_partial_io(struct bio_vec *bvec)
>  {
>   return bvec->bv_len != PAGE_SIZE;
>  }
> +#else

For page size of 4096 bv_len can still be < 4096 and partial pages should be 
supported 
(uncompress before write etc). ? 

> +static inline bool is_partial_io(struct bio_vec *bvec)
> +{
> + return false;
> +}
> +#endif
>  
>  static void zram_revalidate_disk(struct zram *zram)
>  {
> @@ -191,18 +200,6 @@ static bool page_same_filled(void *ptr, unsigned long 
> *element)
>   return true;
>  }
>  
> -static void handle_same_page(struct bio_vec *bvec, unsigned long element)
> -{
> - struct page *page = bvec->bv_page;
> - void *user_mem;
> -
> - user_mem = kmap_atomic(page);
> - zram_fill_page(user_mem + bvec->bv_offset, bvec->bv_len, element);
> - kunmap_atomic(user_mem);
> -
> - flush_dcache_page(page);
> -}
> -
>  static ssize_t initstate_show(struct device *dev,
>   struct device_attribute *attr, char *buf)
>  {
> @@ -418,6 +415,53 @@ static DEVICE_ATTR_RO(io_stat);
>  static DEVICE_ATTR_RO(mm_stat);
>  static DEVICE_ATTR_RO(debug_stat);
>  
> +static bool zram_special_page_read(struct zram *zram, u32 index,
> + struct page *page,
> + unsigned int offset, unsigned int len)
> +{
> + struct zram_meta *meta = zram->meta;
> +
> + bit_spin_lock(ZRAM_ACCESS, &meta->table[index].value);
> + if (unlikely(!meta->table[index].handle) ||
> + zram_test_flag(meta, index, ZRAM_SAME)) {
> + void *mem;
> +
> + bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
> + mem = kmap_atomic(page);
> + zram_fill_page(mem + offset, len, meta->table[index].element);
> + kunmap_atomic(mem);
> + return true;
> + }
> + bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
> +
> + return false;
> +}
> +
> +static bool zram_special_page_write(struct zram *zram, u32 index,
> + struct page *page)
> +{
> + unsigned long element;
> + void *mem = kmap_atomic(page);
> +
> + if (page_same_filled(mem, &element)) {
> + struct zram_meta *meta = zram->meta;
> +
> + kunmap_atomic(mem);
> + /* Free memory associated with this sector now. */
> + bit_spin_lock(ZRAM_ACCESS, &meta->table[index].value);
> + zram_free_page(zram, index);
> + zram_set_flag(meta, index, ZRAM_SAME);
> + zram_set_element(meta, index, element);
> + bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
> +
> + atomic64_inc(&zram->stats.same_pages);
> + return true;
> + }
> + kunmap_atomic(mem);
> +
> + return false;
> +}
> +
>  static void zram_meta_free(struct zram_meta *meta, u64 disksize)
>  {
>   size_t num_pages = disksize >> PAGE_SHIFT;
> @@ -504,169 +548,104 @@ static void zram_free_page(struct zram *zram, size_t 
> index)
>   zram_set_obj_size(meta, index, 0);
>  }
>  
> -static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
> +static int zram_decompress_page(struct zram *zram, struct page *page, u32 
> index)
>  {
> - int ret = 0;
> - unsigned char *cmem;
> - struct zram_meta *meta = zram->meta;
> + int ret;
>   unsigned long handle;
>   unsigned int size;
> + void *src, *dst;
> + struct zram_meta *meta = zram->meta;
> +
> + if (zram_special_page_read(zram, index, page, 0, PAGE_SIZE))
> + return 0;
>  
>   bit_spin_lock(ZRAM_ACCESS, &meta->table[index].value);
>   handle = meta->table[index].handle;
>   size = zram_get_obj_size(meta, index);
>  
> - if (!handle || zram_test_flag(meta, index, ZRAM

[PATCH] fix pinctrl setup for i.IMX6

2017-02-27 Thread Mika Penttilä

Hi!

Recent pulls for mainline pre 4.11 introduced pinctrl setup changes and moved 
pinctrl-imx to
use generic helpers. 

Net effect was that hog group could not be resolved. I made it work for myself
with a two stage setup with create and start separated, and dt probe in between 
them.


Signed-off-by: Mika Penttilä 
---

diff --git a/drivers/pinctrl/core.c b/drivers/pinctrl/core.c
index d690465..33659c5a 100644
--- a/drivers/pinctrl/core.c
+++ b/drivers/pinctrl/core.c
@@ -2237,6 +2237,47 @@ int devm_pinctrl_register_and_init(struct device *dev,
 }
 EXPORT_SYMBOL_GPL(devm_pinctrl_register_and_init);
 
+int devm_pinctrl_register_and_init_nostart(struct device *dev,
+   struct pinctrl_desc *pctldesc,
+   void *driver_data,
+   struct pinctrl_dev **pctldev)
+{
+   struct pinctrl_dev **ptr;
+   struct pinctrl_dev *p;
+
+   ptr = devres_alloc(devm_pinctrl_dev_release, sizeof(*ptr), GFP_KERNEL);
+   if (!ptr)
+   return -ENOMEM;
+
+   p = pinctrl_init_controller(pctldesc, dev, driver_data);
+   if (IS_ERR(p)) {
+   devres_free(ptr);
+   return PTR_ERR(p);
+   }
+
+   *ptr = p;
+   devres_add(dev, ptr);
+   *pctldev = p;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(devm_pinctrl_register_and_init_nostart);
+
+int devm_pinctrl_start(struct device *dev,
+   struct pinctrl_dev *pctldev)
+{
+   int error = 0;
+
+   error = pinctrl_create_and_start(pctldev);
+   if (error) {
+   mutex_destroy(&pctldev->mutex);
+   return error;
+   }
+
+   return error;
+}
+EXPORT_SYMBOL_GPL(devm_pinctrl_start);
+
 /**
  * devm_pinctrl_unregister() - Resource managed version of 
pinctrl_unregister().
  * @dev: device for which which resource was allocated
diff --git a/drivers/pinctrl/freescale/pinctrl-imx.c 
b/drivers/pinctrl/freescale/pinctrl-imx.c
index a7ace9e..3644aed 100644
--- a/drivers/pinctrl/freescale/pinctrl-imx.c
+++ b/drivers/pinctrl/freescale/pinctrl-imx.c
@@ -686,16 +686,6 @@ static int imx_pinctrl_probe_dt(struct platform_device 
*pdev,
return 0;
 }
 
-/*
- * imx_free_resources() - free memory used by this driver
- * @info: info driver instance
- */
-static void imx_free_resources(struct imx_pinctrl *ipctl)
-{
-   if (ipctl->pctl)
-   pinctrl_unregister(ipctl->pctl);
-}
-
 int imx_pinctrl_probe(struct platform_device *pdev,
  struct imx_pinctrl_soc_info *info)
 {
@@ -774,26 +764,30 @@ int imx_pinctrl_probe(struct platform_device *pdev,
ipctl->info = info;
ipctl->dev = info->dev;
platform_set_drvdata(pdev, ipctl);
-   ret = devm_pinctrl_register_and_init(&pdev->dev,
-imx_pinctrl_desc, ipctl,
-&ipctl->pctl);
+
+   ret = devm_pinctrl_register_and_init_nostart(&pdev->dev,
+   imx_pinctrl_desc, ipctl,
+   &ipctl->pctl);
+
if (ret) {
dev_err(&pdev->dev, "could not register IMX pinctrl driver\n");
-   goto free;
+   return ret;
}
 
ret = imx_pinctrl_probe_dt(pdev, ipctl);
if (ret) {
dev_err(&pdev->dev, "fail to probe dt properties\n");
-   goto free;
+   return ret;
+   }
+
+   ret = devm_pinctrl_start(&pdev->dev, ipctl->pctl);
+   if (ret) {
+   dev_err(&pdev->dev, "could not start IMX pinctrl driver\n");
+   return ret;
}
 
dev_info(&pdev->dev, "initialized IMX pinctrl driver\n");
 
return 0;
 
-free:
-   imx_free_resources(ipctl);
-
-   return ret;
 }
diff --git a/include/linux/pinctrl/pinctrl.h b/include/linux/pinctrl/pinctrl.h
index 8ce2d87..e7020f0 100644
--- a/include/linux/pinctrl/pinctrl.h
+++ b/include/linux/pinctrl/pinctrl.h
@@ -153,9 +153,17 @@ extern struct pinctrl_dev *pinctrl_register(struct 
pinctrl_desc *pctldesc,
 extern void pinctrl_unregister(struct pinctrl_dev *pctldev);
 
 extern int devm_pinctrl_register_and_init(struct device *dev,
-   struct pinctrl_desc *pctldesc,
-   void *driver_data,
-   struct pinctrl_dev **pctldev);
+   struct pinctrl_desc *pctldesc,
+   void *driver_data,
+   struct pinctrl_dev **pctldev);
+
+extern int devm_pinctrl_register_and_init_nostart(struct device *dev,
+   struct pinctrl_desc *pctldesc,
+   void *driver_data,
+

[REGRESSION] pinctrl, of, unable to find hogs

2017-02-26 Thread Mika Penttilä


With current linus git (pre 4.11), unable to find the pinctrl hogs :


 imx6q-pinctrl 20e.iomuxc: unable to find group for node hoggrp


Device is i.MX6 based.


--Mika

Re: [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held

2017-02-13 Thread Mika Penttilä


On 13.02.2017 16:15, Oleg Nesterov wrote:
> + retval = de_thread(current);
> + if (retval)
> + return retval;
>  
>   if (N_MAGIC(ex) == OMAGIC) {
>   unsigned long text_addr, map_size;
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 4223702..79508f7 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -855,13 +855,17 @@ static int load_elf_binary(struct linux_binprm *bprm)
>   setup_new_exec(bprm);
>   install_exec_creds(bprm);
>  
> + retval = de_thread(current);
> + if (retval)
> + goto out_free_dentry;
> +
>   /* Do this so that we can load the interpreter, if need be.  We will
>  change some of these later */
>   retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
>executable_stack);
>   if (retval < 0)
>   goto out_free_dentry;
> - 
> +
>   current->mm->start_stack = bprm->p;
>  
>   /* Now we do a little grungy work by mmapping the ELF image into
> diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
> index d2e36f8..75fd6d8 100644
> --- a/fs/binfmt_elf_fdpic.c
> +++ b/fs/binfmt_elf_fdpic.c
> @@ -430,6 +430,10 @@ static int load_elf_fdpic_binary(struct linux_binprm 
> *bprm)
>  #endif
>  
>   install_exec_creds(bprm);
> + retval = de_thread(current);
> + if (retval)
> + goto error;
> +
>   if (create_elf_fdpic_tables(bprm, current->mm,
>   &exec_params, &interp_params) < 0)
>   goto error;
> diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
> index 9b2917a..a0ad9a3 100644
> --- a/fs/binfmt_flat.c
> +++ b/fs/binfmt_flat.c
> @@ -953,6 +953,9 @@ static int load_flat_binary(struct linux_binprm *bprm)
>   }
>  
>   install_exec_creds(bprm);
> + res = de_thread(current);
> + if (res)
> + return res;
>  
>   set_binfmt(&flat_format);
>  
> diff --git a/fs/exec.c b/fs/exec.c
> index e579466..8591c56 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1036,13 +1036,62 @@ static int exec_mmap(struct mm_struct *mm)
>   return 0;
>  }
>  
> +static int wait_for_notify_count(struct task_struct *tsk, struct 
> signal_struct *sig)
> +{
> + for (;;) {
> + if (unlikely(__fatal_signal_pending(tsk)))
> + goto killed;
> + set_current_state(TASK_KILLABLE);
> + if (!sig->notify_count)
> + break;
> + schedule();
> + }
> + __set_current_state(TASK_RUNNING);
> + return 0;
> +
> +killed:
> + /* protects against exit_notify() and __exit_signal() */
> + read_lock(&tasklist_lock);
> + sig->group_exit_task = NULL;
> + sig->notify_count = 0;
> + read_unlock(&tasklist_lock);
> + return -EINTR;
> +}
> +
> +/*
> + * Kill all the sub-threads and wait until they all pass exit_notify().
> + */
> +static int kill_sub_threads(struct task_struct *tsk)
> +{
> + struct signal_struct *sig = tsk->signal;
> + int err = -EINTR;
> +
> + if (thread_group_empty(tsk))
> + return 0;
> +
> + read_lock(&tasklist_lock);
> + spin_lock_irq(&tsk->sighand->siglock);
> + if (!signal_group_exit(sig)) {
> + sig->group_exit_task = tsk;
> + sig->notify_count = -zap_other_threads(tsk);
> + err = 0;
> + }
> + spin_unlock_irq(&tsk->sighand->siglock);
> + read_unlock(&tasklist_lock);
> +
> + if (!err)
> + err = wait_for_notify_count(tsk, sig);
> + return err;
> +
> +}
> +
>  /*
> - * This function makes sure the current process has its own signal table,
> - * so that flush_signal_handlers can later reset the handlers without
> - * disturbing other processes.  (Other processes might share the signal
> - * table via the CLONE_SIGHAND option to clone().)
> + * This function makes sure the current process has no other threads and
> + * has a private signal table so that flush_signal_handlers() can reset
> + * the handlers without disturbing other processes which might share the
> + * signal table via the CLONE_SIGHAND option to clone().
>   */
> -static int de_thread(struct task_struct *tsk)
> +int de_thread(struct task_struct *tsk)
>  {
>   struct signal_struct *sig = tsk->signal;
>   struct sighand_struct *oldsighand = tsk->sighand;
> @@ -1051,60 +1100,24 @@ static int de_thread(struct task_struct *tsk)
>   if (thread_group_empty(tsk))
>   goto no_thread_group;
>  
> - /*
> -  * Kill all other threads in the thread group.
> -  */
>   spin_lock_irq(lock);
> - if (signal_group_exit(sig)) {
> - /*
> -  * Another group action in progress, just
> -  * return so that the signal is processed.
> -  */
> - spin_unlock_irq(lock);
> - return -EAGAIN;
> - }
> -
> - sig->group_exit_task = tsk;
> - sig->notify_count = zap_other_threads(tsk);
> + sig->

Re: [PATCH 2/2] efi: efi_mem_reserve(): don't reserve through memblock after mm_init()

2016-12-21 Thread Mika Penttilä



On 21.12.2016 20:28, Nicolai Stange wrote:
> Before invoking the arch specific handler, efi_mem_reserve() reserves
> the given memory region through memblock.
>
> efi_mem_reserve() can get called after mm_init() though -- through
> efi_bgrt_init(), for example. After mm_init(), memblock is dead and should
> not be used anymore.
>
> Let efi_mem_reserve() check whether memblock is dead and not do the
> reservation if so. Emit a warning from the generic efi_arch mem_reserve()
> in this case: if the architecture doesn't provide any other means of
> registering the region as reserved, the operation would be a nop.
>
> Fixes: 4bc9f92e64c8 ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying 
> image data")
> Signed-off-by: Nicolai Stange 
> ---
>  drivers/firmware/efi/efi.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index 92914801e388..12b2e3a6d73f 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -403,7 +403,10 @@ u64 __init efi_mem_desc_end(efi_memory_desc_t *md)
>   return end;
>  }
>  
> -void __init __weak efi_arch_mem_reserve(phys_addr_t addr, u64 size) {}
> +void __init __weak efi_arch_mem_reserve(phys_addr_t addr, u64 size)
> +{
> + WARN(slab_is_available(), "efi_mem_reserve() has no effect");
> +}
>  
>  /**
>   * efi_mem_reserve - Reserve an EFI memory region
> @@ -419,7 +422,7 @@ void __init __weak efi_arch_mem_reserve(phys_addr_t addr, 
> u64 size) {}
>   */
>  void __init efi_mem_reserve(phys_addr_t addr, u64 size)
>  {
> - if (!memblock_is_region_reserved(addr, size))
> + if (slab_is_available() && !memblock_is_region_reserved(addr, size))
>   memblock_reserve(addr, size);
Maybe !slab_is_available() ?

>  
--Mika

Re: [PATCH] bdi flusher should not be throttled here when it fall into buddy slow path

2016-10-20 Thread Mika Penttilä



On 20.10.2016 15:38, zhouxianr...@huawei.com wrote:
> From: z00281421 
>
> The bdi flusher should be throttled only depends on 
> own bdi and is decoupled with others.
>
> separate PGDAT_WRITEBACK into PGDAT_ANON_WRITEBACK and
> PGDAT_FILE_WRITEBACK avoid scanning anon lru and it is ok 
> then throttled on file WRITEBACK.
>
> i think above may be not right.
>
> Signed-off-by: z00281421 
> ---
>  fs/fs-writeback.c  |8 ++--
>  include/linux/mmzone.h |7 +--
>  mm/vmscan.c|   20 
>  3 files changed, 23 insertions(+), 12 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 05713a5..ddcc70f 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1905,10 +1905,13 @@ void wb_workfn(struct work_struct *work)
>  {
>   struct bdi_writeback *wb = container_of(to_delayed_work(work),
>   struct bdi_writeback, dwork);
> + struct backing_dev_info *bdi = container_of(to_delayed_work(work),
> + struct backing_dev_info, 
> wb.dwork);
>   long pages_written;
>  
>   set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
> - current->flags |= PF_SWAPWRITE;
> + current->flags |= (PF_SWAPWRITE | PF_LESS_THROTTLE);
> + current->bdi = bdi;
>  
>   if (likely(!current_is_workqueue_rescuer() ||
>  !test_bit(WB_registered, &wb->state))) {
> @@ -1938,7 +1941,8 @@ void wb_workfn(struct work_struct *work)
>   else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
>   wb_wakeup_delayed(wb);
>  
> - current->flags &= ~PF_SWAPWRITE;
> + current->bdi = NULL;
> + current->flags &= ~(PF_SWAPWRITE | PF_LESS_THROTTLE);
>  }
>  
>  /*
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7f2ae99..fa602e9 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -528,8 +528,11 @@ enum pgdat_flags {
>* many dirty file pages at the tail
>* of the LRU.
>*/
> - PGDAT_WRITEBACK,/* reclaim scanning has recently found
> -  * many pages under writeback
> + PGDAT_ANON_WRITEBACK,   /* reclaim scanning has recently found
> +  * many anonymous pages under writeback
> +  */
> + PGDAT_FILE_WRITEBACK,   /* reclaim scanning has recently found
> +  * many file pages under writeback
>*/
>   PGDAT_RECLAIM_LOCKED,   /* prevents concurrent reclaim */

Nobody seems to be clearing those bits (same was with PGDAT_WRITEBACK) ?


--Mika

Re: EXT: pre 4.9-rc dma regression, imx6 sound etc

2016-10-13 Thread Mika Penttilä

On 10/13/2016 08:25 AM, Han, Nandor (GE Healthcare) wrote:
> 
> 
> On 12/10/2016 16:25, Mika Penttilä wrote:
>> Hi!
>>
>> Bisected that commit 5881826 - "dmaengine: imx-sdma - update the residue 
>> calculation for cyclic channels" is first bad commit to cause audio 
>> regression on imx6q, sgtl5000 codec. It causes audible disturbing background 
>> noise when playing wavs.
>>
>> Unfortunately, reverting only 5881826 causes uarts to fail, so another fix 
>> is needed.
>>
>> Thanks,
>>
>> Mika
>>
>>
>>
>>
>>
> 
> Hi Mika,
>Thanks for info. I have already posted a patch that will fix this issue.
> 
> https://lkml.org/lkml/2016/10/11/319
> 
> It was already tested on imx6 and imx53. Let me know how it works.
> 

Hi,

Great, with this patch audio is ok again! This patch should go in for 4.9!

Thanks,
Mika

pre 4.9-rc dma regression, imx6 sound etc

2016-10-12 Thread Mika Penttilä

Hi!

Bisected that commit 5881826 - "dmaengine: imx-sdma - update the residue 
calculation for cyclic channels" is first bad commit to cause audio regression 
on imx6q, sgtl5000 codec. It causes audible disturbing background noise when 
playing wavs.

Unfortunately, reverting only 5881826 causes uarts to fail, so another fix is 
needed.

Thanks,

Mika

Re: [PATCH] KVM: VMX: Enable MSR-BASED TPR shadow even if w/o APICv

2016-09-15 Thread Mika Penttilä

On 09/15/2016 07:25 AM, Wanpeng Li wrote:
> 2016-09-15 12:08 GMT+08:00 Mika Penttilä :
>> On 09/14/2016 10:58 AM, Wanpeng Li wrote:
>>> From: Wanpeng Li 
>>>
>>> I observed that kvmvapic(to optimize flexpriority=N or AMD) is used
>>> to boost TPR access when testing kvm-unit-test/eventinj.flat tpr case
>>> on my haswell desktop (w/ flexpriority, w/o APICv). Commit (8d14695f9542
>>> x86, apicv: add virtual x2apic support) disable virtual x2apic mode
>>> completely if w/o APICv, and the author also told me that windows guest
>>> can't enter into x2apic mode when he developed the APICv feature several
>>> years ago. However, it is not truth currently, Interrupt Remapping and
>>> vIOMMU is added to qemu and the developers from Intel test windows 8 can
>>> work in x2apic mode w/ Interrupt Remapping enabled recently.
>>>
>>> This patch enables TPR shadow for virtual x2apic mode to boost
>>> windows guest in x2apic mode even if w/o APICv.
>>>
>>> Can pass the kvm-unit-test.
>>>
>>
>> While at it, is the vmx flexpriotity stuff still valid code?
>> AFAICS it gets enabled iff TPR shadow is on. flexpriority
>> is on when :
>>
>> (flexpriority_enabled && lapic_in_kernel && cpu_has_vmx_tpr_shadow && 
>> cpu_has_vmx_virtualize_apic_accesses)
>>
>> But apic accesses to TPR mmio are not then trapped and TPR changes not 
>> reported because
>> the “use TPR shadow” VM-execution control is 1.
> 
> Please note the patch is for MSR-BASED TPR shadow w/o APICv, TPR
> shadow can work correctly in other configurations.
> 
> Regards,
> Wanpeng Li
> 

Hi,

Yes I see, this is slightly offtopic but while at flexpriority, how is it 
relevant in current kvm?
In other words I see it as dead code, because it is enabled only when TPR 
shadow is enabled,
and as such ineffective because TPR shadow disables the wmexits tpr reporting 
uses.

Thanks,
Mika

Re: [PATCH] KVM: VMX: Enable MSR-BASED TPR shadow even if w/o APICv

2016-09-14 Thread Mika Penttilä

On 09/14/2016 10:58 AM, Wanpeng Li wrote:
> From: Wanpeng Li 
> 
> I observed that kvmvapic(to optimize flexpriority=N or AMD) is used 
> to boost TPR access when testing kvm-unit-test/eventinj.flat tpr case
> on my haswell desktop (w/ flexpriority, w/o APICv). Commit (8d14695f9542 
> x86, apicv: add virtual x2apic support) disable virtual x2apic mode 
> completely if w/o APICv, and the author also told me that windows guest
> can't enter into x2apic mode when he developed the APICv feature several 
> years ago. However, it is not truth currently, Interrupt Remapping and 
> vIOMMU is added to qemu and the developers from Intel test windows 8 can 
> work in x2apic mode w/ Interrupt Remapping enabled recently. 
> 
> This patch enables TPR shadow for virtual x2apic mode to boost 
> windows guest in x2apic mode even if w/o APICv.
> 
> Can pass the kvm-unit-test.
> 

While at it, is the vmx flexpriotity stuff still valid code?
AFAICS it gets enabled iff TPR shadow is on. flexpriority
is on when :

(flexpriority_enabled && lapic_in_kernel && cpu_has_vmx_tpr_shadow && 
cpu_has_vmx_virtualize_apic_accesses)

But apic accesses to TPR mmio are not then trapped and TPR changes not reported 
because
the “use TPR shadow” VM-execution control is 1.

Thanks,
Mika


> Suggested-by: Wincy Van 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Wincy Van 
> Cc: Yang Zhang 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/vmx.c | 41 ++---
>  1 file changed, 22 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 5cede40..e703129 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6336,7 +6336,7 @@ static void wakeup_handler(void)
>  
>  static __init int hardware_setup(void)
>  {
> - int r = -ENOMEM, i, msr;
> + int r = -ENOMEM, i;
>  
>   rdmsrl_safe(MSR_EFER, &host_efer);
>  
> @@ -6464,18 +6464,6 @@ static __init int hardware_setup(void)
>  
>   set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
>  
> - for (msr = 0x800; msr <= 0x8ff; msr++)
> - vmx_disable_intercept_msr_read_x2apic(msr);
> -
> - /* TMCCT */
> - vmx_enable_intercept_msr_read_x2apic(0x839);
> - /* TPR */
> - vmx_disable_intercept_msr_write_x2apic(0x808);
> - /* EOI */
> - vmx_disable_intercept_msr_write_x2apic(0x80b);
> - /* SELF-IPI */
> - vmx_disable_intercept_msr_write_x2apic(0x83f);
> -
>   if (enable_ept) {
>   kvm_mmu_set_mask_ptes(VMX_EPT_READABLE_MASK,
>   (enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
> @@ -8435,12 +8423,7 @@ static void vmx_set_virtual_x2apic_mode(struct 
> kvm_vcpu *vcpu, bool set)
>   return;
>   }
>  
> - /*
> -  * There is not point to enable virtualize x2apic without enable
> -  * apicv
> -  */
> - if (!cpu_has_vmx_virtualize_x2apic_mode() ||
> - !kvm_vcpu_apicv_active(vcpu))
> + if (!cpu_has_vmx_virtualize_x2apic_mode())
>   return;
>  
>   if (!cpu_need_tpr_shadow(vcpu))
> @@ -8449,8 +8432,28 @@ static void vmx_set_virtual_x2apic_mode(struct 
> kvm_vcpu *vcpu, bool set)
>   sec_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
>  
>   if (set) {
> + int msr;
> +
>   sec_exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>   sec_exec_control |= SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
> +
> + if (kvm_vcpu_apicv_active(vcpu)) {
> + for (msr = 0x800; msr <= 0x8ff; msr++)
> + vmx_disable_intercept_msr_read_x2apic(msr);
> +
> + /* TMCCT */
> + vmx_enable_intercept_msr_read_x2apic(0x839);
> + /* TPR */
> + vmx_disable_intercept_msr_write_x2apic(0x808);
> + /* EOI */
> + vmx_disable_intercept_msr_write_x2apic(0x80b);
> + /* SELF-IPI */
> + vmx_disable_intercept_msr_write_x2apic(0x83f);
> + } else if (vmx_exec_control(to_vmx(vcpu)) & 
> CPU_BASED_TPR_SHADOW) {
> + /* TPR */
> + vmx_disable_intercept_msr_read_x2apic(0x808);
> + vmx_disable_intercept_msr_write_x2apic(0x808);
> + }
>   } else {
>   sec_exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
>   sec_exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>

[PATCH] bluetooth, regression: MSG_TRUNC fixes

2016-08-24 Thread Mika Penttilä

Recent 4.8-rc changes to bluetooth MSG_TRUNC handling introduced regression; 
pairing finishes
but connecting profiles not. 

With the below fixes to MSG_TRUNC handling the connection is established 
normally.

--Mika


Signed-off-by: Mika Penttilä 
---

diff --git a/net/bluetooth/af_bluetooth.c b/net/bluetooth/af_bluetooth.c
index ece45e0..0b5f729 100644
--- a/net/bluetooth/af_bluetooth.c
+++ b/net/bluetooth/af_bluetooth.c
@@ -250,7 +250,7 @@ int bt_sock_recvmsg(struct socket *sock, struct msghdr 
*msg, siz
 
skb_free_datagram(sk, skb);
 
-   if (msg->msg_flags & MSG_TRUNC)
+   if (flags & MSG_TRUNC)
copied = skblen;
 
return err ? : copied;
diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
index 6ef8a01..96f04b7 100644
--- a/net/bluetooth/hci_sock.c
+++ b/net/bluetooth/hci_sock.c
@@ -1091,7 +1091,7 @@ static int hci_sock_recvmsg(struct socket *sock, struct 
msghdr
 
skb_free_datagram(sk, skb);
 
-   if (msg->msg_flags & MSG_TRUNC)
+   if (flags & MSG_TRUNC)
copied = skblen;
 
return err ? : copied;

Re: kvm vmx shadow paging question

2016-08-15 Thread Mika Penttilä

On 13.08.2016 18:47, Mika Penttilä wrote:

> On 13.08.2016 17:38, Mika Penttilä wrote:
>
>> Hi,
>>
>> While studying the vmx code, and the shadow page tables usage (no ept 
>> involved),
>> I wondered the GUEST_CR3 usage. If no ept, GUEST_CR3 points to the shadow 
>> tables.
>> But the format of sptes is always 64 bit. How is that with 32 bit hosts, is 
>> the GUEST_CR3
>> similar to EPT format, 4 level 64 bit always or how is this working on 32 
>> bit?
>>
>> Thanks,
>> Mika
>>
> Hmm seems non PAE 32 bit host without ept enabled is not supported 
> combination, right ?
>
> --Mika
>
>

Ping, am I correct that for using shadow mmu with x86 kvm host you need at 
least 32bit PAE enabled host
(because shadow ptes are always 64 bit, in PAE or long format)?

--Mika

Re: kvm vmx shadow paging question

2016-08-14 Thread Mika Penttilä

On 13.08.2016 17:38, Mika Penttilä wrote:

> Hi,
>
> While studying the vmx code, and the shadow page tables usage (no ept 
> involved),
> I wondered the GUEST_CR3 usage. If no ept, GUEST_CR3 points to the shadow 
> tables.
> But the format of sptes is always 64 bit. How is that with 32 bit hosts, is 
> the GUEST_CR3
> similar to EPT format, 4 level 64 bit always or how is this working on 32 bit?
>
> Thanks,
> Mika
>
Hmm seems non PAE 32 bit host without ept enabled is not supported combination, 
right ?

--Mika

Re: [PATCH v1 2/2] x86/KASLR: Increase BRK pages for KASLR memory randomization

2016-08-08 Thread Mika Penttilä

On 08/08/2016 09:40 PM, Thomas Garnier wrote:
> Default implementation expects 6 pages maximum are needed for low page
> allocations. If KASLR memory randomization is enabled, the worse case
> of e820 layout would require 12 pages (no large pages). It is due to the
> PUD level randomization and the variable e820 memory layout.
> 
> This bug was found while doing extensive testing of KASLR memory
> randomization on different type of hardware.
> 
> Signed-off-by: Thomas Garnier 
> ---
> Based on next-20160805
> ---
>  arch/x86/mm/init.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 6209289..3a27e6a 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -130,6 +130,14 @@ void  __init early_alloc_pgt_buf(void)
>   unsigned long tables = INIT_PGT_BUF_SIZE;
>   phys_addr_t base;
>  
> + /*
> +  * Depending on the machine e860 memory layout and the PUD alignement.
> +  * We may need twice more pages when KASLR memoy randomization is
> +  * enabled.
> +  */
> + if (IS_ENABLED(CONFIG_RANDOMIZE_MEMORY))
> + tables *= 2;
> +
>   base = __pa(extend_brk(tables, PAGE_SIZE));
>  
>   pgt_buf_start = base >> PAGE_SHIFT;
> 

You should increase the reserve also :
RESERVE_BRK(early_pgt_alloc, INIT_PGT_BUF_SIZE);


--Mika

Re: [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup

2016-06-29 Thread Mika Penttilä



On 29.06.2016 10:06, Mika Penttilä wrote:
> On 06/27/2016 12:55 AM, Andy Lutomirski wrote:
>> Hi all-
>>
>> Known issues:
>>  - tcp md5, virtio_net, and virtio_console will have issues.  Eric Dumazet
>>has a patch for tcp md5, and Michael Tsirkin says he'll fix virtio_net
>>and virtio_console.
>>
>  How about PTRACE_SETREGS, it's using the child stack's vmapped address to 
> put regs?
>
> --Mika
>
PTRACE_SETREGS is ok of course.

--Mika

Re: [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup

2016-06-29 Thread Mika Penttilä

On 06/27/2016 12:55 AM, Andy Lutomirski wrote:
> Hi all-
> 

> 
> Known issues:
>  - tcp md5, virtio_net, and virtio_console will have issues.  Eric Dumazet
>has a patch for tcp md5, and Michael Tsirkin says he'll fix virtio_net
>and virtio_console.
> 
 How about PTRACE_SETREGS, it's using the child stack's vmapped address to put 
regs?

--Mika

Re: [PATCH 12/13] x86/mm/64: Enable vmapped stacks

2016-06-15 Thread Mika Penttilä

Hi,

On 06/16/2016 03:28 AM, Andy Lutomirski wrote:
> This allows x86_64 kernels to enable vmapped stacks.  There are a
> couple of interesting bits.
> 
> First, x86 lazily faults in top-level paging entries for the vmalloc
> area.  This won't work if we get a page fault while trying to access
> the stack: the CPU will promote it to a double-fault and we'll die.
> To avoid this problem, probe the new stack when switching stacks and
> forcibly populate the pgd entry for the stack when switching mms.
> 
> Second, once we have guard pages around the stack, we'll want to
> detect and handle stack overflow.
> 
> I didn't enable it on x86_32.  We'd need to rework the double-fault
> code a bit and I'm concerned about running out of vmalloc virtual
> addresses under some workloads.
> 
> This patch, by itself, will behave somewhat erratically when the
> stack overflows while RSP is still more than a few tens of bytes
> above the bottom of the stack.  Specifically, we'll get #PF and make
> it to no_context and an oops without triggering a double-fault, and
> no_context doesn't know about stack overflows.  The next patch will
> improve that case.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/Kconfig |  1 +
>  arch/x86/include/asm/switch_to.h | 28 +++-
>  arch/x86/kernel/traps.c  | 32 
>  arch/x86/mm/tlb.c| 15 +++
>  4 files changed, 75 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 0a7b885964ba..b624b24d1dc1 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -92,6 +92,7 @@ config X86
>   select HAVE_ARCH_TRACEHOOK
>   select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>   select HAVE_EBPF_JITif X86_64
> + select HAVE_ARCH_VMAP_STACK if X86_64
>   select HAVE_CC_STACKPROTECTOR
>   select HAVE_CMPXCHG_DOUBLE
>   select HAVE_CMPXCHG_LOCAL
> diff --git a/arch/x86/include/asm/switch_to.h 
> b/arch/x86/include/asm/switch_to.h
> index 8f321a1b03a1..14e4b20f0aaf 100644
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -8,6 +8,28 @@ struct tss_struct;
>  void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
> struct tss_struct *tss);
>  
> +/* This runs runs on the previous thread's stack. */
> +static inline void prepare_switch_to(struct task_struct *prev,
> +  struct task_struct *next)
> +{
> +#ifdef CONFIG_VMAP_STACK
> + /*
> +  * If we switch to a stack that has a top-level paging entry
> +  * that is not present in the current mm, the resulting #PF will
> +  * will be promoted to a double-fault and we'll panic.  Probe
> +  * the new stack now so that vmalloc_fault can fix up the page
> +  * tables if needed.  This can only happen if we use a stack
> +  * in vmap space.
> +  *
> +  * We assume that the stack is aligned so that it never spans
> +  * more than one top-level paging entry.
> +  *
> +  * To minimize cache pollution, just follow the stack pointer.
> +  */
> + READ_ONCE(*(unsigned char *)next->thread.sp);
> +#endif
> +}
> +
>  #ifdef CONFIG_X86_32
>  
>  #ifdef CONFIG_CC_STACKPROTECTOR
> @@ -39,6 +61,8 @@ do {
> \
>*/ \
>   unsigned long ebx, ecx, edx, esi, edi;  \
>   \
> + prepare_switch_to(prev, next);  \
> + \
>   asm volatile("pushl %%ebp\n\t"  /* saveEBP   */ \
>"movl %%esp,%[prev_sp]\n\t"/* saveESP   */ \
>"movl %[next_sp],%%esp\n\t"/* restore ESP   */ \
> @@ -103,7 +127,9 @@ do {  
> \
>   * clean in kernel mode, with the possible exception of IOPL.  Kernel IOPL
>   * has no effect.
>   */
> -#define switch_to(prev, next, last) \
> +#define switch_to(prev, next, last)\
> + prepare_switch_to(prev, next);\
> +   \
>   asm volatile(SAVE_CONTEXT \
>"movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */   \
>"movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */\
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 00f03d82e69a..9cb7ea781176 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP, SIGBUS,  "segment not 
> present",   segment_not

Re: [PATCH v2] sched/cputime: add steal time support to full dynticks CPU time accounting

2016-05-18 Thread Mika Penttilä

On 05/18/2016 11:28 AM, Wanpeng Li wrote:
> From: Wanpeng Li 
> 
> This patch adds steal guest time support to full dynticks CPU 
> time accounting. After 'commit ff9a9b4c4334 ("sched, time: Switch 
> VIRT_CPU_ACCOUNTING_GEN to jiffy granularity")', time is jiffy 
> based sampling even if it's still listened to ring boundaries, so 
> steal_account_process_tick() is reused to account how much 'ticks' 
> are steal time after the last accumulation. 
> 
> Suggested-by: Rik van Riel 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra (Intel) 
> Cc: Rik van Riel 
> Cc: Thomas Gleixner 
> Cc: Frederic Weisbecker 
> Cc: Paolo Bonzini 
> Cc: Radim 
> Signed-off-by: Wanpeng Li 
> ---
> v1 -> v2:
>  * fix divide zero bug, thanks Rik
> 
>  kernel/sched/cputime.c | 13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 75f98c5..bfa50a0 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -257,7 +257,7 @@ void account_idle_time(cputime_t cputime)
>   cpustat[CPUTIME_IDLE] += (__force u64) cputime;
>  }
>  
> -static __always_inline bool steal_account_process_tick(void)
> +static __always_inline unsigned long steal_account_process_tick(void)
>  {
>  #ifdef CONFIG_PARAVIRT
>   if (static_key_false(¶virt_steal_enabled)) {
> @@ -279,7 +279,7 @@ static __always_inline bool 
> steal_account_process_tick(void)
>   return steal_jiffies;
>   }
>  #endif
> - return false;
> + return 0;
>  }
>  
>  /*
> @@ -691,8 +691,12 @@ static cputime_t get_vtime_delta(struct task_struct *tsk)
>  
>  static void __vtime_account_system(struct task_struct *tsk)
>  {
> + unsigned long steal_time = steal_account_process_tick();
>   cputime_t delta_cpu = get_vtime_delta(tsk);
>  
> + if (steal_time >= delta_cpu)
> + return;
> + delta_cpu -= steal_time;
>   account_system_time(tsk, irq_count(), delta_cpu, 
> cputime_to_scaled(delta_cpu));
>  }
>  
> @@ -723,7 +727,12 @@ void vtime_account_user(struct task_struct *tsk)
>   write_seqcount_begin(&tsk->vtime_seqcount);
>   tsk->vtime_snap_whence = VTIME_SYS;
>   if (vtime_delta(tsk)) {
> + unsigned long steal_time = steal_account_process_tick();
>   delta_cpu = get_vtime_delta(tsk);


afaik steal_account_process_tick() returns jiffies and get_vtime_delta() 
cputime, so can't mix them like this : ?

> +
> + if (steal_time >= delta_cpu)
> + return;



> + delta_cpu -= steal_time;
>   account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
>   }
>   write_seqcount_end(&tsk->vtime_seqcount);
> 


--Mika

[BUG] 4.4.7 af_unix, wakeup

2016-04-21 Thread Mika Penttilä

Have hit this same looking oops every now and then since at least 4.2 or so.. 
Not easy to reproduce systematically.

--Mika



[10973.891726] Unable to handle kernel NULL pointer dereference at virtual 
address 
[10973.899839] pgd = a8ce4000
[10973.902549] [] *pgd=38c38831, *pte=, *ppte=
[10973.908866] Internal error: Oops: 8007 [#1] PREEMPT SMP ARM
[10973.914787] Modules linked in: btwilink radio_quantek st_drv
[10973.920507] CPU: 1 PID: 310 Comm: compositor Not tainted 4.4.7 #1
[10973.926602] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[10973.933133] task: a82dc740 ti: a8dd2000 task.ti: a8dd2000
[10973.938534] PC is at 0x0
[10973.941080] LR is at __wake_up_common+0x4c/0x80
[10973.945615] pc : [<>]lr : [<80058584>]psr: 800f00b3
[10973.945615] sp : a8dd3d98  ip : a9077d48  fp : 0001
[10973.957093] r10: 0001  r9 : 0001  r8 : 00c3
[10973.962320] r7 : a9041c84  r6 : 011e6ebc  r5 : 0001  r4 : ab712074
[10973.968849] r3 : 00c3  r2 : 0001  r1 : 0001  r0 : a9077d48
[10973.975380] Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA Thumb  Segment 
user
[10973.982777] Control: 10c5387d  Table: 38ce404a  DAC: 0055
[10973.988525] Process compositor (pid: 310, stack limit = 0xa8dd2210)
[10973.994793] Stack: (0xa8dd3d98 to 0xa8dd4000)
[10973.999153] 3d80:   
0001 a9041c80
[10974.007335] 3da0: 0001 00c3 0001 a00f0013 0030 a88dd080 
a88df64c 80058b68
[10974.015517] 3dc0: 00c3 806e28e0 a88df440  a88df440 a9c3ac00 
0001 8051b7ac
[10974.023698] 3de0: 0030 805c392c a8dd3e08 0003 7e9e25d8 0030 
a64d8c00 0001
[10974.031879] 3e00: a8dd3f74 a8dd3f6c     
 
[10974.040060] 3e20: a8dd3e54 a8dd3f6c 4040  a64d8c00  
a8dd3e58 
[10974.048240] 3e40: 0170 80518634 a8dd3f6c 80518fb8   
0005 757d2b98
[10974.056421] 3e60:    00c5e2dc 00c5d724 75750310 
00deeaac 0030
[10974.064602] 3e80:  00afd9c4 00c5e2dc  758057e0 7e9e22a8 
757fa4c8 757782e8
[10974.072783] 3ea0: 75805790  0068 00e364b8  00afd9c4 
00e364e0 0003
[10974.080963] 3ec0: 758057f0  00d4 75791b0c 04a6 0001 
00aff83c 00afd9c4
[10974.089145] 3ee0: 0034 00e2c374 00e364c4 00e1b634  0001 
00d4 0300
[10974.097325] 3f00: 04a6 75763790 00e364e0   0001 
3fff a84b3a98
[10974.105506] 3f20: a8dfc9c0 800fd2e8 a8dd3f68 a8dd3f64 4040 0128 
8000f6a4 a64d8c00
[10974.113686] 3f40: 7e9e25e8 4040 0128 8000f6a4 a8dd2000 80519c14 
 7e9e2730
[10974.121867] 3f60: 0107 0001 fff7   0001 
 
[10974.130047] 3f80: a8dd3e80    4040  
0001 00df0a30
[10974.138228] 3fa0: 00deea20 8000f500 0001 00df0a30 001e 7e9e25e8 
4040 
[10974.146409] 3fc0: 0001 00df0a30 00deea20 0128 00df1a18 000b308c 
7e9e25e8 0170
[10974.154590] 3fe0:  7e9e25d0 76ef54c0 75b568f8 800f0010 001e 
 
[10974.162783] [<80058584>] (__wake_up_common) from [<80058b68>] 
(__wake_up_sync_key+0x44/0x60)
[10974.171236] [<80058b68>] (__wake_up_sync_key) from [<8051b7ac>] 
(sock_def_readable+0x3c/0x6c)
[10974.179779] [<8051b7ac>] (sock_def_readable) from [<805c392c>] 
(unix_stream_sendmsg+0x154/0x340)
[10974.188572] [<805c392c>] (unix_stream_sendmsg) from [<80518634>] 
(sock_sendmsg+0x14/0x24)
[10974.196757] [<80518634>] (sock_sendmsg) from [<80518fb8>] 
(___sys_sendmsg+0x1d0/0x1d8)
[10974.204680] [<80518fb8>] (___sys_sendmsg) from [<80519c14>] 
(__sys_sendmsg+0x3c/0x6c)
[10974.212521] [<80519c14>] (__sys_sendmsg) from [<8000f500>] 
(ret_fast_syscall+0x0/0x34)
[10974.220441] Code: bad PC value
[10974.223502] ---[ end trace 7f5c4ba07462311f ]---

Re: [PATCH 23/31] huge tmpfs recovery: framework for reconstituting huge pages

2016-04-06 Thread Mika Penttilä

On 04/06/2016 12:53 AM, Hugh Dickins wrote:



> +static void shmem_recovery_work(struct work_struct *work)
> +{
> + struct recovery *recovery;
> + struct shmem_inode_info *info;
> + struct address_space *mapping;
> + struct page *page;
> + struct page *head = NULL;
> + int error = -ENOENT;
> +
> + recovery = container_of(work, struct recovery, work);
> + info = SHMEM_I(recovery->inode);
> + if (!shmem_work_still_useful(recovery)) {
> + shr_stats(work_too_late);
> + goto out;
> + }
> +
> + /* Are we resuming from an earlier partially successful attempt? */
> + mapping = recovery->inode->i_mapping;
> + spin_lock_irq(&mapping->tree_lock);
> + page = shmem_clear_tag_hugehole(mapping, recovery->head_index);
> + if (page)
> + head = team_head(page);
> + spin_unlock_irq(&mapping->tree_lock);
> + if (head) {
> + /* Serialize with shrinker so it won't mess with our range */
> + spin_lock(&shmem_shrinklist_lock);
> + spin_unlock(&shmem_shrinklist_lock);
> + }
> +
> + /* If team is now complete, no tag and head would be found above */
> + page = recovery->page;
> + if (PageTeam(page))
> + head = team_head(page);
> +
> + /* Get a reference to the head of the team already being assembled */
> + if (head) {
> + if (!get_page_unless_zero(head))
> + head = NULL;
> + else if (!PageTeam(head) || head->mapping != mapping ||
> + head->index != recovery->head_index) {
> + put_page(head);
> + head = NULL;
> + }
> + }
> +
> + if (head) {
> + /* We are resuming work from a previous partial recovery */
> + if (PageTeam(page))
> + shr_stats(resume_teamed);
> + else
> + shr_stats(resume_tagged);
> + } else {
> + gfp_t gfp = mapping_gfp_mask(mapping);
> + /*
> +  * XXX: Note that with swapin readahead, page_to_nid(page) will
> +  * often choose an unsuitable NUMA node: something to fix soon,
> +  * but not an immediate blocker.
> +  */
> + head = __alloc_pages_node(page_to_nid(page),
> + gfp | __GFP_NOWARN | __GFP_THISNODE, HPAGE_PMD_ORDER);  
>  
> + if (!head) {
> + shr_stats(huge_failed);
> + error = -ENOMEM;
> + goto out;
> + }

Should this head marked PageTeam? Because in patch 27/31 when given as a hint 
to shmem_getpage_gfp() :

hugehint = NULL;
+   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+   sgp == SGP_TEAM && *pagep) {
+   struct page *head;
+
+   if (!get_page_unless_zero(*pagep)) {
+   error = -ENOENT;
+   goto decused;
+   }
+   page = *pagep;
+   lock_page(page);
+   head = page - (index & (HPAGE_PMD_NR-1)); 

we fail always because :
+   if (!PageTeam(head)) {
+   error = -ENOENT;
+   goto decused;
+   }


> + if (!shmem_work_still_useful(recovery)) {
> + __free_pages(head, HPAGE_PMD_ORDER);
> + shr_stats(huge_too_late);
> + goto out;
> + }
> + split_page(head, HPAGE_PMD_ORDER);
> + get_page(head);
> + shr_stats(huge_alloced);
> + }


Thanks,
Mika

Re: [PATCH v14] x86, mce: Add memcpy_mcsafe()

2016-03-10 Thread Mika Penttilä



On 18.02.2016 21:47, Tony Luck wrote:
> Make use of the EXTABLE_FAULT exception table entries to write
> a kernel copy routine that doesn't crash the system if it
> encounters a machine check. Prime use case for this is to copy
> from large arrays of non-volatile memory used as storage.
>
> We have to use an unrolled copy loop for now because current
> hardware implementations treat a machine check in "rep mov"
> as fatal. When that is fixed we can simplify.
>
> Signed-off-by: Tony Luck 
> ---
>
> Is this what we want now?  Return type is a "bool". True means
> that we copied OK, false means that it didn't (this is all that
> Dan says that he needs).  Dropped all the complex code to figure
> out how many bytes we didn't copy as Linus says this isn't the
> right place to do this (and besides we should just make "rep mov"
>
> +
> + /* Copy successful. Return true */
> +.L_done_memcpy_trap:
> + xorq %rax, %rax
> + ret
> +ENDPROC(memcpy_mcsafe)
> +
> + .section .fixup, "ax"
> + /* Return false for any failure */
> +.L_memcpy_mcsafe_fail:
> + mov $1, %rax
> + ret
> +
>
But you return 0 == false for success and 1 == true for failure.

--Mika

[BUG] 4.5-rc7 af unix related

2016-03-08 Thread Mika Penttilä


Have got some of these same looking crashes after 4.4 (maybe before also, not 
sure). Very random, not easy to reproduce.

--Mika



[ 1999.948171] Unable to handle kernel NULL pointer dereference at virtual 
address 
[ 1999.955740] pgd = a8ba4000
[ 1999.958264] [] *pgd=38ca7831, *pte=, *ppte=
[ 1999.964118] Internal error: Oops: 8007 [#1] PREEMPT SMP ARM
[ 1999.969600] Modules linked in: btwilink radio_quantek st_drv
[ 1999.974895] CPU: 0 PID: 335 Comm: compositor Not tainted 4.5.0-rc7 #26
[ 1999.980939] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[ 1999.986983] task: a8da8fc0 ti: a8ce8000 task.ti: a8ce8000
[ 1999.991983] PC is at 0x0
[ 1999.994343] LR is at __wake_up_common+0x4c/0x80
[ 1999.998540] pc : [<>]lr : [<80058934>]psr: 800e00b3
[ 1999.998540] sp : a8ce9d98  ip : a8a31d28  fp : 0001
[ 2000.009164] r10: 0001  r9 : 0001  r8 : 00c3
[ 2000.014002] r7 : 9c641944  r6 : 0001  r5 : 0001  r4 : 80150007
[ 2000.020044] r3 : 00c3  r2 : 0001  r1 : 0001  r0 : a8a31d28
[ 2000.026088] Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA Thumb  Segment 
user
[ 2000.032934] Control: 10c5387d  Table: 38ba404a  DAC: 0055
[ 2000.038254] Process compositor (pid: 335, stack limit = 0xa8ce8210)
[ 2000.044056] Stack: (0xa8ce9d98 to 0xa8cea000)
[ 2000.048091] 9d80:   
0001 9c641940
[ 2000.055663] 9da0: 0001 00c3 0001 a00e0013 0120 a9d9fc80 
a9d9fbcc 80058f1c
[ 2000.063236] 9dc0: 00c3 806f3630 a9d9f9c0  a9d9f9c0 a8a8e480 
0001 8052a4fc
[ 2000.070808] 9de0: 0120 805d3a6c a8ce9e08 0003 7eba18e0 0120 
a6047680 0001
[ 2000.078380] 9e00: a8ce9f74 a8ce9f6c     
 
[ 2000.085952] 9e20: a8ce9e54 a8ce9f6c 4040  a6047680  
a8ce9e58 
[ 2000.093524] 9e40:  8052724c a8ce9f6c 80527bd0 a8ce9eb4  
a80cd400 587c
[ 2000.101095] 9e60: 806f6b90 80282e2c a80cd400 0001 a80cd808 800e0193 
00b8879c 0120
[ 2000.108668] 9e80: a80cd400 80044708 80b85704 80b85704 0097c6d4 a84d1d5c 
80059020 
[ 2000.116240] 9ea0: a81c3e2c   0003 0001 8005902c 
a81c3e20 80058934
[ 2000.123811] 9ec0:  a81c3e28 600e0193  0003 0001 
a816ecc0 
[ 2000.131383] 9ee0: 0001 80b7e450 0002d6c8 802f67a4 a81c3c10 bb9c838b 
 0001
[ 2000.138955] 9f00: 0001  0125 a816ecc0 0001 600e0013 
a8ce9f2c 800458f4
[ 2000.146527] 9f20: 9c750c00 80102fbc a8ce9f68 a8ce9f64 4040 0128 
8000f7e4 a6047680
[ 2000.154099] 9f40: 7eba18f0 4040 0128 8000f7e4 a8ce8000 8052882c 
 00b8f1a0
[ 2000.161670] 9f60: 0107 0001 fff7   0001 
 
[ 2000.169241] 9f80: a8ce9e80    4040  
0001 00b8a040
[ 2000.176813] 9fa0: 00b88030 8000f640 0001 00b8a040 0024 7eba18f0 
4040 
[ 2000.184384] 9fc0: 0001 00b8a040 00b88030 0128 00b8b028 0002876c 
7eba18f0 
[ 2000.191956] 9fe0:  7eba18d8 76faa4c0 75c0e8f8 800e0010 0024 
3bf59811 3bf59c11
[ 2000.199541] [<80058934>] (__wake_up_common) from [<80058f1c>] 
(__wake_up_sync_key+0x44/0x60)
[ 2000.207363] [<80058f1c>] (__wake_up_sync_key) from [<8052a4fc>] 
(sock_def_readable+0x3c/0x6c)
[ 2000.215267] [<8052a4fc>] (sock_def_readable) from [<805d3a6c>] 
(unix_stream_sendmsg+0x154/0x340)
[ 2000.223414] [<805d3a6c>] (unix_stream_sendmsg) from [<8052724c>] 
(sock_sendmsg+0x14/0x24)
[ 2000.230989] [<8052724c>] (sock_sendmsg) from [<80527bd0>] 
(___sys_sendmsg+0x1d0/0x1d8)
[ 2000.238321] [<80527bd0>] (___sys_sendmsg) from [<8052882c>] 
(__sys_sendmsg+0x3c/0x6c)
[ 2000.245579] [<8052882c>] (__sys_sendmsg) from [<8000f640>] 
(ret_fast_syscall+0x0/0x34)
[ 2000.252911] Code: bad PC value
[ 2000.255744] ---[ end trace a385e81f19607805 ]---

Re: [RFC PATCH] x86: Make sure verify_cpu has a good stack

2016-03-02 Thread Mika Penttilä



On 02.03.2016 18:55, Borislav Petkov wrote:
> On Wed, Mar 02, 2016 at 06:38:15PM +0200, Mika Penttilä wrote:
>> I actually looked at it a while too...
>>
>> The
>>   movq stack_start - __START_KERNEL_map, %rsp
>>
>> turns into (objdump disassembly)
>>
>>   mov0x0,%rsp
>>
>> with relocation
>> 0004 R_X86_64_32S  stack_start+0x8000
>>
>> Now stack_start is at 81ef3380, so the relocation gives 1ef3380 
>> which would be correct, so why the
>> second subq ?
>>
>> You may explain :)
> Here it is :-)
>
> $ readelf -a vmlinux | grep stack_start
>  70526: 81cbabf8 0 NOTYPE  GLOBAL DEFAULT   14 stack_start
>
> 0x81cbabf8 - __START_KERNEL_map =
> 0x81cbabf8 - 0x8000 =
> 0x1cbabf8
>
> (gdb) x/x 0x1cbabf8
> 0x1cbabf8:  0x81c03ff8
>
> (You don't need gdb for that - you can hexdump or objdump vmlinux).
>
> Now stack_start is:
>
> GLOBAL(stack_start)
> .quad  init_thread_union+THREAD_SIZE-8
>
> which is
>
> $ readelf -a vmlinux | grep init_thread_union
>  82491: 81c0 16384 OBJECT  GLOBAL DEFAULT   14 init_thread_union
>
> so init_thread_union+THREAD_SIZE-8 = 0x81c0 + 4*4096-8 = 
> 0x81c03ff8
>
> So you have to subtract __START_KERNEL_map again because it has there a
> virtual address and we haven't enabled paging yet:
>
> 0x81c03ff8 - 0x8000 = 0x1c03ff8.
>
> Makes sense?
>
Ah missed completely that stack_start is effectively a pointer to stack..

Thanks,
Mika

Re: [RFC PATCH] x86: Make sure verify_cpu has a good stack

2016-03-02 Thread Mika Penttilä


On 02.03.2016 18:15, Borislav Petkov wrote:
> On Wed, Mar 02, 2016 at 05:55:14PM +0200, Mika Penttilä wrote:
>>> +   /* Setup a stack for verify_cpu */
>>> +   movqstack_start - __START_KERNEL_map, %rsp
>>> +   subq$__START_KERNEL_map, %rsp
>>> +
>> You subtract __START_KERNEL_map twice ?
> Yes. That's not very obvious and it took me a while. I probably should
> add a comment.
>
> Want to stare at it a little bit more and try to figure it out or should
> I explain?
>
> :-)
>

I actually looked at it a while too...

The
  movq stack_start - __START_KERNEL_map, %rsp

turns into (objdump disassembly)

  mov0x0,%rsp

with relocation
0004 R_X86_64_32S  stack_start+0x8000

Now stack_start is at 81ef3380, so the relocation gives 1ef3380 which 
would be correct, so why the
second subq ?

You may explain :)

--Mika

Re: [RFC PATCH] x86: Make sure verify_cpu has a good stack

2016-03-02 Thread Mika Penttilä

On 02.03.2016 13:20, Borislav Petkov wrote:
> From: Borislav Petkov 
>
> 04633df0c43d ("x86/cpu: Call verify_cpu() after having entered long mode too")
> added the call to verify_cpu() for sanitizing CPU configuration.
>
> The latter uses the stack minimally and it can happen that we land in
> startup_64() directly from a 64-bit bootloader. Then we want to use our
> own, known good stack.
>
> Do that.
>
> APs don't need this as the trampoline sets up a stack for them.
>
> Reported-by: Tom Lendacky 
> Signed-off-by: Borislav Petkov 
> ---
>  arch/x86/kernel/head_64.S | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 22fbf9df61bb..d60a044c2fdc 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -64,6 +64,10 @@ startup_64:
>* tables and then reload them.
>*/
>  
> + /* Setup a stack for verify_cpu */
> + movqstack_start - __START_KERNEL_map, %rsp
> + subq$__START_KERNEL_map, %rsp
> +

 
You subtract __START_KERNEL_map twice ?

--Mika

Re: [REGRESSION, bisected] 4.5rc4 sound fsl-soc

2016-02-20 Thread Mika Penttilä

Mark,

I can confirm that the patch:

https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/base/regmap/regcache.c?id=3245d460a1eb55b5c3ca31dde7b5c5ac71546edf


solved the audio codec probing issue for me. Without it there's no sound on 
imx6 with 4.5-rc4.

Please apply.

Thanks,
Mika

On 20.02.2016 22:45, Fabio Estevam wrote:
> Hi Mika,
>
> Did it work for you?
>
> If so, please ask in the mailing list for Mark Brown to apply that patch.
>
> Thanks
>
> ____
> From: Mika Penttilä 
> Sent: Monday, February 15, 2016 11:00 AM
> To: Fabio Estevam
> Subject: Re: [REGRESSION, bisected] 4.5rc4 sound fsl-soc
>
> On 02/15/2016 01:30 PM, Fabio Estevam wrote:
>> [Sorry for the top post, can't reply properly from this Inbox]
> Hi,
>
> will test that tomorrow morning EET.
>
>
> thanks,
> Mika
>
>
>> Could you please try applying this commit from linux-next?
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/base/regmap/regcache.c?id=3245d460a1eb55b5c3ca31dde7b5c5ac71546edf
>>
>> Thanks,
>>
>> Fabio Estevam
>>
>> 
>> From: Mika Penttilä 
>> Sent: Monday, February 15, 2016 9:25 AM
>> To: LKML; m...@maciej.szmigiero.name; Fabio Estevam
>> Subject: [REGRESSION, bisected] 4.5rc4 sound fsl-soc
>>
>> Hi,
>>
>> The following commit :
>>
>> 5c408fee254633a5be69505bc86c6b034f871ab4 is the first bad commit
>> commit 5c408fee254633a5be69505bc86c6b034f871ab4
>> Author: Maciej S. Szmigiero 
>> Date:   Mon Jan 18 20:07:44 2016 +0100
>>
>> ASoC: fsl_ssi: remove explicit register defaults
>>
>> There is no guarantee that on fsl_ssi module load
>> SSI registers will have their power-on-reset values.
>>
>> In fact, if the driver is reloaded the values in
>> registers will be whatever they were set to previously.
>>
>> However, the cache needs to be fully populated at probe
>> time to avoid non-atomic allocations during register
>> access.
>>
>> Special case here is imx21-class SSI, since
>> according to datasheet it don't have SACC{ST,EN,DIS}
>> regs.
>>
>> This fixes hard lockup on fsl_ssi module reload,
>> at least in AC'97 mode.
>>
>> Fixes: 05cf237972fe ("ASoC: fsl_ssi: Add driver suspend and resume to 
>> support MEGA Fast")
>> Signed-off-by: Maciej S. Szmigiero 
>> Tested-by: Fabio Estevam 
>> Signed-off-by: Mark Brown 
>>
>>
>> causes regmap init failure when loading the sgtl5000 codec on imx6q, and
>> leads to no audio.
>>
>> With the mentioned patch reverted sound works ok.
>>
>> --Mika
>>

[REGRESSION, bisected] 4.5rc4 sound fsl-soc

2016-02-15 Thread Mika Penttilä

Hi,

The following commit :

5c408fee254633a5be69505bc86c6b034f871ab4 is the first bad commit
commit 5c408fee254633a5be69505bc86c6b034f871ab4
Author: Maciej S. Szmigiero 
Date:   Mon Jan 18 20:07:44 2016 +0100

ASoC: fsl_ssi: remove explicit register defaults

There is no guarantee that on fsl_ssi module load
SSI registers will have their power-on-reset values.

In fact, if the driver is reloaded the values in
registers will be whatever they were set to previously.

However, the cache needs to be fully populated at probe
time to avoid non-atomic allocations during register
access.

Special case here is imx21-class SSI, since
according to datasheet it don't have SACC{ST,EN,DIS}
regs.

This fixes hard lockup on fsl_ssi module reload,
at least in AC'97 mode.

Fixes: 05cf237972fe ("ASoC: fsl_ssi: Add driver suspend and resume to 
support MEGA Fast")
Signed-off-by: Maciej S. Szmigiero 
Tested-by: Fabio Estevam 
Signed-off-by: Mark Brown 


causes regmap init failure when loading the sgtl5000 codec on imx6q, and
leads to no audio.

With the mentioned patch reverted sound works ok.

--Mika

Re: [PATCHv2 2/2] x86: SROP mitigation: implement signal cookies

2016-02-06 Thread Mika Penttilä

Hi,


On 07.02.2016 01:39, Scott Bauer wrote:
> This patch adds SROP mitigation logic to the x86 signal delivery
> and sigreturn code. The cookie is placed in the unused alignment
> space above the saved FP state, if it exists. If there is no FP
> state to save then the cookie is placed in the alignment space above
> the sigframe.
>
> Cc: Abhiram Balasubramanian 
> Signed-off-by: Scott Bauer 
> ---
>  arch/x86/ia32/ia32_signal.c| 63 +---
>  arch/x86/include/asm/fpu/signal.h  |  1 +
>  arch/x86/include/asm/sighandling.h |  5 ++-
>  arch/x86/kernel/fpu/signal.c   | 10 +
>  arch/x86/kernel/signal.c   | 86 
> +-
>  5 files changed, 146 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
> index 0552884..2751f47 100644
> --- a/arch/x86/ia32/ia32_signal.c
> +++ b/arch/x86/ia32/ia32_signal.c
> @@ -68,7 +68,8 @@
>  }
>  
>  static int ia32_restore_sigcontext(struct pt_regs *regs,
> -struct sigcontext_32 __user *sc)
> +struct sigcontext_32 __user *sc,
> +void __user **user_cookie)
>  {
>   unsigned int tmpflags, err = 0;
>   void __user *buf;
> @@ -105,6 +106,16 @@ static int ia32_restore_sigcontext(struct pt_regs *regs,
>   buf = compat_ptr(tmp);
>   } get_user_catch(err);
>  
> + /*
> +  * If there is fp state get cookie from the top of the fp state,
> +  * else get it from the top of the sig frame.
> +  */
> +
> + if (tmp != 0)
> + *user_cookie = compat_ptr(tmp + fpu__getsize(1));
> + else
> + *user_cookie = NULL;
> +
>   err |= fpu__restore_sig(buf, 1);
>  
>   force_iret();
> @@ -117,6 +128,7 @@ asmlinkage long sys32_sigreturn(void)
>   struct pt_regs *regs = current_pt_regs();
>   struct sigframe_ia32 __user *frame = (struct sigframe_ia32 __user 
> *)(regs->sp-8);
>   sigset_t set;
> + void __user *user_cookie;
>  
>   if (!access_ok(VERIFY_READ, frame, sizeof(*frame)))
>   goto badframe;
> @@ -129,8 +141,15 @@ asmlinkage long sys32_sigreturn(void)
>  
>   set_current_blocked(&set);
>  
> - if (ia32_restore_sigcontext(regs, &frame->sc))
> + if (ia32_restore_sigcontext(regs, &frame->sc, &user_cookie))
> + goto badframe;
> +
> + if (user_cookie == NULL)
> + user_cookie = compat_ptr((regs->sp - 8) + sizeof(*frame));
> +
> + if (verify_clear_sigcookie(user_cookie))
>   goto badframe;
> +
>   return regs->ax;
>  
>  badframe:
> @@ -142,6 +161,7 @@ asmlinkage long sys32_rt_sigreturn(void)
>  {
>   struct pt_regs *regs = current_pt_regs();
>   struct rt_sigframe_ia32 __user *frame;
> + void __user *user_cookie;
>   sigset_t set;
>  
>   frame = (struct rt_sigframe_ia32 __user *)(regs->sp - 4);
> @@ -153,7 +173,13 @@ asmlinkage long sys32_rt_sigreturn(void)
>  
>   set_current_blocked(&set);
>  
> - if (ia32_restore_sigcontext(regs, &frame->uc.uc_mcontext))
> + if (ia32_restore_sigcontext(regs, &frame->uc.uc_mcontext, &user_cookie))
> + goto badframe;
> +
> + if (user_cookie == NULL)
> + user_cookie = (void __user *)((regs->sp - 4) + sizeof(*frame));
regs->sp is already restored so you should use frame instead.

--Mika

Re: [PATCH v2 RESEND 1/2] arm, arm64: change_memory_common with numpages == 0 should be no-op.

2016-01-27 Thread Mika Penttilä


Hi Will,

On 26.01.2016 17:59, Will Deacon wrote:
> Hi Mika,
>
> On Tue, Jan 26, 2016 at 04:59:52PM +0200, mika.pentt...@nextfour.com wrote:
>> From: Mika Penttilä 
>>
>> This makes the caller set_memory_xx() consistent with x86.
>>
>> arm64 part is rebased on 4.5.0-rc1 with Ard's patch
>>  
>> lkml.kernel.org/g/<1453125665-26627-1-git-send-email-ard.biesheu...@linaro.org>
>> applied.
>>
>> Signed-off-by: Mika Penttilä mika.pentt...@nextfour.com
>> Reviewed-by: Laura Abbott 
>> Acked-by: David Rientjes 
>>
>> ---
>>  arch/arm/mm/pageattr.c   | 3 +++
>>  arch/arm64/mm/pageattr.c | 3 +++
>>  2 files changed, 6 insertions(+)
>>
>> diff --git a/arch/arm/mm/pageattr.c b/arch/arm/mm/pageattr.c
>> index cf30daf..d19b1ad 100644
>> --- a/arch/arm/mm/pageattr.c
>> +++ b/arch/arm/mm/pageattr.c
>> @@ -49,6 +49,9 @@ static int change_memory_common(unsigned long addr, int 
>> numpages,
>>  WARN_ON_ONCE(1);
>>  }
>>  
>> +if (!numpages)
>> +return 0;
>> +
>>  if (start < MODULES_VADDR || start >= MODULES_END)
>>  return -EINVAL;
>>  
>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
>> index 1360a02..b582fc2 100644
>> --- a/arch/arm64/mm/pageattr.c
>> +++ b/arch/arm64/mm/pageattr.c
>> @@ -53,6 +53,9 @@ static int change_memory_common(unsigned long addr, int 
>> numpages,
>>  WARN_ON_ONCE(1);
>>  }
>>  
>> +if (!numpages)
>> +return 0;
>> +
> Thanks for this. I can reproduce the failure on my Juno board, so I'd
> like to queue this for 4.5 since it fixes a real issue. I've taken the
> liberty of rebasing the arm64 part to my fixes branch and writing a
> commit message. Does the patch below look ok to you?
>
> Will
>
> --->8
>
> From 57adec866c0440976c96a4b8f5b59fb411b1cacb Mon Sep 17 00:00:00 2001
> From: =?UTF-8?q?Mika=20Penttil=C3=A4?= 
> Date: Tue, 26 Jan 2016 15:47:25 +
> Subject: [PATCH] arm64: mm: avoid calling apply_to_page_range on empty range
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> Calling apply_to_page_range with an empty range results in a BUG_ON
> from the core code. This can be triggered by trying to load the st_drv
> module with CONFIG_DEBUG_SET_MODULE_RONX enabled:
>
>   kernel BUG at mm/memory.c:1874!
>   Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
>   Modules linked in:
>   CPU: 3 PID: 1764 Comm: insmod Not tainted 4.5.0-rc1+ #2
>   Hardware name: ARM Juno development board (r0) (DT)
>   task: ffc9763b8000 ti: ffc975af8000 task.ti: ffc975af8000
>   PC is at apply_to_page_range+0x2cc/0x2d0
>   LR is at change_memory_common+0x80/0x108
>
> This patch fixes the issue by making change_memory_common (called by the
> set_memory_* functions) a NOP when numpages == 0, therefore avoiding the
> erroneous call to apply_to_page_range and bringing us into line with x86
> and s390.
>
> Cc: 
> Reviewed-by: Laura Abbott 
> Acked-by: David Rientjes 
> Signed-off-by: Mika Penttilä 
> Signed-off-by: Will Deacon 
> ---
>  arch/arm64/mm/pageattr.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> index 3571c7309c5e..cf6240741134 100644
> --- a/arch/arm64/mm/pageattr.c
> +++ b/arch/arm64/mm/pageattr.c
> @@ -57,6 +57,9 @@ static int change_memory_common(unsigned long addr, int 
> numpages,
>   if (end < MODULES_VADDR || end >= MODULES_END)
>   return -EINVAL;
>  
> + if (!numpages)
> + return 0;
> +
>   data.set_mask = set_mask;
>   data.clear_mask = clear_mask;
>  

Yes I'm fine with that,
Thanks!
Mika

[PATCH, REGRESSION v4] mm: make apply_to_page_range more robust

2016-01-21 Thread Mika Penttilä

Recent changes (4.4.0+) in module loader triggered oops on ARM : 

The module in question is in-tree module :
drivers/misc/ti-st/st_drv.ko

The BUG is here :

[ 53.638335] [ cut here ]
[ 53.642967] kernel BUG at mm/memory.c:1878!
[ 53.647153] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
[ 53.652987] Modules linked in:
[ 53.656061] CPU: 0 PID: 483 Comm: insmod Not tainted 4.4.0 #3
[ 53.661808] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[ 53.668338] task: a989d400 ti: 9e6a2000 task.ti: 9e6a2000
[ 53.673751] PC is at apply_to_page_range+0x204/0x224
[ 53.678723] LR is at change_memory_common+0x90/0xdc
[ 53.683604] pc : [<800ca0ec>] lr : [<8001d668>] psr: 600b0013
[ 53.683604] sp : 9e6a3e38 ip : 8001d6b4 fp : 7f0042fc
[ 53.695082] r10:  r9 : 9e6a3e90 r8 : 0080
[ 53.700309] r7 :  r6 : 7f008000 r5 : 7f008000 r4 : 7f008000
[ 53.706837] r3 : 8001d5a4 r2 : 7f008000 r1 : 7f008000 r0 : 80b8d3c0
[ 53.713368] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 53.720504] Control: 10c5387d Table: 2e6b804a DAC: 0055
[ 53.726252] Process insmod (pid: 483, stack limit = 0x9e6a2210)
[ 53.732173] Stack: (0x9e6a3e38 to 0x9e6a4000)
[ 53.736532] 3e20: 7f007fff 7f008000
[ 53.744714] 3e40: 80b8d3c0 80b8d3c0  7f007000 7f00426c 7f008000 
 7f008000
[ 53.752895] 3e60: 7f004140 7f008000  0080   
7f0042fc 8001d668
[ 53.761076] 3e80: 9e6a3e90  8001d6b4 7f00426c 0080  
9e6a3f58 7f004140
[ 53.769257] 3ea0: 7f004240 7f00414c  8008bbe0  7f00 
 
[ 53.777438] 3ec0: a8b12f00 0001cfd4 7f004250 7f004240 80b8159c  
00e0 7f0042fc
[ 53.785619] 3ee0: c183d000 74f8 18fd  0b3c  
 7f002024
[ 53.793800] 3f00: 0002      
 
[ 53.801980] 3f20:     0040  
0003 0001cfd4
[ 53.810161] 3f40: 017b 8000f7e4 9e6a2000  0002 8008c498 
c183d000 74f8
[ 53.818342] 3f60: c1841588 c1841409 c1842950 5000 52a0  
 
[ 53.826523] 3f80: 0023 0024 001a 001e 0016  
 
[ 53.834703] 3fa0: 003e3d60 8000f640   0003 0001cfd4 
 003e3d60
[ 53.842884] 3fc0:   003e3d60 017b 003e3d20 7eabc9d4 
76f2c000 0002
[ 53.851065] 3fe0: 7eabc990 7eabc980 00016320 76e81d00 600b0010 0003 
 
[ 53.859256] [<800ca0ec>] (apply_to_page_range) from [<8001d668>] 
(change_memory_common+0x90/0xdc)
[ 53.868139] [<8001d668>] (change_memory_common) from [<8008bbe0>] 
(load_module+0x194c/0x2068)
[ 53.876671] [<8008bbe0>] (load_module) from [<8008c498>] 
(SyS_finit_module+0x64/0x74)
[ 53.884512] [<8008c498>] (SyS_finit_module) from [<8000f640>] 
(ret_fast_syscall+0x0/0x34)
[ 53.892694] Code: e0834104 eabc e51a1008 eaac (e7f001f2)
[ 53.898792] ---[ end trace fe43fc78ebde29a3 ]---


apply_to_page_range gets zero length resulting in triggering :
   
  BUG_ON(addr >= end)

This is regression and a consequence of changes in module section handling.

BUG_ON() is not needed here and would need all call sites to be checked
because there may be callers that expect zero size to succeed and BUG_ON allows
easy way to DOS.

With this patch loading this module throws out a warning but that can be
handled in arch code with a separate patch.

v2: add more explanation
v3: added even more explanation and stack trace, tagged as regression
v4: change BUG_ON() to WARN_ON() and return -EINVAL

Signed-off-by: Mika Penttilä mika.pentt...@nextfour.com
---

diff --git a/mm/memory.c b/mm/memory.c
index 30991f8..9178ee6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1871,7 +1871,9 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
long addr,
unsigned long end = addr + size;
int err;
 
-   BUG_ON(addr >= end);
+   if (WARN_ON(addr >= end))
+   return -EINVAL;
+
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);

Re: [PATCH, REGRESSION v3] mm: make apply_to_page_range more robust

2016-01-21 Thread Mika Penttilä

On 01/22/2016 01:12 AM, David Rientjes wrote:
> On Thu, 21 Jan 2016, Mika Penttilä wrote:
> 
>> Recent changes (4.4.0+) in module loader triggered oops on ARM : 
>>
>> The module in question is in-tree module :
>> drivers/misc/ti-st/st_drv.ko
>>
>> The BUG is here :
>>
>> [ 53.638335] [ cut here ]
>> [ 53.642967] kernel BUG at mm/memory.c:1878!
>> [ 53.647153] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
>> [ 53.652987] Modules linked in:
>> [ 53.656061] CPU: 0 PID: 483 Comm: insmod Not tainted 4.4.0 #3
>> [ 53.661808] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
>> [ 53.668338] task: a989d400 ti: 9e6a2000 task.ti: 9e6a2000
>> [ 53.673751] PC is at apply_to_page_range+0x204/0x224
>> [ 53.678723] LR is at change_memory_common+0x90/0xdc
>> [ 53.683604] pc : [<800ca0ec>] lr : [<8001d668>] psr: 600b0013
>> [ 53.683604] sp : 9e6a3e38 ip : 8001d6b4 fp : 7f0042fc
>> [ 53.695082] r10:  r9 : 9e6a3e90 r8 : 0080
>> [ 53.700309] r7 :  r6 : 7f008000 r5 : 7f008000 r4 : 7f008000
>> [ 53.706837] r3 : 8001d5a4 r2 : 7f008000 r1 : 7f008000 r0 : 80b8d3c0
>> [ 53.713368] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
>> [ 53.720504] Control: 10c5387d Table: 2e6b804a DAC: 0055
>> [ 53.726252] Process insmod (pid: 483, stack limit = 0x9e6a2210)
>> [ 53.732173] Stack: (0x9e6a3e38 to 0x9e6a4000)
>> [ 53.736532] 3e20: 7f007fff 7f008000
>> [ 53.744714] 3e40: 80b8d3c0 80b8d3c0  7f007000 7f00426c 7f008000 
>>  7f008000
>> [ 53.752895] 3e60: 7f004140 7f008000  0080   
>> 7f0042fc 8001d668
>> [ 53.761076] 3e80: 9e6a3e90  8001d6b4 7f00426c 0080  
>> 9e6a3f58 7f004140
>> [ 53.769257] 3ea0: 7f004240 7f00414c  8008bbe0  7f00 
>>  
>> [ 53.777438] 3ec0: a8b12f00 0001cfd4 7f004250 7f004240 80b8159c  
>> 00e0 7f0042fc
>> [ 53.785619] 3ee0: c183d000 74f8 18fd  0b3c  
>>  7f002024
>> [ 53.793800] 3f00: 0002      
>>  
>> [ 53.801980] 3f20:     0040  
>> 0003 0001cfd4
>> [ 53.810161] 3f40: 017b 8000f7e4 9e6a2000  0002 8008c498 
>> c183d000 74f8
>> [ 53.818342] 3f60: c1841588 c1841409 c1842950 5000 52a0  
>>  
>> [ 53.826523] 3f80: 0023 0024 001a 001e 0016  
>>  
>> [ 53.834703] 3fa0: 003e3d60 8000f640   0003 0001cfd4 
>>  003e3d60
>> [ 53.842884] 3fc0:   003e3d60 017b 003e3d20 7eabc9d4 
>> 76f2c000 0002
>> [ 53.851065] 3fe0: 7eabc990 7eabc980 00016320 76e81d00 600b0010 0003 
>>  
>> [ 53.859256] [<800ca0ec>] (apply_to_page_range) from [<8001d668>] 
>> (change_memory_common+0x90/0xdc)
>> [ 53.868139] [<8001d668>] (change_memory_common) from [<8008bbe0>] 
>> (load_module+0x194c/0x2068)
>> [ 53.876671] [<8008bbe0>] (load_module) from [<8008c498>] 
>> (SyS_finit_module+0x64/0x74)
>> [ 53.884512] [<8008c498>] (SyS_finit_module) from [<8000f640>] 
>> (ret_fast_syscall+0x0/0x34)
>> [ 53.892694] Code: e0834104 eabc e51a1008 eaac (e7f001f2)
>> [ 53.898792] ---[ end trace fe43fc78ebde29a3 ]---
>>
> 
> NACK to your patch as it is just covering up buggy code silently.  The 
> problem needs to be addressed in change_memory_common() to return if there 
> is no size to change (numpages == 0).  It's a two line fix to that 
> function.
> 

That surely would make this particular problem disappear on ARM. But, we 
probably get similar behavior on other arches too (arm64 at least).

Also, you are suggesting it is ok to call set_memory_xx() with numpages==0, but 
bug to call apply_to_page_range() with size==0 ? 
I think these are similar apis with a size type of argument. Functions taking a 
range [start, end) are a different story and should be illegal to call 
start==end.

Also, taking a fast look at all call sites of apply_to_page_range not all are 
checking for !size (some Xen code for instance) and could trigger a kernel BUG 
(potentially triggerable from user code). So something that was meant to help 
finding buggy code could be turned into an easy way to DOS. 

Thanks,
--Mika

[PATCH, REGRESSION v3] mm: make apply_to_page_range more robust

2016-01-20 Thread Mika Penttilä

Recent changes (4.4.0+) in module loader triggered oops on ARM : 

The module in question is in-tree module :
drivers/misc/ti-st/st_drv.ko

The BUG is here :

[ 53.638335] [ cut here ]
[ 53.642967] kernel BUG at mm/memory.c:1878!
[ 53.647153] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
[ 53.652987] Modules linked in:
[ 53.656061] CPU: 0 PID: 483 Comm: insmod Not tainted 4.4.0 #3
[ 53.661808] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[ 53.668338] task: a989d400 ti: 9e6a2000 task.ti: 9e6a2000
[ 53.673751] PC is at apply_to_page_range+0x204/0x224
[ 53.678723] LR is at change_memory_common+0x90/0xdc
[ 53.683604] pc : [<800ca0ec>] lr : [<8001d668>] psr: 600b0013
[ 53.683604] sp : 9e6a3e38 ip : 8001d6b4 fp : 7f0042fc
[ 53.695082] r10:  r9 : 9e6a3e90 r8 : 0080
[ 53.700309] r7 :  r6 : 7f008000 r5 : 7f008000 r4 : 7f008000
[ 53.706837] r3 : 8001d5a4 r2 : 7f008000 r1 : 7f008000 r0 : 80b8d3c0
[ 53.713368] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 53.720504] Control: 10c5387d Table: 2e6b804a DAC: 0055
[ 53.726252] Process insmod (pid: 483, stack limit = 0x9e6a2210)
[ 53.732173] Stack: (0x9e6a3e38 to 0x9e6a4000)
[ 53.736532] 3e20: 7f007fff 7f008000
[ 53.744714] 3e40: 80b8d3c0 80b8d3c0  7f007000 7f00426c 7f008000 
 7f008000
[ 53.752895] 3e60: 7f004140 7f008000  0080   
7f0042fc 8001d668
[ 53.761076] 3e80: 9e6a3e90  8001d6b4 7f00426c 0080  
9e6a3f58 7f004140
[ 53.769257] 3ea0: 7f004240 7f00414c  8008bbe0  7f00 
 
[ 53.777438] 3ec0: a8b12f00 0001cfd4 7f004250 7f004240 80b8159c  
00e0 7f0042fc
[ 53.785619] 3ee0: c183d000 74f8 18fd  0b3c  
 7f002024
[ 53.793800] 3f00: 0002      
 
[ 53.801980] 3f20:     0040  
0003 0001cfd4
[ 53.810161] 3f40: 017b 8000f7e4 9e6a2000  0002 8008c498 
c183d000 74f8
[ 53.818342] 3f60: c1841588 c1841409 c1842950 5000 52a0  
 
[ 53.826523] 3f80: 0023 0024 001a 001e 0016  
 
[ 53.834703] 3fa0: 003e3d60 8000f640   0003 0001cfd4 
 003e3d60
[ 53.842884] 3fc0:   003e3d60 017b 003e3d20 7eabc9d4 
76f2c000 0002
[ 53.851065] 3fe0: 7eabc990 7eabc980 00016320 76e81d00 600b0010 0003 
 
[ 53.859256] [<800ca0ec>] (apply_to_page_range) from [<8001d668>] 
(change_memory_common+0x90/0xdc)
[ 53.868139] [<8001d668>] (change_memory_common) from [<8008bbe0>] 
(load_module+0x194c/0x2068)
[ 53.876671] [<8008bbe0>] (load_module) from [<8008c498>] 
(SyS_finit_module+0x64/0x74)
[ 53.884512] [<8008c498>] (SyS_finit_module) from [<8000f640>] 
(ret_fast_syscall+0x0/0x34)
[ 53.892694] Code: e0834104 eabc e51a1008 eaac (e7f001f2)
[ 53.898792] ---[ end trace fe43fc78ebde29a3 ]---


The call path is SyS_init_module()->set_memory_xx()->apply_to_page_range(),
and apply_to_page_range gets zero length resulting in triggering :
   
  BUG_ON(addr >= end)

This is regression and a consequence of changes in module section handling 
(Rusty CC:ed).
This may be triggable only with certain modules and/or gcc versions. 

Plus, I think the spirit of the BUG_ON is to catch overflows,
not to bug on zero length legitimate callers. So whatever the
reason for this triggering, some day we have another caller with
zero length. And, as Rusty mentioned, he expected a zero-length range 
to do nothing, which is what intuition says. 

Fix by letting call with zero size succeed. 

v2: add more explanation
v3: added even more explanation and stack trace, tagged as regression

Signed-off-by: Mika Penttilä mika.pentt...@nextfour.com
Reviewed-by: Pekka Enberg 
---

diff --git a/mm/memory.c b/mm/memory.c
index c387430..c3d1a2e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1884,6 +1884,9 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
long addr,
unsigned long end = addr + size;
int err;
 
+   if (!size)
+   return 0;
+
BUG_ON(addr >= end);
pgd = pgd_offset(mm, addr);
do {

Re: 4.4-rc crash (af_unix)

2016-01-03 Thread Mika Penttilä

Just got other one with rc8 (random, not easily reproducable):

[ 1254.780923] Unable to handle kernel NULL pointer dereference at
virtual address 
[ 1254.789308] pgd = a920
[ 1254.789320] [] *pgd=39120831, *pte=, *ppte=
[ 1254.789331] Internal error: Oops: 817 [#1] PREEMPT SMP ARM
[ 1254.789340] Modules linked in: btwilink st_drv
[ 1254.789352] CPU: 3 PID: 319 Comm: compositor Tainted: GW
  4.4.0-rc8 #8
[ 1254.789361] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[ 1254.789365] task: a888d580 ti: a90a4000 task.ti: a90a4000
[ 1254.789376] PC is at skb_queue_tail+0x24/0x48
[ 1254.789385] LR is at _raw_spin_lock_irqsave+0x18/0x5c
[ 1254.789390] pc : [<8051ded8>]lr : [<806dff80>]psr: 600d0093
[ 1254.789390] sp : a90a5e40  ip : 000a  fp : a911288c
[ 1254.789394] r10: a9111b80  r9 : 003e  r8 : 0001
[ 1254.789397] r7 : a9112704  r6 : a9112710  r5 : a9112704  r4 : a9112704
[ 1254.789400] r3 :   r2 :   r1 :   r0 : 600d0013
[ 1254.789406] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM
Segment user
[ 1254.789411] Control: 10c5387d  Table: 3920004a  DAC: 0055
[ 1254.789414] Process compositor (pid: 319, stack limit = 0xa90a4210)
[ 1254.789418] Stack: (0xa90a5e40 to 0xa90a6000)
[ 1254.789425] 5e40: 003e  a9112680 805c1034 a90a5e70
0003  003e
[ 1254.789432] 5e60: a63bac00 0001 a90a5ec4 a90a5ebc 
  
[ 1254.789438] 5e80:   013f a90a5f14 a8e49e40
0001  
[ 1254.789445] 5ea0:   7e90b4bc 8051622c a90a5f14
805162c4  
[ 1254.789450] 5ec0:  0001   a90a5f14
  
[ 1254.789457] 5ee0:  a90a5f28 a90a5efc a8e49e40 
a90a5f88 a90a5f88 800e37b0
[ 1254.789475] 5f00: 003e 806e01f4  7e908fb0 003e
0001  003e
[ 1254.789485] 5f20: a90a5f0c 0001 a8e49e40  
  
[ 1254.789492] 5f40:   cbb1c6a8 a8e49e40 003e
7e908fb0 a90a5f88 8000f6a4
[ 1254.789498] 5f60: a90a4000 800e3f00 7e908fb0 003e a8e49e40
a8e49e41 7e908fb0 003e
[ 1254.789505] 5f80: 8000f6a4 800e4718   04e7
003e 7e908fb0 75b41cc0
[ 1254.789511] 5fa0: 0004 8000f500 003e 7e908fb0 0002
7e908fb0 003e 
[ 1254.789517] 5fc0: 003e 7e908fb0 75b41cc0 0004 003e
 0002 7e90b4bc
[ 1254.789524] 5fe0:  7e908e88 73e3f4c0 75ad0d34 800d0010
0002  
[ 1254.789553] [<8051ded8>] (skb_queue_tail) from [<805c1034>]
(unix_stream_sendmsg+0x134/0x340)
[ 1254.789567] [<805c1034>] (unix_stream_sendmsg) from [<8051622c>]
(sock_sendmsg+0x14/0x24)
[ 1254.789577] [<8051622c>] (sock_sendmsg) from [<805162c4>]
(sock_write_iter+0x88/0xbc)
[ 1254.789594] [<805162c4>] (sock_write_iter) from [<800e37b0>]
(__vfs_write+0xac/0xdc)
[ 1254.789605] [<800e37b0>] (__vfs_write) from [<800e3f00>]
(vfs_write+0x90/0x164)
[ 1254.789614] [<800e3f00>] (vfs_write) from [<800e4718>]
(SyS_write+0x44/0x9c)
[ 1254.789630] [<800e4718>] (SyS_write) from [<8000f500>]
(ret_fast_syscall+0x0/0x34)
[ 1254.789639] Code: eb070826 e5943004 e5854000 e5853004 (e5835000)
[ 1254.789644] ---[ end trace d7af6297ad511a4e ]---


On 12/22/2015 02:51 AM, Cong Wang wrote:
> (Cc'ing netdev and Rainer)
> 
> On Thu, Dec 17, 2015 at 9:12 PM, Mika Penttilä
>  wrote:
>> Still something with af_unix and/or wake code on rc5 :
>>
>>
>> [34971.300210] Unable to handle kernel paging request at virtual address
>> 56ac56ac
>>
>> [34971.307455] pgd = a8c3
>>
>> [34971.310164] [56ac56ac] *pgd=
>>
>> [34971.313761] Internal error: Oops: 8005 [#1] PREEMPT SMP ARM
>>
>> [34971.319683] Modules linked in: btwilink st_drv
>>
>> [34971.324174] CPU: 1 PID: 333 Comm: compositor Not tainted 4.4.0-rc5 #1
>>
>> [34971.330620] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
>>
>> [34971.337152] task: a8c71c80 ti: a8aea000 task.ti: a8aea000
>>
>> [34971.342554] PC is at 0x56ac56ac
>>
>> [34971.345710] LR is at __wake_up_common+0x4c/0x80
>>
>> [34971.350246] pc : [<56ac56ac>]lr : [<800585e4>]psr: 200f0093
>>
>> [34971.350246] sp : a8aebd20  ip : a8ea56bc  fp : 0001
>>
>> [34971.361725] r10: 0001  r9 : 0001  r8 : 0304
>>
>> [34971.366952] r7 : a8ea5744  r6 : 8023a9e4  r5 : 56ac56ac  r4 : a8c95d28
>>
>> [34971.373480] r3 : 0304  r2 : 0001  r1 : 0001  r0 : a8ea56bc
>>
>> [34971.380010] Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM
>> Segment user
>>
>> [34971.387234] Co

4.4-rc5 crash (af_unix)

2015-12-17 Thread Mika Penttilä

Still something with af_unix and/or wake code on rc5 :


[34971.300210] Unable to handle kernel paging request at virtual address
56ac56ac

[34971.307455] pgd = a8c3

[34971.310164] [56ac56ac] *pgd=

[34971.313761] Internal error: Oops: 8005 [#1] PREEMPT SMP ARM

[34971.319683] Modules linked in: btwilink st_drv

[34971.324174] CPU: 1 PID: 333 Comm: compositor Not tainted 4.4.0-rc5 #1

[34971.330620] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)

[34971.337152] task: a8c71c80 ti: a8aea000 task.ti: a8aea000

[34971.342554] PC is at 0x56ac56ac

[34971.345710] LR is at __wake_up_common+0x4c/0x80

[34971.350246] pc : [<56ac56ac>]lr : [<800585e4>]psr: 200f0093

[34971.350246] sp : a8aebd20  ip : a8ea56bc  fp : 0001

[34971.361725] r10: 0001  r9 : 0001  r8 : 0304

[34971.366952] r7 : a8ea5744  r6 : 8023a9e4  r5 : 56ac56ac  r4 : a8c95d28

[34971.373480] r3 : 0304  r2 : 0001  r1 : 0001  r0 : a8ea56bc

[34971.380010] Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM
Segment user

[34971.387234] Control: 10c5387d  Table: 38c3004a  DAC: 0055

[34971.392982] Process compositor (pid: 333, stack limit = 0xa8aea210)

[34971.399250] Stack: (0xa8aebd20 to 0xa8aec000)

[34971.403612] bd20: 0001 a8ea5740 0001 0304 0001
a00f0013 0098 a8aebe4c

[34971.411793] bd40:  80058bc8 0304 a8aebe78 a8db42c0
a8db42c0 a8db4000 0f68

[34971.419974] bd60: a8db4084 805bd7c0 a8db4394 805196d8 a9e37600
 a8db4000 805bd270

[34971.428155] bd80: a8aebd94    
 a9e37600 8051a810

[34971.436336] bda0: a9e37600 8051a970 a9e37600 8051aa44 a9e37600
805be498 7ec113e8 8025f2e4

[34971.444517] bdc0: a8f35044 0098 a8db420c 0098 a653e780
0001  

[34971.452697] bde0: a8db41e4  a8aebdf8  a8aea000
800e4928  

[34971.460878] be00:     7ec115a8
80264d8c a8aebe78 a8aebe24

[34971.469059] be20: 00a5cda4 80513658 a8aebf6c 4040 7ec115b8
7ec115d4 a8aebeb8 a653e780

[34971.477240] be40: 00c50388 805be604 00c87240 805bd130 a653e780
a8aebf6c  1000

[34971.485420] be60: 4040  00c50388 8051515c 
 00c506fc 0c8c

[34971.493601] be80: 00c50388 0374 014d  0004
0320 00a70664 00983dcc

[34971.501782] bea0: 758ce8e0 00983dcc 758ce8e0 758ce8ec 00a70664
758ce8ec  80513a8c

[34971.509963] bec0: a8b6df8c a8aebf10 a8b6df8c a8aebf10 7ec116d8
8011da50 00a70664 

[34971.518143] bee0: 0001 600f0013 a8aebefc 8004559c a8c07500
600f0013 a8c07534 806de1a0

[34971.526324] bf00: a8c07534 806de414  8011df1c a8aebf10
a8aebf10 a8aebf2c 0020

[34971.534505] bf20: a895ccc0 800fd364 a8aebf68 a8aebf64 4040
0129 a653e780 7ec115b8

[34971.542686] bf40: 4040 0129 8000f6a4 a8aea000 
80515eb0  

[34971.550867] bf60: 0020 0001 fff7  
 0098 0f68

[34971.559047] bf80: a8aebe78 0002 7ec115d4 007c 4000
 0040 001c

[34971.567227] bfa0: 7ec115b8 8000f500 0040 001c 001c
7ec115b8 4040 

[34971.575409] bfc0: 0040 001c 7ec115b8 0129 0006
7ec115b8 76145d68 00c50388

[34971.583589] bfe0:  7ec11588 73d6f4c0 75bef794 800f0010
001c 3bf5e861 3bf5ec61

[34971.591782] [<800585e4>] (__wake_up_common) from [<80058bc8>]
(__wake_up_sync_key+0x44/0x60)

[34971.600235] [<80058bc8>] (__wake_up_sync_key) from [<805bd7c0>]
(unix_write_space+0x58/0x88)

[34971.608686] [<805bd7c0>] (unix_write_space) from [<805196d8>]
(sock_wfree+0x78/0x80)

[34971.616437] [<805196d8>] (sock_wfree) from [<805bd270>]
(unix_destruct_scm+0x64/0x6c)

[34971.624276] [<805bd270>] (unix_destruct_scm) from [<8051a810>]
(skb_release_head_state+0x84/0xec)

[34971.633154] [<8051a810>] (skb_release_head_state) from [<8051a970>]
(skb_release_all+0xc/0x24)

[34971.641772] [<8051a970>] (skb_release_all) from [<8051aa44>]
(consume_skb+0x24/0x60)

[34971.649523] [<8051aa44>] (consume_skb) from [<805be498>]
(unix_stream_read_generic+0x71c/0x7d0)

[34971.658228] [<805be498>] (unix_stream_read_generic) from [<805be604>]
(unix_stream_recvmsg+0x38/0x40)

[34971.667453] [<805be604>] (unix_stream_recvmsg) from [<8051515c>]
(___sys_recvmsg+0x94/0x12c)

[34971.675897] [<8051515c>] (___sys_recvmsg) from [<80515eb0>]
(__sys_recvmsg+0x3c/0x6c)

[34971.683738] [<80515eb0>] (__sys_recvmsg) from [<8000f500>]
(ret_fast_syscall+0x0/0x34)

[34971.691659] Code: bad PC value

[34971.694718] ---[ end trace b54a6d4b7a89f212 ]---

[34971.699339] Kernel panic - not syncing: Fatal exception

[34971.704572] CPU2: stopping

[34971.707292] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G  D
4.4.0-rc5 #1

[34971.714691] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)

[34971.721240] [<80016be4>] (unwind_backtrace) from [<80012b68>]
(show_stack+0x10/0x14)

[34971.728997] [<80012b68>]

4.4-rc4 crash net/80211 related

2015-12-16 Thread Mika Penttilä

Hi,

Triggered this with rc4, but the relevant parts are same in rc5:

offending line is :

(gdb) list *(ieee80211_scan_rx+0x158)
0xf68 is in ieee80211_scan_rx (net/mac80211/scan.c:205).
200 if (!(sdata1 &&
201   (ether_addr_equal(mgmt->da, sdata1->vif.addr) ||
202scan_req->flags & 
NL80211_SCAN_FLAG_RANDOM_ADDR)) &&
203 !(sdata2 &&
204   (ether_addr_equal(mgmt->da, sdata2->vif.addr) ||
205sched_scan_req->flags & 
NL80211_SCAN_FLAG_RANDOM_ADDR)))
206 return;
207 
208 elements = mgmt->u.probe_resp.variable;
209 baselen = offsetof(struct ieee80211_mgmt, 
u.probe_resp.variable);
(gdb)

i.e. sched_scan_req->flags which means sched_scan_req is NULL.

It is not easy to trigger (have been running for days) so its not easy
to say if it's triggering with rc5.

relevant hw info : i.mx6 + ti wl1835 wlan

--

[471559.635143] Unable to handle kernel NULL pointer dereference at
virtual address 0018

Internal error: Oops: 17 [#1] PREEMPT SMP ARM

CPU: 1 PID: 24194 Comm: kworker/u8:1 Tainted: GW   4.4.0-rc4 #1

[a4c7e1(505x9a.76e9f0872] Hardware name: Freescale i.MX6 Quad/DualLite
(Device Tree)

S[u4r7f1a559.717313] PC is at ieee80211_scan_rx+0x158/0x168

LR is at 0x2f04a578

ce(0xa7efe8)

[471559.729744] pc : [<806a0bb0>]lr : [<2f04a578>]psr: a0030113

[471559.729744] sp : a8aa7da0  ip : 0066  fp : a800ac00

[471559.742599] r10: a89e6a00  r9 :   r8 : 

[471559.747913] r7 : a8b00440  r6 : a87764c0  r5 : 647b  r4 : a8b00440

[471559.754529] r3 : d0fbdb87  r2 : 9b84  r1 : a8cc76c0  r0 : a84d43e0

[471559.761146] Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM
Segment kernel

[471559.768544] Control: 10c5387d  Table: 1b48804a  DAC: 0055

[471559.774379] Process kworker/u8:1 (pid: 24194, stack limit = 0xa8aa6210)

[471559.781081] Stack: (0xa8aa7da0 to 0xa8aa8000)

[471559.785531] 7da0: 0006f631  afb50401 ab712080 a8aa7dfc
806dc340 ab712080 80042018

[471559.793799] 7dc0:  8a14a000 0002 8003e980 a82d5f48
a82d5f50 a82d5f48 800500d4

[471559.802066] 7de0:   5129e9f0 0001ace1 0001
 a8aa7e3c 806d870c

[471559.810334] 7e00:   a8aa7e1c 800455e4 9c119808
ab7120c0 625e a82d5f00

[471559.818601] 7e20: ab7120c0 a82d5f48 80b6170c 0002 0001
 ab712080 80053738

[471559.826868] 7e40: 9c119808 ab7120c0 1259  1259
 0001 a84d43e0

[471559.835136] 7e60: 0050 a8cc76c0 a8b00440  
806b6ee8 80b5c080 80b5c080

[471559.843403] 7e80: 0004  02953182  a8cc76c0
a84d43e0  

[471559.851670] 7ea0:   0010 0010 
 a800ac00 a84d4c40

[471559.859938] 7ec0: a8cc76c0 a84d43e0 a84d4e00 803b37a4 
a89e6a00 a800ac00 803b37c0

[471559.868205] 7ee0: a84d4ecc a84d4c40 a800ac00 a83c2f00 
803b383c a89e6a00 a84d4ecc

[471559.876473] 7f00: a800ac00 800388ac a800ac14 a800ac14 0001
a800ac00 a89e6a18 a800ac14

[471559.884740] 7f20: a8aa6000 0088 80b9a73b a89e6a00 a800ac00
80038b1c 80b60100 a800ad64

[471559.893007] 7f40: 80038ad0  a8a96f40 a89e6a00 80038ad0
  

[471559.901274] 7f60:  8003dd78 fff5  
a89e6a00  

[471559.909542] 7f80: a8aa7f80 a8aa7f80   a8aa7f90
a8aa7f90 a8aa7fac a8a96f40

[471559.917809] 7fa0: 8003dc90   8000f5a8 
  

[471559.926076] 7fc0:     
  

[471559.934343] 7fe0:     0013
  

[471559.942623] [<806a0bb0>] (ieee80211_scan_rx) from [<806b6ee8>]
(ieee80211_rx_napi+0x680/0x7a0)

[471559.951330] [<806b6ee8>] (ieee80211_rx_napi) from [<803b37c0>]
(wl1271_flush_deferred_work+0x30/0x98)

[471559.960643] [<803b37c0>] (wl1271_flush_deferred_work) from
[<803b383c>] (wl1271_netstack_work+0x14/0x24)

[471559.970216] [<803b383c>] (wl1271_netstack_work) from [<800388ac>]
(process_one_work+0x120/0x344)

[471559.979093] [<800388ac>] (process_one_work) from [<80038b1c>]
(worker_thread+0x4c/0x490)

[471559.987279] [<80038b1c>] (worker_thread) from [<8003dd78>]
(kthread+0xe8/0x104)

[471559.994686] [<8003dd78>] (kthread) from [<8000f5a8>]
(ret_from_fork+0x14/0x2c)

[471560.002000] Code: e0222005 e023300e e1923003 0ac0 (e5993018)

[471560.008219] ---[ end trace eb084eff56d23079 ]---

[471560.012947] Kernel panic - not syncing: Fatal exception in interrupt

[471560.012954] CPU0: stopping

[471560.012962] CPU: 0 PID: 24339 Comm: compositor Tainted: G  D W
 4.4.0-rc4 #1

[471560.012965] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)

[471560.012988] [<80016be4>] (unwind_backtrace) from [<80

Re: [PATCH v5 01/12] mm: support madvise(MADV_FREE)

2015-11-30 Thread Mika Penttilä

> +  * If pmd isn't transhuge but the page is THP and
> +  * is owned by only this process, split it and
> +  * deactivate all pages.
> +  */
> + if (PageTransCompound(page)) {
> + if (page_mapcount(page) != 1)
> + goto out;
> + get_page(page);
> + if (!trylock_page(page)) {
> + put_page(page);
> + goto out;
> + }
> + pte_unmap_unlock(orig_pte, ptl);
> + if (split_huge_page(page)) {
> + unlock_page(page);
> + put_page(page);
> + pte_offset_map_lock(mm, pmd, addr, &ptl);
> + goto out;
> + }
> + pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> + pte--;
> + addr -= PAGE_SIZE;
> + continue;
> + }

looks like this leaks page count if split_huge_page() is succesfull
(returns zero).

--Mika

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

4.4-rc2 crash: block related

2015-11-24 Thread Mika Penttilä


Hi,

With recent block layer pull i see a 100% repeatable crash on boot while
mounting roots (ext4 partition on eMMC, with cfq io scheduler).

---

5.674294] Unable to handle kernel NULL pointer dereference at virtual
address 0004
[5.682399] pgd = a8ca4000
[5.685113] [0004] *pgd=38a5e831, *pte=, *ppte=
[5.691428] Internal error: Oops: 17 [#1] PREEMPT SMP ARM
[5.696830] Modules linked in: st_drv
[5.700533] CPU: 1 PID: 221 Comm: mount Not tainted 4.4.0-rc2 #49
[5.706631] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[5.713163] task: a88e2ac0 ti: a88d4000 task.ti: a88d4000
[5.718578] PC is at cfq_init_prio_data+0x8/0xec
[5.723206] LR is at cfq_insert_request+0x28/0x4f0
[5.723211] pc : [<8024bf9c>]lr : [<8024e768>]psr: 600d0093
[5.723211] sp : a88d5bc0  ip :   fp : a8ab5400
[5.723219] r10: 0001  r9 : a617f4c0  r8 : 80b6359c
[5.723223] r7 : 80b62100  r6 : a873e200  r5 : a885ac30  r4 : 
[5.723226] r3 : a88d5bc0  r2 : a89106c0  r1 :   r0 : 
[5.723232] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM
Segment user
[5.723235] Control: 10c5387d  Table: 38ca404a  DAC: 0055
[5.723239] Process mount (pid: 221, stack limit = 0xa88d4210)
[5.723242] Stack: (0xa88d5bc0 to 0xa88d6000)
[5.723251] 5bc0:  a885ac30 a873e200 8024e768 a87c
a885ac30 0005 a88d4000
[5.723257] 5be0: 80b6359c a617f4c0 0001 8023817c 
a89106c0 a885ac30 
[5.723263] 5c00: a89106c0  a87c 8023654c 
 a8ab5400 a89106c0
[5.723269] 5c20: 0008 1411 f000 80236680 a88d5c44
a87c0168 a617f4c0 a81a45c0
[5.723276] 5c40: 0001 0240 80b6359c a617f4c0 0001
80231b04 a00d0013 000f
[5.723282] 5c60: a617f4c0 a89106c0 1411 f000 80b6359c
a617f4c0 0001 80110950
[5.723288] 5c80: a617f4c0 0001 1411 80b6370c 80b6359c
80112490 a8b35c00 
[5.723295] 5ca0: 80b63658 801602e4 0205a9d9  a62b4738
a8ab5400 a8b35c00 a8b36000
[5.723301] 5cc0:   a8b36000 a8ab5400 a88d5e8c
80162644 a62621e8 800f7004
[5.723307] 5ce0: a88d5e8c 806dd610 a62621e8 a617f4c0 a8b35c00
a8b36000 0001 80165480
[5.723313] 5d00:   a88d5d58 a88d5d50 a87f2a90
a88d5d54 01897158 800ec9dc
[5.723319] 5d20:  0002  a88d5dc8 0001
a88d5dc0 0001 a6023000
[5.723325] 5d40: a88d5d90 a88d5d88 a8887f10 a88d5d8c 01897158
800ec8d0 a8887f10 0004
[5.723332] 5d60:  a88d5dc0 a88d5dc0 a6029110 0001
a80fd000 a88d5d8c a8744800
[5.723338] 5d80:   0001 0980 b67c
 0001 800bf478
[5.723343] 5da0: a615e490 0001 006c a8102db0 
0001 000a 0001
[5.723349] 5dc0:     002b
a82ec200 80b6e735 0004
[5.723355] 5de0:   a8ab5400  a8b36264
 001013d0 
[5.723361] 5e00: 0001  a8b36000  1000
a8b35e88  
[5.723366] 5e20:   a8ab5594  80be3e54
  
[5.723372] 5e40:  4003  80b70288 01897158
8025e5bc a6298e00 a88d5e6c
[5.723378] 5e60: 3b9aca00 0009 a6298e00 a6298e74 a8b35c00
a6298e00 0083 
[5.723384] 5e80:  80b70288 01897158 800e6324 a6298e00
800c0050 62636d6d 70306b6c
[5.723391] 5ea0: a835 800d0013 0004 80be3e2c a8dca80e
 0001 8015f030
[5.723397] 5ec0: a8dca800  80b70288 80b70288 80b6aeb0
8015f048 801636d8 a8ab1a48
[5.723403] 5ee0: 01897158 800e6f14  a8dca800 a8ab19c0
a8dca800  80b70288
[5.723409] 5f00:  800febbc  0020 
a8dca800 a8dca840 80101a14
[5.723416] 5f20:  80b60be0 a8001f00 024000c0 88c5
800df23c 007f a8dca800
[5.723421] 5f40: a87f2a90 a6138cc0 c0ed a8dca800 000f
 000f a8dca840
[5.723428] 5f60: a8dca800  018971a0 c0ed a88d4000
 01897158 801027e4
[5.723434] 5f80:  28936a1b 563c86d0  
76f35688 c0ed 0015
[5.723440] 5fa0: 8000f6a4 8000f500  76f35688 01897188
018971a0 01897158 c0ed
[5.723447] 5fc0:  76f35688 c0ed 0015 018971a0
01897188 76f36dac 01897158
[5.723453] 5fe0: 76e56dc0 7eedcc30 76f09e70 76e56dd0 600d0010
01897188  
[5.723473] [<8024bf9c>] (cfq_init_prio_data) from [<8024e768>]
(cfq_insert_request+0x28/0x4f0)
[5.723484] [<8024e768>] (cfq_insert_request) from [<8023817c>]
(blk_queue_bio+0x254/0x260)
[5.723500] [<8023817c>] (blk_queue_bio) from [<8023654c>]
(generic_make_request+0xcc/0x17c)
[5.723510] [<8023654c>] (generic_make_request) from [<80236680>]
[5.723527] [<80236680>] (submit_bio) from [<80110950>]
(submit_bh_wbc+0x10c/0x144)
[5.723537] [<80110950>] (submit_bh_wbc)

Deferred struct page initialization issue vs memblocks

2015-09-28 Thread Mika Penttilä

deferred_init_memmap() uses for_each_mem_pfn_range() which (in x86)
uses memblocks. Because of CONFIG_ARCH_DISCARD_MEMBLOCK, the memblock
infos have already been freed to page allocator (in
free_low_memory_core_early()), which happens before
deferred_init_memmap().

Maybe the fix is not to allow DISCARD_MEMBLOCK in deferred case, or
discard memblock infos later.

Thanks,

Mika
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[no subject]

2015-09-01 Thread Mika Penttilä

This one causes imx6q with debug uart connected to "schedule while
atomic" endlessly :


9e7b399d6528eac33a6fbfceb2b92af209c3454d is the first bad commit
commit 9e7b399d6528eac33a6fbfceb2b92af209c3454d
Author: Eduardo Valentin 
Date:   Tue Aug 11 10:21:20 2015 -0700

serial: imx: remove unbalanced clk_prepare

The current code attempts to prepare clk_per and clk_ipg
before using the device. However, the result is an extra
prepare call on each clock. Here is the output of uart
clocks (only uart enabled and used as console):

$  grep uart /sys/kernel/debug/clk/clk_summary
 uart_serial   128000  0 0
   uart   126600  0 0

This patch balances the calls of prepares. The result is:

$  grep uart /sys/kernel/debug/clk/clk_summary
 uart_serial   118000  0 0
   uart   116600  0 0

Cc: Fabio Estevam 
Cc: Greg Kroah-Hartman 
Cc: Jiri Slaby 
Cc: linux-ser...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Eduardo Valentin 
Signed-off-by: Greg Kroah-Hartman 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/5] x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store persistent cpuid <-> apicid mapping.

2015-07-07 Thread Mika Penttilä

I think you forgot to reserve CPU 0 for BSP in cpuid mask.

--Mika

On Tue, Jul 7, 2015 at 12:30 PM, Tang Chen  wrote:
> From: Gu Zheng 
>
> In this patch, we introduce a new static array named apicid_to_cpuid[],
> which is large enough to store info for all possible cpus.
>
> And then, we modify the cpuid calculation. In generic_processor_info(),
> it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid
> mapping changes with node hotplug.
>
> After this patch, we find the next unused cpuid, map it to an apicid,
> and store the mapping in apicid_to_cpuid[], so that cpuid <-> apicid
> mapping will be persistent.
>
> And finally we will use this array to make cpuid <-> nodeid persistent.
>
> cpuid <-> apicid mapping is established at local apic registeration time.
> But non-present or disabled cpus are ignored.
>
> In this patch, we establish all possible cpuid <-> apicid mapping when
> registering local apic.
>
>
> Signed-off-by: Gu Zheng 
> Signed-off-by: Tang Chen 
> ---
>  arch/x86/include/asm/mpspec.h |  1 +
>  arch/x86/kernel/acpi/boot.c   |  6 ++
>  arch/x86/kernel/apic/apic.c   | 47 
> ---
>  3 files changed, 47 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
> index b07233b..db902d8 100644
> --- a/arch/x86/include/asm/mpspec.h
> +++ b/arch/x86/include/asm/mpspec.h
> @@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
>  #endif
>
>  int generic_processor_info(int apicid, int version);
> +int __generic_processor_info(int apicid, int version, bool enabled);
>
>  #define PHYSID_ARRAY_SIZE  BITS_TO_LONGS(MAX_LOCAL_APIC)
>
> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index e49ee24..bcc85b2 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
> return -EINVAL;
> }
>
> -   if (!enabled) {
> +   if (!enabled)
> ++disabled_cpus;
> -   return -EINVAL;
> -   }
>
> if (boot_cpu_physical_apicid != -1U)
> ver = apic_version[boot_cpu_physical_apicid];
>
> -   return generic_processor_info(id, ver);
> +   return __generic_processor_info(id, ver, enabled);
>  }
>
>  static int __init
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index a9c9830..c744ffb 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -1977,7 +1977,38 @@ void disconnect_bsp_APIC(int virt_wire_setup)
> apic_write(APIC_LVT1, value);
>  }
>
> -static int __generic_processor_info(int apicid, int version, bool enabled)
> +/*
> + * Logic cpu number(cpuid) to local APIC id persistent mappings.
> + * Do not clear the mapping even if cpu is hot-removed.
> + */
> +static int apicid_to_cpuid[] = {
> +   [0 ... NR_CPUS - 1] = -1,
> +};
> +
> +/*
> + * Internal cpu id bits, set the bit once cpu present, and never clear it.
> + */
> +static cpumask_t cpuid_mask = CPU_MASK_NONE;
> +
> +static int get_cpuid(int apicid)
> +{
> +   int free_id, i;
> +
> +   free_id = cpumask_next_zero(-1, &cpuid_mask);
> +   if (free_id >= nr_cpu_ids)
> +   return -1;
> +
> +   for (i = 0; i < free_id; i++)
> +   if (apicid_to_cpuid[i] == apicid)
> +   return i;
> +
> +   apicid_to_cpuid[free_id] = apicid;
> +   cpumask_set_cpu(free_id, &cpuid_mask);
> +
> +   return free_id;
> +}
> +
> +int __generic_processor_info(int apicid, int version, bool enabled)
>  {
> int cpu, max = nr_cpu_ids;
> bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
> @@ -2058,8 +2089,18 @@ static int __generic_processor_info(int apicid, int 
> version, bool enabled)
>  * for BSP.
>  */
> cpu = 0;
> -   } else
> -   cpu = cpumask_next_zero(-1, cpu_present_mask);
> +   } else {
> +   cpu = get_cpuid(apicid);
> +   if (cpu < 0) {
> +   int thiscpu = max + disabled_cpus;
> +
> +   pr_warning("  Processor %d/0x%x ignored.\n",
> +  thiscpu, apicid);
> +   if (enabled)
> +   disabled_cpus++;
> +   return -EINVAL;
> +   }
> +   }
>
> /*
>  * Validate version
> --
> 1.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.ht

Fwd: imx6 eth phy broken

2015-02-26 Thread Mika Penttilä

Seems to be the same clock tree change problems as in :

http://www.spinics.net/lists/arm-kernel/msg400244.html

I am able to help to reproduce/test/fix the problem with KaRo imx6q board.

--Mika


-- Forwarded message --
From: Mika Penttilä 
Date: Thu, Feb 26, 2015 at 11:58 AM
Subject: imx6 eth phy broken
To: linux-kernel@vger.kernel.org


Ethernet phy not working on current linus git on imx6 (KaRo tx6q) :


[8.781755] fec 2188000.ethernet eth0: no PHY, assuming direct
connection to switch
[8.791175] libphy: PHY fixed-0:00 not found
[8.797571] fec 2188000.ethernet eth0: could not attach to PHY


Bisected :

035a61c314eb3dab5bcc5683afaf4d412689858a is the first bad commit
commit 035a61c314eb3dab5bcc5683afaf4d412689858a
Author: Tomeu Vizoso 
Date:   Fri Jan 23 12:03:30 2015 +0100

clk: Make clk API return per-user struct clk instances


--Mika
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

imx6 eth phy broken

2015-02-26 Thread Mika Penttilä

Ethernet phy not working on current linus git on imx6 (KaRo tx6q) :


[8.781755] fec 2188000.ethernet eth0: no PHY, assuming direct
connection to switch
[8.791175] libphy: PHY fixed-0:00 not found
[8.797571] fec 2188000.ethernet eth0: could not attach to PHY


Bisected :

035a61c314eb3dab5bcc5683afaf4d412689858a is the first bad commit
commit 035a61c314eb3dab5bcc5683afaf4d412689858a
Author: Tomeu Vizoso 
Date:   Fri Jan 23 12:03:30 2015 +0100

clk: Make clk API return per-user struct clk instances


--Mika
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, kaslr: Prevent .bss from overlaping initrd

2014-10-30 Thread Mika Penttilä

> When choosing a random address, the current implementation does not take into
> account the reversed space for .bss and .brk sections. Thus the relocated 
> kernel
> may overlap other components in memory, e.g. the initrd image:

initrd should be included in the avoid arrays already, and bss is
included in the output_size

for choose_kernel_location(). So something else is going on?

--Mika

On Thu, Oct 30, 2014 at 9:06 PM, Mika Penttilä
 wrote:
>> When choosing a random address, the current implementation does not take
>> into
>> account the reversed space for .bss and .brk sections. Thus the relocated
>> kernel
>> may overlap other components in memory, e.g. the initrd image:
>
> initrd should be included in the avoid arrays already, and bss is included
> in the output_size
>
> for choose_kernel_location(). So something else is going on?
>
> --Mika
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

2008-01-20 Thread Mika Penttilä




+ * This is how much memory *in addition to the memory covered up to
+ * and including _end* we need mapped initially.  We need one bit for
+ * each possible page, but only in low memory, which means
+ * 2^32/4096/8 = 128K worst case (4G/4G split.)
+ *
+ * Modulo rounding, each megabyte assigned here requires a kilobyte of
+ * memory, which is currently unreclaimed.
+ *
+ * This should be a multiple of a page.
+ */
+#define INIT_MAP_BEYOND_END(128*1024)
+
+/*
  


You have dropped the requirement to map all of low memory (the boot 
allocator is used for instance to construct physical mem mapping). 
Either you should fix your INIT_MAP_BEYOND_END or make a big comment 
telling us why it isn't necessary anymore to map low mem.


--Mika
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/4] x86: PAT followup - Remove KERNPG_TABLE from pte entry

2008-01-16 Thread Mika Penttilä


[EMAIL PROTECTED] kirjoitti:

KERNPG_TABLE was a bug in earlier patch. Remove it from pte.
pte_val() check is redundant as this routine is called immediately after a
ptepage is allocated afresh.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>
Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>

Index: linux-2.6.git/arch/x86/mm/init_64.c
===
--- linux-2.6.git.orig/arch/x86/mm/init_64.c2008-01-15 11:02:23.0 
-0800
+++ linux-2.6.git/arch/x86/mm/init_64.c 2008-01-15 11:06:37.0 -0800
@@ -541,9 +541,6 @@
if (address >= end)
break;
 
-		if (pte_val(*pte))

-   continue;
-
/* Nothing to map. Map the null page */
if (!(address & (~PAGE_MASK)) &&
(address + PAGE_SIZE <= end) &&
@@ -561,9 +558,9 @@
}
 
 		if (exec)

-   entry = _PAGE_NX|_KERNPG_TABLE|_PAGE_GLOBAL|address;
+   entry = _PAGE_NX|_PAGE_GLOBAL|address;
else
-   entry = _KERNPG_TABLE|_PAGE_GLOBAL|address;
+   entry = _PAGE_GLOBAL|address;
entry &= __supported_pte_mask;
set_pte(pte, __pte(entry));
}

  


Hmm then what's the point of mapping not present 4k pages for valid mem 
here?


--Mika


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm 1/2] wait_task_stopped: remove unneeded delay_group_leader check

2007-11-24 Thread Mika Penttilä




wait_task_stopped() doesn't need the "delay_group_leader" parameter. If the
child is not traced it must be a group leader. With or without subthreads
  

What do you mean "has to be a group leader"? It could be a stopped thread.


->group_stop_count == 0 when the whole task is stopped.

  
Signed-off-by: Oleg Nesterov <[EMAIL PROTECTED]>


--- PT/kernel/exit.c~5_ck_group_stop2007-11-22 19:08:43.0 +0300
+++ PT/kernel/exit.c2007-11-23 20:31:21.0 +0300
@@ -1348,7 +1348,7 @@ static int wait_task_zombie(struct task_
  * the lock and this task is uninteresting.  If we return nonzero, we have
  * released the lock and the system call should return.
  */
-static int wait_task_stopped(struct task_struct *p, int delayed_group_leader,
+static int wait_task_stopped(struct task_struct *p,
 int noreap, struct siginfo __user *infop,
 int __user *stat_addr, struct rusage __user *ru)
 {
@@ -1362,8 +1362,7 @@ static int wait_task_stopped(struct task
if (unlikely(!is_task_stopped_or_traced(p)))
goto unlock_sig;
 
-	if (delayed_group_leader && !(p->ptrace & PT_PTRACED) &&

-   p->signal->group_stop_count > 0)
+   if (!(p->ptrace & PT_PTRACED) && p->signal->group_stop_count > 0)
/*
 * A group stop is in progress and this is the group leader.
 * We won't report until all threads have stopped.
@@ -1519,7 +1518,7 @@ repeat:
!(options & WUNTRACED))
continue;
 
-retval = wait_task_stopped(p, ret == 2,

+   retval = wait_task_stopped(p,
(options & WNOWAIT), infop,
stat_addr, ru);
} else if (p->exit_state == EXIT_ZOMBIE) {

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nmi_watchdog fix for x86_64 to be more like i386

2007-10-01 Thread Mika Penttilä


Thomas Gleixner wrote:

On Tue, 2 Oct 2007, Andi Kleen wrote:
  

OTOH, the accounting hook would allow us to remove the IRQ#0 -> CPU#0
restriction. Not sure whether it's worth the trouble.
  

Some SIS chipsets hang the machine when you migrate irq 0 to another
CPU. It's better to keep that Also I wouldn't be surprised if there are some
other assumptions about this elsewhere.

Ok in theory it could be done only on SIS, but that probably would really
not be worth the trouble



Agreed.

I just got a x8664-hrt report, where I found the following oddity:

 0:   1197 172881   IO-APIC-edge  timer

That's one of those infamous AMD C1E boxen. Strange, all my systems have 
IRQ#0 on CPU#0 and nowhere else. Any idea ?


tglx

  

Here I have with stock FC7 (2.6.22.9-91) kernel :
0: 107835  133459760   IO-APIC-edge  timer

Processor:
vendor_id: AuthenticAMD
cpu family: 15
model: 107
model name: AMD Athlon(tm) 64 X2 Dual Core Processor 4000+
stepping: 1
cpu MHz: 2109.721
cache size: 512 KB

MB:
Asus M2N-E (NF570)

--Mika


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [linux-pm] [RFC][PATCH 0/2 -mm] kexec based hibernation -v3

2007-09-21 Thread Mika Penttilä


huang ying wrote:

On 9/21/07, Mika Penttilä <[EMAIL PROTECTED]> wrote:
  

huang ying wrote:


On 9/21/07, Mika Penttilä <[EMAIL PROTECTED]> wrote:

  

Usage:

1. Compile kernel with following options selected:

CONFIG_X86_32=y
CONFIG_RELOCATABLE=y # not needed strictly, but it is more convenient with it
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y # only needed by kexeced kernel to save/restore memory image
CONFIG_PM=y
CONFIG_KEXEC_JUMP=y

2. Download the kexec-tools-testing git tree, apply the kexec-tools
   kjump patches (or download the source tar ball directly) and
   compile.

3. Download and compile the krestore tool.

4. Prepare 2 root partition used by kernel A and kernel B/C, referred
   as /dev/hda, /dev/hdb in following text. This is not strictly
   necessary, I use this scheme for testing during development.

5. Boot kernel compiled for normal usage (kernal A).

6. Load kernel compiled for hibernating/restore usage (kernel B) with
   kexec, the same kernel as that of 5 can be used if
   CONFIG_RELOCATABLE=y and CONFIG_CRASH_DUMP=y are selected.

   The --elf64-core-headers should be specified in command line of
   kexec, because only the 64bit ELF is supported by krestore tool.

   For example, the shell command line can be as follow:

   kexec -p -n /boot/bzImage --mem-min=0x10 --mem-max=0xff
   --elf64-core-headers --append="root=/dev/hdb single"

7. Jump to the hibernating kernel (kernel B) with following shell
   command line:

   kexec -j

8. In the hibernating kernel (kernel B), the memory image of
   hibernated kernel (kernel A) can be saved as follow:

   cp /proc/vmcore .
   cp /sys/kernel/kexec_jump_back_entry .


  

Here we save also kernel B's pages.



No, the kernel B's pages will not be saved. Because when we build the
elfcore (/proc/vmcore) header, we exclude memory area used by kernel
B. The details can be found in kexec-tools patches.


  

Ok I see. But should the kernel B's e820 mem map be limited to 1m-16m in
order not to allocate pages found also in A's space? Or does does the
--mem-min and --mem-max do that also?



That is what "memmap=exactmap [EMAIL PROTECTED] [EMAIL PROTECTED]" for. The
contents of e820 memmap will be overrided when these kernel parameters
are specified.

Best Regards,
Huang Ying
  
Yes, you just didn't specify exactmap for kernel B in your instructions, 
but only for C. But it is also required for kernel B then?


Thanks,
Mika

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [linux-pm] [RFC][PATCH 0/2 -mm] kexec based hibernation -v3

2007-09-21 Thread Mika Penttilä


huang ying wrote:

On 9/21/07, Mika Penttilä <[EMAIL PROTECTED]> wrote:
  

Usage:

1. Compile kernel with following options selected:

CONFIG_X86_32=y
CONFIG_RELOCATABLE=y # not needed strictly, but it is more convenient with it
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y # only needed by kexeced kernel to save/restore memory image
CONFIG_PM=y
CONFIG_KEXEC_JUMP=y

2. Download the kexec-tools-testing git tree, apply the kexec-tools
   kjump patches (or download the source tar ball directly) and
   compile.

3. Download and compile the krestore tool.

4. Prepare 2 root partition used by kernel A and kernel B/C, referred
   as /dev/hda, /dev/hdb in following text. This is not strictly
   necessary, I use this scheme for testing during development.

5. Boot kernel compiled for normal usage (kernal A).

6. Load kernel compiled for hibernating/restore usage (kernel B) with
   kexec, the same kernel as that of 5 can be used if
   CONFIG_RELOCATABLE=y and CONFIG_CRASH_DUMP=y are selected.

   The --elf64-core-headers should be specified in command line of
   kexec, because only the 64bit ELF is supported by krestore tool.

   For example, the shell command line can be as follow:

   kexec -p -n /boot/bzImage --mem-min=0x10 --mem-max=0xff
   --elf64-core-headers --append="root=/dev/hdb single"

7. Jump to the hibernating kernel (kernel B) with following shell
   command line:

   kexec -j

8. In the hibernating kernel (kernel B), the memory image of
   hibernated kernel (kernel A) can be saved as follow:

   cp /proc/vmcore .
   cp /sys/kernel/kexec_jump_back_entry .

  

Here we save also kernel B's pages.



No, the kernel B's pages will not be saved. Because when we build the
elfcore (/proc/vmcore) header, we exclude memory area used by kernel
B. The details can be found in kexec-tools patches.

  
Ok I see. But should the kernel B's e820 mem map be limited to 1m-16m in 
order not to allocate pages found also in A's space? Or does does the 
--mem-min and --mem-max do that also?

Thanks,
Mika


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [linux-pm] [RFC][PATCH 0/2 -mm] kexec based hibernation -v3

2007-09-21 Thread Mika Penttilä




Usage:

1. Compile kernel with following options selected:

CONFIG_X86_32=y
CONFIG_RELOCATABLE=y # not needed strictly, but it is more convenient with it
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y # only needed by kexeced kernel to save/restore memory image
CONFIG_PM=y
CONFIG_KEXEC_JUMP=y

2. Download the kexec-tools-testing git tree, apply the kexec-tools
   kjump patches (or download the source tar ball directly) and
   compile.

3. Download and compile the krestore tool.

4. Prepare 2 root partition used by kernel A and kernel B/C, referred
   as /dev/hda, /dev/hdb in following text. This is not strictly
   necessary, I use this scheme for testing during development.

5. Boot kernel compiled for normal usage (kernal A).

6. Load kernel compiled for hibernating/restore usage (kernel B) with
   kexec, the same kernel as that of 5 can be used if
   CONFIG_RELOCATABLE=y and CONFIG_CRASH_DUMP=y are selected.

   The --elf64-core-headers should be specified in command line of
   kexec, because only the 64bit ELF is supported by krestore tool.

   For example, the shell command line can be as follow:

   kexec -p -n /boot/bzImage --mem-min=0x10 --mem-max=0xff
   --elf64-core-headers --append="root=/dev/hdb single"

7. Jump to the hibernating kernel (kernel B) with following shell
   command line:

   kexec -j

8. In the hibernating kernel (kernel B), the memory image of
   hibernated kernel (kernel A) can be saved as follow:

   cp /proc/vmcore .
   cp /sys/kernel/kexec_jump_back_entry .
  

Here we save also kernel B's pages.

9. Shutdown or reboot in hibernating kernel (kernel B).

10. Boot kernel (kernel C) compiled for hibernating/restore usage on
the root file system /dev/hdb in memory range of kernel B.

For example, the following kernel command line parameters can be
used:

root=/dev/hdb single memmap=exactmap [EMAIL PROTECTED] [EMAIL PROTECTED]
  

0-640K from kernel A overrides 0-640K of kernel C at restore time.

11. In restore kernel (kernel C), the memory image of kernel A can be
restored as follow:

cp kexec_jump_back_entry /sys/kernel/kexec_jump_back_entry
krestore vmcore

  
This steps replaces kernel C's pages with kernel B's (at least 15m-16m), 
saved at step 8, so these kernels should be equal? Or they must be 
physically located in non-overlapping regions such that C is in B's 
memory range but non-overlapping. The proposed setup doesn't guaratee 
this afaics.

12. Jump back to hibernated kernel (kernel A)

kexec -b

Best Regards,
Huang Ying
___
linux-pm mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/linux-pm

  

--Mika


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Crash on modpost, addend_386_rel()

2007-05-23 Thread Mika Penttilä


Atsushi Nemoto wrote:

On Tue, 22 May 2007 17:48:09 +0300, Mika_Penttilä <[EMAIL PROTECTED]> wrote:
  

I can't see how this use of r_attend is going to work. find_elf_symbol
compares relsym->st_value to Elf_Rela->r_attend. I think it doesn't work
for RELA archs and even with this patch for REL.



It seems works fine with RELA archs, at least mips64.

For example, set_up_list3s is correctly reported.

WARNING: mm/built-in.o - Section mismatch: reference to 
.init.text:set_up_list3s from .text between 'kmem_cache_create' (at offset 
0x26358) and 'cache_reap'

Here is excerpt from readelf output.  Addend value 0x21d8 matches
st_value of its target symbol.

$ mips64-linux-readelf -sr ../build-sb1250/mm/built-in.o
Relocation section '.rela.text' at offset 0x33fe0 contains 5100 entries:
  Offset  Info   Type   Sym. ValueSym. Name + Addend
...
00026358  00040004 R_MIPS_26  .init.text + 21d8
Type2: R_MIPS_NONE
Type3: R_MIPS_NONE
...
Symbol table '.symtab' contains 1652 entries:
   Num:Value  Size TypeBind   Vis  Ndx Name
...
   746: 21d8   148 FUNCLOCAL  DEFAULT4 set_up_list3s

---
Atsushi Nemoto

  
So with mips64 you are lucky because the relocation symbol is .init.text 
and hence addend matches  (has to match) symbol's offset. I can't find 
any spec where it is stated that addend == address, maybe it's in mips64 
abi or something. It is quite normal to have addend of 0.


--Mika

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Crash on modpost, addend_386_rel()

2007-05-22 Thread Mika Penttilä


Atsushi Nemoto kirjoitti:

On Tue, 22 May 2007 14:29:29 +0900 (JST), Atsushi Nemoto <[EMAIL PROTECTED]> 
wrote:
  

Subject: [PATCH] kbuild: make better section mismatch reports on i386, arm and 
mips (take 3)



Updated again to get a little bit better report on i386 relocatable
vmlinux.


Subject: [PATCH] kbuild: make better section mismatch reports on i386, arm and 
mips (take 4)

On i386, ARM and MIPS, warn_sec_mismatch() sometimes fails to show
usefull symbol name.  This is because empty 'refsym' due to 0 r_addend
value.  This patch is to adjust r_addend value, consulting with
apply_relocate() routine in kernel code.

Signed-off-by: Atsushi Nemoto <[EMAIL PROTECTED]>
---
diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c
index 8e5610d..760b2b3 100644
--- a/scripts/mod/modpost.c
+++ b/scripts/mod/modpost.c
@@ -374,6 +374,7 @@ static int parse_elf(struct elf_info *info, const char 
*filename)
hdr->e_shstrndx = TO_NATIVE(hdr->e_shstrndx);
hdr->e_shnum= TO_NATIVE(hdr->e_shnum);
hdr->e_machine  = TO_NATIVE(hdr->e_machine);
+   hdr->e_type = TO_NATIVE(hdr->e_type);
sechdrs = (void *)hdr + hdr->e_shoff;
info->sechdrs = sechdrs;
 
@@ -384,6 +385,8 @@ static int parse_elf(struct elf_info *info, const char *filename)

sechdrs[i].sh_size   = TO_NATIVE(sechdrs[i].sh_size);
sechdrs[i].sh_link   = TO_NATIVE(sechdrs[i].sh_link);
sechdrs[i].sh_name   = TO_NATIVE(sechdrs[i].sh_name);
+   sechdrs[i].sh_info   = TO_NATIVE(sechdrs[i].sh_info);
+   sechdrs[i].sh_addr   = TO_NATIVE(sechdrs[i].sh_addr);
}
/* Find symbol table. */
for (i = 1; i < hdr->e_shnum; i++) {
@@ -753,6 +756,8 @@ static Elf_Sym *find_elf_symbol(struct elf_info *elf, 
Elf_Addr addr,
for (sym = elf->symtab_start; sym < elf->symtab_stop; sym++) {
if (sym->st_shndx != relsym->st_shndx)
continue;
+   if (ELF_ST_TYPE(sym->st_info) == STT_SECTION)
+   continue;
if (sym->st_value == addr)
return sym;
}
@@ -895,6 +900,72 @@ static void warn_sec_mismatch(const char *modname, const 
char *fromsec,
}
 }
 
+static inline unsigned int *reloc_location(struct elf_info *elf,

+  int rsection, Elf_Rela *r)
+{
+   Elf_Shdr *sechdrs = elf->sechdrs;
+   int section = sechdrs[rsection].sh_info;
+
+   return (void *)elf->hdr + sechdrs[section].sh_offset +
+   (r->r_offset - sechdrs[section].sh_addr);
+}
+
+static void addend_386_rel(struct elf_info *elf, int rsection, Elf_Rela *r)
+{
+   unsigned int r_typ = ELF_R_TYPE(r->r_info);
+   unsigned int *location = reloc_location(elf, rsection, r);
+
+   switch (r_typ) {
+   case R_386_32:
+   r->r_addend = TO_NATIVE(*location);
+   break;
+   case R_386_PC32:
+   r->r_addend = TO_NATIVE(*location) + 4;
+   /* For CONFIG_RELOCATABLE=y */
+   if (elf->hdr->e_type == ET_EXEC)
+   r->r_addend += r->r_offset;
+   break;
+   }
+}
+
+static void addend_arm_rel(struct elf_info *elf, int rsection, Elf_Rela *r)
+{
+   unsigned int r_typ = ELF_R_TYPE(r->r_info);
+   unsigned int *location = reloc_location(elf, rsection, r);
+
+   switch (r_typ) {
+   case R_ARM_ABS32:
+   r->r_addend = TO_NATIVE(*location);
+   break;
+   case R_ARM_PC24:
+   r->r_addend = ((TO_NATIVE(*location) & 0x00ff) << 2) + 8;
+   break;
+   }
+}
+
+static int addend_mips_rel(struct elf_info *elf, int rsection, Elf_Rela *r)
+{
+   unsigned int r_typ = ELF_R_TYPE(r->r_info);
+   unsigned int *location = reloc_location(elf, rsection, r);
+   unsigned int inst;
+
+   if (r_typ == R_MIPS_HI16)
+   return 1;   /* skip this */
+   inst = TO_NATIVE(*location);
+   switch (r_typ) {
+   case R_MIPS_LO16:
+   r->r_addend = inst & 0x;
+   break;
+   case R_MIPS_26:
+   r->r_addend = (inst & 0x03ff) << 2;
+   break;
+   case R_MIPS_32:
+   r->r_addend = inst;
+   break;
+   }
+   return 0;
+}
+
 /**
  * A module includes a number of sections that are discarded
  * either when loaded or when used as built-in.
@@ -938,8 +1009,11 @@ static void check_sec_ref(struct module *mod, const char 
*modname,
r.r_offset = TO_NATIVE(rela->r_offset);
 #if KERNEL_ELFCLASS == ELFCLASS64
if (hdr->e_machine == EM_MIPS) {
+   unsigned int r_typ;
r_sym = ELF64_MIPS_R_SYM(rela->r_info);
r_sym = TO_NATIVE(r_sym);
+

Re: [PATCH 13/14] x86_64 irq: Safely cleanup an irq after moving it.

2007-02-25 Thread Mika Penttilä


Eric W. Biederman wrote:

  * Vectors 0x20-0x2f are used for ISA interrupts.
  */
-#define IRQ0_VECTORFIRST_EXTERNAL_VECTOR
+#define IRQ0_VECTORFIRST_EXTERNAL_VECTOR + 0x10
 #define IRQ1_VECTORIRQ0_VECTOR + 1
 #define IRQ2_VECTORIRQ0_VECTOR + 2
 #define IRQ3_VECTORIRQ0_VECTOR + 3
@@ -82,7 +87,7 @@
  

I think we have a dependency in i8259.c that irq0 is mapped to vector 0x20.

--Mika

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Mika Penttilä


Al Viro wrote:


On Fri, Aug 19, 2005 at 10:16:47PM +0300, Mika Penttilä wrote:
 

Just out of curiosity - what protects even local filesystems against 
concurrent truncate and symlink resolving when using the page cache helpers?
   



How do you get truncate(2) or ftruncate(2) to do something with a symlink?
The former follows links, the latter takes an open file...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 

Yes that is right, there is no way to invalidate the symlink inode 
mapping page(s) from user space.


--Mika




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Mika Penttilä


Al Viro wrote:


On Fri, Aug 19, 2005 at 05:53:32PM +0100, Al Viro wrote:
 


I'm taking NFS helpers to libfs.c and switching ncpfs to them.  IMO that's
better than copying the damn thing and other network filesystems might have
the same needs eventually...
   



[something like this - completely untested]

* stray_page_get_link(inode, filler) - returns ERR_PTR(error) or pointer
to symlink body.  Said symlink body sits in a page at offset equal to
offsetof(page, struct stray_page_link).  filler() is expected to put it
at such offset. Page is cached.

* stray_page_put_link() - ->put_link() suitable for links obtained from
stray_page_get_link().  Unlike the usual pagecache-based variants, this
sucker does _not_ rely on page staying cached.

* nfs and ncpfs switched to the helpers above.

Signed-off-by: Al Viro <[EMAIL PROTECTED]>

 



Just out of curiosity - what protects even local filesystems against 
concurrent truncate and symlink resolving when using the page cache helpers?


--Mika

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

80 matches

Mail list logo