Re: [RFC PATCH] mm, oom: introduce vm.sacrifice_hugepage_on_oom

2021-02-18 Thread Eiichi Tsukata
Hi Michal

> On Feb 17, 2021, at 21:31, Michal Hocko  wrote:
> 
> On Wed 17-02-21 10:42:24, Eiichi Tsukata wrote:
>> Hi All,
>> 
>> Firstly, thank you for your careful review and attention to my patch
>> (and apologies for top-posting!).  Let me first explain why our use
>> case requires hugetlb over THP and then elaborate on the difficulty we
>> have to maintain the correct number of hugepages in the pool, finally
>> concluding with why the proposed approach would help us. Hopefully you
>> can extend it to other use cases and justify the proposal.
>> 
>> We use Linux to operate a KVM-based hypervisor. Using hugepages to
>> back VM memory significantly increases performance and density. Each
>> VM incurs a 4k regular page overhead which can vary drastically even
>> at runtime (eg. depending on network traffic). In addition, the
>> software doesn't know upfront if users will power on one large VM or
>> several small VMs.
>> 
>> To manage the varying balance of 4k pages vs. hugepages, we originally
>> leveraged THP. However, constant fragmentation due to VM power cycles,
>> the varying overhead I mentioned above, and other operations like
>> reconfiguration of NIC RX buffers resulted in two problems:
>> 1) There were no guarantees hugepages would be used; and
>> 2) Constant memory compaction incurred a measurable overhead.
>> 
>> Having a userspace service managing hugetlb gave us significant
>> performance advantages and much needed determinism. It chooses when to
>> try and create more hugepages as well as how many hugepages to go
>> after. Elements like how many hugepages it actually gets, combined
>> with what operations are happening on the host, allow our service to
>> make educated decisions about when to compact memory, drop caches, and
>> retry growing (or shrinking) the pool.
> 
> OK, thanks for the clarification. Just to make sure I understand. This
> means that you are pro-activelly and optimistically pre-allocate hugetlb
> pages even when there is no immediate need for those, right?

Right, but this is not a "pre-allocation just in case". We need to
know how many hugepages are available for VM memory upfront. That
allows us to plan for disaster scenarios where a host goes down and we
need to restart VMs in other hosts. In addition, going from zero to
TBs worth of hugepages may take a long time and makes VM power on
times too slow. Of course in bloat conditions we could lose hugepages
we pre-allocated, but our placement models can react to that.


> 
>> But that comes with a challenge: despite listening on cgroup for
>> pressure notifications (which happen from those runtime events we do
>> not control),
> 
> We do also have global pressure (PSI) counters. Have you tried to look
> into those and try to back off even when the situation becomes critical?

Yes. PSI counters help us to some extent. But we've found that in some cases
OOM can happen before we observe memory pressure if memory bloat occurred
rapidly. The proposed failsafe mechanism can cover even such a situation.
Also, as I mentioned in commit message, oom notifiers doesn't work if OOM
is triggered by memory allocation for kernel.

> 
>> the service is not guaranteed to sacrifice hugepages
>> fast enough and that causes an OOM. The killer will normally take out
>> a VM even if there are plenty of unused hugepages and that's obviously
>> disruptive for users. For us, free hugepages are almost always expendable.
>> 
>> For the bloat cases which are predictable, a memory management service
>> can adjust the hugepage pool size ahead of time. But it can be hard to
>> anticipate all scenarios, and some can be very volatile. Having a
>> failsafe mechanism as proposed in this patch offers invaluable
>> protection when things are missed.
>> 
>> The proposal solves this problem by sacrificing hugepages inline even
>> when the pressure comes from kernel allocations. The userspace service
>> can later readjust the pool size without being under pressure. Given
>> this is configurable, and defaults to being off, we thought it would
>> be a nice addition to the kernel and appreciated by other users that
>> may have similar requirements.
> 
> Thanks for your usecase description. It helped me to understand what you
> are doing and how this can be really useful for your particular setup.
> This is really a very specific situation from my POV. I am not yet sure
> this is generic enough to warrant for a yet another tunable. One thing
> you can do [1] is to
> hook into oom notifiers interface (register_oom_notifier) and release
> pages from the callback. Why is that batter than a global tuna

Re: [RFC PATCH] mm, oom: introduce vm.sacrifice_hugepage_on_oom

2021-02-17 Thread Eiichi Tsukata
Hi All,

Firstly, thank you for your careful review and attention to my patch
(and apologies for top-posting!).  Let me first explain why our use
case requires hugetlb over THP and then elaborate on the difficulty we
have to maintain the correct number of hugepages in the pool, finally
concluding with why the proposed approach would help us. Hopefully you
can extend it to other use cases and justify the proposal.

We use Linux to operate a KVM-based hypervisor. Using hugepages to
back VM memory significantly increases performance and density. Each
VM incurs a 4k regular page overhead which can vary drastically even
at runtime (eg. depending on network traffic). In addition, the
software doesn't know upfront if users will power on one large VM or
several small VMs.

To manage the varying balance of 4k pages vs. hugepages, we originally
leveraged THP. However, constant fragmentation due to VM power cycles,
the varying overhead I mentioned above, and other operations like
reconfiguration of NIC RX buffers resulted in two problems:
1) There were no guarantees hugepages would be used; and
2) Constant memory compaction incurred a measurable overhead.

Having a userspace service managing hugetlb gave us significant
performance advantages and much needed determinism. It chooses when to
try and create more hugepages as well as how many hugepages to go
after. Elements like how many hugepages it actually gets, combined
with what operations are happening on the host, allow our service to
make educated decisions about when to compact memory, drop caches, and
retry growing (or shrinking) the pool.

But that comes with a challenge: despite listening on cgroup for
pressure notifications (which happen from those runtime events we do
not control), the service is not guaranteed to sacrifice hugepages
fast enough and that causes an OOM. The killer will normally take out
a VM even if there are plenty of unused hugepages and that's obviously
disruptive for users. For us, free hugepages are almost always expendable.

For the bloat cases which are predictable, a memory management service
can adjust the hugepage pool size ahead of time. But it can be hard to
anticipate all scenarios, and some can be very volatile. Having a
failsafe mechanism as proposed in this patch offers invaluable
protection when things are missed.

The proposal solves this problem by sacrificing hugepages inline even
when the pressure comes from kernel allocations. The userspace service
can later readjust the pool size without being under pressure. Given
this is configurable, and defaults to being off, we thought it would
be a nice addition to the kernel and appreciated by other users that
may have similar requirements.

I welcome your comments and thank you again for your time!

Eiichi

> On Feb 17, 2021, at 16:57, Michal Hocko  wrote:
> 
> On Tue 16-02-21 14:30:15, Mike Kravetz wrote:
> [...]
>> However, this is an 'opt in' feature.  So, I would not expect anyone who
>> carefully plans the size of their hugetlb pool to enable such a feature.
>> If there is a use case where hugetlb pages are used in a non-essential
>> application, this might be of use.
> 
> I would really like to hear about the specific usecase. Because it
> smells more like a misconfiguration. What would be non-essential hugetlb
> pages? This is not a resource to be pre-allocated just in case, right?
> 
> -- 
> Michal Hocko
> SUSE Labs



[RFC PATCH] mm, oom: introduce vm.sacrifice_hugepage_on_oom

2021-02-15 Thread Eiichi Tsukata
Hugepages can be preallocated to avoid unpredictable allocation latency.
If we run into 4k page shortage, the kernel can trigger OOM even though
there were free hugepages. When OOM is triggered by user address page
fault handler, we can use oom notifier to free hugepages in user space
but if it's triggered by memory allocation for kernel, there is no way
to synchronously handle it in user space.

This patch introduces a new sysctl vm.sacrifice_hugepage_on_oom. If
enabled, it first tries to free a hugepage if available before invoking
the oom-killer. The default value is disabled not to change the current
behavior.

Signed-off-by: Eiichi Tsukata 
---
 Documentation/admin-guide/sysctl/vm.rst | 12 
 include/linux/hugetlb.h |  2 ++
 include/linux/oom.h |  1 +
 kernel/sysctl.c |  9 +
 mm/hugetlb.c|  4 ++--
 mm/oom_kill.c   | 23 +++
 6 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst 
b/Documentation/admin-guide/sysctl/vm.rst
index e35a3f2fb006..f2f195524be6 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -65,6 +65,7 @@ Currently, these files are in /proc/sys/vm:
 - page-cluster
 - panic_on_oom
 - percpu_pagelist_fraction
+- sacrifice_hugepage_on_oom
 - stat_interval
 - stat_refresh
 - numa_stat
@@ -807,6 +808,17 @@ The initial value is zero.  Kernel does not use this value 
at boot time to set
 the high water marks for each per cpu page list.  If the user writes '0' to 
this
 sysctl, it will revert to this default behavior.
 
+sacrifice_hugepage_on_oom
+=
+
+This value controls whether the kernel should attempt to break up hugepages
+when out-of-memory happens. OOM happens under memory cgroup would not invoke
+this.
+
+If set to 0 (default), the kernel doesn't touch the hugepage pool during OOM
+conditions.
+If set to 1, the kernel frees one hugepage at a time, if available, before
+invoking the oom-killer.
 
 stat_interval
 =
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b5807f23caf8..8aad2f2ab6e6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -145,6 +145,8 @@ int hugetlb_reserve_pages(struct inode *inode, long from, 
long to,
 long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
long freed);
 bool isolate_huge_page(struct page *page, struct list_head *list);
+int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
+   bool acct_surplus);
 void putback_active_hugepage(struct page *page);
 void move_hugetlb_state(struct page *oldpage, struct page *newpage, int 
reason);
 void free_huge_page(struct page *page);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 2db9a1432511..0bfae027ec16 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -127,4 +127,5 @@ extern struct task_struct *find_lock_task_mm(struct 
task_struct *p);
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
 extern int sysctl_panic_on_oom;
+extern int sysctl_sacrifice_hugepage_on_oom;
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c9fbdd848138..d2e3ec625f5f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2708,6 +2708,15 @@ static struct ctl_table vm_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec,
},
+   {
+   .procname   = "sacrifice_hugepage_on_oom",
+   .data   = _sacrifice_hugepage_on_oom,
+   .maxlen = sizeof(sysctl_sacrifice_hugepage_on_oom),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = SYSCTL_ZERO,
+   .extra2 = SYSCTL_ONE,
+   },
{
.procname   = "overcommit_ratio",
.data   = _overcommit_ratio,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4bdb58ab14cb..e2d57200fd00 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1726,8 +1726,8 @@ static int alloc_pool_huge_page(struct hstate *h, 
nodemask_t *nodes_allowed,
  * balanced over allowed nodes.
  * Called with hugetlb_lock locked.
  */
-static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
-bool acct_surplus)
+int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
+   bool acct_surplus)
 {
int nr_nodes, node;
int ret = 0;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04b19b7b5435..fd2c1f427926 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include "i

Re: [PATCH] xfs: Fix UBSAN null-ptr-deref in xfs_sysfs_init

2020-08-06 Thread Eiichi Tsukata
Thanks, I sent it to linux-xfs ML. I had some trouble with gmail server.

Eiichi

On 2020/08/07 0:13, Darrick J. Wong wrote:
> On Fri, Aug 07, 2020 at 12:05:27AM +0900, Eiichi Tsukata wrote:
>> If xfs_sysfs_init is called with parent_kobj == NULL, UBSAN
>> shows the following warning:
> 
> This needs to be sent to the xfs mailing list, per get_maintainers.pl.
> 
> --D
> 
>>   UBSAN: null-ptr-deref in ./fs/xfs/xfs_sysfs.h:37:23
>>   member access within null pointer of type 'struct xfs_kobj'
>>   Call Trace:
>>dump_stack+0x10e/0x195
>>ubsan_type_mismatch_common+0x241/0x280
>>__ubsan_handle_type_mismatch_v1+0x32/0x40
>>init_xfs_fs+0x12b/0x28f
>>do_one_initcall+0xdd/0x1d0
>>do_initcall_level+0x151/0x1b6
>>do_initcalls+0x50/0x8f
>>do_basic_setup+0x29/0x2b
>>kernel_init_freeable+0x19f/0x20b
>>kernel_init+0x11/0x1e0
>>ret_from_fork+0x22/0x30
>>
>> Fix it by checking parent_kobj before the code accesses its member.
>>
>> Signed-off-by: Eiichi Tsukata 
>> ---
>>  fs/xfs/xfs_sysfs.h | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_sysfs.h b/fs/xfs/xfs_sysfs.h
>> index e9f810fc6731..aad67dc4ab5b 100644
>> --- a/fs/xfs/xfs_sysfs.h
>> +++ b/fs/xfs/xfs_sysfs.h
>> @@ -32,9 +32,9 @@ xfs_sysfs_init(
>>  struct xfs_kobj *parent_kobj,
>>  const char  *name)
>>  {
>> +struct kobject *parent = parent_kobj ? _kobj->kobject : NULL;
>>  init_completion(>complete);
>> -return kobject_init_and_add(>kobject, ktype,
>> -_kobj->kobject, "%s", name);
>> +return kobject_init_and_add(>kobject, ktype, parent, "%s", name);
>>  }
>>  
>>  static inline void
>> -- 
>> 2.26.2
>>


[PATCH] xfs: Fix UBSAN null-ptr-deref in xfs_sysfs_init

2020-08-06 Thread Eiichi Tsukata
If xfs_sysfs_init is called with parent_kobj == NULL, UBSAN
shows the following warning:

  UBSAN: null-ptr-deref in ./fs/xfs/xfs_sysfs.h:37:23
  member access within null pointer of type 'struct xfs_kobj'
  Call Trace:
   dump_stack+0x10e/0x195
   ubsan_type_mismatch_common+0x241/0x280
   __ubsan_handle_type_mismatch_v1+0x32/0x40
   init_xfs_fs+0x12b/0x28f
   do_one_initcall+0xdd/0x1d0
   do_initcall_level+0x151/0x1b6
   do_initcalls+0x50/0x8f
   do_basic_setup+0x29/0x2b
   kernel_init_freeable+0x19f/0x20b
   kernel_init+0x11/0x1e0
   ret_from_fork+0x22/0x30

Fix it by checking parent_kobj before the code accesses its member.

Signed-off-by: Eiichi Tsukata 
---
 fs/xfs/xfs_sysfs.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_sysfs.h b/fs/xfs/xfs_sysfs.h
index e9f810fc6731..aad67dc4ab5b 100644
--- a/fs/xfs/xfs_sysfs.h
+++ b/fs/xfs/xfs_sysfs.h
@@ -32,9 +32,9 @@ xfs_sysfs_init(
struct xfs_kobj *parent_kobj,
const char  *name)
 {
+   struct kobject *parent = parent_kobj ? _kobj->kobject : NULL;
init_completion(>complete);
-   return kobject_init_and_add(>kobject, ktype,
-   _kobj->kobject, "%s", name);
+   return kobject_init_and_add(>kobject, ktype, parent, "%s", name);
 }
 
 static inline void
-- 
2.26.2



[PATCH] xfs: Fix UBSAN null-ptr-deref in xfs_sysfs_init

2020-08-06 Thread Eiichi Tsukata
If xfs_sysfs_init is called with parent_kobj == NULL, UBSAN
shows the following warning:

  UBSAN: null-ptr-deref in ./fs/xfs/xfs_sysfs.h:37:23
  member access within null pointer of type 'struct xfs_kobj'
  Call Trace:
   dump_stack+0x10e/0x195
   ubsan_type_mismatch_common+0x241/0x280
   __ubsan_handle_type_mismatch_v1+0x32/0x40
   init_xfs_fs+0x12b/0x28f
   do_one_initcall+0xdd/0x1d0
   do_initcall_level+0x151/0x1b6
   do_initcalls+0x50/0x8f
   do_basic_setup+0x29/0x2b
   kernel_init_freeable+0x19f/0x20b
   kernel_init+0x11/0x1e0
   ret_from_fork+0x22/0x30

Fix it by checking parent_kobj before the code accesses its member.

Signed-off-by: Eiichi Tsukata 
---
 fs/xfs/xfs_sysfs.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_sysfs.h b/fs/xfs/xfs_sysfs.h
index e9f810fc6731..aad67dc4ab5b 100644
--- a/fs/xfs/xfs_sysfs.h
+++ b/fs/xfs/xfs_sysfs.h
@@ -32,9 +32,9 @@ xfs_sysfs_init(
struct xfs_kobj *parent_kobj,
const char  *name)
 {
+   struct kobject *parent = parent_kobj ? _kobj->kobject : NULL;
init_completion(>complete);
-   return kobject_init_and_add(>kobject, ktype,
-   _kobj->kobject, "%s", name);
+   return kobject_init_and_add(>kobject, ktype, parent, "%s", name);
 }
 
 static inline void
-- 
2.26.2



[PATCH] xfs: Fix UBSAN null-ptr-deref in xfs_sysfs_init

2020-08-06 Thread Eiichi Tsukata
If xfs_sysfs_init is called with parent_kobj == NULL, UBSAN
shows the following warning:

  UBSAN: null-ptr-deref in ./fs/xfs/xfs_sysfs.h:37:23
  member access within null pointer of type 'struct xfs_kobj'
  Call Trace:
   dump_stack+0x10e/0x195
   ubsan_type_mismatch_common+0x241/0x280
   __ubsan_handle_type_mismatch_v1+0x32/0x40
   init_xfs_fs+0x12b/0x28f
   do_one_initcall+0xdd/0x1d0
   do_initcall_level+0x151/0x1b6
   do_initcalls+0x50/0x8f
   do_basic_setup+0x29/0x2b
   kernel_init_freeable+0x19f/0x20b
   kernel_init+0x11/0x1e0
   ret_from_fork+0x22/0x30

Fix it by checking parent_kobj before the code accesses its member.

Signed-off-by: Eiichi Tsukata 
---
 fs/xfs/xfs_sysfs.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_sysfs.h b/fs/xfs/xfs_sysfs.h
index e9f810fc6731..aad67dc4ab5b 100644
--- a/fs/xfs/xfs_sysfs.h
+++ b/fs/xfs/xfs_sysfs.h
@@ -32,9 +32,9 @@ xfs_sysfs_init(
struct xfs_kobj *parent_kobj,
const char  *name)
 {
+   struct kobject *parent = parent_kobj ? _kobj->kobject : NULL;
init_completion(>complete);
-   return kobject_init_and_add(>kobject, ktype,
-   _kobj->kobject, "%s", name);
+   return kobject_init_and_add(>kobject, ktype, parent, "%s", name);
 }
 
 static inline void
-- 
2.26.2



Re: [RFC PATCH] KVM: x86: Fix APIC page invalidation race

2020-06-09 Thread Eiichi Tsukata



> On Jun 9, 2020, at 18:54, Paolo Bonzini  wrote:
> 
> 
> No need to resend, the patch is good.  Here is my take on the commit message:

Thank you Paolo! Your commit message is much clearer.
I really appreciate your great job.

Best

Eiichi

> 
>Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried
>to fix inappropriate APIC page invalidation by re-introducing arch
>specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
>kvm_mmu_notifier_invalidate_range_start. However, the patch left a
>possible race where the VMCS APIC address cache is updated *before*
>it is unmapped:
> 
>  (Invalidator) kvm_mmu_notifier_invalidate_range_start()
>  (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
>  (KVM VCPU) vcpu_enter_guest()
>  (KVM VCPU) kvm_vcpu_reload_apic_access_page()
>  (Invalidator) actually unmap page
> 
>Because of the above race, there can be a mismatch between the
>host physical address stored in the APIC_ACCESS_PAGE VMCS field and
>the host physical address stored in the EPT entry for the APIC GPA
>(0xfee).  When this happens, the processor will not trap APIC
>accesses, and will instead show the raw contents of the APIC-access page.
>Because Windows OS periodically checks for unexpected modifications to
>the LAPIC register, this will show up as a BSOD crash with BugCheck
>CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugzilla.redhat.com_show-5Fbug.cgi-3Fid-3D1751017=DwIDaQ=s883GpUCOChKOHiocYtGcg=dy01Dr4Ly8mhvnUdx1pZhhT1bkq4h9z5aVWu3paoZtk=yo6pYK4bnrvDLz_RwGyvwtJEY6oWkg-9XGlbPBq1B2g=7tKkV92x4mCNLQ7miJIanMVqokYNJdjrcK4Rqqb_h7s=
>  .
> 
>The root cause of the issue is that 
> kvm_arch_mmu_notifier_invalidate_range()
>cannot guarantee that no additional references are taken to the pages in
>the range before kvm_mmu_notifier_invalidate_range_end().  Fortunately,
>this case is supported by the MMU notifier API, as documented in
>include/linux/mmu_notifier.h:
> 
> * If the subsystem
> * can't guarantee that no additional references are taken to
> * the pages in the range, it has to implement the
> * invalidate_range() notifier to remove any references taken
> * after invalidate_range_start().
> 
>The fix therefore is to reload the APIC-access page field in the VMCS
>from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().
> 
> Thanks,
> 
> Paolo



Re: [RFC PATCH] KVM: x86: Fix APIC page invalidation race

2020-06-08 Thread Eiichi Tsukata


> On Jun 8, 2020, at 22:13, Paolo Bonzini  wrote:
> 
> On 06/06/20 06:26, Eiichi Tsukata wrote:
>> Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried to
>> fix inappropriate APIC page invalidation by re-introducing arch specific
>> kvm_arch_mmu_notifier_invalidate_range() and calling it from
>> kvm_mmu_notifier_invalidate_range_start. But threre could be the
>> following race because VMCS APIC address cache can be updated
>> *before* it is unmapped.
>> 
>> Race:
>>  (Invalidator) kvm_mmu_notifier_invalidate_range_start()
>>  (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
>>  (KVM VCPU) vcpu_enter_guest()
>>  (KVM VCPU) kvm_vcpu_reload_apic_access_page()
>>  (Invalidator) actually unmap page
>> 
>> Symptom:
>>  The above race can make Guest OS see already freed page and Guest OS
>> will see broken APIC register values.
> 
> This is not exactly the issue.  The values in the APIC-access page do
> not really matter, the problem is that the host physical address values
> won't match between the page tables and the APIC-access page address.
> Then the processor will not trap APIC accesses, and will instead show
> the raw contents of the APIC-access page (zeroes), and cause the crash
> as you mention below.
> 
> Still, the race explains the symptoms and the patch matches this text in
> include/linux/mmu_notifier.h:
> 
>* If the subsystem
> * can't guarantee that no additional references are taken to
> * the pages in the range, it has to implement the
> * invalidate_range() notifier to remove any references taken
> * after invalidate_range_start().
> 
> where the "additional reference" is in the VMCS: because we have to
> account for kvm_vcpu_reload_apic_access_page running between
> invalidate_range_start() and invalidate_range_end(), we need to
> implement invalidate_range().
> 
> The patch seems good, but I'd like Andrea Arcangeli to take a look as
> well so I've CCed him.
> 
> Thank you very much!
> 
> Paolo
> 

Hello Paolo

Thanks for detailed explanation!
I’ll fix the commit message like this:

```
Symptom:
  The above race can cause mismatch between the page tables and the
APIC-access page address in VMCS.Then the processor will not trap APIC
accesses, and will instead show the raw contents of the APIC-access page
(zeroes). Especially, Windows OS checks LAPIC modification so it can cause
BSOD crash with BugCheck CRITICAL_STRUCTURE_CORRUPTION (109). These symptoms
are the same as we previously saw in
https://bugzilla.kernel.org/show_bug.cgi?id=197951
and we are currently seeing in
https://bugzilla.redhat.com/show_bug.cgi?id=1751017.

To prevent mismatch between page tables and APIC-access page address,
this patch calls kvm_arch_mmu_notifier_invalidate_range() from
kvm_mmu_notifier_invalidate_range() instead of ..._range_start().
We need to implement invalidate_range() because we have to
account for kvm_vcpu_reload_apic_access_page() running between
invalidate_range_start() and invalidate_range_end().
```


Best

Eiichi



Re: [RFC PATCH] KVM: x86: Fix APIC page invalidation race

2020-06-05 Thread Eiichi Tsukata
Hello

The race window I mentioned in the commit message is pretty small. So it’s 
difficult to reproduce it.
But with the following ‘delay’ patch, it can be very easy to reproduce.

```
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c17e6eb9ad43..b6728bf80a7d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -55,6 +55,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -8161,8 +8162,10 @@ int kvm_arch_mmu_notifier_invalidate_range(struct kvm 
*kvm,
 * Update it when it becomes invalid.
 */
apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
-   if (start <= apic_address && apic_address < end)
+   if (start <= apic_address && apic_address < end) {
kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+   mdelay(1000);
+   }
 
return 0;
 }
```

Steps to Reproduce:
- start Windows VM(ex: Windows Server 2016) and watch YouTube video to 
stimulate VM_ENTER/EXIT
- ’stress —vm X —vm-bytes Y’ to make the APIC page swapped out
- Windows OS will crash with BugCheck 0x109

Thanks,

Eiichi

> On Jun 6, 2020, at 13:26, Eiichi Tsukata  wrote:
> 
> Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried to
> fix inappropriate APIC page invalidation by re-introducing arch specific
> kvm_arch_mmu_notifier_invalidate_range() and calling it from
> kvm_mmu_notifier_invalidate_range_start. But threre could be the
> following race because VMCS APIC address cache can be updated
> *before* it is unmapped.
> 
> Race:
>  (Invalidator) kvm_mmu_notifier_invalidate_range_start()
>  (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
>  (KVM VCPU) vcpu_enter_guest()
>  (KVM VCPU) kvm_vcpu_reload_apic_access_page()
>  (Invalidator) actually unmap page
> 
> Symptom:
>  The above race can make Guest OS see already freed page and Guest OS
> will see broken APIC register values. Especially, Windows OS checks
> LAPIC modification so it can cause BSOD crash with BugCheck
> CRITICAL_STRUCTURE_CORRUPTION (109). These symptoms are the same as we
> previously saw in 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugzilla.kernel.org_show-5Fbug.cgi-3Fid-3D197951=DwIDAg=s883GpUCOChKOHiocYtGcg=dy01Dr4Ly8mhvnUdx1pZhhT1bkq4h9z5aVWu3paoZtk=0Tyk-14RQ4E7qUHEz3qfkUGJEUisqm5fr6wFgen6m9o=uTkyasbUNMoptgfsLkg3D5IDb_xxOSjklf2IfLLUzgI=
>   and
> we are currently seeing in
> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugzilla.redhat.com_show-5Fbug.cgi-3Fid-3D1751017=DwIDAg=s883GpUCOChKOHiocYtGcg=dy01Dr4Ly8mhvnUdx1pZhhT1bkq4h9z5aVWu3paoZtk=0Tyk-14RQ4E7qUHEz3qfkUGJEUisqm5fr6wFgen6m9o=pyRkFbs1A9a9AXxWMqiDEOoGJGBbmF8uJdLu8vKSPCs=
>  .
> 
> To prevent Guest OS from accessing already freed page, this patch calls
> kvm_arch_mmu_notifier_invalidate_range() from
> kvm_mmu_notifier_invalidate_range() instead of ..._range_start().
> 
> Fixes: b1394e745b94 ("KVM: x86: fix APIC page invalidation")
> Signed-off-by: Eiichi Tsukata 
> ---
> arch/x86/kvm/x86.c   |  7 ++-
> include/linux/kvm_host.h |  4 ++--
> virt/kvm/kvm_main.c  | 26 --
> 3 files changed, 20 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c17e6eb9ad43..1700aade39d1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8150,9 +8150,8 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
>   kvm_x86_ops.load_eoi_exitmap(vcpu, eoi_exit_bitmap);
> }
> 
> -int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> - unsigned long start, unsigned long end,
> - bool blockable)
> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> + unsigned long start, unsigned long 
> end)
> {
>   unsigned long apic_address;
> 
> @@ -8163,8 +8162,6 @@ int kvm_arch_mmu_notifier_invalidate_range(struct kvm 
> *kvm,
>   apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>   if (start <= apic_address && apic_address < end)
>   kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> -
> - return 0;
> }
> 
> void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 131cc1527d68..92efa39ea3d7 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1406,8 +1406,8 @@ static inline long kvm_arch_vcpu_async_ioctl(struct 
> file *filp,
> }
> #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
> 
> -int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> - unsigned long start, unsigned long end, bool blockable);
> +v

[RFC PATCH] KVM: x86: Fix APIC page invalidation race

2020-06-05 Thread Eiichi Tsukata
Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried to
fix inappropriate APIC page invalidation by re-introducing arch specific
kvm_arch_mmu_notifier_invalidate_range() and calling it from
kvm_mmu_notifier_invalidate_range_start. But threre could be the
following race because VMCS APIC address cache can be updated
*before* it is unmapped.

Race:
  (Invalidator) kvm_mmu_notifier_invalidate_range_start()
  (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
  (KVM VCPU) vcpu_enter_guest()
  (KVM VCPU) kvm_vcpu_reload_apic_access_page()
  (Invalidator) actually unmap page

Symptom:
  The above race can make Guest OS see already freed page and Guest OS
will see broken APIC register values. Especially, Windows OS checks
LAPIC modification so it can cause BSOD crash with BugCheck
CRITICAL_STRUCTURE_CORRUPTION (109). These symptoms are the same as we
previously saw in https://bugzilla.kernel.org/show_bug.cgi?id=197951 and
we are currently seeing in
https://bugzilla.redhat.com/show_bug.cgi?id=1751017.

To prevent Guest OS from accessing already freed page, this patch calls
kvm_arch_mmu_notifier_invalidate_range() from
kvm_mmu_notifier_invalidate_range() instead of ..._range_start().

Fixes: b1394e745b94 ("KVM: x86: fix APIC page invalidation")
Signed-off-by: Eiichi Tsukata 
---
 arch/x86/kvm/x86.c   |  7 ++-
 include/linux/kvm_host.h |  4 ++--
 virt/kvm/kvm_main.c  | 26 --
 3 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c17e6eb9ad43..1700aade39d1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8150,9 +8150,8 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
kvm_x86_ops.load_eoi_exitmap(vcpu, eoi_exit_bitmap);
 }
 
-int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-   unsigned long start, unsigned long end,
-   bool blockable)
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+   unsigned long start, unsigned long 
end)
 {
unsigned long apic_address;
 
@@ -8163,8 +8162,6 @@ int kvm_arch_mmu_notifier_invalidate_range(struct kvm 
*kvm,
apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
if (start <= apic_address && apic_address < end)
kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
-
-   return 0;
 }
 
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 131cc1527d68..92efa39ea3d7 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1406,8 +1406,8 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file 
*filp,
 }
 #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
 
-int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-   unsigned long start, unsigned long end, bool blockable);
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+   unsigned long start, unsigned long 
end);
 
 #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
 int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 731c1e517716..77aa91fb08d2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -155,10 +155,9 @@ static void kvm_uevent_notify_change(unsigned int type, 
struct kvm *kvm);
 static unsigned long long kvm_createvm_count;
 static unsigned long long kvm_active_vms;
 
-__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-   unsigned long start, unsigned long end, bool blockable)
+__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+  unsigned long start, 
unsigned long end)
 {
-   return 0;
 }
 
 bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
@@ -384,6 +383,18 @@ static inline struct kvm *mmu_notifier_to_kvm(struct 
mmu_notifier *mn)
return container_of(mn, struct kvm, mmu_notifier);
 }
 
+static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned 
long end)
+{
+   struct kvm *kvm = mmu_notifier_to_kvm(mn);
+   int idx;
+
+   idx = srcu_read_lock(>srcu);
+   kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+   srcu_read_unlock(>srcu, idx);
+}
+
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
@@ -408,7 +419,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct 
mmu_notifier *mn,
 {
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int need_tlb_flush = 0, idx;
-   int ret;
 
idx = srcu_read_lock(>srcu)

Re: [PATCH] tracing: Prevent RCU EQS breakage in preemptirq events

2019-07-29 Thread Eiichi Tsukata
Thanks for comments.

On 2019/07/30 0:21, Steven Rostedt wrote:
> On Mon, 29 Jul 2019 10:07:34 +0900
> Eiichi Tsukata  wrote:
> 
>> If context tracking is enabled, causing page fault in preemptirq
>> irq_enable or irq_disable events triggers the following RCU EQS warning.
>>
>> Reproducer:
>>
>>   // CONFIG_PREEMPTIRQ_EVENTS=y
>>   // CONFIG_CONTEXT_TRACKING=y
>>   // CONFIG_RCU_EQS_DEBUG=y
>>   # echo 1 > events/preemptirq/irq_disable/enable
>>   # echo 1 > options/userstacktrace
> 
> So the problem is only with userstacktrace enabled?

It can happen when tracing code causes page fault in preemptirq events.
For example, the following perf command also hit the warning:

  # perf record -e 'preemptirq:irq_enable' -g ls


>>  
>>  __visible void trace_hardirqs_on_caller(unsigned long caller_addr)
>>  {
>> +enum ctx_state prev_state;
>> +
>>  if (this_cpu_read(tracing_irq_cpu)) {
>> -if (!in_nmi())
>> +if (!in_nmi()) {
> 
> This is a very high fast path (for tracing irqs off and such). Instead
> of adding a check here for a case that is seldom used (userstacktrace
> and tracing irqs on/off). Move this to surround the userstack trace
> code.
> 
> -- Steve

If the problem was only with userstacktrace, it will be reasonable to
surround only the userstack unwinder. But the situation is similar to
the previous "tracing vs CR2" case. As Peter taught me in
https://lore.kernel.org/lkml/20190708074823.gv3...@hirez.programming.kicks-ass.net/
there are some other codes likely to to user access.
So I surround preemptirq events earlier.


Re: [PATCH] tracing: Prevent RCU EQS breakage in preemptirq events

2019-07-29 Thread Eiichi Tsukata
Thanks for comments.

On 2019/07/29 19:29, Peter Zijlstra wrote:
> On Sun, Jul 28, 2019 at 09:25:58PM -0700, Andy Lutomirski wrote:
>> On Sun, Jul 28, 2019 at 6:08 PM Eiichi Tsukata  wrote:
> 
>>> diff --git a/kernel/trace/trace_preemptirq.c 
>>> b/kernel/trace/trace_preemptirq.c
>>> index 4d8e99fdbbbe..031b51cb94d0 100644
>>> --- a/kernel/trace/trace_preemptirq.c
>>> +++ b/kernel/trace/trace_preemptirq.c
>>> @@ -10,6 +10,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include "trace.h"
>>>
>>>  #define CREATE_TRACE_POINTS
>>> @@ -49,9 +50,14 @@ NOKPROBE_SYMBOL(trace_hardirqs_off);
>>>
>>>  __visible void trace_hardirqs_on_caller(unsigned long caller_addr)
>>>  {
>>> +   enum ctx_state prev_state;
>>> +
>>> if (this_cpu_read(tracing_irq_cpu)) {
>>> -   if (!in_nmi())
>>> +   if (!in_nmi()) {
>>> +   prev_state = exception_enter();
>>> trace_irq_enable_rcuidle(CALLER_ADDR0, caller_addr);
>>> +   exception_exit(prev_state);
>>> +   }
>>> tracer_hardirqs_on(CALLER_ADDR0, caller_addr);
>>> this_cpu_write(tracing_irq_cpu, 0);
>>> }
>>
>> This seems a bit distressing.  Now we're going to do a whole bunch of
>> context tracking transitions for each kernel entry.  Would a better>> fix me 
>> to change trace_hardirqs_on_caller to skip the trace event if
>> the previous state was already IRQs on and, more importantly, to skip
>> tracing IRQs off if IRQs were already off?  The x86 code is very
>> careful to avoid ever having IRQs on and CONTEXT_USER at the same
>> time.
> 
> I think they already (try to) do that; see 'tracing_irq_cpu'.
> 

Or you mean something like this?
As for trace_hardirqs_off_caller:

diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
index 4d8e99fdbbbe..d39478bcf0f2 100644
--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -66,7 +66,7 @@ __visible void trace_hardirqs_off_caller(unsigned long 
caller_addr)
if (!this_cpu_read(tracing_irq_cpu)) {
this_cpu_write(tracing_irq_cpu, 1);
tracer_hardirqs_off(CALLER_ADDR0, caller_addr);
-   if (!in_nmi())
+   if (!in_nmi() && !irqs_disabled())
trace_irq_disable_rcuidle(CALLER_ADDR0, caller_addr);
}

Or

diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
index 4d8e99fdbbbe..e08c5c6ff2b3 100644
--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -66,8 +66,6 @@ __visible void trace_hardirqs_off_caller(unsigned long 
caller_addr)
if (!this_cpu_read(tracing_irq_cpu)) {
this_cpu_write(tracing_irq_cpu, 1);
tracer_hardirqs_off(CALLER_ADDR0, caller_addr);
-   if (!in_nmi())
-   trace_irq_disable_rcuidle(CALLER_ADDR0, caller_addr);
}


As for trace_hardirqs_on_caller, it is called when IRQs off and CONTEXT_USER.
So even though we skipped the trace event if the previous state was already 
IRQs on,
we will fall into the same situation.


Re: [PATCH] sched/core: Remove the unused schedule_user() function

2019-07-29 Thread Eiichi Tsukata



On 2019/07/29 17:30, Qais Yousef wrote:
> On 07/28/19 01:55, Eiichi Tsukata wrote:
>> Since commit 02bc7768fe44 ("x86/asm/entry/64: Migrate error and IRQ exit
>> work to C and remove old assembly code"), it's no longer used.
> 
> It seems to me that powerpc and sparc are still using it?
> 
> $ git grep SCHEDULE_USER
> arch/powerpc/include/asm/context_tracking.h:#define SCHEDULE_USER bl
> schedule_user
> arch/powerpc/include/asm/context_tracking.h:#define SCHEDULE_USER bl
> schedule
> arch/powerpc/kernel/entry_64.S: SCHEDULE_USER
> arch/sparc/kernel/rtrap_64.S:# define SCHEDULE_USER schedule_user
> arch/sparc/kernel/rtrap_64.S:# define SCHEDULE_USER schedule
> arch/sparc/kernel/rtrap_64.S:   callSCHEDULE_USER
> 
> Cheers
> 
> --
> Qais Yousef
> 

Sorry, I misused cscope. The function is still used as you say.

Thanks

Eiichi


[PATCH] tracing: Prevent RCU EQS breakage in preemptirq events

2019-07-28 Thread Eiichi Tsukata
If context tracking is enabled, causing page fault in preemptirq
irq_enable or irq_disable events triggers the following RCU EQS warning.

Reproducer:

  // CONFIG_PREEMPTIRQ_EVENTS=y
  // CONFIG_CONTEXT_TRACKING=y
  // CONFIG_RCU_EQS_DEBUG=y
  # echo 1 > events/preemptirq/irq_disable/enable
  # echo 1 > options/userstacktrace

  WARNING: CPU: 0 PID: 2574 at kernel/rcu/tree.c:262 
rcu_dynticks_eqs_exit+0x48/0x50
  CPU: 0 PID: 2574 Comm: sh Not tainted 5.3.0-rc1+ #105
  RIP: 0010:rcu_dynticks_eqs_exit+0x48/0x50
  Call Trace:
   rcu_eqs_exit+0x4e/0xd0
   rcu_user_exit+0x13/0x20
   __context_tracking_exit.part.0+0x74/0x120
   context_tracking_exit.part.0+0x28/0x50
   context_tracking_exit+0x1d/0x20
   do_page_fault+0xab/0x1b0
   do_async_page_fault+0x35/0xb0
   async_page_fault+0x3e/0x50
  RIP: 0010:arch_stack_walk_user+0x8e/0x100
   stack_trace_save_user+0x7d/0xa9
   trace_buffer_unlock_commit_regs+0x178/0x220
   trace_event_buffer_commit+0x6b/0x200
   trace_event_raw_event_preemptirq_template+0x7b/0xc0
   trace_hardirqs_off_caller+0xb3/0xf0
   trace_hardirqs_off_thunk+0x1a/0x20
   entry_SYSCALL_64_after_hwframe+0x3e/0xbe

Details of call trace and RCU EQS/Context:

  entry_SYSCALL_64_after_hwframe() EQS: IN, CTX: USER
trace_irq_disable_rcuidle()
  rcu_irq_enter_irqson()
rcu_dynticks_eqs_exit() EQS: IN => OUT
  stack_trace_save_user() EQS: OUT, CTX: USER
page_fault()
  do_page_fault()
exception_enter() EQS: OUT, CTX: USER
  context_tracking_exit()
rcu_eqs_exit()
  rcu_dynticks_eqs_exit() EQS: OUT => OUT? (warning)

trace_irq_disable/enable_rcuidle() are called from user mode in entry
code, and calls rcu_irq_enter_irqson() in __DO_TRACE(). This can cause
the state "RCU ESQ: OUT but CTX: USER", then stack_trace_save_user() can
cause page fault which calls rcu_dynticks_eqs_exit() again leads to hit
the EQS validation warning if CONFIG_RCU_EQS_DEBUG is enabled.

Fix it by calling exception_enter/exit() around
trace_irq_disable/enable_rcuidle() to enter CONTEXT_KERNEL before
tracing code causes page fault. Also makes the timing of state change to
CONTEXT_KERNL earlier to prevent tracing codes from calling
context_tracking_exit() recursively.

Ideally, the problem can be fixed by calling enter_from_user_mode() before
TRACE_IRQS_OFF in entry codes (then we need to tell lockdep that IRQs are
off eariler) and calling prepare_exit_to_usermode() after TRACE_IRQS_ON.
But this patch will be much simpler and limit the most change to tracing
codes.

Fixes: 865e63b04e9b ("tracing: Add back in rcu_irq_enter/exit_irqson() for 
rcuidle tracepoints")
Signed-off-by: Eiichi Tsukata 
---
 kernel/context_tracking.c   |  6 +-
 kernel/trace/trace_preemptirq.c | 15 +--
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index be01a4d627c9..860eaf9780e5 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -148,6 +148,11 @@ void __context_tracking_exit(enum ctx_state state)
return;
 
if (__this_cpu_read(context_tracking.state) == state) {
+   /*
+* Change state before executing codes which can trigger
+* page fault leading unnecessary re-entrance.
+*/
+   __this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
if (__this_cpu_read(context_tracking.active)) {
/*
 * We are going to run code that may use RCU. Inform
@@ -159,7 +164,6 @@ void __context_tracking_exit(enum ctx_state state)
trace_user_exit(0);
}
}
-   __this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
}
context_tracking_recursion_exit();
 }
diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
index 4d8e99fdbbbe..031b51cb94d0 100644
--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "trace.h"
 
 #define CREATE_TRACE_POINTS
@@ -49,9 +50,14 @@ NOKPROBE_SYMBOL(trace_hardirqs_off);
 
 __visible void trace_hardirqs_on_caller(unsigned long caller_addr)
 {
+   enum ctx_state prev_state;
+
if (this_cpu_read(tracing_irq_cpu)) {
-   if (!in_nmi())
+   if (!in_nmi()) {
+   prev_state = exception_enter();
trace_irq_enable_rcuidle(CALLER_ADDR0, caller_addr);
+   exception_exit(prev_state);
+   }
tracer_hardirqs_on(CALLER_ADDR0, caller_addr);
this_cpu_write(tracing_irq_cpu, 0);
}
@@ -63,11 +69,16 @@ NOKPROBE_SYMBOL(trace_hardirqs_on_caller);
 
 __visible void trace_hardirqs_off_caller(unsigned long caller_

[PATCH] sched/core: Remove the unused schedule_user() function

2019-07-27 Thread Eiichi Tsukata
Since commit 02bc7768fe44 ("x86/asm/entry/64: Migrate error and IRQ exit
work to C and remove old assembly code"), it's no longer used.

Signed-off-by: Eiichi Tsukata 
---
 kernel/sched/core.c | 19 ---
 1 file changed, 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b037f195473..0079bebe0086 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3973,25 +3973,6 @@ void __sched schedule_idle(void)
} while (need_resched());
 }
 
-#ifdef CONFIG_CONTEXT_TRACKING
-asmlinkage __visible void __sched schedule_user(void)
-{
-   /*
-* If we come here after a random call to set_need_resched(),
-* or we have been woken up remotely but the IPI has not yet arrived,
-* we haven't yet exited the RCU idle mode. Do it here manually until
-* we find a better solution.
-*
-* NB: There are buggy callers of this function.  Ideally we
-* should warn if prev_state != CONTEXT_USER, but that will trigger
-* too frequently to make sense yet.
-*/
-   enum ctx_state prev_state = exception_enter();
-   schedule();
-   exception_exit(prev_state);
-}
-#endif
-
 /**
  * schedule_preempt_disabled - called with preemption disabled
  *
-- 
2.21.0



[tip:x86/urgent] x86/stacktrace: Prevent access_ok() warnings in arch_stack_walk_user()

2019-07-22 Thread tip-bot for Eiichi Tsukata
Commit-ID:  2af7c85714d8cafadf925d55441458eae312cd6b
Gitweb: https://git.kernel.org/tip/2af7c85714d8cafadf925d55441458eae312cd6b
Author: Eiichi Tsukata 
AuthorDate: Mon, 22 Jul 2019 17:32:16 +0900
Committer:  Thomas Gleixner 
CommitDate: Mon, 22 Jul 2019 10:42:36 +0200

x86/stacktrace: Prevent access_ok() warnings in arch_stack_walk_user()

When arch_stack_walk_user() is called from atomic contexts, access_ok() can
trigger the following warning if compiled with CONFIG_DEBUG_ATOMIC_SLEEP=y.

Reproducer:

  // CONFIG_DEBUG_ATOMIC_SLEEP=y
  # cd /sys/kernel/debug/tracing
  # echo 1 > options/userstacktrace
  # echo 1 > events/irq/irq_handler_entry/enable

  WARNING: CPU: 0 PID: 2649 at arch/x86/kernel/stacktrace.c:103 
arch_stack_walk_user+0x6e/0xf6
  CPU: 0 PID: 2649 Comm: bash Not tainted 5.3.0-rc1+ #99
  RIP: 0010:arch_stack_walk_user+0x6e/0xf6
  Call Trace:
   
   stack_trace_save_user+0x10a/0x16d
   trace_buffer_unlock_commit_regs+0x185/0x240
   trace_event_buffer_commit+0xec/0x330
   trace_event_raw_event_irq_handler_entry+0x159/0x1e0
   __handle_irq_event_percpu+0x22d/0x440
   handle_irq_event_percpu+0x70/0x100
   handle_irq_event+0x5a/0x8b
   handle_edge_irq+0x12f/0x3f0
   handle_irq+0x34/0x40
   do_IRQ+0xa6/0x1f0
   common_interrupt+0xf/0xf
   

Fix it by calling __range_not_ok() directly instead of access_ok() as
copy_from_user_nmi() does. This is fine here because the actual copy is
inside a pagefault disabled region.

Reported-by: Juri Lelli 
Signed-off-by: Eiichi Tsukata 
Signed-off-by: Thomas Gleixner 
Link: https://lkml.kernel.org/r/20190722083216.16192-2-de...@etsukata.com

---
 arch/x86/kernel/stacktrace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index 4f36d3241faf..2d6898c2cb64 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -100,7 +100,7 @@ copy_stack_frame(const void __user *fp, struct 
stack_frame_user *frame)
 {
int ret;
 
-   if (!access_ok(fp, sizeof(*frame)))
+   if (__range_not_ok(fp, sizeof(*frame), TASK_SIZE))
return 0;
 
ret = 1;


[PATCH 1/1] x86/stacktrace: Fix userstacktrace access_ok() WARNING in irq events

2019-07-22 Thread Eiichi Tsukata
When arch_stack_walk_user() is called from irq context, access_ok() can
trigger the following WARNING if compiled with CONFIG_DEBUG_ATOMIC_SLEEP=y.

Reproducer:

  // CONFIG_DEBUG_ATOMIC_SLEEP=y
  # cd /sys/kernel/debug/tracing
  # echo 1 > options/userstacktrace
  # echo 1 > events/irq/irq_handler_entry/enable

WARNING:

  WARNING: CPU: 0 PID: 2649 at arch/x86/kernel/stacktrace.c:103 
arch_stack_walk_user+0x6e/0xf6
  Modules linked in:
  CPU: 0 PID: 2649 Comm: bash Not tainted 5.3.0-rc1+ #99
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 
04/01/2014
  RIP: 0010:arch_stack_walk_user+0x6e/0xf6
  Code: 00 48 89 45 c8 48 89 da 49 89 c7 49 89 c5 65 8b 05 5f 3f 3c 72 a9 00 01 
1f 00 74 10 48 8b 45 c8 8b 80 58 16 00 00 85 c0 75 02 <0f> 0b 49 8b 85 18 17 00 
00 48 83 e8 10 48 39 c2 77 32 41 83 85 58
  RSP: 0018:888068a09bc0 EFLAGS: 00010046
  RAX:  RBX: 5567f28dc6a0 RCX: 8ddf6b71
  RDX: 5567f28dc6a0 RSI: 7f3fcf7d20f8 RDI: 888068475048
  RBP: 888068a09bf8 R08: 8ddf6b4b R09: ed100ced26f1
  R10: ed100ced26f0 R11: 888067693787 R12: 88807c1bff58
  R13: 888067693780 R14: 888068a09c28 R15: 888067693780
  FS:  7f3fcf6e3740() GS:888068a0() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 55cd64646630 CR3: 5e230004 CR4: 00160ef0
  Call Trace:
   
   ? stack_trace_save+0xc0/0xc0
   stack_trace_save_user+0x10a/0x16d
   ? stack_trace_save_tsk_reliable+0x1c0/0x1c0
   ? __kasan_check_read+0x11/0x20
   trace_buffer_unlock_commit_regs+0x185/0x240
   trace_event_buffer_commit+0xec/0x330
   trace_event_raw_event_irq_handler_entry+0x159/0x1e0
   ? perf_trace_softirq+0x250/0x250
   ? check_chain_key+0x1da/0x2d0
   ? perf_trace_softirq+0x250/0x250
   __handle_irq_event_percpu+0x22d/0x440
   handle_irq_event_percpu+0x70/0x100
   ? __handle_irq_event_percpu+0x440/0x440
   ? __kasan_check_read+0x11/0x20
   ? preempt_count_sub+0x1a/0x120
   handle_irq_event+0x5a/0x8b
   handle_edge_irq+0x12f/0x3f0
   handle_irq+0x34/0x40
   do_IRQ+0xa6/0x1f0
   common_interrupt+0xf/0xf
   

Fix it by calling __range_not_ok() directly instead of access_ok() as
copy_from_user_nmi() does. This is fine here because the actual copy is
inside a pagefault disabled region.

Reported-by: Juri Lelli 
Signed-off-by: Eiichi Tsukata 
---
 arch/x86/kernel/stacktrace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index 4f36d3241faf..2d6898c2cb64 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -100,7 +100,7 @@ copy_stack_frame(const void __user *fp, struct 
stack_frame_user *frame)
 {
int ret;
 
-   if (!access_ok(fp, sizeof(*frame)))
+   if (__range_not_ok(fp, sizeof(*frame), TASK_SIZE))
return 0;
 
ret = 1;
-- 
2.21.0



[PATCH 0/1] x86/stacktrace: Fix userstacktrace access_ok() WARNING in irq events

2019-07-22 Thread Eiichi Tsukata
Hello

I also hit the same WARNING previously repored by Juri.

Hiramatsu san's patch looks good to me but I found that perf and
oprofile code do the similar thing by just directly calling
__range_not_ok().

  perf: perf_callchain_user()@arch/x86/events/core.c
  oprofile: dump_user_backtrace()@arch/x86/oprofile/backtrace.c

So for simplicity, I wrote a patch to fix the warning as other
codes do.

Ideally, we should merge these similar stacktrace codes(perf, ftrace,
oprofile) into one, but this time I made the minimum fix.

Eiichi Tsukata (1):
  x86/stacktrace: Fix userstacktrace access_ok() WARNING in irq events

 arch/x86/kernel/stacktrace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.21.0



workqueue: possible deadlock when setting sysfs "debug_force_rr_cpu"

2019-07-20 Thread Eiichi Tsukata
Hello

This is just a report for someone who hit the same problem.
When setting 1 to sysfs debug_force_rr_cpu can trigger the
following LOCKDEP WARNING.

Reproducer:

  # ssh localhost // to use pty
  # echo 1 > /sys/module/workqueue/parameters/debug_force_rr_cpu


This problem is similar to the one which syzkaller previously found
and it had been discussed in the thread: 
  - 
https://lore.kernel.org/lkml/20180607051019.GA10406@jagdpanzerIV/T/#ec7d3d2bdb8ccdf0f063fd2e5b2a4d8fd45d8d9e5

Fundamental solution for the problem will be the on going printk() rework:
  - 
https://lore.kernel.org/lkml/20190607162349.18199-1-john.ogn...@linutronix.de/

WARNING:

[   42.605235] ==
[   42.605236] WARNING: possible circular locking dependency detected
[   42.605237] 5.2.0+ #63 Tainted: G U   
[   42.605238] --
[   42.605239] sysctl_fuzzer/2697 is trying to acquire lock:
[   42.605239] d28828ac (console_owner){-.-.}, at: 
console_unlock+0x23f/0x9a0
[   42.605243] 
[   42.605243] but task is already holding lock:
[   42.605244] f7c4cf7b (&(>lock)->rlock){-.-.}, at: 
pty_write+0xd5/0x1e0
[   42.605247] 
[   42.605248] which lock already depends on the new lock.
[   42.605248] 
[   42.605249] 
[   42.605250] the existing dependency chain (in reverse order) is:
[   42.605250] 
[   42.605251] -> #2 (&(>lock)->rlock){-.-.}:
[   42.605254]_raw_spin_lock_irqsave+0x44/0x90
[   42.605255]tty_port_tty_get+0x24/0x90
[   42.605255]tty_port_default_wakeup+0x10/0x30
[   42.605256]tty_port_tty_wakeup+0x5c/0x70
[   42.605257]uart_write_wakeup+0x3c/0x50
[   42.605258]serial8250_tx_chars+0x3d2/0xb10
[   42.605259]serial8250_handle_irq.part.0+0x14a/0x240
[   42.605259]serial8250_default_handle_irq+0x8c/0xf0
[   42.605260]serial8250_interrupt+0xd6/0x150
[   42.605261]__handle_irq_event_percpu+0x1c2/0x640
[   42.605262]handle_irq_event_percpu+0x71/0x150
[   42.605263]handle_irq_event+0xa7/0x134
[   42.605263]handle_edge_irq+0x218/0xa50
[   42.605264]handle_irq+0x46/0x60
[   42.605265]do_IRQ+0xbc/0x250
[   42.605265]ret_from_intr+0x0/0x1d
[   42.605266]default_idle+0x2f/0x2d0
[   42.605267]arch_cpu_idle+0x15/0x20
[   42.605268]default_idle_call+0x40/0x60
[   42.605268]do_idle+0x315/0x3c0
[   42.605269]cpu_startup_entry+0x20/0x22
[   42.605270]rest_init+0x1ea/0x333
[   42.605270]arch_call_rest_init+0xe/0x1b
[   42.605271]start_kernel+0x6e0/0x71b
[   42.605272]x86_64_start_reservations+0x24/0x26
[   42.605273]x86_64_start_kernel+0x75/0x79
[   42.605273]secondary_startup_64+0xa4/0xb0
[   42.605274] 
[   42.605275] -> #1 (_lock_key){-.-.}:
[   42.605277]_raw_spin_lock_irqsave+0x44/0x90
[   42.605278]serial8250_console_write+0x142/0x730
[   42.605279]univ8250_console_write+0x6c/0xa0
[   42.605280]console_unlock+0x5cf/0x9a0
[   42.605280]register_console+0x559/0xa10
[   42.605281]univ8250_console_init+0x23/0x2d
[   42.605282]console_init+0x339/0x4b9
[   42.605283]start_kernel+0x43d/0x71b
[   42.605283]x86_64_start_reservations+0x24/0x26
[   42.605284]x86_64_start_kernel+0x75/0x79
[   42.605285]secondary_startup_64+0xa4/0xb0
[   42.605286] 
[   42.605286] -> #0 (console_owner){-.-.}:
[   42.605289]__lock_acquire+0x2921/0x56a0
[   42.605290]lock_acquire+0x151/0x380
[   42.605290]console_unlock+0x29c/0x9a0
[   42.605291]vprintk_emit+0xee/0x2b0
[   42.605292]vprintk_default+0x1f/0x30
[   42.605292]vprintk_func+0x50/0x1df
[   42.605293]printk+0xb2/0xe3
[   42.605294]__queue_work.cold+0x79/0xc3
[   42.605295]queue_work_on+0x72/0x80
[   42.605295]tty_flip_buffer_push+0xbf/0x120
[   42.605296]pty_write+0x164/0x1e0
[   42.605297]n_tty_write+0x368/0x10e0
[   42.605298]tty_write+0x381/0x710
[   42.605298]__vfs_write+0x65/0x110
[   42.605299]vfs_write+0x171/0x4b0
[   42.605300]ksys_write+0x122/0x220
[   42.605301]__x64_sys_write+0x73/0xb0
[   42.605302]do_syscall_64+0x9a/0x450
[   42.605303]entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   42.605304] 
[   42.605305] other info that might help us debug this:
[   42.605306] 
[   42.605307] Chain exists of:
[   42.605308]   console_owner --> _lock_key --> &(>lock)->rlock
[   42.605314] 
[   42.605316]  Possible unsafe locking scenario:
[   42.605316] 
[   42.605318]CPU0CPU1
[   42.605319]
[   42.605320]   lock(&(>lock)->rlock);
[   42.605323]lock(_lock_key);
[   42.605326]lock(&(>lock)->rlock);
[   42.605329]   

Re: [PATCH v3 0/6] Tracing vs CR2

2019-07-20 Thread Eiichi Tsukata


On 2019/07/20 21:49, Andy Lutomirski wrote:
> On Fri, Jul 19, 2019 at 8:59 PM Eiichi Tsukata  wrote:
>>
...
>>
>> 
>>
>> debug() // dr6: 0x4ff0, user_mode: 1
>>   TRACE_IRQS_OFF
>> arch_stack_user_walk()
>>   debug()  // dr6: 0x4ff1 == 0x4ff0 | 0x0ff1 ... (*)
>> do_debug()
>>   WARN_ON_ONCE
>>   do_debug() // dr6: 0x0ff0(cleared in the above do_debug())
> 
> The dr6 register will indeed be cleared like this, but the dr6
> variable should still be 0x4ff0.

I should have use DR6 to mean it is a register, not variable.
"dr6" was ambiguous.

> 
>>
...
>>
>> Note: printk() in do_debug() can cause infinite loop(printk() ->
>> irq_disable() -> do_debug() -> printk() ...), so printk_deferred()
>> was preferable.
>>
> 
> Shouldn't that be fixed with my patches?  It should only be able to
> recurse two deep: do_debug() from user mode can indeed trip
> breakpoints, but the next do_debug() will clear DR7 in paranoid_entry.
> 

Sorry, I missed that. Now I confirmed your patches fixed the loop.

Thanks

Eiichi 




Re: [PATCH v3 0/6] Tracing vs CR2

2019-07-19 Thread Eiichi Tsukata


On 2019/07/19 5:27, Andy Lutomirski wrote:
> Hi all-
> 
> I suspect that a bunch of the bugs you're all finding boil down to:
> 
>  - Nested debug exceptions could corrupt the outer exception's DR6.
>  - Nested debug exceptions in which *both* exceptions came from the
> kernel were probably all kinds of buggy
>  - Data breakpoints in bad places in the kernel were bad news
> 
> Could you give this not-quite-finished series a try?
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/
> 

Though I'm still trying to find out other cases(other areas which could
be buggy if we set hw breakpoints), as far as I tested, there is
no problem so far.

If I understand correctly, the call trace and the dr6 value will be: 



debug() // dr6: 0x4ff0, user_mode: 1
  TRACE_IRQS_OFF
arch_stack_user_walk()
  debug()  // dr6: 0x4ff1 == 0x4ff0 | 0x0ff1 ... (*)
do_debug()
  WARN_ON_ONCE
  do_debug() // dr6: 0x0ff0(cleared in the above do_debug())

(*) :
>   * The Intel SDM says:
>   *
>   *   Certain debug exceptions may clear bits 0-3. The remaining
>   *   contents of the DR6 register are never cleared by the
>   *   processor. To avoid confusion in identifying debug
>   *   exceptions, debug handlers should clear the register before
>   *   returning to the interrupted task.



Note: printk() in do_debug() can cause infinite loop(printk() -> 
irq_disable() -> do_debug() -> printk() ...), so printk_deferred()
was preferable.

Thanks

Eiichi


Re: [PATCH] tracing: Fix user stack trace "??" output

2019-07-17 Thread Eiichi Tsukata
Hello Steven

Would you review the patch?

On 2019/06/30 17:54, Eiichi Tsukata wrote:
> Commit c5c27a0a5838 ("x86/stacktrace: Remove the pointless ULONG_MAX
> marker") removes ULONG_MAX marker from user stack trace entries but
> trace_user_stack_print() still uses the marker and it outputs unnecessary
> "??".
> 
> For example:
> 
> less-1911  [001] d..234.758944: 
>=>  <7f16f2295910>
>=> ??
>=> ??
>=> ??
>=> ??
>=> ??
>=> ??
>=> ??
> 
> The user stack trace code zeroes the storage before saving the stack, so if
> the trace is shorter than the maximum number of entries it can terminate
> the print loop if a zero entry is detected.
> 
> Fixes: 4285f2fcef80 ("tracing: Remove the ULONG_MAX stack trace hackery")
> Signed-off-by: Eiichi Tsukata 
> ---
>  kernel/trace/trace_output.c | 9 +
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index ba751f993c3b..cab4a5398f1d 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -1109,17 +1109,10 @@ static enum print_line_t 
> trace_user_stack_print(struct trace_iterator *iter,
>   for (i = 0; i < FTRACE_STACK_ENTRIES; i++) {
>   unsigned long ip = field->caller[i];
>  
> - if (ip == ULONG_MAX || trace_seq_has_overflowed(s))
> + if (!ip || trace_seq_has_overflowed(s))
>   break;
>  
>   trace_seq_puts(s, " => ");
> -
> - if (!ip) {
> - trace_seq_puts(s, "??");
> - trace_seq_putc(s, '\n');
> - continue;
> - }
> -
>   seq_print_user_ip(s, mm, ip, flags);
>   trace_seq_putc(s, '\n');
>   }
> 


Re: [PATCH v3 0/6] Tracing vs CR2

2019-07-17 Thread Eiichi Tsukata



On 2019/07/17 6:51, Vegard Nossum wrote:
> 
...
> 
> Got a different one:
> 
> WARNING: CPU: 0 PID: 2150 at arch/x86/kernel/traps.c:791 do_debug+0xfe/0x240
> CPU: 0 PID: 2150 Comm: init Not tainted 5.2.0+ #124
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Ubuntu-1.8.2-1ubuntu1 04/01/2014
> RIP: 0010:do_debug+0xfe/0x240
...


Hello Vegard

I found a way to reproduce #DB WARNING by setting hardware watchpoint to
the address arch_stack_walk_user() will touch.


[Steps to Reproduce #DB WARNING]

poc.s:

```
.global _start

.text
_start:
# exit(0)
mov $60, %rax
xor %rdi, %rdi
syscall
```

build:

  # gcc -g -c poc.s; ld -o poc poc.o

setup ftrace:

  # echo 1 > options/userstacktrace
  # echo 1 > events/preemptirq/irq_disable/enable

exec gdb:(set hardware watch point to $rbp)

  [18:28:48 root@vm loops]# gdb ./poc
  GNU gdb (GDB) Fedora 8.3-6.fc30
  Copyright (C) 2019 Free Software Foundation, Inc.
  License GPLv3+: GNU GPL version 3 or later 
  This is free software: you are free to change and redistribute it.
  There is NO WARRANTY, to the extent permitted by law.
  Type "show copying" and "show warranty" for details.
  This GDB was configured as "x86_64-redhat-linux-gnu".
  Type "show configuration" for configuration details.
  For bug reporting instructions, please see:
  .
  Find the GDB manual and other documentation resources online at:
  .

  For help, type "help".
  Type "apropos word" to search for commands related to "word"...
  Reading symbols from ./poc...
  (gdb) l
  1   .global _start
  2
  3   .text
  4   _start:
  5   # exit(0)
  6   mov $60, %rax
  7   xor %rdi, %rdi
  8   syscall
  (gdb) b 6
  Breakpoint 1 at 0x401000: file poc.s, line 6.
  (gdb) start
  Function "main" not defined.
  Make breakpoint pending on future shared library load? (y or [n]) n
  Starting program: /root/tmp/loops/poc

  Breakpoint 1, _start () at poc.s:6
  6   mov $60, %rax
  (gdb) set $rbp = $rsp
  (gdb) p $rbp
  $1 = (void *) 0x7fffe4b0
  (gdb) rwatch *0x7fffe4b0
  Hardware read watchpoint 2: *0x7fffe4b0
  (gdb) c
  Continuing.
  [Inferior 1 (process 2744) exited normally]

dmesg:

[  564.646159][ T2744] WARNING: CPU: 0 PID: 2744 at arch/x86/kernel/traps.c:791 
do_debug+0x220/0x490
[  564.648581][ T2744] Modules linked in:
[  564.649530][ T2744] CPU: 0 PID: 2744 Comm: poc Tainted: GW 
5.2.0+ #77
[  564.651121][ T2744] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.12.0-2.fc30 04/01/2014
[  564.653569][ T2744] RIP: 0010:do_debug+0x220/0x490
[  564.654847][ T2744] Code: 00 48 8b 95 60 ff ff ff 48 b8 00 00 00 00 00 fc ff 
df 48 c1 ea 03 80 3c 02 00 0f 85 03 02 00 00 41 f6 87 88 00 00 00 03 75 60 <0f> 
0b 4c 89 f2 49 81 e5 ff bf ff ff 48 b8 00 00 00 00 00 fc ff df
[  564.659905][ T2744] RSP: :fe014e98 EFLAGS: 00010046
[  564.661500][ T2744] RAX: dc00 RBX: 1fc029d8 RCX: 
11100f81c2d3
[  564.663531][ T2744] RDX: 1fc029fc RSI:  RDI: 
85c19f00
[  564.665553][ T2744] RBP: fe014f48 R08: fe014fe8 R09: 
88807c0e08a0
[  564.667637][ T2744] R10: 0001 R11: 11100d1042ba R12: 
88807c0e
[  564.669700][ T2744] R13: 4001 R14: 88807c0e1698 R15: 
fe014f58
[  564.671768][ T2744] FS:  () GS:88806880() 
knlGS:
[  564.674032][ T2744] CS:  0010 DS:  ES:  CR0: 80050033
[  564.675752][ T2744] CR2: 0001 CR3: 5fe08002 CR4: 
00160ef0
[  564.677570][ T2744] DR0: 7fffe4b0 DR1:  DR2: 

[  564.679686][ T2744] DR3:  DR6: 0ff0 DR7: 
000f0602
[  564.681788][ T2744] Call Trace:
[  564.682700][ T2744]  <#DB>
[  564.683492][ T2744]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  564.684954][ T2744]  ? do_int3+0x1f0/0x1f0
[  564.686074][ T2744]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  564.687512][ T2744]  debug+0x2d/0x70
[  564.688456][ T2744] RIP: 0010:arch_stack_walk_user+0x7d/0xf2
[  564.689899][ T2744] Code: 00 0f 85 8d 00 00 00 49 8b 87 d8 16 00 00 48 83 e8 
10 49 39 c6 77 32 41 83 87 e8 15 00 00 01 0f 1f 00 0f ae e8 31 c0 49 8b 0e <85> 
c0 75 6d 49 8b 76 08 0f 1f 00 85 c0 74 1f 65 48 8b 04 25 00 ef
[  564.694763][ T2744] RSP: :888061fb7c48 EFLAGS: 0046
[  564.696316][ T2744] RAX:  RBX: 88807c0e RCX: 
0001
[  564.698342][ T2744] RDX: 11100ba08e93 RSI: 00401009 RDI: 
888061fb7cbc
[  564.700323][ T2744] RBP: 888061fb7c80 R08: 11100ba08e93 R09: 
88805d04749c
[  564.702337][ T2744] R10: ed100ba08e9b R11: 88805d0474db R12: 
888061fb7cb0
[  564.704359][ T2744] 

[tip:x86/urgent] x86/stacktrace: Prevent infinite loop in arch_stack_walk_user()

2019-07-11 Thread tip-bot for Eiichi Tsukata
Commit-ID:  cbf5b73d162b22e044fe0b7d51dcaa33be065253
Gitweb: https://git.kernel.org/tip/cbf5b73d162b22e044fe0b7d51dcaa33be065253
Author: Eiichi Tsukata 
AuthorDate: Thu, 11 Jul 2019 11:35:01 +0900
Committer:  Thomas Gleixner 
CommitDate: Thu, 11 Jul 2019 08:22:03 +0200

x86/stacktrace: Prevent infinite loop in arch_stack_walk_user()

arch_stack_walk_user() checks `if (fp == frame.next_fp)` to prevent a
infinite loop by self reference but it's not enogh for circular reference.

Once a lack of return address is found, there is no point to continue the
loop, so break out.

Fixes: 02b67518e2b1 ("tracing: add support for userspace stacktraces in 
tracing/iter_ctrl")
Signed-off-by: Eiichi Tsukata 
Signed-off-by: Thomas Gleixner 
Acked-by: Linus Torvalds 
Link: https://lkml.kernel.org/r/20190711023501.963-1-de...@etsukata.com

---
 arch/x86/kernel/stacktrace.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index 2abf27d7df6b..4f36d3241faf 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -129,11 +129,9 @@ void arch_stack_walk_user(stack_trace_consume_fn 
consume_entry, void *cookie,
break;
if ((unsigned long)fp < regs->sp)
break;
-   if (frame.ret_addr) {
-   if (!consume_entry(cookie, frame.ret_addr, false))
-   return;
-   }
-   if (fp == frame.next_fp)
+   if (!frame.ret_addr)
+   break;
+   if (!consume_entry(cookie, frame.ret_addr, false))
break;
fp = frame.next_fp;
}


[PATCH] x86/stacktrace: Fix infinite loop in arch_stack_walk_user()

2019-07-10 Thread Eiichi Tsukata
Current arch_stack_walk_user() checks `if (fp == frame.next_fp)`
to prevent infinite loop by self reference but it's not enogh for
circular reference.

Once we find a lack of return address, there is no need to continue
loop, so let's break out.

Fixes: 02b67518e2b1 ("tracing: add support for userspace stacktraces in 
tracing/iter_ctrl")
Signed-off-by: Eiichi Tsukata 
---
 arch/x86/kernel/stacktrace.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index 2abf27d7df6b..b1a1f4b4c943 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -129,11 +129,8 @@ void arch_stack_walk_user(stack_trace_consume_fn 
consume_entry, void *cookie,
break;
if ((unsigned long)fp < regs->sp)
break;
-   if (frame.ret_addr) {
-   if (!consume_entry(cookie, frame.ret_addr, false))
-   return;
-   }
-   if (fp == frame.next_fp)
+   if (!frame.ret_addr ||
+   !consume_entry(cookie, frame.ret_addr, false))
break;
fp = frame.next_fp;
}
-- 
2.21.0



Re: [PATCH v2 5/7] x86/mm, tracing: Fix CR2 corruption

2019-07-08 Thread Eiichi Tsukata



On 2019/07/08 18:42, Eiichi Tsukata wrote:
> 
> 
> On 2019/07/08 17:58, Eiichi Tsukata wrote:
> 
>>
>> By the way, is there possibility that the WARNING(#GP in execve(2)) which 
>> Steven
>> previously hit? : 
>> https://lore.kernel.org/lkml/20190321095502.47b51...@gandalf.local.home/
>>
>> Even if there were, it will *Not* be the bug introduced by this patch series,
>> but the bug revealed by this series.
>>
> 
> I reproduced with the kernel:
>   - 
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=x86/core=057b48d544b384c47ed921f616128b802ab1fdc3
> 
> Reproducer:
> 
>   # echo 1 > events/preemptirq/irq_disable/enable
>   # echo userstacktrace > trace_options
>   # service sshd restart
> 
> [   20.338200] [ cut here ]
> [   20.339471] General protection fault in user access. Non-canonical address?
> [   20.339488] WARNING: CPU: 1 PID: 1957 at arch/x86/mm/extable.c:126 
> ex_handler_uaccess+0x52/0x60
> [   20.342550] Modules linked in:
> [   20.343209] CPU: 1 PID: 1957 Comm: systemctl Tainted: GW 
> 5.2.0-rc7+ #48
> [   20.344935] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.12.0-2.fc30 04/01/2014
> [   20.346688] RIP: 0010:ex_handler_uaccess+0x52/0x60
> [   20.347667] Code: c4 08 b8 01 00 00 00 5b 5d c3 80 3d b6 8c 94 01 00 75 db 
> 48 c7 c7 b8 e4 6f aa 48 89 75 f0 c6 05 ad
> [   20.351508] RSP: 0018:b28940453688 EFLAGS: 00010082
> [   20.352707] RAX:  RBX: aa2023c0 RCX: 
> 0001
> [   20.354478] RDX:  RSI:  RDI: 
> aa7dab8d
> [   20.355706] RBP: b28940453698 R08:  R09: 
> 0001
> [   20.356665] R10:  R11:  R12: 
> b28940453728
> [   20.357594] R13:  R14: 000d R15: 
> 
> [   20.358876] FS:  7fec9fa248c0() GS:921d3d80() 
> knlGS:
> [   20.360573] CS:  0010 DS:  ES:  CR0: 80050033
> [   20.361792] CR2: 0004 CR3: 74d4e000 CR4: 
> 06e0
> [   20.363300] Call Trace:
> [   20.363830]  fixup_exception+0x4a/0x61
> [   20.364637]  __idt_do_general_protection+0x65/0x1d0
> [   20.365684]  do_general_protection+0x1e/0x30
> [   20.366654]  general_protection+0x1e/0x30
> [   20.367513] RIP: 0010:arch_stack_walk_user+0x7a/0x100
> [   20.368583] Code: 00 01 1f 00 0f 85 8d 00 00 00 49 8b 87 98 16 00 00 48 83 
> e8 10 49 39 c6 77 32 41 83 87 d0 15 00 04
> [   20.372797] RSP: 0018:b289404537d0 EFLAGS: 00010046
> [   20.374027] RAX:  RBX: 921d383bcb00 RCX: 
> 0074726174736572
> [   20.375674] RDX: 921d38393b84 RSI: 7265732e64687373 RDI: 
> b28940453818
> [   20.377179] RBP: b28940453808 R08: 0001 R09: 
> 921d3d169c00
> [   20.378681] R10: 0b60 R11: 921d38393b70 R12: 
> b28940453818
> [   20.380195] R13: b28940453f58 R14: 0074726174736572 R15: 
> 921d383bcb00
> [   20.381703]  ? profile_setup.cold+0x9b/0x9b
> [   20.382604]  stack_trace_save_user+0x60/0x7d
> [   20.383520]  trace_buffer_unlock_commit_regs+0x17b/0x220
> [   20.384686]  trace_event_buffer_commit+0x6b/0x200
> [   20.385741]  trace_event_raw_event_preemptirq_template+0x7b/0xc0
> [   20.387067]  ? get_page_from_freelist+0x10a/0x13b0
> [   20.388129]  ? __alloc_pages_nodemask+0x178/0x380
> [   20.389132]  trace_hardirqs_off+0xc6/0x100
> [   20.390006]  get_page_from_freelist+0x10a/0x13b0
> [   20.390997]  ? kvm_clock_read+0x18/0x30
> [   20.391813]  ? sched_clock+0x9/0x10
> [   20.392647]  __alloc_pages_nodemask+0x178/0x380
> [   20.393706]  alloc_pages_current+0x87/0xe0
> [   20.394609]  __page_cache_alloc+0xcd/0x110
> [   20.395491]  __do_page_cache_readahead+0xa1/0x1a0
> [   20.396547]  ondemand_readahead+0x220/0x540
> [   20.397486]  page_cache_sync_readahead+0x35/0x50
> [   20.398511]  generic_file_read_iter+0x8d8/0xbd0
> [   20.399335]  ? __might_sleep+0x4b/0x80
> [   20.400110]  ext4_file_read_iter+0x35/0x40
> [   20.400937]  new_sync_read+0x10f/0x1a0
> [   20.401833]  __vfs_read+0x29/0x40
> [   20.402608]  vfs_read+0xc0/0x170
> [   20.403319]  kernel_read+0x31/0x50

The cause was obvious as Linus already said in:
  
https://lore.kernel.org/lkml/CAHk-=whvxjjbbomqssrb-xhkc6xm8zghsbrgpxh14usey_g...@mail.gmail.com/

stack_trace_save_user() is called after set_fs(KERNEL_DS).
I don't know why systemctl alyways hit this, but user space process can make up 
stack
as it like, so we have to check it by ourselves?


ssize_t kernel_read(struct file *file, void *buf, size_t count, loff_t *pos)
{
mm_segment_t 

Re: [PATCH v2 5/7] x86/mm, tracing: Fix CR2 corruption

2019-07-08 Thread Eiichi Tsukata



On 2019/07/08 17:58, Eiichi Tsukata wrote:

> 
> By the way, is there possibility that the WARNING(#GP in execve(2)) which 
> Steven
> previously hit? : 
> https://lore.kernel.org/lkml/20190321095502.47b51...@gandalf.local.home/
> 
> Even if there were, it will *Not* be the bug introduced by this patch series,
> but the bug revealed by this series.
> 

I reproduced with the kernel:
  - 
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=x86/core=057b48d544b384c47ed921f616128b802ab1fdc3

Reproducer:

  # echo 1 > events/preemptirq/irq_disable/enable
  # echo userstacktrace > trace_options
  # service sshd restart

[   20.338200] [ cut here ]
[   20.339471] General protection fault in user access. Non-canonical address?
[   20.339488] WARNING: CPU: 1 PID: 1957 at arch/x86/mm/extable.c:126 
ex_handler_uaccess+0x52/0x60
[   20.342550] Modules linked in:
[   20.343209] CPU: 1 PID: 1957 Comm: systemctl Tainted: GW 
5.2.0-rc7+ #48
[   20.344935] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.12.0-2.fc30 04/01/2014
[   20.346688] RIP: 0010:ex_handler_uaccess+0x52/0x60
[   20.347667] Code: c4 08 b8 01 00 00 00 5b 5d c3 80 3d b6 8c 94 01 00 75 db 
48 c7 c7 b8 e4 6f aa 48 89 75 f0 c6 05 ad
[   20.351508] RSP: 0018:b28940453688 EFLAGS: 00010082
[   20.352707] RAX:  RBX: aa2023c0 RCX: 0001
[   20.354478] RDX:  RSI:  RDI: aa7dab8d
[   20.355706] RBP: b28940453698 R08:  R09: 0001
[   20.356665] R10:  R11:  R12: b28940453728
[   20.357594] R13:  R14: 000d R15: 
[   20.358876] FS:  7fec9fa248c0() GS:921d3d80() 
knlGS:
[   20.360573] CS:  0010 DS:  ES:  CR0: 80050033
[   20.361792] CR2: 0004 CR3: 74d4e000 CR4: 06e0
[   20.363300] Call Trace:
[   20.363830]  fixup_exception+0x4a/0x61
[   20.364637]  __idt_do_general_protection+0x65/0x1d0
[   20.365684]  do_general_protection+0x1e/0x30
[   20.366654]  general_protection+0x1e/0x30
[   20.367513] RIP: 0010:arch_stack_walk_user+0x7a/0x100
[   20.368583] Code: 00 01 1f 00 0f 85 8d 00 00 00 49 8b 87 98 16 00 00 48 83 
e8 10 49 39 c6 77 32 41 83 87 d0 15 00 04
[   20.372797] RSP: 0018:b289404537d0 EFLAGS: 00010046
[   20.374027] RAX:  RBX: 921d383bcb00 RCX: 0074726174736572
[   20.375674] RDX: 921d38393b84 RSI: 7265732e64687373 RDI: b28940453818
[   20.377179] RBP: b28940453808 R08: 0001 R09: 921d3d169c00
[   20.378681] R10: 0b60 R11: 921d38393b70 R12: b28940453818
[   20.380195] R13: b28940453f58 R14: 0074726174736572 R15: 921d383bcb00
[   20.381703]  ? profile_setup.cold+0x9b/0x9b
[   20.382604]  stack_trace_save_user+0x60/0x7d
[   20.383520]  trace_buffer_unlock_commit_regs+0x17b/0x220
[   20.384686]  trace_event_buffer_commit+0x6b/0x200
[   20.385741]  trace_event_raw_event_preemptirq_template+0x7b/0xc0
[   20.387067]  ? get_page_from_freelist+0x10a/0x13b0
[   20.388129]  ? __alloc_pages_nodemask+0x178/0x380
[   20.389132]  trace_hardirqs_off+0xc6/0x100
[   20.390006]  get_page_from_freelist+0x10a/0x13b0
[   20.390997]  ? kvm_clock_read+0x18/0x30
[   20.391813]  ? sched_clock+0x9/0x10
[   20.392647]  __alloc_pages_nodemask+0x178/0x380
[   20.393706]  alloc_pages_current+0x87/0xe0
[   20.394609]  __page_cache_alloc+0xcd/0x110
[   20.395491]  __do_page_cache_readahead+0xa1/0x1a0
[   20.396547]  ondemand_readahead+0x220/0x540
[   20.397486]  page_cache_sync_readahead+0x35/0x50
[   20.398511]  generic_file_read_iter+0x8d8/0xbd0
[   20.399335]  ? __might_sleep+0x4b/0x80
[   20.400110]  ext4_file_read_iter+0x35/0x40
[   20.400937]  new_sync_read+0x10f/0x1a0
[   20.401833]  __vfs_read+0x29/0x40
[   20.402608]  vfs_read+0xc0/0x170
[   20.403319]  kernel_read+0x31/0x50
[   20.404128]  prepare_binprm+0x150/0x190
[   20.404766]  __do_execve_file.isra.0+0x4c0/0xaa0
[   20.405559]  __x64_sys_execve+0x39/0x50
[   20.406173]  do_syscall_64+0x55/0x1b0
[   20.406762]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   20.407703] RIP: 0033:0x7fec9f0ee647
[   20.408441] Code: ff 76 e0 f7 d8 64 41 89 01 eb d8 0f 1f 84 00 00 00 00 00 
f7 d8 64 41 89 01 eb d7 0f 1f 84 00 00 08
[   20.411636] RSP: 002b:7ffe7f316228 EFLAGS: 0202 ORIG_RAX: 
003b
[   20.412686] RAX: ffda RBX: 7fec9f8105a4 RCX: 7fec9f0ee647
[   20.414047] RDX: 7ffe7f3167d8 RSI: 7ffe7f316230 RDI: 7fec9f73cea0
[   20.415048] RBP: 7ffe7f316480 R08: 0030 R09: 0001
[   20.416076] R10: 7ffe7f316498 R11: 0202 R12: 7fec9f73cea0
[   20.417556] R13: 0001 R14: 0001 R15: 
[   20.419032] irq event stamp: 832
[   20.419493] hardirqs last  enabled at (831)

Re: [PATCH v2 5/7] x86/mm, tracing: Fix CR2 corruption

2019-07-08 Thread Eiichi Tsukata



On 2019/07/08 16:48, Peter Zijlstra wrote:
...

> 
> Or are we going to put the CR2 save/restore on every single tracepoint?
> But then we also need it on the mcount/fentry stubs and we again have
> multiple places.
> 
> Whereas if we stick it in the entry path, like I proposed, we fix it in
> one location and we're done.
> 

Thanks to your detailed comments and the discussion, I've come to realize
that your solution "save CR2 early in the entry" is the simplest and reasonable.

By the way, is there possibility that the WARNING(#GP in execve(2)) which Steven
previously hit? : 
https://lore.kernel.org/lkml/20190321095502.47b51...@gandalf.local.home/

Even if there were, it will *Not* be the bug introduced by this patch series,
but the bug revealed by this series.

Thanks,

Eiichi


Re: [PATCH v2 5/7] x86/mm, tracing: Fix CR2 corruption

2019-07-06 Thread Eiichi Tsukata



On 2019/07/07 7:27, Steven Rostedt wrote:

> 
> We also have to deal with reading vmalloc'd data as that can fault too.
> The perf ring buffer IIUC is vmalloc, so if perf records in one of
> these locations, then the reading of the vmalloc area has a potential
> to fault corrupting the CR2 register as well. Or have we changed
> vmalloc to no longer fault?
> 
> -- Steve
> 

It seems that perf ring buffer does not normally use vmalloc.
It depends on CONFIG_PERF_USE_VMALLOC introduced by the following commit:

commit 906010b2134e14a2e377decbadd357b3d0ab9c6a
Author: Peter Zijlstra 
Date:   Mon Sep 21 16:08:49 2009 +0200

perf_event: Provide vmalloc() based mmap() backing

Some architectures such as Sparc, ARM and MIPS (basically
everything with flush_dcache_page()) need to deal with dcache
aliases by carefully placing pages in both kernel and user maps.

These architectures typically have to use vmalloc_user() for this.

However, on other architectures, vmalloc() is not needed and has
the downsides of being more restricted and slower than regular
allocations.

Signed-off-by: Peter Zijlstra 
Acked-by: David Miller 
Cc: Andrew Morton 
Cc: Jens Axboe 
Cc: Paul Mackerras 
LKML-Reference: <1254830228.21044.272.camel@laptop>
Signed-off-by: Ingo Molnar 



Re: [PATCH v2 5/7] x86/mm, tracing: Fix CR2 corruption

2019-07-06 Thread Eiichi Tsukata



On 2019/07/05 11:18, Linus Torvalds wrote:
> On Fri, Jul 5, 2019 at 5:03 AM Peter Zijlstra  wrote:
>>
>> Despire the current efforts to read CR2 before tracing happens there
>> still exist a number of possible holes:
> 
> So this whole series disturbs me for the simple reason that I thought
> tracing was supposed to save/restore cr2 and make it unnecessary to
> worry about this in non-tracing code.
> 
> That is very much what the NMI code explicitly does. Why shouldn't all
> the other tracing code do the same thing in case they can take page
> faults?
> 
> So I don't think the patches are wrong per se, but this seems to solve
> it at the wrong level.
> 
>  Linus
> 

Steven previously tried to fix it by saving CR2 in TRACE_IRQS_OFF:
https://lore.kernel.org/lkml/20190320221534.165ab...@oasis.local.home/

But hit the following WARNING:
https://lore.kernel.org/lkml/20190321095502.47b51...@gandalf.local.home/

I tried to find out the root cause of the WARNING, and found that it is
caused by touching trace point(TRACE_IRQS_OFF) before search_binary_handler()
at exeve.

To prevent userstack trace code from reading user stack before it becomes ready,
checking current->in_execve in stack_trace_save_user() can help Steven's 
approach,
though trace_sched_process_exec() is called before current->in_execve = 0 so it 
changes
current behavior.

The PoC code is as follows:

diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index 2abf27d7df6b..30fa6e1b7a87 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -116,10 +116,12 @@ void arch_stack_walk_user(stack_trace_consume_fn 
consume_entry, void *cookie,
  const struct pt_regs *regs)
 {
const void __user *fp = (const void __user *)regs->bp;
+   unsigned long address;
 
if (!consume_entry(cookie, regs->ip, false))
return;
 
+   address = read_cr2();
while (1) {
struct stack_frame_user frame;
 
@@ -131,11 +133,14 @@ void arch_stack_walk_user(stack_trace_consume_fn 
consume_entry, void *cookie,
break;
if (frame.ret_addr) {
if (!consume_entry(cookie, frame.ret_addr, false))
-   return;
+   break;
}
if (fp == frame.next_fp)
break;
fp = frame.next_fp;
}
+
+   if (address != read_cr2())
+   write_cr2(address);
 }
 
diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
index 36139de0a3c4..489d33bb5d28 100644
--- a/kernel/stacktrace.c
+++ b/kernel/stacktrace.c
@@ -230,6 +230,9 @@ unsigned int stack_trace_save_user(unsigned long *store, 
unsigned int size)
/* Trace user stack if not a kernel thread */
if (!current->mm)
return 0;
+   /* current can reach some trace points before its stack is ready */
+   if (current->in_execve)
+   return 0;
 
arch_stack_walk_user(consume_entry, , task_pt_regs(current));
return c.len;
  





[PATCH] x86/stacktrace: Do not access user space memory unnecessarily

2019-07-01 Thread Eiichi Tsukata
Put the boundary check before it accesses user space to prevent unnecessary
access which might crash the machine.

Especially, ftrace preemptirq/irq_disable event with user stack trace
option can trigger SEGV in pid 1 which leads to panic.

Reproducer:

  CONFIG_PREEMPTIRQ_TRACEPOINTS=y
  # echo 1 > events/preemptirq/enable
  # echo userstacktrace > trace_options

Output:

  Kernel panic - not syncing: Attempted to kill init! exitcode=0x000b
  CPU: 1 PID: 1 Comm: systemd Not tainted 5.2.0-rc7+ #10
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  Call Trace:
   dump_stack+0x67/0x90
   panic+0x100/0x2c6
   do_exit.cold+0x4e/0x101
   do_group_exit+0x3a/0xa0
   get_signal+0x14a/0x8e0
   do_signal+0x36/0x650
   exit_to_usermode_loop+0x92/0xb0
   prepare_exit_to_usermode+0x6f/0xb0
   retint_user+0x8/0x18
  RIP: 0033:0x55be7ad1c89f
  Code: Bad RIP value.
  RSP: 002b:7ffe329a4b00 EFLAGS: 00010202
  RAX: 0768 RBX: 7ffe329a4ba0 RCX: 7ff0063aa469
  RDX: 7ff0066761de RSI: 7ffe329a4b20 RDI: 0768
  RBP: 000b R08:  R09: 7ffe329a4e2f
  R10:  R11: 0246 R12: 0768
  R13:  R14: 0004 R15: 55be7b3d3560
  Kernel Offset: 0x2a00 from 0x8100 (relocation range: 
0x8000-0xbfff)

Fixes: 02b67518e2b1 ("tracing: add support for userspace stacktraces in 
tracing/iter_ctrl")
Signed-off-by: Eiichi Tsukata 
---
 arch/x86/kernel/stacktrace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index 2abf27d7df6b..6d0c608ffe34 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -123,12 +123,12 @@ void arch_stack_walk_user(stack_trace_consume_fn 
consume_entry, void *cookie,
while (1) {
struct stack_frame_user frame;
 
+   if ((unsigned long)fp < regs->sp)
+   break;
frame.next_fp = NULL;
frame.ret_addr = 0;
if (!copy_stack_frame(fp, ))
break;
-   if ((unsigned long)fp < regs->sp)
-   break;
if (frame.ret_addr) {
if (!consume_entry(cookie, frame.ret_addr, false))
return;
-- 
2.21.0



Re: [PATCH] tracing: Fix out-of-range read in trace_stack_print()

2019-06-30 Thread Eiichi Tsukata



On 2019/06/15 5:31, Steven Rostedt wrote:
...

> 
>>
>> Fixes: 4a9bd3f134dec ("tracing: Have dynamic size event stack traces")
> 
> Actually it fixes:
> 
>  4285f2fcef80 ("tracing: Remove the ULONG_MAX stack trace hackery")
> 
> Because before that, a ULONG_MAX was inserted into the buffer.
> 
> -- Steve

Thank you for the advice.
Now there is no ULONG_MAX marker, so I should have fixed it by just
removing `*p != ULONG_MAX` check, right?


Thanks

Eiichi

> 
>> Signed-off-by: Eiichi Tsukata 
>> ---
>>  kernel/trace/trace_output.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
>> index 54373d93e251..ba751f993c3b 100644
>> --- a/kernel/trace/trace_output.c
>> +++ b/kernel/trace/trace_output.c
>> @@ -1057,7 +1057,7 @@ static enum print_line_t trace_stack_print(struct 
>> trace_iterator *iter,
>>  
>>  trace_seq_puts(s, "\n");
>>  
>> -for (p = field->caller; p && *p != ULONG_MAX && p < end; p++) {
>> +for (p = field->caller; p && p < end && *p != ULONG_MAX; p++) {
>>  
>>  if (trace_seq_has_overflowed(s))
>>  break;
> 


[PATCH] tracing: Fix user stack trace "??" output

2019-06-30 Thread Eiichi Tsukata
Commit c5c27a0a5838 ("x86/stacktrace: Remove the pointless ULONG_MAX
marker") removes ULONG_MAX marker from user stack trace entries but
trace_user_stack_print() still uses the marker and it outputs unnecessary
"??".

For example:

less-1911  [001] d..234.758944: 
   =>  <7f16f2295910>
   => ??
   => ??
   => ??
   => ??
   => ??
   => ??
   => ??

The user stack trace code zeroes the storage before saving the stack, so if
the trace is shorter than the maximum number of entries it can terminate
the print loop if a zero entry is detected.

Fixes: 4285f2fcef80 ("tracing: Remove the ULONG_MAX stack trace hackery")
Signed-off-by: Eiichi Tsukata 
---
 kernel/trace/trace_output.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index ba751f993c3b..cab4a5398f1d 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1109,17 +1109,10 @@ static enum print_line_t trace_user_stack_print(struct 
trace_iterator *iter,
for (i = 0; i < FTRACE_STACK_ENTRIES; i++) {
unsigned long ip = field->caller[i];
 
-   if (ip == ULONG_MAX || trace_seq_has_overflowed(s))
+   if (!ip || trace_seq_has_overflowed(s))
break;
 
trace_seq_puts(s, " => ");
-
-   if (!ip) {
-   trace_seq_puts(s, "??");
-   trace_seq_putc(s, '\n');
-   continue;
-   }
-
seq_print_user_ip(s, mm, ip, flags);
trace_seq_putc(s, '\n');
}
-- 
2.21.0



[PATCH net-next] net/ipv6: Fix misuse of proc_dointvec "flowlabel_reflect"

2019-06-27 Thread Eiichi Tsukata
/proc/sys/net/ipv6/flowlabel_reflect assumes written value to be in the
range of 0 to 3. Use proc_dointvec_minmax instead of proc_dointvec.

Fixes: 323a53c41292 ("ipv6: tcp: enable flowlabel reflection in some RST 
packets")
Signed-off-by: Eiichi Tsukata 
---
 net/ipv6/sysctl_net_ipv6.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index 6d86fac472e7..831573461e19 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -114,7 +114,7 @@ static struct ctl_table ipv6_table_template[] = {
.data   = _net.ipv6.sysctl.flowlabel_reflect,
.maxlen = sizeof(int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = proc_dointvec_minmax,
.extra1 = ,
.extra2 = ,
},
-- 
2.21.0



[tip:smp/urgent] cpu/hotplug: Fix out-of-bounds read when setting fail state

2019-06-27 Thread tip-bot for Eiichi Tsukata
Commit-ID:  33d4a5a7a5b4d02915d765064b2319e90a11cbde
Gitweb: https://git.kernel.org/tip/33d4a5a7a5b4d02915d765064b2319e90a11cbde
Author: Eiichi Tsukata 
AuthorDate: Thu, 27 Jun 2019 11:47:32 +0900
Committer:  Thomas Gleixner 
CommitDate: Thu, 27 Jun 2019 09:34:04 +0200

cpu/hotplug: Fix out-of-bounds read when setting fail state

Setting invalid value to /sys/devices/system/cpu/cpuX/hotplug/fail
can control `struct cpuhp_step *sp` address, results in the following
global-out-of-bounds read.

Reproducer:

  # echo -2 > /sys/devices/system/cpu/cpu0/hotplug/fail

KASAN report:

  BUG: KASAN: global-out-of-bounds in write_cpuhp_fail+0x2cd/0x2e0
  Read of size 8 at addr 89734438 by task bash/1941

  CPU: 0 PID: 1941 Comm: bash Not tainted 5.2.0-rc6+ #31
  Call Trace:
   write_cpuhp_fail+0x2cd/0x2e0
   dev_attr_store+0x58/0x80
   sysfs_kf_write+0x13d/0x1a0
   kernfs_fop_write+0x2bc/0x460
   vfs_write+0x1e1/0x560
   ksys_write+0x126/0x250
   do_syscall_64+0xc1/0x390
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
  RIP: 0033:0x7f05e4f4c970

  The buggy address belongs to the variable:
   cpu_hotplug_lock+0x98/0xa0

  Memory state around the buggy address:
   89734300: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
   89734380: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
  >89734400: 00 00 00 00 fa fa fa fa 00 00 00 00 fa fa fa fa
  ^
   89734480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   89734500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Add a sanity check for the value written from user space.

Fixes: 1db49484f21ed ("smp/hotplug: Hotplug state fail injection")
Signed-off-by: Eiichi Tsukata 
Signed-off-by: Thomas Gleixner 
Cc: pet...@infradead.org
Link: https://lkml.kernel.org/r/20190627024732.31672-1-de...@etsukata.com

---
 kernel/cpu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 551db494f153..ef1c565edc5d 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1964,6 +1964,9 @@ static ssize_t write_cpuhp_fail(struct device *dev,
if (ret)
return ret;
 
+   if (fail < CPUHP_OFFLINE || fail > CPUHP_ONLINE)
+   return -EINVAL;
+
/*
 * Cannot fail STARTING/DYING callbacks.
 */


[PATCH] cpu/hotplug: Fix out-of-bounds read when setting fail state

2019-06-26 Thread Eiichi Tsukata
Setting invalid value to /sys/devices/system/cpu/cpuX/hotplug/fail
can control `struct cpuhp_step *sp` address, results in the following
global-out-of-bounds read.

Reproducer:

  # echo -2 > /sys/devices/system/cpu/cpu0/hotplug/fail

KASAN report:

  BUG: KASAN: global-out-of-bounds in write_cpuhp_fail+0x2cd/0x2e0
  Read of size 8 at addr 89734438 by task bash/1941

  CPU: 0 PID: 1941 Comm: bash Not tainted 5.2.0-rc6+ #31
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 
04/01/2014
  Call Trace:
   dump_stack+0xca/0x13e
   print_address_description.cold+0x5/0x246
   __kasan_report.cold+0x75/0x9a
   ? write_cpuhp_fail+0x2cd/0x2e0
   kasan_report+0xe/0x20
   write_cpuhp_fail+0x2cd/0x2e0
   ? takedown_cpu+0x3a0/0x3a0
   ? takedown_cpu+0x3a0/0x3a0
   dev_attr_store+0x58/0x80
   ? put_device+0x30/0x30
   sysfs_kf_write+0x13d/0x1a0
   kernfs_fop_write+0x2bc/0x460
   ? sysfs_kf_bin_read+0x270/0x270
   ? kernfs_notify+0x1f0/0x1f0
   __vfs_write+0x81/0x100
   vfs_write+0x1e1/0x560
   ksys_write+0x126/0x250
   ? __ia32_sys_read+0xb0/0xb0
   ? do_syscall_64+0x1f/0x390
   do_syscall_64+0xc1/0x390
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
  RIP: 0033:0x7f05e4f4c970
  Code: 73 01 c3 48 8b 0d 28 d5 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 
00 00 83 3d 99 2d 2c 00 04
  RSP: 002b:7ffd7fa630f8 EFLAGS: 0246 ORIG_RAX: 0001
  RAX: ffda RBX: 0003 RCX: 7f05e4f4c970
  RDX: 0003 RSI: 02039c08 RDI: 0001
  RBP: 02039c08 R08: 7f05e520c760 R09: 7f05e5858700
  R10: 0073 R11: 0246 R12: 0003
  R13: 0001 R14: 7f05e520b600 R15: 0003

  The buggy address belongs to the variable:
   cpu_hotplug_lock+0x98/0xa0

  Memory state around the buggy address:
   89734300: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
   89734380: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
  >89734400: 00 00 00 00 fa fa fa fa 00 00 00 00 fa fa fa fa
  ^
   89734480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   89734500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Adds sanity check for given value `fail`.

Fixes: 1db49484f21ed ("smp/hotplug: Hotplug state fail injection")
Signed-off-by: Eiichi Tsukata 
---
 kernel/cpu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 077fde6fb953..336254a48502 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1964,6 +1964,9 @@ static ssize_t write_cpuhp_fail(struct device *dev,
if (ret)
return ret;
 
+   if (fail < CPUHP_OFFLINE || fail > CPUHP_ONLINE)
+   return -EINVAL;
+
/*
 * Cannot fail STARTING/DYING callbacks.
 */
-- 
2.21.0



[PATCH] EDAC: Fix global-out-of-bounds write when setting edac_mc_poll_msec

2019-06-25 Thread Eiichi Tsukata
Commit 9da21b1509d8 ("EDAC: Poll timeout cannot be zero, p2") assumes
edac_mc_poll_msec to be unsigned long, but the type of the variable still
remained as int. Setting edac_mc_poll_msec can trigger out-of-bounds
write.

Reproducer:

  # echo 1001 > /sys/module/edac_core/parameters/edac_mc_poll_msec

KASAN report:

  BUG: KASAN: global-out-of-bounds in edac_set_poll_msec+0x140/0x150
  Write of size 8 at addr b91b2d00 by task bash/1996

  CPU: 1 PID: 1996 Comm: bash Not tainted 5.2.0-rc6+ #23
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 
04/01/2014
  Call Trace:
   dump_stack+0xca/0x13e
   print_address_description.cold+0x5/0x246
   __kasan_report.cold+0x75/0x9a
   ? edac_set_poll_msec+0x140/0x150
   kasan_report+0xe/0x20
   edac_set_poll_msec+0x140/0x150
   ? dimmdev_location_show+0x30/0x30
   ? vfs_lock_file+0xe0/0xe0
   ? _raw_spin_lock+0x87/0xe0
   param_attr_store+0x1b5/0x310
   ? param_array_set+0x4f0/0x4f0
   module_attr_store+0x58/0x80
   ? module_attr_show+0x80/0x80
   sysfs_kf_write+0x13d/0x1a0
   kernfs_fop_write+0x2bc/0x460
   ? sysfs_kf_bin_read+0x270/0x270
   ? kernfs_notify+0x1f0/0x1f0
   __vfs_write+0x81/0x100
   vfs_write+0x1e1/0x560
   ksys_write+0x126/0x250
   ? __ia32_sys_read+0xb0/0xb0
   ? do_syscall_64+0x1f/0x390
   do_syscall_64+0xc1/0x390
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
  RIP: 0033:0x7fa7caa5e970
  Code: 73 01 c3 48 8b 0d 28 d5 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 
00 00 83 3d 99 2d 2c 00 00 75 10 b8 01 00 00 00 04
  RSP: 002b:7fff6acfdfe8 EFLAGS: 0246 ORIG_RAX: 0001
  RAX: ffda RBX: 0005 RCX: 7fa7caa5e970
  RDX: 0005 RSI: 00e95c08 RDI: 0001
  RBP: 00e95c08 R08: 7fa7cad1e760 R09: 7fa7cb36a700
  R10: 0073 R11: 0246 R12: 0005
  R13: 0001 R14: 7fa7cad1d600 R15: 0005

  The buggy address belongs to the variable:
   edac_mc_poll_msec+0x0/0x40

  Memory state around the buggy address:
   b91b2c00: 00 00 00 00 fa fa fa fa 00 00 00 00 fa fa fa fa
   b91b2c80: 00 00 00 00 fa fa fa fa 00 00 00 00 fa fa fa fa
  >b91b2d00: 04 fa fa fa fa fa fa fa 04 fa fa fa fa fa fa fa
 ^
   b91b2d80: 04 fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
   b91b2e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Fix it by changing the type of edac_mc_poll_msec to unsigned int.
The reason why this patch adopts unsigned int rather than unsigned long
is msecs_to_jiffies() assumes arg to be unsigned int. We can avoid
integer conversion bugs and unsigned int will be large enough for
edac_mc_poll_msec.

Fixes: 9da21b1509d8 ("EDAC: Poll timeout cannot be zero, p2")
Signed-off-by: Eiichi Tsukata 
---
 drivers/edac/edac_mc_sysfs.c | 16 
 drivers/edac/edac_module.h   |  2 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/edac/edac_mc_sysfs.c b/drivers/edac/edac_mc_sysfs.c
index 7c01e1cc030c..4386ea4b9b5a 100644
--- a/drivers/edac/edac_mc_sysfs.c
+++ b/drivers/edac/edac_mc_sysfs.c
@@ -26,7 +26,7 @@
 static int edac_mc_log_ue = 1;
 static int edac_mc_log_ce = 1;
 static int edac_mc_panic_on_ue;
-static int edac_mc_poll_msec = 1000;
+static unsigned int edac_mc_poll_msec = 1000;
 
 /* Getter functions for above */
 int edac_mc_get_log_ue(void)
@@ -45,30 +45,30 @@ int edac_mc_get_panic_on_ue(void)
 }
 
 /* this is temporary */
-int edac_mc_get_poll_msec(void)
+unsigned int edac_mc_get_poll_msec(void)
 {
return edac_mc_poll_msec;
 }
 
 static int edac_set_poll_msec(const char *val, const struct kernel_param *kp)
 {
-   unsigned long l;
+   unsigned int i;
int ret;
 
if (!val)
return -EINVAL;
 
-   ret = kstrtoul(val, 0, );
+   ret = kstrtouint(val, 0, );
if (ret)
return ret;
 
-   if (l < 1000)
+   if (i < 1000)
return -EINVAL;
 
-   *((unsigned long *)kp->arg) = l;
+   *((unsigned int *)kp->arg) = i;
 
/* notify edac_mc engine to reset the poll period */
-   edac_mc_reset_delay_period(l);
+   edac_mc_reset_delay_period(i);
 
return 0;
 }
@@ -82,7 +82,7 @@ MODULE_PARM_DESC(edac_mc_log_ue,
 module_param(edac_mc_log_ce, int, 0644);
 MODULE_PARM_DESC(edac_mc_log_ce,
 "Log correctable error to console: 0=off 1=on");
-module_param_call(edac_mc_poll_msec, edac_set_poll_msec, param_get_int,
+module_param_call(edac_mc_poll_msec, edac_set_poll_msec, param_get_uint,
  _mc_poll_msec, 0644);
 MODULE_PARM_DESC(edac_mc_poll_msec, "Polling period in milliseconds");
 
diff --git a/drivers/edac/edac_module.h b/drivers/edac/edac_module.h
index bc4b806dc9cc..b2f59ee76c22 100644
--- a/drivers/edac/edac_module.h
+++ b/drivers/edac/edac_module.h
@@ -36,7 +36,7 @@ extern int edac_mc_get_log_u

Re: [PATCH 1/2] tty: ldisc: Fix misuse of proc_dointvec "ldisc_autoload"

2019-06-24 Thread Eiichi Tsukata



On 2019/06/25 12:32, Greg KH wrote:
> On Tue, Jun 25, 2019 at 12:08:00PM +0900, Eiichi Tsukata wrote:
>> /proc/sys/dev/tty/ldisc_autoload assumes given value to be 0 or 1. Use
>> proc_dointvec_minmax instead of proc_dointvec.
>>
>> Fixes: 7c0cca7c847e "(tty: ldisc: add sysctl to prevent autoloading of 
>> ldiscs)"
>> Signed-off-by: Eiichi Tsukata 
>> ---
>>  drivers/tty/tty_ldisc.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/tty/tty_ldisc.c b/drivers/tty/tty_ldisc.c
>> index e38f104db174..a8ea7a35c94e 100644
>> --- a/drivers/tty/tty_ldisc.c
>> +++ b/drivers/tty/tty_ldisc.c
>> @@ -863,7 +863,7 @@ static struct ctl_table tty_table[] = {
>>  .data   = _ldisc_autoload,
>>  .maxlen = sizeof(tty_ldisc_autoload),
>>  .mode   = 0644,
>> -.proc_handler   = proc_dointvec,
>> +.proc_handler   = proc_dointvec_minmax,
>>  .extra1 = ,
>>  .extra2 = ,
> 
> Ah, nice catch.  But this really isn't an issue as if you use a bigger
> value, things will not "break", right?
> 

Someone may misuse -1 to disable ldisc autoload, but basically it does not
"break".

Thanks,

Eiichi


[PATCH 2/2] net/ipv6: Fix misuse of proc_dointvec "skip_notify_on_dev_down"

2019-06-24 Thread Eiichi Tsukata
/proc/sys/net/ipv6/route/skip_notify_on_dev_down assumes given value to be
0 or 1. Use proc_dointvec_minmax instead of proc_dointvec.

Fixes: 7c6bb7d2faaf ("net/ipv6: Add knob to skip DELROUTE message ondevice 
down")
Signed-off-by: Eiichi Tsukata 
---
 net/ipv6/route.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 11ad62effd56..aade636c6be6 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -5281,7 +5281,7 @@ static struct ctl_table ipv6_route_table_template[] = {
.data   =   
_net.ipv6.sysctl.skip_notify_on_dev_down,
.maxlen =   sizeof(int),
.mode   =   0644,
-   .proc_handler   =   proc_dointvec,
+   .proc_handler   =   proc_dointvec_minmax,
.extra1 =   ,
.extra2 =   ,
},
-- 
2.21.0



[PATCH 1/2] tty: ldisc: Fix misuse of proc_dointvec "ldisc_autoload"

2019-06-24 Thread Eiichi Tsukata
/proc/sys/dev/tty/ldisc_autoload assumes given value to be 0 or 1. Use
proc_dointvec_minmax instead of proc_dointvec.

Fixes: 7c0cca7c847e "(tty: ldisc: add sysctl to prevent autoloading of ldiscs)"
Signed-off-by: Eiichi Tsukata 
---
 drivers/tty/tty_ldisc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/tty/tty_ldisc.c b/drivers/tty/tty_ldisc.c
index e38f104db174..a8ea7a35c94e 100644
--- a/drivers/tty/tty_ldisc.c
+++ b/drivers/tty/tty_ldisc.c
@@ -863,7 +863,7 @@ static struct ctl_table tty_table[] = {
.data   = _ldisc_autoload,
.maxlen = sizeof(tty_ldisc_autoload),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = proc_dointvec_minmax,
.extra1 = ,
.extra2 = ,
},
-- 
2.21.0



[PATCH] tracing/snapshot: resize spare buffer if size changed

2019-06-24 Thread Eiichi Tsukata
Current snapshot implementation swaps two ring_buffers even though their
sizes are different from each other, that can cause an inconsistency
between the contents of buffer_size_kb file and the current buffer size.

For example:

  # cat buffer_size_kb
  7 (expanded: 1408)
  # echo 1 > events/enable
  # grep bytes per_cpu/cpu0/stats
  bytes: 1441020
  # echo 1 > snapshot // current:1408, spare:1408
  # echo 123 > buffer_size_kb // current:123,  spare:1408
  # echo 1 > snapshot // current:1408, spare:123
  # grep bytes per_cpu/cpu0/stats
  bytes: 1443700
  # cat buffer_size_kb
  123 // != current:1408

And also, a similar per-cpu case hits the following WARNING:

Reproducer:

  # echo 1 > per_cpu/cpu0/snapshot
  # echo 123 > buffer_size_kb
  # echo 1 > per_cpu/cpu0/snapshot

WARNING:

  WARNING: CPU: 0 PID: 1946 at kernel/trace/trace.c:1607 
update_max_tr_single.part.0+0x2b8/0x380
  Modules linked in:
  CPU: 0 PID: 1946 Comm: bash Not tainted 5.2.0-rc6 #20
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 
04/01/2014
  RIP: 0010:update_max_tr_single.part.0+0x2b8/0x380
  Code: ff e8 dc da f9 ff 0f 0b e9 88 fe ff ff e8 d0 da f9 ff 44 89 ee bf f5 ff 
ff ff e8 33 dc f9 ff 41 83 fd f5 74 96 e8 b8 da f9 ff <0f> 0b eb 8d e8 af da f9 
ff 0f 0b e9 bf fd ff ff e8 a3 da f9 ff 48
  RSP: 0018:888063e4fca0 EFLAGS: 00010093
  RAX: 888066214380 RBX: 99850fe0 RCX: 964298a8
  RDX:  RSI: fff5 RDI: 0005
  RBP: 11100c7c9f96 R08: 888066214380 R09: ed100c7c9f9b
  R10: ed100c7c9f9a R11: 0003 R12: 
  R13: ffea R14: 888066214380 R15: 99851060
  FS:  7f9f8173c700() GS:88806d00() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 00714dc0 CR3: 66fa6000 CR4: 06f0
  Call Trace:
   ? trace_array_printk_buf+0x140/0x140
   ? __mutex_lock_slowpath+0x10/0x10
   tracing_snapshot_write+0x4c8/0x7f0
   ? trace_printk_init_buffers+0x60/0x60
   ? selinux_file_permission+0x3b/0x540
   ? tracer_preempt_off+0x38/0x506
   ? trace_printk_init_buffers+0x60/0x60
   __vfs_write+0x81/0x100
   vfs_write+0x1e1/0x560
   ksys_write+0x126/0x250
   ? __ia32_sys_read+0xb0/0xb0
   ? do_syscall_64+0x1f/0x390
   do_syscall_64+0xc1/0x390
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

This patch adds resize_buffer_duplicate_size() to check if there is a
difference between current/spare buffer sizes and resize a spare buffer
if necessary.

Signed-off-by: Eiichi Tsukata 
---
 kernel/trace/trace.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 83e08b78dbee..3edd4c1b96be 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6719,11 +6719,13 @@ tracing_snapshot_write(struct file *filp, const char 
__user *ubuf, size_t cnt,
break;
}
 #endif
-   if (!tr->allocated_snapshot) {
+   if (tr->allocated_snapshot)
+   ret = resize_buffer_duplicate_size(>max_buffer,
+   >trace_buffer, iter->cpu_file);
+   else
ret = tracing_alloc_snapshot_instance(tr);
-   if (ret < 0)
-   break;
-   }
+   if (ret < 0)
+   break;
local_irq_disable();
/* Now, we're going to swap */
if (iter->cpu_file == RING_BUFFER_ALL_CPUS)
-- 
2.21.0



[PATCH 1/2] tracing/uprobe: Fix NULL pointer dereference in trace_uprobe_create()

2019-06-14 Thread Eiichi Tsukata
Just like the case of commit 8b05a3a7503c ("tracing/kprobes: Fix NULL
pointer dereference in trace_kprobe_create()"), writing an incorrectly
formatted string to uprobe_events can trigger NULL pointer dereference.

Reporeducer:

  # echo r > /sys/kernel/debug/tracing/uprobe_events

dmesg:

  BUG: kernel NULL pointer dereference, address: 
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x) - not-present page
  PGD 800079d12067 P4D 800079d12067 PUD 7b7ab067 PMD 0
  Oops:  [#1] PREEMPT SMP PTI
  CPU: 0 PID: 1903 Comm: bash Not tainted 5.2.0-rc3+ #15
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 
04/01/2014
  RIP: 0010:strchr+0x0/0x30
  Code: c0 eb 0d 84 c9 74 18 48 83 c0 01 48 39 d0 74 0f 0f b6 0c 07 3a 0c 06 74 
ea 19 c0 83 c8 01 c3 31 c0 c3 0f 1f 84 00 00 00 00 00 <0f> b6 07 89 f2 40 38 f0 
75 0e eb 13 0f b6 47 01 48 83 c
  RSP: 0018:b55fc0403d10 EFLAGS: 00010293

  RAX: 993ffb793400 RBX:  RCX: a4852625
  RDX:  RSI: 002f RDI: 
  RBP: b55fc0403dd0 R08: 993ffb793400 R09: 
  R10:  R11:  R12: 
  R13: 993ff9cc1668 R14: 0001 R15: 
  FS:  7f30c5147700() GS:993ffda0() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2:  CR3: 7b628000 CR4: 06f0
  Call Trace:
   trace_uprobe_create+0xe6/0xb10
   ? __kmalloc_track_caller+0xe6/0x1c0
   ? __kmalloc+0xf0/0x1d0
   ? trace_uprobe_create+0xb10/0xb10
   create_or_delete_trace_uprobe+0x35/0x90
   ? trace_uprobe_create+0xb10/0xb10
   trace_run_command+0x9c/0xb0
   trace_parse_run_command+0xf9/0x1eb
   ? probes_open+0x80/0x80
   __vfs_write+0x43/0x90
   vfs_write+0x14a/0x2a0
   ksys_write+0xa2/0x170
   do_syscall_64+0x7f/0x200
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: 0597c49c69d5 ("tracing/uprobes: Use dyn_event framework for uprobe 
events")
Signed-off-by: Eiichi Tsukata 
---
 kernel/trace/trace_uprobe.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 0d60d6856de5..7dc8ee55cf84 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -443,10 +443,17 @@ static int trace_uprobe_create(int argc, const char 
**argv)
ret = 0;
ref_ctr_offset = 0;
 
-   /* argc must be >= 1 */
-   if (argv[0][0] == 'r')
+   switch (argv[0][0]) {
+   case 'r':
is_return = true;
-   else if (argv[0][0] != 'p' || argc < 2)
+   break;
+   case 'p':
+   break;
+   default:
+   return -ECANCELED;
+   }
+
+   if (argc < 2)
return -ECANCELED;
 
if (argv[0][1] == ':')
-- 
2.21.0



[PATCH 2/2] tracing/uprobe: Fix obsolete comment on trace_uprobe_create()

2019-06-14 Thread Eiichi Tsukata
Commit 0597c49c69d5 ("tracing/uprobes: Use dyn_event framework for
uprobe events") cleaned up the usage of trace_uprobe_create(), and the
function has been no longer used for removing uprobe/uretprobe.

Signed-off-by: Eiichi Tsukata 
---
 kernel/trace/trace_uprobe.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 7dc8ee55cf84..7860e3f59fad 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -426,8 +426,6 @@ static int register_trace_uprobe(struct trace_uprobe *tu)
 /*
  * Argument syntax:
  *  - Add uprobe: p|r[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS]
- *
- *  - Remove uprobe: -:[GRP/]EVENT
  */
 static int trace_uprobe_create(int argc, const char **argv)
 {
-- 
2.21.0



[PATCH] tracing: Fix out-of-range read in trace_stack_print()

2019-06-09 Thread Eiichi Tsukata
Puts range check before dereferencing the pointer.

Reproducer:

  # echo stacktrace > trace_options
  # echo 1 > events/enable
  # cat trace > /dev/null

KASAN report:

  ==
  BUG: KASAN: use-after-free in trace_stack_print+0x26b/0x2c0
  Read of size 8 at addr 888069d2 by task cat/1953

  CPU: 0 PID: 1953 Comm: cat Not tainted 5.2.0-rc3+ #5
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 
04/01/2014
  Call Trace:
   dump_stack+0x8a/0xce
   print_address_description+0x60/0x224
   ? trace_stack_print+0x26b/0x2c0
   ? trace_stack_print+0x26b/0x2c0
   __kasan_report.cold+0x1a/0x3e
   ? trace_stack_print+0x26b/0x2c0
   kasan_report+0xe/0x20
   trace_stack_print+0x26b/0x2c0
   print_trace_line+0x6ea/0x14d0
   ? tracing_buffers_read+0x700/0x700
   ? trace_find_next_entry_inc+0x158/0x1d0
   s_show+0xea/0x310
   seq_read+0xaa7/0x10e0
   ? seq_escape+0x230/0x230
   __vfs_read+0x7c/0x100
   vfs_read+0x16c/0x3a0
   ksys_read+0x121/0x240
   ? kernel_write+0x110/0x110
   ? perf_trace_sys_enter+0x8a0/0x8a0
   ? syscall_slow_exit_work+0xa9/0x410
   do_syscall_64+0xb7/0x390
   ? prepare_exit_to_usermode+0x165/0x200
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f867681f910
  Code: b6 fe ff ff 48 8d 3d 0f be 08 00 48 83 ec 08 e8 06 db 01 00 66 0f 1f 44 
00 00 83 3d f9 2d 2c 00 00 75 10 b8 00 00 00 00 04
  RSP: 002b:7ffdabf23488 EFLAGS: 0246 ORIG_RAX: 
  RAX: ffda RBX: 0002 RCX: 7f867681f910
  RDX: 0002 RSI: 7f8676cde000 RDI: 0003
  RBP: 7f8676cde000 R08:  R09: 
  R10: 0871 R11: 0246 R12: 7f8676cde000
  R13: 0003 R14: 0002 R15: 0ec0

  Allocated by task 1214:
   save_stack+0x1b/0x80
   __kasan_kmalloc.constprop.0+0xc2/0xd0
   kmem_cache_alloc+0xaf/0x1a0
   getname_flags+0xd2/0x5b0
   do_sys_open+0x277/0x5a0
   do_syscall_64+0xb7/0x390
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Freed by task 1214:
   save_stack+0x1b/0x80
   __kasan_slab_free+0x12c/0x170
   kmem_cache_free+0x8a/0x1c0
   putname+0xe1/0x120
   do_sys_open+0x2c5/0x5a0
   do_syscall_64+0xb7/0x390
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  The buggy address belongs to the object at 888069d2
   which belongs to the cache names_cache of size 4096
  The buggy address is located 0 bytes inside of
   4096-byte region [888069d2, 888069d21000)
  The buggy address belongs to the page:
  page:ea0001a74800 refcount:1 mapcount:0 mapping:88806ccd1380 
index:0x0 compound_mapcount: 0
  flags: 0x1010200(slab|head)
  raw: 01010200 dead0100 dead0200 88806ccd1380
  raw:  00070007 0001 
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   888069d1ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   888069d1ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  >888069d2: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ^
   888069d20080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   888069d20100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==

Fixes: 4a9bd3f134dec ("tracing: Have dynamic size event stack traces")
Signed-off-by: Eiichi Tsukata 
---
 kernel/trace/trace_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 54373d93e251..ba751f993c3b 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1057,7 +1057,7 @@ static enum print_line_t trace_stack_print(struct 
trace_iterator *iter,
 
trace_seq_puts(s, "\n");
 
-   for (p = field->caller; p && *p != ULONG_MAX && p < end; p++) {
+   for (p = field->caller; p && p < end && *p != ULONG_MAX; p++) {
 
if (trace_seq_has_overflowed(s))
break;
-- 
2.21.0



[RESEND PATCH v3] scsi: Add timeout to avoid infinite command retry

2014-02-27 Thread Eiichi Tsukata
Currently, scsi error handling in scsi_io_completion() tries to
unconditionally requeue scsi command when device keeps some error state.
For example, UNIT_ATTENTION causes infinite retry with
action == ACTION_RETRY.
This is because retryable errors are thought to be temporary and the scsi
device will soon recover from those errors. Normally, such retry policy is
appropriate because the device will soon recover from temporary error state.

But there is no guarantee that device is able to recover from error state
immediately. Some hardware error can prevent device from recovering.

This patch adds timeout in scsi_io_completion() to avoid infinite command
retry in scsi_io_completion(). Once scsi command retry time is longer than
this timeout, the command is treated as failure.

Changes in v3:
 - use existing timeout instead of adding new sysfs attribute
   (Thanks to James)

Changes in v2:
 - check retry timeout in scsi_io_completion() instead of scsi_softirq_done()

Signed-off-by: Eiichi Tsukata 
Cc: "James E.J. Bottomley" 
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/scsi/scsi_lib.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 62ec84b..f97a1a6 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -788,6 +788,7 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int 
good_bytes)
enum {ACTION_FAIL, ACTION_REPREP, ACTION_RETRY,
  ACTION_DELAYED_RETRY} action;
char *description = NULL;
+   unsigned long wait_for = (cmd->allowed + 1) * req->timeout;
 
if (result) {
sense_valid = scsi_command_normalize_sense(cmd, );
@@ -989,6 +990,12 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned 
int good_bytes)
action = ACTION_FAIL;
}
 
+   if (action != ACTION_FAIL &&
+   time_before(cmd->jiffies_at_alloc + wait_for, jiffies)) {
+   action = ACTION_FAIL;
+   description = "Command timed out";
+   }
+
switch (action) {
case ACTION_FAIL:
/* Give up and fail the remainder of the request */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RESEND PATCH v3] scsi: Add timeout to avoid infinite command retry

2014-02-27 Thread Eiichi Tsukata
Currently, scsi error handling in scsi_io_completion() tries to
unconditionally requeue scsi command when device keeps some error state.
For example, UNIT_ATTENTION causes infinite retry with
action == ACTION_RETRY.
This is because retryable errors are thought to be temporary and the scsi
device will soon recover from those errors. Normally, such retry policy is
appropriate because the device will soon recover from temporary error state.

But there is no guarantee that device is able to recover from error state
immediately. Some hardware error can prevent device from recovering.

This patch adds timeout in scsi_io_completion() to avoid infinite command
retry in scsi_io_completion(). Once scsi command retry time is longer than
this timeout, the command is treated as failure.

Changes in v3:
 - use existing timeout instead of adding new sysfs attribute
   (Thanks to James)

Changes in v2:
 - check retry timeout in scsi_io_completion() instead of scsi_softirq_done()

Signed-off-by: Eiichi Tsukata eiichi.tsukata...@hitachi.com
Cc: James E.J. Bottomley jbottom...@parallels.com
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/scsi/scsi_lib.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 62ec84b..f97a1a6 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -788,6 +788,7 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int 
good_bytes)
enum {ACTION_FAIL, ACTION_REPREP, ACTION_RETRY,
  ACTION_DELAYED_RETRY} action;
char *description = NULL;
+   unsigned long wait_for = (cmd-allowed + 1) * req-timeout;
 
if (result) {
sense_valid = scsi_command_normalize_sense(cmd, sshdr);
@@ -989,6 +990,12 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned 
int good_bytes)
action = ACTION_FAIL;
}
 
+   if (action != ACTION_FAIL 
+   time_before(cmd-jiffies_at_alloc + wait_for, jiffies)) {
+   action = ACTION_FAIL;
+   description = Command timed out;
+   }
+
switch (action) {
case ACTION_FAIL:
/* Give up and fail the remainder of the request */

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3] scsi: Add timeout to avoid infinite command retry

2014-02-10 Thread Eiichi Tsukata
Currently, scsi error handling in scsi_io_completion() tries to
unconditionally requeue scsi command when device keeps some error state.
For example, UNIT_ATTENTION causes infinite retry with
action == ACTION_RETRY.
This is because retryable errors are thought to be temporary and the scsi
device will soon recover from those errors. Normally, such retry policy is
appropriate because the device will soon recover from temporary error state.

But there is no guarantee that device is able to recover from error state
immediately. Some hardware error can prevent device from recovering.

This patch adds timeout in scsi_io_completion() to avoid infinite command
retry in scsi_io_completion(). Once scsi command retry time is longer than
this timeout, the command is treated as failure.

Changes in v3:
 - use existing timeout instead of adding new sysfs attribute
   (Thanks to James)

Changes in v2:
 - check retry timeout in scsi_io_completion() instead of scsi_softirq_done()

Signed-off-by: Eiichi Tsukata 
Cc: "James E.J. Bottomley" 
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/scsi/scsi_lib.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 7bd7f0d..fa9707d 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -788,6 +788,7 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int 
good_bytes)
enum {ACTION_FAIL, ACTION_REPREP, ACTION_RETRY,
  ACTION_DELAYED_RETRY} action;
char *description = NULL;
+   unsigned long wait_for = (cmd->allowed + 1) * req->timeout;
 
if (result) {
sense_valid = scsi_command_normalize_sense(cmd, );
@@ -989,6 +990,12 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned 
int good_bytes)
action = ACTION_FAIL;
}
 
+   if (action != ACTION_FAIL &&
+   time_before(cmd->jiffies_at_alloc + wait_for, jiffies)) {
+   action = ACTION_FAIL;
+   description = "Command timed out";
+   }
+
switch (action) {
case ACTION_FAIL:
/* Give up and fail the remainder of the request */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: [PATCH v2] scsi: Add 'retry_timeout' to avoid infinite command retry

2014-02-10 Thread Eiichi Tsukata


(2014/02/07 15:15), Libo Chen wrote:

On 2014/2/7 13:46, James Bottomley wrote:

On Fri, 2014-02-07 at 09:22 +0900, Eiichi Tsukata wrote:

Currently, scsi error handling in scsi_io_completion() tries to
unconditionally requeue scsi command when device keeps some error state.
For example, UNIT_ATTENTION causes infinite retry with
action == ACTION_RETRY.
This is because retryable errors are thought to be temporary and the scsi
device will soon recover from those errors. Normally, such retry policy is
appropriate because the device will soon recover from temporary error state.




But there is no guarantee that device is able to recover from error state
immediately. Actually, we've experienced an infinite retry on some hardware.
Therefore hardware error can results in infinite command retry loop.

Could you please add an analysis of the actual failure; which devices
and what conditions.



same question, can you explain?



I'm afraid to say that I can't expose the device name because I've not 
confirmed yet that
the device is responsible for the problem with the device manufacturer. 
However, with the

limited evidence, It seems that SCSI command retried with UNIT_ATTENTION.

In the previous thread, Ewan reported that a storage array returned a 
CHECK_CONDITION
with invalid sense data, which caused the command to be retried 
indefinitely:

https://lkml.org/lkml/2013/8/20/498

So, we should try to avoid infinite retry in SCSI middle layer, not in 
each SCSI LLDD.



This patch adds 'retry_timeout' sysfs attribute which limits the retry time
of each scsi command. This attribute is located in scsi sysfs directory
for example "/sys/bus/scsi/devices/X:X:X:X/" and value is in seconds.
Once scsi command retry time is longer than this timeout,
the command is treated as failure. 'retry_timeout' is set to '0' by default
which means no timeout set.

Don't do this ... you're mixing a feature (which you'd need to justify)
with an apparent bug fix.

Once you dump all the complexity, I think the patch boils down to a
simple check before the action switch in scsi_io_completion():

if (action !=  ACTION_FAIL &&
time_before(cmd->jiffies_at_alloc + wait_for, jiffies)) {
action = ACTION_FAIL;
description = "command timed out";
}



Sounds good!
Thanks for much simpler code. It's enough to fix the bug.
I'll resend the patch soon with the above code.

Eiichi.


James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: [PATCH v2] scsi: Add 'retry_timeout' to avoid infinite command retry

2014-02-10 Thread Eiichi Tsukata


(2014/02/07 15:15), Libo Chen wrote:

On 2014/2/7 13:46, James Bottomley wrote:

On Fri, 2014-02-07 at 09:22 +0900, Eiichi Tsukata wrote:

Currently, scsi error handling in scsi_io_completion() tries to
unconditionally requeue scsi command when device keeps some error state.
For example, UNIT_ATTENTION causes infinite retry with
action == ACTION_RETRY.
This is because retryable errors are thought to be temporary and the scsi
device will soon recover from those errors. Normally, such retry policy is
appropriate because the device will soon recover from temporary error state.




But there is no guarantee that device is able to recover from error state
immediately. Actually, we've experienced an infinite retry on some hardware.
Therefore hardware error can results in infinite command retry loop.

Could you please add an analysis of the actual failure; which devices
and what conditions.



same question, can you explain?



I'm afraid to say that I can't expose the device name because I've not 
confirmed yet that
the device is responsible for the problem with the device manufacturer. 
However, with the

limited evidence, It seems that SCSI command retried with UNIT_ATTENTION.

In the previous thread, Ewan reported that a storage array returned a 
CHECK_CONDITION
with invalid sense data, which caused the command to be retried 
indefinitely:

https://lkml.org/lkml/2013/8/20/498

So, we should try to avoid infinite retry in SCSI middle layer, not in 
each SCSI LLDD.



This patch adds 'retry_timeout' sysfs attribute which limits the retry time
of each scsi command. This attribute is located in scsi sysfs directory
for example /sys/bus/scsi/devices/X:X:X:X/ and value is in seconds.
Once scsi command retry time is longer than this timeout,
the command is treated as failure. 'retry_timeout' is set to '0' by default
which means no timeout set.

Don't do this ... you're mixing a feature (which you'd need to justify)
with an apparent bug fix.

Once you dump all the complexity, I think the patch boils down to a
simple check before the action switch in scsi_io_completion():

if (action !=  ACTION_FAIL 
time_before(cmd-jiffies_at_alloc + wait_for, jiffies)) {
action = ACTION_FAIL;
description = command timed out;
}



Sounds good!
Thanks for much simpler code. It's enough to fix the bug.
I'll resend the patch soon with the above code.

Eiichi.


James


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3] scsi: Add timeout to avoid infinite command retry

2014-02-10 Thread Eiichi Tsukata
Currently, scsi error handling in scsi_io_completion() tries to
unconditionally requeue scsi command when device keeps some error state.
For example, UNIT_ATTENTION causes infinite retry with
action == ACTION_RETRY.
This is because retryable errors are thought to be temporary and the scsi
device will soon recover from those errors. Normally, such retry policy is
appropriate because the device will soon recover from temporary error state.

But there is no guarantee that device is able to recover from error state
immediately. Some hardware error can prevent device from recovering.

This patch adds timeout in scsi_io_completion() to avoid infinite command
retry in scsi_io_completion(). Once scsi command retry time is longer than
this timeout, the command is treated as failure.

Changes in v3:
 - use existing timeout instead of adding new sysfs attribute
   (Thanks to James)

Changes in v2:
 - check retry timeout in scsi_io_completion() instead of scsi_softirq_done()

Signed-off-by: Eiichi Tsukata eiichi.tsukata...@hitachi.com
Cc: James E.J. Bottomley jbottom...@parallels.com
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/scsi/scsi_lib.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 7bd7f0d..fa9707d 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -788,6 +788,7 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int 
good_bytes)
enum {ACTION_FAIL, ACTION_REPREP, ACTION_RETRY,
  ACTION_DELAYED_RETRY} action;
char *description = NULL;
+   unsigned long wait_for = (cmd-allowed + 1) * req-timeout;
 
if (result) {
sense_valid = scsi_command_normalize_sense(cmd, sshdr);
@@ -989,6 +990,12 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned 
int good_bytes)
action = ACTION_FAIL;
}
 
+   if (action != ACTION_FAIL 
+   time_before(cmd-jiffies_at_alloc + wait_for, jiffies)) {
+   action = ACTION_FAIL;
+   description = Command timed out;
+   }
+
switch (action) {
case ACTION_FAIL:
/* Give up and fail the remainder of the request */

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] scsi: Add 'retry_timeout' to avoid infinite command retry

2014-02-06 Thread Eiichi Tsukata
Currently, scsi error handling in scsi_io_completion() tries to
unconditionally requeue scsi command when device keeps some error state.
For example, UNIT_ATTENTION causes infinite retry with
action == ACTION_RETRY.
This is because retryable errors are thought to be temporary and the scsi
device will soon recover from those errors. Normally, such retry policy is
appropriate because the device will soon recover from temporary error state.

But there is no guarantee that device is able to recover from error state
immediately. Actually, we've experienced an infinite retry on some hardware.
Therefore hardware error can results in infinite command retry loop.

This patch adds 'retry_timeout' sysfs attribute which limits the retry time
of each scsi command. This attribute is located in scsi sysfs directory
for example "/sys/bus/scsi/devices/X:X:X:X/" and value is in seconds.
Once scsi command retry time is longer than this timeout,
the command is treated as failure. 'retry_timeout' is set to '0' by default
which means no timeout set.

Usage:
 - To set retry timeout(set retry_timeout to 30 sec):
# echo 30 > /sys/bus/scsi/devices/X:X:X:X/retry_timeout

Changes in v2:
 - check retry timeout in scsi_io_completion() instead of scsi_softirq_done()

Signed-off-by: Eiichi Tsukata 
Cc: "James E.J. Bottomley" 
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/scsi/scsi_lib.c|   22 ++
 drivers/scsi/scsi_scan.c   |1 +
 drivers/scsi/scsi_sysfs.c  |   32 
 include/scsi/scsi.h|1 +
 include/scsi/scsi_device.h |1 +
 5 files changed, 57 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 7bd7f0d..813b287 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -741,6 +741,24 @@ static int __scsi_error_from_host_byte(struct scsi_cmnd 
*cmd, int result)
 }
 
 /*
+ * Check if scsi command excessed retry timeout
+ */
+static int scsi_retry_timed_out(struct scsi_cmnd *cmd)
+{
+   unsigned int retry_timeout;
+
+   retry_timeout = cmd->device->retry_timeout;
+   if (retry_timeout &&
+   time_before(cmd->jiffies_at_alloc + retry_timeout, jiffies)) {
+   scmd_printk(KERN_ERR, cmd, "retry timeout, waited %us\n",
+   retry_timeout/HZ);
+   return 1;
+   }
+
+   return 0;
+}
+
+/*
  * Function:scsi_io_completion()
  *
  * Purpose: Completion processing for block device I/O requests.
@@ -989,6 +1007,10 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned 
int good_bytes)
action = ACTION_FAIL;
}
 
+   if ((action == ACTION_RETRY || action == ACTION_DELAYED_RETRY) &&
+   scsi_retry_timed_out(cmd))
+   action = ACTION_FAIL;
+
switch (action) {
case ACTION_FAIL:
/* Give up and fail the remainder of the request */
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 307a811..4ab044a 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -925,6 +925,7 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned 
char *inq_result,
sdev->no_dif = 1;
 
sdev->eh_timeout = SCSI_DEFAULT_EH_TIMEOUT;
+   sdev->retry_timeout = SCSI_DEFAULT_RETRY_TIMEOUT;
 
if (*bflags & BLIST_SKIP_VPD_PAGES)
sdev->skip_vpd_pages = 1;
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 8ff62c2..eaa2118 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -627,6 +627,37 @@ sdev_store_eh_timeout(struct device *dev, struct 
device_attribute *attr,
 static DEVICE_ATTR(eh_timeout, S_IRUGO | S_IWUSR, sdev_show_eh_timeout, 
sdev_store_eh_timeout);
 
 static ssize_t
+sdev_show_retry_timeout(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct scsi_device *sdev;
+   sdev = to_scsi_device(dev);
+   return snprintf(buf, 20, "%u\n", sdev->retry_timeout / HZ);
+}
+
+static ssize_t
+sdev_store_retry_timeout(struct device *dev, struct device_attribute *attr,
+const char *buf, size_t count)
+{
+   struct scsi_device *sdev;
+   unsigned int retry_timeout;
+   int err;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EACCES;
+
+   sdev = to_scsi_device(dev);
+   err = kstrtouint(buf, 10, _timeout);
+   if (err)
+   return err;
+   sdev->retry_timeout = retry_timeout * HZ;
+
+   return count;
+}
+static DEVICE_ATTR(retry_timeout, S_IRUGO | S_IWUSR,
+  sdev_show_retry_timeout, sdev_store_retry_timeout);
+
+static ssize_t
 store_rescan_field (struct device *dev, struct device_attribute *attr,
const char *buf, size_t count)
 {
@@ -797,6 +828,7 @@ static struct attribute *scsi_sdev_attrs

[PATCH v2] scsi: Add 'retry_timeout' to avoid infinite command retry

2014-02-06 Thread Eiichi Tsukata
Currently, scsi error handling in scsi_io_completion() tries to
unconditionally requeue scsi command when device keeps some error state.
For example, UNIT_ATTENTION causes infinite retry with
action == ACTION_RETRY.
This is because retryable errors are thought to be temporary and the scsi
device will soon recover from those errors. Normally, such retry policy is
appropriate because the device will soon recover from temporary error state.

But there is no guarantee that device is able to recover from error state
immediately. Actually, we've experienced an infinite retry on some hardware.
Therefore hardware error can results in infinite command retry loop.

This patch adds 'retry_timeout' sysfs attribute which limits the retry time
of each scsi command. This attribute is located in scsi sysfs directory
for example /sys/bus/scsi/devices/X:X:X:X/ and value is in seconds.
Once scsi command retry time is longer than this timeout,
the command is treated as failure. 'retry_timeout' is set to '0' by default
which means no timeout set.

Usage:
 - To set retry timeout(set retry_timeout to 30 sec):
# echo 30  /sys/bus/scsi/devices/X:X:X:X/retry_timeout

Changes in v2:
 - check retry timeout in scsi_io_completion() instead of scsi_softirq_done()

Signed-off-by: Eiichi Tsukata eiichi.tsukata...@hitachi.com
Cc: James E.J. Bottomley jbottom...@parallels.com
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/scsi/scsi_lib.c|   22 ++
 drivers/scsi/scsi_scan.c   |1 +
 drivers/scsi/scsi_sysfs.c  |   32 
 include/scsi/scsi.h|1 +
 include/scsi/scsi_device.h |1 +
 5 files changed, 57 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 7bd7f0d..813b287 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -741,6 +741,24 @@ static int __scsi_error_from_host_byte(struct scsi_cmnd 
*cmd, int result)
 }
 
 /*
+ * Check if scsi command excessed retry timeout
+ */
+static int scsi_retry_timed_out(struct scsi_cmnd *cmd)
+{
+   unsigned int retry_timeout;
+
+   retry_timeout = cmd-device-retry_timeout;
+   if (retry_timeout 
+   time_before(cmd-jiffies_at_alloc + retry_timeout, jiffies)) {
+   scmd_printk(KERN_ERR, cmd, retry timeout, waited %us\n,
+   retry_timeout/HZ);
+   return 1;
+   }
+
+   return 0;
+}
+
+/*
  * Function:scsi_io_completion()
  *
  * Purpose: Completion processing for block device I/O requests.
@@ -989,6 +1007,10 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned 
int good_bytes)
action = ACTION_FAIL;
}
 
+   if ((action == ACTION_RETRY || action == ACTION_DELAYED_RETRY) 
+   scsi_retry_timed_out(cmd))
+   action = ACTION_FAIL;
+
switch (action) {
case ACTION_FAIL:
/* Give up and fail the remainder of the request */
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 307a811..4ab044a 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -925,6 +925,7 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned 
char *inq_result,
sdev-no_dif = 1;
 
sdev-eh_timeout = SCSI_DEFAULT_EH_TIMEOUT;
+   sdev-retry_timeout = SCSI_DEFAULT_RETRY_TIMEOUT;
 
if (*bflags  BLIST_SKIP_VPD_PAGES)
sdev-skip_vpd_pages = 1;
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 8ff62c2..eaa2118 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -627,6 +627,37 @@ sdev_store_eh_timeout(struct device *dev, struct 
device_attribute *attr,
 static DEVICE_ATTR(eh_timeout, S_IRUGO | S_IWUSR, sdev_show_eh_timeout, 
sdev_store_eh_timeout);
 
 static ssize_t
+sdev_show_retry_timeout(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct scsi_device *sdev;
+   sdev = to_scsi_device(dev);
+   return snprintf(buf, 20, %u\n, sdev-retry_timeout / HZ);
+}
+
+static ssize_t
+sdev_store_retry_timeout(struct device *dev, struct device_attribute *attr,
+const char *buf, size_t count)
+{
+   struct scsi_device *sdev;
+   unsigned int retry_timeout;
+   int err;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EACCES;
+
+   sdev = to_scsi_device(dev);
+   err = kstrtouint(buf, 10, retry_timeout);
+   if (err)
+   return err;
+   sdev-retry_timeout = retry_timeout * HZ;
+
+   return count;
+}
+static DEVICE_ATTR(retry_timeout, S_IRUGO | S_IWUSR,
+  sdev_show_retry_timeout, sdev_store_retry_timeout);
+
+static ssize_t
 store_rescan_field (struct device *dev, struct device_attribute *attr,
const char *buf, size_t count)
 {
@@ -797,6 +828,7 @@ static struct attribute *scsi_sdev_attrs[] = {
dev_attr_state.attr

Re: [REVIEW PATCH] scsi: Add 'retry_timeout' to avoid infinite command retry

2014-02-05 Thread Eiichi Tsukata

(2014/02/06 1:55), James Bottomley wrote:

On Wed, 2014-02-05 at 14:47 +0900, Eiichi Tsukata wrote:

Currently, scsi error handling in scsi_decide_disposition() tries to
unconditionally requeue scsi command when device keeps some error state.
This is because retryable errors are thought to be temporary and the scsi
device will soon recover from those errors. Normally, such retry policy is
appropriate because the device will soon recover from temporary error state.

This description isn't very descriptive.  I presume you're talking about
the ADD_TO_MLQUEUE return from scsi_decide_disposition()?



Thanks for reviewing, James.

I was talking about ADD_TO_MLQUEUE and NEEDS_RETRY.
I'll fix the description.


But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop.

If you're talking about ADD_TO_MLQUEUE, I don't believe this is correct:
there's a test in scsi_softirq_done() that makes sure the maximum
lifetime is retries*timeout and fails the command after that.


I see, threre's already timeout in scsi_softirq_done() so my patch is not 
correct
as you say.

What I really want to do is to prevent command from retrying indefinitely
in scsi_io_completion()
with action == ACTION_RETRY || action == ACTION_DELAYED_RETRY.
In v2 patch, I'll change the location of retry timeout check from 
scsi_softirq_done()
to scsi_io_completion().

Eiichi


(2014/02/06 1:55), James Bottomley wrote:

On Wed, 2014-02-05 at 14:47 +0900, Eiichi Tsukata wrote:

Currently, scsi error handling in scsi_decide_disposition() tries to
unconditionally requeue scsi command when device keeps some error state.
This is because retryable errors are thought to be temporary and the scsi
device will soon recover from those errors. Normally, such retry policy is
appropriate because the device will soon recover from temporary error state.

This description isn't very descriptive.  I presume you're talking about
the ADD_TO_MLQUEUE return from scsi_decide_disposition()?


But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop.

If you're talking about ADD_TO_MLQUEUE, I don't believe this is correct:
there's a test in scsi_softirq_done() that makes sure the maximum
lifetime is retries*timeout and fails the command after that.

James



This patchs adds 'retry_timeout' sysfs attribute which limits the retry time
of each scsi command. Once scsi command retry time is longer than this timeout,
the command is treated as failure. 'retry_timeout' is set to '0' by default
which means no timeout set.

Signed-off-by: Eiichi Tsukata 
Cc: "James E.J. Bottomley" 
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
  drivers/scsi/scsi_lib.c| 22 +-
  drivers/scsi/scsi_scan.c   |  1 +
  drivers/scsi/scsi_sysfs.c  | 32 
  include/scsi/scsi.h|  1 +
  include/scsi/scsi_device.h |  1 +
  5 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 7bd7f0d..a9db69e 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1492,6 +1492,23 @@ static void scsi_kill_request(struct request *req, 
struct request_queue *q)
blk_complete_request(req);
  }
  
+/*

+ * Check if scsi command excessed retry timeout
+ */
+static int scsi_retry_timed_out(struct scsi_cmnd *cmd)
+{
+   unsigned int retry_timeout;
+
+   retry_timeout = cmd->device->retry_timeout;
+   if (retry_timeout &&
+   time_before(cmd->jiffies_at_alloc + retry_timeout, jiffies)) {
+   scmd_printk(KERN_INFO, cmd, "retry timeout\n");
+   return 1;
+   }
+
+   return 0;
+}
+
  static void scsi_softirq_done(struct request *rq)
  {
struct scsi_cmnd *cmd = rq->special;
@@ -1512,7 +1529,10 @@ static void scsi_softirq_done(struct request *rq)
wait_for/HZ);
disposition = SUCCESS;
}
-   
+
+   if (scsi_retry_timed_out(cmd))
+   disposition = FAILED;
+
scsi_log_completion(cmd, disposition);
  
  	switch (disposition) {

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 307a811..4ab044a 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -925,6 +925,7 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned 
char *inq_result,
sdev->no_dif = 1;
  
  	sdev->eh_timeout = SCSI_DEFAULT_EH_TIMEOUT;

+   sdev->retry_timeout = SCSI_DEFAULT_RETRY_TIMEOUT;
  
  	if (*bflags & BLIST_SKIP_VPD_PAGES)

sdev->skip_vpd_pages = 1;
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_

Re: [REVIEW PATCH] scsi: Add 'retry_timeout' to avoid infinite command retry

2014-02-05 Thread Eiichi Tsukata

(2014/02/06 1:55), James Bottomley wrote:

On Wed, 2014-02-05 at 14:47 +0900, Eiichi Tsukata wrote:

Currently, scsi error handling in scsi_decide_disposition() tries to
unconditionally requeue scsi command when device keeps some error state.
This is because retryable errors are thought to be temporary and the scsi
device will soon recover from those errors. Normally, such retry policy is
appropriate because the device will soon recover from temporary error state.

This description isn't very descriptive.  I presume you're talking about
the ADD_TO_MLQUEUE return from scsi_decide_disposition()?



Thanks for reviewing, James.

I was talking about ADD_TO_MLQUEUE and NEEDS_RETRY.
I'll fix the description.


But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop.

If you're talking about ADD_TO_MLQUEUE, I don't believe this is correct:
there's a test in scsi_softirq_done() that makes sure the maximum
lifetime is retries*timeout and fails the command after that.


I see, threre's already timeout in scsi_softirq_done() so my patch is not 
correct
as you say.

What I really want to do is to prevent command from retrying indefinitely
in scsi_io_completion()
with action == ACTION_RETRY || action == ACTION_DELAYED_RETRY.
In v2 patch, I'll change the location of retry timeout check from 
scsi_softirq_done()
to scsi_io_completion().

Eiichi


(2014/02/06 1:55), James Bottomley wrote:

On Wed, 2014-02-05 at 14:47 +0900, Eiichi Tsukata wrote:

Currently, scsi error handling in scsi_decide_disposition() tries to
unconditionally requeue scsi command when device keeps some error state.
This is because retryable errors are thought to be temporary and the scsi
device will soon recover from those errors. Normally, such retry policy is
appropriate because the device will soon recover from temporary error state.

This description isn't very descriptive.  I presume you're talking about
the ADD_TO_MLQUEUE return from scsi_decide_disposition()?


But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop.

If you're talking about ADD_TO_MLQUEUE, I don't believe this is correct:
there's a test in scsi_softirq_done() that makes sure the maximum
lifetime is retries*timeout and fails the command after that.

James



This patchs adds 'retry_timeout' sysfs attribute which limits the retry time
of each scsi command. Once scsi command retry time is longer than this timeout,
the command is treated as failure. 'retry_timeout' is set to '0' by default
which means no timeout set.

Signed-off-by: Eiichi Tsukata eiichi.tsukata...@hitachi.com
Cc: James E.J. Bottomley jbottom...@parallels.com
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
  drivers/scsi/scsi_lib.c| 22 +-
  drivers/scsi/scsi_scan.c   |  1 +
  drivers/scsi/scsi_sysfs.c  | 32 
  include/scsi/scsi.h|  1 +
  include/scsi/scsi_device.h |  1 +
  5 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 7bd7f0d..a9db69e 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1492,6 +1492,23 @@ static void scsi_kill_request(struct request *req, 
struct request_queue *q)
blk_complete_request(req);
  }
  
+/*

+ * Check if scsi command excessed retry timeout
+ */
+static int scsi_retry_timed_out(struct scsi_cmnd *cmd)
+{
+   unsigned int retry_timeout;
+
+   retry_timeout = cmd-device-retry_timeout;
+   if (retry_timeout 
+   time_before(cmd-jiffies_at_alloc + retry_timeout, jiffies)) {
+   scmd_printk(KERN_INFO, cmd, retry timeout\n);
+   return 1;
+   }
+
+   return 0;
+}
+
  static void scsi_softirq_done(struct request *rq)
  {
struct scsi_cmnd *cmd = rq-special;
@@ -1512,7 +1529,10 @@ static void scsi_softirq_done(struct request *rq)
wait_for/HZ);
disposition = SUCCESS;
}
-   
+
+   if (scsi_retry_timed_out(cmd))
+   disposition = FAILED;
+
scsi_log_completion(cmd, disposition);
  
  	switch (disposition) {

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 307a811..4ab044a 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -925,6 +925,7 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned 
char *inq_result,
sdev-no_dif = 1;
  
  	sdev-eh_timeout = SCSI_DEFAULT_EH_TIMEOUT;

+   sdev-retry_timeout = SCSI_DEFAULT_RETRY_TIMEOUT;
  
  	if (*bflags  BLIST_SKIP_VPD_PAGES)

sdev-skip_vpd_pages = 1;
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 8ff62c2

[REVIEW PATCH] scsi: Add 'retry_timeout' to avoid infinite command retry

2014-02-04 Thread Eiichi Tsukata
Currently, scsi error handling in scsi_decide_disposition() tries to
unconditionally requeue scsi command when device keeps some error state.
This is because retryable errors are thought to be temporary and the scsi
device will soon recover from those errors. Normally, such retry policy is
appropriate because the device will soon recover from temporary error state.

But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop.

This patchs adds 'retry_timeout' sysfs attribute which limits the retry time
of each scsi command. Once scsi command retry time is longer than this timeout,
the command is treated as failure. 'retry_timeout' is set to '0' by default
which means no timeout set.

Signed-off-by: Eiichi Tsukata 
Cc: "James E.J. Bottomley" 
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/scsi/scsi_lib.c| 22 +-
 drivers/scsi/scsi_scan.c   |  1 +
 drivers/scsi/scsi_sysfs.c  | 32 
 include/scsi/scsi.h|  1 +
 include/scsi/scsi_device.h |  1 +
 5 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 7bd7f0d..a9db69e 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1492,6 +1492,23 @@ static void scsi_kill_request(struct request *req, 
struct request_queue *q)
blk_complete_request(req);
 }
 
+/*
+ * Check if scsi command excessed retry timeout
+ */
+static int scsi_retry_timed_out(struct scsi_cmnd *cmd)
+{
+   unsigned int retry_timeout;
+
+   retry_timeout = cmd->device->retry_timeout;
+   if (retry_timeout &&
+   time_before(cmd->jiffies_at_alloc + retry_timeout, jiffies)) {
+   scmd_printk(KERN_INFO, cmd, "retry timeout\n");
+   return 1;
+   }
+
+   return 0;
+}
+
 static void scsi_softirq_done(struct request *rq)
 {
struct scsi_cmnd *cmd = rq->special;
@@ -1512,7 +1529,10 @@ static void scsi_softirq_done(struct request *rq)
wait_for/HZ);
disposition = SUCCESS;
}
-   
+
+   if (scsi_retry_timed_out(cmd))
+   disposition = FAILED;
+
scsi_log_completion(cmd, disposition);
 
switch (disposition) {
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 307a811..4ab044a 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -925,6 +925,7 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned 
char *inq_result,
sdev->no_dif = 1;
 
sdev->eh_timeout = SCSI_DEFAULT_EH_TIMEOUT;
+   sdev->retry_timeout = SCSI_DEFAULT_RETRY_TIMEOUT;
 
if (*bflags & BLIST_SKIP_VPD_PAGES)
sdev->skip_vpd_pages = 1;
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 8ff62c2..eaa2118 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -627,6 +627,37 @@ sdev_store_eh_timeout(struct device *dev, struct 
device_attribute *attr,
 static DEVICE_ATTR(eh_timeout, S_IRUGO | S_IWUSR, sdev_show_eh_timeout, 
sdev_store_eh_timeout);
 
 static ssize_t
+sdev_show_retry_timeout(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct scsi_device *sdev;
+   sdev = to_scsi_device(dev);
+   return snprintf(buf, 20, "%u\n", sdev->retry_timeout / HZ);
+}
+
+static ssize_t
+sdev_store_retry_timeout(struct device *dev, struct device_attribute *attr,
+const char *buf, size_t count)
+{
+   struct scsi_device *sdev;
+   unsigned int retry_timeout;
+   int err;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EACCES;
+
+   sdev = to_scsi_device(dev);
+   err = kstrtouint(buf, 10, _timeout);
+   if (err)
+   return err;
+   sdev->retry_timeout = retry_timeout * HZ;
+
+   return count;
+}
+static DEVICE_ATTR(retry_timeout, S_IRUGO | S_IWUSR,
+  sdev_show_retry_timeout, sdev_store_retry_timeout);
+
+static ssize_t
 store_rescan_field (struct device *dev, struct device_attribute *attr,
const char *buf, size_t count)
 {
@@ -797,6 +828,7 @@ static struct attribute *scsi_sdev_attrs[] = {
_attr_state.attr,
_attr_timeout.attr,
_attr_eh_timeout.attr,
+   _attr_retry_timeout.attr,
_attr_iocounterbits.attr,
_attr_iorequest_cnt.attr,
_attr_iodone_cnt.attr,
diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
index 66d42ed..545408d 100644
--- a/include/scsi/scsi.h
+++ b/include/scsi/scsi.h
@@ -16,6 +16,7 @@ struct scsi_cmnd;
 
 enum scsi_timeouts {
SCSI_DEFAULT_EH_TIMEOUT = 10 * HZ,
+   SCSI_DEFAULT_RETRY_TIMEOUT  = 0,/* disabled b

[REVIEW PATCH] scsi: Add 'retry_timeout' to avoid infinite command retry

2014-02-04 Thread Eiichi Tsukata
Currently, scsi error handling in scsi_decide_disposition() tries to
unconditionally requeue scsi command when device keeps some error state.
This is because retryable errors are thought to be temporary and the scsi
device will soon recover from those errors. Normally, such retry policy is
appropriate because the device will soon recover from temporary error state.

But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop.

This patchs adds 'retry_timeout' sysfs attribute which limits the retry time
of each scsi command. Once scsi command retry time is longer than this timeout,
the command is treated as failure. 'retry_timeout' is set to '0' by default
which means no timeout set.

Signed-off-by: Eiichi Tsukata eiichi.tsukata...@hitachi.com
Cc: James E.J. Bottomley jbottom...@parallels.com
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/scsi/scsi_lib.c| 22 +-
 drivers/scsi/scsi_scan.c   |  1 +
 drivers/scsi/scsi_sysfs.c  | 32 
 include/scsi/scsi.h|  1 +
 include/scsi/scsi_device.h |  1 +
 5 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 7bd7f0d..a9db69e 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1492,6 +1492,23 @@ static void scsi_kill_request(struct request *req, 
struct request_queue *q)
blk_complete_request(req);
 }
 
+/*
+ * Check if scsi command excessed retry timeout
+ */
+static int scsi_retry_timed_out(struct scsi_cmnd *cmd)
+{
+   unsigned int retry_timeout;
+
+   retry_timeout = cmd-device-retry_timeout;
+   if (retry_timeout 
+   time_before(cmd-jiffies_at_alloc + retry_timeout, jiffies)) {
+   scmd_printk(KERN_INFO, cmd, retry timeout\n);
+   return 1;
+   }
+
+   return 0;
+}
+
 static void scsi_softirq_done(struct request *rq)
 {
struct scsi_cmnd *cmd = rq-special;
@@ -1512,7 +1529,10 @@ static void scsi_softirq_done(struct request *rq)
wait_for/HZ);
disposition = SUCCESS;
}
-   
+
+   if (scsi_retry_timed_out(cmd))
+   disposition = FAILED;
+
scsi_log_completion(cmd, disposition);
 
switch (disposition) {
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 307a811..4ab044a 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -925,6 +925,7 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned 
char *inq_result,
sdev-no_dif = 1;
 
sdev-eh_timeout = SCSI_DEFAULT_EH_TIMEOUT;
+   sdev-retry_timeout = SCSI_DEFAULT_RETRY_TIMEOUT;
 
if (*bflags  BLIST_SKIP_VPD_PAGES)
sdev-skip_vpd_pages = 1;
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 8ff62c2..eaa2118 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -627,6 +627,37 @@ sdev_store_eh_timeout(struct device *dev, struct 
device_attribute *attr,
 static DEVICE_ATTR(eh_timeout, S_IRUGO | S_IWUSR, sdev_show_eh_timeout, 
sdev_store_eh_timeout);
 
 static ssize_t
+sdev_show_retry_timeout(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct scsi_device *sdev;
+   sdev = to_scsi_device(dev);
+   return snprintf(buf, 20, %u\n, sdev-retry_timeout / HZ);
+}
+
+static ssize_t
+sdev_store_retry_timeout(struct device *dev, struct device_attribute *attr,
+const char *buf, size_t count)
+{
+   struct scsi_device *sdev;
+   unsigned int retry_timeout;
+   int err;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EACCES;
+
+   sdev = to_scsi_device(dev);
+   err = kstrtouint(buf, 10, retry_timeout);
+   if (err)
+   return err;
+   sdev-retry_timeout = retry_timeout * HZ;
+
+   return count;
+}
+static DEVICE_ATTR(retry_timeout, S_IRUGO | S_IWUSR,
+  sdev_show_retry_timeout, sdev_store_retry_timeout);
+
+static ssize_t
 store_rescan_field (struct device *dev, struct device_attribute *attr,
const char *buf, size_t count)
 {
@@ -797,6 +828,7 @@ static struct attribute *scsi_sdev_attrs[] = {
dev_attr_state.attr,
dev_attr_timeout.attr,
dev_attr_eh_timeout.attr,
+   dev_attr_retry_timeout.attr,
dev_attr_iocounterbits.attr,
dev_attr_iorequest_cnt.attr,
dev_attr_iodone_cnt.attr,
diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
index 66d42ed..545408d 100644
--- a/include/scsi/scsi.h
+++ b/include/scsi/scsi.h
@@ -16,6 +16,7 @@ struct scsi_cmnd;
 
 enum scsi_timeouts {
SCSI_DEFAULT_EH_TIMEOUT = 10 * HZ,
+   SCSI_DEFAULT_RETRY_TIMEOUT  = 0,/* disabled by default

[PATCH] scsi: Add printk to detect retry loop

2013-09-05 Thread Eiichi Tsukata
Currently, scsi error handling in scsi_decide_disposition() unconditionally
retries on some errors. This is because retriable errors are thought to be
temporary and the scsi device will soon recover from those errors. But there
is no guarantee that the device is able to recover from error state
immediately. The problem is that there is no easy way to detect retry loop in
user space.

This patch adds printk to detect command retry loop in user space. When
the command retry count exceeds the allowed count(scmd->allowed), the
kernel prints messages, which can be handled in user space application.
Here the allowed count(scmd->allowed) is currently used as finite retry
limit count. Once retry count exceeds the allowed count on a device,
the message is suppressed on the device to avoid too much messages
outputted in dmesg.

Signed-off-by: Eiichi Tsukata 
Cc: "James E.J. Bottomley" 
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/scsi/scsi_error.c  |3 +--
 drivers/scsi/scsi_lib.c|   14 ++
 include/scsi/scsi_device.h |1 +
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 2150596..31d10f4 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1615,8 +1615,7 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd)
 * the request was not marked fast fail.  Note that above,
 * even if the request is marked fast fail, we still requeue
 * for queue congestion conditions (QUEUE_FULL or BUSY) */
-   if ((++scmd->retries) <= scmd->allowed
-   && !scsi_noretry_cmd(scmd)) {
+   if (scmd->retries < scmd->allowed && !scsi_noretry_cmd(scmd)) {
return NEEDS_RETRY;
} else {
/*
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 124392f..0198490 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1513,6 +1513,20 @@ static void scsi_softirq_done(struct request *rq)
disposition = SUCCESS;
}

+   /*
+* Print message if retry count exceeds allowed count.
+* This message can be used by user space application to detect
+* indefinite command retry loop.
+*/
+   if (cmd->allowed > 0 && ++cmd->retries == cmd->allowed) {
+   /* Once a command retry over was detected, suppress message */
+   if (!cmd->device->retry_over) {
+   scmd_printk(KERN_INFO, cmd,
+   "command retried %d times\n", cmd->allowed);
+   scsi_print_command(cmd);
+   cmd->device->retry_over = 1;
+   }
+   }
scsi_log_completion(cmd, disposition);
 
switch (disposition) {
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index a44954c..8751d82 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -160,6 +160,7 @@ struct scsi_device {
unsigned is_visible:1;  /* is the device visible in sysfs */
unsigned wce_default_on:1;  /* Cache is ON by default */
unsigned no_dif:1;  /* T10 PI (DIF) should be disabled */
+   unsigned retry_over:1;  /* retry count exceeded allowed count */
 
atomic_t disk_events_disable_depth; /* disable depth for disk events */
 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] scsi: Add printk to detect retry loop

2013-09-05 Thread Eiichi Tsukata
Hello,

This patch is to avoid indefinite command retry loop.
The previous RFC patch is here:
http://marc.info/?t=13769053762=1=3

In the previous discuss, James says that once retry loop is detected,
whether or not to panic(offline) should be decided by user, and not in
scsi subsystem. So we need a way to detect retry loop and report it to
user space.

This patch adds printk to detect command retry loop in user space. When
the command retry count exceeds the allowed count(scmd->allowed), the
kernel prints messages, which can be handled in user space application.
Here the allowed count(scmd->allowed) is currently used as finite retry
limit count. Once retry count exceeds the allowed count on a device,
the message is suppressed on the device to avoid too much messages
outputted in dmesg.


---

Eiichi Tsukata (1):
  scsi: Add printk to detect retry loop


 drivers/scsi/scsi_error.c  |3 +--
 drivers/scsi/scsi_lib.c|   14 ++
 include/scsi/scsi_device.h |1 +
 3 files changed, 16 insertions(+), 2 deletions(-)

-- 
Eiichi Tsukata
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: eiichi.tsukata...@hitachi.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] scsi: Add printk to detect retry loop

2013-09-05 Thread Eiichi Tsukata
Hello,

This patch is to avoid indefinite command retry loop.
The previous RFC patch is here:
http://marc.info/?t=13769053762r=1w=3

In the previous discuss, James says that once retry loop is detected,
whether or not to panic(offline) should be decided by user, and not in
scsi subsystem. So we need a way to detect retry loop and report it to
user space.

This patch adds printk to detect command retry loop in user space. When
the command retry count exceeds the allowed count(scmd-allowed), the
kernel prints messages, which can be handled in user space application.
Here the allowed count(scmd-allowed) is currently used as finite retry
limit count. Once retry count exceeds the allowed count on a device,
the message is suppressed on the device to avoid too much messages
outputted in dmesg.


---

Eiichi Tsukata (1):
  scsi: Add printk to detect retry loop


 drivers/scsi/scsi_error.c  |3 +--
 drivers/scsi/scsi_lib.c|   14 ++
 include/scsi/scsi_device.h |1 +
 3 files changed, 16 insertions(+), 2 deletions(-)

-- 
Eiichi Tsukata
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: eiichi.tsukata...@hitachi.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] scsi: Add printk to detect retry loop

2013-09-05 Thread Eiichi Tsukata
Currently, scsi error handling in scsi_decide_disposition() unconditionally
retries on some errors. This is because retriable errors are thought to be
temporary and the scsi device will soon recover from those errors. But there
is no guarantee that the device is able to recover from error state
immediately. The problem is that there is no easy way to detect retry loop in
user space.

This patch adds printk to detect command retry loop in user space. When
the command retry count exceeds the allowed count(scmd-allowed), the
kernel prints messages, which can be handled in user space application.
Here the allowed count(scmd-allowed) is currently used as finite retry
limit count. Once retry count exceeds the allowed count on a device,
the message is suppressed on the device to avoid too much messages
outputted in dmesg.

Signed-off-by: Eiichi Tsukata eiichi.tsukata...@hitachi.com
Cc: James E.J. Bottomley jbottom...@parallels.com
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/scsi/scsi_error.c  |3 +--
 drivers/scsi/scsi_lib.c|   14 ++
 include/scsi/scsi_device.h |1 +
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 2150596..31d10f4 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1615,8 +1615,7 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd)
 * the request was not marked fast fail.  Note that above,
 * even if the request is marked fast fail, we still requeue
 * for queue congestion conditions (QUEUE_FULL or BUSY) */
-   if ((++scmd-retries) = scmd-allowed
-!scsi_noretry_cmd(scmd)) {
+   if (scmd-retries  scmd-allowed  !scsi_noretry_cmd(scmd)) {
return NEEDS_RETRY;
} else {
/*
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 124392f..0198490 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1513,6 +1513,20 @@ static void scsi_softirq_done(struct request *rq)
disposition = SUCCESS;
}

+   /*
+* Print message if retry count exceeds allowed count.
+* This message can be used by user space application to detect
+* indefinite command retry loop.
+*/
+   if (cmd-allowed  0  ++cmd-retries == cmd-allowed) {
+   /* Once a command retry over was detected, suppress message */
+   if (!cmd-device-retry_over) {
+   scmd_printk(KERN_INFO, cmd,
+   command retried %d times\n, cmd-allowed);
+   scsi_print_command(cmd);
+   cmd-device-retry_over = 1;
+   }
+   }
scsi_log_completion(cmd, disposition);
 
switch (disposition) {
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index a44954c..8751d82 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -160,6 +160,7 @@ struct scsi_device {
unsigned is_visible:1;  /* is the device visible in sysfs */
unsigned wce_default_on:1;  /* Cache is ON by default */
unsigned no_dif:1;  /* T10 PI (DIF) should be disabled */
+   unsigned retry_over:1;  /* retry count exceeded allowed count */
 
atomic_t disk_events_disable_depth; /* disable depth for disk events */
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] scsi: Add failfast mode to avoid infinite retry loop

2013-08-26 Thread Eiichi Tsukata

(2013/08/23 21:26), Ric Wheeler wrote:

On 08/23/2013 05:10 AM, Eiichi Tsukata wrote:

(2013/08/21 3:09), Ewan Milne wrote:

On Tue, 2013-08-20 at 16:13 +0900, Eiichi Tsukata wrote:

(2013/08/19 23:30), James Bottomley wrote:

On Mon, 2013-08-19 at 18:39 +0900, Eiichi Tsukata wrote:

Hello,

This patch adds scsi device failfast mode to avoid infinite retry loop.

Currently, scsi error handling in scsi_decide_disposition() and
scsi_io_completion() unconditionally retries on some errors. This is because
retryable errors are thought to be temporary and the scsi device will soon
recover from those errors. Normally, such retry policy is appropriate because
the device will soon recover from temporary error state.
But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop. In fact,
CHECK_CONDITION error with the sense-key = UNIT_ATTENTION caused infinite
retry loop in our environment. As the comments in kernel source code says,
UNIT_ATTENTION means the device must have been a power glitch and expected
to immediately recover from the state. But it seems that hardware error
caused permanent UNIT_ATTENTION error.

To solve the above problem, this patch introduces scsi device "failfast mode".
If failfast mode is enabled, retry counts of all scsi commands are limited to
scsi->allowed(== SD_MAX_RETRIES == 5). All commands are prohibited to retry
infinitely, and immediately fails when the retry count exceeds upper limit.
Failfast mode is useful on mission critical systems which are required
to keep running flawlessly because they need to failover to the secondary
system once they detect failures.
On default, failfast mode is disabled because failfast policy is not suitable
for most use cases which can accept I/O latency due to device hardware error.

To enable failfast mode(default disabled):
# echo 1> /sys/bus/scsi/devices/X:X:X:X/failfast
To disable:
# echo 0> /sys/bus/scsi/devices/X:X:X:X/failfast

Furthermore, I'm planning to make the upper limit count configurable.
Currently, I have two plans to implement it:
(1) set same upper limit count on all errors.
(2) set upper limit count on each error.
The first implementation is simple and easy to implement but not flexible.
Someone wants to set different upper limit count on each errors depends on the
scsi device they use. The second implementation satisfies such requirement
but can be too fine-grained and annoying to configure because scsi error
codes are so much. The default 5 times retry may too much on some errors but
too few on other errors.

Which would be the appropriate implementation?
Any comments or suggestions are welcome as usual.


I'm afraid you'll need to propose another solution. We have a large
selection of commands which, by design, retry until the command exceeds
it's timeout. UA is one of those (as are most of the others you're
limiting). How do you kick this device out of its UA return (because
that's the recovery that needs to happen)?

James




Thanks for reviewing, James.

Originally, I planned that once the retry count exceeds its limit,
a monitoring tool stops the server with the scsi prink error message
as a trigger.
Current failfast mode implementation is that the command fails when
retry command exceeds its limit. However, I noticed that only printing error 
messages
on retry counts excess without changing retry logic will be enough
to stop the server and take fail over. Though there is no guarantee that
userspace application can work properly on disk failure condition.
So, now I'm considering that just calling panic() on retry excess is better.

For that reason, I propose the solution that adding "panic_on_error" option to
sysfs parameter and if panic_on_error mode is enabled the server panics
immediately once it detects retry excess. Of course, it is disabled on default.

I would appreciate it if you could give me some comments.

Eiichi
--


For what it's worth, I've seen a report of a case where a storage array
returned a CHECK CONDITION with invalid sense data, which caused the
command to be retried indefinitely.


Thank you for commenting, Ewan.
I appreciate your information about indefinite retry on CHECK CONDITION.


I'm not sure what you can do about
this, if the device won't ever complete a command without an error.
Perhaps it should be offlined after sufficiently bad behavior.

I don't think you want to panic on an error, though. In a clustered
environment it is possible that the other systems will all fail in the
same way, for example.

-Ewan



Yes, basically the device should be offlined on error detection.
Just offlining the disk is enough when an error occurs on "not" os-installed
system disk. Panic is going too far on such case.

However, in a clustered environment where computers use each its own disk and
do not share the same disk

Re: [RFC PATCH] scsi: Add failfast mode to avoid infinite retry loop

2013-08-26 Thread Eiichi Tsukata

(2013/08/24 4:36), Ewan Milne wrote:

On Fri, 2013-08-23 at 06:19 -0700, James Bottomley wrote:

On Fri, 2013-08-23 at 18:10 +0900, Eiichi Tsukata wrote:

Yes, basically the device should be offlined on error detection.
Just offlining the disk is enough when an error occurs on "not" os-installed
system disk. Panic is going too far on such case.

However, in a clustered environment where computers use each its own
disk and
do not share the same disk, calling panic() will be suitable when an
error
occurs in system disk.


However, when not in a clustered environment, it won't be.  Decisions
about whether to panic the system or not are user space policy, and
should not be embedded into subsystems.  What we need to do is to come
up with a way of detecting the condition, reporting it and possibly
taking some action.


  Because even on such disk error, cluster monitoring
tool may not be able to detect the system failure while heartbeat can
continue
working.
So, I think basically offlining is enough and also, panic is necessary
on some cases.


The way I have seen this done in such a clustered environment is to have
the heartbeat agent on each system periodically attempt to access the
disk.  If that I/O hangs, other systems will see loss of heartbeat.
You really don't want to panic the kernel.  Among other things, it may
make it difficult to get the system up again later for long enough to
figure out what is wrong.



Sounds good.
Disk access on each hreartbeat is reasonable to detect I/O error.

But by such a way, can you distinguish indefinite command retry?
I'd like to tell indefinite retry from other disk errors.

I'm now considering printk error message on retry count excess.
There should be some reporting mechanism in kernel.

Eiichi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] scsi: Add failfast mode to avoid infinite retry loop

2013-08-26 Thread Eiichi Tsukata

(2013/08/23 22:19), James Bottomley wrote:

On Fri, 2013-08-23 at 18:10 +0900, Eiichi Tsukata wrote:

(2013/08/21 3:09), Ewan Milne wrote:

On Tue, 2013-08-20 at 16:13 +0900, Eiichi Tsukata wrote:

(2013/08/19 23:30), James Bottomley wrote:

On Mon, 2013-08-19 at 18:39 +0900, Eiichi Tsukata wrote:

Hello,

This patch adds scsi device failfast mode to avoid infinite retry loop.

Currently, scsi error handling in scsi_decide_disposition() and
scsi_io_completion() unconditionally retries on some errors. This is because
retryable errors are thought to be temporary and the scsi device will soon
recover from those errors. Normally, such retry policy is appropriate because
the device will soon recover from temporary error state.
But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop. In fact,
CHECK_CONDITION error with the sense-key = UNIT_ATTENTION caused infinite
retry loop in our environment. As the comments in kernel source code says,
UNIT_ATTENTION means the device must have been a power glitch and expected
to immediately recover from the state. But it seems that hardware error
caused permanent UNIT_ATTENTION error.

To solve the above problem, this patch introduces scsi device "failfast mode".
If failfast mode is enabled, retry counts of all scsi commands are limited to
scsi->allowed(== SD_MAX_RETRIES == 5). All commands are prohibited to retry
infinitely, and immediately fails when the retry count exceeds upper limit.
Failfast mode is useful on mission critical systems which are required
to keep running flawlessly because they need to failover to the secondary
system once they detect failures.
On default, failfast mode is disabled because failfast policy is not suitable
for most use cases which can accept I/O latency due to device hardware error.

To enable failfast mode(default disabled):
# echo 1>/sys/bus/scsi/devices/X:X:X:X/failfast
To disable:
# echo 0>/sys/bus/scsi/devices/X:X:X:X/failfast

Furthermore, I'm planning to make the upper limit count configurable.
Currently, I have two plans to implement it:
(1) set same upper limit count on all errors.
(2) set upper limit count on each error.
The first implementation is simple and easy to implement but not flexible.
Someone wants to set different upper limit count on each errors depends on the
scsi device they use. The second implementation satisfies such requirement
but can be too fine-grained and annoying to configure because scsi error
codes are so much. The default 5 times retry may too much on some errors but
too few on other errors.

Which would be the appropriate implementation?
Any comments or suggestions are welcome as usual.


I'm afraid you'll need to propose another solution.  We have a large
selection of commands which, by design, retry until the command exceeds
it's timeout.  UA is one of those (as are most of the others you're
limiting).  How do you kick this device out of its UA return (because
that's the recovery that needs to happen)?

James




Thanks for reviewing, James.

Originally, I planned that once the retry count exceeds its limit,
a monitoring tool stops the server with the scsi prink error message
as a trigger.
Current failfast mode implementation is that the command fails when
retry command exceeds its limit. However, I noticed that only printing error 
messages
on retry counts excess without changing retry logic will be enough
to stop the server and take fail over.  Though there is no guarantee that
userspace application can work properly on disk failure condition.
So, now I'm considering that just calling panic() on retry excess is better.

For that reason, I propose the solution that adding "panic_on_error" option to
sysfs parameter and if panic_on_error mode is enabled the server panics
immediately once it detects retry excess. Of course, it is disabled on default.

I would appreciate it if you could give me some comments.

Eiichi
--


For what it's worth, I've seen a report of a case where a storage array
returned a CHECK CONDITION with invalid sense data, which caused the
command to be retried indefinitely.


Thank you for commenting, Ewan.
I appreciate your information about indefinite retry on CHECK CONDITION.


I'm not sure what you can do about
this, if the device won't ever complete a command without an error.
Perhaps it should be offlined after sufficiently bad behavior.

I don't think you want to panic on an error, though.  In a clustered
environment it is possible that the other systems will all fail in the
same way, for example.

-Ewan



Yes, basically the device should be offlined on error detection.
Just offlining the disk is enough when an error occurs on "not" os-installed
system disk. Panic is going too far on such case.

However, in a clustered environment where computers

Re: [RFC PATCH] scsi: Add failfast mode to avoid infinite retry loop

2013-08-26 Thread Eiichi Tsukata

(2013/08/23 22:19), James Bottomley wrote:

On Fri, 2013-08-23 at 18:10 +0900, Eiichi Tsukata wrote:

(2013/08/21 3:09), Ewan Milne wrote:

On Tue, 2013-08-20 at 16:13 +0900, Eiichi Tsukata wrote:

(2013/08/19 23:30), James Bottomley wrote:

On Mon, 2013-08-19 at 18:39 +0900, Eiichi Tsukata wrote:

Hello,

This patch adds scsi device failfast mode to avoid infinite retry loop.

Currently, scsi error handling in scsi_decide_disposition() and
scsi_io_completion() unconditionally retries on some errors. This is because
retryable errors are thought to be temporary and the scsi device will soon
recover from those errors. Normally, such retry policy is appropriate because
the device will soon recover from temporary error state.
But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop. In fact,
CHECK_CONDITION error with the sense-key = UNIT_ATTENTION caused infinite
retry loop in our environment. As the comments in kernel source code says,
UNIT_ATTENTION means the device must have been a power glitch and expected
to immediately recover from the state. But it seems that hardware error
caused permanent UNIT_ATTENTION error.

To solve the above problem, this patch introduces scsi device failfast mode.
If failfast mode is enabled, retry counts of all scsi commands are limited to
scsi-allowed(== SD_MAX_RETRIES == 5). All commands are prohibited to retry
infinitely, and immediately fails when the retry count exceeds upper limit.
Failfast mode is useful on mission critical systems which are required
to keep running flawlessly because they need to failover to the secondary
system once they detect failures.
On default, failfast mode is disabled because failfast policy is not suitable
for most use cases which can accept I/O latency due to device hardware error.

To enable failfast mode(default disabled):
# echo 1/sys/bus/scsi/devices/X:X:X:X/failfast
To disable:
# echo 0/sys/bus/scsi/devices/X:X:X:X/failfast

Furthermore, I'm planning to make the upper limit count configurable.
Currently, I have two plans to implement it:
(1) set same upper limit count on all errors.
(2) set upper limit count on each error.
The first implementation is simple and easy to implement but not flexible.
Someone wants to set different upper limit count on each errors depends on the
scsi device they use. The second implementation satisfies such requirement
but can be too fine-grained and annoying to configure because scsi error
codes are so much. The default 5 times retry may too much on some errors but
too few on other errors.

Which would be the appropriate implementation?
Any comments or suggestions are welcome as usual.


I'm afraid you'll need to propose another solution.  We have a large
selection of commands which, by design, retry until the command exceeds
it's timeout.  UA is one of those (as are most of the others you're
limiting).  How do you kick this device out of its UA return (because
that's the recovery that needs to happen)?

James




Thanks for reviewing, James.

Originally, I planned that once the retry count exceeds its limit,
a monitoring tool stops the server with the scsi prink error message
as a trigger.
Current failfast mode implementation is that the command fails when
retry command exceeds its limit. However, I noticed that only printing error 
messages
on retry counts excess without changing retry logic will be enough
to stop the server and take fail over.  Though there is no guarantee that
userspace application can work properly on disk failure condition.
So, now I'm considering that just calling panic() on retry excess is better.

For that reason, I propose the solution that adding panic_on_error option to
sysfs parameter and if panic_on_error mode is enabled the server panics
immediately once it detects retry excess. Of course, it is disabled on default.

I would appreciate it if you could give me some comments.

Eiichi
--


For what it's worth, I've seen a report of a case where a storage array
returned a CHECK CONDITION with invalid sense data, which caused the
command to be retried indefinitely.


Thank you for commenting, Ewan.
I appreciate your information about indefinite retry on CHECK CONDITION.


I'm not sure what you can do about
this, if the device won't ever complete a command without an error.
Perhaps it should be offlined after sufficiently bad behavior.

I don't think you want to panic on an error, though.  In a clustered
environment it is possible that the other systems will all fail in the
same way, for example.

-Ewan



Yes, basically the device should be offlined on error detection.
Just offlining the disk is enough when an error occurs on not os-installed
system disk. Panic is going too far on such case.

However, in a clustered environment where computers use each its own
disk and
do not share the same disk

Re: [RFC PATCH] scsi: Add failfast mode to avoid infinite retry loop

2013-08-26 Thread Eiichi Tsukata

(2013/08/24 4:36), Ewan Milne wrote:

On Fri, 2013-08-23 at 06:19 -0700, James Bottomley wrote:

On Fri, 2013-08-23 at 18:10 +0900, Eiichi Tsukata wrote:

Yes, basically the device should be offlined on error detection.
Just offlining the disk is enough when an error occurs on not os-installed
system disk. Panic is going too far on such case.

However, in a clustered environment where computers use each its own
disk and
do not share the same disk, calling panic() will be suitable when an
error
occurs in system disk.


However, when not in a clustered environment, it won't be.  Decisions
about whether to panic the system or not are user space policy, and
should not be embedded into subsystems.  What we need to do is to come
up with a way of detecting the condition, reporting it and possibly
taking some action.


  Because even on such disk error, cluster monitoring
tool may not be able to detect the system failure while heartbeat can
continue
working.
So, I think basically offlining is enough and also, panic is necessary
on some cases.


The way I have seen this done in such a clustered environment is to have
the heartbeat agent on each system periodically attempt to access the
disk.  If that I/O hangs, other systems will see loss of heartbeat.
You really don't want to panic the kernel.  Among other things, it may
make it difficult to get the system up again later for long enough to
figure out what is wrong.



Sounds good.
Disk access on each hreartbeat is reasonable to detect I/O error.

But by such a way, can you distinguish indefinite command retry?
I'd like to tell indefinite retry from other disk errors.

I'm now considering printk error message on retry count excess.
There should be some reporting mechanism in kernel.

Eiichi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] scsi: Add failfast mode to avoid infinite retry loop

2013-08-26 Thread Eiichi Tsukata

(2013/08/23 21:26), Ric Wheeler wrote:

On 08/23/2013 05:10 AM, Eiichi Tsukata wrote:

(2013/08/21 3:09), Ewan Milne wrote:

On Tue, 2013-08-20 at 16:13 +0900, Eiichi Tsukata wrote:

(2013/08/19 23:30), James Bottomley wrote:

On Mon, 2013-08-19 at 18:39 +0900, Eiichi Tsukata wrote:

Hello,

This patch adds scsi device failfast mode to avoid infinite retry loop.

Currently, scsi error handling in scsi_decide_disposition() and
scsi_io_completion() unconditionally retries on some errors. This is because
retryable errors are thought to be temporary and the scsi device will soon
recover from those errors. Normally, such retry policy is appropriate because
the device will soon recover from temporary error state.
But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop. In fact,
CHECK_CONDITION error with the sense-key = UNIT_ATTENTION caused infinite
retry loop in our environment. As the comments in kernel source code says,
UNIT_ATTENTION means the device must have been a power glitch and expected
to immediately recover from the state. But it seems that hardware error
caused permanent UNIT_ATTENTION error.

To solve the above problem, this patch introduces scsi device failfast mode.
If failfast mode is enabled, retry counts of all scsi commands are limited to
scsi-allowed(== SD_MAX_RETRIES == 5). All commands are prohibited to retry
infinitely, and immediately fails when the retry count exceeds upper limit.
Failfast mode is useful on mission critical systems which are required
to keep running flawlessly because they need to failover to the secondary
system once they detect failures.
On default, failfast mode is disabled because failfast policy is not suitable
for most use cases which can accept I/O latency due to device hardware error.

To enable failfast mode(default disabled):
# echo 1 /sys/bus/scsi/devices/X:X:X:X/failfast
To disable:
# echo 0 /sys/bus/scsi/devices/X:X:X:X/failfast

Furthermore, I'm planning to make the upper limit count configurable.
Currently, I have two plans to implement it:
(1) set same upper limit count on all errors.
(2) set upper limit count on each error.
The first implementation is simple and easy to implement but not flexible.
Someone wants to set different upper limit count on each errors depends on the
scsi device they use. The second implementation satisfies such requirement
but can be too fine-grained and annoying to configure because scsi error
codes are so much. The default 5 times retry may too much on some errors but
too few on other errors.

Which would be the appropriate implementation?
Any comments or suggestions are welcome as usual.


I'm afraid you'll need to propose another solution. We have a large
selection of commands which, by design, retry until the command exceeds
it's timeout. UA is one of those (as are most of the others you're
limiting). How do you kick this device out of its UA return (because
that's the recovery that needs to happen)?

James




Thanks for reviewing, James.

Originally, I planned that once the retry count exceeds its limit,
a monitoring tool stops the server with the scsi prink error message
as a trigger.
Current failfast mode implementation is that the command fails when
retry command exceeds its limit. However, I noticed that only printing error 
messages
on retry counts excess without changing retry logic will be enough
to stop the server and take fail over. Though there is no guarantee that
userspace application can work properly on disk failure condition.
So, now I'm considering that just calling panic() on retry excess is better.

For that reason, I propose the solution that adding panic_on_error option to
sysfs parameter and if panic_on_error mode is enabled the server panics
immediately once it detects retry excess. Of course, it is disabled on default.

I would appreciate it if you could give me some comments.

Eiichi
--


For what it's worth, I've seen a report of a case where a storage array
returned a CHECK CONDITION with invalid sense data, which caused the
command to be retried indefinitely.


Thank you for commenting, Ewan.
I appreciate your information about indefinite retry on CHECK CONDITION.


I'm not sure what you can do about
this, if the device won't ever complete a command without an error.
Perhaps it should be offlined after sufficiently bad behavior.

I don't think you want to panic on an error, though. In a clustered
environment it is possible that the other systems will all fail in the
same way, for example.

-Ewan



Yes, basically the device should be offlined on error detection.
Just offlining the disk is enough when an error occurs on not os-installed
system disk. Panic is going too far on such case.

However, in a clustered environment where computers use each its own disk and
do not share the same disk, calling panic() will be suitable when an error

Re: [RFC PATCH] scsi: Add failfast mode to avoid infinite retry loop

2013-08-23 Thread Eiichi Tsukata

(2013/08/21 3:09), Ewan Milne wrote:

On Tue, 2013-08-20 at 16:13 +0900, Eiichi Tsukata wrote:

(2013/08/19 23:30), James Bottomley wrote:

On Mon, 2013-08-19 at 18:39 +0900, Eiichi Tsukata wrote:

Hello,

This patch adds scsi device failfast mode to avoid infinite retry loop.

Currently, scsi error handling in scsi_decide_disposition() and
scsi_io_completion() unconditionally retries on some errors. This is because
retryable errors are thought to be temporary and the scsi device will soon
recover from those errors. Normally, such retry policy is appropriate because
the device will soon recover from temporary error state.
But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop. In fact,
CHECK_CONDITION error with the sense-key = UNIT_ATTENTION caused infinite
retry loop in our environment. As the comments in kernel source code says,
UNIT_ATTENTION means the device must have been a power glitch and expected
to immediately recover from the state. But it seems that hardware error
caused permanent UNIT_ATTENTION error.

To solve the above problem, this patch introduces scsi device "failfast mode".
If failfast mode is enabled, retry counts of all scsi commands are limited to
scsi->allowed(== SD_MAX_RETRIES == 5). All commands are prohibited to retry
infinitely, and immediately fails when the retry count exceeds upper limit.
Failfast mode is useful on mission critical systems which are required
to keep running flawlessly because they need to failover to the secondary
system once they detect failures.
On default, failfast mode is disabled because failfast policy is not suitable
for most use cases which can accept I/O latency due to device hardware error.

To enable failfast mode(default disabled):
   # echo 1>   /sys/bus/scsi/devices/X:X:X:X/failfast
To disable:
   # echo 0>   /sys/bus/scsi/devices/X:X:X:X/failfast

Furthermore, I'm planning to make the upper limit count configurable.
Currently, I have two plans to implement it:
(1) set same upper limit count on all errors.
(2) set upper limit count on each error.
The first implementation is simple and easy to implement but not flexible.
Someone wants to set different upper limit count on each errors depends on the
scsi device they use. The second implementation satisfies such requirement
but can be too fine-grained and annoying to configure because scsi error
codes are so much. The default 5 times retry may too much on some errors but
too few on other errors.

Which would be the appropriate implementation?
Any comments or suggestions are welcome as usual.


I'm afraid you'll need to propose another solution.  We have a large
selection of commands which, by design, retry until the command exceeds
it's timeout.  UA is one of those (as are most of the others you're
limiting).  How do you kick this device out of its UA return (because
that's the recovery that needs to happen)?

James




Thanks for reviewing, James.

Originally, I planned that once the retry count exceeds its limit,
a monitoring tool stops the server with the scsi prink error message
as a trigger.
Current failfast mode implementation is that the command fails when
retry command exceeds its limit. However, I noticed that only printing error 
messages
on retry counts excess without changing retry logic will be enough
to stop the server and take fail over.  Though there is no guarantee that
userspace application can work properly on disk failure condition.
So, now I'm considering that just calling panic() on retry excess is better.

For that reason, I propose the solution that adding "panic_on_error" option to
sysfs parameter and if panic_on_error mode is enabled the server panics
immediately once it detects retry excess. Of course, it is disabled on default.

I would appreciate it if you could give me some comments.

Eiichi
--


For what it's worth, I've seen a report of a case where a storage array
returned a CHECK CONDITION with invalid sense data, which caused the
command to be retried indefinitely.


Thank you for commenting, Ewan.
I appreciate your information about indefinite retry on CHECK CONDITION.


I'm not sure what you can do about
this, if the device won't ever complete a command without an error.
Perhaps it should be offlined after sufficiently bad behavior.

I don't think you want to panic on an error, though.  In a clustered
environment it is possible that the other systems will all fail in the
same way, for example.

-Ewan



Yes, basically the device should be offlined on error detection.
Just offlining the disk is enough when an error occurs on "not" os-installed
system disk. Panic is going too far on such case.

However, in a clustered environment where computers use each its own disk and
do not share the same disk, calling panic() will be suitable when an error
occurs in

Re: [RFC PATCH] scsi: Add failfast mode to avoid infinite retry loop

2013-08-23 Thread Eiichi Tsukata

(2013/08/21 3:09), Ewan Milne wrote:

On Tue, 2013-08-20 at 16:13 +0900, Eiichi Tsukata wrote:

(2013/08/19 23:30), James Bottomley wrote:

On Mon, 2013-08-19 at 18:39 +0900, Eiichi Tsukata wrote:

Hello,

This patch adds scsi device failfast mode to avoid infinite retry loop.

Currently, scsi error handling in scsi_decide_disposition() and
scsi_io_completion() unconditionally retries on some errors. This is because
retryable errors are thought to be temporary and the scsi device will soon
recover from those errors. Normally, such retry policy is appropriate because
the device will soon recover from temporary error state.
But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop. In fact,
CHECK_CONDITION error with the sense-key = UNIT_ATTENTION caused infinite
retry loop in our environment. As the comments in kernel source code says,
UNIT_ATTENTION means the device must have been a power glitch and expected
to immediately recover from the state. But it seems that hardware error
caused permanent UNIT_ATTENTION error.

To solve the above problem, this patch introduces scsi device failfast mode.
If failfast mode is enabled, retry counts of all scsi commands are limited to
scsi-allowed(== SD_MAX_RETRIES == 5). All commands are prohibited to retry
infinitely, and immediately fails when the retry count exceeds upper limit.
Failfast mode is useful on mission critical systems which are required
to keep running flawlessly because they need to failover to the secondary
system once they detect failures.
On default, failfast mode is disabled because failfast policy is not suitable
for most use cases which can accept I/O latency due to device hardware error.

To enable failfast mode(default disabled):
   # echo 1   /sys/bus/scsi/devices/X:X:X:X/failfast
To disable:
   # echo 0   /sys/bus/scsi/devices/X:X:X:X/failfast

Furthermore, I'm planning to make the upper limit count configurable.
Currently, I have two plans to implement it:
(1) set same upper limit count on all errors.
(2) set upper limit count on each error.
The first implementation is simple and easy to implement but not flexible.
Someone wants to set different upper limit count on each errors depends on the
scsi device they use. The second implementation satisfies such requirement
but can be too fine-grained and annoying to configure because scsi error
codes are so much. The default 5 times retry may too much on some errors but
too few on other errors.

Which would be the appropriate implementation?
Any comments or suggestions are welcome as usual.


I'm afraid you'll need to propose another solution.  We have a large
selection of commands which, by design, retry until the command exceeds
it's timeout.  UA is one of those (as are most of the others you're
limiting).  How do you kick this device out of its UA return (because
that's the recovery that needs to happen)?

James




Thanks for reviewing, James.

Originally, I planned that once the retry count exceeds its limit,
a monitoring tool stops the server with the scsi prink error message
as a trigger.
Current failfast mode implementation is that the command fails when
retry command exceeds its limit. However, I noticed that only printing error 
messages
on retry counts excess without changing retry logic will be enough
to stop the server and take fail over.  Though there is no guarantee that
userspace application can work properly on disk failure condition.
So, now I'm considering that just calling panic() on retry excess is better.

For that reason, I propose the solution that adding panic_on_error option to
sysfs parameter and if panic_on_error mode is enabled the server panics
immediately once it detects retry excess. Of course, it is disabled on default.

I would appreciate it if you could give me some comments.

Eiichi
--


For what it's worth, I've seen a report of a case where a storage array
returned a CHECK CONDITION with invalid sense data, which caused the
command to be retried indefinitely.


Thank you for commenting, Ewan.
I appreciate your information about indefinite retry on CHECK CONDITION.


I'm not sure what you can do about
this, if the device won't ever complete a command without an error.
Perhaps it should be offlined after sufficiently bad behavior.

I don't think you want to panic on an error, though.  In a clustered
environment it is possible that the other systems will all fail in the
same way, for example.

-Ewan



Yes, basically the device should be offlined on error detection.
Just offlining the disk is enough when an error occurs on not os-installed
system disk. Panic is going too far on such case.

However, in a clustered environment where computers use each its own disk and
do not share the same disk, calling panic() will be suitable when an error
occurs in system disk. Because even on such disk error

Re: [RFC PATCH] scsi: Add failfast mode to avoid infinite retry loop

2013-08-20 Thread Eiichi Tsukata

(2013/08/19 23:30), James Bottomley wrote:

On Mon, 2013-08-19 at 18:39 +0900, Eiichi Tsukata wrote:

Hello,

This patch adds scsi device failfast mode to avoid infinite retry loop.

Currently, scsi error handling in scsi_decide_disposition() and
scsi_io_completion() unconditionally retries on some errors. This is because
retryable errors are thought to be temporary and the scsi device will soon
recover from those errors. Normally, such retry policy is appropriate because
the device will soon recover from temporary error state.
But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop. In fact,
CHECK_CONDITION error with the sense-key = UNIT_ATTENTION caused infinite
retry loop in our environment. As the comments in kernel source code says,
UNIT_ATTENTION means the device must have been a power glitch and expected
to immediately recover from the state. But it seems that hardware error
caused permanent UNIT_ATTENTION error.

To solve the above problem, this patch introduces scsi device "failfast mode".
If failfast mode is enabled, retry counts of all scsi commands are limited to
scsi->allowed(== SD_MAX_RETRIES == 5). All commands are prohibited to retry
infinitely, and immediately fails when the retry count exceeds upper limit.
Failfast mode is useful on mission critical systems which are required
to keep running flawlessly because they need to failover to the secondary
system once they detect failures.
On default, failfast mode is disabled because failfast policy is not suitable
for most use cases which can accept I/O latency due to device hardware error.

To enable failfast mode(default disabled):
  # echo 1>  /sys/bus/scsi/devices/X:X:X:X/failfast
To disable:
  # echo 0>  /sys/bus/scsi/devices/X:X:X:X/failfast

Furthermore, I'm planning to make the upper limit count configurable.
Currently, I have two plans to implement it:
(1) set same upper limit count on all errors.
(2) set upper limit count on each error.
The first implementation is simple and easy to implement but not flexible.
Someone wants to set different upper limit count on each errors depends on the
scsi device they use. The second implementation satisfies such requirement
but can be too fine-grained and annoying to configure because scsi error
codes are so much. The default 5 times retry may too much on some errors but
too few on other errors.

Which would be the appropriate implementation?
Any comments or suggestions are welcome as usual.


I'm afraid you'll need to propose another solution.  We have a large
selection of commands which, by design, retry until the command exceeds
it's timeout.  UA is one of those (as are most of the others you're
limiting).  How do you kick this device out of its UA return (because
that's the recovery that needs to happen)?

James




Thanks for reviewing, James.

Originally, I planned that once the retry count exceeds its limit,
a monitoring tool stops the server with the scsi prink error message
as a trigger.
Current failfast mode implementation is that the command fails when
retry command exceeds its limit. However, I noticed that only printing error 
messages
on retry counts excess without changing retry logic will be enough
to stop the server and take fail over.  Though there is no guarantee that
userspace application can work properly on disk failure condition.
So, now I'm considering that just calling panic() on retry excess is better.

For that reason, I propose the solution that adding "panic_on_error" option to
sysfs parameter and if panic_on_error mode is enabled the server panics
immediately once it detects retry excess. Of course, it is disabled on default.

I would appreciate it if you could give me some comments.

Eiichi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] scsi: Add failfast mode to avoid infinite retry loop

2013-08-20 Thread Eiichi Tsukata

(2013/08/19 23:30), James Bottomley wrote:

On Mon, 2013-08-19 at 18:39 +0900, Eiichi Tsukata wrote:

Hello,

This patch adds scsi device failfast mode to avoid infinite retry loop.

Currently, scsi error handling in scsi_decide_disposition() and
scsi_io_completion() unconditionally retries on some errors. This is because
retryable errors are thought to be temporary and the scsi device will soon
recover from those errors. Normally, such retry policy is appropriate because
the device will soon recover from temporary error state.
But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop. In fact,
CHECK_CONDITION error with the sense-key = UNIT_ATTENTION caused infinite
retry loop in our environment. As the comments in kernel source code says,
UNIT_ATTENTION means the device must have been a power glitch and expected
to immediately recover from the state. But it seems that hardware error
caused permanent UNIT_ATTENTION error.

To solve the above problem, this patch introduces scsi device failfast mode.
If failfast mode is enabled, retry counts of all scsi commands are limited to
scsi-allowed(== SD_MAX_RETRIES == 5). All commands are prohibited to retry
infinitely, and immediately fails when the retry count exceeds upper limit.
Failfast mode is useful on mission critical systems which are required
to keep running flawlessly because they need to failover to the secondary
system once they detect failures.
On default, failfast mode is disabled because failfast policy is not suitable
for most use cases which can accept I/O latency due to device hardware error.

To enable failfast mode(default disabled):
  # echo 1  /sys/bus/scsi/devices/X:X:X:X/failfast
To disable:
  # echo 0  /sys/bus/scsi/devices/X:X:X:X/failfast

Furthermore, I'm planning to make the upper limit count configurable.
Currently, I have two plans to implement it:
(1) set same upper limit count on all errors.
(2) set upper limit count on each error.
The first implementation is simple and easy to implement but not flexible.
Someone wants to set different upper limit count on each errors depends on the
scsi device they use. The second implementation satisfies such requirement
but can be too fine-grained and annoying to configure because scsi error
codes are so much. The default 5 times retry may too much on some errors but
too few on other errors.

Which would be the appropriate implementation?
Any comments or suggestions are welcome as usual.


I'm afraid you'll need to propose another solution.  We have a large
selection of commands which, by design, retry until the command exceeds
it's timeout.  UA is one of those (as are most of the others you're
limiting).  How do you kick this device out of its UA return (because
that's the recovery that needs to happen)?

James




Thanks for reviewing, James.

Originally, I planned that once the retry count exceeds its limit,
a monitoring tool stops the server with the scsi prink error message
as a trigger.
Current failfast mode implementation is that the command fails when
retry command exceeds its limit. However, I noticed that only printing error 
messages
on retry counts excess without changing retry logic will be enough
to stop the server and take fail over.  Though there is no guarantee that
userspace application can work properly on disk failure condition.
So, now I'm considering that just calling panic() on retry excess is better.

For that reason, I propose the solution that adding panic_on_error option to
sysfs parameter and if panic_on_error mode is enabled the server panics
immediately once it detects retry excess. Of course, it is disabled on default.

I would appreciate it if you could give me some comments.

Eiichi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] scsi: Add failfast mode to avoid infinite retry loop

2013-08-19 Thread Eiichi Tsukata
Hello,

This patch adds scsi device failfast mode to avoid infinite retry loop.

Currently, scsi error handling in scsi_decide_disposition() and
scsi_io_completion() unconditionally retries on some errors. This is because
retryable errors are thought to be temporary and the scsi device will soon
recover from those errors. Normally, such retry policy is appropriate because
the device will soon recover from temporary error state.
But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop. In fact,
CHECK_CONDITION error with the sense-key = UNIT_ATTENTION caused infinite
retry loop in our environment. As the comments in kernel source code says,
UNIT_ATTENTION means the device must have been a power glitch and expected
to immediately recover from the state. But it seems that hardware error
caused permanent UNIT_ATTENTION error.

To solve the above problem, this patch introduces scsi device "failfast mode".
If failfast mode is enabled, retry counts of all scsi commands are limited to
scsi->allowed(== SD_MAX_RETRIES == 5). All commands are prohibited to retry
infinitely, and immediately fails when the retry count exceeds upper limit.
Failfast mode is useful on mission critical systems which are required
to keep running flawlessly because they need to failover to the secondary
system once they detect failures.
On default, failfast mode is disabled because failfast policy is not suitable
for most use cases which can accept I/O latency due to device hardware error.

To enable failfast mode(default disabled):
 # echo 1 > /sys/bus/scsi/devices/X:X:X:X/failfast
To disable:
 # echo 0 > /sys/bus/scsi/devices/X:X:X:X/failfast

Furthermore, I'm planning to make the upper limit count configurable.
Currently, I have two plans to implement it:
(1) set same upper limit count on all errors.
(2) set upper limit count on each error.
The first implementation is simple and easy to implement but not flexible.
Someone wants to set different upper limit count on each errors depends on the
scsi device they use. The second implementation satisfies such requirement
but can be too fine-grained and annoying to configure because scsi error
codes are so much. The default 5 times retry may too much on some errors but
too few on other errors.

Which would be the appropriate implementation?
Any comments or suggestions are welcome as usual.

Signed-off-by: Eiichi Tsukata 
Cc: "James E.J. Bottomley" 
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/scsi/scsi_error.c  |   44 +++-
 drivers/scsi/scsi_lib.c|   10 ++
 drivers/scsi/scsi_sysfs.c  |8 +++-
 include/scsi/scsi_device.h |1 +
 4 files changed, 45 insertions(+), 18 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 2150596..1b6a4b6 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1442,7 +1442,7 @@ int scsi_noretry_cmd(struct scsi_cmnd *scmd)
  */
 int scsi_decide_disposition(struct scsi_cmnd *scmd)
 {
-   int rtn;
+   int rtn, disposition;
 
/*
 * if the device is offline, then we clearly just pass the result back
@@ -1492,12 +1492,14 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd)
 * and not get stuck in a loop.
 */
case DID_SOFT_ERROR:
-   goto maybe_retry;
+   disposition = NEEDS_RETRY;
+   goto limited_retry;
case DID_IMM_RETRY:
-   return NEEDS_RETRY;
-
+   disposition = NEEDS_RETRY;
+   goto retry;
case DID_REQUEUE:
-   return ADD_TO_MLQUEUE;
+   disposition = ADD_TO_MLQUEUE;
+   goto retry;
case DID_TRANSPORT_DISRUPTED:
/*
 * LLD/transport was disrupted during processing of the IO.
@@ -1506,7 +1508,8 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd)
 * based on its timers and recovery capablilities if
 * there are enough retries.
 */
-   goto maybe_retry;
+   disposition = NEEDS_RETRY;
+   goto limited_retry;
case DID_TRANSPORT_FAILFAST:
/*
 * The transport decided to failfast the IO (most likely
@@ -1524,7 +1527,8 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd)
/* fallthrough */
case DID_BUS_BUSY:
case DID_PARITY:
-   goto maybe_retry;
+   disposition = NEEDS_RETRY;
+   goto limited_retry;
case DID_TIME_OUT:
/*
 * when we scan the bus, we get timeout messages for
@@ -1566,17 +1570,21 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd)
 * the em

[RFC PATCH] scsi: Add failfast mode to avoid infinite retry loop

2013-08-19 Thread Eiichi Tsukata
Hello,

This patch adds scsi device failfast mode to avoid infinite retry loop.

Currently, scsi error handling in scsi_decide_disposition() and
scsi_io_completion() unconditionally retries on some errors. This is because
retryable errors are thought to be temporary and the scsi device will soon
recover from those errors. Normally, such retry policy is appropriate because
the device will soon recover from temporary error state.
But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop. In fact,
CHECK_CONDITION error with the sense-key = UNIT_ATTENTION caused infinite
retry loop in our environment. As the comments in kernel source code says,
UNIT_ATTENTION means the device must have been a power glitch and expected
to immediately recover from the state. But it seems that hardware error
caused permanent UNIT_ATTENTION error.

To solve the above problem, this patch introduces scsi device failfast mode.
If failfast mode is enabled, retry counts of all scsi commands are limited to
scsi-allowed(== SD_MAX_RETRIES == 5). All commands are prohibited to retry
infinitely, and immediately fails when the retry count exceeds upper limit.
Failfast mode is useful on mission critical systems which are required
to keep running flawlessly because they need to failover to the secondary
system once they detect failures.
On default, failfast mode is disabled because failfast policy is not suitable
for most use cases which can accept I/O latency due to device hardware error.

To enable failfast mode(default disabled):
 # echo 1  /sys/bus/scsi/devices/X:X:X:X/failfast
To disable:
 # echo 0  /sys/bus/scsi/devices/X:X:X:X/failfast

Furthermore, I'm planning to make the upper limit count configurable.
Currently, I have two plans to implement it:
(1) set same upper limit count on all errors.
(2) set upper limit count on each error.
The first implementation is simple and easy to implement but not flexible.
Someone wants to set different upper limit count on each errors depends on the
scsi device they use. The second implementation satisfies such requirement
but can be too fine-grained and annoying to configure because scsi error
codes are so much. The default 5 times retry may too much on some errors but
too few on other errors.

Which would be the appropriate implementation?
Any comments or suggestions are welcome as usual.

Signed-off-by: Eiichi Tsukata eiichi.tsukata...@hitachi.com
Cc: James E.J. Bottomley jbottom...@parallels.com
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/scsi/scsi_error.c  |   44 +++-
 drivers/scsi/scsi_lib.c|   10 ++
 drivers/scsi/scsi_sysfs.c  |8 +++-
 include/scsi/scsi_device.h |1 +
 4 files changed, 45 insertions(+), 18 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 2150596..1b6a4b6 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1442,7 +1442,7 @@ int scsi_noretry_cmd(struct scsi_cmnd *scmd)
  */
 int scsi_decide_disposition(struct scsi_cmnd *scmd)
 {
-   int rtn;
+   int rtn, disposition;
 
/*
 * if the device is offline, then we clearly just pass the result back
@@ -1492,12 +1492,14 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd)
 * and not get stuck in a loop.
 */
case DID_SOFT_ERROR:
-   goto maybe_retry;
+   disposition = NEEDS_RETRY;
+   goto limited_retry;
case DID_IMM_RETRY:
-   return NEEDS_RETRY;
-
+   disposition = NEEDS_RETRY;
+   goto retry;
case DID_REQUEUE:
-   return ADD_TO_MLQUEUE;
+   disposition = ADD_TO_MLQUEUE;
+   goto retry;
case DID_TRANSPORT_DISRUPTED:
/*
 * LLD/transport was disrupted during processing of the IO.
@@ -1506,7 +1508,8 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd)
 * based on its timers and recovery capablilities if
 * there are enough retries.
 */
-   goto maybe_retry;
+   disposition = NEEDS_RETRY;
+   goto limited_retry;
case DID_TRANSPORT_FAILFAST:
/*
 * The transport decided to failfast the IO (most likely
@@ -1524,7 +1527,8 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd)
/* fallthrough */
case DID_BUS_BUSY:
case DID_PARITY:
-   goto maybe_retry;
+   disposition = NEEDS_RETRY;
+   goto limited_retry;
case DID_TIME_OUT:
/*
 * when we scan the bus, we get timeout messages for
@@ -1566,17 +1570,21 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd