Re: [Linux v4.10.0-rc1] call-traces after suspend-resume (pm? i915? cpu/hotplug?)

2016-12-29 Thread Mika Kuoppala
Sedat Dilek <sedat.di...@gmail.com> writes:

> On Wed, Dec 28, 2016 at 11:32 PM, Rafael J. Wysocki <raf...@kernel.org> wrote:
>> On Wed, Dec 28, 2016 at 11:00 AM, Sedat Dilek <sedat.di...@gmail.com> wrote:
>>> On Wed, Dec 28, 2016 at 9:29 AM, Jani Nikula <jani.nik...@intel.com> wrote:
>>>> On Wed, 28 Dec 2016, Sedat Dilek <sedat.di...@gmail.com> wrote:
>>>>> On Tue, Dec 27, 2016 at 10:13 PM, Pavel Machek <pa...@ucw.cz> wrote:
>>>>>> Hi!
>>>>>>
>>>>>>> [ Add some pm | i915 | x86 folks ]
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have built Linux v4.10-rc1 today on my Ubuntu/precise AMD64 system
>>>>>>> and I see some call-traces.
>>>>>>> It is reproducible on suspend and resume.
>>>>>>>
>>>>>>> I cannot say which area touches the problem or if these are several
>>>>>>> independent problems.
>>>>>>>
>>>>>>> For a full dmesg-log see attachments (my linux-config is attached, too).
>>>>>>>
>>>>>>> Here some hunks...
>>>>>>>
>>>>>>> [   29.003601] BUG: sleeping function called from invalid context at
>>>>>>> drivers/base/power/runtime.c:1032
>>>>>>> [   29.003608] in_atomic(): 1, irqs_disabled(): 0, pid: 1469, name: Xorg
>>>>>>> [   29.003610] 1 lock held by Xorg/1469:
>>>>>>> [   29.003611]  #0:  (>struct_mutex){+.+.+.}, at:
>>>>>>> [] i915_mutex_lock_interruptible+0x43/0x140 [i915]
>>>>>>> [   29.003653] CPU: 0 PID: 1469 Comm: Xorg Not tainted
>>>>>>> 4.10.0-rc1-1-iniza-small #1
>>>>>>> [   29.003655] Hardware name: SAMSUNG ELECTRONICS CO., LTD.
>>>>>>> 530U3BI/530U4BI/530U4BH/530U3BI/530U4BI/530U4BH, BIOS 13XK 03/28/2013
>>>>>>> [   29.003656] Call Trace:
>>>>>>
>>>>>> Just a note, at least 2 machines here refuse to resume with
>>>>>> v4.10-rc1. One has intel graphics, one has AMD. It may or may not have
>>>>>> common cause...
>>>>>>
>>>>>
>>>>> [ Correct linux-pm ML and add Mika & Jani ]
>>>>>
>>>>> Thanks for the feedback.
>>>>>
>>>>> There are some cpu/hotplug fixes post-v4.10-rc1.
>>>>> Give that a try.
>>>>>
>>>>> Yesterday, after answers from drm-intel folks I have seen that a
>>>>> cpu/hotplug commit [1] was reverted in
>>>>> drm-intel.git#drm-intel-nightly.
>>>>> I haven't tried that.
>>>>>
>>>>> It's good when Thomas knows of this and gets in contact with drm-intel 
>>>>> folks.
>>>>>
>>>>> Regards,
>>>>> - Sedat -
>>>>>
>>>>> [1] 
>>>>> https://cgit.freedesktop.org/drm-intel/commit/?h=drm-intel-nightly=e558f178f5390185b7324ff4b816b52c6ae3a928
>>>>> [2] https://cgit.freedesktop.org/drm-intel/log/?h=drm-intel-nightly
>>>>>
>>>>> P.S.: Revert "cpu/hotplug: Prevent overwriting of callbacks"
>>>>>
>>>>> This reverts commit dc280d93623927570da279e99393879dbbab39e7
>>>>> Author: Thomas Gleixner <t...@linutronix.de>
>>>>> Date: Wed Dec 21 20:19:49 2016 +0100
>>>>> cpu/hotplug: Prevent overwriting of callbacks
>>>>>
>>>>> It started hanging all machines in CI s3 test:
>>>>> https://intel-gfx-ci.01.org/CI/igt@gem_exec_susp...@basic-s3.html
>>>>>
>>>>> Bisected-by: Mika Kuoppala <mika.kuopp...@linux.intel.com>
>>>>> Signed-off-by: Jani Nikula <jani.nik...@intel.com>
>>>>
>>>> Thomas -
>>>>
>>>> Indeed, basically all of the boxes in the intel-gfx CI hang at the
>>>> suspend/resume test with dc280d936239 ("cpu/hotplug: Prevent overwriting
>>>> of callbacks"), and after the revert in the tree that feeds to the CI,
>>>> we're back on track.
>>>>
>>>> I found [1], was hoping to get feedback from Mika whether that helps
>>>> before reporting. Chris also suggested [2] as a quick fix but I don't
>>>> know if anyone tried that.
>>>>
>>>
>>> Hi Jani,
>>>
>>> I know you were not CCed in the original thread, please see [5].
>>>
>>> The patchset from Thomas you mention [1] does fix one of the problems
>>> I have seen, please see [6].
>>> With these post-v4.10-rc1 patches applied a clean revert of Revert
>>> "cpu/hotplug: Prevent overwriting of callbacks" is not possible.
>>>
>>> Can you give a clear statement if the quick-fix from Chris is in
>>> combination with the above revert or not?
>>> Against v4.10-rc1?
>>> Tested together with the patchset of Thomas?
>>
>> Please test the Linus' tree from today, it should work.
>>
>
> Latest Linus tree (v4.10-rc1-17-g2d706e790f05) does not fix it.
>

Latest Linus tree 2d706e790f0508dff4fb72eca9b4892b79757feb fixes our S3
problems. It survives gem_exec_suspend --r basic-S3 on kabylake.

It contains the fix to the bisected commit:

commit b9d9d6911bd5c370ad4b3aa57d758c093d17aed5
Author: Thomas Gleixner <t...@linutronix.de>
Date:   Mon Dec 26 22:58:19 2016 +0100

smp/hotplug: Undo tglxs brainfart


-Mika


Re: [Linux v4.10.0-rc1] call-traces after suspend-resume (pm? i915? cpu/hotplug?)

2016-12-29 Thread Mika Kuoppala
Sedat Dilek  writes:

> On Wed, Dec 28, 2016 at 11:32 PM, Rafael J. Wysocki  wrote:
>> On Wed, Dec 28, 2016 at 11:00 AM, Sedat Dilek  wrote:
>>> On Wed, Dec 28, 2016 at 9:29 AM, Jani Nikula  wrote:
>>>> On Wed, 28 Dec 2016, Sedat Dilek  wrote:
>>>>> On Tue, Dec 27, 2016 at 10:13 PM, Pavel Machek  wrote:
>>>>>> Hi!
>>>>>>
>>>>>>> [ Add some pm | i915 | x86 folks ]
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have built Linux v4.10-rc1 today on my Ubuntu/precise AMD64 system
>>>>>>> and I see some call-traces.
>>>>>>> It is reproducible on suspend and resume.
>>>>>>>
>>>>>>> I cannot say which area touches the problem or if these are several
>>>>>>> independent problems.
>>>>>>>
>>>>>>> For a full dmesg-log see attachments (my linux-config is attached, too).
>>>>>>>
>>>>>>> Here some hunks...
>>>>>>>
>>>>>>> [   29.003601] BUG: sleeping function called from invalid context at
>>>>>>> drivers/base/power/runtime.c:1032
>>>>>>> [   29.003608] in_atomic(): 1, irqs_disabled(): 0, pid: 1469, name: Xorg
>>>>>>> [   29.003610] 1 lock held by Xorg/1469:
>>>>>>> [   29.003611]  #0:  (>struct_mutex){+.+.+.}, at:
>>>>>>> [] i915_mutex_lock_interruptible+0x43/0x140 [i915]
>>>>>>> [   29.003653] CPU: 0 PID: 1469 Comm: Xorg Not tainted
>>>>>>> 4.10.0-rc1-1-iniza-small #1
>>>>>>> [   29.003655] Hardware name: SAMSUNG ELECTRONICS CO., LTD.
>>>>>>> 530U3BI/530U4BI/530U4BH/530U3BI/530U4BI/530U4BH, BIOS 13XK 03/28/2013
>>>>>>> [   29.003656] Call Trace:
>>>>>>
>>>>>> Just a note, at least 2 machines here refuse to resume with
>>>>>> v4.10-rc1. One has intel graphics, one has AMD. It may or may not have
>>>>>> common cause...
>>>>>>
>>>>>
>>>>> [ Correct linux-pm ML and add Mika & Jani ]
>>>>>
>>>>> Thanks for the feedback.
>>>>>
>>>>> There are some cpu/hotplug fixes post-v4.10-rc1.
>>>>> Give that a try.
>>>>>
>>>>> Yesterday, after answers from drm-intel folks I have seen that a
>>>>> cpu/hotplug commit [1] was reverted in
>>>>> drm-intel.git#drm-intel-nightly.
>>>>> I haven't tried that.
>>>>>
>>>>> It's good when Thomas knows of this and gets in contact with drm-intel 
>>>>> folks.
>>>>>
>>>>> Regards,
>>>>> - Sedat -
>>>>>
>>>>> [1] 
>>>>> https://cgit.freedesktop.org/drm-intel/commit/?h=drm-intel-nightly=e558f178f5390185b7324ff4b816b52c6ae3a928
>>>>> [2] https://cgit.freedesktop.org/drm-intel/log/?h=drm-intel-nightly
>>>>>
>>>>> P.S.: Revert "cpu/hotplug: Prevent overwriting of callbacks"
>>>>>
>>>>> This reverts commit dc280d93623927570da279e99393879dbbab39e7
>>>>> Author: Thomas Gleixner 
>>>>> Date: Wed Dec 21 20:19:49 2016 +0100
>>>>> cpu/hotplug: Prevent overwriting of callbacks
>>>>>
>>>>> It started hanging all machines in CI s3 test:
>>>>> https://intel-gfx-ci.01.org/CI/igt@gem_exec_susp...@basic-s3.html
>>>>>
>>>>> Bisected-by: Mika Kuoppala 
>>>>> Signed-off-by: Jani Nikula 
>>>>
>>>> Thomas -
>>>>
>>>> Indeed, basically all of the boxes in the intel-gfx CI hang at the
>>>> suspend/resume test with dc280d936239 ("cpu/hotplug: Prevent overwriting
>>>> of callbacks"), and after the revert in the tree that feeds to the CI,
>>>> we're back on track.
>>>>
>>>> I found [1], was hoping to get feedback from Mika whether that helps
>>>> before reporting. Chris also suggested [2] as a quick fix but I don't
>>>> know if anyone tried that.
>>>>
>>>
>>> Hi Jani,
>>>
>>> I know you were not CCed in the original thread, please see [5].
>>>
>>> The patchset from Thomas you mention [1] does fix one of the problems
>>> I have seen, please see [6].
>>> With these post-v4.10-rc1 patches applied a clean revert of Revert
>>> "cpu/hotplug: Prevent overwriting of callbacks" is not possible.
>>>
>>> Can you give a clear statement if the quick-fix from Chris is in
>>> combination with the above revert or not?
>>> Against v4.10-rc1?
>>> Tested together with the patchset of Thomas?
>>
>> Please test the Linus' tree from today, it should work.
>>
>
> Latest Linus tree (v4.10-rc1-17-g2d706e790f05) does not fix it.
>

Latest Linus tree 2d706e790f0508dff4fb72eca9b4892b79757feb fixes our S3
problems. It survives gem_exec_suspend --r basic-S3 on kabylake.

It contains the fix to the bisected commit:

commit b9d9d6911bd5c370ad4b3aa57d758c093d17aed5
Author: Thomas Gleixner 
Date:   Mon Dec 26 22:58:19 2016 +0100

smp/hotplug: Undo tglxs brainfart


-Mika


Re: [PATCH v2] iommu/intel-iommu: fix pasid table size encoding

2016-12-14 Thread Mika Kuoppala
Jacob Pan <jacob.jun@linux.intel.com> writes:

> Different encodings are used to represent supported PASID bits
> and number of PASID table entries.
> The current code assigns ecap_pss directly to extended context
> table entry PTS which is wrong and could result in writing
> non-zero bits to the reserved fields. IOMMU fault reason
> 11 will be reported when reserved bits are nonzero.
> This patch converts ecap_pss to extend context entry pts encoding
> based on VT-d spec. Chapter 9.4 as follows:
>  - number of PASID bits = ecap_pss + 1
>  - number of PASID table entries = 2^(pts + 5)
> Software assigned limit of pasid_max value is also respected to
> match the allocation limitation of PASID table.
>
> cc: Mika Kuoppala <mika.kuopp...@linux.intel.com>
> cc: Ashok Raj <ashok@intel.com>
> Signed-off-by: Jacob Pan <jacob.jun@linux.intel.com>

Tested-by: Mika Kuoppala <mika.kuopp...@intel.com>

> ---
>  drivers/iommu/intel-iommu.c | 23 ++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 27596e6..5d9cddc 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5173,6 +5173,25 @@ static void intel_iommu_remove_device(struct device 
> *dev)
>  }
>  
>  #ifdef CONFIG_INTEL_IOMMU_SVM
> +#define MAX_NR_PASID_BITS (20)
> +static inline unsigned long intel_iommu_get_pts(struct intel_iommu *iommu)
> +{
> + /*
> +  * Convert ecap_pss to extend context entry pts encoding, also
> +  * respect the soft pasid_max value set by the iommu.
> +  * - number of PASID bits = ecap_pss + 1
> +  * - number of PASID table entries = 2^(pts + 5)
> +  * Therefore, pts = ecap_pss - 4
> +  * e.g. KBL ecap_pss = 0x13, PASID has 20 bits, pts = 15
> +  */
> + if (ecap_pss(iommu->ecap) < 5)
> + return 0;
> +
> + /* pasid_max is encoded as actual number of entries not the bits */
> + return find_first_bit((unsigned long *)>pasid_max,
> + MAX_NR_PASID_BITS) - 5;
> +}
> +
>  int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct intel_svm_dev 
> *sdev)
>  {
>   struct device_domain_info *info;
> @@ -5205,7 +5224,9 @@ int intel_iommu_enable_pasid(struct intel_iommu *iommu, 
> struct intel_svm_dev *sd
>  
>   if (!(ctx_lo & CONTEXT_PASIDE)) {
>   context[1].hi = (u64)virt_to_phys(iommu->pasid_state_table);
> - context[1].lo = (u64)virt_to_phys(iommu->pasid_table) | 
> ecap_pss(iommu->ecap);
> + context[1].lo = (u64)virt_to_phys(iommu->pasid_table) |
> + intel_iommu_get_pts(iommu);
> +
>   wmb();
>   /* CONTEXT_TT_MULTI_LEVEL and CONTEXT_TT_DEV_IOTLB are both
>* extended to permit requests-with-PASID if the PASIDE bit
> -- 
> 2.7.4


Re: [PATCH v2] iommu/intel-iommu: fix pasid table size encoding

2016-12-14 Thread Mika Kuoppala
Jacob Pan  writes:

> Different encodings are used to represent supported PASID bits
> and number of PASID table entries.
> The current code assigns ecap_pss directly to extended context
> table entry PTS which is wrong and could result in writing
> non-zero bits to the reserved fields. IOMMU fault reason
> 11 will be reported when reserved bits are nonzero.
> This patch converts ecap_pss to extend context entry pts encoding
> based on VT-d spec. Chapter 9.4 as follows:
>  - number of PASID bits = ecap_pss + 1
>  - number of PASID table entries = 2^(pts + 5)
> Software assigned limit of pasid_max value is also respected to
> match the allocation limitation of PASID table.
>
> cc: Mika Kuoppala 
> cc: Ashok Raj 
> Signed-off-by: Jacob Pan 

Tested-by: Mika Kuoppala 

> ---
>  drivers/iommu/intel-iommu.c | 23 ++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 27596e6..5d9cddc 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5173,6 +5173,25 @@ static void intel_iommu_remove_device(struct device 
> *dev)
>  }
>  
>  #ifdef CONFIG_INTEL_IOMMU_SVM
> +#define MAX_NR_PASID_BITS (20)
> +static inline unsigned long intel_iommu_get_pts(struct intel_iommu *iommu)
> +{
> + /*
> +  * Convert ecap_pss to extend context entry pts encoding, also
> +  * respect the soft pasid_max value set by the iommu.
> +  * - number of PASID bits = ecap_pss + 1
> +  * - number of PASID table entries = 2^(pts + 5)
> +  * Therefore, pts = ecap_pss - 4
> +  * e.g. KBL ecap_pss = 0x13, PASID has 20 bits, pts = 15
> +  */
> + if (ecap_pss(iommu->ecap) < 5)
> + return 0;
> +
> + /* pasid_max is encoded as actual number of entries not the bits */
> + return find_first_bit((unsigned long *)>pasid_max,
> + MAX_NR_PASID_BITS) - 5;
> +}
> +
>  int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct intel_svm_dev 
> *sdev)
>  {
>   struct device_domain_info *info;
> @@ -5205,7 +5224,9 @@ int intel_iommu_enable_pasid(struct intel_iommu *iommu, 
> struct intel_svm_dev *sd
>  
>   if (!(ctx_lo & CONTEXT_PASIDE)) {
>   context[1].hi = (u64)virt_to_phys(iommu->pasid_state_table);
> - context[1].lo = (u64)virt_to_phys(iommu->pasid_table) | 
> ecap_pss(iommu->ecap);
> + context[1].lo = (u64)virt_to_phys(iommu->pasid_table) |
> + intel_iommu_get_pts(iommu);
> +
>   wmb();
>   /* CONTEXT_TT_MULTI_LEVEL and CONTEXT_TT_DEV_IOTLB are both
>* extended to permit requests-with-PASID if the PASIDE bit
> -- 
> 2.7.4


[PATCH] drm/i915: Add checks to i915_bind_vma

2015-04-22 Thread Mika Kuoppala
The current aliasing ppgtt implementation allocates
the page table structures on driver initialization
for the entire vm address space. Earlier the page tables
were allocated as array of struct pages, but introduction
of dynamic allocation of page structures changed the page
tables to be inside a page directory.

We have a detailed bug report where traversing of tables and
deferencing page_table[pte]->page oopses. This indicates that
our pre allocation of page tables has failed or that we get
corrupt vma which does not fit inside the vm area and throws
pte > 511.

Add more checks to catch the latter. Warn and bail out if
we get vma which is out of bounds for binding.

v2: Check vma node early (Chris)

Cc: Linus Torvalds 
Cc: Chris Wilson 
Cc: Michel Thierry 
Signed-off-by: Mika Kuoppala 
---
 drivers/gpu/drm/i915/i915_gem_gtt.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c 
b/drivers/gpu/drm/i915/i915_gem_gtt.c
index 0239fbf..2ffa8f6 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -2746,6 +2746,13 @@ i915_get_ggtt_vma_pages(struct i915_vma *vma)
 int i915_vma_bind(struct i915_vma *vma, enum i915_cache_level cache_level,
  u32 flags)
 {
+
+   if (WARN_ON(!drm_mm_node_allocated(>node)))
+   return -EINVAL;
+
+   if (WARN_ON(vma->node.start > vma->vm->total - vma->node.size))
+   return -EINVAL;
+
if (i915_is_ggtt(vma->vm)) {
int ret = i915_get_ggtt_vma_pages(vma);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] drm/i915: Add checks to i915_bind_vma

2015-04-22 Thread Mika Kuoppala
The current aliasing ppgtt implementation allocates
the page table structures on driver initialization
for the entire vm address space. Earlier the page tables
were allocated as array of struct pages, but introduction
of dynamic allocation of page structures changed the page
tables to be inside a page directory.

We have a detailed bug report where traversing of tables and
deferencing page_table[pte]-page oopses. This indicates that
our pre allocation of page tables has failed or that we get
corrupt vma which does not fit inside the vm area and throws
pte  511.

Add more checks to catch the latter. Warn and bail out if
we get vma which is out of bounds for binding.

v2: Check vma node early (Chris)

Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Chris Wilson ch...@chris-wilson.co.uk
Cc: Michel Thierry michel.thie...@intel.com
Signed-off-by: Mika Kuoppala mika.kuopp...@intel.com
---
 drivers/gpu/drm/i915/i915_gem_gtt.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c 
b/drivers/gpu/drm/i915/i915_gem_gtt.c
index 0239fbf..2ffa8f6 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -2746,6 +2746,13 @@ i915_get_ggtt_vma_pages(struct i915_vma *vma)
 int i915_vma_bind(struct i915_vma *vma, enum i915_cache_level cache_level,
  u32 flags)
 {
+
+   if (WARN_ON(!drm_mm_node_allocated(vma-node)))
+   return -EINVAL;
+
+   if (WARN_ON(vma-node.start  vma-vm-total - vma-node.size))
+   return -EINVAL;
+
if (i915_is_ggtt(vma-vm)) {
int ret = i915_get_ggtt_vma_pages(vma);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/