Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-04-24 Thread Dave Young
On 04/24/15 at 04:35pm, Baoquan He wrote:
 On 04/24/15 at 04:25pm, Dave Young wrote:
  Hi, Baoquan
  
   I support this patchset.
   
   We should not fear oldmem since reserved crashkernel region is similar.
   No one can guarantee that any crazy code won't step into crashkernel
   region just because 1st kernel says it's reversed for kdump kernel. Here
   the root table and context tables are also not built to allow legal code
   to danamge. Both of them has the risk to be corrupted, for trying our
   best to get a dumped vmcore the risk is worth being taken.
  
  old mem is mapped in 1st kernel so compare with the reserved crashkernel
  they are more likely to be corrupted. they are totally different. 
 
 Could you tell how and why they are different? Wrong code will choose
 root tables and context tables to danamge when they totally lose
 control?

iommu will map io address to system ram, right? not to reserved ram, but
yes I'm assuming the page table is right, but I was worrying they are corrupted
while kernel panic is happening.

 
  
   
   And the resetting pci way has been NACKed by David Woodhouse, the
   maintainer of intel iommu. Because the place calling the resetting pci
   code is ugly before kdump kernel or in kdump kernel. And as he said a
   certain device made mistakes why we blame on all devices. We should fix
   that device who made mistakes. 
  
  Resetting pci bus is not ugly than fixing a problem with risk and to fix
  the problem it introduced in the future.
 
 There's a problem, we fix the problem. If that's uglier, I need redefine
 the 'ugly' in my personal dict. You mean the problem it could introduce
 is wrong code will damage root table and context tables, why don't we
 fix that wrong code, but blame innocent context tables? So you mean
 these tables should deserve being damaged by wrong code?

I'm more than happy to see this issue can be fixed in the patchset, I do not
agree to add the code there with such problems. OTOH, for now seems there's
no way to fix it.

 
  
  I know it is late to speak out, but sorry I still object and have to NACK 
  this
  oldmem approach from my point.
  
  Thanks
  Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-04-24 Thread Baoquan He
On 04/15/15 at 02:48pm, Dave Young wrote:
 On 04/15/15 at 01:47pm, Li, ZhenHua wrote:
  On 04/15/2015 08:57 AM, Dave Young wrote:
  Again, I think it is bad to use old page table, below issues need consider:
  1) make sure old page table are reliable across crash
  2) do not allow writing oldmem after crash
  
  Please correct me if I'm wrong, or if above is not doable I think I will 
  vote for
  resetting pci bus.
  
  Thanks
  Dave
  
  Hi Dave,
  
  When updating the context tables, we have to write their address to root
  tables, this will cause writing to old mem.
  
  Resetting the pci bus has been discussed, please check this:
  http://lists.infradead.org/pipermail/kexec/2014-October/012752.html
  https://lkml.org/lkml/2014/10/21/890

I support this patchset.

We should not fear oldmem since reserved crashkernel region is similar.
No one can guarantee that any crazy code won't step into crashkernel
region just because 1st kernel says it's reversed for kdump kernel. Here
the root table and context tables are also not built to allow legal code
to danamge. Both of them has the risk to be corrupted, for trying our
best to get a dumped vmcore the risk is worth being taken.

And the resetting pci way has been NACKed by David Woodhouse, the
maintainer of intel iommu. Because the place calling the resetting pci
code is ugly before kdump kernel or in kdump kernel. And as he said a
certain device made mistakes why we blame on all devices. We should fix
that device who made mistakes. 

As for me, periodically poked by customers to ask how iommu fix is
going, I really think this patchset is good enough. Aren't we going to
do thing just because there's a risk with tiny possibility or not perfect
enough. I think people won't agree. Otherwise kdump could have been
killed when author proposed it since crashkernel reserved region is
risky and could be corrupted by 1st kernel.

Anyway, let's comprimise a little. At worst it can be reverted if it's
not satisfactory.

Personal opinion.

By the way, I tested it and it works well on my HP z420 workstation.

Thanks
Baoquan


 
 I know one reason to use old pgtable is this looks better because it fixes the
 real problem, but it is not a good way if it introduce more problems because 
 of
 it have to use oldmem. I will be glad if this is not a problem but I have not
 been convinced.
 
 OTOH, there's many types of iommu, intel, amd, a lot of other types. They need
 their own fixes, so it looks not that elegant.
 
 For pci reset, it is not perfect, but it has another advantage, the patch is
 simpler. The problem I see from the old discusssion is, reset bus in 2nd 
 kernel
 is acceptable but it does not fix things on sparc platform. AFAIK current 
 reported
 problems are intel and amd iommu, at least pci reset stuff does not make it 
 worse.
 
 Thanks
 Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-04-24 Thread Baoquan He
On 04/24/15 at 04:25pm, Dave Young wrote:
 Hi, Baoquan
 
  I support this patchset.
  
  We should not fear oldmem since reserved crashkernel region is similar.
  No one can guarantee that any crazy code won't step into crashkernel
  region just because 1st kernel says it's reversed for kdump kernel. Here
  the root table and context tables are also not built to allow legal code
  to danamge. Both of them has the risk to be corrupted, for trying our
  best to get a dumped vmcore the risk is worth being taken.
 
 old mem is mapped in 1st kernel so compare with the reserved crashkernel
 they are more likely to be corrupted. they are totally different. 

Could you tell how and why they are different? Wrong code will choose
root tables and context tables to danamge when they totally lose
control?

 
  
  And the resetting pci way has been NACKed by David Woodhouse, the
  maintainer of intel iommu. Because the place calling the resetting pci
  code is ugly before kdump kernel or in kdump kernel. And as he said a
  certain device made mistakes why we blame on all devices. We should fix
  that device who made mistakes. 
 
 Resetting pci bus is not ugly than fixing a problem with risk and to fix
 the problem it introduced in the future.

There's a problem, we fix the problem. If that's uglier, I need redefine
the 'ugly' in my personal dict. You mean the problem it could introduce
is wrong code will damage root table and context tables, why don't we
fix that wrong code, but blame innocent context tables? So you mean
these tables should deserve being damaged by wrong code?

 
 I know it is late to speak out, but sorry I still object and have to NACK this
 oldmem approach from my point.
 
 Thanks
 Dave
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu