Re: [PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-12-19 Thread Alex Deucher
On Tue, Dec 18, 2018 at 7:08 PM Brad Love  wrote:
>
>
> On 18/12/2018 18.05, Brad Love wrote:
> > On 18/12/2018 17.46, Alex Deucher wrote:
> >> On Tue, Dec 18, 2018 at 7:45 AM Mauro Carvalho Chehab
> >>  wrote:
> >>> Em Mon, 17 Dec 2018 21:05:11 -0500
> >>> Alex Deucher  escreveu:
> >>>
>  On Sun, Dec 16, 2018 at 9:23 AM Mauro Carvalho Chehab
>   wrote:
> > Em Sun, 16 Dec 2018 11:37:02 +0100
> > Markus Dobel  escreveu:
> >
> >> On 06.12.2018 19:01, Mauro Carvalho Chehab wrote:
> >>> Em Thu, 06 Dec 2018 18:18:23 +0100
> >>> Markus Dobel  escreveu:
> >>>
>  Hi everyone,
> 
>  I will try if the hack mentioned fixes the issue for me on the 
>  weekend
>  (but I assume, as if effectively removes the function).
> >>> It should, but it keeps a few changes. Just want to be sure that what
> >>> would be left won't cause issues. If this works, the logic that would
> >>> solve Ryzen DMA fixes will be contained into a single point, making
> >>> easier to maintain it.
> >> Hi,
> >>
> >> I wanted to have this setup running stable for a few days before
> >> replying, that's why I am answering only now.
> >>
> >> But yes, as expected, with Mauro's hack, the driver has been stable for
> >> me for about a week, with several
> >> scheduled recordings in tvheadend, none of them missed.
> >>
> >> So, adding a reliable detection for affected chipsets, where the `if
> >> (1)` currently is, should work.
> > Markus,
> >
> > Thanks for testing!
> >
> > Brad/Alex,
> >
> > I guess we should then stick with this patch:
> > https://patchwork.linuxtv.org/patch/53351/
> >
> > The past approach that we used on cx88, bttv and other old drivers
> > were to patch drivers/pci/quirks.c, making them to "taint" DMA
> > memory controllers that were known to bad affect on media devices,
> > and then some logic at the drivers to check for such "taint".
> >
> > However, that would require to touch another subsystem, with
> > usually cause delays. Also, as Alex pointed, this could well
> > be just a matter of incompatibility between the cx23885 and
> > the Ryzen DMA controller, and may not affect any other drivers.
> >
> > So, let's start with a logic like what I proposed, fine
> > tuning it to the Ryzen DMA controllers with we know have
> > troubles with the driver.
> >
> > We need to list the PCI ID of the memory controllers at the
> > device ID table on that patch, though. At the RFC patch,
> > I just added an IOMMU PCI ID from a randon Ryzen CPU:
> >
> > +static struct {
> > +   int vendor, dev;
> > +} const broken_dev_id[] = {
> > +   /* According with
> > +* 
> > https://openbenchmarking.org/system/1703021-RI-AMDZEN08075/Ryzen%207%201800X/lspci,
> > +* 0x1451 is PCI ID for the IOMMU found on Ryzen 7
> > +*/
> > +   { PCI_VENDOR_ID_AMD, 0x1451 },
> > +};
> > +
> >
> > Ideally, the ID for the affected Ryzen DMA engines should be there at
> > include/linux/pci_ids.h, instead of hard-coded inside a driver.
> >
> > Also, we should, instead, add there the PCI IDs of the DMA engines
> > that are known to have problems with the cx23885.
>  These aren't really DMA engines.  Isn't this just the pcie bridge on the 
>  CPU?
> >>> Yeah, it is not the DMA engine itself, but the CPU/chipset support for it.
> >>>
> >>> Let me be a little clearer. The Conexant chipsets for PCI/PCIe engines
> >>> have internally a RISC CPU that it is programmed, in runtime, to do
> >>> DMA scatter/gather. The actual DMA engine is there. For it to work, the
> >>> Northbridge (or the CPU chipset - as nowadays several chipsets integrated
> >>> the Northbridge inside an IP block at the CPU) has to do the counter part,
> >>> by allowing the board's DMA engine to access the mainboard's main memory,
> >>> usually via IOMMU, in a safe way[1].
> >>>
> >>> [1] preventing memory corruption if two devices try to do DMA to the
> >>> same area, or if the DMA from the board tries to write at the same
> >>> time the CPU tries to access it.
> >>>
> >>> Media PCI boards usually push the DMA logic to unusual conditions, as
> >>> a large amount of data is transferred, in a synchronous way,
> >>> between the PCIe card and memory.
> >>>
> >>> If the video stream is recorded, the same physical memory DMA mapped area
> >>> where the data is written by the video board could be used on another DMA
> >>> transfer via the HD disk controller.
> >>>
> >>> It is even possible to setup the Conexant's DMA engine to do transfers
> >>> directly to the GPU's internal memory, causing a PCI to PCI DMA transfer,
> >>> using V4L2 API overlay mode.
> >> Is this supported today?  I di

Re: [PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-12-18 Thread Brad Love


On 18/12/2018 18.05, Brad Love wrote:
> On 18/12/2018 17.46, Alex Deucher wrote:
>> On Tue, Dec 18, 2018 at 7:45 AM Mauro Carvalho Chehab
>>  wrote:
>>> Em Mon, 17 Dec 2018 21:05:11 -0500
>>> Alex Deucher  escreveu:
>>>
 On Sun, Dec 16, 2018 at 9:23 AM Mauro Carvalho Chehab
  wrote:
> Em Sun, 16 Dec 2018 11:37:02 +0100
> Markus Dobel  escreveu:
>
>> On 06.12.2018 19:01, Mauro Carvalho Chehab wrote:
>>> Em Thu, 06 Dec 2018 18:18:23 +0100
>>> Markus Dobel  escreveu:
>>>
 Hi everyone,

 I will try if the hack mentioned fixes the issue for me on the weekend
 (but I assume, as if effectively removes the function).
>>> It should, but it keeps a few changes. Just want to be sure that what
>>> would be left won't cause issues. If this works, the logic that would
>>> solve Ryzen DMA fixes will be contained into a single point, making
>>> easier to maintain it.
>> Hi,
>>
>> I wanted to have this setup running stable for a few days before
>> replying, that's why I am answering only now.
>>
>> But yes, as expected, with Mauro's hack, the driver has been stable for
>> me for about a week, with several
>> scheduled recordings in tvheadend, none of them missed.
>>
>> So, adding a reliable detection for affected chipsets, where the `if
>> (1)` currently is, should work.
> Markus,
>
> Thanks for testing!
>
> Brad/Alex,
>
> I guess we should then stick with this patch:
> https://patchwork.linuxtv.org/patch/53351/
>
> The past approach that we used on cx88, bttv and other old drivers
> were to patch drivers/pci/quirks.c, making them to "taint" DMA
> memory controllers that were known to bad affect on media devices,
> and then some logic at the drivers to check for such "taint".
>
> However, that would require to touch another subsystem, with
> usually cause delays. Also, as Alex pointed, this could well
> be just a matter of incompatibility between the cx23885 and
> the Ryzen DMA controller, and may not affect any other drivers.
>
> So, let's start with a logic like what I proposed, fine
> tuning it to the Ryzen DMA controllers with we know have
> troubles with the driver.
>
> We need to list the PCI ID of the memory controllers at the
> device ID table on that patch, though. At the RFC patch,
> I just added an IOMMU PCI ID from a randon Ryzen CPU:
>
> +static struct {
> +   int vendor, dev;
> +} const broken_dev_id[] = {
> +   /* According with
> +* 
> https://openbenchmarking.org/system/1703021-RI-AMDZEN08075/Ryzen%207%201800X/lspci,
> +* 0x1451 is PCI ID for the IOMMU found on Ryzen 7
> +*/
> +   { PCI_VENDOR_ID_AMD, 0x1451 },
> +};
> +
>
> Ideally, the ID for the affected Ryzen DMA engines should be there at
> include/linux/pci_ids.h, instead of hard-coded inside a driver.
>
> Also, we should, instead, add there the PCI IDs of the DMA engines
> that are known to have problems with the cx23885.
 These aren't really DMA engines.  Isn't this just the pcie bridge on the 
 CPU?
>>> Yeah, it is not the DMA engine itself, but the CPU/chipset support for it.
>>>
>>> Let me be a little clearer. The Conexant chipsets for PCI/PCIe engines
>>> have internally a RISC CPU that it is programmed, in runtime, to do
>>> DMA scatter/gather. The actual DMA engine is there. For it to work, the
>>> Northbridge (or the CPU chipset - as nowadays several chipsets integrated
>>> the Northbridge inside an IP block at the CPU) has to do the counter part,
>>> by allowing the board's DMA engine to access the mainboard's main memory,
>>> usually via IOMMU, in a safe way[1].
>>>
>>> [1] preventing memory corruption if two devices try to do DMA to the
>>> same area, or if the DMA from the board tries to write at the same
>>> time the CPU tries to access it.
>>>
>>> Media PCI boards usually push the DMA logic to unusual conditions, as
>>> a large amount of data is transferred, in a synchronous way,
>>> between the PCIe card and memory.
>>>
>>> If the video stream is recorded, the same physical memory DMA mapped area
>>> where the data is written by the video board could be used on another DMA
>>> transfer via the HD disk controller.
>>>
>>> It is even possible to setup the Conexant's DMA engine to do transfers
>>> directly to the GPU's internal memory, causing a PCI to PCI DMA transfer,
>>> using V4L2 API overlay mode.
>> Is this supported today?  I didn't think all of the PCI and DMA API
>> changes had gone upstream yet for PCIe p2p support.  I think you'd
>> need to disable the IOMMU to work correctly.
>>
>>> There was a time where it used to be common to have Intel CPUs (or
>>> Intel-compatible CPUs) using non-Intel North

Re: [PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-12-18 Thread Brad Love


On 18/12/2018 17.46, Alex Deucher wrote:
> On Tue, Dec 18, 2018 at 7:45 AM Mauro Carvalho Chehab
>  wrote:
>> Em Mon, 17 Dec 2018 21:05:11 -0500
>> Alex Deucher  escreveu:
>>
>>> On Sun, Dec 16, 2018 at 9:23 AM Mauro Carvalho Chehab
>>>  wrote:
 Em Sun, 16 Dec 2018 11:37:02 +0100
 Markus Dobel  escreveu:

> On 06.12.2018 19:01, Mauro Carvalho Chehab wrote:
>> Em Thu, 06 Dec 2018 18:18:23 +0100
>> Markus Dobel  escreveu:
>>
>>> Hi everyone,
>>>
>>> I will try if the hack mentioned fixes the issue for me on the weekend
>>> (but I assume, as if effectively removes the function).
>> It should, but it keeps a few changes. Just want to be sure that what
>> would be left won't cause issues. If this works, the logic that would
>> solve Ryzen DMA fixes will be contained into a single point, making
>> easier to maintain it.
> Hi,
>
> I wanted to have this setup running stable for a few days before
> replying, that's why I am answering only now.
>
> But yes, as expected, with Mauro's hack, the driver has been stable for
> me for about a week, with several
> scheduled recordings in tvheadend, none of them missed.
>
> So, adding a reliable detection for affected chipsets, where the `if
> (1)` currently is, should work.
 Markus,

 Thanks for testing!

 Brad/Alex,

 I guess we should then stick with this patch:
 https://patchwork.linuxtv.org/patch/53351/

 The past approach that we used on cx88, bttv and other old drivers
 were to patch drivers/pci/quirks.c, making them to "taint" DMA
 memory controllers that were known to bad affect on media devices,
 and then some logic at the drivers to check for such "taint".

 However, that would require to touch another subsystem, with
 usually cause delays. Also, as Alex pointed, this could well
 be just a matter of incompatibility between the cx23885 and
 the Ryzen DMA controller, and may not affect any other drivers.

 So, let's start with a logic like what I proposed, fine
 tuning it to the Ryzen DMA controllers with we know have
 troubles with the driver.

 We need to list the PCI ID of the memory controllers at the
 device ID table on that patch, though. At the RFC patch,
 I just added an IOMMU PCI ID from a randon Ryzen CPU:

 +static struct {
 +   int vendor, dev;
 +} const broken_dev_id[] = {
 +   /* According with
 +* 
 https://openbenchmarking.org/system/1703021-RI-AMDZEN08075/Ryzen%207%201800X/lspci,
 +* 0x1451 is PCI ID for the IOMMU found on Ryzen 7
 +*/
 +   { PCI_VENDOR_ID_AMD, 0x1451 },
 +};
 +

 Ideally, the ID for the affected Ryzen DMA engines should be there at
 include/linux/pci_ids.h, instead of hard-coded inside a driver.

 Also, we should, instead, add there the PCI IDs of the DMA engines
 that are known to have problems with the cx23885.
>>> These aren't really DMA engines.  Isn't this just the pcie bridge on the 
>>> CPU?
>> Yeah, it is not the DMA engine itself, but the CPU/chipset support for it.
>>
>> Let me be a little clearer. The Conexant chipsets for PCI/PCIe engines
>> have internally a RISC CPU that it is programmed, in runtime, to do
>> DMA scatter/gather. The actual DMA engine is there. For it to work, the
>> Northbridge (or the CPU chipset - as nowadays several chipsets integrated
>> the Northbridge inside an IP block at the CPU) has to do the counter part,
>> by allowing the board's DMA engine to access the mainboard's main memory,
>> usually via IOMMU, in a safe way[1].
>>
>> [1] preventing memory corruption if two devices try to do DMA to the
>> same area, or if the DMA from the board tries to write at the same
>> time the CPU tries to access it.
>>
>> Media PCI boards usually push the DMA logic to unusual conditions, as
>> a large amount of data is transferred, in a synchronous way,
>> between the PCIe card and memory.
>>
>> If the video stream is recorded, the same physical memory DMA mapped area
>> where the data is written by the video board could be used on another DMA
>> transfer via the HD disk controller.
>>
>> It is even possible to setup the Conexant's DMA engine to do transfers
>> directly to the GPU's internal memory, causing a PCI to PCI DMA transfer,
>> using V4L2 API overlay mode.
> Is this supported today?  I didn't think all of the PCI and DMA API
> changes had gone upstream yet for PCIe p2p support.  I think you'd
> need to disable the IOMMU to work correctly.
>
>> There was a time where it used to be common to have Intel CPUs (or
>> Intel-compatible CPUs) using non-Intel North Bridges. On such time,
>> we've seen a lot of troubles with PCI to PCI transfers most of them
>> when using non-Intel north bridges.
> Well p2p was p

Re: [PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-12-18 Thread Alex Deucher
On Tue, Dec 18, 2018 at 7:45 AM Mauro Carvalho Chehab
 wrote:
>
> Em Mon, 17 Dec 2018 21:05:11 -0500
> Alex Deucher  escreveu:
>
> > On Sun, Dec 16, 2018 at 9:23 AM Mauro Carvalho Chehab
> >  wrote:
> > >
> > > Em Sun, 16 Dec 2018 11:37:02 +0100
> > > Markus Dobel  escreveu:
> > >
> > > > On 06.12.2018 19:01, Mauro Carvalho Chehab wrote:
> > > > > Em Thu, 06 Dec 2018 18:18:23 +0100
> > > > > Markus Dobel  escreveu:
> > > > >
> > > > >> Hi everyone,
> > > > >>
> > > > >> I will try if the hack mentioned fixes the issue for me on the 
> > > > >> weekend
> > > > >> (but I assume, as if effectively removes the function).
> > > > >
> > > > > It should, but it keeps a few changes. Just want to be sure that what
> > > > > would be left won't cause issues. If this works, the logic that would
> > > > > solve Ryzen DMA fixes will be contained into a single point, making
> > > > > easier to maintain it.
> > > >
> > > > Hi,
> > > >
> > > > I wanted to have this setup running stable for a few days before
> > > > replying, that's why I am answering only now.
> > > >
> > > > But yes, as expected, with Mauro's hack, the driver has been stable for
> > > > me for about a week, with several
> > > > scheduled recordings in tvheadend, none of them missed.
> > > >
> > > > So, adding a reliable detection for affected chipsets, where the `if
> > > > (1)` currently is, should work.
> > >
> > > Markus,
> > >
> > > Thanks for testing!
> > >
> > > Brad/Alex,
> > >
> > > I guess we should then stick with this patch:
> > > https://patchwork.linuxtv.org/patch/53351/
> > >
> > > The past approach that we used on cx88, bttv and other old drivers
> > > were to patch drivers/pci/quirks.c, making them to "taint" DMA
> > > memory controllers that were known to bad affect on media devices,
> > > and then some logic at the drivers to check for such "taint".
> > >
> > > However, that would require to touch another subsystem, with
> > > usually cause delays. Also, as Alex pointed, this could well
> > > be just a matter of incompatibility between the cx23885 and
> > > the Ryzen DMA controller, and may not affect any other drivers.
> > >
> > > So, let's start with a logic like what I proposed, fine
> > > tuning it to the Ryzen DMA controllers with we know have
> > > troubles with the driver.
> > >
> > > We need to list the PCI ID of the memory controllers at the
> > > device ID table on that patch, though. At the RFC patch,
> > > I just added an IOMMU PCI ID from a randon Ryzen CPU:
> > >
> > > +static struct {
> > > +   int vendor, dev;
> > > +} const broken_dev_id[] = {
> > > +   /* According with
> > > +* 
> > > https://openbenchmarking.org/system/1703021-RI-AMDZEN08075/Ryzen%207%201800X/lspci,
> > > +* 0x1451 is PCI ID for the IOMMU found on Ryzen 7
> > > +*/
> > > +   { PCI_VENDOR_ID_AMD, 0x1451 },
> > > +};
> > > +
> > >
> > > Ideally, the ID for the affected Ryzen DMA engines should be there at
> > > include/linux/pci_ids.h, instead of hard-coded inside a driver.
> > >
> > > Also, we should, instead, add there the PCI IDs of the DMA engines
> > > that are known to have problems with the cx23885.
> >
> > These aren't really DMA engines.  Isn't this just the pcie bridge on the 
> > CPU?
>
> Yeah, it is not the DMA engine itself, but the CPU/chipset support for it.
>
> Let me be a little clearer. The Conexant chipsets for PCI/PCIe engines
> have internally a RISC CPU that it is programmed, in runtime, to do
> DMA scatter/gather. The actual DMA engine is there. For it to work, the
> Northbridge (or the CPU chipset - as nowadays several chipsets integrated
> the Northbridge inside an IP block at the CPU) has to do the counter part,
> by allowing the board's DMA engine to access the mainboard's main memory,
> usually via IOMMU, in a safe way[1].
>
> [1] preventing memory corruption if two devices try to do DMA to the
> same area, or if the DMA from the board tries to write at the same
> time the CPU tries to access it.
>
> Media PCI boards usually push the DMA logic to unusual conditions, as
> a large amount of data is transferred, in a synchronous way,
> between the PCIe card and memory.
>
> If the video stream is recorded, the same physical memory DMA mapped area
> where the data is written by the video board could be used on another DMA
> transfer via the HD disk controller.
>
> It is even possible to setup the Conexant's DMA engine to do transfers
> directly to the GPU's internal memory, causing a PCI to PCI DMA transfer,
> using V4L2 API overlay mode.

Is this supported today?  I didn't think all of the PCI and DMA API
changes had gone upstream yet for PCIe p2p support.  I think you'd
need to disable the IOMMU to work correctly.

>
> There was a time where it used to be common to have Intel CPUs (or
> Intel-compatible CPUs) using non-Intel North Bridges. On such time,
> we've seen a lot of troubles with PCI to PCI transf

Re: [PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-12-18 Thread Brad Love
Hi everyone,


On 18/12/2018 06.45, Mauro Carvalho Chehab wrote:
> Em Mon, 17 Dec 2018 21:05:11 -0500
> Alex Deucher  escreveu:
>
>> On Sun, Dec 16, 2018 at 9:23 AM Mauro Carvalho Chehab
>>  wrote:
>>> Em Sun, 16 Dec 2018 11:37:02 +0100
>>> Markus Dobel  escreveu:
>>>  
 On 06.12.2018 19:01, Mauro Carvalho Chehab wrote:  
> Em Thu, 06 Dec 2018 18:18:23 +0100
> Markus Dobel  escreveu:
>  
>> Hi everyone,
>>
>> I will try if the hack mentioned fixes the issue for me on the weekend
>> (but I assume, as if effectively removes the function).  
> It should, but it keeps a few changes. Just want to be sure that what
> would be left won't cause issues. If this works, the logic that would
> solve Ryzen DMA fixes will be contained into a single point, making
> easier to maintain it.  
 Hi,

 I wanted to have this setup running stable for a few days before
 replying, that's why I am answering only now.

 But yes, as expected, with Mauro's hack, the driver has been stable for
 me for about a week, with several
 scheduled recordings in tvheadend, none of them missed.

 So, adding a reliable detection for affected chipsets, where the `if
 (1)` currently is, should work.  
>>> Markus,
>>>
>>> Thanks for testing!
>>>
>>> Brad/Alex,
>>>
>>> I guess we should then stick with this patch:
>>> https://patchwork.linuxtv.org/patch/53351/
>>>
>>> The past approach that we used on cx88, bttv and other old drivers
>>> were to patch drivers/pci/quirks.c, making them to "taint" DMA
>>> memory controllers that were known to bad affect on media devices,
>>> and then some logic at the drivers to check for such "taint".
>>>
>>> However, that would require to touch another subsystem, with
>>> usually cause delays. Also, as Alex pointed, this could well
>>> be just a matter of incompatibility between the cx23885 and
>>> the Ryzen DMA controller, and may not affect any other drivers.
>>>
>>> So, let's start with a logic like what I proposed, fine
>>> tuning it to the Ryzen DMA controllers with we know have
>>> troubles with the driver.
>>>
>>> We need to list the PCI ID of the memory controllers at the
>>> device ID table on that patch, though. At the RFC patch,
>>> I just added an IOMMU PCI ID from a randon Ryzen CPU:
>>>
>>> +static struct {
>>> +   int vendor, dev;
>>> +} const broken_dev_id[] = {
>>> +   /* According with
>>> +* 
>>> https://openbenchmarking.org/system/1703021-RI-AMDZEN08075/Ryzen%207%201800X/lspci,
>>> +* 0x1451 is PCI ID for the IOMMU found on Ryzen 7
>>> +*/
>>> +   { PCI_VENDOR_ID_AMD, 0x1451 },
>>> +};
>>> +
>>>
>>> Ideally, the ID for the affected Ryzen DMA engines should be there at
>>> include/linux/pci_ids.h, instead of hard-coded inside a driver.
>>>
>>> Also, we should, instead, add there the PCI IDs of the DMA engines
>>> that are known to have problems with the cx23885.  
>> These aren't really DMA engines.  Isn't this just the pcie bridge on the CPU?
> Yeah, it is not the DMA engine itself, but the CPU/chipset support for it.
>
> Let me be a little clearer. The Conexant chipsets for PCI/PCIe engines 
> have internally a RISC CPU that it is programmed, in runtime, to do
> DMA scatter/gather. The actual DMA engine is there. For it to work, the
> Northbridge (or the CPU chipset - as nowadays several chipsets integrated
> the Northbridge inside an IP block at the CPU) has to do the counter part,
> by allowing the board's DMA engine to access the mainboard's main memory,
> usually via IOMMU, in a safe way[1].
>
> [1] preventing memory corruption if two devices try to do DMA to the
> same area, or if the DMA from the board tries to write at the same
> time the CPU tries to access it.
>
> Media PCI boards usually push the DMA logic to unusual conditions, as
> a large amount of data is transferred, in a synchronous way,
> between the PCIe card and memory.
>
> If the video stream is recorded, the same physical memory DMA mapped area
> where the data is written by the video board could be used on another DMA
> transfer via the HD disk controller.
>
> It is even possible to setup the Conexant's DMA engine to do transfers 
> directly to the GPU's internal memory, causing a PCI to PCI DMA transfer,
> using V4L2 API overlay mode.
>
> There was a time where it used to be common to have Intel CPUs (or
> Intel-compatible CPUs) using non-Intel North Bridges. On such time,
> we've seen a lot of troubles with PCI to PCI transfers most of them
> when using non-Intel north bridges. 
>
> With some north bridges, having the same block of memory mapped
> for two DMA operations (where memory writes come from the video
> card and memory reads from the HD disk controller) was also
> problematic, as the IOMMU had issues on managing two kinds of
> transfer for the same physical memory block.
>
> The report we have on the 95

Re: [PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-12-18 Thread Mauro Carvalho Chehab
Em Mon, 17 Dec 2018 21:05:11 -0500
Alex Deucher  escreveu:

> On Sun, Dec 16, 2018 at 9:23 AM Mauro Carvalho Chehab
>  wrote:
> >
> > Em Sun, 16 Dec 2018 11:37:02 +0100
> > Markus Dobel  escreveu:
> >  
> > > On 06.12.2018 19:01, Mauro Carvalho Chehab wrote:  
> > > > Em Thu, 06 Dec 2018 18:18:23 +0100
> > > > Markus Dobel  escreveu:
> > > >  
> > > >> Hi everyone,
> > > >>
> > > >> I will try if the hack mentioned fixes the issue for me on the weekend
> > > >> (but I assume, as if effectively removes the function).  
> > > >
> > > > It should, but it keeps a few changes. Just want to be sure that what
> > > > would be left won't cause issues. If this works, the logic that would
> > > > solve Ryzen DMA fixes will be contained into a single point, making
> > > > easier to maintain it.  
> > >
> > > Hi,
> > >
> > > I wanted to have this setup running stable for a few days before
> > > replying, that's why I am answering only now.
> > >
> > > But yes, as expected, with Mauro's hack, the driver has been stable for
> > > me for about a week, with several
> > > scheduled recordings in tvheadend, none of them missed.
> > >
> > > So, adding a reliable detection for affected chipsets, where the `if
> > > (1)` currently is, should work.  
> >
> > Markus,
> >
> > Thanks for testing!
> >
> > Brad/Alex,
> >
> > I guess we should then stick with this patch:
> > https://patchwork.linuxtv.org/patch/53351/
> >
> > The past approach that we used on cx88, bttv and other old drivers
> > were to patch drivers/pci/quirks.c, making them to "taint" DMA
> > memory controllers that were known to bad affect on media devices,
> > and then some logic at the drivers to check for such "taint".
> >
> > However, that would require to touch another subsystem, with
> > usually cause delays. Also, as Alex pointed, this could well
> > be just a matter of incompatibility between the cx23885 and
> > the Ryzen DMA controller, and may not affect any other drivers.
> >
> > So, let's start with a logic like what I proposed, fine
> > tuning it to the Ryzen DMA controllers with we know have
> > troubles with the driver.
> >
> > We need to list the PCI ID of the memory controllers at the
> > device ID table on that patch, though. At the RFC patch,
> > I just added an IOMMU PCI ID from a randon Ryzen CPU:
> >
> > +static struct {
> > +   int vendor, dev;
> > +} const broken_dev_id[] = {
> > +   /* According with
> > +* 
> > https://openbenchmarking.org/system/1703021-RI-AMDZEN08075/Ryzen%207%201800X/lspci,
> > +* 0x1451 is PCI ID for the IOMMU found on Ryzen 7
> > +*/
> > +   { PCI_VENDOR_ID_AMD, 0x1451 },
> > +};
> > +
> >
> > Ideally, the ID for the affected Ryzen DMA engines should be there at
> > include/linux/pci_ids.h, instead of hard-coded inside a driver.
> >
> > Also, we should, instead, add there the PCI IDs of the DMA engines
> > that are known to have problems with the cx23885.  
> 
> These aren't really DMA engines.  Isn't this just the pcie bridge on the CPU?

Yeah, it is not the DMA engine itself, but the CPU/chipset support for it.

Let me be a little clearer. The Conexant chipsets for PCI/PCIe engines 
have internally a RISC CPU that it is programmed, in runtime, to do
DMA scatter/gather. The actual DMA engine is there. For it to work, the
Northbridge (or the CPU chipset - as nowadays several chipsets integrated
the Northbridge inside an IP block at the CPU) has to do the counter part,
by allowing the board's DMA engine to access the mainboard's main memory,
usually via IOMMU, in a safe way[1].

[1] preventing memory corruption if two devices try to do DMA to the
same area, or if the DMA from the board tries to write at the same
time the CPU tries to access it.

Media PCI boards usually push the DMA logic to unusual conditions, as
a large amount of data is transferred, in a synchronous way,
between the PCIe card and memory.

If the video stream is recorded, the same physical memory DMA mapped area
where the data is written by the video board could be used on another DMA
transfer via the HD disk controller.

It is even possible to setup the Conexant's DMA engine to do transfers 
directly to the GPU's internal memory, causing a PCI to PCI DMA transfer,
using V4L2 API overlay mode.

There was a time where it used to be common to have Intel CPUs (or
Intel-compatible CPUs) using non-Intel North Bridges. On such time,
we've seen a lot of troubles with PCI to PCI transfers most of them
when using non-Intel north bridges. 

With some north bridges, having the same block of memory mapped
for two DMA operations (where memory writes come from the video
card and memory reads from the HD disk controller) was also
problematic, as the IOMMU had issues on managing two kinds of
transfer for the same physical memory block.

The report we have on the 95f408bb commit is:

   "media: cx23885: Ryzen DMA related RiSC engine stall fixes

Re: [PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-12-17 Thread Markus Dobel



Am 18. Dezember 2018 03:05:11 MEZ schrieb Alex Deucher :

>possibly?  It's still not clear to me that this is specific to ryzen
>chips rather than a problem with the DMA setup on the cx board.  Is
>there a downside to enabling the workaround in general?  

Hi Alex,

yes, there is. At least for me, the resetting function breaks the driver, 
making the card unresponsive after a few hours of uptime. Without that 
function, the card is perfectly stable.

Markus

-- 
Gesendet mit zwei Streichhölzern, einem Gummiband und etwas Draht.


Re: [PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-12-17 Thread Alex Deucher
On Sun, Dec 16, 2018 at 9:23 AM Mauro Carvalho Chehab
 wrote:
>
> Em Sun, 16 Dec 2018 11:37:02 +0100
> Markus Dobel  escreveu:
>
> > On 06.12.2018 19:01, Mauro Carvalho Chehab wrote:
> > > Em Thu, 06 Dec 2018 18:18:23 +0100
> > > Markus Dobel  escreveu:
> > >
> > >> Hi everyone,
> > >>
> > >> I will try if the hack mentioned fixes the issue for me on the weekend
> > >> (but I assume, as if effectively removes the function).
> > >
> > > It should, but it keeps a few changes. Just want to be sure that what
> > > would be left won't cause issues. If this works, the logic that would
> > > solve Ryzen DMA fixes will be contained into a single point, making
> > > easier to maintain it.
> >
> > Hi,
> >
> > I wanted to have this setup running stable for a few days before
> > replying, that's why I am answering only now.
> >
> > But yes, as expected, with Mauro's hack, the driver has been stable for
> > me for about a week, with several
> > scheduled recordings in tvheadend, none of them missed.
> >
> > So, adding a reliable detection for affected chipsets, where the `if
> > (1)` currently is, should work.
>
> Markus,
>
> Thanks for testing!
>
> Brad/Alex,
>
> I guess we should then stick with this patch:
> https://patchwork.linuxtv.org/patch/53351/
>
> The past approach that we used on cx88, bttv and other old drivers
> were to patch drivers/pci/quirks.c, making them to "taint" DMA
> memory controllers that were known to bad affect on media devices,
> and then some logic at the drivers to check for such "taint".
>
> However, that would require to touch another subsystem, with
> usually cause delays. Also, as Alex pointed, this could well
> be just a matter of incompatibility between the cx23885 and
> the Ryzen DMA controller, and may not affect any other drivers.
>
> So, let's start with a logic like what I proposed, fine
> tuning it to the Ryzen DMA controllers with we know have
> troubles with the driver.
>
> We need to list the PCI ID of the memory controllers at the
> device ID table on that patch, though. At the RFC patch,
> I just added an IOMMU PCI ID from a randon Ryzen CPU:
>
> +static struct {
> +   int vendor, dev;
> +} const broken_dev_id[] = {
> +   /* According with
> +* 
> https://openbenchmarking.org/system/1703021-RI-AMDZEN08075/Ryzen%207%201800X/lspci,
> +* 0x1451 is PCI ID for the IOMMU found on Ryzen 7
> +*/
> +   { PCI_VENDOR_ID_AMD, 0x1451 },
> +};
> +
>
> Ideally, the ID for the affected Ryzen DMA engines should be there at
> include/linux/pci_ids.h, instead of hard-coded inside a driver.
>
> Also, we should, instead, add there the PCI IDs of the DMA engines
> that are known to have problems with the cx23885.

These aren't really DMA engines.  Isn't this just the pcie bridge on the CPU?


>
> There one thing that still bothers me: could this problem be due to
> some BIOS setup [1]? If so, are there any ways for dynamically
> disabling such features inside the driver?
>
> [1] like this: https://www.techarp.com/bios-guide/cpu-pci-write-buffer/
>

possibly?  It's still not clear to me that this is specific to ryzen
chips rather than a problem with the DMA setup on the cx board.  Is
there a downside to enabling the workaround in general?  The original
commit mentioned that xeon platforms were affected as well.  Is it
possible it's just particular platforms with wonky bioses?  Maybe DMI
matching would be better?

> Brad,
>
> From your reports about the DMA issues, do you know what generations
> of the Ryzen are affected?
>
> Alex,
>
> Do you know if are there any differences at the IP block for the
> DMA engine used on different Ryzen CPUs? I mean: I suspect that
> the engine for Ryzen 2nd generation would likely be different than
> the one at the 1st generation, but, along the same generation, does
> the Ryzen 3, 5, 7 and Threadripper use the same DMA engine?

+ Suravee.  I'm not really familiar with the changes, if any, that are
in the pcie bridges on various AMD CPUs.  Or if there are changes, it
would be hard to say whether this issue would affect them or not.

Alex


Re: [PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-12-16 Thread Mauro Carvalho Chehab
Em Sun, 16 Dec 2018 11:37:02 +0100
Markus Dobel  escreveu:

> On 06.12.2018 19:01, Mauro Carvalho Chehab wrote:
> > Em Thu, 06 Dec 2018 18:18:23 +0100
> > Markus Dobel  escreveu:
> >   
> >> Hi everyone,
> >> 
> >> I will try if the hack mentioned fixes the issue for me on the weekend 
> >> (but I assume, as if effectively removes the function).  
> > 
> > It should, but it keeps a few changes. Just want to be sure that what
> > would be left won't cause issues. If this works, the logic that would
> > solve Ryzen DMA fixes will be contained into a single point, making
> > easier to maintain it.  
> 
> Hi,
> 
> I wanted to have this setup running stable for a few days before 
> replying, that's why I am answering only now.
> 
> But yes, as expected, with Mauro's hack, the driver has been stable for 
> me for about a week, with several
> scheduled recordings in tvheadend, none of them missed.
> 
> So, adding a reliable detection for affected chipsets, where the `if 
> (1)` currently is, should work.

Markus,

Thanks for testing!

Brad/Alex,

I guess we should then stick with this patch:
https://patchwork.linuxtv.org/patch/53351/

The past approach that we used on cx88, bttv and other old drivers
were to patch drivers/pci/quirks.c, making them to "taint" DMA
memory controllers that were known to bad affect on media devices,
and then some logic at the drivers to check for such "taint".

However, that would require to touch another subsystem, with
usually cause delays. Also, as Alex pointed, this could well
be just a matter of incompatibility between the cx23885 and
the Ryzen DMA controller, and may not affect any other drivers.

So, let's start with a logic like what I proposed, fine
tuning it to the Ryzen DMA controllers with we know have
troubles with the driver. 

We need to list the PCI ID of the memory controllers at the
device ID table on that patch, though. At the RFC patch,
I just added an IOMMU PCI ID from a randon Ryzen CPU:

+static struct {
+   int vendor, dev;
+} const broken_dev_id[] = {
+   /* According with
+* 
https://openbenchmarking.org/system/1703021-RI-AMDZEN08075/Ryzen%207%201800X/lspci,
+* 0x1451 is PCI ID for the IOMMU found on Ryzen 7
+*/
+   { PCI_VENDOR_ID_AMD, 0x1451 },
+};
+

Ideally, the ID for the affected Ryzen DMA engines should be there at
include/linux/pci_ids.h, instead of hard-coded inside a driver.

Also, we should, instead, add there the PCI IDs of the DMA engines
that are known to have problems with the cx23885.

There one thing that still bothers me: could this problem be due to
some BIOS setup [1]? If so, are there any ways for dynamically
disabling such features inside the driver?

[1] like this: https://www.techarp.com/bios-guide/cpu-pci-write-buffer/

Brad,

From your reports about the DMA issues, do you know what generations
of the Ryzen are affected? 

Alex,

Do you know if are there any differences at the IP block for the
DMA engine used on different Ryzen CPUs? I mean: I suspect that
the engine for Ryzen 2nd generation would likely be different than 
the one at the 1st generation, but, along the same generation, does
the Ryzen 3, 5, 7 and Threadripper use the same DMA engine?

Thanks,
Mauro


Re: [PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-12-16 Thread Markus Dobel

On 06.12.2018 19:01, Mauro Carvalho Chehab wrote:

Em Thu, 06 Dec 2018 18:18:23 +0100
Markus Dobel  escreveu:


Hi everyone,

I will try if the hack mentioned fixes the issue for me on the weekend 
(but I assume, as if effectively removes the function).


It should, but it keeps a few changes. Just want to be sure that what
would be left won't cause issues. If this works, the logic that would
solve Ryzen DMA fixes will be contained into a single point, making
easier to maintain it.


Hi,

I wanted to have this setup running stable for a few days before 
replying, that's why I am answering only now.


But yes, as expected, with Mauro's hack, the driver has been stable for 
me for about a week, with several

scheduled recordings in tvheadend, none of them missed.

So, adding a reliable detection for affected chipsets, where the `if 
(1)` currently is, should work.


Regards,
Markus


Re: [PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-12-06 Thread Mauro Carvalho Chehab
Em Thu, 6 Dec 2018 17:07:52 -0200
Mauro Carvalho Chehab  escreveu:

> Em Thu, 6 Dec 2018 13:36:24 -0500
> Alex Deucher  escreveu:
> 
> > On Thu, Dec 6, 2018 at 1:05 PM Mauro Carvalho Chehab  
> > wrote:  
> > >
> > > Em Thu, 06 Dec 2018 18:18:23 +0100
> > > Markus Dobel  escreveu:
> > >
> > > > Hi everyone,
> > > >
> > > > I will try if the hack mentioned fixes the issue for me on the weekend 
> > > > (but I assume, as if effectively removes the function).
> > >
> > > It should, but it keeps a few changes. Just want to be sure that what
> > > would be left won't cause issues. If this works, the logic that would
> > > solve Ryzen DMA fixes will be contained into a single point, making
> > > easier to maintain it.
> > >
> > > >
> > > > Just in case this is of interest, I neither have Ryzen nor Intel, but 
> > > > an HP Microserver G7 with an AMD Turion II Neo  N54L, so the machine is 
> > > > more on the slow side.
> > >
> > > Good to know. It would probably worth to check if this Ryzen
> > > bug occors with all versions of it or with just a subset.
> > > I mean: maybe it is only at the first gen or Ryzen and doesn't
> > > affect Ryzen 2 (or vice versa).
> > 
> > The original commit also mentions some Xeons are affected too.  Seems
> > like this is potentially an issue on the device side rather than the
> > platform.  
> 
> Maybe.
> 
> > >
> > > The PCI quirks logic will likely need to detect the PCI ID of
> > > the memory controllers found at the buggy CPUs, in order to enable
> > > the quirk only for the affected ones.
> > >
> > > It could be worth talking with AMD people in order to be sure about
> > > the differences at the DMA engine side.
> > >
> > 
> > It's not clear to me what the pci or platform quirk would do.  The
> > workaround seems to be in the driver, not on the platform.  
> 
> Yeah, the fix should be at the driver, but pci/quirk.c would be able
> to detect memory controllers that would require a hack inside the
> driver, in a similar way to what the media PCI drivers already do
> for DMA controllers that don't support pci2pci transfers.
> 
> There, basically the PCI core (drivers/pci/pci.c and 
> drivers/pci/quirks.c) sets a flag (pci_pci_problems) indicating
> potential issues.
> 
> Then, the driver compares such flag in order to enable the specific quirk.
> 
> Ok, there would be a different way to handle it. The driver could use a 
> logic similar to the one I wrote for drivers/edac/i7core_edac.c. There,
> the logic seeks for some specific PCI device IDs using pci_get_device(),
> adjusting the code accordingly, depending on the detected PCI devices.
> 
> In other words, running something like this (untested), at probe time should
> produce similar results:
> 
>   /*
>* FIXME: It probably makes sense to also be able to identify specific
>* versions of the same PCI ID, just in case a latter stepping got a
>* fix for the issue.
>*/
>   const static struct {
>   int vendor, dev;
>   } broken_dev_id[] = {
>   PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_foo,
>   PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_bar,
>   },
> 
>   bool cx23885_does_dma_require_reset(void) 
>   {
>   int i;
>   struct pci_dev *pdev = NULL;
> 
>   for (i = 0; i < sizeof(broken_dev_id); i++) {
>   pdev = pci_get_device(broken_dev_id[i].vendor, 
> broken_dev_id[i].dev, NULL);
>   if (pdev) {
>   pci_put_device(pdev);
>   return true;
>   }
>   }
>   return false;
>   }
> 
> Should work. In any case, we need to know what memory controllers 
> have problems, and what are their PCI IDs, and add them (if not there yet)
> at include/linux/pci_ids.h
> 
> Thanks,
> Mauro

To be clearer, I'm thinking on something like the (untested)
code below (untested).

PS.: the PCI ID used there may be wrong. I just added one in
order to have a proof of concept.

Thanks,
Mauro

[PATCH] media: cx23885: only reset DMA on problematic CPUs

It is reported that changeset 95f408bbc4e4 ("media: cx23885:
Ryzen DMA related RiSC engine stall fixes") caused regresssions
with other CPUs.

Ensure that the quirk will be applied only for the CPUs that
are known to cause problems.

Fixes: 95f408bbc4e4 ("media: cx23885: Ryzen DMA related RiSC engine stall 
fixes")
Signed-off-by: Mauro Carvalho Chehab 

diff --git a/drivers/media/pci/cx23885/cx23885-core.c 
b/drivers/media/pci/cx23885/cx23885-core.c
index 39804d830305..48da7d194cc1 100644
--- a/drivers/media/pci/cx23885/cx23885-core.c
+++ b/drivers/media/pci/cx23885/cx23885-core.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -603,8 +604,13 @@ static void cx23885_risc_disasm(struct cx23885_tsport 
*port,
 
 static void cx23885_clear_bridge_error(struct cx23885_dev *dev)
 {
-   uint32_t r

Re: [PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-12-06 Thread Mauro Carvalho Chehab
Em Thu, 6 Dec 2018 13:36:24 -0500
Alex Deucher  escreveu:

> On Thu, Dec 6, 2018 at 1:05 PM Mauro Carvalho Chehab  
> wrote:
> >
> > Em Thu, 06 Dec 2018 18:18:23 +0100
> > Markus Dobel  escreveu:
> >  
> > > Hi everyone,
> > >
> > > I will try if the hack mentioned fixes the issue for me on the weekend 
> > > (but I assume, as if effectively removes the function).  
> >
> > It should, but it keeps a few changes. Just want to be sure that what
> > would be left won't cause issues. If this works, the logic that would
> > solve Ryzen DMA fixes will be contained into a single point, making
> > easier to maintain it.
> >  
> > >
> > > Just in case this is of interest, I neither have Ryzen nor Intel, but an 
> > > HP Microserver G7 with an AMD Turion II Neo  N54L, so the machine is more 
> > > on the slow side.  
> >
> > Good to know. It would probably worth to check if this Ryzen
> > bug occors with all versions of it or with just a subset.
> > I mean: maybe it is only at the first gen or Ryzen and doesn't
> > affect Ryzen 2 (or vice versa).  
> 
> The original commit also mentions some Xeons are affected too.  Seems
> like this is potentially an issue on the device side rather than the
> platform.

Maybe.

> >
> > The PCI quirks logic will likely need to detect the PCI ID of
> > the memory controllers found at the buggy CPUs, in order to enable
> > the quirk only for the affected ones.
> >
> > It could be worth talking with AMD people in order to be sure about
> > the differences at the DMA engine side.
> >  
> 
> It's not clear to me what the pci or platform quirk would do.  The
> workaround seems to be in the driver, not on the platform.

Yeah, the fix should be at the driver, but pci/quirk.c would be able
to detect memory controllers that would require a hack inside the
driver, in a similar way to what the media PCI drivers already do
for DMA controllers that don't support pci2pci transfers.

There, basically the PCI core (drivers/pci/pci.c and 
drivers/pci/quirks.c) sets a flag (pci_pci_problems) indicating
potential issues.

Then, the driver compares such flag in order to enable the specific quirk.

Ok, there would be a different way to handle it. The driver could use a 
logic similar to the one I wrote for drivers/edac/i7core_edac.c. There,
the logic seeks for some specific PCI device IDs using pci_get_device(),
adjusting the code accordingly, depending on the detected PCI devices.

In other words, running something like this (untested), at probe time should
produce similar results:

/*
 * FIXME: It probably makes sense to also be able to identify specific
 * versions of the same PCI ID, just in case a latter stepping got a
 * fix for the issue.
 */
const static struct {
int vendor, dev;
} broken_dev_id[] = {
PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_foo,
PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_bar,
},

bool cx23885_does_dma_require_reset(void) 
{
int i;
struct pci_dev *pdev = NULL;

for (i = 0; i < sizeof(broken_dev_id); i++) {
pdev = pci_get_device(broken_dev_id[i].vendor, 
broken_dev_id[i].dev, NULL);
if (pdev) {
pci_put_device(pdev);
return true;
}
}
return false;
}

Should work. In any case, we need to know what memory controllers 
have problems, and what are their PCI IDs, and add them (if not there yet)
at include/linux/pci_ids.h

Thanks,
Mauro


Re: [PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-12-06 Thread Alex Deucher
On Thu, Dec 6, 2018 at 1:05 PM Mauro Carvalho Chehab  wrote:
>
> Em Thu, 06 Dec 2018 18:18:23 +0100
> Markus Dobel  escreveu:
>
> > Hi everyone,
> >
> > I will try if the hack mentioned fixes the issue for me on the weekend (but 
> > I assume, as if effectively removes the function).
>
> It should, but it keeps a few changes. Just want to be sure that what
> would be left won't cause issues. If this works, the logic that would
> solve Ryzen DMA fixes will be contained into a single point, making
> easier to maintain it.
>
> >
> > Just in case this is of interest, I neither have Ryzen nor Intel, but an HP 
> > Microserver G7 with an AMD Turion II Neo  N54L, so the machine is more on 
> > the slow side.
>
> Good to know. It would probably worth to check if this Ryzen
> bug occors with all versions of it or with just a subset.
> I mean: maybe it is only at the first gen or Ryzen and doesn't
> affect Ryzen 2 (or vice versa).

The original commit also mentions some Xeons are affected too.  Seems
like this is potentially an issue on the device side rather than the
platform.

>
> The PCI quirks logic will likely need to detect the PCI ID of
> the memory controllers found at the buggy CPUs, in order to enable
> the quirk only for the affected ones.
>
> It could be worth talking with AMD people in order to be sure about
> the differences at the DMA engine side.
>

It's not clear to me what the pci or platform quirk would do.  The
workaround seems to be in the driver, not on the platform.

Alex


Re: [PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-12-06 Thread Mauro Carvalho Chehab
Em Thu, 06 Dec 2018 18:18:23 +0100
Markus Dobel  escreveu:

> Hi everyone,
> 
> I will try if the hack mentioned fixes the issue for me on the weekend (but I 
> assume, as if effectively removes the function).

It should, but it keeps a few changes. Just want to be sure that what
would be left won't cause issues. If this works, the logic that would
solve Ryzen DMA fixes will be contained into a single point, making
easier to maintain it.

> 
> Just in case this is of interest, I neither have Ryzen nor Intel, but an HP 
> Microserver G7 with an AMD Turion II Neo  N54L, so the machine is more on the 
> slow side. 

Good to know. It would probably worth to check if this Ryzen
bug occors with all versions of it or with just a subset.
I mean: maybe it is only at the first gen or Ryzen and doesn't
affect Ryzen 2 (or vice versa).

The PCI quirks logic will likely need to detect the PCI ID of
the memory controllers found at the buggy CPUs, in order to enable
the quirk only for the affected ones.

It could be worth talking with AMD people in order to be sure about
the differences at the DMA engine side.

Thanks,
Mauro


Re: [PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-12-06 Thread Markus Dobel
Hi everyone,

I will try if the hack mentioned fixes the issue for me on the weekend (but I 
assume, as if effectively removes the function).

Just in case this is of interest, I neither have Ryzen nor Intel, but an HP 
Microserver G7 with an AMD Turion II Neo  N54L, so the machine is more on the 
slow side. 

Regards, Markus
-- 
Gesendet mit zwei Streichhölzern, einem Gummiband und etwas Draht.


Re: [PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-12-06 Thread Brad Love
Hi Mauro & Markus,


On 05/12/2018 05.07, Mauro Carvalho Chehab wrote:
> Em Sun, 21 Oct 2018 15:45:39 +0200
> Markus Dobel  escreveu:
>
>> The original commit (the one reverted in this patch) introduced a 
>> regression,
>> making a previously flawless adapter unresponsive after running a few 
>> hours
>> to days. Since I never experienced the problems that the original commit 
>> is
>> supposed to fix, I propose to revert the change until a regression-free
>> variant is found.
>>
>> Before submitting this, I've been running a system 24x7 with this revert 
>> for
>> several weeks now, and it's running stable again.
>>
>> It's not a pure revert, as the original commit does not revert cleanly
>> anymore due to other changes, but content-wise it is.
>>
>> Signed-off-by: Markus Dobel 
>> ---
>>   drivers/media/pci/cx23885/cx23885-core.c | 60 
>>   drivers/media/pci/cx23885/cx23885-reg.h  | 14 --
>>   2 files changed, 74 deletions(-)
>>
>> diff --git a/drivers/media/pci/cx23885/cx23885-core.c 
>> b/drivers/media/pci/cx23885/cx23885-core.c
>> index 39804d830305..606f6fc0e68b 100644
>> --- a/drivers/media/pci/cx23885/cx23885-core.c
>> +++ b/drivers/media/pci/cx23885/cx23885-core.c
>> @@ -601,25 +601,6 @@ static void cx23885_risc_disasm(struct 
>> cx23885_tsport *port,
> Patch was mangled by your e-mailer: it broke longer lines, causing
> it to not apply.
>
> Also, before just reverting the entire thing, could you please check
> if the enclosed hack would solve it?
>
> If so, it should be easy to add a quirk at drivers/pci/quirks.c
> in order to detect the Ryzen models with a bad DMA engine that
> require periodic resets, and then make cx23885 to use it.
>
> We did similar tricks before with some broken DMA engines, at
> the time we had overlay support on drivers and AMD controllers
> didn't support PCI2PCI DMA transfers.
>
> Brad,
>
> Could you please address this issue?


I'll try to address this today or tomorrow. Since the original patch was
applied I have not received any complaints from ryzen users, but we have
accumulated a few reports from Intel users with a variety of
motherboards that do now encounter issue. Strangely system load affects
the repro; low/no load exhibits the error condition, high system load
everything is fine. I'll do my best to send in a ryzen specific patch by
the weekend.

Regards,

Brad



>
>
> diff --git a/drivers/media/pci/cx23885/cx23885-core.c 
> b/drivers/media/pci/cx23885/cx23885-core.c
> index 39804d830305..8b012bee6b32 100644
> --- a/drivers/media/pci/cx23885/cx23885-core.c
> +++ b/drivers/media/pci/cx23885/cx23885-core.c
> @@ -603,8 +603,14 @@ static void cx23885_risc_disasm(struct cx23885_tsport 
> *port,
>  
>  static void cx23885_clear_bridge_error(struct cx23885_dev *dev)
>  {
> - uint32_t reg1_val = cx_read(TC_REQ); /* read-only */
> - uint32_t reg2_val = cx_read(TC_REQ_SET);
> + uint32_t reg1_val, reg2_val;
> +
> + /* TODO: check for Ryzen quirk */
> + if (1)
> + return;
> +
> + reg1_val = cx_read(TC_REQ); /* read-only */
> + reg2_val = cx_read(TC_REQ_SET);
>  
>   if (reg1_val && reg2_val) {
>   cx_write(TC_REQ, reg1_val);
>
>
>
> Thanks,
> Mauro



Re: [PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-12-05 Thread Mauro Carvalho Chehab
Em Sun, 21 Oct 2018 15:45:39 +0200
Markus Dobel  escreveu:

> The original commit (the one reverted in this patch) introduced a 
> regression,
> making a previously flawless adapter unresponsive after running a few 
> hours
> to days. Since I never experienced the problems that the original commit 
> is
> supposed to fix, I propose to revert the change until a regression-free
> variant is found.
> 
> Before submitting this, I've been running a system 24x7 with this revert 
> for
> several weeks now, and it's running stable again.
> 
> It's not a pure revert, as the original commit does not revert cleanly
> anymore due to other changes, but content-wise it is.
> 
> Signed-off-by: Markus Dobel 
> ---
>   drivers/media/pci/cx23885/cx23885-core.c | 60 
>   drivers/media/pci/cx23885/cx23885-reg.h  | 14 --
>   2 files changed, 74 deletions(-)
> 
> diff --git a/drivers/media/pci/cx23885/cx23885-core.c 
> b/drivers/media/pci/cx23885/cx23885-core.c
> index 39804d830305..606f6fc0e68b 100644
> --- a/drivers/media/pci/cx23885/cx23885-core.c
> +++ b/drivers/media/pci/cx23885/cx23885-core.c
> @@ -601,25 +601,6 @@ static void cx23885_risc_disasm(struct 
> cx23885_tsport *port,

Patch was mangled by your e-mailer: it broke longer lines, causing
it to not apply.

Also, before just reverting the entire thing, could you please check
if the enclosed hack would solve it?

If so, it should be easy to add a quirk at drivers/pci/quirks.c
in order to detect the Ryzen models with a bad DMA engine that
require periodic resets, and then make cx23885 to use it.

We did similar tricks before with some broken DMA engines, at
the time we had overlay support on drivers and AMD controllers
didn't support PCI2PCI DMA transfers.

Brad,

Could you please address this issue?


diff --git a/drivers/media/pci/cx23885/cx23885-core.c 
b/drivers/media/pci/cx23885/cx23885-core.c
index 39804d830305..8b012bee6b32 100644
--- a/drivers/media/pci/cx23885/cx23885-core.c
+++ b/drivers/media/pci/cx23885/cx23885-core.c
@@ -603,8 +603,14 @@ static void cx23885_risc_disasm(struct cx23885_tsport 
*port,
 
 static void cx23885_clear_bridge_error(struct cx23885_dev *dev)
 {
-   uint32_t reg1_val = cx_read(TC_REQ); /* read-only */
-   uint32_t reg2_val = cx_read(TC_REQ_SET);
+   uint32_t reg1_val, reg2_val;
+
+   /* TODO: check for Ryzen quirk */
+   if (1)
+   return;
+
+   reg1_val = cx_read(TC_REQ); /* read-only */
+   reg2_val = cx_read(TC_REQ_SET);
 
if (reg1_val && reg2_val) {
cx_write(TC_REQ, reg1_val);



Thanks,
Mauro


[PATCH] Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

2018-10-21 Thread Markus Dobel
The original commit (the one reverted in this patch) introduced a 
regression,
making a previously flawless adapter unresponsive after running a few 
hours
to days. Since I never experienced the problems that the original commit 
is

supposed to fix, I propose to revert the change until a regression-free
variant is found.

Before submitting this, I've been running a system 24x7 with this revert 
for

several weeks now, and it's running stable again.

It's not a pure revert, as the original commit does not revert cleanly
anymore due to other changes, but content-wise it is.

Signed-off-by: Markus Dobel 
---
 drivers/media/pci/cx23885/cx23885-core.c | 60 
 drivers/media/pci/cx23885/cx23885-reg.h  | 14 --
 2 files changed, 74 deletions(-)

diff --git a/drivers/media/pci/cx23885/cx23885-core.c 
b/drivers/media/pci/cx23885/cx23885-core.c

index 39804d830305..606f6fc0e68b 100644
--- a/drivers/media/pci/cx23885/cx23885-core.c
+++ b/drivers/media/pci/cx23885/cx23885-core.c
@@ -601,25 +601,6 @@ static void cx23885_risc_disasm(struct 
cx23885_tsport *port,

}
 }

-static void cx23885_clear_bridge_error(struct cx23885_dev *dev)
-{
-   uint32_t reg1_val = cx_read(TC_REQ); /* read-only */
-   uint32_t reg2_val = cx_read(TC_REQ_SET);
-
-   if (reg1_val && reg2_val) {
-   cx_write(TC_REQ, reg1_val);
-   cx_write(TC_REQ_SET, reg2_val);
-   cx_read(VID_B_DMA);
-   cx_read(VBI_B_DMA);
-   cx_read(VID_C_DMA);
-   cx_read(VBI_C_DMA);
-
-   dev_info(&dev->pci->dev,
-   "dma in progress detected 0x%08x 0x%08x, clearing\n",
-   reg1_val, reg2_val);
-   }
-}
-
 static void cx23885_shutdown(struct cx23885_dev *dev)
 {
/* disable RISC controller */
@@ -665,8 +646,6 @@ static void cx23885_reset(struct cx23885_dev *dev)
cx_write(CLK_DELAY, cx_read(CLK_DELAY) & 0x8000);
cx_write(PAD_CTRL, 0x00500300);

-   /* clear dma in progress */
-   cx23885_clear_bridge_error(dev);
msleep(100);

cx23885_sram_channel_setup(dev, &dev->sram_channels[SRAM_CH01],
@@ -683,11 +662,6 @@ static void cx23885_reset(struct cx23885_dev *dev)
 	cx23885_sram_channel_setup(dev, &dev->sram_channels[SRAM_CH09], 128, 
0);


cx23885_gpio_setup(dev);
-
-   cx23885_irq_get_mask(dev);
-
-   /* clear dma in progress */
-   cx23885_clear_bridge_error(dev);
 }


@@ -702,8 +676,6 @@ static int cx23885_pci_quirks(struct cx23885_dev 
*dev)

if (dev->bridge == CX23885_BRIDGE_885)
cx_clear(RDR_TLCTL0, 1 << 4);

-   /* clear dma in progress */
-   cx23885_clear_bridge_error(dev);
return 0;
 }

@@ -1392,9 +1364,6 @@ int cx23885_start_dma(struct cx23885_tsport *port,
dprintk(1, "%s() w: %d, h: %d, f: %d\n", __func__,
dev->width, dev->height, dev->field);

-   /* clear dma in progress */
-   cx23885_clear_bridge_error(dev);
-
/* Stop the fifo and risc engine for this port */
cx_clear(port->reg_dma_ctl, port->dma_ctl_val);

@@ -1482,26 +1451,16 @@ int cx23885_start_dma(struct cx23885_tsport 
*port,

case CX23885_BRIDGE_888:
/* enable irqs */
dprintk(1, "%s() enabling TS int's and DMA\n", __func__);
-   /* clear dma in progress */
-   cx23885_clear_bridge_error(dev);
cx_set(port->reg_ts_int_msk,  port->ts_int_msk_val);
cx_set(port->reg_dma_ctl, port->dma_ctl_val);
-
-   /* clear dma in progress */
-   cx23885_clear_bridge_error(dev);
cx23885_irq_add(dev, port->pci_irqmask);
cx23885_irq_enable_all(dev);
-
-   /* clear dma in progress */
-   cx23885_clear_bridge_error(dev);
break;
default:
BUG();
}

cx_set(DEV_CNTRL2, (1<<5)); /* Enable RISC controller */
-   /* clear dma in progress */
-   cx23885_clear_bridge_error(dev);

if (cx23885_boards[dev->board].portb == CX23885_MPEG_ENCODER)
cx23885_av_clk(dev, 1);
@@ -1509,11 +1468,6 @@ int cx23885_start_dma(struct cx23885_tsport 
*port,

if (debug > 4)
cx23885_tsport_reg_dump(port);

-   cx23885_irq_get_mask(dev);
-
-   /* clear dma in progress */
-   cx23885_clear_bridge_error(dev);
-
return 0;
 }

@@ -1521,26 +1475,12 @@ static int cx23885_stop_dma(struct 
cx23885_tsport *port)

 {
struct cx23885_dev *dev = port->dev;
u32 reg;
-   int delay = 0;
-   uint32_t reg1_val;
-   uint32_t reg2_val;

dprintk(1, "%s()\n", __func__);

/* Stop interrupts and DMA */
cx_clear(port->reg_ts_int_msk, port->ts_int_msk_val);
cx_clear(port->reg_dma_ctl, port->dma_ctl_val);
-	/* just in case wait for any dma to complete before allowing dealloc 
*/

-   mdelay(20);