Re: [PATCH] powerpc/eeh: Avoid to handle EEH on a passed Child PE
On Wed, Sep 23, 2015 at 09:07:41AM +1000, Gavin Shan wrote: >On Tue, Sep 22, 2015 at 12:43:03PM +0800, Wei Yang wrote: >>On Mon, Sep 21, 2015 at 09:49:45PM +1000, Gavin Shan wrote: >>>On Mon, Sep 21, 2015 at 05:29:48PM +0800, Wei Yang wrote: Current EEH infrastructure would avoid to handle EEH when a PE is passed to guest, while if this PE is a Child PE of the one hit EEH, host would handle this. By doing so, this would leads to guest hang. The correct way is avoid to handle it on host and let guest to recover. This patch avoids to handle EEH on a passed Child PE. >>> >>>Ok. It's fixing the problem the guest, which owns a VF, when its PF hitting >>>EEH error, right? If so, I'm not sure if you really tested this code. Does >>>it work for you? >> >>Yes, I inject error on Parent Bus PE. >> >>> >>>When the parent PE (PF) is stopped for EEH recovery, it sounds impossible >>>that the child PE can't be affected and just escape from the error. The >>>question is how the guest can continue to work after the EEH recovery on >>>parent PE? >> >>What I see is the PF is covering and VF in guest is recovering. >> > >What do you mean by "covering"? Which PE's error is detected first in your >testing? There is potentially race here: when the VF PE's error is detected >first and guest tries to recover it. After the recovery happened on guest >side, the host detects the PF PE's error and tries to recover it. During >the recovery, the VF PE is total unusable but guest doesn't know that and >operate like usual. I'm not sure what kinds of problems it can incur, but >it would incur issues. > >On the other hand, if PF PE's error is detected on host first, and then the >guest detects the error on VF PE. After that, the host and guest try to do >recovery at same time. Host issues PE reset to PF PE, which isn't finished >yet. Then guest issues PE reset to VF PE, which will cause another EEH error. > >So I'm not sure if had this patch fully tested. If so, it seems the result is >just achieved by luck, I guess. > It looks really has some race condition. So the expected order should be PF's PE recovered then VF's PE recover in guest, right? >>> Signed-off-by: Wei Yang --- arch/powerpc/kernel/eeh_pe.c | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c index 5cde950..c6d0e9f 100644 --- a/arch/powerpc/kernel/eeh_pe.c +++ b/arch/powerpc/kernel/eeh_pe.c @@ -172,6 +172,7 @@ static struct eeh_pe *eeh_pe_next(struct eeh_pe *pe, * callback returns something other than NULL, or no more PEs * to be traversed. */ +static void *__eeh_pe_get(void *data, void *flag); void *eeh_pe_traverse(struct eeh_pe *root, eeh_traverse_func fn, void *flag) { @@ -179,6 +180,8 @@ void *eeh_pe_traverse(struct eeh_pe *root, void *ret; for (pe = root; pe; pe = eeh_pe_next(pe, root)) { + if (eeh_pe_passed(pe) && (fn != __eeh_pe_get)) + continue; >>> >>>The code change here seems ugly. >>> >>>The "flag" can be extended to carry the information to skip pass-through >>>PEs or not. So the function calling eeh_pe_traverse() decides to skip >>>pass-through PEs or not. >> >>I don't get the point, which "flag" you mean? Add a flag in eeh_pe? >> > void *eeh_pe_traverse(struct eeh_pe *root, eeh_traverse_func fn, void *flag) >^^ >This one > >The code needn't to be changed in a hurry though. I don't think it's right >way to fix the issue as discussed as above. > >>> ret = fn(pe, flag); if (ret) return ret; } @@ -210,6 +213,8 @@ void *eeh_pe_dev_traverse(struct eeh_pe *root, >>> /* Traverse root PE */ for (pe = root; pe; pe = eeh_pe_next(pe, root)) { + if (eeh_pe_passed(pe)) + continue; eeh_pe_for_each_dev(pe, edev, tmp) { ret = fn(edev, flag); if (ret) -- 2.5.0 >> >>-- >>Richard Yang >>Help you, Help me -- Richard Yang Help you, Help me ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/eeh: Avoid to handle EEH on a passed Child PE
On Tue, Sep 22, 2015 at 12:43:03PM +0800, Wei Yang wrote: >On Mon, Sep 21, 2015 at 09:49:45PM +1000, Gavin Shan wrote: >>On Mon, Sep 21, 2015 at 05:29:48PM +0800, Wei Yang wrote: >>>Current EEH infrastructure would avoid to handle EEH when a PE is passed to >>>guest, while if this PE is a Child PE of the one hit EEH, host would handle >>>this. By doing so, this would leads to guest hang. The correct way is >>>avoid to handle it on host and let guest to recover. >>> >>>This patch avoids to handle EEH on a passed Child PE. >>> >> >>Ok. It's fixing the problem the guest, which owns a VF, when its PF hitting >>EEH error, right? If so, I'm not sure if you really tested this code. Does >>it work for you? > >Yes, I inject error on Parent Bus PE. > >> >>When the parent PE (PF) is stopped for EEH recovery, it sounds impossible >>that the child PE can't be affected and just escape from the error. The >>question is how the guest can continue to work after the EEH recovery on >>parent PE? > >What I see is the PF is covering and VF in guest is recovering. > What do you mean by "covering"? Which PE's error is detected first in your testing? There is potentially race here: when the VF PE's error is detected first and guest tries to recover it. After the recovery happened on guest side, the host detects the PF PE's error and tries to recover it. During the recovery, the VF PE is total unusable but guest doesn't know that and operate like usual. I'm not sure what kinds of problems it can incur, but it would incur issues. On the other hand, if PF PE's error is detected on host first, and then the guest detects the error on VF PE. After that, the host and guest try to do recovery at same time. Host issues PE reset to PF PE, which isn't finished yet. Then guest issues PE reset to VF PE, which will cause another EEH error. So I'm not sure if had this patch fully tested. If so, it seems the result is just achieved by luck, I guess. >> >>>Signed-off-by: Wei Yang >>>--- >>> arch/powerpc/kernel/eeh_pe.c | 5 + >>> 1 file changed, 5 insertions(+) >>> >>>diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c >>>index 5cde950..c6d0e9f 100644 >>>--- a/arch/powerpc/kernel/eeh_pe.c >>>+++ b/arch/powerpc/kernel/eeh_pe.c >>>@@ -172,6 +172,7 @@ static struct eeh_pe *eeh_pe_next(struct eeh_pe *pe, >>> * callback returns something other than NULL, or no more PEs >>> * to be traversed. >>> */ >>>+static void *__eeh_pe_get(void *data, void *flag); >>> void *eeh_pe_traverse(struct eeh_pe *root, >>> eeh_traverse_func fn, void *flag) >>> { >>>@@ -179,6 +180,8 @@ void *eeh_pe_traverse(struct eeh_pe *root, >>> void *ret; >>> >>> for (pe = root; pe; pe = eeh_pe_next(pe, root)) { >>>+if (eeh_pe_passed(pe) && (fn != __eeh_pe_get)) >>>+continue; >> >>The code change here seems ugly. >> >>The "flag" can be extended to carry the information to skip pass-through >>PEs or not. So the function calling eeh_pe_traverse() decides to skip >>pass-through PEs or not. > >I don't get the point, which "flag" you mean? Add a flag in eeh_pe? > >>> void *eeh_pe_traverse(struct eeh_pe *root, >>> eeh_traverse_func fn, void *flag) ^^ This one The code needn't to be changed in a hurry though. I don't think it's right way to fix the issue as discussed as above. >> >>> ret = fn(pe, flag); >>> if (ret) return ret; >>> } >>>@@ -210,6 +213,8 @@ void *eeh_pe_dev_traverse(struct eeh_pe *root, >> >>> >>> /* Traverse root PE */ >>> for (pe = root; pe; pe = eeh_pe_next(pe, root)) { >>>+if (eeh_pe_passed(pe)) >>>+continue; >>> eeh_pe_for_each_dev(pe, edev, tmp) { >>> ret = fn(edev, flag); >>> if (ret) >>>-- >>>2.5.0 >>> > >-- >Richard Yang >Help you, Help me ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/eeh: Avoid to handle EEH on a passed Child PE
On Mon, Sep 21, 2015 at 09:49:45PM +1000, Gavin Shan wrote: >On Mon, Sep 21, 2015 at 05:29:48PM +0800, Wei Yang wrote: >>Current EEH infrastructure would avoid to handle EEH when a PE is passed to >>guest, while if this PE is a Child PE of the one hit EEH, host would handle >>this. By doing so, this would leads to guest hang. The correct way is >>avoid to handle it on host and let guest to recover. >> >>This patch avoids to handle EEH on a passed Child PE. >> > >Ok. It's fixing the problem the guest, which owns a VF, when its PF hitting >EEH error, right? If so, I'm not sure if you really tested this code. Does >it work for you? Yes, I inject error on Parent Bus PE. > >When the parent PE (PF) is stopped for EEH recovery, it sounds impossible >that the child PE can't be affected and just escape from the error. The >question is how the guest can continue to work after the EEH recovery on >parent PE? What I see is the PF is covering and VF in guest is recovering. > >>Signed-off-by: Wei Yang >>--- >> arch/powerpc/kernel/eeh_pe.c | 5 + >> 1 file changed, 5 insertions(+) >> >>diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c >>index 5cde950..c6d0e9f 100644 >>--- a/arch/powerpc/kernel/eeh_pe.c >>+++ b/arch/powerpc/kernel/eeh_pe.c >>@@ -172,6 +172,7 @@ static struct eeh_pe *eeh_pe_next(struct eeh_pe *pe, >> * callback returns something other than NULL, or no more PEs >> * to be traversed. >> */ >>+static void *__eeh_pe_get(void *data, void *flag); >> void *eeh_pe_traverse(struct eeh_pe *root, >>eeh_traverse_func fn, void *flag) >> { >>@@ -179,6 +180,8 @@ void *eeh_pe_traverse(struct eeh_pe *root, >> void *ret; >> >> for (pe = root; pe; pe = eeh_pe_next(pe, root)) { >>+ if (eeh_pe_passed(pe) && (fn != __eeh_pe_get)) >>+ continue; > >The code change here seems ugly. > >The "flag" can be extended to carry the information to skip pass-through >PEs or not. So the function calling eeh_pe_traverse() decides to skip >pass-through PEs or not. I don't get the point, which "flag" you mean? Add a flag in eeh_pe? > >> ret = fn(pe, flag); >> if (ret) return ret; >> } >>@@ -210,6 +213,8 @@ void *eeh_pe_dev_traverse(struct eeh_pe *root, > >> >> /* Traverse root PE */ >> for (pe = root; pe; pe = eeh_pe_next(pe, root)) { >>+ if (eeh_pe_passed(pe)) >>+ continue; >> eeh_pe_for_each_dev(pe, edev, tmp) { >> ret = fn(edev, flag); >> if (ret) >>-- >>2.5.0 >> -- Richard Yang Help you, Help me ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/eeh: Avoid to handle EEH on a passed Child PE
On Mon, Sep 21, 2015 at 05:29:48PM +0800, Wei Yang wrote: >Current EEH infrastructure would avoid to handle EEH when a PE is passed to >guest, while if this PE is a Child PE of the one hit EEH, host would handle >this. By doing so, this would leads to guest hang. The correct way is >avoid to handle it on host and let guest to recover. > >This patch avoids to handle EEH on a passed Child PE. > Ok. It's fixing the problem the guest, which owns a VF, when its PF hitting EEH error, right? If so, I'm not sure if you really tested this code. Does it work for you? When the parent PE (PF) is stopped for EEH recovery, it sounds impossible that the child PE can't be affected and just escape from the error. The question is how the guest can continue to work after the EEH recovery on parent PE? >Signed-off-by: Wei Yang >--- > arch/powerpc/kernel/eeh_pe.c | 5 + > 1 file changed, 5 insertions(+) > >diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c >index 5cde950..c6d0e9f 100644 >--- a/arch/powerpc/kernel/eeh_pe.c >+++ b/arch/powerpc/kernel/eeh_pe.c >@@ -172,6 +172,7 @@ static struct eeh_pe *eeh_pe_next(struct eeh_pe *pe, > * callback returns something other than NULL, or no more PEs > * to be traversed. > */ >+static void *__eeh_pe_get(void *data, void *flag); > void *eeh_pe_traverse(struct eeh_pe *root, > eeh_traverse_func fn, void *flag) > { >@@ -179,6 +180,8 @@ void *eeh_pe_traverse(struct eeh_pe *root, > void *ret; > > for (pe = root; pe; pe = eeh_pe_next(pe, root)) { >+ if (eeh_pe_passed(pe) && (fn != __eeh_pe_get)) >+ continue; The code change here seems ugly. The "flag" can be extended to carry the information to skip pass-through PEs or not. So the function calling eeh_pe_traverse() decides to skip pass-through PEs or not. > ret = fn(pe, flag); > if (ret) return ret; > } >@@ -210,6 +213,8 @@ void *eeh_pe_dev_traverse(struct eeh_pe *root, > > /* Traverse root PE */ > for (pe = root; pe; pe = eeh_pe_next(pe, root)) { >+ if (eeh_pe_passed(pe)) >+ continue; > eeh_pe_for_each_dev(pe, edev, tmp) { > ret = fn(edev, flag); > if (ret) >-- >2.5.0 > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev