On Fri, 19 Aug 2022 09:46:55 +0100 Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> On Thu, 18 Aug 2022 17:37:40 +0100 > Jonathan Cameron via <qemu-devel@nongnu.org> wrote: > > > On Wed, 17 Aug 2022 17:16:19 +0100 > > Jonathan Cameron <jonathan.came...@huawei.com> wrote: > > > > > On Thu, 11 Aug 2022 17:46:55 -0700 > > > Dan Williams <dan.j.willi...@intel.com> wrote: > > > > > > > Dan Williams wrote: > > > > > Bobo WL wrote: > > > > > > Hi Dan, > > > > > > > > > > > > Thanks for your reply! > > > > > > > > > > > > On Mon, Aug 8, 2022 at 11:58 PM Dan Williams > > > > > > <dan.j.willi...@intel.com> wrote: > > > > > > > > > > > > > > What is the output of: > > > > > > > > > > > > > > cxl list -MDTu -d decoder0.0 > > > > > > > > > > > > > > ...? It might be the case that mem1 cannot be mapped by > > > > > > > decoder0.0, or > > > > > > > at least not in the specified order, or that validation check is > > > > > > > broken. > > > > > > > > > > > > Command "cxl list -MDTu -d decoder0.0" output: > > > > > > > > > > Thanks for this, I think I know the problem, but will try some > > > > > experiments with cxl_test first. > > > > > > > > Hmm, so my cxl_test experiment unfortunately passed so I'm not > > > > reproducing the failure mode. This is the result of creating x4 region > > > > with devices directly attached to a single host-bridge: > > > > > > > > # cxl create-region -d decoder3.5 -w 4 -m -g 256 mem{12,10,9,11} -s > > > > $((1<<30)) > > > > { > > > > "region":"region8", > > > > "resource":"0xf1f0000000", > > > > "size":"1024.00 MiB (1073.74 MB)", > > > > "interleave_ways":4, > > > > "interleave_granularity":256, > > > > "decode_state":"commit", > > > > "mappings":[ > > > > { > > > > "position":3, > > > > "memdev":"mem11", > > > > "decoder":"decoder21.0" > > > > }, > > > > { > > > > "position":2, > > > > "memdev":"mem9", > > > > "decoder":"decoder19.0" > > > > }, > > > > { > > > > "position":1, > > > > "memdev":"mem10", > > > > "decoder":"decoder20.0" > > > > }, > > > > { > > > > "position":0, > > > > "memdev":"mem12", > > > > "decoder":"decoder22.0" > > > > } > > > > ] > > > > } > > > > cxl region: cmd_create_region: created 1 region > > > > > > > > > Did the commit_store() crash stop reproducing with latest cxl/preview > > > > > branch? > > > > > > > > I missed the answer to this question. > > > > > > > > All of these changes are now in Linus' tree perhaps give that a try and > > > > post the debug log again? > > > > > > Hi Dan, > > > > > > I've moved onto looking at this one. > > > 1 HB, 2RP (to make it configure the HDM decoder in the QEMU HB, I'll tidy > > > that up > > > at some stage), 1 switch, 4 downstream switch ports each with a type 3 > > > > > > I'm not getting a crash, but can't successfully setup a region. > > > Upon adding the final target > > > It's failing in check_last_peer() as pos < distance. > > > Seems distance is 4 which makes me think it's using the wrong level of > > > the heirarchy for > > > some reason or that distance check is wrong. > > > Wasn't a good idea to just skip that step though as it goes boom - though > > > stack trace is not useful. > > > > Turns out really weird corruption happens if you accidentally back two > > type3 devices > > with the same memory device. Who would have thought it :) > > > > That aside ignoring the check_last_peer() failure seems to make everything > > work for this > > topology. I'm not seeing the crash, so my guess is we fixed it somewhere > > along the way. > > > > Now for the fun one. I've replicated the crash if we have > > > > 1HB 1*RP 1SW, 4SW-DSP, 4Type3 > > > > Now, I'd expect to see it not 'work' because the QEMU HDM decoder won't be > > programmed > > but the null pointer dereference isn't related to that. > > > > The bug is straight forward. Not all decoders have commit callbacks... > > Will send out > > a possible fix shortly. > > > For completeness I'm carrying this hack because I haven't gotten my head > around the right fix for check_last_peer() failing on this test topology. > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c > index c49d9a5f1091..275e143bd748 100644 > --- a/drivers/cxl/core/region.c > +++ b/drivers/cxl/core/region.c > @@ -978,7 +978,7 @@ static int cxl_port_setup_targets(struct cxl_port *port, > rc = check_last_peer(cxled, ep, cxl_rr, > distance); > if (rc) > - return rc; > + // return rc; > goto out_target_set; > } > goto add_target; I'm still carrying this hack and still haven't worked out the right fix. Suggestions welcome! If not I'll hopefully get some time on this towards the end of the week. Jonathan