A quick followup to my prior message: * I ran this again with the SSD model off and the assertion failure didn't happen. This rules out some sort of issue simply with having marss swapping a lot (I was a bit concerned that since we are the only people doing disk swapping with marss right now that we would run into a problem with that).
* After looking at the SSD debug logs, it seems that there weren't any DMA transactions outstanding to the memory system when the assertion failure happened. This means that this was not an address conflict from software using the DMA buffer. * I talked with Paul Rosenfeld and he said that the assertion failure seems to be on the path to the memory controller. I looked at the code further and it looks like what is happening is memoryController::handle_interconnect_cb is returning false because pendingRequests_ did not have any open entries. What I think is happening is that because of contention to DRAMSim (the SSD DMA and normal marss last layer cache both using it), marss is building up a long list of pending requests and then eventually the request queue overflows. Does this explain what I'm seeing? If so, can I fix this simply by making the MEM_REQ_NUM larger (right now it is 64)? Or will that break something else? Thanks, Jim Stevens > Hello, > > We've been getting an assertion failure in my modified version of marss > for SSD simulation... > > Completed 4583629000 cycles, 787459313 commits: 73877 Hz, > 187432 insns/sec: rip ffffffff8108adc8 ffffffff8108e548 > ffffffff81013091 ffffffff81013091qemu-system-x86_64: > ptlsim/build/cache/splitPhaseBus.cpp:371: bool > Memory::SplitPhaseBus::BusInterconnect::broadcast_completed_cb(void*): > Assertion `ret' failed. > > > This is running parsec's fluidanimate simlarge with 64 MB of RAM. > > My SSD simulation code has to check the address stream coming back from > DRAMSim and if the address is part of a DMA, then it sends the callback to > the SSD and not back to the marss pipeline. My concern about this approach > was that if the pipeline and the SSD both request the same address at the > same time, that bad things could happen (e.g. assertion failures). > > Do you think this is a likely explanation of the above assertion failure? > > I was hoping that this wouldn't be an issue because the OS should know not > to touch the DMA buffer until after it receives the IRQ (note: the IRQ > isn't sent until after all DMA and disk interactions have completed). If > this is indeed the problem, I think I have a fix for it (I'll have to > check both the pending DMA requests AND the pending marss requests), but > I'm worried this might break something else. And I'm confused as to why > the software would ever touch the DMA buffer, so it makes me think I have > a bug elsewhere (probably on the qemu side of things, which is where I > extract the scatter gather lists for the DMA buffers). > > I'm currently running two more tests (with no SSD model and the SSD model > with DMA off) to see if this error happens again. But I think this is the > DMA code, given that the assertion failure happened in the ptlsim/cache > subdirectory. > > > Thanks, > > Jim Stevens > University of Maryland, College Park > > > _______________________________________________ > http://www.marss86.org > Marss86-Devel mailing list > [email protected] > https://www.cs.binghamton.edu/mailman/listinfo/marss86-devel > _______________________________________________ http://www.marss86.org Marss86-Devel mailing list [email protected] https://www.cs.binghamton.edu/mailman/listinfo/marss86-devel
